<a href="https://colab.research.google.com/github/jordanml7/DroneSimulation/blob/master/data_scraping_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example Implementation of Website Scraping

*Written by Jordan Lueck and Paddy Alton*

**How to use:**

First you have to run the installation & importing codeblocks, below. To run a codeblock, simple mouse of the little number in the top left of the codeblock and a small *play button* will appear - click it!

Next, determine the form of your data. Is it a text-centered webpage? Or is it a dataset, like a big table? If it's some other type of data you're trying to scrape, come ask Jordan or Paddy!

Pick the relevant subsection of this doc & uncollapse it to see your next steps...


## Import necessary libraries

You *always* need to run these blocks first before running any other part of this code!

In [0]:
!pip install apolitical-data-viz -qU
!pip install geopandas -qU
!pip install bokeh -qU

In [0]:
import apol_dataviz
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import ipywidgets as widgets
import geopandas as gpd
import re

from bokeh.io import output_notebook
from bokeh.models import LogColorMapper, LogTicker, ColorBar
import bokeh.palettes as palettes
from bokeh.plotting import figure, output_file, show
from bokeh.models.annotations import Title

## Scraping text

This section is useful if you've got an article or some press release that you'd like to scrape for its raw contents - i.e. the words. It's similar to just copying-and-pasting the text, but will do it for you in one step and remove any unnecessary graphics and whitespace.

The example executed below uses a Guardian report on the weather in the UK on Jan 31, 2019. 

Simply copy-and-paste the url of the article or release that you want to scrape into the **url** codeblock below, then individually *run* that codeblock and the three below it. Your output text is now stored `stripped_text`

In [0]:
url = "https://www.theguardian.com/uk-news/2019/jan/31/britain-coldest-night-winter-mercury-drops-minus-11" #@param{type: "string"}

In [0]:
def reduce_whitespace(plain_text):
  """ Reduces multiple newline markers to one """
  text_copy = plain_text.strip("\n")
  while True:
    parsed = text_copy.split("\n\n")
    if len(parsed) < 2:
      break
    text_copy = "\n".join(parsed)
  return text_copy

In [0]:
response = requests.get(url)

soup = BeautifulSoup(response.content)

In [518]:
plain_text = "\n".join([p.get_text() for p in soup.findAll("p")])

stripped_text = reduce_whitespace(plain_text)
print(stripped_text)

Snow and ice alert as Braemar in Scotland is hit by lowest temperature in UK since 2012
Steven Morris
Thu 31 Jan 2019 07.43 EST
First published on Thu 31 Jan 2019 03.16 EST
The lowest temperature in the UK for seven years was recorded on Thursday as snowy and icy weather continued to hit Britain.
Residents of Braemar in north-east Scotland were shivering in a temperature of -14.4C (6.1F), the Met Office said – the lowest temperature in the UK since 2012 when it reached -15.6C (3.9F) at Holbeach, Lincolnshire.
UPDATE: Braemar has now fallen to -14.4 °C. That's the lowest temperature in the UK since 2012 (-15.6 °C at Holbeach, Lincolnshire 11 February) pic.twitter.com/f1PVbiwDIZ
The Met Office said: “A band of rain will arrive from the southwest on Thursday afternoon, quickly turning to snow and becoming heavy at times.”
It said 3cm to 7cm could accumulate within two to three hours and there could be up to 10cm in some places. “The highest snowfall accumulations are likely to be in areas

## Scraping a dataset

If you're trying to scrape a dataset of some sort, first you've got to determine the form of the dataset. Is it a CSV you've downloaded? Or is a table you've found online somewhere? Again, if your data is in some other form, come as Jordan or Paddy!

### If it's a CSV / TSV ...

If you've downloaded the dataset to your own computer in a CSV or TSV format, upload the file into your Google Colab environment by opening the sidebar on the left of this window and selecting the **Files** section. Click **upload**.

Next, enter the entire name of the file as it appears in the sidebar, including the filetype extension (i.e. `my_dataset.csv`), into the **dataset** field below. Now individually *run* that codeblock and the 4 blocks below it.

Note that this code is currently configured to handle geographic data from `Eurostat` sources only. This can easily be adapted for other types of datasets and other sources as necessary. Additionally, the example below is extracting from a TSV file uploaded to Github, rather than a local file, so every user can access it.

In [0]:
dataset = "https://raw.githubusercontent.com/apolitical/journalism/scraping-tutorial/prepared-data/t2020_30.tsv" #@param{type: "string"}

Since you are using a CSV/TSV, you'll have to supply the name of the dataset, i.e. what the dataset is measuring, manually below.

In [0]:
dataset_name = "Greenhouse Gas Emissions" #@param{type: "string"}

In [0]:
if dataset.endswith(".csv"):
  df = pd.read_csv(dataset)
elif dataset.endswith(".tsv"):
  df = pd.read_csv(dataset,"\t")

All the below code really does is mess around with some formatting from the `Eurostat` dataset to make it compatible with the methods for extracting and plotting its data.

In [0]:
try:
  ind_geo = df.columns[0].split("\\")[0].split(",").index("geo")
except:
  ind_geo = 1

In [0]:
df = df.rename(index=str, columns={df.columns[0]: "iso_a2"})

temp = df["iso_a2"].tolist()
code = [n.split(",")[ind_geo] for n in temp]
df["iso_a2"] = code

df.loc[df.iso_a2 == "UK","iso_a2"] = "GB"
df.loc[df.iso_a2 == "EL","iso_a2"] = "GR"

The below CSV contains the various codes used by the UN and other organizations to define nations (it's easier to compare a 3-digit code, like **826**, than a complicated name, like **The United Kingdom of Great Britain and Northern Ireland**)

In [0]:
country_codes = pd.read_csv("https://raw.githubusercontent.com/apolitical/journalism/scraping-tutorial/prepared-data/country_codes_complete.csv")
country_codes.loc[country_codes.iso_a3 == "NAM","iso_a2"] = "NA"

In [0]:
csv_df = pd.merge(df, country_codes, how="left", on="iso_a2").drop(columns=["m49","iso_a3"])
csv_df.loc[csv_df.country.isnull(),"country"] = csv_df.loc[csv_df.country.isnull(),"iso_a2"]

csv_df.index = csv_df["country"]
csv_df = csv_df.drop(columns=["country","iso_a2"])
csv_df.columns = [year.strip(' ') for year in csv_df.columns.tolist()]

In [0]:
def format_numeric(x):
  return re.sub("[^0-9.]", "", x)

In [0]:
final_data_df = csv_df.applymap(format_numeric)
final_data_df = final_data_df.replace(":", np.nan, regex=True).replace("", np.nan, regex=True).astype(float).sort_index()

### If it's a website (HTML) ...

If you've found the dataset online, it's probably in HTML format. Simply copy the link to the webpage that the table appears on and paste it into the **link_address** field below. Then, *run* the two codeblocks below.

**NB**: While it's easier and quicker to load a webpage than download & upload a CSV, and it is accessible by everyone as opposed to just those with the relevant files, it can often be risky as different data sources will format their HTML tables differently, potentially leading to some funky results. Just something to keep in mind.

In [0]:
link_address = "https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=t2020_30&plugin=1" #@param{type: "string"}

In [0]:
html_df = pd.read_html(link_address)
dataset_name = html_df[-5][0][0].title()
final_data_df = html_df[-1]
rows = html_df[-2].columns.tolist()[0]
cols = html_df[-3].columns.tolist()

In [0]:
def format_numeric(x):
  return re.sub("[^0-9.]", "", x)

In [0]:
final_data_df = final_data_df.applymap(format_numeric)
final_data_df.columns = cols
final_data_df.index = rows
final_data_df = final_data_df.replace(":", np.nan, regex=True).replace("", np.nan, regex=True).astype(float).sort_index()

The below block is commented out and currently does nothing. It's just here to show you some of the stuff that's going on "behind-the-hood" of the above `read_html` method called. This can be useful if you want to dig into the data extraction, perhaps if `read_html` isn't producing the data the way you think it should appear.

In [0]:
'''
The longer way...
(useful if you want to specify how exactly the data is extracted)

response = requests.get(link_address)
html_doc = response.text

soup = BeautifulSoup(html_doc, "html.parser")
data_title = soup.h2.string

html_columns = soup.find_all("th", class_="cell")
column_headers = [h.text.strip() for h in html_columns]

html_rows = soup.find_all("th", class_="hl_row_fix")
row_headers = [r.text.strip() for r in html_rows]

html_df = pd.DataFrame(index=row_headers, columns=column_headers)

for row in html_rows:
  header = row.text.strip()
  id_name = row.get("id").replace("_fix","")
  
  row_stats = soup.find("tr", id=id_name).find_all("td")
  row_data = [float(r.text.strip().replace(":","NaN")) for r in row_stats]
  
  html_df.loc[header] = row_data
  
html_df = html_df.sort_index()

'''

### What to do with the data...

Now you've got your dataset, what do you want to do with it?

#### You could present it as a line graph...

Just select the country you want to plot the yearly data for and run all the following codeblocks!

In [0]:
countries = final_data_df.index.tolist()
country_picker = widgets.Dropdown(options=countries, value=countries[0])
country_picker

In [0]:
country_choice = country_picker.value

country_data = final_data_df.loc[country_choice]

You will need to supply your desired x-axis and y-axis labels yourself, below.

In [0]:
xlab = "Year" #@param{type: "string"}
ylab = "Percent (%) of 1990 Emissions" #@param{type: "string"}

In [0]:
ax = country_data.plot(title=country_choice + " " + dataset_name)
xlab = ax.set_xlabel(xlab)
ylab = ax.set_ylabel(ylab)

#### Or you could make an interactive world map...

Just select the year you want to plot the geographic country data for and run all the following codeblocks!

**NB**: This is a rather finnicky graph that its quite case-specific, so changing the data is likely to break the graph. If this happens, reach out to Jordan or Paddy!

In [0]:
years = final_data_df.columns.tolist()
year_picker = widgets.Dropdown(options=years, value=years[len(years)-2])
year_picker

In [0]:
year_choice = year_picker.value

year_data = final_data_df.loc[:,year_choice]

##### Behind the scenes of map generation...

In [0]:
ylr = palettes.YlOrRd6
ylr.reverse()

In [0]:
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world = world.drop(world.loc[world.continent == "Antarctica"].index.tolist())

world.loc[world.name == "France","iso_a3"] = "FRA"
world.loc[world.name == "Norway","iso_a3"] = "NOR"
world.loc[world.name == "N. Cyprus","iso_a3"] = "CYP"
world.loc[world.name == "Somaliland","iso_a3"] = "SOM"
world.loc[world.name == "Kosovo","iso_a3"] = "RKS"

In [0]:
geo_data = year_data.to_frame().reset_index().rename(columns={"index": "country"})
geo_data.loc[geo_data.country == "United Kingdom","country"] = "United Kingdom of Great Britain and Northern Ireland"

In [0]:
country_codes = pd.read_csv("https://raw.githubusercontent.com/apolitical/journalism/scraping-tutorial/prepared-data/country_codes_complete.csv")
country_codes.loc[country_codes.iso_a3 == "NAM","iso_a2"] = "NA"

In [0]:
expanded_world = pd.merge(country_codes, geo_data, how="left", on="country")
complete_data = pd.merge(world, expanded_world, how="left", on="iso_a3").replace(np.nan, "-", regex=True)
complete_data = complete_data.drop(columns=["pop_est","continent","name","iso_a3","gdp_md_est","m49"])
complete_data.loc[complete_data.loc[:,year_choice] == "-",year_choice] = "No Data"

In [0]:
newdf = pd.DataFrame({})
for index, row in complete_data.iterrows():
  if row.geometry.type == "MultiPolygon":
    geom_data = row.geometry
    for polygon in geom_data:
      newrow = row.copy()
      newrow.geometry = polygon
      newdf = newdf.append(newrow, ignore_index=True)
    complete_data.drop(index, axis=0, inplace=True)

complete_data = complete_data.append(newdf, sort=True, ignore_index=True)

In [0]:
complete_data["x"] = [list(poly.exterior.coords.xy[0]) for poly in complete_data.geometry]
complete_data["y"] = [list(poly.exterior.coords.xy[1]) for poly in complete_data.geometry]

In [0]:
a = "http://flagpedia.net/data/flags/w580/"

good_data = complete_data[complete_data.loc[:,year_choice] != "No Data"]
nan_data = complete_data[complete_data.loc[:,year_choice] == "No Data"]

In [0]:
low_val = int(min(good_data.loc[:,year_choice]))
hi_val = int(max(good_data.loc[:,year_choice]))

In [0]:
color_mapper = LogColorMapper(palette=ylr, low=low_val, high=hi_val)

good = dict(
    x=good_data.x.tolist(),
    y=good_data.y.tolist(),
    name=good_data.country.tolist(),
    data=[float(i) for i in good_data.loc[:,year_choice]],
    img=[a + i.lower() + ".png" for i in good_data.iso_a2],
)

nan = dict(
    x=nan_data.x.tolist(),
    y=nan_data.y.tolist(),
    name=nan_data.country.tolist(),
    data=nan_data.loc[:,year_choice].tolist(),
    img=[a + i.lower() + ".png" for i in nan_data.iso_a2],
)

In [0]:
TOOLS = "pan,box_zoom,wheel_zoom,reset,hover,save"
title_string = dataset_name + ", " + year_choice
data_name = title_string
if len(data_name) > 40:
  list_words = title_string.split(' ')
  list_words[round(len(list_words)/2)] = list_words[round(len(list_words)/2)] + "<br>"
  data_name = ' '.join(list_words)

tooltips = """
<div style="width:300px;padding: 5px;border-style: solid;border-width: 1px;border-color: #00B3BF;">
  <div>
    <img
        src="@img" height="30" width="50"
        style="float: left; margin: 0px 10px 0px 0px;"
        border="1"
    ></img>
  </div>
  <div>
    <div style="height:30px">
      <b><font size=2vh; color=#00B3BF>@name</font></b>
    </div>
    <div style="padding-top:10px">
      <b>{data_name}:</b> @data <br>
    </div>
    <div>
      <img
          src="https://apolitical.co/wp-content/themes/apolitical/public/img/stamp.svg" height="16" width="12"
          style="position: absolute; bottom: 10px; right: 12px;"
      ></img>
    </div>
  </div>
</div>
""".format(data_name=data_name,link=more_info)

In [0]:
title = Title(text=title_string, text_color="#00B3BF",text_font_size="20px")

In [0]:
p = figure(
    title=title, tools=TOOLS,
    x_axis_location=None, y_axis_location=None,
    tooltips=tooltips,
    plot_width=1000, plot_height=500, toolbar_location="below",
    active_scroll = "wheel_zoom", outline_line_color = "#00B3BF",
    outline_line_width = 2)

p.toolbar.logo = None
p.grid.grid_line_color = None
p.hover.point_policy = "follow_mouse"

p.patches("x", "y", source=good,
          fill_color={"field": "data", "transform": color_mapper},
          fill_alpha=0.7, line_color="white", line_width=0.5)

p.patches("x", "y", source=nan, fill_color="#fbf7f5",
          fill_alpha=0.7, line_color="white", line_width=0.5, hatch_pattern="/", hatch_alpha=0.3)

color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
                     label_standoff=12, border_line_color=None, location=(0,0))

p.add_layout(color_bar, "right")

##### ... And the final product

In [0]:
output_file(dataset_name.lower().split(',')[0].replace(' ','_') + ".html")
output_notebook()
show(p)