<a href="https://colab.research.google.com/github/mcsmith89/webscrapingprimer/blob/master/Copy_of_Worldometer_COVID19_Scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping in Python
COVID19 Stats by Country: [worldometers.info](https://www.worldometers.info/coronavirus/)

---

*   [Selector Cheatsheet](https://devhints.io/xpath)
*   [Beautiful Soup Docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
*   [Pandas Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
*   [Workshop Presentation](https://docs.google.com/presentation/d/1zHwfvTlk9vXFhiDdJSeFzasBZgwor1xM1wTiQH_qRDY/edit?usp=sharing)

* Create Advanced Scrapers with [Scrapy](https://scrapy.org/)




In [0]:
# Import necessary libraries

import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

In [0]:
# Make a HTTP request for the webpage

req = requests.get('https://www.worldometers.info/coronavirus/')

In [0]:
# Parse the raw HTML into a soup

soup = BeautifulSoup(req.content)

In [0]:
# Map the HTML table for "cases by country" into an array of arrays
#
# Map(function, array) -> Essentially applies a function to each element in an array. We use lambda expressions to create functions.

data = list(map(lambda x: list(map(lambda y: y.text, x.select('td, th'))), soup.select('table#main_table_countries_yesterday tr')))

In [0]:
# Turn array of arrays into a Pandas Dataframe
df = pd.DataFrame(data)

# Set the column names to the first row
df.columns = df.iloc[0]

# Delete first row to avoid confusing it as a country
df = df.drop(0, axis=0)

In [0]:
# Lets have a look at the first ten rows, it should match the website table.

df.head(10)

Unnamed: 0,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Tot Deaths/1M pop
1,China,81218,47,3281,4,73650,4287,1399.0,56,2
2,Italy,74386,5210,7503,683,9362,57521,3489.0,1230,124
3,USA,68211,13355,1027,247,394,66790,1452.0,206,3
4,Spain,49515,7457,3647,656,5367,40501,3166.0,1059,78
5,Germany,37323,4332,206,47,3547,33570,23.0,445,2
6,Iran,27017,2206,2077,143,9625,15315,,322,25
7,France,25233,2929,1331,231,3900,20002,2827.0,387,20
8,Switzerland,10897,1020,153,31,131,10613,141.0,1259,18
9,UK,9529,1452,465,43,135,8929,163.0,140,7
10,S. Korea,9137,100,126,6,3730,5281,59.0,178,2


In [0]:
# Finaly, save as a spreadsheet or other file format. It will be in a folder located on the left sidebar.

df.to_csv('covid19.csv')

# Thats it! Now try scraping another website.


Wyatt Phillips | [HackWITus Team](https://hackwit.us/)

phillipsw1@wit.edu

Discord: Wyatt Phillips [Co-op]#7689
