# Python Lecture (November 15th, 2022)

Today we are going to review some of the basic of data scraping, cleaning and model generation in Python. This should cover the foundational elements needed to complete your final projects, with each individual project likely requiring some further (independent) investigation.

We are going to mainly be working with the following website on live COVID data: https://www.worldometers.info/coronavirus/#countries

### Preliminaries

Packages we use:

The first set are somewhat general and all purpose packages you will likely need for all programming in python to handle numerics and analysis.

The second set are for plotting which we will see many examples of later and that you have used in your homework assignments.

The last set we will go over as we use them, mainly used for retrieving data from the web in a usable format.

In [None]:
#-----General------#
import numpy as np
import pandas as pd
import os
import sys
import math
import random

#-----Plotting-----#
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import seaborn as sns
from pandas_profiling import ProfileReport

#-----Utility-----#
import itertools
import warnings
warnings.filterwarnings("ignore")
import re
import gc
from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
from datetime import date, datetime

LOOK_AT = 5 # Controls how many bars the user can see in the bar graph
AT_LEAST = 50 # Controls what rank a country must be in terms of total cases to be shown on the bar graph

### Web scraping basics

In [None]:
fname = 'https://www.worldometers.info/coronavirus/#countries'

# This is the website from the midterm exam. How do we know how to best import it? 
# A standard method is to specify your webbrowser as the user agent, but can also leave argument blank
req = Request(fname, headers={'User-Agent': 'Safari'})
# req = Request(fname, headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req)
page_soup = soup(webpage, "html.parser")
today = datetime.now()
today_str = "%s %d, %d" % (date.today().strftime("%b"), today.day, today.year)
yesterday_str = "%s %d, %d" % (date.today().strftime("%b"), today.day-1, today.year)
clean = True # optional parameter we use to decide if we want to parse strings as ints

#### Scraping Script
If clean is set to true, then the numerical data will be converted from a string to a float. We drop China in our analysis because of some inconsistent positioning for China when scraping the data.

In [None]:
table = page_soup.findAll("table", {"id": "main_table_countries_yesterday"})
# In your individual projects, you may need to access the prior day's data as well
# as todays. You can always insepct the webpage for this older data!

containers = table[0].findAll("tr", {"style": ""})
# This find all command is something you have encountered before. Basically, I want 
# to first extract the row names (countries)

del containers[0]

all_data = []
for country in containers:
    country_data = []
    country_container = country.findAll("td")
    # Now iterate over the columns

    if country_container[1].text == 'China':
        continue
    
    for i in range(1, len(country_container)):
        final_feature = country_container[i].text

        # Clean-up column names for easier usage later on (not necessary)
        if clean:
            if i != 1 and i != len(country_container)-1:
                final_feature = final_feature.replace(',', '')
                if final_feature.find('+') != -1:
                    final_feature = final_feature.replace('+', '')
                    final_feature = float(final_feature)
                elif final_feature.find('-') != -1:
                    final_feature = final_feature.replace('-', '')
                    final_feature = float(final_feature)*-1

        # Handle missing data
        if final_feature == 'N/A':
            final_feature = 0
        elif final_feature == '' or final_feature == ' ':
            final_feature = -1 # None
        country_data.append(final_feature)
    all_data.append(country_data)

In [None]:
df = pd.DataFrame(all_data)
df = df.drop([i for i in range(15, len(all_data[0]))], axis=1) # Get rid of unnecessary data
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,World,640524148,150248.0,6616287,586.0,620247011,360267.0,13660850,35798,82173,848.8,-1,-1,-1,All
1,USA,99935041,13999.0,1100296,67.0,97494783,34958.0,1339962,2637,298487,3286.0,1134554617,3388700,334805269,North America
2,India,44667251,327.0,530532,-1.0,44126924,-1.0,9795,698,31755,377.0,902709553,641753,1406631776,Asia
3,France,37133190,6017.0,157821,100.0,36508660,59636.0,466709,869,566188,2406.0,271490188,4139547,65584518,Europe
4,Germany,36033394,-1.0,155588,-1.0,35022500,93000.0,855306,1406,429564,1855.0,122332384,1458359,83883596,Europe


On the worldometers website, the category "New Recovered" doesn't appear; however, based on the numbers, we can interpolate a certain column of data to be that.

In [None]:
column_labels = ["Country", "Total Cases", "New Cases", "Total Deaths", "New Deaths", "Total Recovered", "New Recovered", "Active Cases", "Serious/Critical",
                "Tot Cases/1M", "Deaths/1M", "Total Tests", "Tests/1M", "Population", "Continent"]
df.columns = column_labels
df.head()

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,New Recovered,Active Cases,Serious/Critical,Tot Cases/1M,Deaths/1M,Total Tests,Tests/1M,Population,Continent
0,World,640524148,150248.0,6616287,586.0,620247011,360267.0,13660850,35798,82173,848.8,-1,-1,-1,All
1,USA,99935041,13999.0,1100296,67.0,97494783,34958.0,1339962,2637,298487,3286.0,1134554617,3388700,334805269,North America
2,India,44667251,327.0,530532,-1.0,44126924,-1.0,9795,698,31755,377.0,902709553,641753,1406631776,Asia
3,France,37133190,6017.0,157821,100.0,36508660,59636.0,466709,869,566188,2406.0,271490188,4139547,65584518,Europe
4,Germany,36033394,-1.0,155588,-1.0,35022500,93000.0,855306,1406,429564,1855.0,122332384,1458359,83883596,Europe


#### What Countries are not present in the Analysis?
For some reason, there are some countries that are not included when scraping the webpage.

In [None]:
country_labels = page_soup.findAll("a", {"class": "mt_a"})
# mt_a is an html tag that is used in this table to declare country names. We are
# searching for country names that are labeled in the html file but do not appear on the 
# webpage

c_label = []
for country in country_labels:
    c_label.append(country.text)
c_label = set(c_label)

not_counted = []
sorted_countries = set(df['Country']) #Increase computational speed
for country in c_label:
    if country not in sorted_countries:
        not_counted.append(country)
  
print(not_counted + ['China'])

['DPRK', 'Western Sahara', 'China', 'Falkland Islands', 'Macao', 'Vatican City', 'Eritrea', 'China']


### Final Processing
Here, we will convert all the numerical data into np.int64 data type, and add some other features that may be particularly useful.

In [None]:
for label in df.columns:
    if label != 'Country' and label != 'Continent':
        df[label] = pd.to_numeric(df[label])

In [None]:
df['%Inc Cases'] = df['New Cases']/df['Total Cases']*100
df['%Inc Deaths'] = df['New Deaths']/df['Total Deaths']*100
df['%Inc Recovered'] = df['New Recovered']/df['Total Recovered']*100

# Converting everything to percentages and putting the values in a new column

In [None]:
pd.options.display.max_rows = None
df

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,New Recovered,Active Cases,Serious/Critical,Tot Cases/1M,Deaths/1M,Total Tests,Tests/1M,Population,Continent,%Inc Cases,%Inc Deaths,%Inc Recovered
0,World,640524148,150248.0,6616287,586.0,620247011,360267.0,13660850,35798,82173.0,848.8,-1,-1,-1,All,0.023457,0.008857,0.058084
1,USA,99935041,13999.0,1100296,67.0,97494783,34958.0,1339962,2637,298487.0,3286.0,1134554617,3388700,334805269,North America,0.014008,0.006089,0.035856
2,India,44667251,327.0,530532,-1.0,44126924,-1.0,9795,698,31755.0,377.0,902709553,641753,1406631776,Asia,0.000732,-0.000188,-2e-06
3,France,37133190,6017.0,157821,100.0,36508660,59636.0,466709,869,566188.0,2406.0,271490188,4139547,65584518,Europe,0.016204,0.063363,0.163348
4,Germany,36033394,-1.0,155588,-1.0,35022500,93000.0,855306,1406,429564.0,1855.0,122332384,1458359,83883596,Europe,-3e-06,-0.000643,0.265544
5,Brazil,34961403,5583.0,688746,8.0,34115188,-1.0,157469,8318,162344.0,3198.0,63776166,296146,215353593,South America,0.015969,0.001162,-3e-06
6,S. Korea,26217994,23765.0,29709,44.0,25401440,34902.0,786845,413,510774.0,579.0,15804065,307892,51329899,Asia,0.090644,0.148103,0.137402
7,UK,23954192,-1.0,195530,-1.0,23651083,9973.0,107579,146,349707.0,2855.0,522526476,7628357,68497907,Europe,-4e-06,-0.000511,0.042167
8,Italy,23823192,-1.0,179985,-1.0,23224653,-1.0,418554,203,395322.0,2987.0,254586854,4224613,60262770,Europe,-4e-06,-0.000556,-4e-06
9,Japan,23216265,37555.0,47627,48.0,20566940,8298.0,2601698,235,184865.0,379.0,79384829,632121,125584838,Asia,0.161762,0.100783,0.040346


Now that we have finished processing the dataframe, let's check the basics of this set.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223 entries, 0 to 222
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country           223 non-null    object 
 1   Total Cases       223 non-null    int64  
 2   New Cases         223 non-null    float64
 3   Total Deaths      223 non-null    int64  
 4   New Deaths        223 non-null    float64
 5   Total Recovered   223 non-null    int64  
 6   New Recovered     223 non-null    float64
 7   Active Cases      223 non-null    int64  
 8   Serious/Critical  223 non-null    int64  
 9   Tot Cases/1M      223 non-null    float64
 10  Deaths/1M         223 non-null    float64
 11  Total Tests       223 non-null    int64  
 12  Tests/1M          223 non-null    int64  
 13  Population        223 non-null    int64  
 14  Continent         223 non-null    object 
 15  %Inc Cases        223 non-null    float64
 16  %Inc Deaths       223 non-null    float64
 1

In [None]:
df.describe()

Unnamed: 0,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,New Recovered,Active Cases,Serious/Critical,Tot Cases/1M,Deaths/1M,Total Tests,Tests/1M,Population,%Inc Cases,%Inc Deaths,%Inc Recovered
count,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,223.0,207.0
mean,8594225.0,2012.641256,88984.0,7.049327,8314968.0,4843.520179,177922.0,481.121076,186284.382511,1218.881614,29826660.0,1981341.0,28993320.0,0.156581,-0.713793,0.258661
std,60938970.0,14533.270396,630915.8,56.347796,59020060.0,34939.473817,1306243.0,3445.783441,185933.599084,1252.078842,113144700.0,3492133.0,103359400.0,2.17397,16.5964,7.786449
min,87.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,381.0,-1.0,-1.0,-1.0,-1.0,-1.149425,-100.0,-50.0
25%,31969.5,-1.0,225.0,-1.0,16468.0,-1.0,79.0,-1.0,21999.5,174.5,249493.5,130371.5,444732.0,-0.002251,-0.383384,-0.002561
50%,223728.0,-1.0,2348.0,-1.0,180461.0,-1.0,1074.0,4.0,122520.0,800.0,1907195.0,730785.0,5834950.0,-6e-05,-0.033434,-0.000116
75%,1384000.0,43.5,16028.5,-1.0,1356312.0,96.5,14902.5,29.0,299514.5,1964.0,11532710.0,2162910.0,21839340.0,0.011786,-0.00255,0.021128
max,640524100.0,150248.0,6616287.0,586.0,620247000.0,360267.0,13660850.0,35798.0,703959.0,6448.0,1134555000.0,22236080.0,1406632000.0,32.443366,100.0,100.0


### Basics of Plotting with Plotly

In [None]:
cases_ser = df[["Total Recovered", "Active Cases", "Total Deaths"]].loc[0]
# Get the country data for active + recovered cases + deaths

cases_df = pd.DataFrame(cases_ser).reset_index()
cases_df.columns = ['Type', 'Total']
cases_df['Percentage'] = np.round(100*cases_df['Total']/np.sum(cases_df['Total']), 2)
cases_df['Virus'] = ['COVID-19' for i in range(len(cases_df))]

fig = px.bar(cases_df, x='Virus', y='Percentage', color='Type', hover_data=['Total'])
fig.update_layout(title={'text': f"Total Number of Cases, Recoveries, and Deaths on {yesterday_str}", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, yaxis_title="Percentage", xaxis_title="")
fig.show(renderer="colab")

In [None]:
new_ser = df[["New Cases", "New Recovered", "New Deaths"]].loc[0]
new_df = pd.DataFrame(new_ser).reset_index()
new_df.columns = ['Type', 'Total']
new_df['Percentage'] = np.round(100*new_df['Total']/np.sum(new_df['Total']), 2)
new_df['Virus'] = ['COVID-19' for i in range(len(new_df))]

fig = px.bar(new_df, x='Virus', y='Percentage', color='Type', hover_data=['Total'])
fig.update_layout(title={'text': f"New Cases, Recoveries, and Deaths on {yesterday_str}", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, yaxis_title="Percentage", xaxis_title="")
fig.show(renderer="colab")

In [None]:
continent_df = df.groupby('Continent').sum().drop('All')
continent_df = continent_df.reset_index()
continent_df

Unnamed: 0,Continent,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,New Recovered,Active Cases,Serious/Critical,Tot Cases/1M,Deaths/1M,Total Tests,Tests/1M,Population,%Inc Cases,%Inc Deaths,%Inc Recovered
0,Africa,12678911,1399.0,257866,-52.0,10881879,459.0,334559,812,2385331.0,18379.0,109174805,10855805,1402440339,1.191368,85.011093,-48.737815
1,Asia,190976843,98779.0,1487620,210.0,183598235,102352.0,5675383,9238,6995867.0,33174.0,2157938959,90608549,3236227214,1.58718,-6.030722,1.532527
2,Australia/Oceania,12768654,5965.0,21885,-11.0,12499409,2949.0,149061,111,3481540.0,8196.0,88293143,20323544,43469030,31.33704,-64.065965,98.566451
3,Europe,235982673,16318.0,1951493,190.0,230077162,206419.0,3674912,7645,17978929.0,120923.0,2788853285,209695330,747543038,0.684751,-10.44965,1.96405
4,North America,118424355,15482.0,1557929,38.0,113790998,39060.0,2198547,7780,8588175.0,56864.0,1268665722,99095462,598140916,-0.107224,-163.313405,-0.156228
5,South America,64632463,10380.0,1334066,25.0,62896111,8332.0,322450,10108,1947229.0,32577.0,238419628,11260318,437690904,0.177568,-0.344998,0.257662


### Model Fitting

Let's build a basic regression that finds the relationship between newly recovered individuals and serious/critical condition ones.

In [None]:
# Variable to predict with
predictor = pd.DataFrame(df, columns=["New Recovered"])

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(df, columns=["Serious/Critical"])

In [None]:
import statsmodels.api as sm

# Note the difference in argument order
model = sm.OLS(target, predictor).fit()
predictions = model.predict(predictor) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Serious/Critical,R-squared (uncentered):,0.918
Model:,OLS,Adj. R-squared (uncentered):,0.918
Method:,Least Squares,F-statistic:,2500.0
Date:,"Tue, 15 Nov 2022",Prob (F-statistic):,8.370000000000001e-123
Time:,02:20:51,Log-Likelihood:,-1854.9
No. Observations:,223,AIC:,3712.0
Df Residuals:,222,BIC:,3715.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
New Recovered,0.0945,0.002,49.998,0.000,0.091,0.098

0,1,2,3
Omnibus:,103.902,Durbin-Watson:,2.325
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14456.081
Skew:,0.577,Prob(JB):,0.0
Kurtosis:,42.427,Cond. No.,1.0


From the babove, we see that the R^2 is 0.935, so the variables are highly correlated. How do we interpret that? Mayve countries with more patients recovering means more patients in overwhelmed hospitals, thus inadvertently more people reach critical condition due to a constrained hospital.

Unfortunately, with a very simple model that views data on such a high level, we cannot make a meaningful prediction on this relationship.

Now, let's instead try to predict missing data (a common practice when data is scarce or unreliable.

In [None]:
# Mask out certain data
num_rows = len(df)
mask = random.choices(range(num_rows), k=10)
train = df.index.difference(mask)
# print(train)

train_data = df.loc[train]
test_data  = df.iloc[mask]

# Restrict ourselves to just a few columns
missing_data = pd.DataFrame(df, columns=["New Recovered","Serious/Critical","Population"])

Let's first try replacing the values with a simple linear regression

In [None]:
# First build a regression using the available data
from sklearn.metrics import mean_squared_error

x_train = pd.DataFrame(train_data, columns=["New Recovered","Population"])
y_train = pd.DataFrame(train_data, columns=["Serious/Critical"])

x_test = pd.DataFrame(test_data, columns=["New Recovered","Population"])
y_test = pd.DataFrame(test_data, columns=["Serious/Critical"])

model = sm.OLS(y_train, x_train).fit()
y_pred = model.predict(x_test) # make the predictions by the model

rms = np.sqrt(mean_squared_error(y_test, y_pred)) # Compute the error of our model
print(rms)

471.6798818982283


The question remains, is this good? Can we do better?

In [None]:
# What about a Logistic Regression?
from sklearn.linear_model import LogisticRegression

logreg= LogisticRegression()
logreg.fit(x_train,y_train)
y_pred=logreg.predict(x_test)

rms = np.sqrt(mean_squared_error(y_test, y_pred)) # Compute the error of our model
print(rms)

698.9799711007462


In [None]:
# Sometimes we do worse! Logistic relationship doesn't work. What else can we try?
from sklearn import svm
regr = svm.SVR()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)

rms = np.sqrt(mean_squared_error(y_test, y_pred)) # Compute the error of our model
print(rms)

482.4119507761734


In [None]:
from sklearn.neural_network import MLPRegressor

clf = MLPRegressor(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(15,), random_state=1)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

rms = np.sqrt(mean_squared_error(y_test, y_pred)) # Compute the error of our model
print(rms)

467.7310299815516


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0,loss='squared_error').fit(x_train, y_train)

y_pred = est.predict(x_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred)) # Compute the error of our model
print(rms)

462.18068718266295


Seems we are making improvements now and some fine tuning of the parameters could be crucially to further improving the model...