<h2> Segmenting and Grouping Neighborhood Vulnerable to Covid-19 Pandemic</h2>

<h2> By Michael Kumakech (Eng.) </h2>

<b> Import Libaries</b>

In [2]:
!pip install requests
!pip install lxml



In [3]:
!pip install pandas



In [4]:
from lxml import html
import requests
import lxml.html as lh
import pandas as pd

<b> Remove the website to be put in the notebook</b>

In [5]:
COVID_19_url = 'https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases' #assign the wiki page
#WHO_url = 'https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases'


page = requests.get(COVID_19_url) # create a handle to for contents of the wiki page

doc = lh.fromstring(page.content) # store content of the wiki page under doc

tr_elements = doc.xpath('//tr') # parse data stored between tr in the html

[len(T) for T in tr_elements[:12]] # check the length of the first 12 rows

[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

<b> Check the table headers</b>

In [6]:
tr_elements = doc.xpath('//tr') # parse first row as header

col = [] # create empty list
i = 0

for t in tr_elements[0]: # for each row, store each first element (header) and an empty list
    i+=1
    name=t.text_content()
    print("%d:%s" % (i,name))
    col.append((name,[]))

1:Region
2:Places reporting cases
3:Sum of Cases
4:Sum of Deaths
5:Confirmed cases in the last 14 days


<b> Check the data in other Row</b>

In [7]:
for j in range(1,len(tr_elements)): # Because header is the first row, data would be store in the subsequent rows.
    T = tr_elements[j] #T is j'th row
    
    if len(T)!=5: #if row is not size 3, //tr data is not from the table.
        break
        
    i = 0 #i is the index of the first column
    
    for t in T.iterchildren(): #iterate through each element of the row
        data=t.text_content()
            
        col[i][1].append(data) #append the data to the empty list of the i'th column
            
        i+=1 #increment i for the next column

<b> What about the numbers of rows and columns</b>

In [8]:
[len(C) for (title,C) in col]

[215, 215, 215, 215, 215]

<b> Displays the data frame with three columns.</b>

In [10]:
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
df.head(12)

Unnamed: 0,Region,Places reporting cases,Sum of Cases,Sum of Deaths,Confirmed cases in the last 14 days
0,Africa,Algeria,85084,2464,14455
1,,Angola,15319,351,1501
2,,Benin,3055,44,139
3,,Botswana,9992,31,889
4,,Burkina_Faso,3010,68,340
5,,Burundi,689,1,47
6,,Cameroon,24487,441,1591
7,,Cape_Verde,10867,106,867
8,,Central_African_Republic,4918,63,18
9,,Chad,1705,102,89


In [26]:
df.tail(10)

Unnamed: 0,Region,Places reporting cases,Sum of Cases,Sum of Deaths,Confirmed cases in the last 14 days
205,,Marshall_Islands,4,0,0
206,,New_Caledonia,34,0,4
207,,New_Zealand,1713,25,59
208,,Northern_Mariana_Islands,106,2,3
209,,Papua_New_Guinea,669,7,65
210,,Solomon_Islands,17,0,1
211,,Vanuatu,1,0,0
212,,Wallis_and_Futuna,3,0,2
213,Other,Cases_on_an_international_conveyance_Japan,696,7,0
214,Total,,64455619,1495430,8080645


<b> Checking the shapes</b>

In [27]:
df.shape

(215, 5)

<h2> Clean the dataframe</h2>

In [11]:
import pandas as pd
import numpy as np

<b> Remove Row 214 that have total record</b>

In [12]:
# Delete row at index position Region: Total
df2 = df.drop([df.index[214]])
df2

Unnamed: 0,Region,Places reporting cases,Sum of Cases,Sum of Deaths,Confirmed cases in the last 14 days
0,Africa,Algeria,85084,2464,14455
1,,Angola,15319,351,1501
2,,Benin,3055,44,139
3,,Botswana,9992,31,889
4,,Burkina_Faso,3010,68,340
...,...,...,...,...,...
209,,Papua_New_Guinea,669,7,65
210,,Solomon_Islands,17,0,1
211,,Vanuatu,1,0,0
212,,Wallis_and_Futuna,3,0,2


<h2> Data Analysis</h2>

In [33]:

%%capture
! pip install seaborn

In [34]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline

In [35]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<h2> Explore the Data</h2> 

In [13]:
df2.describe()

Unnamed: 0,Region,Places reporting cases,Sum of Cases,Sum of Deaths,Confirmed cases in the last 14 days
count,214.0,214,214,214,214
unique,7.0,214,212,166,189
top,,Syria,85,0,0
freq,208.0,1,2,21,8


<b> We can see we have missing values for the columns.</b>

In [38]:
print("number of NaN values for the column Region :", df2['Region'].isnull().sum())
print("number of NaN values for the column Places reporting cases:", df2['Places reporting cases'].isnull().sum())
print("number of NaN values for the column Cases :", df2['Sum of Cases'].isnull().sum())
print("number of NaN values for the column Deaths :", df2['Sum of Deaths'].isnull().sum())
#print("number of NaN values for the column Confirmed cases in the last 15 days :", df2['Confirmed cases in the last 15 days'].isnull().sum())

number of NaN values for the column Region : 0
number of NaN values for the column Places reporting cases: 0
number of NaN values for the column Cases : 0
number of NaN values for the column Deaths : 0


<h2> Exploratory data analysis</h2>

In [39]:
df2['Region'].value_counts().to_frame()

Unnamed: 0,Region
,208
Other,1
America,1
Africa,1
Europe,1
Asia,1
Oceania,1


<h2> Model Development</h2>

In [16]:
#Import libraries 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression 
%matplotlib inline

<b> Import packages</b>

In [41]:
!conda install -c anaconda xlrd --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.2
  latest version: 4.9.2

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - xlrd


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.10.14 |                0         128 KB  anaconda
    certifi-2020.6.20          |           py36_0         160 KB  anaconda
    openssl-1.1.1h             |       h7b6447c_0         3.8 MB  anaconda
    xlrd-1.2.0                 |           py36_0         188 KB  anaconda
    ------------------------------------------------------------
                                           Total:         4.3 MB

The following NEW packages will be INSTALLED:

  xlrd               anaconda/linux-64::xlrd-1.2.0-py36_0

The fol

<h2> Model 1a: Simple Linear Regression Development</h2>

(a)We can Fit a linear regression model using the Cases feature and caculate the R^2 for Deaths

In [17]:
X = df2[['Sum of Cases']] 
Y = df2['Sum of Deaths'] 
lm = LinearRegression() 
lm 
lm.fit(X,Y)
lm.score(X, Y)

0.8984440430463697

<b> How could identified cases help us predict deaths? </b>

In [18]:
X = df2[['Sum of Cases']] 
Y = df2['Sum of Deaths'] 
lm = LinearRegression()
lm
lm.fit(X,Y)

Yhat=lm.predict(X)
Yhat[0:5]

array([2694.7806869 , 1308.84238555, 1065.20807998, 1203.01721059,
       1064.31411849])

<b> What is the value of the intercept ? </b> 

In [19]:
lm.intercept_

1004.5180275140774

<b> What is the value of the Slope ? </b>

In [46]:
lm.coef_

array([0.01986581])

<h2> What is the final estimated linear model we get?</h2>

As we saw above, we should get a final linear model with the structure: Yhat = a + b X

Plugging in the actual values we get

<h2>Sum of Deaths = 1005 + 0.02 x Sum of Cases </h2> as of data update of 4th Dec. 2020 at 9:15am Nirobi Time. However, this Linear model keeps changing as the numbers of cases and deaths increases

<h2> Model 1b: Simple Linear Regression</h2>

b)We can Fit a linear regression model using the Confirmed cases in the last 14 days feature and caculate the R^2 for Deaths

In [20]:
X = df2[['Confirmed cases in the last 14 days']] 
Y = df2['Sum of Deaths'] 
lm1 = LinearRegression() 
lm1 
lm1.fit(X,Y)
lm1.score(X, Y)

0.7808154064078128

When you compare Model 1a: Simple Linear Regression and Model 1b: Simple Linear Regression the R^2 values are <b>  ~89.4% and ~78% </B>. Thus, Model 1 (a) performs better than that in (b). We use  Sum of Cases instead of Confirmed cases in the last 14 days when developing Linear Regression model.

<B> Practice</b> Draw the graphs showing linear relationship in both 1(a) and 1(b)

<h2> Measures for In-Sample Evaluation of Linear Regression</h2> 

<h2>R-squared </h2>
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line. The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.

<h2>Mean Squared Error (MSE)</h2>
The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ). 

<h2> Model 1: Simple Linear Regression</h2>

In [21]:
#Cases_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))

The R-square is:  0.7808154064078128




<h2>Let's calculate the MSE</h2>
We can predict the output i.e., "yhat" using the predict method, where X is the input variable

In [22]:
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

The output of the first four predicted value is:  [3910.97028583 2200.62135253 2020.79309507 2119.81746592]
