### Task 1 - Data Collection
Here you will obtain the required data for the analysis. As described in the project instructions, you will perform a web scrap to obtain data from the NCDC website, import data from the John Hopkins repository, and import the provided external data.


### A - NCDC Website scrap
Website - https://covid19.ncdc.gov.ng/

In [21]:
import requests
import numpy as np
import urllib.request
import pandas as pd
import csv
from bs4 import BeautifulSoup
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')  
import warnings
warnings.filterwarnings('ignore')

In [22]:
r = requests.get('https://covid19.ncdc.gov.ng')
soup = BeautifulSoup(r.text,'lxml')
soup


<!DOCTYPE html>
<html lang="en">
<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<title>NCDC Coronavirus COVID-19 Microsite</title>
<!--[if lt IE 11]>
    	<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
    	<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
    	<![endif]-->
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, user-scalable=0, minimal-ui" name="viewport"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<meta content="Codedthemes" name="author"/>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
  new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
  'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
  })(window,document,'script','da

In [23]:
table = soup.find('table',{'id':'custom1'})
table

<table id="custom1">
<thead>
<tr>
<th>States Affected</th>
<th>No. of Cases (Lab Confirmed)</th>
<th>No. of Cases (on admission)</th>
<th>No. Discharged</th>
<th>No. of Deaths</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Lagos
</td>
<td>103,145
</td>
<td>2
</td>
<td>102,372
</td>
<td>771
</td>
</tr>
<tr>
<td>
FCT
</td>
<td>29,161
</td>
<td>120
</td>
<td>28,792
</td>
<td>249
</td>
</tr>
<tr>
<td>
Rivers
</td>
<td>17,755
</td>
<td>180
</td>
<td>17,420
</td>
<td>155
</td>
</tr>
<tr>
<td>
Kaduna
</td>
<td>11,487
</td>
<td>20
</td>
<td>11,378
</td>
<td>89
</td>
</tr>
<tr>
<td>
Oyo
</td>
<td>10,328
</td>
<td>2
</td>
<td>10,124
</td>
<td>202
</td>
</tr>
<tr>
<td>
Plateau
</td>
<td>10,309
</td>
<td>27
 </td>
<td>10,207
</td>
<td>75
</td>
</tr>
<tr>
<td>
Edo
</td>
<td>7,821
</td>
<td>102
</td>
<td>7,398
</td>
<td>321
</td>
</tr>
<tr>
<td>
Ogun
</td>
<td>5,810
</td>
<td>11
</td>
<td>5,717
</td>
<td>82
</td>
</tr>
<tr>
<td>
Delta
</td>
<td>5,653
</td>
<td>371
</td>
<td>5,170
</td>
<td>112
</td>
</tr>
<t

In [24]:
headers = []
for i in table.find_all('th'):
    title = i.text
    headers.append(title)

In [25]:
df = pd.DataFrame(columns = headers)

In [26]:
df

Unnamed: 0,States Affected,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths


In [27]:
for row in table.find_all('tr')[1:]:
    data = row.find_all('td')
    row_data = [td.text.strip() for td in data]
    length = len(df)
    df.loc[length] = row_data

In [28]:
df

Unnamed: 0,States Affected,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths
0,Lagos,103145,2,102372,771
1,FCT,29161,120,28792,249
2,Rivers,17755,180,17420,155
3,Kaduna,11487,20,11378,89
4,Oyo,10328,2,10124,202
5,Plateau,10309,27,10207,75
6,Edo,7821,102,7398,321
7,Ogun,5810,11,5717,82
8,Delta,5653,371,5170,112
9,Kano,5186,50,5009,127


### B - John Hopkins Data Repository
Here you will obtain data from the John Hopkins repository. Your task here involves saving the data from the GitHub repo link to DataFrame for further analysis. Find the links below. 
* Global Daily Confirmed Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)
* Global Daily Recovered Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv)
* Global Daily Death Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv)

## Dataset of global daily confirmed cases



In [29]:
PATH = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

In [30]:
df_confirmed = pd.read_csv(PATH)

In [31]:
df_confirmed

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/21/22,8/22/22,8/23/22,8/24/22,8/25/22,8/26/22,8/27/22,8/28/22,8/29/22,8/30/22
0,,Afghanistan,33.939110,67.709953,0,0,0,0,0,0,...,190643,191040,191247,191585,191967,191967,191967,192463,192906,193004
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,326077,326181,326787,327232,327607,327961,328299,328515,328571,329017
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,269805,269894,269971,270043,270097,270145,270175,270194,270235,270272
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,45975,45975,45975,46027,46027,46027,46027,46027,46027,46027
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,102636,102636,102636,102636,102636,102636,102636,102636,102636,102636
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280,,West Bank and Gaza,31.952200,35.233200,0,0,0,0,0,0,...,697447,698384,698384,698384,698384,698384,698384,698384,698384,701739
281,,Winter Olympics 2022,39.904200,116.407400,0,0,0,0,0,0,...,535,535,535,535,535,535,535,535,535,535
282,,Yemen,15.552727,48.516388,0,0,0,0,0,0,...,11915,11915,11917,11919,11922,11922,11925,11925,11925,11926
283,,Zambia,-13.133897,27.849332,0,0,0,0,0,0,...,332264,332527,332527,332648,332710,332710,332710,332710,332822,332822


## Dataset of global daily recovered cases

In [32]:
PATHS_R = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

In [33]:
df_recover = pd.read_csv(PATHS_R)

In [34]:
df_recover.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/21/22,8/22/22,8/23/22,8/24/22,8/25/22,8/26/22,8/27/22,8/28/22,8/29/22,8/30/22
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Dataset of global daily death cases

In [35]:
PATH_D = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

In [36]:
df_death = pd.read_csv(PATH_D)

In [37]:
df_death

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/21/22,8/22/22,8/23/22,8/24/22,8/25/22,8/26/22,8/27/22,8/28/22,8/29/22,8/30/22
0,,Afghanistan,33.939110,67.709953,0,0,0,0,0,0,...,7762,7767,7768,7769,7771,7771,7771,7777,7777,7777
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,3576,3576,3577,3578,3579,3580,3581,3581,3581,3582
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,6878,6878,6878,6878,6878,6878,6878,6878,6878,6878
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,154,154,154,154,154,154,154,154,154,154
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,1917,1917,1917,1917,1917,1917,1917,1917,1917,1917
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280,,West Bank and Gaza,31.952200,35.233200,0,0,0,0,0,0,...,5691,5694,5694,5694,5694,5694,5694,5694,5694,5700
281,,Winter Olympics 2022,39.904200,116.407400,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
282,,Yemen,15.552727,48.516388,0,0,0,0,0,0,...,2154,2154,2154,2155,2155,2155,2155,2155,2155,2155
283,,Zambia,-13.133897,27.849332,0,0,0,0,0,0,...,4016,4016,4016,4016,4016,4016,4016,4016,4016,4016


## C - External Data 
* Save the external data to a DataFrame
* External Data includes but not limited to: `covid_external.csv`, `Budget data.csv`, `RealGDP.csv`

In [42]:
df_budget = pd.read_csv('Budget_data2.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Budget_data2.csv'

In [43]:
df_covid_external = pd.read_csv('covid_external.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'covid_external.csv'

In [None]:
df_real_gdp = pd.read_csv('RealGDP.csv')

### Task 2 - View the data
Obtain basic information about the data using the `head()` and `info()` method.

In [None]:
df_budget.head()

In [None]:
df_covid_external.head()

In [None]:
df_real_gdp

### Task 3 - Data Cleaning and Preparation
From the information obtained above, I fixed the data format. 
<br>
Examples: 
* Convert to appropriate data type.
* Rename the columns of the scraped data.
* Remove comma(,) in numerical data
* Extract daily data for Nigeria from the Global daily cases data

### Cleaning the Scraped Data 

In [None]:
df.dtypes

In [None]:
df.rename(columns = {'States Affected':'states','No. of Cases (Lab Confirmed)':'Confirmed Cases','No. of Cases (on admission)':'Admitted Cases','No. Discharged':'Discharged Cases','No. of Deaths':'Death Cases'},inplace = True)

In [None]:
df

In [None]:
df = df.replace(',','', regex=True)

In [None]:
df.head()

### Cleaning John Hopkins data

In [None]:
df_confirmed.drop(['Province/State','Lat','Long'], axis = 1,inplace = True)

In [None]:
df_confirmed.head()

In [None]:
df_recover.drop(['Province/State','Lat','Long'], axis = 1,inplace = True)

In [None]:
df_recover.head()

In [None]:
df_death.drop(['Province/State','Lat','Long'], axis = 1,inplace = True)

In [None]:
df_death.head()


### Extracting Daily Data from the Global Daily Cases

* #### Nigeria Daily Data for Confirmed Cases

In [None]:
nigeria_confirmed = df_confirmed[df_confirmed["Country/Region"] == "Nigeria"]

In [None]:
nigeria_confirmed

* #### Nigeria Daily Data for Recovered Cases

In [None]:
nigeria_recovered = df_recover[df_recover["Country/Region"] == "Nigeria"]

In [None]:
nigeria_recovered

* #### Nigeria Daily Data for Death Cases

In [None]:
nigeria_death_cases = df_death[df_death["Country/Region"] == "Nigeria"]

In [None]:
nigeria_death_cases

### A Pandas DataFrame for Daily Confirmed Cases in Nigeria. Columns are Date and Cases

In [None]:
columns = nigeria_confirmed.iloc[:, 4:]

df_nigeria_confirmed = nigeria_confirmed.melt(value_vars=columns, var_name="Date", value_name="Cases")

In [None]:
df_nigeria_confirmed.head()

* #### Cleaning the data by making date a datetime datatype.

In [None]:
df_nigeria_confirmed["Date"] = df_nigeria_confirmed["Date"].apply(pd.to_datetime, errors='coerce')

### A Pandas DataFrame for Daily Recovered Cases in Nigeria. Columns are Date and Cases

In [None]:
columns = nigeria_recovered.iloc[:, 4:]

df_nigeria_recovered = nigeria_recovered.melt(value_vars=columns, var_name="Date", value_name="Cases")

In [None]:
df_nigeria_recovered.head()

* #### Cleaning the data by making date a datetime datatype.

In [None]:
df_nigeria_recovered["Date"] = df_nigeria_recovered["Date"].apply(pd.to_datetime, errors='coerce')

### A Pandas DataFrame for Daily Death Cases in Nigeria. Columns are Date and Cases

In [None]:
columns = nigeria_death_cases.iloc[:, 4:]

df_nigeria_death_cases = nigeria_death_cases.melt(value_vars=columns, var_name="Date", value_name="Cases")

In [None]:
df_nigeria_death_cases.head()

* #### Cleaning the data by making date a datetime datatype.

In [None]:
df_nigeria_death_cases["Date"] = df_nigeria_death_cases["Date"].apply(pd.to_datetime, errors='coerce')

## Task 4 - Data Visualization

### TODO A - Generate a plot that shows the Top 10 states in terms of Confirmed Covid cases by Laboratory test

In [None]:
df['Confirmed Cases'] = pd.to_numeric(df['Confirmed Cases'])

In [None]:
df_ncdc_confirm = df.sort_values(by = ['Confirmed Cases'],ascending=False).head(10)

In [None]:
df_ncdc_confirm

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(x = 'states',y = 'Confirmed Cases',data = df_ncdc_confirm).set_title('Top 10 States in terms of Confirmed Cases',fontsize = 20)

#### Summary: Lagos State tops in terms of Confirmed Cases with FCT(Abuja) second and Rivers State is third.

### TODO B - Generate a plot that shows the Top 10 states in terms of Discharged Covid cases. Hint - Sort the values

In [None]:
df['Discharged Cases'] = pd.to_numeric(df['Discharged Cases'])

In [None]:
df_ncdc_discharged = df.sort_values(by = ['Discharged Cases'],ascending=False).head(11)

In [None]:
df_ncdc_discharged

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(x = 'states',y = 'Discharged Cases',data = df_ncdc_discharged).set_title('Top 10 States in terms of Discharged Cases',fontsize = 20)

#### Summary:  Lagos State tops in terms of Discharged Cases with FCT(Abuja) second and Rivers State is third.

### TODO C - Plot the top 10 Death cases

In [None]:
df['Death Cases'] = pd.to_numeric(df['Death Cases'])

In [None]:
df_ncdc_death = df.sort_values(by = ['Death Cases'],ascending=False).head(11)

In [None]:
df_ncdc_death

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(x = 'states',y = 'Death Cases',data = df_ncdc_death).set_title('Top 10 States in terms of Death Cases',fontsize = 20)

#### Summary:  Lagos State tops in terms of Death Cases with Edo State second and FCT is third.

### TODO D - Generate a line plot for the total daily confirmed, recovered and death cases in Nigeria

In [None]:
sns.set(rc={'figure.figsize':(12,10)})

sns.lineplot(df_nigeria_confirmed.Date, df_nigeria_confirmed.Cases,linewidth=3, label="Daily Confirmed Cases in Nigeria", color="r", markers="o")

sns.lineplot(df_nigeria_recovered.Date, df_nigeria_recovered.Cases,linewidth=3, label="Daily Recovered Cases in Nigeria", color="b")

sns.lineplot(df_nigeria_death_cases.Date, df_nigeria_death_cases.Cases,linewidth=3, label="Daily Death Cases in Nigeria", color="g")

#Title, labels and legend
plt.xlabel("Dates", fontsize=20)
plt.ylabel("Numbers of Cases", fontsize=20)
plt.xticks(rotation=90)
plt.title ("Total Daily Confirmed, Recovered and Death Cases in Nigeria", fontsize=30)
plt.show()

### TODO E - 
* Determine the daily infection rate, you can use the Pandas `diff` method to find the derivate of the total cases.
* Generate a line plot for the above

In [None]:
df_nigeria_daily_infection_rate = df_nigeria_confirmed.Cases.diff()
plt.figure(figsize=(15,5))
sns.lineplot(df_nigeria_confirmed.Date, df_nigeria_daily_infection_rate).set_title("Daily Infection Rates in Nigeria", fontdict = { 'fontsize': 20});

### TODO F - 
* Calculate maximum infection rate for a day (Number of new cases)


In [None]:
df_nigeria_confirmed["Daily Infection Rate"] = df_nigeria_confirmed["Cases"].diff()
max_infection = df_nigeria_confirmed["Daily Infection Rate"].max()

In [None]:
max_infection

### TODO G - Determine the relationship between the external dataset and the NCDC COVID-19 dataset. 
Here you will generate a line plot of top 10 confirmed cases and the overall community vulnerability index on the same axis. From the graph, explain your observation.


In [39]:
df2 = pd.merge(df,df_covid_external,on = 'states')

NameError: name 'df_covid_external' is not defined

In [40]:
df2

NameError: name 'df2' is not defined

In [41]:
df_confirmed_vulnerabilty = df2.nlargest(10,['Confirmed Cases','Overall CCVI Index'])

NameError: name 'df2' is not defined

In [None]:
df_confirmed_vulnerabilty

In [None]:
x=df_confirmed_vulnerabilty['states']
y1 = df_confirmed_vulnerabilty['Confirmed Cases']
y2 = df_confirmed_vulnerabilty['Overall CCVI Index']
fig, ax1 = plt.subplots(figsize = (12,7))
plt.title('The Top 10 Confirmed Cases and the Overall CCVI Index',fontsize=25)
ax1.set_xlabel('States',fontsize=20)
ax1.set_ylabel('Top 10 Confirmed Cases',color = 'r',fontsize=20)
ax2 = ax1.twinx()
ax2.set_ylabel('Overall CCVI Index',color = 'b',fontsize=20)
curve1 = ax1.plot(x, y1, label = 'Confirmed Cases', color = 'r')
curve2 = ax1.plot(x, y2, label='Overall CCVI Index', color = 'b')
plt.plot()
plt.show()


### TODO H - Determine the relationship between the external dataset and the NCDC COVID-19 dataset. 
* Here I generated a regression plot between two variables to visualize the linear relationships - Confirmed Cases and Population Density.


In [None]:
b = df_confirmed_vulnerabilty['Confirmed Cases']
c = df_confirmed_vulnerabilty['Population Density']
plt.subplots(figsize = (12,5))
plt.title('The Top 10 Confirmed Cases and the Overall CCVI Index',fontsize=25)
plt.xlabel('States',fontsize=20)
plt.ylabel('Population Density',fontsize=20)
sns.regplot(x=b,y= c, data= df_confirmed_vulnerabilty)
plt.show()

## TODO L -
Determine the effect of the Pandemic on the economy. To do this, you will compare the Real GDP value Pre-COVID-19 with Real GDP in 2020 


In [None]:

df_nigeria_real_gdp = df_real_gdp.melt(id_vars= 'Year',value_vars= ['Q1','Q2','Q3','Q4'],var_name='Quarters',value_name = 'Quarter_value')
df_nigeria_real_gdp

In [None]:
d=df_nigeria_real_gdp['Year']
e= df_nigeria_real_gdp['Quarter_value']
plt.subplots(3,3 figsize=(12,5))
plt.bar(d,e)
plt.axhline(1.58e7,c='r',linewidth=2)
plt.title('Years and their GDP Quarter values',fontsize=25)
plt.xlabel('Year',fontsize=20)
plt.ylabel('Quarter value',fontsize=20)
plt.show()

#### Summary: The red horizontal line is the quarter value of 2020 Q2 which is the lowest GDP of the country due to the effect of Covid-19.