<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    Machine Learning Fundamentals
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:43%; left:10%;">
    Santiago Basulto
</h3>
</div>

<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Project
</h1>
    
<h3 style="color: #ef7d22; font-weight: normal;">
    COVID-19 Analysis
</h3>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

<img src="img/covid.jpg"
    style="width:250px; float: right; margin: 0 40px 40px 40px;"></img>

Now we will put in practice what we just learn on previous lessons.

Our final goal will be to visualize the pandemic COVID-19 and it's effects. 

Coronavirus (COVID-19) is an infectious disease caused by a newly discovered coronavirus.

We will use [COVID-19 dataset](https://www.kaggle.com/imdevskp/corona-virus-report#covid_19_clean_complete.csv), which have 8 numeric features.

* Lat: Latitude of the location
* Long: Longitude of the location
* Date: Date of cumulative report
* Confirmed: Cumulative number of confirmed cases till this day
* Deaths: Cumulative number of deaths till this day
* Recovered:Cumulative number of recovered cases till this day


### Hands on! 


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 
%matplotlib inline 

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Load the `covid_19_clean_complete.csv` dataset which is in the `data` folder, and store it into `df`.

In [None]:
df = pd.read_csv("data/covid_19_clean_complete.csv",
                 parse_dates=['Date'])

Show the columns name of the resulting `df`.

In [None]:
df.columns

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Data exploration

Let's first see some descriptive statistics of the data:

In [None]:
df.describe()

What do you think? Do all the statistics make sense?

> It is not make sense to calculate descriptive statistics for Lat and Long.

Now count the number of `NaN` values in the dataset:

In [None]:
df.isnull().sum()

Calculate the number of active cases in a new column: `Active`.

In [None]:
# Active Case = confirmed - deaths - recovered
df['Active'] = df['Confirmed'] - df['Deaths'] - df['Recovered']

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Data visualization and relationships

First we need to make some changes on the date format using `datetime` library.

In [None]:
from datetime import datetime as dt

df['Date'] = df['Date'].dt.normalize()
df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')

In [None]:
a = df.Date.value_counts().sort_index()

print(f"The first date is: {a.index[0]}")
print(f"The last date is: {a.index[-1]}")

**Visualize the total number of confirmed cases versus time**

We need to generate a new dataframe to calculate the number of total cases, and call this `total_cases`.

> Note: you should use `groupby`.


In [None]:
total_cases = df.loc[:, ['Date', 'Confirmed']].groupby('Date').sum().reset_index()

total_cases

Now plot the time series of the total_cases. 

In [None]:
plt.figure(figsize= (14,5))

ax = sns.pointplot(x=total_cases['Date'],
                   y=total_cases['Confirmed'],
                   color='r')
ax.set(xlabel='Dates', ylabel='Total cases')

plt.xticks(rotation=90, fontsize=10)
plt.yticks(fontsize=12)

plt.xlabel('Dates', fontsize=14)
plt.ylabel('Total cases', fontsize=14)
plt.title('Worldwide Confirmed Cases Over Time', fontsize=20)

**Another option**

In [None]:
with sns.axes_style('white'):
    g = sns.relplot(x="Date", y="Deaths", kind="line", data=df)
    g.fig.autofmt_xdate()
    g.set_xticklabels(step=10)
    plt.title ("Covid-19 Deaths, Year:2020")

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Visualize the top 10 countries with higher cases

First filter the maximum number of cases for each country in a `top` variable:

In [None]:
top = df.loc[df['Date'] == df['Date'].max()]

Now use `groupby` to select the ten first counties with the highest number of cases in a `top_casualties` variable:

In [None]:
top_casualities = top.groupby('Country/Region')['Confirmed'].sum().sort_values(ascending=False).head(10).reset_index()

top_casualities

Plot Total cases of the top 20 countries using barplot:

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize= (15,10))

ax = sns.barplot(x=top_casualities['Confirmed'],
                 y=top_casualities['Country/Region'])

for i, (value, name) in enumerate(zip(top_casualities['Confirmed'], top_casualities['Country/Region'])):
    ax.text(value, i-.05, f'{value:,.0f}', size=10, ha='left', va='center')
ax.set(xlabel='Total cases', ylabel='Country/Region')

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Total cases', fontsize=30)
plt.ylabel('Country', fontsize=30)
plt.title('Top 10 countries having most confirmed cases', fontsize=20)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### USA analysis

Keep just `US` data and interesting columns we'll use:

In [None]:
us = df.loc[df['Country/Region'] == 'US', ['Date', 'Recovered', 'Deaths', 'Confirmed', 'Active']]

us.head()

Also, group by `Date` and remove old rows:
    

In [None]:
us = us.groupby('Date').sum().reset_index()
us = us.iloc[33:]

us.head()

Plot US's active cases over time using seaborn's `pointplot()`:


In [None]:
plt.figure(figsize=(15,5))
sns.set_color_codes("pastel")
sns.pointplot(us.index, us.Active, color='b')
plt.xlabel('No. of Days', fontsize=15)
plt.ylabel('Total cases', fontsize=15)
plt.title("US's Active Cases Over Time", fontsize=25)


## Another solution
#plt.figure(figsize=(15,5))

#sns.pointplot(us.Date, us.Active, color='r')
#plt.xlabel('No. of Days', fontsize=15)
#plt.ylabel('Total cases', fontsize=15)
#plt.xticks(rotation=90 ,fontsize=10)
#plt.title("US's Active Cases Over Time" , fontsize=25)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Optional : Stacked Bar Chart

A stacked bar graph (or stacked bar chart) is a chart that uses bars to show comparisons between categories of data, but with ability to break down and compare parts of a whole. 

In [None]:
sns.set(style="whitegrid")

# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(15, 5))

# Plot the total cases
sns.set_color_codes("pastel")
sns.barplot(us.index, us.Active + us.Recovered + us.Deaths,
            label="Active", color="b")

# Plot the recovered
sns.set_color_codes("muted")
sns.barplot(us.index, us.Recovered + us.Deaths, 
            label="Recovered", color="g")

# Plot the Deaths
sns.set_color_codes("dark")
sns.barplot(us.index ,us.Deaths, 
            label="Deaths", color="r")
plt.xlabel('No. of Days', fontsize=14)
plt.ylabel('No. of cases', fontsize=15)

# Add a legend and informative axis label
ax.legend(ncol=2, loc="upper left", frameon=True)
sns.despine(top=True)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Optional Lecture

<p style='text-align:center'><a href="https://www.weforum.org/agenda/2020/03/covid-19-crisis-artificial-intelligence-creativity/">AI can help with the COVID-19 crisis - but the right human input is key</a></p>


<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>