# COVID-19 Pandas + Seaborn


In this project you will apply your Python Pandas and Seaborn skills to obtain useful insights into the live COVID-19 data. The [dataset](https://github.com/open-covid-19/data) is a processed version of the main [COVID-19 data repository](https://github.com/CSSEGISandData/COVID-19) for the 2019 Novel Coronavirus Visual Dashboard that is operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

The dataset is updated on a public GitHub repository, so we can donwload the most recent verion as follows:

In [0]:
import pandas as pd
data_covid19 = pd.read_csv("https://open-covid-19.github.io/data/data.csv",parse_dates=["Date"],index_col="Date")

*How many rows and how many columns does this dataset have?*

In [0]:
###Start code here

###End code here

*Print the first 20 rows in the dataset.*

In [0]:
###Start code here

###End code here

The columns of
[data_covid19](https://open-covid-19.github.io/data/data.csv) are:

| Name | Description | Example |
| ---- | ----------- | ------- |
| **Date**\* | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-21 |
| **Key** | `CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}` | CN_HB |
| **CountryCode** | ISO 3166-1 code of the country | CN |
| **CountryName** | American English name of the country, subject to change | China |
| **RegionCode** | (Optional) ISO 3166-2 or NUTS 2 code of the region | HB |
| **RegionName** | (Optional) American English name of the region, subject to change | Hubei |
| **Confirmed**\*\* | Total number of cases confirmed after positive test | 67800 |
| **Deaths**\*\* | Total number of deaths from a positive COVID-19 case | 3139 |
| **Latitude** | Floating point representing the geographic coordinate | 30.9756 |
| **Longitude** | Floating point representing the geographic coordinate | 112.2707 |
| **Population** | Total count of humans living in the region | 1.153933e+07 |

\*Date used is **reporting** date, which generally lags a day from the actual
date and is subject to timezone adjustments. Whenever possible, dates
consistent with the ECDC daily reports are used.

\*\*Missing values will be represented as nulls, whereas zeroes are used when
a true value of zero is reported. For example, US states where deaths are not
being reported have null values.

For countries where both country-level and region-level data is available, the
entry which has a null value for the `RegionCode` and `RegionName` columns
indicates country-level aggregation. Please note that, sometimes, the
country-level data and the region-level data come from different sources so
adding up all region-level values may not equal exactly to the reported
country-level value. See the [data loading tutorial][7] for more information.

The `CountryName` and `RegionName` values are subject to change. You may use
them for labels in your application, but you should not assume that they will
remain the same in future updates. Instead, use `CountryCode` and `RegionCode`
to perform joins with other data sources or for filtering within your
application.

For this analysis we will look at only a few countries. We can use the `Key` column to select these countries:

In [0]:
countries = ["FR","BE","SE","CH","AU","NL","CN","JP","GB","ES","IT","US"]

data_covid19 = data_covid19[data_covid19["Key"].isin(countries)]

*Print the last 10 rows in the dataset where `Key` is BE*.

In [0]:
###Start code here

###End code here

*Use the Pandas DataFrame `.describe()` method to print statistics about the confirmed cases for Belgium.*

In [0]:
###Start code here

###End code here

Each row in the dataset represents a day and contains the number of confirmed cases in a region (country) for that day. You will plot the number of confirmed cases for each country up until today. To do so you will first create a new DataFrame that contains the data to plot. 

In Pandas the `.groupby` method can group rows by the values in a certain column "A" such that the values in another column "B" can be aggegated. 

*Use these methods to print the total number confirmed cases for each country by assigning the correct columns to variables "A" and "B":*

In [0]:
###Start code here
A = 
B = 
###End code here

total_confirmed = data_covid19.groupby(A)[B].sum()
print(total_confirmed)

To plot the result of the aggregagtion we need to transform the result into a DataFrame as follows:

In [0]:
total_confirmed = total_confirmed.reset_index(name = "Confirmed")
print(total_confirmed)

*Sort the countries in `total_confirmed` by confirmed cases in descending order.*

In [0]:
###Start code here
total_confirmed = 
###End code here

print(total_confirmed)

We are now ready to make a nice looking barplot using the Seaborn module.

*Assign the correct data to the x and y-axis to plot this barplot:*

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

###Start code here
x = 
y = 
###End code here

plt.figure(figsize=(12,8))
plt.title("#confirmed cases/country")
sns.barplot(x=x,y=y,palette="Paired")
plt.xlabel("Country")
plt.show()

The `Confirmed` column in the `data_covid19` dataset contains the number of confirmed cases for each day. We will add a new column `Confirmed_total` to the dataset that contains the number of confirmed cases up until that day:

In [0]:
data_covid19["Confirmed_total"] = data_covid19.groupby('Key')['Confirmed'].transform(pd.Series.cumsum)

*Use the `.lineplot` method in Seaborn to plot, for each country (on the sample plot), the total number of confirmed cases (x-axis) against the confirmed cases for each day (y-axis):*

In [0]:
plt.figure(figsize=(12,8))

###Start code here
sns.lineplot(...,palette="Paired")
###End code here

plt.show()

*Add a new column to `data_covid19` called  `Deaths_total` that contains the number of deaths up unitl that day:*

In [0]:
###Start code here
data_covid19["Deaths_total"] = 
###End code here

*Use the `.lineplot` method to plot these two new columns with a line for each country:*

In [0]:
plt.figure(figsize=(12,8))

###Start code here
sns.lineplot(...,palette="Paired")
###End code here

plt.show()

*Create one more interesting plot from this dataset:*

In [0]:
###Start code here

###End code here