# Data science 101 with COVID-19 dataset

This notebook aims to be a guide to analyze the COVID-19 dataset created by Johns Hopkins University.

In the notebook, you'll have several questions that you should answer using the dataset and the tools taught on the learning unit. Most questions are associated to one of the following topics:
- ***Polygraph*** - you have to confirm if a specific news is fake (or not)! Fake news buuusters
- ***Shooow time*** - sometimes it’s hard to make conclusions from looking at data but visualizations (charts) make it muuuuch easier ;-)
- ***Fortune telling*** - data analysis is really cool but what about predict the future? Isn’it niceeeer? :-D

To do it, we start with colabs notebook setup so we can use Google Colabs with 0-problems.

After the setup, the notebook is splitted in the following units:
- Unit 0 - Load dataset
- Unit 1 - First overview
- Unit 2 - Data Analysis: Worldwide
- Unit 3 - Data Analysis: Country
- Unit 4 - Making predictions

Enjoooy ;)



## File setup

Run the cell below (to do it, click on the cell and then "Shift+Enter" - that's the shortcut you need to run the commands written in a cell).

In [None]:
import sys
path = './'
sys.path.append(path)

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import requests
import io

pd.read_csv(io.BytesIO(requests.get('https://thorly.education/backend/send-csv',
                                    verify=False).content)).to_csv(path + 'dataset.csv',
                                                                   index=False)

with open("./hackathome.py", "w") as f:
  f.write(requests.get('https://thorly.education/backend/send-py').text)                                    

print("Yeeeeaaahhhhh, Great success!")

**If you don't receive 'Yeeeeaaahhhhh, Great success!', call one of the mentors!**

Otherwise, you're ready to start your challenge. Please, write your code below this text! Good luck!

## Challenge start (11h45)

## Unit 0 - Load dataset (11h45-12h15)

#### **Import libraries**

- 'pandas' to handle data
- 'pyplot' to plot charts

In [None]:
# Code here:


#### **Load dataset to a dataframe**

You already have the dataset with the name ***'dataset.csv'*** ready in the folder where this file is ;-)

Load the csv into a Pandas DataFrame

In [None]:
# Code here:


## Unit 1. First overview (12h15-13h00)

In this chapter we do the first overview of the dataset and should make it (more) usable for the next steps.

#### **Check the head, tail and shape of the dataset**

In [None]:
# Code here:


#### **Q1. What are the columns of the dataset?**

In [None]:
# Code here:


##### **Q1.1. Rename the columns of the dataset to make it easier to work with.**
Use the follow dictionary to rename the columns:
```
{'Country/Region': 'country',
 'Date (yyyy/mm/dd)': 'date',
 'Confirmed cases': 'confirmed',
 'Death cases': 'deaths',
 'Recovered cases': 'recovered'
}
```

In [None]:
# Code here:


#### **Q2. What is the type of each column of the dataset?**

In [None]:
# Code here:


##### **Q2.1. Change the type of the 'date' column to datetime.**

In [None]:
# Code here:


#### **Q3. Do we have "nulls" in the dataset?**

In [None]:
# Code here:


#### **Q4. How many countries do we have represented in the dataset?**

In [None]:
# Code here:


#### **Q5. What is the first and the last 'date' of the dataset?**

In [None]:
# Code here:


## Unit 2. Data Analysis: Worldwide  (14h00-15h00)

Time to make some worldwide analysis looking at global metrics and comparing countries.

#### **Q6. We checked before that we have data for 'confirmed', 'recovered' and 'deaths' cases but something is missing... Add the column for 'active' cases to the dataset.**

In [None]:
# Code here:


#### **Q7. Create a dataset that have the cumulative ACTIVE cases worldwide per day.**

In [None]:
# Code here:


![showtime](https://media.giphy.com/media/13ZVRnWnmSMaRy/giphy.gif)

***IT'S SHOOOOOOW TIME***

#### **Q8. Plot the curve with the evolution of the active cases worlwide.**

In [None]:
# Code here:


#### **Q9. Get the top 10 countries by confirmed cases and plot a bar chart with the top 10 countries with more confirmed cases.**
> **Hint**: You can filter the dataset by the latest day of the dataset (it has the number of confirmed cases in each country). Then sort the values by 'confirmed' and use `head(10)` to get the top 10.

In [None]:
# Code here to get the top 10:


In [None]:
# Code here to plot the bar chart:


#### **Q10. Calculate the recovery percentage and mortality rate for each country in latest day**

> **Hint:** Create a dataset with just the last date and create a column for each of the ratios we want:

In [None]:
# Code here:


![](https://media.giphy.com/media/rbaC8w0QY1vGw/giphy.gif)

##### **POLYGRAPH TIME!!!**

> ***BREAKING NEWS:***
> 
> *Netherlands has the highest recovery percentage in the world!*


##### **Q10.1. Is this a fake news?**

In [None]:
# Code here:


## Unit 3. Data analysis - Country (15h00-16h15)

From now on, you will only analyse one country that should be selected from the list below.

For this chapter consider the following values of population (you need the value for your country population to make a further question):

In [None]:
population = {'US': 331002651,
              'Spain': 46754778,
              'Italy': 60461826,
              'Portugal': 10196709,
              'United Kingdom': 67886011,
              'Germany': 83783942,
              'Norway': 5421241,
              'Belgium': 11589623,
              'Netherlands': 17134872}

### Attention: **The solution is assuming your country is Germany**!

#### **Q11. Get a dataset that contains only information regarding your country.**

In [None]:
# Code here:


#### **Q12. Create a dataset that only contains information since the day that was registered the first case in your country. Tell us what is that day and how many cases were registered in that day?**

In [None]:
# Code here:


#### **Q13. Get a column with the confirmed cases in proportion to 1M of the population.**

In [None]:
# Code here:


#### **Q14. Get a column that register the number of days since the first case was registered per country.**

In [None]:
# Code here:


![showtime](https://media.giphy.com/media/fxqt51CAMGITJlxcRI/giphy.gif)

***IT'S SHOOOOOOW TIME***

#### **Q15. Plot the cruve with evolution of confirmed cases per 1M of population since the 1st case for your country**

In [None]:
# Code here:


#### **Q16. How does your country compares in terms of confirmed cases to others?**

Just run the code below and analyze the chart. Compare your country with others.

In [None]:
# Run this cell:
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

population_plot = {'US': 331002651,'Spain': 46754778,'Italy': 60461826,'Portugal': 10196709,
                   'United Kingdom': 67886011,'Germany': 83783942,'Norway': 5421241,
                   'Belgium': 11589623,'Netherlands': 17134872,'Sweden': 10086360}

df_thorly = pd.read_csv('./dataset.csv')
df_thorly['Date (yyyy/mm/dd)'] = pd.to_datetime(df_thorly['Date (yyyy/mm/dd)'])
plt.figure(figsize=(10,10))

for country in population:
  country_df = df_thorly[df_thorly['Country/Region'] == country]
  country_df = country_df[country_df['Confirmed cases'] > 0]
  country_df['confirmed_pop'] = ((country_df['Confirmed cases'] / population_plot[country]) * 1000000)
  country_df['days'] = (country_df['Date (yyyy/mm/dd)'] - country_df['Date (yyyy/mm/dd)'].min()).dt.days

  plt.plot(country_df.days, country_df.confirmed_pop, label = country)

plt.legend()
plt.grid()
plt.xlabel('Number of days since the first case')
plt.ylabel('Confirmed cases per 1M of population')
plt.title('Evolution of confirmed cases per 1M of population since the 1st case')
plt.show()

![](https://media.giphy.com/media/rbaC8w0QY1vGw/giphy.gif)

##### **POLYGRAPH TIME!!!**

> ***BREAKING NEWS:***
> 
> *Belgium is the country (among the countries presented on the chart) with the highest number of cases by 1M population!*


##### **Q16.1. Is this a fake news?**

*(There is no need to code in this question... Analyze the chart above)*

## Unit 4. Making predictions (16h30-17h45)
Here we will deal with models definition and we should be able to make predictions. There are a lot of models that can be used but we prefer to keep things simple for now... Use the models we explained in the learning unit.

Ah and remember that you should remove the "first wave data" from the dataset so your models can analyze and predict the second wave in better conditions!

![prediction](https://media.giphy.com/media/3o72F5tx9CEhSDxonC/giphy.gif)

#### Time to **fortune telling**!!!

In [None]:
# Code here:

Then, to create the prediction file:

In [None]:
# Code here:


**Now you can get your file by opening the left pane and download it by click on mouse left button. After that, upload the file to the leaderboard using the instructions in the platform!**

## Finish challenge (17h45)

![](https://media.giphy.com/media/lD76yTC5zxZPG/giphy.gif)

**Congratulations, you've been through an entire simple data science project.**

Now it's time to finish your presentations!