# Data Mining and Machine Learning - Assignment 1 
## EDA, Visualization and Regression

#### Due: Sunday Nov 01 11:59pm 

The purpose of this assignment is to review the different concepts seen in class so far (i.e. data cleaning, EDA, visualization and regression). To this end, we analyze COVID-19 data.

Run the first few cells to load the dataset and then get started with the questions! Once you are done you have to do **both**:

1. Submit your Python notebook [here](https://moodle.unil.ch/mod/assign/view.php?id=841447)
2. Answer the questions to the quiz [here](https://moodle.unil.ch/mod/quiz/view.php?id=918142)

The answers to the quiz should be supported by your code. If they are not you will not receive the points for them.

**IMPORTANT!** You can discuss the questions with other students but **do not exchange code!** We will run your code and check for similarities.

You can post your questions in slack (channel `#assignment1_questions`).

If there is need for further clarifications on the questions, after the assignment is released, we will update this file, so make sure you check the git repo for updates.


Good luck!

In [1]:
# Import requiered packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium

%matplotlib inline
sns.set_style("whitegrid")

In [2]:
# Load data
df = pd.read_csv("data/COVID.csv")

## 1. Understand and Clean the Dataset

### 1.1 Show the first 5 or 10 rows to get an idea of the data. What does each column show?

In [7]:
# YOUR CODE HERE
df.head()

Unnamed: 0,AdministrativeDivision,Country,Latitude,Longitude,Deaths,Confirmed,Recovered,Date
0,Maharashtra,India,19.453778,76.12177,8053.0,180298.0,93154.0,2020-07-01
1,Maharashtra,India,19.453778,76.12177,8178.0,186626.0,101172.0,2020-07-02
2,Maharashtra,India,19.453778,76.12177,8376.0,192990.0,104687.0,2020-07-03
3,Maharashtra,India,19.453778,76.12177,8671.0,200064.0,108082.0,2020-07-04
4,Maharashtra,India,19.453778,76.12177,8822.0,206619.0,111740.0,2020-07-05


### 1.2 Describe the dataset.

#### 1.2.1 How many rows and columns are in the dataset? 

In [3]:
# YOUR CODE HERE


#### 1.2.2 Describe the different features (small description and their type)

In [3]:
# YOUR CODE HERE


#### 1.2.3 For which time period these data samples have been collected? (i.e, the oldest and the most recent dates in the dataset)

In [3]:
# YOUR CODE HERE


#### 1.2.4 Are there missing/null values?

In [3]:
# YOUR CODE HERE


#### 1.2.5 How many countries are there?

In [3]:
# YOUR CODE HERE


#### 1.2.6 How many "Administrative Divisions" are there for the country "United States"?

In [3]:
# YOUR CODE HERE


### 1.3 Data types
#### 1.3.1 Show how the column data types are interpreted when the data is loaded. For which column(s) would you like to change the data type? 

In [4]:
# YOUR CODE HERE

#### 1.3.2 Change the data type that have to be changed.

In [5]:
# YOUR CODE HERE
# hint: you will need to use the "pd.to_datetime" function

### 1.4 Null values
#### 1.4.1 Fill the null values in the columns `Confirmed`, `Deaths`, and `Recovered` by zero.

In [6]:
# YOUR CODE HERE

## 2. EDA and Visualization

### 2.1 For the most recent date, calculate the number of `Confirmed` cases by `Country`. Which country has the second highest number of cases?

In [7]:
# YOUR CODE HERE

### 2.2 For the most recent date, plot the top 10 countries in terms of number of cases using an appropriate plot type.

In [8]:
# YOUR CODE HERE

### 2.3 Create a new DataFrame where you group the information by `Country` and `Date` (i.e. we want to get rid off `AdministrativeDivision`). The remaining columns of the new DataFrame are as follows:
* Drop `AdministrativeDivision`.
* `Latitude` and `Longitude` should be averaged per `Country`.
* `Deaths`, `Confirmed`, and `Recovered` should be accumulated per `Country` and `Date`.

__In the remaining, we will work with this DataFrame.__

In [9]:
# YOUR CODE HERE
# hint: you should first drop a coulmn and then do a group by followed by agg function

#### 2.3.1 How many rows and columns does the new DataFrame have?

In [10]:
# YOUR CODE HERE
# hint: as a useful sanity check you can check if the number of rows for this data frame is equal to 
# number of countries times number of days spanned by this dataset

#### 2.3.2 What was the total number of deaths due to COVID-19 in India by third of July 2020?

In [11]:
# YOUR CODE HERE

### 2.3.3 Which country has the lowest ratio of total recovered cases over total confirmed cases by 6th of October 2020? (ignore the countries which have zero recovered cases as this is most probably due to missing data)

In [12]:
# YOUR CODE HERE

### 2.4 BONUS QUESTION. The columns `Deaths`, `Confirmed`, and `Recovered` are expressed using cumulative amounts. Create three new columns `Deaths New`, `Confirmed New`, and `Recovered New` and compute the number of new cases per `Country` for each `Date`. How many new confirmed cases were there in Zimbabwe on October 2, 2020?

In [13]:
# YOUR CODE HERE

### 2.5 Plot the cumulative number of `Deaths` over time for Switzerland.

In [14]:
# YOUR CODE HERE

### 2.6 Plot the number of new `Deaths` over time for Switzerland (i.e. not cumulative). On which day in this time period (1st of July untill 6th of October), we see a peak in the daily number of deaths in Switzerland?
__Hint: Note that for this question you don't need to have the bonus question solved. There is a method that you can apply on pandas Series in order to compute the difference between each row and its previous row (or any other element in the Series). This method is called `diff()`. Check out the documentation for more info.__

In [15]:
# YOUR CODE HERE

### 2.7 Using the method you used in the previous question, plot the number of new deaths per week in Switzerland. How many new deaths occured in the week starting from 16th of September and ending on 22nd of September?

In [16]:
# YOUR CODE HERE
# hint: a week is seven days so you should do ".resample('7D')"

### 2.8 Bonus question: For the most recent date (i.e, 6th of October) and the top 10 countries, generate a map showing the cumulative number of confirmed cases per country. For example, color countries according to their number of confirmed cases.

In [17]:
# YOUR CODE HERE

## 3. Regression
__For this section you need to work on the data which is aggregated by countries, i.e, the dataframe you created in question 2.3. Here, we are providing that dataframe for you. So if something has possibly went wrong for you in Section 2, you could still do this section correctly.__

In [18]:
df_countries = pd.read_csv("data/COVID_per_country.csv")

### 3.1 Create two new columns `Day` and `Month` with, resp., the day and month values of each row.

In [20]:
# YOUR CODE HERE

### 3.2 Regress `Confirmed` (y) on `Latitude`, `Longitude`, `Day`, and `Month` (X).

#### 3.2.1 Select the dependent variable (y) and independent (X) variables

In [38]:
# YOUR CODE HERE

#### 3.2.2 Split your dataset into a training set (80%) and a test set (20%). Use `sklearn.model_selection.train_test_split()` and set the `random_state` to 42.

In [21]:
# YOUR CODE HERE

#### 3.2.3 Train a linear regression model on the training data. What is the R^2 score for the training data? (answer rounded to 2 decimal floating point accuray)

In [22]:
# YOUR CODE HERE

#### 3.2.4 Predict: What would be the total number of confirmed cases by 10th of October for France?

In [23]:
# YOUR CODE HERE

__You can see that this prediction is pretty inaccurate. Try to train a new model but this time only on the data samples for France. Could you improve the prediction?__

In [24]:
# YOUR CODE HERE
# hint: you should first filter the dataframe in order to have only the data samples for France