# Part 1: Import and cleaning of data
The first part of the project is the import and cleaning of the data that is going to be used for the machine learning algorithm. Two datasets downloaded from kaggle will be used. The first dataset is a weather dataset for 16 South Korean provinces.The second dataset is a dataset with Covid-19 Case numbers for the 16 provinces in South Korea. In Part 1, the datasets will be imported, cleaned and merged together.
## 1.1 Weather Dataset
This dataset was downloaded from https://www.kaggle.com/kimjihoo/coronavirusdataset. It contains data on different weather metrics in 16 different provinces in South Korea. To import and clean the dataset I used the pandas library in python.
### 1.1.1 Import

In [None]:
#import pandas package and download Weather data
import pandas as pd
Weather = pd.read_csv('Weather.csv')

### 1.1.2 Initial analysis of data
In the following a initial analysis of the data is conducted to get a better feel for the data

In [None]:
# Get the first ten rows
Weather.head(10)

In [None]:
# Get the shape of the data frame
print('The shape of the dataframe is:', Weather.shape)

In [None]:
# Get the number of provinces and name them
provinces_list=Weather["province"].unique()
number=len(provinces_list)
print(f"There are {number} of provinces. They are called:\n{provinces_list}")

In [None]:
# Get Descriptive statistics for each column
Weather.describe()

### 1.1.3 Cleaning the dataset
#### 1.1.3.1 Correcting wrong city name
In the dataset for the month of June 2020 the name of the province "Chungcheongbuk-do" was written incorrectly ("Chunghceongbuk-do"). This will be corrected with the code below.

In [None]:
# Correcting the name for Chungcheongbuk-do
# The corretion was done with the method and code explained in: https://www.kite.com/python/answers/how-to-change-values-in-a-pandas-dataframe-column-based-on-a-condition-in-python
Weather.loc[(Weather.province == 'Chunghceongbuk-do'),'province']='Chungcheongbuk-do'

#### 1.1.3.2 Selecting the province you want to look at
Since weather (and its effects) can differ quite a bit depending on province, it is appropriate to create an algorithm for each province with the weather data for that province. Because of this, one province can be selected through entering it in the input function. After that all rows with other provinces will be dropped.

In [None]:
#In this part, the choice of the province happens.

#First, function.an overwiew of all available provinces will be given.
# For this all provinces are extracted from the dataset with the .unique() function and stored in a list
provinces=list(Weather["province"].unique())
print("These are the provinces you can choose from: ")
#Here, a for loop is used to print every province from the list in one line
for province in provinces:
    print (province)

# After giving the user an overview of the provinces, he is asked to decide which province he wants to look at through an input function
Province=input("Which province do you want to look at? ")

# With an if-else statement it is checked wether a correct province name was entered
if Province in provinces:
    print(f"In the following, a machine learning algorithm for {Province} will be created. ")
else:
    print(f"{Province} is an invalid province. Please try again. ")

#### 1.1.3.3 Cleaning the dataset
Here, the dataset is cleaned for the chosen province

In [None]:
#Here, all rows with other provinces were dropped
Weather=Weather[Weather["province"]==Province]
#Since every row coresponds to a date, the was set as the index
Weather=Weather.set_index("date")
# the data type of the index was changed from object to datetime to make it easier to work with
Weather.index = pd.to_datetime(Weather.index)
#Both columns code and province are not needed anymore, which is why they were dropped
Weather=Weather.drop(["code","province"], axis=1)
#The first ten rows are shown here
Weather.head(10)

## 1.2 Cases dataset
This dataset was downloaded from https://www.kaggle.com/kimjihoo/coronavirusdataset. It contains data on Covid-19 case numbers in 16 different provinces in South Korea. To import and clean the dataset I used the pandas library in python.
### 1.2.1 Import dataset

In [None]:
#Reading in the cases dataframe and saving it as Cases
Cases= pd.read_csv("TimeProvince.csv")

### 1.2.2 Initial Analysis of Data
In the following a initial analysis of the data is conducted to get a better feel for the data

In [None]:
# Get the first ten rows
Cases.head(10)

In [None]:
# Get the shape of the data frame
print('The shape of the dataframe is:', Cases.shape)

In [None]:
# Get Descriptive statistics for each column
Cases.describe()

### 1.2.3 Cleaning the dataset
#### 1.2.3.1 Cleaning the dataset
From this dataset, only the date and the number of cases for one province are needed. That's why all the other information will be dropped from this dataframe.

In [None]:
#As stated, only one province will be looked at. That's why rows with other provinces are dropped
Cases= Cases[Cases["province"]==Province]
#From this dataset i only need the case numbers. Consequently, all other columns were dropped
Cases=Cases.drop(["time", "province", "released", "deceased"],axis=1)
#Since every row corresponds to a date, the date was set as index
Cases= Cases.set_index("date")
#the datatype of the index is changed from object to datetime to make it easier to work with.
Cases.index=pd.to_datetime(Cases.index)
# the first ten rows are shown here
Cases.head(10)

#### 1.2.3.2 Data Manipulation
In the dataset the casenumbers were given in a cumulative form. This means that the confirmed case numbers comprises all case numbers to date for a specific date. For the analysis however the daily casenumbers are needed. Because of this, the casenumbers will be calcualated in the following. The equation is:

***

$$Daily   Cases_{t} = Cumulative   Cases_{t}   -   Cumulative   Cases_{t-1}$$

***

In [None]:
#First, A Column for the daily cases is created. Initially, all the cases will equal 0
Cases["daily_cases"]=0
#Here, i start to calculate the daily case based on the cumulative cases.
# First, i need to get all the dates for which i have data in a list.
#Here, this (empty) list is created
index_list=[]
#Next i iterate through the dataframe with the iterrows function and add every index/ every date to the list
for index, row in Cases.iterrows():
    index_list.append(index)

# Here, i make the actual calculations. For this, i iterate through the dataframe by using the list with the indexes
# I iterate by using the length of the index list, which corresponds to the length of the dataframe
# I iterate that way, because i need two rows at the time to make the calculation
# I calculate the daily cases and store them in the dataframe under daily_cases
for i in range(len(index_list)):
    Cases.loc[index_list[i],"daily_cases"]=Cases.loc[index_list[i],"confirmed"]-Cases.loc[index_list[i-1],"confirmed"]

#The case number for the first date was wrong. It is supposed to be zero for all provinces except Incheon, where it is 1.
#I set the daily case number correspending for the first day
#I need to control wether it is Incheon or another province, because for Incheon it equals 1, for all others 0
#I do so with a if-else statement
if Province == "Incheon":
    first_day=1
else:
    first_day=0
Cases.loc[index_list[0],"daily_cases"]=first_day
# I only need the daily cases. The cumulative cases were dropped because of that
Cases=Cases.drop("confirmed",axis=1)
# The first ten rows are shown
Cases.head(10)

### 1.2.4 Visualizing the dataset
Just to get a feel for the daily cases, they were plotted using the matplotlib library.

In [None]:
import matplotlib.pyplot as plt
#Here the style of the plot is changed
plt.style.use(["tableau-colorblind10","seaborn-poster"])
# Here I set the label of the x-axis, the y-axis and also I set the title of the plot
plt.xlabel('Date'); plt.ylabel('Cases'); plt.title(f'Daily Covid-19 Cases in {Province}');
#Here the data that needs to be ploted (the Cases datafram) was specified
plt.plot(Cases)
#The x-axis was not readable very well since it was horizontal (the dates were overlapping).
#Because of this, i rotated it to be vertical instead of horizontal by setting rotation = 90
plt.xticks(rotation=90)

## 1.3 Merging both dataframes
### 1.3.1 Merging
For the analysis, one dataframe containing the weather data and the case numbers is needed. The data is merged using the merge function from the panda library. The datasets were merged by their index, meaning by the dates (for the same date the data of both dataframes is now shown in one row). 

In this function it was chosen to make an inner join. The figure below visualises the different join metheods. With an inner join, it only includes those dates that are included in both dataframes. For instance in the Weather dataframe, there was data going back to 2016. This data was subsuquently not included with this join method. This method was used because it is fairly easy to get rid of all the Weather data of the time before Covid-19, which is useless in this analysis.

![Image of Yaktocat](https://miro.medium.com/max/1200/1*9eH1_7VbTZPZd9jBiGIyNA.png)

In [None]:
# Here, both dataframes are joined using pandas merge function
# "how" specifies the join methond, left_index and right_index = true specifies that the dataframes are merge by the index
df=pd.merge(Weather,Cases, how='inner', left_index=True, right_index=True)
#Here the five first and last rows of the new dataframe are shown
df

### 1.3.2 Cleaning
To clean the dataset, only the dropna function from the panda library is needed. It drops all rows for which a value is missing. This is done because the machine learning code doesn't work well if there is data missing.

In [None]:
#Code from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
#Drops all rows where some values are missing
df=df.dropna()

### 1.3.3 Initial analysis of data
In the following a initial analysis of the data is conducted to get a better feel for the merged dataframe.

In [None]:
# Get the first ten rows
df.head(10)

In [None]:
# Get the shape of the data frame
print('The shape of the datafram is:', df.shape)

In [None]:
# Get Descriptive statistics for each column
df.describe()

# Part 2: Machine Learning
In the following the actual Machine Learning algorithm will be created. For this there first needs to be some data processing to make the dataset compatible with the used libraries (sklearn) and functions. Secondly, a baseline value will be defined, which the algorithm will need to beat to be deemed useful. Thirdly, the machine learning algorithm to predict Covid-19 Cases basen on the weather will be trained and tested and the hyperparameters used in the model will be tuned to get the best result. The model that will be used for this is the Random Forest model. The model will be explained in more detail below. Fourthly, the results of the Random Forest will be compared with the defined baseline value to determine wether the model is useful. Finally, the results will be visualized.
## 2.1 Data Processing
To make the dataset compatible with the sklearn library, some data processing needs to be done.Firstly, the features need do be seperated from the targets. The targets are what needs to be predicted. In this case, this are the daily cases The features are all the other columns here. These are used, to predict the daily cases.

Secondly, the data needs to be normalized. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, not every dataset requires normalization. It is required only when features have different ranges. This is the case here. For instance if you look at the percipitation and the wind direction value, the wind direction value is one average about 500 times as big as the percipitation value. When we do further analysis, the wind direction could intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. So we normalize the data to bring all the variables to the same range. (Source: https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff#:~:text=Normalization%3A,in%20the%20ranges%20of%20values.&text=So%20we%20normalize%20the%20data,variables%20to%20the%20same%20range)

As a final step of data preperation, the data is split into training and testing data. During training, we let the model ‘see’ the answers, in this case the daily cases, so it can learn how to predict the case numbers from the features. We expect there to be some relationship between all the features and the target value, and the model’s job is to learn this relationship during training. Then, when it comes time to evaluate the model, we ask it to make predictions on a testing set where it only has access to the features (not the answers)! Because we do have the actual answers for the test set, we can compare these predictions to the true value to judge how accurate the model is. Generally, when training a model, we randomly split the data into training and testing sets to get a representation of all data points (if we trained on the first nine months of the year and then used the final three months for prediction, our algorithm would not perform well because it has not seen any data from those last three months). (Source: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)
### 2.1.1 Seperation of features and targets

In [None]:
# This code was copide from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

# Labels are the values we want to predict (= the daily cases).
# Because of this we create a dataframe with just the daily cases
labels = df['daily_cases']
#Next, we nedd to remove the labels from the features to get a dataframe with just the features
# axis 1 refers to the columns here
df= df.drop('daily_cases', axis = 1)
# Finally, the feature names are saved in a list for later use
feature_list = list(df.columns)


### 2.1.2 Normalization of data
To normalize the data the normalize function from the sklearn library was used.

In [None]:
# Code from: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
from sklearn.preprocessing import normalize
df=normalize(df, axis=0)

### 2.1.3 Spliting the dataset into training and testing data
As explained above, the data needs to be split into training and testing data. This will be done using the train_test_split function form the sklearn library. In this function you need to specify the percentage of data you want to be testing data. The user can specify this through the input function in the code below.


In [None]:
#Choosing the split betweeen training and testing data
# First, the user is asked through the input function how much perecnt of the function he wants to be testing data
# To be able to work with it, the datatype is changed from string to float
try:
    split=float(input("How much percent of the data do you want to be testing data? (e.g. 30) "))
    #In the function a value greater than 0 but smaller than 1 is needed. This is why the entred value is divided by 100
    split=split/100
    # With an if-else stament i let the user now wether he entred a valid percentage number (i.e. between 0 and 1)
    #if he entres an invalid number, he is asked to run the code again
    if split > 0 and split <1:
        print(f"The data is going to be split into {(1-split)*100}% training data and {split*100}% testing data. ")
    else:
        print("You have chosen an incorrect percentage. It needs to be larger than 0% and smaller than 100%. Please try again. ")#
# If the number was entred in the wrong format (e.g. 30% instead of 30) a ValueError will be raised
#The user will then be asked to run the code again and enter the number in a valid format
except ValueError:
    print ("You entred the number in the wrong format. You need to enter just digts (e.g. 30). \nPlease try again")    



Below, the data will be finally split. The exact test size was specified by the user in the cell above. In this function I am setting the random state to 42 which means the results will be the same each time I run the split for reproducible results.

In [None]:
# This code was copide from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(df, labels, test_size = split, random_state = 42)

In [None]:
# To check if everything worked we look at the shape of the areas. The training arrays need to have the same amount of rows
# The same counts for the testing arrays. Both array with feature also need to have the same amount of columns
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

## 2.2 Establish the baseline
To evaluate wether the machine learning model is any good, we need to establish a baseline. For this project, the baseline is the average amount of cases per day. The algorithm needs to beat that baseline. The algorithm needs to be better than this baseline. If it's not better, the algorithm dosen't help much. You would be better of just predicting the average amount of daily cases every day. The baseline error is calculated in the following way:

***

$$ Baseline Prediction: \overline{x}=\frac{\sum\limits_{i=1}^{n}Error_{i}}{n}$$

$$ Baseline Error: \overline{X}=\frac{\sum\limits_{i=1}^{n}|\overline{x}-Daily Cases_{i}|}{n}$$

***

In [None]:
# This code was copied form https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
#To calculate the mean, we need the numpy library
import numpy as np
# The baseline predictions are the historical averages. For this, the average of the daily cases from the training data is used
# The average is calculated using the mean function from the numpy library
baseline_preds=np.mean(train_labels)
# All the individual baseline errors are calculated
baseline_errors = abs(baseline_preds - test_labels)
#The basline error is caluclated by calculating the average of all basline errors using the mean function from the numpy library
baseline_error= round(np.mean(baseline_errors), 2)
#The average baseline error is displayed
print('Average baseline error: ', baseline_error )

## 2.3 Training and Testing of Random Forest machine learning model
Before training and testing the model in the following an overvie will be given on how Random Forests work. The explanations provided are based on the following source: https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d

To understand the random forest model, we must first learn about the *decision tree*, the basic building block of a random forest. We all use decision trees in our daily life, and even if you don’t know it by that name. To illustrate the concept, we’ll use an everyday example: predicting the tomorrow’s maximum temperature.

In order to answer the single max temperature question, we actually need to work through an entire series of queries. We start by forming an initial reasonable range given our domain knowledge, which for this problem might be 30–70 degrees (Fahrenheit) if we do not know the time of year before we begin. Gradually, through a set of questions and answers we reduce this range until we are confident enough to make a single prediction. The series of questions is displayed in the picture below.

![Image of Yaktocat](https://miro.medium.com/max/1400/1*H3nZElqhfOE35AFAq8gy0A.png)

To make an example of how a prediction would work: If we now that:
1. We are in the Winter Season,
2. The historical maximal temperature for this day of the year is below 50 degrees and
3. The average temperature for today is above 25 degrees

We would predict that the maximal temperature for today is 40 degrees.

During training, we give the model any historical data that is relevant to the problem domain (the temperature the day before, the season of the year, and the historical average) and the true value we want the model to learn to predict, in this case the max temperature tomorrow. The model learns any relationships between the data (known as features in machine learning) and the values we want to predict (called the target). The decision tree forms the structure shown above, calculating the best questions to ask in order to make the most accurate estimates possible. For each question, the model asks the question that will lower the prediction error the most. This means for instance, that the first question is the best predictior for the maximal temperature since it can decrease the prediction error by the most.

That is basically the entire high-level concept of a decision tree: a flowchart of questions leading to a prediction. Now, we take the mighty leap from a single decision tree to a *random forest*.

My prediction for the maximum temperature is probably wrong. There are too many factors to take into account, and chances are, each individual guess will be high or low. Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! **The fundamental idea behind a random forest is to combine many decision trees into a single model**. Individually, predictions made by decision trees (or humans) may not be accurate, but combined together, the predictions will be closer to the mark on average.

Why exactly is a random forest better than a single decision tree? We can think about it terms of having hundreds of humans make estimates for the max temperature problem: by **pooling predictions**, we can incorporate much more knowledge than from any one individual. Each individual brings their own background experience and information sources to the problem. This leads to a more accurate preditction.

Why the name ‘random forest?’ Well, much as people might rely on different sources to make a prediction, each decision tree in the forest considers a **random subset of features** when forming questions and only has access to a random set of the training data points. This **increases diversity** in the forest leading to **more robust overall predictions** and the name ‘random forest.’ When it comes time to make a prediction, the random forest takes an **average of all the individual decision tree estimates**.

With that in mind, we now have down all the conceptual parts of the random forest and can start with training the model.
### 2.3.1 Training the random forest model
To train the model the Scikit-learn library is used. We import the random forest regression model from skicit-learn, instantiate the model, and fit (scikit-learn’s name for training) the model on the training data. The  random state is again set at 42 for reproducible results.

In [None]:
#This code was copied from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels)

### 2.3.2 Testing the random forest model
Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is. To do this we make predictions on the test features (the model is never allowed to see the test answers). To make predictions, we use the predict function from the sklearn library. We then compare the predictions to the known answers. Fot this, we calculate the absolute difference between the two values. Finally, we calculate the mean of all differences to get the Mean Absolute Error, which serves as measure of how good the created model is.

In [None]:
#This code was copied from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))

### 2.3.3 Tuning of hyperparameters
(Source: https://bradleyboehmke.github.io/HOML/random-forest.html#hyperparameters)

Although random forests perform well out-of-the-box, there are several tunable hyperparameters that should be considered when training a model. The main hyperparameters to consider include:
1. The number of trees in the forest
2. The number of features to consider at any given split:  $m_{try}$
3. The complexity of each tree
4. The sampling scheme

**1. Number of trees**

The first consideration is the number of trees within your random forest. Although not technically a hyperparameter, the number of trees needs to be sufficiently large to stabilize the error rate. A good rule of thumb is to start with 10 times the number of features. More trees provide more robust and stable error estimates and variable importance measures; however, the impact on computation time increases linearly with the number of trees. This is why I decided to use 100 trees, because it's computationaly not to heavy and still performs quite good.

**2. $m_{try}$**

This decides how many of the features will be considered for every tree of the random forrest. After specifying the number of features, the algorithm chooses this x-amount of features randomly. With this you can make sure that not all trees are dominated by the same feature, which can reaise predictive strength. To find out the best m_{try}$, the code below checks all possibilities (needs to be at least two features and for our dataframe a maximum of 7 (all) features)

**3. Tree complexity (Node size)**

This decides how complex the individual tress will be. Complexity is here definedby the minimal size of the terminal node of each tree. The terminal node is the final point of each tree, which decides the value that is predicted. A visualisation of what a terminal node is can be seen in the first picture below. The nodesize decides how many datapoints a terminal node contains. For the tree in the second picture below, there are 5o observations in the dataset and the minimal nodesize is 10, meaning a terminal node needs to look at a minimum of 10 observations. When adjusting node size start with three values between 1–10 and adjust depending on impact to accuracy and run time. In this case, we loop through all possibilities between 1 and 10 to get the ideal node size..

![Image of Yaktocat](https://blog.hackerearth.com/wp-content/uploads/2016/12/root-01.jpg)
![Image of Yaktocat](https://i.stack.imgur.com/GMy2X.png)

**4. Sampling scheme**

The default sampling scheme for random forests is bootstrapping where 100% of the observations are sampled with replacement (in other words, each bootstrap copy has the same size as the original training data); however, we can adjust both the sample size and whether to sample with or without replacement. In this case, we will only adjust wether we sample with or without replacement. The samplesize will always be 100% of all samples.

In [None]:
# Hyperparameter tuning: We tune mtry, tree complexity and sampling scheme
# First i create two dictonaries: One to store which hyperparameters were used and what the results were
# The second dictionary is to store a list of the hyperparameters in integer type for the hyperparameters used
# becaus it will be easier to work with leather
accuracy={}
parameter_tuning={}
#For each hyperparameter that we tune one for loop is created
# The first for-loop to decide tree complexity
for i in range(1,11):
    # The Second for-loop is to decide on mtry
    for n in range (2,8):
        #The third for-loop is to decide on which of the two sampling schemes
        for s in range(1,3):
            #The RandomForestRegressor function needs a true (=with replacement) or false (=without replacement)
            #statement. I get a true or false respectively by looping between 1 or 2 and using an if-elif statement
            #I store the True/False under variale x
            if s==1:
                x=True
            elif s==2:
                x=False
            #Here the heyperparameters are entered to create the coresponding model
            #n_estimators: Number of trees
            #mac_features: mtry
            #min_samples_leaf: tree complexity (node size)
            #bootstrap: sampling scheme
            rf = RandomForestRegressor(n_estimators = 100, max_features=n, min_samples_leaf=i, bootstrap=x,random_state=42)
            # Train the model on training data
            rf.fit(train_features, train_labels)
            # Use the forest's predict method on the test data
            predictions = rf.predict(test_features)
            # Calculate the absolute errors
            errors = abs(predictions - test_labels)
            # Calculating mean absolute error (mae)
            MAE= round(np.mean(errors), 2)
            #Here i store the MAE in the dictionary. As key i use a string with all the details in hyperparameters
            accuracy[f"complexity {i} and mtry {n} and bootstrap {x}"]=MAE
            #Here i store the hyperparameters in a form that the RandomForrestRegressor function can use them later
            parameter_tuning[f"complexity {i} and mtry {n} and bootstrap {x}"]=[i, n, x]
#Here I get the hyperparameters which lead to the lowest MAE. I use the dictionary created beforhand for this
parameters=min(accuracy, key=accuracy.get)
#Printing the parameters
print(parameters)

In [None]:
# Here i train the model with the best hyperparameters identified beforehand
# Here i access the best parameters from the dictionary created in the cell above
x,y,z=parameter_tuning[parameters]
# The model is trained with the best parameters
rf = RandomForestRegressor(n_estimators = 100, max_features=parameter_tuning[parameters][1], min_samples_leaf=parameter_tuning[parameters][0],bootstrap=parameter_tuning[parameters][2], random_state=42)
# Train the model on training data
rf.fit(train_features, train_labels)
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Calculate the mean absolute error (mae)
MAE= round(np.mean(errors), 2)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), "cases")

## 2.4 Evaluating the model
In the following the model is evaluated based on the comparison with the baseline model.

In [None]:
# Task: Evalutation of the model
# Here the error of the model is compared with the baseline.
# If it is lower than the baseline, the model is (somewhat) effective.
# If it is higher than the baseline, it is useless. You get a better result if you predict the average everytime
# This comparison will be done with an if-else statement. It then prints out a message that says wether the model is good or not
if MAE < baseline_error:
    print (f"The mean absolute error (MAE) of the model, {MAE}, is lower than the baseline error of {baseline_error}.\nTherfore the model is sufficiently good and can be accepted!")
else:
    print (f"The mean absolute error (MAE) of the model, {MAE}, is equal to or higher than the baseline error of {baseline_error}.\nTherfore the model is useless and needs to be discarded!")

## 2.5 Visualisations
In the following, different results of this project will be visualised. First,one tree from the random forest will be visualised. Second, the importance of each of the features in predicting the case numbers will be visualised. Finally, the actual case numbers, the predictions and the baseline will be visualised to see how the model performed.
### 2.5.1 Tree-Visualisation
To visualise trees you need the pydot and export_graphviz packages. In case you don't have them downloaded, you can uncomment the two cells below to download them.

In [None]:
# Installing needed package for Visualizing one deciscion tree in the random forest
# Code from: https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/
#import sys
#!{sys.executable} -m pip install pydot

In [None]:
# Installing needed package for Visualizing one deciscion tree in the random forest
#Code from: https://www.codesofinterest.com/2017/02/visualizing-model-structures-in-keras.html
#conda install graphviz

In the following, one randomly chosen tree from the random forest will be visualised and saved as a png file. In the following cell the picture from the png file will be shown. However, if you want to see the picture in more detail, I would highly recommend opening the png file be yourself on your own device.

In [None]:
#Visualizing one tree from random forest and exporting as png
# Code from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

In [None]:
# Showing png of one tree of random forest
#For this, you need two packages from the matplotlib
# Code from: https://stackoverflow.com/questions/20597088/display-a-png-image-from-python-on-mint-15-linux
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

#reading in the png file and saving it as img
img = mpimg.imread('tree.png')
#Ploting and showing the image of the tree
plt.imshow(img)
plt.show()

### 2.5.2 Feature importance
In the following it will be identified how important each feature was in making the prediction. In the next cell, this will then be visualised using the matplotlib

In [None]:
#For entire forest: How important was every feature in making a prediction?
#Code from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
# Get numerical feature importances and save them in list
importances = list(rf.feature_importances_)
# Create list of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances in descending order
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

In [None]:
#Visualizing how important each feature is (See task above)
# Code from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
# Import matplotlib for plotting and use magic command for Jupyter Notebooks
import matplotlib.pyplot as plt
%matplotlib inline
# Set the style
plt.style.use('fivethirtyeight')
# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

### 2.5.3 Visualising model results
Using the matplotlib, in the following the actual case number, the predicted case number and average case numbers will be plotted. This should give a better feel for how the model performed.

In [None]:
### Code from: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
# Dataframe with true values and dates
true_data = pd.DataFrame(data = {'date': test_labels.index, 'actual': test_labels})
# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': test_labels.index, 'prediction': predictions})
#Dataframe with average cases
average_cases=pd.DataFrame(data={"date": test_labels.index, "average": baseline_preds})
#Because the newly created dataframes are not sorted, they need to be sorted by date in ascending order
#For this, the sort_index function is used. Inplace=True means that no new variable needs to be defined
#Additionally for two dataframes the date needed to be set as index because it wasn't the index before but just a column
true_data.sort_index(inplace=True, ascending=True)
predictions_data=predictions_data.set_index("date",drop=False)
predictions_data.sort_index(inplace=True, ascending=True)
average_cases=average_cases.set_index("date",drop=False)
average_cases.sort_index(inplace=True, ascending=True)
# Select plot style
plt.style.use(["tableau-colorblind10","seaborn-poster"])
# Plot the actual values
plt.plot( true_data['date'],true_data['actual'], 'b-', label = 'actual')
# Plot the predicted values
plt.plot( predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
#Plot the average cases
plt.plot( average_cases['date'],average_cases['average'], 'g-', label = 'average cases')
#The labls on the x-axis were rotated to be vertical instead of horizontal because they are more
# readable that way since they are not overlapping
plt.xticks(rotation = '90');
#Plot the legen
plt.legend()
# Specifiyng the Graph labels
plt.xlabel('Date'); plt.ylabel('Cases'); plt.title(f'Actual and Predicted Values for {Province}')