# Execute the code below

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
link = "https://raw.githubusercontent.com/murpi/wilddata/master/quests/weather2019.csv"
df_weather = pd.read_csv(link)

# Scoring and metrics
Last time, you did a multivariate linear regression. But how can you be sure this multivariate linear regression is better than an univariate ? You have to measure it !


## First regression
Let's begin with a first linear regression : create a new column `'predict_from_sun'` whith the prediction of MAX temperature from the SUNHOUR variable.

In [3]:
# Your code here :

# Creating DataFrame with all features for prediction:
#  * SUNHOUR

# Hint: If this is just a Series (df_weather["SUNHOUR"]),
#   then we get a shape warning from the LinearRegression()
#   algorithm.
#
# Solution: Make it a DataFrame, i.e. use double-brackets.
X = df_weather[["SUNHOUR"]]
y = df_weather[["MAX_TEMPERATURE_C"]]
model_from_sun = LinearRegression().fit(X, y)

# Predicting Max. Temperature from SUNHOUR using our new model
# to new DataFrame column `predict_from_sun`
df_weather["predict_from_sun"] = model_from_sun.predict(X)

# preview in notebook
df_weather[["SUNHOUR", "predict_from_sun", "MAX_TEMPERATURE_C"]]


Unnamed: 0,SUNHOUR,predict_from_sun,MAX_TEMPERATURE_C
0,5.1,11.396823,9
1,8.7,16.020019,8
2,8.7,16.020019,6
3,5.1,11.396823,5
4,8.7,16.020019,6
...,...,...,...
360,8.7,16.020019,13
361,6.9,13.708421,11
362,8.7,16.020019,9
363,8.7,16.020019,12


## R2 score
The best possible R2 score is '1', when our prediction predicts perfectly the reality. Let's see what is our R2 score :

In [4]:
# Change the name of the model if it's necessary
model_from_sun.score(X, y)

0.47654554059087306

## Let's continue with 2 others regressions
- Second regression : create a new column 'predict_from_min' whith the prediction of MAX temperature from the MIN temperature variable
- Third regression : create a new column 'predict_from_both' whith the prediction of MAX temperature from the both variables (MIN temperature and Sunhours)

In [5]:
# Your code here :

X_min = df_weather[["MIN_TEMPERATURE_C"]]
model2 = LinearRegression().fit(X_min, y)

X_both = df_weather[["MIN_TEMPERATURE_C", "SUNHOUR"]]
model3 = LinearRegression().fit(X_both, y)

## Calculate the R2 score of the 2 new predictions
Be careful : if you still use the same "X" name, you will overwrite it.

Which model has the best score ? Do you think it's logic ?

In [6]:
# Your code here :
print(f"Model 2 score = {model2.score(X_min, y)}")
print(f"Model 3 score = {model3.score(X_both, y)}")

Model 2 score = 0.7689396999057355
Model 3 score = 0.867478798077497


**Which model has the best score ?**  
> Obviously the model created using `MIN_TEMPERATURE_C` and `SUNHOUR` has best score.

**Do you think it's logic ?**  
> Yes, the minimal temperature is raised by a sunlight applied for some hours of the day, so both variables contribute to the build up of the daily heat. So it makes sense to include them both into the model and the model containing both these variables should perform better than the models with one of them missing.

# Train Test Split
One of biggest problems of Machine learning is : **overfitting**.



To be sure that machine didn't memorize the result, we use the Train Test Split methodology. We keep some data separate (often 25% of our initial dataset). Then we train our model on the 75% (the "Train set"). 
After, we can calculate a score on the "Test set".

Let's do that !

In [7]:
# Juste read and execute the code below
from sklearn.model_selection import train_test_split

X = df_weather[['SUNHOUR']]
y = df_weather['MAX_TEMPERATURE_C']

# Here, we split our 2 datasets (the variables "X" and the target "y") into 4 datasets X and y for the train set and X and y for the test set.
# We set the size of the train set to 75%. And the rest is for the test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size = 0.75)
print("The lenght of the initial dataset is :", len(X))
print("The lenght of the train dataset is   :", len(X_train))
print("The lenght of the test dataset is    :", len(X_test))

# Here we train the model only on the train dataset.
newmodel = LinearRegression().fit(X_train, y_train)

# And now we compare both scores :
print("\nScore for the Train dataset :", newmodel.score(X_train, y_train))
print("Score for the Test dataset :", newmodel.score(X_test, y_test))


The lenght of the initial dataset is : 365
The lenght of the train dataset is   : 273
The lenght of the test dataset is    : 92

Score for the Train dataset : 0.47243569075679914
Score for the Test dataset : 0.4749360350733982


## Both scores are very close, there is no overfitting, well done !

What happens if we don't randomize our dataset. Here, the model learns only on the 9 first months.

In [8]:
# Juste read and execute the code below
from sklearn.model_selection import train_test_split

X = df_weather[['MIN_TEMPERATURE_C']]
y = df_weather['MAX_TEMPERATURE_C']

# We set the size of the train set to 75%. And the rest is for the test set.
# We set the split NOT in random.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75, shuffle = False)


# Here we train the model only on the train dataset.
newmodel = LinearRegression().fit(X_train, y_train)

# And now we compare both scores :
print("\nScore for the Train dataset :", newmodel.score(X_train, y_train))
print("Score for the Test dataset :", newmodel.score(X_test, y_test))


Score for the Train dataset : 0.7875765302008688
Score for the Test dataset : 0.03610833322378593


## There is an overfitting ! 
Indeed, the model get a good score on the Train dataset, because he learned in winter / spring / summer datas. But he get a bad score in Falls...

# Let's play !
Train a new model with all numeric variables (without your target of course) and try to have a better score than previously.

Remember to split randomly your dataset before training your model.

Display the Test score.

In [9]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   DATE                    365 non-null    object 
 1   MAX_TEMPERATURE_C       365 non-null    int64  
 2   MIN_TEMPERATURE_C       365 non-null    int64  
 3   WINDSPEED_MAX_KMH       365 non-null    int64  
 4   TEMPERATURE_MORNING_C   365 non-null    int64  
 5   TEMPERATURE_NOON_C      365 non-null    int64  
 6   TEMPERATURE_EVENING_C   365 non-null    int64  
 7   PRECIP_TOTAL_DAY_MM     365 non-null    float64
 8   HUMIDITY_MAX_PERCENT    365 non-null    int64  
 9   VISIBILITY_AVG_KM       365 non-null    float64
 10  PRESSURE_MAX_MB         365 non-null    int64  
 11  CLOUDCOVER_AVG_PERCENT  365 non-null    float64
 12  HEATINDEX_MAX_C         365 non-null    int64  
 13  DEWPOINT_MAX_C          365 non-null    int64  
 14  WINDTEMP_MAX_C          365 non-null    in

In [20]:
# Your code here :

# Let's list all numeric features.
# It's all column except:
#   - DATE (not numerical)
#   - MAX_TEMPERATURE_C (this is the target)
#   - OPINION (not numerical)
features = [
           col
           for col in df_weather.columns.values
           if col not in ["DATE", "MAX_TEMPERATURE_C", "OPINION"]
           ]

# Preparation of the dataframes (entire set)
X = df_weather[features]
y = df_weather["MAX_TEMPERATURE_C"]

# Train-test split with shuffling
#   Train set size = 80%
#   Test set size  = 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# Checking the shapes
print(f"Training set shape   = {X_train.shape}")
print(f"Training target size = {y_train.shape}")
print(f"Testing set shape    = {X_test.shape}")
print(f"Testing target size  = {y_test.shape}")

# Training the model using LinearRegression() on the training set
last_model = LinearRegression().fit(X_train, y_train)

# Compute score for training set and testing set :
print("\nR2 coefficient for the Train dataset :", last_model.score(X_train, y_train))
print("R2 coefficient for the Test dataset :", last_model.score(X_test, y_test))


Training set shape   = (292, 22)
Training target size = (292,)
Testing set shape    = (73, 22)
Testing target size  = (73,)

R2 coefficient for the Train dataset : 0.9943467176553721
R2 coefficient for the Test dataset : 0.9912265382977196
