# Notebook 05:

Building baseline ML models.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import r2_score

from sklearn.base import clone

import pickle

from utility.model_evaluation import evaluate_data
from utility.model_evaluation import evaluate_cities

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
#Loading cleaned data.

with open(r'..\data\interim\04_data_merging\cleaned_data.pkl', 'rb') as f:
    data = pickle.load(f)

data.head()

Unnamed: 0,time,year,month,dom,doy,doysin,city,temp,rel_hum,abs_hum,ava_rad,et_rad,kt,cloudy
0,1993-01-01,1993.0,1.0,1.0,1.0,0.009,BND,19.11,0.782,12.748946,12.62,21.652331,0.582847,False
1,1993-01-01,1993.0,1.0,1.0,1.0,0.009,ESF,0.44,0.462,2.240238,12.9,18.352701,0.702894,False
3,1993-01-01,1993.0,1.0,1.0,1.0,0.009,KRM,4.83,0.576,3.915072,13.38,19.798337,0.675814,False
4,1993-01-01,1993.0,1.0,1.0,1.0,0.009,MSH,-0.06,0.732,3.549468,10.42,16.104011,0.647044,False
5,1993-01-01,1993.0,1.0,1.0,1.0,0.009,SHZ,4.61,0.602,4.091794,14.62,20.216349,0.723177,False


## Cloudy samples investigation:

In this section effect of cloudy samples on ML models is investigated.

In [3]:
# Ridge model with whole samples.

evaluate_cities(Ridge(), data)
pass

Average R2 score                [train : 0.707, test : 0.708]
Average root mean squared error [train : 3.980, test : 3.966]


In [4]:
# Ridge model without cloudy samples

evaluate_data(Ridge(), data[data.cloudy == False])

Mean R2 score    [train : 0.776, test : 0.777]
Mean rmse        [train : 3.265, test : 3.262]


{'score': (0.7763433239290418, 0.7769685373418713),
 'rmse': (3.2646946468099585, 3.2623152868427363),
 'model': Ridge()}

In [5]:
print('Percentage of cloudy samples: {:.3f} %'.format(data.cloudy.mean() * 100))

Percentage of cloudy samples: 6.328 %


Above models showed that droping cloudy samples improves the R2 score from 0.699 to 0.778,
witch is around 8% of the variation.

Also cloudy samples are just about 6.3% of the whole dataset.

So we have to consider if we should drop cloudy samples completely or not?

To answer this we should test the model with only cloudy samples.

In [6]:
# Training model of the whole training set. And evaluating it's score:

X_train, X_test, y_train, y_test = train_test_split(data[['doy', 'temp', 'rel_hum', 'et_rad', 'cloudy']], data['ava_rad'], stratify=data['cloudy'], random_state=42)

model = Ridge()
model.fit(X_train[['doy', 'temp', 'rel_hum', 'et_rad']], y_train)

score = model.score(X_train[['doy', 'temp', 'rel_hum', 'et_rad']], y_train)
print('R2 score of the training set: {:.3f}'.format(score))

R2 score of the training set: 0.700


In [7]:
# Seperating cloudy and not_cloudy samples of the training set:

# Not_cloudy samples:
X_train_nc = X_train[X_train.cloudy == False]
y_train_nc = y_train[X_train.cloudy == False]

# Cloudy samples:
X_train_c = X_train[X_train.cloudy == True]
y_train_c = y_train[X_train.cloudy == True]

In [8]:
# Evaluating model on the training set:

# Not_cloudy samples:
score_train_nc = model.score(X_train_nc[['doy', 'temp', 'rel_hum', 'et_rad']], y_train_nc)
print('R2 score of the training set without cloudy samples: {:.3f}'.format(score_train_nc))

# Cloudy samples:
score_train_c = model.score(X_train_c[['doy', 'temp', 'rel_hum', 'et_rad']], y_train_c)
print('R2 score of the training set with just cloudy samples: {:.3f}'.format(score_train_c))

R2 score of the training set without cloudy samples: 0.765
R2 score of the training set with just cloudy samples: -4.247


Above scores shows that the model cant even predict the cloudy samples that is been trained with!

And cloudy samples are completely useless, they can not be predicted, and they just lower models score.

So we just seperate cloudy samples, and train our models only on not_cloudy samples.

In [9]:
# pickling not_cloudy samples:

not_cloudy = data[data.cloudy == False]
not_cloudy = not_cloudy.drop(['cloudy'], axis=1)

with open(r'..\data\interim\05_machine_learning\not_cloudy_data.pkl', 'wb') as f:
    pickle.dump(not_cloudy, f)

In [10]:
# Renaming DataFrames:

# As we are going to work only on not_cloudy samples most of the times,
# data is renamed to whole_data, and not_cloudy samples is renamed to data.

whole_data = data
data = not_cloudy

## Selecting humidity features:

In this section effect of different humidity features are investigated.

In [11]:
# Only use relative humidity:

feature_set_1 = ['doysin', 'temp', 'rel_hum', 'et_rad']
scores1, _, _ = evaluate_cities(LinearRegression(), data, input_columns=feature_set_1)

Average R2 score                [train : 0.785, test : 0.784]
Average root mean squared error [train : 3.082, test : 3.063]


In [12]:
# Only use absolute humidity:

feature_set_2 = ['doysin', 'temp', 'abs_hum', 'et_rad']
scores2, _, _ = evaluate_cities(LinearRegression(), data, input_columns=feature_set_2)

Average R2 score                [train : 0.781, test : 0.779]
Average root mean squared error [train : 3.107, test : 3.094]


In [13]:
# Use both relative and absolute humidity:

feature_set_3 = ['doysin', 'temp', 'rel_hum', 'abs_hum', 'et_rad']
scores3, _, _ = evaluate_cities(LinearRegression(), data, input_columns=feature_set_3)

Average R2 score                [train : 0.789, test : 0.788]
Average root mean squared error [train : 3.054, test : 3.034]


Average scores shows that using both relative and absolute humidity is the best combination.

But what is the best combination for each inividual city?

In [14]:
# combs_score shows test score of each city with each feature combination.

combs_scores = pd.concat((scores1.test, scores2.test, scores3.test), axis=1, keys=(1, 2, 3))
combs_scores = combs_scores.round(decimals=3)
combs_scores

Unnamed: 0,1,2,3
BND,0.773,0.774,0.774
ESF,0.845,0.843,0.846
KRM,0.81,0.82,0.82
MSH,0.834,0.836,0.838
SHZ,0.809,0.815,0.815
TBR,0.773,0.767,0.773
AHV,0.784,0.785,0.786
JSK,0.73,0.728,0.732
KRMS,0.775,0.774,0.778
ZNJ,0.784,0.768,0.786


In [15]:
# best comb shows for each city, whitch combination has the best score

best_comb = combs_scores.eq(combs_scores.max(axis=1), axis=0)
best_comb

Unnamed: 0,1,2,3
BND,False,True,True
ESF,False,False,True
KRM,False,True,True
MSH,False,False,True
SHZ,False,True,True
TBR,True,False,True
AHV,False,False,True
JSK,False,False,True
KRMS,False,False,True
ZNJ,False,False,True


In [16]:
# To find out that how many citie have the best performance with each feature combination:

best_comb.sum()

1     3
2     6
3    24
dtype: int64

Above results shows that the third combination (having both relative and absolute humidity),

is the best choise for most of cities.

# Feature importance:

Finding out how valuable each feature is:

In [17]:
rf = RandomForestRegressor()
evaluate_data(rf, data)
pass

Mean R2 score    [train : 0.971, test : 0.792]
Mean rmse        [train : 1.180, test : 3.152]


In [18]:
# Showing feature importance

fi = pd.DataFrame(index=rf.feature_names_in_,
                  data=rf.feature_importances_,
                  columns=['feature_importance'])
fi

Unnamed: 0,feature_importance
doysin,0.097698
temp,0.04712
rel_hum,0.091531
abs_hum,0.063453
et_rad,0.700199


The above DataFrame shows that the most important feature is et_rad,

So can our model predict good enough without having et_rad, or only with having it?

To test this, we train some models:

In [19]:
# Testing a model only with et_rad feature:

rf = RandomForestRegressor()
evaluate_data(rf, data, input_columns=['et_rad'])
pass

Mean R2 score    [train : 0.797, test : 0.750]
Mean rmse        [train : 3.114, test : 3.457]


In [20]:
# Testing a model without et_rad feature:

rf = RandomForestRegressor()
evaluate_data(rf, data, input_columns=['doysin', 'temp', 'rel_hum', 'abs_hum'])
pass

Mean R2 score    [train : 0.963, test : 0.755]
Mean rmse        [train : 1.327, test : 3.416]


In [21]:
# Showing feature importance

fi = pd.DataFrame(index=rf.feature_names_in_,
                  data=rf.feature_importances_,
                  columns=['feature_importance'])
fi

Unnamed: 0,feature_importance
doysin,0.72591
temp,0.073139
rel_hum,0.111197
abs_hum,0.089755


Above results shows that it's posible to get good enough withou having et_rad feature.

But still the most important feature is et_rad!

## Training models:


In [22]:
# Training a LinearRegression:

evaluate_data(LinearRegression(), data)
pass

Mean R2 score    [train : 0.776, test : 0.777]
Mean rmse        [train : 3.265, test : 3.262]


In [23]:
# Training a RandomForest:

evaluate_data(RandomForestRegressor(), data)
pass

Mean R2 score    [train : 0.971, test : 0.792]
Mean rmse        [train : 1.180, test : 3.147]


In [24]:
# Training a Ridge model:

evaluate_data(Ridge(), data)
pass

Mean R2 score    [train : 0.776, test : 0.777]
Mean rmse        [train : 3.265, test : 3.262]


##### scores of LinearRegression and Ridge are the same. And score of RandomForest is 0.015 point higher.

This shows that due to the high number of samples and few number of features, the models don't overfit easily.

And due to the simple structure of linear models, they are a good choice for this data.

But, what if data is splited by city, and train individual models on each of them?

The drop in the size of the dataset may drop down scores.

So we have to test the result of spliting data by city!

In [25]:
# Training a Ridge model on each city:

r2s, rmses, models = evaluate_cities(Ridge(), data)

Average R2 score                [train : 0.788, test : 0.788]
Average root mean squared error [train : 3.054, test : 3.035]


In [26]:
# Showing r2 scores of trained models on each city:

r2s

Unnamed: 0,train,test
BND,0.778414,0.774149
ESF,0.837641,0.846319
KRM,0.826772,0.820154
MSH,0.840709,0.837895
SHZ,0.831814,0.815168
TBR,0.765357,0.772775
AHV,0.802316,0.786224
JSK,0.745455,0.731098
KRMS,0.762268,0.777983
ZNJ,0.770331,0.786223


## Conclusion:
In this notebook:

1. Effect of cloudy samples on ML models is investigated.
2. The best combination of humidity features is fined.
3. Importance of different features is investigated.
4. Score of ML models on each individual city, and all cities at once are calculated.