# ToDo:


1.   Define baseline model, do encoding
2.   Start building models on baseline model, using:
    1.   Naive Bayes
    2.   Logistic Regression
    3.   K-Nearest Neighbors (KNN)         
    4.   SVM ( optional )

3.   Compare the results of first modues. Try to improve logistic regression.

4.   Then go back to preparation part, and compare how it reflects on model quality.

# Set up

In [2]:
!pip install ISLP

Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.29.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.9.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.5.1-py3-none-any.whl.metadata (20 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines->ISLP)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=0.9 (from ISLP)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m3.0 MB/s[0m eta [36m

In [63]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", palette="bright")

import os

from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)

from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score

# library for exporting dataset
from google.colab import files

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
path = '/content/drive/My Drive/Colab Notebooks/EPAM DS foundations course/DS_module3/df_prepared.csv'
df = pd.read_csv(path, sep=',')

In [7]:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

In [6]:
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Georegion,ElevationZone,ClimaticZone,AdminReg,Latitude,Longitude,Month,Season,Evaporation,Cloud9am
0,2008-12-01,Albury,0.188978,-0.047262,-0.250908,0.050496,W,0.323625,W,WNW,...,Southeastern Australia,Lowland,Temperate,New South Wales,0.548407,0.404652,December,Winter,-0.118178,1.458449
1,2008-12-02,Albury,-0.749171,0.262153,-0.354204,1.22787,WNW,0.323625,NNW,WSW,...,Southeastern Australia,Lowland,Temperate,New South Wales,0.548407,0.404652,December,Winter,-0.118178,0.153124
2,2008-12-03,Albury,0.110799,0.346539,-0.354204,0.906768,WSW,0.479084,W,WSW,...,Southeastern Australia,Lowland,Temperate,New South Wales,0.548407,0.404652,December,Winter,-0.118178,0.153124
3,2008-12-04,Albury,-0.467726,0.670018,-0.354204,1.025068,NE,-1.230975,SE,E,...,Southeastern Australia,Lowland,Temperate,New South Wales,0.548407,0.404652,December,Winter,-0.118178,0.153124
4,2008-12-05,Albury,0.830047,1.274783,-0.182044,-0.163572,W,0.090435,ENE,NW,...,Southeastern Australia,Lowland,Temperate,New South Wales,0.548407,0.404652,December,Winter,-0.118178,1.02334


# Defining base model

For the baseline model I'd have all features that I had from the start, with a few exceptions:
- I will remove Date, and encode **seasons** and **months**
- I will remove City, and in order to encode less amount of variants: **Georegion**
- remove Raintomorrow and Raintoday initial records ( leave only encoded )
- I need to encode wind direction

## Mapping wind directions

In [8]:
df['WindGustDir'].unique()

array(['W', 'WNW', 'WSW', 'NE', 'NNW', 'N', 'NNE', 'SW', 'ENE', 'SSE',
       'S', 'NW', 'SE', 'ESE', 'E', 'SSW'], dtype=object)

In [9]:
wind_mapping = {
    'N': 'North', 'NNE': 'North', 'NE': 'North', 'NNW': 'North',
    'S': 'South', 'SSE': 'South', 'SE': 'South', 'SSW': 'South',
    'E': 'East', 'ENE': 'East', 'ESE': 'East',
    'W': 'West', 'WNW': 'West', 'WSW': 'West', 'NW': 'West', 'SW': 'West'
}

In [10]:
df['WindGustDir'] = df['WindGustDir'].map(wind_mapping)
df['WindDir3pm'] = df['WindDir3pm'].map(wind_mapping)
df['WindDir9am'] = df['WindDir9am'].map(wind_mapping)

## Baseline model scope

In [11]:
df_baseline = df.drop(['Date', 'Location', 'RainTomorrow', 'RainToday', 'Latitude', 'Longitude', 'AdminReg', 'ClimaticZone', 'ElevationZone'], axis=1)

In [12]:
df_baseline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141006 entries, 0 to 141005
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   MinTemp               141006 non-null  float64
 1   MaxTemp               141006 non-null  float64
 2   Rainfall              141006 non-null  float64
 3   Sunshine              141006 non-null  float64
 4   WindGustDir           141006 non-null  object 
 5   WindGustSpeed         141006 non-null  float64
 6   WindDir9am            141006 non-null  object 
 7   WindDir3pm            141006 non-null  object 
 8   WindSpeed9am          141006 non-null  float64
 9   WindSpeed3pm          141006 non-null  float64
 10  Humidity9am           141006 non-null  float64
 11  Humidity3pm           141006 non-null  float64
 12  Pressure9am           141006 non-null  float64
 13  Pressure3pm           141006 non-null  float64
 14  Cloud3pm              141006 non-null  float64
 15  

# Hot-encoding Catigorical variables

In [21]:
wind_dummies = pd.get_dummies(df_baseline['WindGustDir'], prefix='WindGustDir')
#wind_dummies.head()
df_baseline = pd.concat([df_baseline, wind_dummies], axis=1)
df_baseline = df_baseline.drop('WindGustDir', axis=1)

In [13]:
wind_dummies3pm = pd.get_dummies(df_baseline['WindDir3pm'], prefix='WindDir3pm')
df_baseline = pd.concat([df_baseline, wind_dummies3pm], axis=1)
df_baseline = df_baseline.drop('WindDir3pm', axis=1)

wind_dummies9am = pd.get_dummies(df_baseline['WindDir9am'], prefix='WindDir9am')
df_baseline = pd.concat([df_baseline, wind_dummies9am], axis=1)
df_baseline = df_baseline.drop('WindDir9am', axis=1)

In [14]:
geo_dummies = pd.get_dummies(df_baseline['Georegion'], prefix='Georegion')
df_baseline = pd.concat([df_baseline, geo_dummies], axis=1)
df_baseline = df_baseline.drop('Georegion', axis=1)

In [15]:
month_dummies = pd.get_dummies(df_baseline['Month'], prefix='Month')
df_baseline = pd.concat([df_baseline, month_dummies], axis=1)
df_baseline = df_baseline.drop('Month', axis=1)

In [16]:
season_dummies = pd.get_dummies(df_baseline['Season'], prefix='Season')
df_baseline = pd.concat([df_baseline, season_dummies], axis=1)
df_baseline = df_baseline.drop('Season', axis=1)

In [17]:
df_baseline.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Sunshine,WindGustDir,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud3pm,Temp9am,Temp3pm,RainToday_encoded,RainTomorrow_encoded,Evaporation,Cloud9am,WindDir3pm_East,WindDir3pm_North,WindDir3pm_South,WindDir3pm_West,WindDir9am_East,WindDir9am_North,WindDir9am_South,WindDir9am_West,Georegion_Central Australia,Georegion_Eastern Australia,Georegion_External Territory,Georegion_Northern Australia,Georegion_Southeastern Australia,Georegion_Southern Australia,Georegion_Tasmania,Georegion_Western Australia,Month_April,Month_August,Month_December,Month_February,Month_January,Month_July,Month_June,Month_March,Month_May,Month_November,Month_October,Month_September,Season_Autumn,Season_Spring,Season_Summer,Season_Winter
0,0.188978,-0.047262,-0.250908,0.050496,West,0.323625,0.674442,0.611889,0.114095,-1.432044,-1.469626,-1.216459,0.207539,-0.013048,0.016522,0,0,-0.118178,1.458449,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True
1,-0.749171,0.262153,-0.354204,1.22787,West,0.323625,-1.131835,0.38329,-1.308954,-1.286263,-1.041445,-1.112085,-1.371553,0.033195,0.379679,0,0,-0.118178,0.153124,False,False,False,True,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True
2,0.110799,0.346539,-0.354204,0.906768,West,0.479084,0.56155,0.840488,-1.625187,-1.043296,-1.484391,-0.977891,-1.039112,0.618935,0.21989,0,0,-0.118178,0.153124,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True
3,-0.467726,0.670018,-0.354204,1.025068,North,-1.230975,-0.341589,-1.102601,-1.256248,-1.723605,-0.007903,-0.366562,-0.706672,0.171923,0.699257,0,0,-0.118178,0.153124,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True
4,0.830047,1.274783,-0.182044,-0.163572,West,0.090435,-0.793158,0.154692,0.693856,-0.897515,-1.011915,-1.380474,1.45419,0.12568,1.164098,0,0,-0.118178,1.02334,False,False,False,True,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True


# Test, train split

In [22]:
X = df_baseline.drop('RainTomorrow_encoded', axis=1)
y = df_baseline['RainTomorrow_encoded']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2024)

# Modeling

## Naive Bayes

In [23]:
df_baseline.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud3pm,Temp9am,Temp3pm,RainToday_encoded,RainTomorrow_encoded,Evaporation,Cloud9am,WindDir3pm_East,WindDir3pm_North,WindDir3pm_South,WindDir3pm_West,WindDir9am_East,WindDir9am_North,WindDir9am_South,WindDir9am_West,Georegion_Central Australia,Georegion_Eastern Australia,Georegion_External Territory,Georegion_Northern Australia,Georegion_Southeastern Australia,Georegion_Southern Australia,Georegion_Tasmania,Georegion_Western Australia,Month_April,Month_August,Month_December,Month_February,Month_January,Month_July,Month_June,Month_March,Month_May,Month_November,Month_October,Month_September,Season_Autumn,Season_Spring,Season_Summer,Season_Winter,WindGustDir_East,WindGustDir_North,WindGustDir_South,WindGustDir_West
0,0.188978,-0.047262,-0.250908,0.050496,0.323625,0.674442,0.611889,0.114095,-1.432044,-1.469626,-1.216459,0.207539,-0.013048,0.016522,0,0,-0.118178,1.458449,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True
1,-0.749171,0.262153,-0.354204,1.22787,0.323625,-1.131835,0.38329,-1.308954,-1.286263,-1.041445,-1.112085,-1.371553,0.033195,0.379679,0,0,-0.118178,0.153124,False,False,False,True,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True
2,0.110799,0.346539,-0.354204,0.906768,0.479084,0.56155,0.840488,-1.625187,-1.043296,-1.484391,-0.977891,-1.039112,0.618935,0.21989,0,0,-0.118178,0.153124,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True
3,-0.467726,0.670018,-0.354204,1.025068,-1.230975,-0.341589,-1.102601,-1.256248,-1.723605,-0.007903,-0.366562,-0.706672,0.171923,0.699257,0,0,-0.118178,0.153124,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False
4,0.830047,1.274783,-0.182044,-0.163572,0.090435,-0.793158,0.154692,0.693856,-0.897515,-1.011915,-1.380474,1.45419,0.12568,1.164098,0,0,-0.118178,1.02334,False,False,False,True,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True


In [31]:
# Initialize and train the Gaussian Naive Bayes classifier
NB = GaussianNB()
NB.fit(X_train, y_train)

In [32]:
# Make predictions on the test set
y_pred = NB.predict(X_test)

In [33]:
NB.classes_

array([0, 1])

In [34]:
NB.class_prior_

array([0.77630226, 0.22369774])

### Evaluate the model

In [36]:
accuracy = accuracy_score(y_test, y_pred) # sklearn.metrics
print(f"Gaussian Naive Bayes Accuracy: {accuracy}")

Gaussian Naive Bayes Accuracy: 0.7680306361250975


In [59]:
confusion_table(y_test, y_pred)

Truth,0,1
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
0,17271,4593
1,1949,4389


In [60]:
confusion_matrix(y_test, y_pred)

array([[17271,  4593],
       [ 1949,  4389]])

| | Prediction==0 | Prediction==1|
|-|---|---|
|Actial==0| 17271 | 4593 |
|Actual==1| 1949 |  4389 |

In [45]:
true_neg, true_pos = 17271, 4389
false_neg, false_pos = 4593, 1949
overall = true_neg + true_pos + false_neg + false_pos

In [62]:
accuracy = (true_neg + true_pos) / overall
precision = true_pos / (true_pos + false_pos)
recall = true_pos / (true_pos + false_neg)
f1 = 2 * (precision * recall) / (precision + recall)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.7680306361250975
Precision: 0.692489744398864
Recall: 0.48864395457581833
F1 Score: 0.572976501305483


In [65]:
print(f"Roc Auc Score: {roc_auc_score(y_test, y_pred)}")

Roc Auc Score: 0.7412091971171049


In [66]:
roc_curve

(array([0.        , 0.21007135, 1.        ]),
 array([0.        , 0.69248974, 1.        ]),
 array([inf,  1.,  0.]))

# Conclusion

*   1
*   2