<a href="https://colab.research.google.com/github/numustafa/ML-Projects-/blob/main/Logistic%20Regression%20Project/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression with Scikit-Learn (Classification Challange)

This is a learning project for revising classification ML concepts. The workflow for the project follows:
* Typical problem statement
* Exploring Dataset


## 1. problem Statement
I apply _logistic regression_ to a real-world dataset from [Kaggle](https://kaggle.com/datasets):

> **QUESTION**: The [Rain in Australia dataset](https://kaggle.com/jsphyg/weather-dataset-rattle-package) contains about 10 years of daily weather observations from numerous Australian weather stations. Here's a small sample from the dataset:
>
> ![](https://i.imgur.com/5QNJvir.png)
>
> As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow.
>
>
> ![](https://i.imgur.com/KWfcpcO.png)




### 1.1 Download the Data


In [1]:
!pip install opendatasets --upgrade --quiet   # Lib by Jovian to download the public datasets from Kaggle
import opendatasets as od
od.version()


'0.1.22'

The dataset can now be downloaded using `od.download`. When you execute `od.download`, you will be asked to provide your Kaggle username and API key. Follow these instructions to create an API key: http://bit.ly/kaggle-creds

In [2]:
dataset_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: numustafa
Your Kaggle Key: ··········
Downloading weather-dataset-rattle-package.zip to ./weather-dataset-rattle-package


100%|██████████| 3.83M/3.83M [00:00<00:00, 197MB/s]







In [3]:
# Check the downloaded folder
import os
data_dir = "/content/weather-dataset-rattle-package"
os.listdir(data_dir)

['weatherAUS.csv']

It is shown that its just a single file.

In [4]:
train_csv = data_dir + "/weatherAUS.csv"

## 2. Explore the Dataset

### 2.1 Read the CSV

In [5]:
import pandas as pd
raw_df = pd.read_csv(train_csv)
raw_df

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145455,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,SE,...,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No
145456,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,...,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No
145457,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,SE,...,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No
145458,2017-06-24,Uluru,7.8,27.0,0.0,,,SE,28.0,SSE,...,51.0,24.0,1019.4,1016.5,3.0,2.0,15.1,26.0,No,No


It appeared that the data is quite huge with 21+1 parameters

In [6]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

There is missing data, which might be due to unavailability of data, data maynot be measured or data maynot be entered correctly.

It also appears the around 3000 rows missed the data for Rain today & rain tomorrow. Therefore, a model can only work if the data is there. incase of missing key data, it is imparitive for the model to work.

### 2.2 Handeling the Null Values

#### 1. Drop rows where `RainTomorrow` and `RainToday` is N/A
The rason is, if we dont drop, we wont be able to predict, as `RainTomorrow` is a response var. Besides, `RainToday` is a basic determining factor for rain tomorrow, and eliminating it rows only ensures more accuracy in the predictions.

In [7]:
raw_df.dropna(subset = ["RainToday", "RainTomorrow"], inplace = True)

In [8]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140787 entries, 0 to 145458
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           140787 non-null  object 
 1   Location       140787 non-null  object 
 2   MinTemp        140319 non-null  float64
 3   MaxTemp        140480 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81093 non-null   float64
 6   Sunshine       73982 non-null   float64
 7   WindGustDir    131624 non-null  object 
 8   WindGustSpeed  131682 non-null  float64
 9   WindDir9am     131127 non-null  object 
 10  WindDir3pm     137117 non-null  object 
 11  WindSpeed9am   139732 non-null  float64
 12  WindSpeed3pm   138256 non-null  float64
 13  Humidity9am    139270 non-null  float64
 14  Humidity3pm    137286 non-null  float64
 15  Pressure9am    127044 non-null  float64
 16  Pressure3pm    127018 non-null  float64
 17  Cloud9am       88162 non-null

### 2.3 Data Exploration
Check how the distn of each var behave with the target var.

In [9]:
# lib
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# necessary visualization parameters
sns.set_style("darkgrid")
matplotlib.rcParams["font.size"] = 14
matplotlib.rcParams["figure.figsize"] = (10,6)
matplotlib.rcParams["figure.facecolor"] = "#00000000"

## 3 Model Building

Whenever there is a timeseries data, it is important to divide the data in accordance with the date, so data doesnt know the future date label for previous day label.

In [10]:
# create a date factor
year = pd.to_datetime(raw_df.Date).dt.year

train_df = raw_df[year<2015]
val_df = raw_df[year==2015]
test_df = raw_df[year>2015]

In [11]:
print("train shape: ", train_df.shape)
print("validation shape: ", val_df.shape)
print("test shape: ", test_df.shape)


train shape:  (97988, 23)
validation shape:  (17089, 23)
test shape:  (25710, 23)


In [12]:
# specify the input and target cols
inputs_cols = list(train_df.columns)[1:-1]
target_cols = "RainTomorrow"

In [13]:
inputs_cols

['Location',
 'MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustDir',
 'WindGustSpeed',
 'WindDir9am',
 'WindDir3pm',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm',
 'RainToday']

### 3.1 Data Modification

In [14]:
train_inputs = train_df[inputs_cols].copy()
train_targets = train_df[target_cols].copy()
val_inputs = val_df[inputs_cols].copy()
val_targets = val_df[target_cols].copy()
test_inputs = test_df[inputs_cols].copy()
test_targets = test_df[target_cols].copy()



In [19]:
train_targets

0         No
1         No
2         No
3         No
4         No
          ..
144548    No
144549    No
144550    No
144551    No
144552    No
Name: RainTomorrow, Length: 97988, dtype: object

In [15]:
import numpy as np
# saperating numeric cols to non-numerics columns
numeric_cols = raw_df.select_dtypes(include = np.number).columns.tolist()
categorical_cols = train_inputs.select_dtypes("object").columns.tolist()

numeric_cols, categorical_cols

(['MinTemp',
  'MaxTemp',
  'Rainfall',
  'Evaporation',
  'Sunshine',
  'WindGustSpeed',
  'WindSpeed9am',
  'WindSpeed3pm',
  'Humidity9am',
  'Humidity3pm',
  'Pressure9am',
  'Pressure3pm',
  'Cloud9am',
  'Cloud3pm',
  'Temp9am',
  'Temp3pm'],
 ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'])

In [16]:
# find the unique values in categorical cols
train_inputs[categorical_cols].nunique()

Location       49
WindGustDir    16
WindDir9am     16
WindDir3pm     16
RainToday       2
dtype: int64

### 3.2 Data Imputation

In [17]:
from sklearn.impute import SimpleImputer

# Strategy - replace NaN with avg
imputer = SimpleImputer(strategy="mean")


In [18]:
# chack the no of nan in each column of raw data
raw_df[numeric_cols].isna().sum()

MinTemp            468
MaxTemp            307
Rainfall             0
Evaporation      59694
Sunshine         66805
WindGustSpeed     9105
WindSpeed9am      1055
WindSpeed3pm      2531
Humidity9am       1517
Humidity3pm       3501
Pressure9am      13743
Pressure3pm      13769
Cloud9am         52625
Cloud3pm         56094
Temp9am            656
Temp3pm           2624
dtype: int64

Evaporation, Sunshine, Cloud9am & Cloud3pm has almos the half missing data

In [20]:
imputer.fit(raw_df[numeric_cols])

In [22]:
# Check the mean for each col
list(imputer.statistics_), numeric_cols

([12.18482386562048,
  23.235120301822324,
  2.349974074310839,
  5.472515506887154,
  7.630539861047281,
  39.97051988882308,
  13.990496092519967,
  18.631140782316862,
  68.82683277087672,
  51.44928834695453,
  1017.6545771543717,
  1015.2579625879797,
  4.431160817585808,
  4.499250233195188,
  16.98706638787991,
  21.69318269001107],
 ['MinTemp',
  'MaxTemp',
  'Rainfall',
  'Evaporation',
  'Sunshine',
  'WindGustSpeed',
  'WindSpeed9am',
  'WindSpeed3pm',
  'Humidity9am',
  'Humidity3pm',
  'Pressure9am',
  'Pressure3pm',
  'Cloud9am',
  'Cloud3pm',
  'Temp9am',
  'Temp3pm'])

In [26]:
# transform the raw_df altogether (replace "only" nans with avg)
raw_df[numeric_cols] = imputer.transform(raw_df[numeric_cols])
raw_df[numeric_cols].isna().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
dtype: int64

### 3.3 Data Scaling
using min-max scalar, with indentify min and max for each col and scale the values b/w 0-1


In [27]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(raw_df[numeric_cols])

raw_df[numeric_cols] = scaler.transform(raw_df[numeric_cols])

raw_df.describe()


Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0,140787.0
mean,0.48785,0.529964,0.006334,0.037741,0.526244,0.263337,0.107619,0.214151,0.688268,0.514493,0.614125,0.610527,0.492351,0.499917,0.510276,0.520023
std,0.150784,0.134343,0.022817,0.021926,0.189061,0.101797,0.068099,0.100214,0.189607,0.20547,0.111557,0.10692,0.253806,0.234384,0.136727,0.131916
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.379717,0.429112,0.0,0.027586,0.526244,0.193798,0.053846,0.149425,0.57,0.37,0.545455,0.5424,0.333333,0.444444,0.411392,0.424184
50%,0.483491,0.519849,0.0,0.037741,0.526244,0.255814,0.1,0.214151,0.7,0.514493,0.614125,0.610527,0.492351,0.499917,0.506329,0.512476
75%,0.596698,0.623819,0.002156,0.037741,0.6,0.310078,0.146154,0.275862,0.83,0.65,0.682645,0.6768,0.666667,0.666667,0.605485,0.608445
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 3.4 Encoding Categorical Data


In [29]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse = False, handle_unknown="ignore")


In [30]:
# replace nans in categorical cols with Unknown
raw2 = raw_df[categorical_cols].fillna("Unknown")
raw2

Unnamed: 0,Location,WindGustDir,WindDir9am,WindDir3pm,RainToday
0,Albury,W,W,WNW,No
1,Albury,WNW,NNW,WSW,No
2,Albury,WSW,W,WSW,No
3,Albury,NE,SE,E,No
4,Albury,W,ENE,NW,No
...,...,...,...,...,...
145454,Uluru,E,ESE,E,No
145455,Uluru,E,SE,ENE,No
145456,Uluru,NNW,SE,N,No
145457,Uluru,N,SE,WNW,No


In [31]:
encoder.fit(raw2)



In [32]:
encoder.categories_

[array(['Adelaide', 'Albany', 'Albury', 'AliceSprings', 'BadgerysCreek',
        'Ballarat', 'Bendigo', 'Brisbane', 'Cairns', 'Canberra', 'Cobar',
        'CoffsHarbour', 'Dartmoor', 'Darwin', 'GoldCoast', 'Hobart',
        'Katherine', 'Launceston', 'Melbourne', 'MelbourneAirport',
        'Mildura', 'Moree', 'MountGambier', 'MountGinini', 'Newcastle',
        'Nhil', 'NorahHead', 'NorfolkIsland', 'Nuriootpa', 'PearceRAAF',
        'Penrith', 'Perth', 'PerthAirport', 'Portland', 'Richmond', 'Sale',
        'SalmonGums', 'Sydney', 'SydneyAirport', 'Townsville',
        'Tuggeranong', 'Uluru', 'WaggaWagga', 'Walpole', 'Watsonia',
        'Williamtown', 'Witchcliffe', 'Wollongong', 'Woomera'],
       dtype=object),
 array(['E', 'ENE', 'ESE', 'N', 'NE', 'NNE', 'NNW', 'NW', 'S', 'SE', 'SSE',
        'SSW', 'SW', 'Unknown', 'W', 'WNW', 'WSW'], dtype=object),
 array(['E', 'ENE', 'ESE', 'N', 'NE', 'NNE', 'NNW', 'NW', 'S', 'SE', 'SSE',
        'SSW', 'SW', 'Unknown', 'W', 'WNW', 'WSW'], dtype=

In [35]:
# generate col names for each feature
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))
encoded_cols

['Location_Adelaide',
 'Location_Albany',
 'Location_Albury',
 'Location_AliceSprings',
 'Location_BadgerysCreek',
 'Location_Ballarat',
 'Location_Bendigo',
 'Location_Brisbane',
 'Location_Cairns',
 'Location_Canberra',
 'Location_Cobar',
 'Location_CoffsHarbour',
 'Location_Dartmoor',
 'Location_Darwin',
 'Location_GoldCoast',
 'Location_Hobart',
 'Location_Katherine',
 'Location_Launceston',
 'Location_Melbourne',
 'Location_MelbourneAirport',
 'Location_Mildura',
 'Location_Moree',
 'Location_MountGambier',
 'Location_MountGinini',
 'Location_Newcastle',
 'Location_Nhil',
 'Location_NorahHead',
 'Location_NorfolkIsland',
 'Location_Nuriootpa',
 'Location_PearceRAAF',
 'Location_Penrith',
 'Location_Perth',
 'Location_PerthAirport',
 'Location_Portland',
 'Location_Richmond',
 'Location_Sale',
 'Location_SalmonGums',
 'Location_Sydney',
 'Location_SydneyAirport',
 'Location_Townsville',
 'Location_Tuggeranong',
 'Location_Uluru',
 'Location_WaggaWagga',
 'Location_Walpole',
 'Locat

In [41]:
# using the encoded col names, regen the raw_df
raw_df[encoded_cols] = encoder.transform(raw_df[categorical_cols].fillna("Unknown"))

In [42]:
raw_df

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_Unknown,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW,RainToday_No,RainToday_Yes
0,2008-12-01,Albury,0.516509,0.523629,0.001617,0.037741,0.526244,W,0.294574,W,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,2008-12-02,Albury,0.375000,0.565217,0.000000,0.037741,0.526244,WNW,0.294574,NNW,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,2008-12-03,Albury,0.504717,0.576560,0.000000,0.037741,0.526244,WSW,0.310078,W,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,2008-12-04,Albury,0.417453,0.620038,0.000000,0.037741,0.526244,NE,0.139535,SE,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2008-12-05,Albury,0.613208,0.701323,0.002695,0.037741,0.526244,W,0.271318,ENE,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,2017-06-20,Uluru,0.283019,0.502836,0.000000,0.037741,0.526244,E,0.193798,ESE,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
145455,2017-06-21,Uluru,0.266509,0.533081,0.000000,0.037741,0.526244,E,0.193798,SE,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
145456,2017-06-22,Uluru,0.285377,0.568998,0.000000,0.037741,0.526244,NNW,0.124031,SE,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
145457,2017-06-23,Uluru,0.327830,0.599244,0.000000,0.037741,0.526244,N,0.240310,SE,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [43]:
new_df = raw_df.copy()