### Variance Inflation Factor in Logistic Regression 

In [2]:
# importing the libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn  as sns

Next, we read the dataset and store it into a dataframe using the read_csv() function from the Pandas library. We also create another dataset for comparsion purposes later.

In [5]:
# reading the dataset
df = pd.read_csv('AusDataForRainPred.csv')
vifdf = pd.read_csv('AusDataForRainPred.csv')

After that, we view the first few rows of the dataframe to get a glimpse of it. To do this, we use the head() function from the Pandas library.

In [8]:
# viewing the first 5 rows
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,12/1/2008,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,12/2/2008,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,12/3/2008,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,12/4/2008,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,12/5/2008,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


Q1. Preprocess the data, remove the attributes which were are not useful to predict rain. Also, remove rows with at least one missing value for each of them

In [11]:
# viewing the dimensions of the dataframe
df.shape

(145460, 23)

As we can see, there are 34978 rows and 23 columns in the dataset. Next, we check for missing values. To do that, we use the isnull() and sum() functions from the Pandas library.

In [14]:
# checking for missing values
df.isnull().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

As we can see, there is only one column - Date, with no missing values. To deal with the missing values, we follow the instructions and drop the rows using the dropna() function from the Pandas library. To maintain consistency between the duplicate dataset, we drop the values in that as well.

In [17]:
# dropping the missing values
df = df.dropna()
vifdf = vifdf.dropna()

In [19]:
# viewing the dimensions of the dataframe
df.shape

(56420, 23)

Next, we encode the binary column RainToday with 1 for yes and 0 for no. To do this we use the map() function.

In [22]:
# encoding the column
df['RainToday'] = df['RainToday'].map({'No':0, 'Yes':1})
vifdf['RainToday'] = vifdf['RainToday'].map({'No':0, 'Yes':1})

After that, we encode the columns with multiple categorical values. To do this, we loop over the categorical columns and create dummies using the get_dummies() funciton from the Pandas library. Then we merge the dummies using the merge() funcion and drop the original columns using the drop() function. To maintain consistency between the duplicate dataset, we carry out the same process in that as well.

In [25]:
# selecting the categorical columns
categorical_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']

In [27]:
# encoding the values
for column in categorical_columns:
    
    # in original dataframe
    catdf = pd.get_dummies(df[column], prefix=column)    
    df = pd.merge(left=df, right=catdf, left_index=True, right_index=True)    
    df = df.drop(columns=column)
    
    # in duplicate dataframe
    catdf = pd.get_dummies(vifdf[column], prefix=column)    
    vifdf = pd.merge(left=vifdf, right=catdf, left_index=True, right_index=True)    
    vifdf = vifdf.drop(columns=column)

Next we split the original dataset into independent variables (x) and dependent variable (y) and use the train_test_split() function from the sklearn library and divide the dataset into training and testing sets. We also drop the Date column since it is not necessary for classification.

In [30]:
# splitting the data into independent and dependent variables
x = df.drop(columns=['Date', 'RainTomorrow'])
y = df['RainTomorrow']

In [32]:
# diving the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=105)

Next, we import the LogisticRegression module from the sklearn library and build a classification model and train it on the original training set. Then we use the original testing set to calculate the accuracy and print it.

In [35]:
# building the model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [37]:
# training the model
logreg.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
# printing the accuracy
print(str.format('Accuracy: {:.2f}%', logreg.score(x_test, y_test)*100))

Accuracy: 84.92%


As we can see, the accuracy of the model is 84.93%, so the model performs moderately well.

Q2. Calculate the Variance Inflation Factor (VIF) value. VIF is a number that determines whether a variable has multicollinearity or not (starts from 1, and it has no upper limit. If the number gets larger, it means the variable has huge multicollinearity on it.).

To solve Question 2, we use the variance_inflation_factor() function from the statsmodels library on the numerical columns to obtain the VIF of each column. We use the duplicate dataframe here onwards.

# importing the variance_inflation_factor() function
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [45]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [47]:
# extracting the numerical columns
cols = [cname for cname in vifdf.columns if vifdf[cname].dtype in ['int64', 'float64']]
data = vifdf[cols]

In [49]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data['Feature'] = data.columns

In [51]:
# calculating VIF for each feature
vif_data['VIF']= [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]

In [53]:
# printing the VIF of each feature
print(vif_data)

          Feature            VIF
0         MinTemp      58.284148
1         MaxTemp     609.736465
2        Rainfall       1.627144
3     Evaporation       7.220027
4        Sunshine      17.318912
5   WindGustSpeed      26.969533
6    WindSpeed9am       8.403752
7    WindSpeed3pm      13.776204
8     Humidity9am      61.146908
9     Humidity3pm      47.805246
10    Pressure9am  432633.791186
11    Pressure3pm  430736.444717
12       Cloud9am       7.362031
13       Cloud3pm       8.322534
14        Temp9am     210.815320
15        Temp3pm     674.785736
16      RainToday       2.227511


As we can see, the columns - MinTemp, MaxTemp, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Temp9am, and Temp3pm have very high values of VIF.

Q3. Remove multicollinearities by creating new features. Find the features that have paired values and create the new feature which is the difference value between those pairs.

To solve Question 3, we engineer the following features:

    Range: MaxTemp - MinTemp
    WindSpeed: WindSpeed3pm - WindSpeed9am
    Humidity: Humidity3pm - Humidity9am
    Pressure: Pressure3pm - Pressure9am
    Cloud: Cloud3pm - Cloud9am
    Temp: Temp3pm - Temp9am

Then, we drop the columns used to engineer the new features to remove the multicollinearity.

In [59]:
# engineering the new features
vifdf['Range'] = vifdf['MaxTemp'] - vifdf['MinTemp']
vifdf['WindSpeed'] = vifdf['WindSpeed3pm'] - vifdf['WindSpeed9am']
vifdf['Humidity'] = vifdf['Humidity3pm'] - vifdf['Humidity9am']
vifdf['Pressure'] = vifdf['Pressure3pm'] - vifdf['Pressure9am']
vifdf['Cloud'] = vifdf['Cloud3pm'] - vifdf['Cloud9am']
vifdf['Temp'] = vifdf['Temp3pm'] - vifdf['Temp9am']

In [61]:
# dropping the original columns
vifdf = vifdf.drop(columns=['MaxTemp', 'MinTemp', 'WindSpeed3pm', 'WindSpeed9am', 'Humidity3pm', 'Humidity9am', 'Pressure3pm',
                     'Pressure9am', 'Cloud3pm', 'Cloud9am', 'Temp3pm', 'Temp9am'])

Next, we use the variance_inflation_factor() function from the statsmodels library on the numerical columns to obtain the VIF of each column.

In [64]:
# importing the variance_inflation_factor() function
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [66]:
# extracting the numerical columns
cols = [cname for cname in vifdf.columns if vifdf[cname].dtype in ['int64', 'float64']]
data = vifdf[cols]

In [68]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data['Feature'] = data.columns

In [70]:
# calculating VIF for each feature
vif_data['VIF']= [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]

In [72]:
# printing the VIF of each feature
print(vif_data)

          Feature        VIF
0        Rainfall   1.591422
1     Evaporation   4.373922
2        Sunshine   8.469864
3   WindGustSpeed   6.641747
4       RainToday   2.061839
5           Range  19.539385
6       WindSpeed   1.371614
7        Humidity   5.575934
8        Pressure   3.304605
9           Cloud   1.077251
10           Temp  13.968278


As we can see, we have significantly reduced the VIF for many columns by engineering new features and removing the original ones.

Q4. Remove features that have a VIF value above 5.

o solve Question 4, we drop the columns Sunshine, WindGustSpeed, Range, and Temp since they have VIF above 5. Next we split the duplicate dataset into independent variables (x) and dependent variable (y) and use the train_test_split() function from the sklearn library and divide the dataset into training and testing sets. We also drop the Date column since it is not necessary for classification.

In [77]:
# splitting the data into independent and dependent variables
x = vifdf.drop(columns=['Date', 'Sunshine', 'WindGustSpeed', 'Range', 'Temp', 'RainTomorrow'])
y = vifdf['RainTomorrow']

In [79]:
# diving the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=105)

Q5. Build a regression model to perform the Rain prediction. Also, tabulate accuracy of the prediction models, before and VIF computation

To solve Question 5, we import the LogisticRegression module from the sklearn library and build a classification model and train it on the duplicate training set. Then we use the duplicate testing set to calculate the accuracy and print it.

In [83]:
# building the model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [85]:
# training the model
logreg.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [87]:
# printing the accuracy
print(str.format('Accuracy: {:.2f}%', logreg.score(x_test, y_test)*100))

Accuracy: 81.25%


As we can see, the accuracy of the model is 81.25%. The accuracy has slightly dropped in comparison to the model built using the original dataset. However, the model is now much more stable because we have removed the multicollinearity, so the variance of the coefficient estimate is stable and is not very sensitive to minor changes in the model.