## Week 5 - Exercise 1

Author: Khushee Kapoor

Last Updated: 22/4/22

### Setting Up

To start, we have imported the following libraries:

- NumPy: to work with the data
- Pandas: to manipulate the dataframe
- MatPlotLib: for data visualization
- Seaborn: for data visulization

In [1]:
# importing the libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn  as sns

Next, we read the dataset and store it into a dataframe using the read_csv() function from the Pandas library. We also create another dataset for comparsion purposes later.

In [2]:
# reading the dataset
df = pd.read_csv('weatherAUS.csv')
vifdf = pd.read_csv('weatherAUS.csv')

After that, we view the first few rows of the dataframe to get a glimpse of it. To do this, we use the head() function from the Pandas library.

In [3]:
# viewing the first 5 rows
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


### Q1. Preprocess the data, remove the attributes which were are not useful to predict rain. Also, remove rows with at least one missing value for each of them. 

To sovle Question 1, we first view the dimensions of the dataframe by using the shape attribute.

In [4]:
# viewing the dimensions of the dataframe
df.shape

(34978, 23)

As we can see, there are 34978 rows and 23 columns in the dataset. Next, we check for missing values. To do that, we use the isnull() and sum() functions from the Pandas library.

In [5]:
# checking for missing values
df.isnull().sum()

Date                 0
Location             1
MinTemp            500
MaxTemp            370
Rainfall           687
Evaporation      19104
Sunshine         23532
WindGustDir       4936
WindGustSpeed     4932
WindDir9am        4472
WindDir3pm        2101
WindSpeed9am       831
WindSpeed3pm      1475
Humidity9am        666
Humidity3pm       1330
Pressure9am       6692
Pressure3pm       6683
Cloud9am         15814
Cloud3pm         16140
Temp9am            437
Temp3pm           1105
RainToday          687
RainTomorrow       687
dtype: int64

As we can see, there is only one column - Date, with no missing values. To deal with the missing values, we follow the instructions and drop the rows using the dropna() function from the Pandas library. To maintain consistency between the duplicate dataset, we drop the values in that as well.

In [6]:
# dropping the missing values
df = df.dropna()
vifdf = vifdf.dropna()

Next, we encode the binary column RainToday with 1 for yes and 0 for no. To do this we use the map() function.

In [7]:
# encoding the column
df['RainToday'] = df['RainToday'].map({'No':0, 'Yes':1})
vifdf['RainToday'] = vifdf['RainToday'].map({'No':0, 'Yes':1})

After that, we encode the columns with multiple categorical values. To do this, we loop over the categorical columns and create dummies using the get_dummies() funciton from the Pandas library. Then we merge the dummies using the merge() funcion and drop the original columns using the drop() function. To maintain consistency between the duplicate dataset, we carry out the same process in that as well.

In [8]:
# selecting the categorical columns
categorical_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']

# encoding the values
for column in categorical_columns:
    
    # in original dataframe
    catdf = pd.get_dummies(df[column], prefix=column)    
    df = pd.merge(left=df, right=catdf, left_index=True, right_index=True)    
    df = df.drop(columns=column)
    
    # in duplicate dataframe
    catdf = pd.get_dummies(vifdf[column], prefix=column)    
    vifdf = pd.merge(left=vifdf, right=catdf, left_index=True, right_index=True)    
    vifdf = vifdf.drop(columns=column)

Next we split the original dataset into independent variables (x) and dependent variable (y) and use the train_test_split() function from the sklearn library and divide the dataset into training and testing sets. We also drop the Date column since it is not necessary for classification.

In [9]:
# splitting the data into independent and dependent variables
x = df.drop(columns=['Date', 'RainTomorrow'])
y = df['RainTomorrow']

# diving the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=105)

Next, we import the LogisticRegression module from the sklearn library and build a classification model and train it on the original training set. Then we use the original testing set to calculate the accuracy and print it.

In [10]:
# building the model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

# training the model
logreg.fit(x_train, y_train)

# printing the accuracy
print(str.format('Accuracy: {:.2f}%', logreg.score(x_test, y_test)*100))

Accuracy: 83.71%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


As we can see, the accuracy of the model is 83.71%, so the model performs moderately well.

### Q2. Calculate the Variance Inflation Factor (VIF) value. VIF is a number that determines whether a variable has multicollinearity or not (starts from 1, and it has no upper limit. If the number gets larger, it means the variable has huge multicollinearity on it.). 

To solve Question 2, we use the variance_inflation_factor() function from the statsmodels library on the numerical columns to obtain the VIF of each column. **We use the duplicate dataframe here onwards.**

In [11]:
# importing the variance_inflation_factor() function
from statsmodels.stats.outliers_influence import variance_inflation_factor

# extracting the numerical columns
cols = [cname for cname in vifdf.columns if vifdf[cname].dtype in ['int64', 'float64']]
data = vifdf[cols]
  
# VIF dataframe
vif_data = pd.DataFrame()
vif_data['Feature'] = data.columns
  
# calculating VIF for each feature
vif_data['VIF']= [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]

# printing the VIF of each feature
print(vif_data)

          Feature            VIF
0         MinTemp      79.232526
1         MaxTemp     658.913695
2        Rainfall       1.744358
3     Evaporation       5.025893
4        Sunshine      19.252299
5   WindGustSpeed      29.240446
6    WindSpeed9am       9.415210
7    WindSpeed3pm      14.866406
8     Humidity9am      76.366273
9     Humidity3pm      64.338024
10    Pressure9am  582307.075465
11    Pressure3pm  579165.727765
12       Cloud9am       8.386911
13       Cloud3pm       9.098550
14        Temp9am     300.273019
15        Temp3pm     767.945037
16      RainToday       2.292319


As we can see, the columns - MinTemp, MaxTemp, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Temp9am, and Temp3pm have very high values of VIF.

### Q3. Remove multicollinearities by creating new features. Find the features that have paired values and create the new feature which is the difference value between those pairs.

To solve Question 3, we engineer the following features:

- Range: MaxTemp - MinTemp
- WindSpeed: WindSpeed3pm - WindSpeed9am
- Humidity: Humidity3pm - Humidity9am
- Pressure: Pressure3pm - Pressure9am
- Cloud: Cloud3pm - Cloud9am
- Temp: Temp3pm - Temp9am

Then, we drop the columns used to engineer the new features to remove the multicollinearity.

In [12]:
# engineering the new features
vifdf['Range'] = vifdf['MaxTemp'] - vifdf['MinTemp']
vifdf['WindSpeed'] = vifdf['WindSpeed3pm'] - vifdf['WindSpeed9am']
vifdf['Humidity'] = vifdf['Humidity3pm'] - vifdf['Humidity9am']
vifdf['Pressure'] = vifdf['Pressure3pm'] - vifdf['Pressure9am']
vifdf['Cloud'] = vifdf['Cloud3pm'] - vifdf['Cloud9am']
vifdf['Temp'] = vifdf['Temp3pm'] - vifdf['Temp9am']

# dropping the original columns
vifdf = vifdf.drop(columns=['MaxTemp', 'MinTemp', 'WindSpeed3pm', 'WindSpeed9am', 'Humidity3pm', 'Humidity9am', 'Pressure3pm',
                     'Pressure9am', 'Cloud3pm', 'Cloud9am', 'Temp3pm', 'Temp9am'])

Next, we use the variance_inflation_factor() function from the statsmodels library on the numerical columns to obtain the VIF of each column.

In [13]:
# importing the variance_inflation_factor() function
from statsmodels.stats.outliers_influence import variance_inflation_factor

# extracting the numerical columns
cols = [cname for cname in vifdf.columns if vifdf[cname].dtype in ['int64', 'float64']]
data = vifdf[cols]
  
# VIF dataframe
vif_data = pd.DataFrame()
vif_data['Feature'] = data.columns
  
# calculating VIF for each feature
vif_data['VIF']= [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]

# printing the VIF of each feature
print(vif_data)

          Feature        VIF
0        Rainfall   1.705682
1     Evaporation   3.567996
2        Sunshine   7.267535
3   WindGustSpeed   7.052104
4       RainToday   2.145453
5           Range  17.785578
6       WindSpeed   1.272272
7        Humidity   5.179944
8        Pressure   4.353885
9           Cloud   1.139004
10           Temp  12.927773


As we can see, we have significantly reduced the VIF for many columns by engineering new features and removing the original ones.

### Q4. Remove features that have a VIF value above 5.

To solve Question 4, we drop the columns Sunshine, WindGustSpeed, Range, and Temp since they have VIF above 5. Next we split the duplicate dataset into independent variables (x) and dependent variable (y) and use the train_test_split() function from the sklearn library and divide the dataset into training and testing sets. We also drop the Date column since it is not necessary for classification.

In [14]:
# splitting the data into independent and dependent variables
x = vifdf.drop(columns=['Date', 'Sunshine', 'WindGustSpeed', 'Range', 'Temp', 'RainTomorrow'])
y = vifdf['RainTomorrow']

# diving the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=105)

### Q5. Build a regression model to perform the Rain prediction. Also, tabulate accuracy of the prediction models, before and VIF computation

To solve Question 5, we import the LogisticRegression module from the sklearn library and build a classification model and train it on the duplicate training set. Then we use the duplicate testing set to calculate the accuracy and print it.

In [15]:
# building the model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

# training the model
logreg.fit(x_train, y_train)

# printing the accuracy
print(str.format('Accuracy: {:.2f}%', logreg.score(x_test, y_test)*100))

Accuracy: 79.39%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


As we can see, the accuracy of the model is 79.39%. The accuracy has slightly dropped in comparison to the model built using the original dataset. However, the model is now much more stable because we have removed the multicollinearity, so the variance of the coefficient estimate is stable and is not very sensitive to minor changes in the model.