**Note to grader:**


----

<i>General instructions for this and future notebooks:</i>
1. To run a cell and move to the next cell: Hold down <strong>Shift</strong> and press <strong>Enter</strong>.
2. To run a cell and stay in the same cell: Hold down <b>Ctrl</b> and press <b>Enter</b>.
3. Use the arrow up and down to navigate or do it manually as you prefer.
4. Escape from typing a cell: Hit <b>Esc</b>.

---------

<b>Note: </b>

> You must run/evaluate all cells. <b>Order of cell execution is important.</b>



> You must work directly out of a copy of the assignment notebook given to you, in the exact order.


In [94]:
# Grader's area

import numpy as np

# This assignment contains 2 exercises each with at most 2 parts.
# We initialize a 2x2 array M, containing zeros.
# The grade for question i, part j, will be recorded in M[i,j].
# Then the total grade can be easily computed in the last grader's area.

M = np.zeros((4,3))
max_score_M = np.zeros((4,3))

# **Assignment 3**
The third assignment helps you become familiar with using Python's machine learning related code to make predictions. The goal of this assignment is to predict the weather type based on the conditions of the day.

### **Part 1: Data pre-processing**

This is a fundamental phase in machine learning. Real-world data can sometimes be uneven. If the data is directly fed to the machine learning model, the excessive amount of noise may not lead to good predictions. This section aims to enhance the model's ability to predict outcomes more effectively. It involves the process of detecting, correcting, or removing corrupt, inaccurate, or irrelevant records from a dataset.



**Step 1**: Download the weather dataset to the local Colab environment. Then, read the dataset into a dataframe format using <code>pandas</code> package.

The changes in this file were tracked using Git to demonstrate my development process and can be accessed using the following link: https://github.com/joshuakobuskie/CS670-M12Assignment. Cloning a Git repo into the current Git repo will cause an error. This should not be a problem if this file is run separately. In order to address this issue, the cloned repo Git tracking file was removed and just the dataset and README were stored in a folder for access and included as part of the current Git repo. The below command should only be run outside the Git environment and if the dataset folder does not already exist.

In [95]:
# This line will only work if the directory does not already exist
!git clone https://github.com/XYZNJIT/weather_prediction_dataset.git

fatal: destination path 'weather_prediction_dataset' already exists and is not an empty directory.


The following code was modified to work with colab or with a local version of the file tracked using Git. 

In [96]:
import pandas as pd
import os
import shutil

# Try colab notebook first, handle exception when working locally using VS Code and support Git
try:
    dataset = pd.read_csv('/content/weather_prediction_dataset/seattle-weather.csv')
except FileNotFoundError:
    dataset = pd.read_csv("weather_prediction_dataset/seattle-weather.csv")

**Step 2**: We use methods such as <code>info</code> in pandas to help us get a preliminary overview of the dataset.



Based on the information provided below, we understand that the dataset comprises six columns: date, precipitation, maximum temperature, minimum temperature, wind speed, and weather type. From the data types, we can discern which columns contain numeric data and which contain non-numeric data.



In [97]:
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1466 entries, 0 to 1465
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           1466 non-null   object 
 1   precipitation  1466 non-null   float64
 2   temp_max       1464 non-null   float64
 3   temp_min       1463 non-null   float64
 4   wind           1466 non-null   float64
 5   weather        1466 non-null   object 
dtypes: float64(4), object(2)
memory usage: 68.8+ KB
None


The method <code>shape</code> will inform you of the size of this dataset.

In [98]:
print(dataset.shape)

(1466, 6)


In [99]:
dataset

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012/1/1,0.0,12.8,5.0,4.7,drizzle
1,2012/1/2,10.9,10.6,2.8,4.5,rain
2,2012/1/3,0.8,11.7,7.2,2.3,rain
3,2012/1/4,20.3,12.2,5.6,4.7,rain
4,2012/1/5,1.3,8.9,2.8,6.1,rain
...,...,...,...,...,...,...
1461,2015/12/27,8.6,4.4,1.7,2.9,rain
1462,2015/12/28,1.5,5.0,1.7,1.3,rain
1463,2015/12/29,0.0,7.2,0.6,2.6,fog
1464,2015/12/30,0.0,5.6,-1.0,3.4,sun


**Step 3**: Remove bad data, such as empty cells, duplicates, and data in the wrong format.

The method <code>dropna()</code> will delete rows that have missing values. The number of columns in shape changed from 1466 to 1461, indicating that there are five rows of data with missing data.

In [100]:
dataset = dataset.dropna()
print(dataset.shape)

(1461, 6)


#### <font color='#008DFF'>  **Question 1**

Use function <code>drop_duplicates()</code> to remove duplicates. <br><br>

<b>Expected Output:</b> <br><code>print(dataset.shape) <br>(1456, 6)</code>

The drop_duplicates function is used to remove the duplicates from the data frame and the expected shape is achieved.

In [101]:
# Your code here. Add more cells if needed
dataset = dataset.drop_duplicates()

print(dataset.shape)

(1456, 6)


In [102]:
# Grader's area


# M[1,1] =

max_score_M[1,1] = 2

**Step 4**: Convert data that cannot be directly used, such as string-type data, into numerical type.


In our current dataset, dates cannot be directly used as numeric data. However, the month may also be a contributing factor to the weather type. Therefore, we extract the month from the date and create a new column for it.

In [103]:
def getMonthInfo(row):
    month = row.split('/')[1]
    return month

dataset['month'] = dataset['date'].apply(lambda row : getMonthInfo(row))

By printing the dataset, you can see that a new column for the month has been added.

In [104]:
dataset

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather,month
0,2012/1/1,0.0,12.8,5.0,4.7,drizzle,1
1,2012/1/2,10.9,10.6,2.8,4.5,rain,1
2,2012/1/3,0.8,11.7,7.2,2.3,rain,1
3,2012/1/4,20.3,12.2,5.6,4.7,rain,1
4,2012/1/5,1.3,8.9,2.8,6.1,rain,1
...,...,...,...,...,...,...,...
1461,2015/12/27,8.6,4.4,1.7,2.9,rain,12
1462,2015/12/28,1.5,5.0,1.7,1.3,rain,12
1463,2015/12/29,0.0,7.2,0.6,2.6,fog,12
1464,2015/12/30,0.0,5.6,-1.0,3.4,sun,12


#### <font color='#008DFF'>  **Question 2**

Please extract the day from the date as well, ensuring the day's range is in [1, 31]. <br><br>

<b>Expected Output:</b> <br><code>print(dataset.shape) <br>(1456, 8) <br>
print(len(dataset['day'].drop_duplicates()))<br>
31
</code>

The same methodology that was used to extract the month info is used to extract the day info, and the day info is then added to the data frame to create the desired data and shape.

In [105]:
# Your code here. Add more cells if needed
def getDayInfo(row):
    day = row.split("/")[2]
    return day

dataset["day"] = dataset["date"].apply(lambda row : getDayInfo(row))

print(dataset.shape)

(1456, 8)


A second cell is added to demonstrate that 31 unique days were extracted, as expected.

In [106]:
print(len(dataset['day'].drop_duplicates()))

31


In [107]:
# Grader's area


# M[1,2] =

max_score_M[1,2] = 3

In fact, using numbers to represent months is not ideal because there should not be a numerical order between months. For instance, December and January are consecutive months, but numerically, they are far apart. Therefore, we introduce an algorithm called "One-hot encoding" to address this issue.
<br><br> One-hot encoding turns categories (like colors or types) into separate columns with 0s and 1s. If the original data value matches the column's category, that entry is marked with a 1; otherwise, it's marked with a 0. This method helps overcome the problem of machine learning models misinterpreting categorical data as having some sort of numerical relationship or order when there is none.


In [108]:
month_dummies = pd.get_dummies(dataset['month'])
dataset = pd.concat([dataset, month_dummies], axis=1)

dataset

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather,month,day,1,10,11,12,2,3,4,5,6,7,8,9
0,2012/1/1,0.0,12.8,5.0,4.7,drizzle,1,1,True,False,False,False,False,False,False,False,False,False,False,False
1,2012/1/2,10.9,10.6,2.8,4.5,rain,1,2,True,False,False,False,False,False,False,False,False,False,False,False
2,2012/1/3,0.8,11.7,7.2,2.3,rain,1,3,True,False,False,False,False,False,False,False,False,False,False,False
3,2012/1/4,20.3,12.2,5.6,4.7,rain,1,4,True,False,False,False,False,False,False,False,False,False,False,False
4,2012/1/5,1.3,8.9,2.8,6.1,rain,1,5,True,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1461,2015/12/27,8.6,4.4,1.7,2.9,rain,12,27,False,False,False,True,False,False,False,False,False,False,False,False
1462,2015/12/28,1.5,5.0,1.7,1.3,rain,12,28,False,False,False,True,False,False,False,False,False,False,False,False
1463,2015/12/29,0.0,7.2,0.6,2.6,fog,12,29,False,False,False,True,False,False,False,False,False,False,False,False
1464,2015/12/30,0.0,5.6,-1.0,3.4,sun,12,30,False,False,False,True,False,False,False,False,False,False,False,False


**Step 5**: Remove unnecessary columns.

In [109]:
dataset = dataset.drop(columns=['date', 'day'])

dataset

Unnamed: 0,precipitation,temp_max,temp_min,wind,weather,month,1,10,11,12,2,3,4,5,6,7,8,9
0,0.0,12.8,5.0,4.7,drizzle,1,True,False,False,False,False,False,False,False,False,False,False,False
1,10.9,10.6,2.8,4.5,rain,1,True,False,False,False,False,False,False,False,False,False,False,False
2,0.8,11.7,7.2,2.3,rain,1,True,False,False,False,False,False,False,False,False,False,False,False
3,20.3,12.2,5.6,4.7,rain,1,True,False,False,False,False,False,False,False,False,False,False,False
4,1.3,8.9,2.8,6.1,rain,1,True,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1461,8.6,4.4,1.7,2.9,rain,12,False,False,False,True,False,False,False,False,False,False,False,False
1462,1.5,5.0,1.7,1.3,rain,12,False,False,False,True,False,False,False,False,False,False,False,False
1463,0.0,7.2,0.6,2.6,fog,12,False,False,False,True,False,False,False,False,False,False,False,False
1464,0.0,5.6,-1.0,3.4,sun,12,False,False,False,True,False,False,False,False,False,False,False,False


Data cleaning is a lengthy process, and there are many techniques we have not yet introduced. You are encouraged to investigate further.

### **Part 2: Training and testing**

As we mentioned before, our goal is to predict the weather type. Predictions can be divided into two types: one is for discrete outcomes, which is classification, and the other is for continuous outcomes, which is regression. Since the weather type is a categorical issue, we need to employ a classification model.

**Step 1**: The first step is to divide the current dataset into a training set and a testing set. In some practical experiments, we also have a validation set, which is used to test unknown data. However, for now, we will use only the training set and testing set to train and test the model.

In [110]:
from sklearn.model_selection import train_test_split

y = dataset['weather']
X = dataset.drop(columns=['weather'])


X_train, X_test, y_train,  y_test = train_test_split(X,y ,
                          random_state=104,
                          train_size=0.8, shuffle=True)

print("X_train shape: " , X_train.shape, "X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape, "y_test shape: ", y_test.shape)

X_train shape:  (1164, 17) X_test shape:  (292, 17)
y_train shape:  (1164,) y_test shape:  (292,)


**Step 2**: Select a model to train the data. There are many types of models, and here are a few commonly used ones: Random Forest, SVM (Support Vector Machine), and KNN (K-Nearest Neighbors). Here, we will use Random Forest as an example.

In [111]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)


**Step 3**: Use the trained model to predict the data in the test set and compare it with the actual results in the test set. There are many ways to compare, with Accuracy being the one of most common. Accuracy is the percentage of predictions that match the actual samples. Other methods include the F1 Score, Confusion Matrix, etc. For this assignment, we will use accuracy as the metric to measure the model's accuracy.

In [112]:
y_pred = rf.predict(X_test)

In [113]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

f1 = f1_score(y_test, y_pred, average=None)
print("F1 score:", f1)

Accuracy: 0.815068493150685
F1 score: [0.21052632 0.12903226 0.95057034 0.66666667 0.80916031]


#### <font color='#008DFF'>  **Question 3**

Please choose a model other than the Random Forest classifier for training and predicting the data, and report the model's final accuracy. <br><br>

<b>Expected Output: </b> The value of accuracy is not required to be the same <br><code>print("My accuracy:", my_accuracy) <br>My accuracy: 0.8082191780821918<br>
</code>

The Support Vector Machine (SVM) for classification called the Support Vector Classifier (SVC) was selected for predicting the weather type. The SVC was fit on the same training data as the Random Forest Classifier, and then used to predict the weather type for the test data. The accuracy of the SVC was lower than the accuracy achieved by the Random Forest Classifier.

In [114]:
# Your code here. Add more cells if needed
svm = SVC()
svm.fit(X_train, y_train)

svm_y_pred = svm.predict(X_test)

my_accuracy = accuracy_score(y_test, svm_y_pred)
print("My accuracy:", my_accuracy)

My accuracy: 0.7671232876712328


The F1 score for the SVC is also shown.

In [115]:
my_f1 = f1_score(y_test, svm_y_pred, average=None)
print("F1 score:", my_f1)

F1 score: [0.         0.         0.864      0.         0.78644068]


In [116]:
# Grader's area


# M[2,1] =

max_score_M[2,1] = 5

### **Part 3: Hyperparameter Tunning**
Next, we need to tune the model to make it more suitable for predicting the weather. Below are some parameters that can be adjusted, and more parameters can be found on the official website.

**Offical link:** https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


*   n_estimators : The number of trees in the forest
*   max_depth : The maximum depth of the tree
*   max_features : The number of features to consider when looking for the best split
*   min_sample_leaf : The minimum number of samples required to be at a leaf node
*   min_sample_split : The minimum number of samples required to be at a leaf node








We use the exhaustive method to run all possible parameters and select the ones that perform the best. Since the number of models needed to be run is very high due to the multiplication of the number of parameters. It is very time-consuming. Here, as an example, we only adjusted one parameter and tried three different values.

In [117]:
from sklearn.model_selection import GridSearchCV


# example_param_grid = {
#     'n_estimators':[10,50,100],
#     'max_depth':[3,5,10],
#     'max_features':[1,3,5,7],
#     'min_samples_leaf':[1,2,3],
#     'min_samples_split':[1,2,3]
# }

param_grid = {
    'n_estimators':[10,50,100],
}


grid_search = GridSearchCV(RandomForestClassifier(),
                           param_grid=param_grid, scoring='accuracy')

grid_search.fit(X_train, y_train)
print(grid_search.best_estimator_)

RandomForestClassifier(n_estimators=50)


In [118]:
y_pred_optimal = grid_search.predict(X_test)

accuracy_optimal = accuracy_score(y_test, y_pred_optimal)
print("Accuracy:", accuracy_optimal)
print("Whether our model improves: ", (accuracy_optimal - accuracy) > 0)

Accuracy: 0.8253424657534246
Whether our model improves:  True


#### <font color='#008DFF'>  **Question 4**

Please tune the model you trained and tested in question 3, and improve its accuracy. <br><br>

<b>Expected Output:</b> The value of accuracy is not required to be the same <br><code>print("My accuracy:", my_accuracy_optimal)<br>My accuracy: 0.8116438356164384 <br>print("Whether our model improves: ", (my_accuracy_optimal - my_accuracy) > 0)
<br>
Whether our model improves:  True
</code>

Grid Search is used to explore a group of C values and kernel functions for the SVC to attempt to improve its accuracy. This section takes about 1 minute to run while it trains and explores different models' performance using the parameters provided. The best model is then used to predict the weather type, and the accuracy is shown. Using the parameters found with Grid Search, the SVC was able to improve its accuracy significantly, even outperforming the Random Forest Classifier.

In [None]:
# Your code here. Add more cells if needed

# This section takes about 1 minute to run due to Grid Search

svm_param_grid = {
    "C": [0.001, 0.01, 0.1, 1, 10],
    "kernel": ["linear", "poly", "rbf", "sigmoid"]
}

svm_grid_search = GridSearchCV(SVC(), param_grid=svm_param_grid, scoring="accuracy")

svm_grid_search.fit(X_train, y_train)

svm_y_pred_optimal = svm_grid_search.predict(X_test)

my_accuracy_optimal = accuracy_score(y_test, svm_y_pred_optimal)

print("My accuracy:", my_accuracy_optimal)
print("Whether our model improves: ", (my_accuracy_optimal - my_accuracy) > 0)

My accuracy: 0.8424657534246576
Whether our model improves:  True


In [120]:
# Grader's area


# M[3,1] =

max_score_M[3,1] = 5

In [121]:
#Grader's area


rawScore = np.sum(M)
maxScore = np.sum(max_score_M)
score = rawScore*100/maxScore

In [122]:
print("raw sccore: ", rawScore, ", max raw score: ", maxScore, ". final score: ", score)

raw sccore:  0.0 , max raw score:  15.0 . final score:  0.0
