**Author:** Prince Appiah
    
**Date:** February 3, 2025
    
**Description:** This project was created as part of my final assignment for the Machine Learning with Python course on Coursera.

<h1 align="center"><font size="5"> Classification with Python</font></h1>

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li><a href="https://#Section_1">Instructions</a></li>
    <li><a href="https://#Section_2">About the Data</a></li>
    <li><a href="https://#Section_3">Importing Data </a></li>
    <li><a href="https://#Section_4">Data Preprocessing</a> </li>
    <li><a href="https://#Section_5">One Hot Encoding </a></li>
    <li><a href="https://#Section_6">Train and Test Data Split </a></li>
    <li><a href="https://#Section_7">Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores</a></li>
</a></li>
</div>

<hr>

# Instructions

In this notebook, we use classification algorithms to create  models based on our training data and evaluate our testing data using some evaluation metrics.
Specifically, we will use these algorithms:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, we use all the models to generate the report at the end. 


# About the dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**

All Libraries required for this project are listed below in the code cell. The libraries pre-installed on my Jupyter are commented.To install these libraries, remove the # sign before !pip. 
Note: If your environment doesn’t support !pip install, use an alternative method to install the 
required package, such as using a system-specific package manager or installing it manually.

In [1]:
# !pip install pandas
# !pip install numpy
#!pip install matplotlib
#!pip install seaborn
#!pip install scikit-learn

In [17]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics
from sklearn.metrics import r2_score

In [9]:
df = pd.read_csv('Weather_Data .csv')

In [10]:
# Print first five observations
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing

#### One Hot Encoding

First, we need to perform one hot encoding to convert categorical variables to binary variables.

In [11]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [12]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data

Now, we set our 'features' or x values and our Y or target variable.

In [13]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [14]:
df_sydney_processed = df_sydney_processed.astype(float)

In [15]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression

#### Step 1: We use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.

#### Step 2: We create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).

#### Step 3: Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

#### Step 4: Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

#### Step 5: Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [16]:
#Step 1
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state = 10)
print('shape of x_train:', x_train.shape)
print('shape of x_test:',x_test.shape )
print('shape of y_train:',y_train.shape )
print('shape of y_test:',y_test.shape )

shape of x_train: (2616, 66)
shape of x_test: (655, 66)
shape of y_train: (2616,)
shape of y_test: (655,)


In [18]:
#Step 2
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

In [19]:
#Step 3
predictions = LinearReg.predict(x_test)

In [20]:
#Step 4
LinearRegression_MAE = np.mean(np.absolute(predictions - y_test))
LinearRegression_MSE = np.mean((predictions - y_test) ** 2)
LinearRegression_R2 = r2_score(y_test, predictions)

In [21]:
#Step 5
Report = pd.DataFrame({
    "MAE": [LinearRegression_MAE],
    "MSE": [LinearRegression_MSE],
    "R2": [LinearRegression_R2]
})

Report

Unnamed: 0,MAE,MSE,R2
0,0.256315,0.11572,0.427133


----

---

### K-Nearest Neighbor (KNN)

#### Step 1: Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.

#### Step 2: Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

#### Step 3: Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [24]:
#Step 1
KNN = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)

In [25]:
#Step 2
predictions = KNN.predict(x_test)

In [26]:
#Step 3
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

---

---

### Decision Tree

#### Step 1: Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).

#### Step 2: Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

#### Step 3: Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [27]:
#Step 1
Tree = DecisionTreeClassifier().fit(x_train, y_train)

In [28]:
#Step 2
predictions = Tree.predict(x_test)

In [29]:
#Step 3
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

---

---

### Logistic Regression

#### Step 1: Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.

#### Step 2: Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.

#### Step 3: Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [30]:
#Step 1
LR = LogisticRegression(solver = 'liblinear').fit(x_train, y_train)

In [31]:
#Step 2
predictions = LR.predict(x_test)

predict_proba = LR.predict_proba(x_test)

In [32]:
#Step 3
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

---

---

### Support Vector Machine (SVM)

#### Step 1: Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).

#### Step 2: Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

#### Step 3: Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [33]:
#Step 1
# The C parameter in SVM controls the trade-off between maximizing the margin and minimizing misclassification.
# - Small C (e.g., C=0.1) allows a larger margin, tolerating some misclassifications for better generalization.
# - Large C (e.g., C=100) forces the model to classify all training points correctly, risking overfitting.
# Here, C=0.1 is chosen to prioritize generalization over training accuracy.

SVM = svm.SVC(kernel='linear', C=0.1).fit(x_train, y_train)

In [34]:
#Step 2
predictions = SVM.predict(x_test)

In [35]:
#Step 3
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

---

---

### Report

#### Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [36]:
Report = pd.DataFrame({
    "Model": ["KNN", "Decision Tree", "Logistic Regression", "SVM"],
    "Accuracy": [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    "Jaccard Index": [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    "F1 Score": [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    "Log Loss": ['N/A', 'N/A', LR_Log_Loss, 'N/A']  # Only Logistic Regression has Log Loss
})

Report

Unnamed: 0,Model,Accuracy,Jaccard Index,F1 Score,Log Loss
0,KNN,0.818321,0.425121,0.59661,
1,Decision Tree,0.757252,0.4,0.571429,
2,Logistic Regression,0.839695,0.522727,0.686567,0.356595
3,SVM,0.842748,0.511848,0.677116,


### **Summary of Classification Model Performance**

Four machine learning models were evaluated: **K-Nearest Neighbors (KNN), Decision Tree, Logistic Regression, and Support Vector Machine (SVM)**. The key metrics analyzed include **Accuracy, Jaccard Index, F1 Score, and Log Loss** (for Logistic Regression).

- **SVM achieved the highest accuracy (0.843)**, followed closely by **Logistic Regression (0.840)**.
- **Logistic Regression had the highest Jaccard Index (0.523) and F1 Score (0.687)**, indicating strong overall performance.
- **KNN performed well (0.818 accuracy) but had a lower Jaccard Index (0.425) and F1 Score (0.597)**.
- **Decision Tree had the lowest accuracy (0.757) and Jaccard Index (0.400)**, suggesting weaker generalization.
- **Log Loss (0.357) was only applicable to Logistic Regression, reflecting a decent probability calibration.**

Overall, **SVM and Logistic Regression** showed the best classification performance, while the **Decision Tree was the weakest performer** in this comparison.
