
<h1 align="center"><font size="5">Final Project: Classification with Python</font></h1>

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li><a href="https://#Section_1">Instructions</a></li>
    <li><a href="https://#Section_2">About the Data</a></li>
    <li><a href="https://#Section_3">Importing Data </a></li>
    <li><a href="https://#Section_4">Data Preprocessing</a> </li>
    <li><a href="https://#Section_5">One Hot Encoding </a></li>
    <li><a href="https://#Section_6">Train and Test Data Split </a></li>
    <li><a href="https://#Section_7">Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores</a></li>
</a></li>


<hr>


# Instructions


In this notebook, we will  practice all the classification algorithms that we have learned in this course.


Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

We will use some of the algorithms taught in the course, specifically:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score




# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [74]:
import pandas as pd
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score, jaccard_score, f1_score, log_loss, mean_absolute_error, mean_squared_error, r2_score


### **Importing the Dataset**


In [75]:
# Importing the dataset
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


In [76]:
df.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0
mean,14.877102,23.005564,3.342158,5.175787,7.16897,41.476307,15.077041,19.294405,68.243962,54.698563,1018.334424,1016.003085,4.318557,4.176093,17.821461,21.543656
std,4.55471,4.483752,9.917746,2.757684,3.815966,10.806951,7.043825,7.453331,15.086127,16.279241,7.02009,7.019915,2.526923,2.411274,4.894316,4.297053
min,4.3,11.7,0.0,0.0,0.0,17.0,0.0,0.0,19.0,10.0,986.7,989.8,0.0,0.0,6.4,10.2
25%,11.0,19.6,0.0,3.2,4.25,35.0,11.0,15.0,58.0,44.0,1013.7,1011.3,2.0,2.0,13.8,18.4
50%,14.9,22.8,0.0,4.8,8.3,41.0,15.0,19.0,69.0,56.0,1018.6,1016.3,5.0,4.0,18.2,21.3
75%,18.8,26.0,1.4,7.0,10.2,44.0,20.0,24.0,80.0,64.0,1023.1,1020.8,7.0,7.0,21.7,24.5
max,27.6,45.8,119.4,18.4,13.6,96.0,54.0,57.0,100.0,99.0,1039.0,1036.7,9.0,8.0,36.5,44.7


In [77]:
df.shape

(3271, 22)

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

### **Data Preprocessing**


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [79]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm']) #This is a pandas function that performs one-hot encoding on the specified columns of the DataFrame

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [80]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [81]:
df_sydney_processed.drop('Date',axis=1,inplace=True) #Drop the 'Date' column

In [82]:
df_sydney_processed = df_sydney_processed.astype(float)

In [83]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1) #Define features and target
Y = df_sydney_processed['RainTomorrow']

### **Linear Regression**


In [84]:
#Enter Your Code and Execute
# Split the data
#Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

# Create and train the Linear Regression model
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

# Predict on test data
predictions = LinearReg.predict(x_test)

# Calculate metrics
LinearRegression_MAE = mean_absolute_error(y_test, predictions)
LinearRegression_MSE = mean_squared_error(y_test, predictions)
LinearRegression_R2 = r2_score(y_test, predictions)

# Show results
Report = pd.DataFrame({
    "Model": ["Linear Regression"],
    "MAE": [LinearRegression_MAE],
    "MSE": [LinearRegression_MSE],
    "R2": [LinearRegression_R2]
})
Report



Unnamed: 0,Model,MAE,MSE,R2
0,Linear Regression,0.256309,0.115719,0.427138


### KNN


In [85]:
# Create and train the KNN model
# Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(x_train, y_train)

# Predict on test data
predictions = KNN.predict(x_test)

# Calculate metrics
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

# Add results to the report
new_row = pd.DataFrame({
    "Model": ["KNN"],
    "MAE": ["N/A"],
    "MSE": ["N/A"],
    "R2": ["N/A"],
    "Accuracy": [KNN_Accuracy_Score],
    "Jaccard Index": [KNN_JaccardIndex],
    "F1 Score": [KNN_F1_Score]
})

Report = pd.concat([Report, new_row],ignore_index=True)
Report


Unnamed: 0,Model,MAE,MSE,R2,Accuracy,Jaccard Index,F1 Score
0,Linear Regression,0.256309,0.115719,0.427138,,,
1,KNN,,,,0.818321,0.425121,0.59661


### Decision Tree


In [86]:
# Create and train the Decision Tree model
Tree = DecisionTreeClassifier()
Tree.fit(x_train, y_train)

# Predict on test data
predictions = Tree.predict(x_test)

# Calculate metrics
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

# Add results to the report
new_row = pd.DataFrame({
    "Model": ["Decision Tree"],
    "MAE": ["N/A"],
    "MSE": ["N/A"],
    "R2": ["N/A"],
    "Accuracy": [Tree_Accuracy_Score],
    "Jaccard Index": [Tree_JaccardIndex],
    "F1 Score": [Tree_F1_Score]
})

Report = pd.concat([Report, new_row], ignore_index=True) # Use pd.concat instead of append
Report

Unnamed: 0,Model,MAE,MSE,R2,Accuracy,Jaccard Index,F1 Score
0,Linear Regression,0.256309,0.115719,0.427138,,,
1,KNN,,,,0.818321,0.425121,0.59661
2,Decision Tree,,,,0.752672,0.39777,0.569149


### Logistic Regression


In [87]:
#Enter Your Code and Execute
# Split the data
#Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

# Create and train the Logistic Regression model
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train, y_train)

# Predict on test data
predictions = LR.predict(x_test)
predict_proba = LR.predict_proba(x_test)

#print(predict_proba)
#print(predictions)

# Calculate metrics
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

# Add results to the report
new_row = pd.DataFrame({ # Create a new DataFrame for the new row
    "Model": ["Logistic Regression"], # Use a list for consistency with other rows
    "MAE": ["N/A"],
    "MSE": ["N/A"],
    "R2": ["N/A"],
    "Accuracy": [LR_Accuracy_Score],
    "Jaccard Index": [LR_JaccardIndex],
    "F1 Score": [LR_F1_Score],
    "Log Loss": [LR_Log_Loss] # Wrap in a list
})

Report = pd.concat([Report, new_row], ignore_index=True) # Use pd.concat to add the new row
Report

Unnamed: 0,Model,MAE,MSE,R2,Accuracy,Jaccard Index,F1 Score,Log Loss
0,Linear Regression,0.256309,0.115719,0.427138,,,,
1,KNN,,,,0.818321,0.425121,0.59661,
2,Decision Tree,,,,0.752672,0.39777,0.569149,
3,Logistic Regression,,,,0.836641,0.509174,0.674772,0.381259


### SVM


In [88]:
#Enter Your Code and Execute
# Create and train the SVM model
SVM = svm.SVC()
SVM.fit(x_train, y_train)

# Predict on test data
predictions = SVM.predict(x_test)

# Calculate metrics
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

# Add results to the report
new_row = pd.DataFrame({
    "Model": ["SVM"], # Put 'SVM' in a list for consistency
    "MAE": ["N/A"],
    "MSE": ["N/A"],
    "R2": ["N/A"],
    "Accuracy": [SVM_Accuracy_Score],
    "Jaccard Index": [SVM_JaccardIndex],
    "F1 Score": [SVM_F1_Score]
})

Report = pd.concat([Report, new_row], ignore_index=True) # Use pd.concat to add the new row
Report

Unnamed: 0,Model,MAE,MSE,R2,Accuracy,Jaccard Index,F1 Score,Log Loss
0,Linear Regression,0.256309,0.115719,0.427138,,,,
1,KNN,,,,0.818321,0.425121,0.59661,
2,Decision Tree,,,,0.752672,0.39777,0.569149,
3,Logistic Regression,,,,0.836641,0.509174,0.674772,0.381259
4,SVM,,,,0.722137,0.0,0.0,


### **Report**


####  the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [89]:
# Display the final report
Report


Unnamed: 0,Model,MAE,MSE,R2,Accuracy,Jaccard Index,F1 Score,Log Loss
0,Linear Regression,0.256309,0.115719,0.427138,,,,
1,KNN,,,,0.818321,0.425121,0.59661,
2,Decision Tree,,,,0.752672,0.39777,0.569149,
3,Logistic Regression,,,,0.836641,0.509174,0.674772,0.381259
4,SVM,,,,0.722137,0.0,0.0,
