# Classification Challenge


## Task

Your company, DS Pros, would like to win a contract with a big city council as it would give us great PR. To do so you think it would be a great idea to proactively browse in the open data sets of this city (the one you choose, total freedom here) identify a situation that could be solved or improved using classification algorithms and present it to the technical office of that city council.

You need to prepare the following:

- A presentation describing the solution you try to solve, how classification will solve it and a summary of the solution proposed
- A well documented and visually appealing notebook where you try different models, explain the steps followed and chose one particular algorithm and hyperparameters (explaining why)
- You should also export that model, once trained, using pickle or similar so it can be reused.
- You should implement a .py script that loads the exported model, accepts a file with samples to classify (identified with an id) and stores the results in a DDBB table (SQLlite) with fields id and class.
- You should provide the files to test the .py script and clear instructions on how to run it.


Happy coding!!

## 0. Introduction & Proposal

Thank you very much for the possibility to present our project proposal to you. We are very excited about the opportunity to work with you and hope that you will find our proposal interesting and that we can work together on this project.

**The Problem:**
The city council of Seattle, Washington, is looking for a solution to predict the weather conditions for the upcoming days based on historical weather data. This would allow them to better plan and prepare for future weather conditions, such as rain, snow, or sunny weather. Especially in a city like Seattle, where the weather is often unpredictable, this would be a very useful tool for the city council to prepare for weather conditions in advance. The current system repeatedly fails to predict the weather correctly, which leads to a lot of frustration and wasted resources.

**The Solution**:
Our team has developed a cutting-edge weather prediction model that leverages historical weather data and advanced machine learning algorithms to provide highly accurate forecasts for the upcoming days. This solution will enable the city council of Seattle to make well-informed decisions and allocate resources more effectively, ultimately improving the city's preparedness for various weather conditions.

## 1. Key Features

1. Data-driven approach: Our model utilizes a vast amount of historical weather data, including temperature, humidity, wind speed, and precipitation levels, to identify patterns and trends that can help predict future weather conditions.

2. Advanced machine learning algorithms: We employ state-of-the-art machine learning techniques to analyze the data and generate accurate predictions. These algorithms continuously learn and improve their predictions as more data becomes available.

3. User-friendly visualization: Our solution includes an easy-to-understand visual representation of the predicted weather conditions, using color gradients to indicate the intensity of various weather elements. This allows the city council members to quickly grasp the forecast and make informed decisions.

4. Customizable and scalable: Our weather prediction model can be tailored to the specific needs of the city council and can be easily scaled up to cover larger geographical areas or extended timeframes.

## 2. Benefits for the Council

By implementing our weather prediction solution, the city council of Seattle will be able to:

1. Improve preparedness for various weather conditions, reducing the impact of adverse weather on the city's infrastructure and residents.

2. Optimize resource allocation, ensuring that the necessary resources are available and deployed efficiently during extreme weather events.

3. Enhance communication with the public, providing accurate and timely weather forecasts to help residents plan their activities and stay safe.

4. Save time and money by reducing the reliance on less accurate weather prediction methods and minimizing the consequences of incorrect forecasts.

We are confident that our weather prediction solution will greatly benefit the city council of Seattle and its residents. We look forward to discussing our proposal further and exploring the possibility of a fruitful collaboration. Thank you for considering our project proposal.

## 3. Setup and tool import

Throughout this project, various Python libraries are used for data analysis, data visualization, and machine learning. The following libraries are imported for this project:

1. `os`: The `os` module in Python provides functions for interacting with the operating system, such as file and directory management, environment variables, and process control.

2. `pickle`: The `pickle` module is used for serializing and deserializing Python objects, allowing you to save and load objects to and from files.

3. `pandas`: `pandas` is a powerful data manipulation library that provides data structures like DataFrames and Series for handling and analyzing data in a flexible and efficient way.

4. `seaborn`: `seaborn` is a data visualization library based on `matplotlib` that provides a high-level interface for creating informative and attractive statistical graphics.

5. `sklearn`: `scikit-learn` is a popular machine learning library that provides simple and efficient tools for data mining and data analysis, including various classification, regression, and clustering algorithms, as well as tools for model selection and preprocessing.

6. `numpy`: `numpy` is a fundamental library for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

We use various classifiers from the `sklearn` library, such as `MLPClassifier`, `KNeighborsClassifier`, `SVC`, `GaussianProcessClassifier`, `GradientBoostingClassifier`, `DecisionTreeClassifier`, `ExtraTreesClassifier`, `RandomForestClassifier`, `AdaBoostClassifier`, `GaussianNB`, `QuadraticDiscriminantAnalysis`, and `SGDClassifier`. Additionally, `LabelEncoder` and `StandardScaler` are used for preprocessing, and `GridSearchCV` is used for hyperparameter tuning.


In [123]:
import os
import pickle
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
import numpy as np

## 4. The Data
The Seattle weather dataset is a collection of historical weather data from Seattle, Washington, which can be used for the purpose of weather classification for upcoming days. The dataset  contains information about various weather attributes, such as temperature, precipitation, and weather conditions (e.g., sunny, cloudy, rainy, etc.) for each day.

The main goal of using this dataset is to train a machine learning model to predict the weather conditions for the upcoming days based on the historical data. By analyzing patterns and trends in the past weather data, the models can learn to recognize the relationships between different weather attributes and make accurate predictions for future weather conditions.

To achieve this, the dataset is preprocessed and transformed into a suitable format for machine learning algorithms. This may involves handling missing values, converting date-time information into separate features (e.g., day, month, and year), and encoding categorical variables (e.g., weather conditions) into numerical values.

In [124]:
# Load Data
df = pd.read_csv('./data/seattle-weather.csv')
df.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


## 5. Data Preprocessing

1. Define a function `date_time(df)` to preprocess the date column:
   - Convert the 'date' column to a datetime object using `pd.to_datetime()`.
   - Extract the year, month, and day from the 'date' column and create new columns for each of them in the DataFrame.

2. Call the `date_time(df)` function and pass the DataFrame to it. This will preprocess the date column and return the modified DataFrame.

3. Use LabelEncoder to convert the text values in the 'weather' column into numeric values (e.g., 1, 2, 3, etc.). Store the encoded values in a new column called 'weather_label'.

4. Create a dictionary called `weather_dict` to map the encoded weather labels back to their original text values. Save this dictionary to disk as a pickle file for future use.

5. Drop the original 'weather' and 'date' columns from the DataFrame, as they are no longer needed.

6. Split the DataFrame into input features (X) and target labels (Y). The input features are all columns except 'weather_label', and the target labels are the 'weather_label' column.

7. Check the shapes of X and Y to ensure they have the correct dimensions.

8. Split the data into training and testing sets using `train_test_split()`. The test set size is 20% of the total data.

9. Scale the input features (X_train and X_test) using StandardScaler. This step is important to ensure that all features have the same scale, which can improve the performance of machine learning algorithms.

10. Check the shapes of the training and testing sets to ensure they have the correct dimensions.

In [87]:
# Preprocess date column
def date_time(df) :

    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year #Generate Year Column
    df['month'] = df['date'].dt.month #Generate Month Column
    df['day'] = df['date'].dt.day

    return df

df = date_time(df)
df.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather,year,month,day
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle,2012,1,1
1,2012-01-02,10.9,10.6,2.8,4.5,rain,2012,1,2
2,2012-01-03,0.8,11.7,7.2,2.3,rain,2012,1,3
3,2012-01-04,20.3,12.2,5.6,4.7,rain,2012,1,4
4,2012-01-05,1.3,8.9,2.8,6.1,rain,2012,1,5


In [128]:
#Turn our Text Values into numeric Values ( e.g. 1,2,3,etc )
le = LabelEncoder()
df['weather_label'] = le.fit_transform(df['weather'])
weather_dict = dict(zip(df['weather_label'], df['weather']))

# Save weather dictionary to disk
with open('weather_dict.pickle', 'wb') as handle:
    pickle.dump(weather_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

df = df.drop(['weather', "date"], axis=1)

X = df.drop(['weather_label'], axis = 1)
Y = df['weather_label']

In [129]:
X

Unnamed: 0,precipitation,temp_max,temp_min,wind
0,0.0,12.8,5.0,4.7
1,10.9,10.6,2.8,4.5
2,0.8,11.7,7.2,2.3
3,20.3,12.2,5.6,4.7
4,1.3,8.9,2.8,6.1
...,...,...,...,...
1456,8.6,4.4,1.7,2.9
1457,1.5,5.0,1.7,1.3
1458,0.0,7.2,0.6,2.6
1459,0.0,5.6,-1.0,3.4


In [130]:
X.shape

(1461, 4)

In [131]:
Y.shape

(1461,)

In [132]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [133]:
X_train

Unnamed: 0,precipitation,temp_max,temp_min,wind
6,0.0,7.2,2.8,2.3
1267,0.0,25.6,13.9,3.4
1389,0.0,16.1,8.3,1.3
499,0.0,18.3,7.8,2.4
689,0.0,7.8,1.7,4.3
...,...,...,...,...
318,0.8,11.1,5.0,2.6
1184,1.8,17.8,10.6,2.9
1253,0.0,31.1,15.6,3.2
276,0.0,18.9,7.8,7.3


In [134]:
# Scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [135]:
X_train

array([[-0.45805889, -1.27934256, -1.08287842, -0.67017992],
       [-0.45805889,  1.27004239,  1.1394444 ,  0.10477027],
       [-0.45805889, -0.04621614,  0.01827252, -1.3746801 ],
       ...,
       [-0.45805889,  2.03208681,  1.47980014, -0.03612976],
       [-0.45805889,  0.34173374, -0.08183211,  2.85232096],
       [-0.45805889,  0.79896039,  1.35967459, -0.10657978]])

In [136]:
X_train.shape, Y_train.shape

((1168, 4), (1168,))

In [137]:
X_test.shape, Y_test.shape

((293, 4), (293,))

## 6. The Models
Each model uses a different algorithm to learn from the data and make predictions. Here's a brief description of each model:

1. **Nearest Neighbors**: This model classifies data points based on their similarity to their nearest neighbors in the training data. It's a simple and intuitive method for classification tasks.

2. **Linear SVM**: Linear Support Vector Machine (SVM) is a model that finds the best linear boundary (a straight line in 2D, a plane in 3D, etc.) that separates different classes of data points. It's effective for linearly separable data.

3. **Polynomial SVM**: This is an extension of the Linear SVM that uses a polynomial function to transform the data into a higher-dimensional space, allowing for more complex decision boundaries.

4. **RBF SVM**: Radial Basis Function (RBF) SVM is another extension of the Linear SVM that uses a non-linear kernel function to transform the data, enabling the model to capture more complex patterns in the data.

5. **Gradient Boosting**: This model combines multiple weak learners (usually decision trees) to create a strong learner. It iteratively improves the model by focusing on the errors made by the previous learners.

6. **Decision Tree**: This model uses a tree-like structure to make decisions based on the input data. It's easy to understand and interpret, but can be prone to overfitting.

7. **Extra Trees**: Extra Trees is an ensemble method that builds multiple decision trees and combines their predictions. It's more robust and less prone to overfitting compared to a single decision tree.

8. **Random Forest**: This model is another ensemble method that builds multiple decision trees and combines their predictions. It introduces randomness in the tree-building process, making it more robust and accurate.

9. **Neural Net**: This model is inspired by the human brain and consists of interconnected nodes (neurons) that learn to make predictions based on the input data. It's highly flexible and can capture complex patterns in the data.

10. **AdaBoost**: AdaBoost is an ensemble method that combines multiple weak learners (usually decision trees) to create a strong learner. It focuses on the errors made by the previous learners and assigns more weight to the misclassified data points.

11. **Naive Bayes**: This model is based on the Bayes' theorem and assumes that the features in the data are independent. It's simple, fast, and works well with small datasets.

12. **SGD**: Stochastic Gradient Descent (SGD) is an optimization algorithm used to train various types of models, including linear classifiers. It's efficient and works well with large datasets.

These models can be used individually or combined to create a more accurate and robust prediction system.

In [102]:
# Defining the  classifiers
names = ["Nearest_Neighbors", "Linear_SVM", "Polynomial_SVM", "RBF_SVM",
         "Gradient_Boosting", "Decision_Tree", "Extra_Trees", "Random_Forest", "Neural_Net", "AdaBoost",
         "Naive_Bayes", "SGD"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(kernel="poly", degree=3, C=0.025),
    SVC(kernel="rbf", C=1, gamma=2),
    #GaussianProcessClassifier(1.0 * RBF(1.0)),
    GradientBoostingClassifier(n_estimators=100, learning_rate=1.0),
    DecisionTreeClassifier(max_depth=5),
    ExtraTreesClassifier(n_estimators=10, min_samples_split=2),
    RandomForestClassifier(max_depth=5, n_estimators=100),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(n_estimators=100),
    GaussianNB(),
    #QuadraticDiscriminantAnalysis(),
    SGDClassifier(loss="hinge", penalty="l2")]

In [103]:
scores = []
for name, clf in zip(names, classifiers):
    clf.fit(X_train, Y_train)
    score = clf.score(X_test, Y_test)
    scores.append(score)
    print(f"""{name}: {score}""")

Nearest_Neighbors: 0.6484641638225256
Linear_SVM: 0.757679180887372
Polynomial_SVM: 0.6416382252559727
RBF_SVM: 0.7235494880546075
Gradient_Boosting: 0.8191126279863481
Decision_Tree: 0.825938566552901
Extra_Trees: 0.8225255972696246
Random_Forest: 0.8600682593856656
Neural_Net: 0.8054607508532423
AdaBoost: 0.6177474402730375
Naive_Bayes: 0.8225255972696246
SGD: 0.78839590443686


In [138]:
df = pd.DataFrame()
df['name'] = names
df['score'] = scores

# Sort and reset index
df = df.sort_values(by='score', ascending=False).reset_index(drop=True)
df

Unnamed: 0,name,score
0,Random_Forest,0.860068
1,Decision_Tree,0.825939
2,Extra_Trees,0.822526
3,Naive_Bayes,0.822526
4,Gradient_Boosting,0.819113
5,Neural_Net,0.805461
6,SGD,0.788396
7,Linear_SVM,0.757679
8,RBF_SVM,0.723549
9,Nearest_Neighbors,0.648464


In [139]:
# Plot the scores
cm = sns.light_palette("green", as_cmap=True)
results = df.style.background_gradient(cmap=cm)
results

Unnamed: 0,name,score
0,Random_Forest,0.860068
1,Decision_Tree,0.825939
2,Extra_Trees,0.822526
3,Naive_Bayes,0.822526
4,Gradient_Boosting,0.819113
5,Neural_Net,0.805461
6,SGD,0.788396
7,Linear_SVM,0.757679
8,RBF_SVM,0.723549
9,Nearest_Neighbors,0.648464


The results point towards using the Random Forest Classifier as it has the highest accuracy score of 0.86.

In [106]:
max_features_range = np.arange(1,7,1)
n_estimators_range = np.arange(10,500,10)
param_grid = dict(max_features=max_features_range, n_estimators=n_estimators_range)

rf = RandomForestClassifier()

grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=10)

In [107]:
grid.fit(X_train, Y_train)

In [108]:
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The best parameters are {'max_features': 3, 'n_estimators': 40} with a score of 0.87


In [109]:
grid_results = pd.concat([pd.DataFrame(grid.cv_results_["params"]),pd.DataFrame(grid.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
grid_results.head()

Unnamed: 0,max_features,n_estimators,Accuracy
0,1,10,0.837305
1,1,20,0.851894
2,1,30,0.852756
3,1,40,0.856182
4,1,50,0.863867


In [110]:
grid_contour = grid_results.groupby(['max_features','n_estimators']).mean()
grid_contour

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy
max_features,n_estimators,Unnamed: 2_level_1
1,10,0.837305
1,20,0.851894
1,30,0.852756
1,40,0.856182
1,50,0.863867
...,...,...
6,450,0.857000
6,460,0.856130
6,470,0.856138
6,480,0.857854


In [111]:
grid_reset = grid_contour.reset_index()
grid_reset.columns = ['max_features', 'n_estimators', 'Accuracy']
grid_pivot = grid_reset.pivot(index='max_features', columns='n_estimators', values='Accuracy')

grid_pivot


n_estimators,10,20,30,40,50,60,70,80,90,100,...,400,410,420,430,440,450,460,470,480,490
max_features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.837305,0.851894,0.852756,0.856182,0.863867,0.85529,0.858753,0.855312,0.853625,0.853603,...,0.856167,0.855312,0.857877,0.857869,0.857877,0.855305,0.858731,0.85445,0.85445,0.855305
2,0.846729,0.862165,0.860455,0.854436,0.857884,0.862135,0.85616,0.860419,0.861295,0.857854,...,0.860433,0.863005,0.857,0.861288,0.859579,0.863867,0.863005,0.864707,0.862997,0.858724
3,0.849315,0.852726,0.854421,0.865561,0.861273,0.856985,0.858709,0.859579,0.86215,0.861288,...,0.862997,0.861273,0.862135,0.853574,0.859571,0.862997,0.862997,0.862135,0.861281,0.862135
4,0.845027,0.856145,0.855268,0.853574,0.853581,0.861281,0.855298,0.860426,0.853574,0.858694,...,0.859556,0.859564,0.857,0.859564,0.861281,0.856992,0.859571,0.858702,0.862128,0.861281
5,0.850147,0.851017,0.851009,0.854413,0.851857,0.851857,0.858709,0.856152,0.855283,0.852726,...,0.861273,0.858694,0.856992,0.857847,0.857854,0.860419,0.856992,0.857847,0.857854,0.859564
6,0.837297,0.851857,0.847598,0.846729,0.853566,0.851002,0.853574,0.857007,0.848445,0.856152,...,0.854421,0.858702,0.855283,0.856985,0.856992,0.857,0.85613,0.856138,0.857854,0.855268


In [112]:
x = grid_pivot.columns.values
y = grid_pivot.index.values
z = grid_pivot.values

## 7. Evaluation

In [113]:
import plotly.graph_objects as go

# X and Y axes labels
layout = go.Layout(
            xaxis=go.layout.XAxis(
              title=go.layout.xaxis.Title(
              text='n_estimators')
             ),
             yaxis=go.layout.YAxis(
              title=go.layout.yaxis.Title(
              text='max_features')
            ) )

fig = go.Figure(data = [go.Contour(z=z, x=x, y=y)], layout=layout )

fig.update_layout(title='Hyperparameter tuning', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

# Create images folder
if not os.path.exists('images'):
    os.makedirs('images')

# Save the plot
fig.write_image('images/plotly/contour_plot.png')

The best parameters are visualized in the plots brightest yellow spots.

In [140]:
import plotly.graph_objects as go


fig = go.Figure(data= [go.Surface(z=z, y=y, x=x)], layout=layout )
fig.update_layout(title='Hyperparameter tuning',
                  scene = dict(
                    xaxis_title='n_estimators',
                    yaxis_title='max_features',
                    zaxis_title='Accuracy'),
                  autosize=False,
                  width=800, height=800,
                  margin=dict(l=65, r=50, b=65, t=90))
fig.show()

We will use a modell with 40 estimators and 3 max features, due to the peak in the accuracy score.

## 8. Saving the best model
We save the model and deliver a ready data science solution to your weather prediction.

In [115]:
# Building the model
rf_best = RandomForestClassifier(max_features=3, n_estimators=40)
rf_best.fit(X_train, Y_train)

In [120]:
# Create directory models if it doesn't exist
if not os.path.exists('models'):
    os.makedirs('models')

In [121]:
# Saving rf_best
pickle.dump(rf_best, open('models/best_model.sav', 'wb'))

## 9. Usage of the .py file for predictions
To predict the weather in Seattle for tomorrow using the given .py file, follow these steps:

1. Update the data file: Replace the `seattle-weather.csv` file in the `./FinalProject/data/` directory with a new file containing the weather data for Seattle, including tomorrow's date. Make sure the new file has the same format and columns as the original file.

2. Run the script: Execute the .py script. This will perform the following actions:
   - Import necessary libraries.
   - Load the pre-trained model from the `best_model.sav` file.
   - Read the updated weather data from the `seattle-weather.csv` file.
   - Preprocess the data by transforming the date and scaling the numerical values.
   - Predict the weather for each entry in the data using the loaded model.
   - Convert the predicted label numbers to label names (e.g., 0 to 'drizzle', 1 to 'rain', etc.).
   - Create or connect to an SQLite database named `classification.db`.
   - Create a table named `classification` with fields `id`, `class`, and `class_name`.
   - Insert the predicted weather data into the `classification` table.
   - Save the changes to the database and close the connection.

3. Check the prediction: Open the `classification.db` file using an SQLite database viewer, and look for the row with the date corresponding to tomorrow. The `class_name` field in that row will contain the predicted weather for Seattle tomorrow.