# Part 2

Tasks:
1. State the different modeling approaches that you will compare to address the business opportunity.
2. Iterate on your suite of possible models by modifying data transformations, pipeline architectures, hyperparameters and other relevant factors.
3. Re-train your model on all of the data using the selected approach and prepare it for deployment.
4. Articulate your findings in a summary report.

### 1. State the different modeling approaches that you will compare to address the business opportunity.

We will try to use different machine learning approaches based on features extracted from histroical data. We will compare accuracy of different methods to address the business opportunity 

### 2. Iterate on your suite of possible models by modifying data transformations, pipeline architectures, 

In [8]:
from cslib import fetch_ts, engineer_features

In [3]:
data_dir = "./data/cs-train/"
ts_data = fetch_ts(data_dir)

... loading ts data from files


In [6]:
ts_data.keys()

dict_keys(['all', 'netherlands', 'united_kingdom', 'germany', 'eire', 'france', 'portugal', 'norway', 'spain', 'singapore', 'hong_kong'])

In [9]:
all_features = engineer_features(ts_data["all"])

In [20]:
X, y, date = all_features

In [22]:
X.head()

Unnamed: 0,previous_7,previous_14,previous_28,previous_70,previous_year,recent_invoices,recent_views
0,14450.54,14450.54,14450.54,14450.54,0.0,5.928571,537.392857
1,27862.5,27862.5,27862.5,27862.5,0.0,10.310345,1007.103448
2,41152.75,41152.75,41152.75,41152.75,0.0,14.966667,1499.066667
3,50840.03,50840.03,50840.03,50840.03,0.0,18.533333,1916.533333
4,52283.29,52283.29,52283.29,52283.29,0.0,19.6,2026.966667


In [37]:
from sklearn import svm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
import numpy as np

### SVM model

In [38]:
## train a SVM

param_grid_svm = {'svm__kernel':('linear', 'rbf'), 'svm__C':[1, 10]}

param_grid_rf = {
    'rf__criterion': ['mse'],
    'rf__n_estimators': [10, 20,50]
}

pipe_svm = Pipeline(steps=[('scaler', StandardScaler()),
                          ('svm', svm.SVC())])

grid = GridSearchCV(pipe_rf, param_grid=param_grid_rf, cv=3, iid=False, n_jobs=-1)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
svm_eval_rmse = round(np.sqrt(mean_squared_error(y_test, y_pred)))
print("svm_eval_rmse=",svm_eval_rmse)

svm_eval_rmse= 15333.0


### RandomForestRegressor model

In [39]:
## Perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    shuffle=True, random_state=42)
## train a random forest model
param_grid_rf = {
    'rf__criterion': ['mse'],
    'rf__n_estimators': [10, 20,50]
}

pipe_rf = Pipeline(steps=[('scaler', StandardScaler()),
                          ('rf', RandomForestRegressor())])

grid = GridSearchCV(pipe_rf, param_grid=param_grid_rf, cv=3, iid=False, n_jobs=-1)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
rf_eval_rmse = round(np.sqrt(mean_squared_error(y_test, y_pred)))
print("rf_eval_rmse=",rf_eval_rmse)

rf_eval_rmse= 16592.0


## According to the above results RandomForestRegressor provides better performance in comparision to SVM model, so we will use former model  

### 3. Re-train your model on all of the data using the selected approach and prepare it for deployment.

In [41]:
from model import model_train
model_train(data_dir)

... loading ts data from files
... saving model: models/sl-all-0_1.joblib
... saving model: models/sl-netherlands-0_1.joblib
... saving model: models/sl-united_kingdom-0_1.joblib
... saving model: models/sl-germany-0_1.joblib
... saving model: models/sl-eire-0_1.joblib
... saving model: models/sl-france-0_1.joblib
... saving model: models/sl-portugal-0_1.joblib
... saving model: models/sl-norway-0_1.joblib
... saving model: models/sl-spain-0_1.joblib
... saving model: models/sl-singapore-0_1.joblib
... saving model: models/sl-hong_kong-0_1.joblib


### 4. Articulate your findings in a summary report.

After comparision of different models we selected RandomForestRegressor as the model that shown the best performace for the accuracy prediction metrics. This metrics aligned with buisness oppartunity. Finally we prepared models for all contries, however for of the contries the data should be improved as expained in EDA (part 1) of this work. 