---
# <span style='color:blue'> **1. Create Model (50 points)** </span>
---

- Create a **logistic regression model** and a **support vector machine model** for the **classification task** involved with your dataset.  

- Assess how well each model performs (use 80/20 training/testing split for your data).
    - <span style='color:green'> ***80/20 Split performed during preprocessing above***</span>

- Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. 
    - That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines.
            -For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe.

## <span style='color:blue'> Create a Logistic Regression classifier </span>


In [None]:
from sklearn.metrics import classification_report

#Logistic regression
logr_clf = LogisticRegression(penalty='l2', #default
                              C=1, #default 
                              class_weight='balanced', # use with imbalanced dataset
                              solver='newton-cg', # only solver that works with this dataset
                              multi_class='multinomial', 
                              random_state=42) 
logr_clf.fit(X_train_class,olist_train_y_range)

# Note: For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; 
# ‘liblinear’ is limited to one-versus-rest schemes.
# Note: solver default lbfgs, sag and saga did not work, we hit max_iter and even set at 4000 it still did not work
# Note: only newton-cg works

#y_train_pred_class = sgd_clf.predict(X_train_class)
yhat_lr = logr_clf.predict(X_test_class)

print('Logistic Regression Metrics:')
print(logr_clf)
print(classification_report(olist_test_y_range, yhat_lr))

## <span style='color:blue'>Utilize RandomizedSearchCV to tune our Logistic Regression Model</span>

source: https://chrisalbon.com/machine_learning/model_selection/hyperparameter_tuning_using_random_search/

In [None]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV
X = X_train_class
y = olist_train_y_range
# create logistic Regression
lgr = LogisticRegression(solver='newton-cg')

# create regularization penalty space
penalty = ['l2','none']  #only penalties that work with newton-cg

# create regulatization hyperparameter distribution using uniform distribution
C = uniform(loc=0, scale=4) 

# create hyperparameter options
hyp = dict(C=C, penalty=penalty)

# create random search, 5fold CV, 100 iterations
clf = RandomizedSearchCV(lgr, hyp, random_state=42, n_iter=10, cv=5, verbose=0, n_jobs=-1)

#fit random search
best_model = clf.fit(X,y)

In [None]:
# view hyperparameter values of best model
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])
print('Best Score: ', best_model.best_score_) 
print('Best Params: ', best_model.best_params_)
print('Best Time (seconds): ', best_model.refit_time_)

In [None]:
# predict using best model
yhat = best_model.predict(X_test_class)

print('Best RandomSearchCV Logistic Regression Metrics:')
print(classification_report(olist_test_y_range, yhat, zero_division=0))

## <span style='color:blue'> 2. Create a linear SVM classifier with stochastic descent </span>

In [None]:
# Basic SVM Model
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train_class, olist_train_y_range)

#y_train_pred_class = sgd_clf.predict(X_train_class)
yhat_svm = sgd_clf.predict(X_test_class)


In [None]:
# SVM with SD Metrics
print('Logistic Regression Metrics:')
print(sgd_clf)
print(classification_report(olist_test_y_range, yhat_svm, zero_division=0))

## <span style='color:blue'>Utilize RandomizedSearchCV to tune our SVM Model</span>


In [None]:
from sklearn.linear_model import SGDClassifier 
from sklearn.model_selection import RandomizedSearchCV 

# create variable dictionaries
loss = ['hinge', 'log', 'modified_huber', 'squared_hinge']
penalty = ['l1', 'l2', 'elasticnet'] 
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000] 
learning_rate = ['constant', 'optimal', 'invscaling', 'adaptive'] 
eta0 = [1, 10, 100] 

# create hyperparameter options
param_distributions = dict(loss=loss, 
                           penalty=penalty, 
                           alpha=alpha, 
                           learning_rate=learning_rate, 
                           eta0=eta0)

# create the classifier
sgd = SGDClassifier(early_stopping=True, validation_fraction=0.15, max_iter=100, class_weight = "balanced") 

# create RandomizedSearchCV
random = RandomizedSearchCV(estimator=sgd,
                            param_distributions=param_distributions, 
                            verbose=1, 
                            n_iter=100, 
                            n_jobs=-1) 
random_result = random.fit(X_train_class,olist_train_y_range) 



In [None]:
# Print Best Results
print('Best Score: ', random_result.best_score_) 
print('Best Params: ', random_result.best_params_)
print('Best Time (seconds): ', random_result.refit_time_)

In [None]:
# predict using best model
yhat_rand_svm = random_result.predict(X_test_class)

print('Best Random SVM SD Metrics:')
print(classification_report(olist_test_y_range, yhat_rand_svm, zero_division=0))

## <span style='color:blue'> 2. Model Advantages (10 points) </span>
Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

**Discussion:**
* We ran 2 basic models using only minimal tuning for multinomial classification and 2 refitted models using RandonSearchCV for tuning.
* We used multinomial classification 
* Basic Logistical Regression is the only model that was able to categorize deliveries that were made on-time (value #2) with a precision of 3%. 
* Refitted Logistical Regression: Once we tuned the model using RandomizedSearchCV with a 5-fold validation, the best model did not correctly classify any on time deliveries; however, the best fit model did increase the precision for classifying late deliveries from 64% to 79%! We also found the best model increased overall accuracy from 75% to 94%, but again, not correctly categorizing any of the on time deliveries. 
* SVM with Stochastic Descent with minor tuning to handle multinormial classification returned an overall accuracy of 94%, but failed to correctly classify any on time packages. This model did present the best precision for categorizing late deliveries at 85%, but fell short of the classification on early deliveries to the basic logistical regression model 95% to 98%.
* Reffited SVM with Stochastic Descent: we again utilized RandomSearchCV to tune our SVM with stochastic decent and found that the resulting best model had the fastest time of 0.72seconds, while maintaining an overall accuracy of 94% that was in line with both the refitted logistical regression model and the initial SVM stochastic descent model. This model was also not able to correctly categorize on time deliveries and fell short of the untuned SVM model's precision of late deliveries...it did have better recall on the SVM's categorization of late deliveries. This model also had the best overall performance of identifying early deliveries based on the ratio between the precision and the recall that was rounded in this report, but from looking at the raw numbers it has a slightly better ratio than the other models.
* Time - we found the best time was the best fit SVM stochastic descent model at 0.71seconds compared to the best fit Logistical Regression model at just over 9 seconds.
* Overall the best performing model was the refit SVM stochastic descent model, even thought it failed to correctly categorize any ontime deliveries. To improve this performance we should probaly look at breaking the categories into more equal sizes that cover a similiar time frame in terms of days.



In [None]:
print('\n================================================================\n')
print('=== Logistic Regression Metrics using minor tuning for class_weight, solver ===')
print(logr_clf)
print(classification_report(olist_test_y_range, yhat_lr))

print('\n================================================================\n')
print('=== Best RandomSearchCV Logistic Regression Metrics ===\n')
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])
print('Best Score: ', best_model.best_score_) 
print('Best Params: ', best_model.best_params_)
print('Best Time (seconds): ', best_model.refit_time_)
print(classification_report(olist_test_y_range, yhat, zero_division=0))

print('\n================================================================\n')
print('=== SVM with Stochastic Descent  Metrics ===\n')
print(classification_report(olist_test_y_range, yhat_svm, zero_division=0))

print('\n================================================================\n')
print('=== Best RandomSeachCV SVM-SD Metrics ===\n')
print('Best Score: ', random_result.best_score_) 
print('Best Params: ', random_result.best_params_)
print('Best Time (seconds): ', random_result.refit_time_)
print(classification_report(olist_test_y_range, yhat_rand_svm, zero_division=0))

### <span style='color:blue'> **Discssion** </span>
