# Project Template: Phase 2

Below are some concrete steps that you can take while doing your analysis for phase3. This guide isn't "one size fit all" so you will probably not do everything listed. But it still serves as a good "pipeline" for how to do data analysis.

If you do engage in a step, you should clearly mention it in the notebook.

---


## 2.1) Decide on what models you will use and compare

Select at least 3 models to compare on your prediction task. At least 2 of your models should be ones we've covered in class. 

Some resources try to help you select a well-performing model for your data:
* [sklearn's Flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
* [geeks4geeks Flowchart](https://www.geeksforgeeks.org/flowchart-for-basic-machine-learning-models/)
* [SAS Cheatsheet](https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet.png)

**Note**: These are general guides, and not guarantees of success. Some of the models are also outside of what we have covered, but you can explore them if you want to.

In addition to selecting a model you think will perform well, there are other reasons to select a model:
* To serve as a baseline (naive) approach you expect to outperform with more complex/appropriate models.
* You need a model that is human interpretable (e.g. Decision Tree).
* The model has historically performed well on similar tasks.
* Some properties of the model are effective for the type of data you have. Remember, at the end of most Seminars, you learned the strengths and weaknesses of each model.

1. Model XXX: I am selecting XXX because...
2. Model YYY: I am selecting YYY because...
3. Model ZZZ: I am selecting ZZZ because...

## 2.2) Split into train and test
Make sure to split your data *before* you apply any transformations.

**Note**: If you have multiple records from the same object (e.g., multiple attempts from the same student), these should all go in either training or test, but not split between them. See the examples for how to accomplish this.

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics

# set a seed for reproducibility



random_seed = 40
np.random.seed(random_seed)
data = pd.read_csv('/home/aaronlinder/Documents/DataMining/week7/df1.csv')

for i in data.columns[data.isnull().any(axis=0)]:     #---Applying Only on variables with NaN values
    data[i].fillna(data[i].mean(),inplace=True)
print(data.describe())


    
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

data.to_csv('/home/aaronlinder/Documents/DataMining/week7/df2.csv')

         Unnamed: 0      Severity    Start_Time      End_Time     Start_Lat  \
count  1.516064e+06  1.516064e+06  1.516064e+06  1.516064e+06  1.516064e+06   
mean   7.580315e+05  2.238630e+00  1.255732e+01  1.265853e+01  3.690056e+01   
std    4.376501e+05  6.081481e-01  6.188610e+00  7.072671e+00  5.165653e+00   
min    0.000000e+00  1.000000e+00  0.000000e+00  0.000000e+00  2.457022e+01   
25%    3.790158e+05  2.000000e+00  8.000000e+00  7.000000e+00  3.385422e+01   
50%    7.580315e+05  2.000000e+00  1.400000e+01  1.400000e+01  3.735113e+01   
75%    1.137047e+06  2.000000e+00  1.700000e+01  1.800000e+01  4.072593e+01   
max    1.516063e+06  4.000000e+00  2.300000e+01  2.300000e+01  4.900058e+01   

          Start_Lng       End_Lat       End_Lng  Distance(mi)  Temperature(F)  \
count  1.516064e+06  1.516064e+06  1.516064e+06  1.516064e+06    1.516064e+06   
mean  -9.859919e+01  3.690061e+01 -9.859901e+01  5.872617e-01    5.958460e+01   
std    1.849602e+01  5.165629e+00  1.849590e+

In [7]:


# The fraction of data that will be test data
test_data_fraction = 0.10

features = data.iloc[:,0:-1]
"""print(features.head())
print(features.describe())"""
labels = data["Severity"]
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_data_fraction,  random_state=random_seed)


# This is a dummy dataset that contains 500 positive and 500 negative samples
"""X,Y = make_classification(n_features=4, n_redundant=0, n_informative=1, n_clusters_per_class=1)"""

"""X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_data_fraction,  random_state=random_seed)"""
from sklearn.tree import DecisionTreeClassifier
Y_test_predicted = DecisionTreeClassifier(criterion = "gini", random_state=random_seed).fit(X=X_train, y=Y_train).predict(X_test)
print(f'Accuracy: {sklearn.metrics.accuracy_score(Y_test, Y_test_predicted)}')
print(f'Precision Macro: {sklearn.metrics.precision_score(Y_test, Y_test_predicted, average="macro")}')
print(f'Recall Macro: {sklearn.metrics.recall_score(Y_test, Y_test_predicted, average="macro")}')
print(f'F1 Macro: { sklearn.metrics.f1_score(Y_test, Y_test_predicted, average="macro") }')
print(f'Precision Micro: {sklearn.metrics.precision_score(Y_test, Y_test_predicted, average="micro")}')
print(f'Recall Micro: {sklearn.metrics.recall_score(Y_test, Y_test_predicted, average="micro")}')
print(f'F1 Micro: { sklearn.metrics.f1_score(Y_test, Y_test_predicted, average="micro") }')
from sklearn.metrics import classification_report

print(classification_report(Y_test,Y_test_predicted,digits=4))
print('X dimensionality', data.Severity.shape)
print('y dimensionality', data.Start_Time.shape)
# K-Nearest Neighbor Classifier


Accuracy: 1.0
Precision Macro: 1.0
Recall Macro: 1.0
F1 Macro: 1.0
Precision Micro: 1.0
Recall Micro: 1.0
F1 Micro: 1.0
              precision    recall  f1-score   support

           1     1.0000    1.0000    1.0000      2786
           2     1.0000    1.0000    1.0000    121209
           3     1.0000    1.0000    1.0000     16042
           4     1.0000    1.0000    1.0000     11570

    accuracy                         1.0000    151607
   macro avg     1.0000    1.0000    1.0000    151607
weighted avg     1.0000    1.0000    1.0000    151607

X dimensionality (1516064,)
y dimensionality (1516064,)


### 2.2.1) Sampling (If needed)

""If one of your classes is very underrepresented (e.g. 1000 of Class 0; 200 of Class 1), you might consider oversampling the minority class (e.g. sample 1000 times with replacement from 200 instances), or undersampling the majority class (e.g. sample 200 times from 1000 instances).

Check out [np.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) for how to sample a vector.

**Note 1**: You should only ever sample the *training dataset*, never the test. After all, you can't chose the class distribution of your test data!

**Note 2**: Sampling can help a classifier perform better on the minority class, often at the cost of *overall* performance. But this is no guarantee. If you chose to sample, you should compare your classifiers' performance with and without sampling to see if it actually helped.

**Note 3**: Make sure you sample the *same* indices from your training and test data -- otherwise they won't match anymore!


Play around with sampling below (or skip this step if you don't need sampling).

In [8]:
data["Severity"].value_counts()

2    1212382
3     161052
4     114452
1      28178
Name: Severity, dtype: int64

 When you're done, write the `sample_data` method to perform sampling on any training dataset.

In [25]:
def sample_data(X_train, Y_train):
    subseta = X_train.sample(frac=0.5)
    subsetb = Y_train.sample(frac=0.5)
   
    return subseta, subsetb


## 2.3) Feature Transformation

Use your training data to fit any transformers or encoder your need, then apply the fit transformer to your test data. This applies to:
* Normalizing/standardizing your features
* Using Bag of Words or TF-IDF to encode strings
* PCA or dimensionality reduction

**Rationale**: In practice, we won't be able to see the test data we'll be making predicting for, so we shouldn't use that data as the basis for any transformation or feature extractio.

Try your feature transformation below:

 When you're done, write the `apply_feature_transformation` method to perform transformation on any training/test split.

In [10]:
def apply_feature_transformation(X_train, X_test):
    """
    Input: The original X_train and X_test feature sets.
    Output: The transformed X_train and X_test feature sets.
    """
    return (X_train, X_test)

## 2.4) Train and Explore your Models
Using the models you decided upon in the beginning, now train these models. Conduct preliminary evaluations to see if using said models are even feasible, before potentially wasting time tuning a model thats no-good.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
Y_test_predicted = KNeighborsClassifier(n_neighbors=3).fit(X=X_train, y=Y_train).predict(X_test)
print("KNN Classifer")
print(classification_report(Y_test,Y_test_predicted,digits=4))


KNN Classifer
              precision    recall  f1-score   support

           1     0.5058    0.5822    0.5413      2786
           2     0.8801    0.9357    0.9071    121209
           3     0.5302    0.3929    0.4513     16042
           4     0.5109    0.3374    0.4064     11570

    accuracy                         0.8261    151607
   macro avg     0.6067    0.5621    0.5765    151607
weighted avg     0.8080    0.8261    0.8139    151607



## 2.5) Hyperparameter Tuning
For promising models, tune them even further to squeeze out the best possible performance. Some questions to consider.

1. What hyperparamaters should I tune? Why?
2. What values ranges should I choose for each param? Why?
3. Should I use try the values manually, or use the [built-in tuning functions](https://scikit-learn.org/stable/modules/grid_search.html)?

**Make sure to only tune on the training dataset!**

In [28]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression




def find_best_hyperparameters_m1(X_train, Y_train):
    logistic = LogisticRegression()
    #penalty = ['l2', 'l2']
    C = np.logspace(0,4,10)
    hyperparemeters = dict(C=C)
    clf =  GridSearchCV(logistic, hyperparemeters, cv=5, verbose=0)
   
    clfa = clf.fit(X_train,Y_train)
    sorted(clfa.cv_results_.keys())
    return clfa 
find_best_hyperparameters_m1(X_train, Y_train)


GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([1.00000000e+00, 2.78255940e+00, 7.74263683e+00, 2.15443469e+01,
       5.99484250e+01, 1.66810054e+02, 4.64158883e+02, 1.29154967e+03,
       3.59381366e+03, 1.00000000e+04])})

data.columns()

## Put it All Together

Now, combine the "scratch work" that you did above into a tidy function that someone could use to replicate your work and process in a single step.

In [26]:

def evaluate_model1(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    hyperparameters = find_best_hyperparameters_m1(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions
    return hyperparameters


evaluate_model1(X_train, X_test, Y_train, Y_test)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([1.00000000e+00, 2.78255940e+00, 7.74263683e+00, 2.15443469e+01,
       5.99484250e+01, 1.66810054e+02, 4.64158883e+02, 1.29154967e+03,
       3.59381366e+03, 1.00000000e+04])})

NameError: name 'clfa' is not defined