# Project Template: Phase 3

Below are some concrete steps that you can take while doing your analysis for phase 3. This guide isn't "one size fit all" so you will probably not do everything listed. But it still serves as a good "pipeline" for how to do data analysis.

If you do engage in a step, you should clearly mention it in the notebook.

---


In [1]:
import pandas as pd
import numpy as np
import math
import random
from random import randrange
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Either must import notebook or repaste methods probably
# from G3_FinalProject_Part2 import apply_feature_transformation

## Methods from Part 2

In [2]:
# Generalized SMOTE algorithm as defined by Chawla et al. in the report "SMOTE: Synthetic Minority Over-sampling Technique".
# https://arxiv.org/pdf/1106.1813.pdf

# Functional version
def SMOTE(minority_df, smote_percent, k):
    """
    Input:  minority_df   - dataframe consisting ONLY of minority class samples
            smote_percent - percent of samples to SMOTE as a decimal
            k             - number of nearest neighbors to use for SMOTE
    Output: array of synthetic minority class samples of size floor( minority_df.length * smote_percent )
    """
    
    # Number of objects (rows) in minority_df
    obj_count = minority_df.shape[0]
    # Number of attributes (columns) in minority_df
    attr_count = minority_df.shape[1]
    # Keeps count of number of synthetic samples generated
    synth_count = 0
    
    # If smote_percent is less than 100% (1), shuffle minority_df's rows because only some objects will be sampled
    if smote_percent < 1:
        # This code shuffles the dataframe's rows in-place and resets the indices 
        minority_df = minority_df.sample(frac=1).reset_index(drop=True)
        # Set number of objects to SMOTE
        obj_count = math.floor(obj_count * smote_percent)
        # Set smote_percent to 100% 
        smote_percent = 1
        
    # Else smote_percent is assumed to be in multiples of 100%
    smote_percent = math.floor( smote_percent )
    
    # List of synthetic samples to output, will be converted to dataframe before output
    synth_list = [[0 for x in range(attr_count)] for y in range(obj_count*smote_percent)] 

    # Inner function to generate the synthetic samples
    def Populate(smote_percent, i, nnarray):
        nonlocal synth_count
        while smote_percent != 0:
            # Choose a random number between 0 and k-1 and assign it to nn.
            # This chooses on of the KNNs of i
            nn = randrange(k)
            new_nnarray = nnarray[0]
            # Create the synthetic attributes
            for attr in range(attr_count):      
                dif = minority_df.iloc[new_nnarray[nn], attr] - minority_df.iloc[i, attr]
                gap = random.random() 
                synth_list[synth_count][attr] = minority_df.iloc[i, attr] + gap * dif
            synth_count += 1
            smote_percent -= 1
        return 
    
    # Compute the KNNs for each of minority_df's samples and use them to populate synth_list
    neigh = NearestNeighbors(n_neighbors=k)
    neigh.fit(minority_df)
    for i in range(obj_count):
        # List of i's KNNs as indices
        nnarray = neigh.kneighbors([minority_df.iloc[i].values.flatten().tolist()], return_distance=False)
        # Populate the list of synthetic samples
        Populate(smote_percent, i, nnarray)
    
    # Return the dataframe of synthetic samples
    
    return pd.DataFrame(synth_list, columns = minority_df.columns.values.tolist())

In [3]:
# Specialized SMOTE function to apply to the ECONet dataset that
# automatically performs smote and manages the "Station" and "measure" data
# This version ignores the "station"-"measure" correlation. Only works for smote_percent >= 1
# TODO Update this description

def ECONet_SMOTE_IgnoreStationCorrelation(X_train, Y_train, smote_percent, k ):
    """
    Input: The original X_train and Y_train training dataset
           smote_percent - percent of samples to SMOTE as a decimal
           k             - number of nearest neighbors to use for SMOTE
    Output: A new training dataset with sampling applied (same columns, different rows)
    """
    
    # First, copy the dataframes so they are not modified in place
    X_train, Y_train = X_train.copy("deep=True"), Y_train.copy("deep=True")
    
    # Now add Y back to X to more easily get the minority class
    df = pd.concat([X_train, Y_train], axis=1)
    
    # Get a new dataframe that has only the minority class
    minority_df = df[df["target"] == True]
    
    # Now remove the "target" row from the minority dataframe, it is useless now that all rows are "True"
    minority_df = minority_df[["Station", "Ob", "value", "measure", "R_flag", "I_flag","Z_flag", "B_flag"]]
    
    # Now, loop through every measure to produce synthetic dataframes for each
    # then merge them together into one big synthetic dataframe. This is necessary because SMOTE does not work
    # with categorical variables like measure
    
    # Big synthetic dataframe to output
    synthetic_df = []
    
    # List of unique "measure" values in the dataframe
    measure_values = minority_df["measure"].unique()
    
    # Loop through each measure
    for measure in measure_values:
        # Get a sub-dataframe that has just this measure 
        sub_df = minority_df[minority_df["measure"] == measure]

        # Continue thorugh the loop if this dataframe has no rows
        if sub_df.shape[0] == 0:
            continue

        # Remove the "measure" attributes from the dataframe so we can SMOTE it
        sub_df = sub_df[["Station", "Ob", "value", "R_flag", "I_flag","Z_flag", "B_flag"]]
        
        # Get the "Station" column and drop it from the sub-dataframe
        stations = sub_df["Station"]
        sub_df = sub_df[["Ob", "value", "R_flag", "I_flag","Z_flag", "B_flag"]]
        
        # Update "Station" column by multiplying column by smote_percent
        stations = stations.loc[stations.index.repeat(smote_percent)].reset_index(drop=True)

        # If k is greater than the number of rows in df, set it to number of rows in df
        # TODO Fix this? Explain?
        if(k > sub_df.shape[0]):
            k_new = sub_df.shape[0]
        else:
            k_new = k

        # Smote the dataframe to get synthetic samples that resemble this combination    
        synthetic_sub_df = SMOTE(sub_df, smote_percent, k_new)

        # Add the "Station" and "measure" attributes back to the dataframe
        synthetic_sub_df.insert(0, 'Station', stations)
        synthetic_sub_df.insert(3, 'measure', measure)

        # Round the flag values so they are discrete. 
        # TODO This approach should be updated but its fine for now
        synthetic_sub_df['R_flag'] = synthetic_sub_df['R_flag'].map(round)
        synthetic_sub_df['I_flag'] = synthetic_sub_df['I_flag'].map(round)
        synthetic_sub_df['Z_flag'] = synthetic_sub_df['Z_flag'].map(round)
        synthetic_sub_df['B_flag'] = synthetic_sub_df['B_flag'].map(round)

        # Finally, add this new synthetic sub-dataframe to the main synthetic dataframe to output
        synthetic_df.extend(synthetic_sub_df.values.tolist())
            
    # Now, add "target" column to the synthetic dataframe before output
    synthetic_df = pd.DataFrame(synthetic_df, columns = X_train.columns.values.tolist())
    synthetic_df.insert(4, 'target', True)
    
    # Output the new synthetic dataframe
    return (synthetic_df.sort_values("Station")).reset_index(drop=True)

In [4]:
def sample_data(X_train, Y_train):
    """
    Input: The original X_train and Y_train training dataset
    Output: A new training dataset with sampling applied (same columns, different rows)
    """
    # For example, undersample the majority class, or oversample the minority class.
    synth_df = ECONet_SMOTE_IgnoreStationCorrelation(X_train, Y_train, 30, 4)
    
    # Split the synthetic dataframe into X and Y
    synth_df_copy = synth_df.copy("deep=True")

    X_synth_train, Y_synth_train = synth_df_copy.drop("target", axis=1), synth_df_copy["target"]

    # Combine the synthetic set with the original training set
    X_train_copy, Y_train_copy = X_train.copy("deep=True"), Y_train.copy("deep=True")

    X_train_copy, Y_train_copy = pd.concat([X_train_copy, X_synth_train], ignore_index=True), pd.concat([Y_train_copy, Y_synth_train], ignore_index=True)
    
    return (X_train_copy, Y_train_copy)

In [28]:
def apply_feature_transformation(X_train, X_test):
    """
    Input: The original X_train and X_test feature sets.
    Output: The transformed X_train and X_test feature sets.
    """
    catStation = ['AURO', 'BAHA', 'BALD', 'BEAR', 'BUCK', 'BURN', 'CAST', 'CHAP', 'CLA2', 'CLAY', 'CLIN', 'DURH', 
                  'FLET', 'FRYI', 'GOLD', 'HAML', 'JACK', 'JEFF', 'KINS', 'LAKE', 'LAUR', 'LEWS', 'LILE', 'MITC', 
                  'NCAT', 'NEWL', 'OXFO', 'PLYM', 'REED', 'REID', 'ROCK', 'SALI', 'SASS', 'SILR', 'SPIN', 'SPRU', 
                  'TAYL', 'UNCA', 'WAYN', 'WHIT', 'WILD', 'WILL', 'WINE']
    
    catMeasure = ['temp_wxt', 'temp_hmp', 'rh_wxt', 'rh_hmp', 'ws10', 'wd10', 'gust10', 'precip', 'impact',
                   'pres', 'par', 'sr', 'st', 'sm', 'temp10', 'ws02', 'wd02', 'gust02', 'ws06', 'wd06', 'gust06',
                   'leafwetness', 'blackglobetemp']
    
    X_train['Station'] = X_train['Station'].astype('category').cat.set_categories(catStation)
    X_train['Station'] = X_train['Station'].astype('category').cat.codes
    X_train['measure'] = X_train['measure'].astype('category').cat.set_categories(catMeasure)
    X_train['measure'] = X_train['measure'].astype('category').cat.codes

    X_test['Station'] = X_test['Station'].astype('category').cat.set_categories(catStation)
    X_test['Station'] = X_test['Station'].astype('category').cat.codes
    X_test['measure'] = X_test['measure'].astype('category').cat.set_categories(catMeasure)
    X_test['measure'] = X_test['measure'].astype('category').cat.codes
    
    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    return (X_train, X_test)

## 3.0) Evaluation

Now that you have selected your models and have trained/tuned them, it's time to see how they stack up. Some important questsion to ask:

1. How did your models compare to each other
2. In what metrics do they differ, why?

## 3.1) Comparing Models

To compare your models, you can try things such as:

1. Doing multiple random restarts of training/test splits (code below)
2. Using cross-validation

In your report, report back the following metrics:

**Classification**
* Accuracy
* Precision
* Recall
* F1
* AUC

**Regression**
* MSE
* MAE
* $R^2$

**Sample Evaluation Code**: Here is some sample code for the evaluation procedure. **You do not need to use the sample code if you feel that it wouldn't work with your pipeline, but you can use it as inspiration. It lacks any sort of feature transformation or sampling procedure, so you would have to implement that yourself.** It runs a set number of trials using different splits, and returns back a dataframe, where each row represents a single random evaluation. It has 3 columns.
    
Model: The name of the model being evaluated

Evaluation: The name of the evaluation (e.g. acc, precision, MSE)

Score: The score of the evaluation

In [6]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

"""
Evaluates a classification model
"""
def evaluate_classification(model,x_test_ev,y_test_ev):
    predictions = model.predict(x_test_ev)
    
    acc = accuracy_score(y_test_ev,predictions)
    
    # Depending on the type of classification you are doing (e.g. multiclass vs binary)
    # Make sure to change the "average" param depending on what you need
    prec = precision_score(y_test_ev,predictions,average="macro")
    recall = recall_score(y_test_ev,predictions,average="macro")
    f1 = f1_score(y_test_ev,predictions,average="macro")
    # Make sure to change/edit the `multi_class` of the ROC if you're doing multiclass
    auc = roc_auc_score(y_test_ev,predictions)
    
    return {"acc":acc,"precision":prec,"recall":recall,"f1":f1,"auc":auc}

In [7]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

"""
Evaluates regression using MAE,MSE, and R^2
"""
def evaluate_regression(model,x_test_ev,y_test_ev):
    predictions = model.predict(x_test_ev)
    mae = mean_absolute_error(y_test_ev,predictions)
    mse = mean_squared_error(y_test_ev,predictions)
    r2 = r2_score(y_test_ev,predictions)
    return {"mae":mae,"mse":mse,"r2":r2}
    

In [8]:
"""
Trains and evaluates a single model on a random train/test split
"""
def evaluate_random(model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train)
    
    # Switch this out with `evaluate_regression` if you're doing a regression problem
    evals = evaluate_classification(model,X_test,y_test)
    #evals = evaluate_regression(model,X_test,y_test)
    
    return evals

In [9]:
from sklearn.model_selection import train_test_split

"""
Input:
    X: Your features
    y: Your target
    models: A list of the models that you are evaluating
    n_trials (opt): The number of random trials
    
Output:
    A dataframe with three colums and len(models)*n_trials*(number of evaluation metrics) rows.
    Each row represents a single random evaluation.
    
    Model: The name of the model being evaluated
    Evaluation: The name of the evaluation (e.g. acc, precision, MSE)
    Score: The score of the evaluation
"""
def get_scores(X,y,models,n_trials=5):
    
    data = {
        "model": [],
        "evaluation": [],
        "score": [],
    }
    
    for n in range(n_trials):
        for model in models:
            # Put in special sampling methods
            
            X_train,X_test,y_train,y_test = train_test_split(X,y)
            # Put in feature scaling here
            # MinMaxScaler()
            
            scores = evaluate_random(model,X_train,y_train,X_test,y_test)
            
            for key in scores:
                data["model"].append(str(model))
                data["evaluation"].append(key)
                data["score"].append(scores[key])
    
    return pd.DataFrame.from_dict(data)
        

In [10]:
# Example of getting classification scores
# (See "Follow" doc for how to do it with regression)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
X,y = load_breast_cancer(return_X_y=True,as_frame=True)
neigh = KNeighborsClassifier(n_neighbors=3)
get_scores(X,y,[neigh],5).head()

Unnamed: 0,model,evaluation,score
0,KNeighborsClassifier(n_neighbors=3),acc,0.902098
1,KNeighborsClassifier(n_neighbors=3),precision,0.897727
2,KNeighborsClassifier(n_neighbors=3),recall,0.879934
3,KNeighborsClassifier(n_neighbors=3),f1,0.887831
4,KNeighborsClassifier(n_neighbors=3),auc,0.879934


### Print a boxplot of the different random runs (see Follow document)

In [None]:
random_seed = 42
np.random.seed(random_seed)

train = pd.read_csv("transformed_train.csv")
X_train, X_test, Y_train, Y_test = train_test_split( train.drop("target", axis=1), train["target"], test_size=0.25, random_state=random_seed)

X_train_smoted, Y_train_smoted = sample_data(X_train, Y_train)
X_train_tansformed, X_test_tansformed = apply_feature_transformation(X_train_smoted, X_test)

In [None]:
#Manually test model for whatever desired metrics to get boxplot data
model = KNeighborsClassifier(n_neighbors=2, weights="distance", algorithm="auto", p=1).fit(X_train_tansformed, Y_train_smoted)
predictions = model.predict(X_test_tansformed)
print(classification_report(Y_test, predictions))

In [None]:
# Hyperparameters were tuned in part 2 using a combination of
# automatic and manual hyperparamater testing for models/SMOTE
model1 = DecisionTreeClassifier(criterion="entropy")
model2 = KNeighborsClassifier(n_neighbors=2, weights="distance", algorithm="auto", p=1)
model3 = SVC(max_iter=5)
#The sample method did not work well with our pipeline
#get_scores(X,Y,[model1, model2, model3],3).head()

### Create a table with the average evaluation scores of each metric for each model
**Bold** the best result for each.

Here's an example for a regression problem.

|     | LinReg | SVR   | MLP   |
|-----|--------|-------|-------|
| **MAE** | 3.061  | 4.143 | **2.71**  |
| **MSE** | 22.09  | 42.42 | **15.37** |
| **$R^2$** | 0.684  | 0.394 | **0.780** |

In [None]:
# Run the model on the actual training/test data
train = pd.read_csv("transformed_train.csv")
test = pd.read_csv("transformed_test.csv")
X_train = train.drop("target", axis=1)
Y_train = train["target"]
X_test = test.drop("target", axis=1)
Y_test = test["target"]
X_train_smoted, Y_train_smoted = sample_data(X_train, Y_train)
X_train_tansformed, X_test_tansformed = apply_feature_transformation(X_train_smoted, X_test)

In [None]:
# Model 0: Decision Tree on real test
model = DecisionTreeClassifier(criterion="entropy", max_depth=20, min_impurity_decrease=0.0004).fit(X_train_smoted, Y_train_smoted)
test_pred = model.predict_proba(X_test_tansformed)
pd.DataFrame(test_pred[:,1], columns=['target']).to_csv('predictions.csv', index=False)
print(classification_report(Y_test, model))

## 4.0) Technical Retrospective

Now that you have your final model, go back and look at how your decisions impacted the results. This can take many forms, here are some ideas:

* Which of your decisions were helpful? With your best model:
    * Compare the results of the model with an without your feature selection
    * Compare the results with and without feature engineering
    * Compare if your sampling method made a difference
    
    
* Why did your model do well?
    * If your model is interpretable, discuss feature importance (e.g. decision tree splits, coefficients of linear regression)

* The biggest decision in model training that was helpful was implementing SMOTE. Oversampling saw a major increase in the overall performance of our model. Also, encoding the categorical features using static arrays was not only great for performance, but vastly sped up the speed of our program's execution time. Lastly, a simple decision like standardizing our features caused the performance of our models to shoot up drastically.

## 5.0) Writeup 

Now it is time to reflect upon your work and tie up your report. The goal of this project was to get you familiar with doing a data science problem from scratch on a custom dataset. First, write some conclusions about your model. Then, consider how it could be used in practice. Finally, write about your experiences and what you learned from this project.

Use the following questions as inspiration.

1. How could we use this model in practice?
2. Would you trust the model to make decisions?
3. What are the limitations of the model?
4. what are alternative approaches you could try in the future?

1. We could use this to predict erroneous weather measures from the ECONet stations with a higher level of precision and recall than the current quality control system. This model could then be used instead to minimize the amount of readings people have to manually review.
2. We could probably trust our model to make decisions, because the results have been fairly high across the board showing high levels of average precision in both k-NN and decision trees. It probably would not be a universal answer to ECONet erroneous data predicting at a usable level,  but it is a step in the right direction.
3. A major limitation of our model is the execution time. Oversamping takes over half an hour which drastically slows down the speed of development and testing. Other than that, our model is fairly flexible with the data it can use as long as categorical features are encoded properly and there are not unknown unique categories between training and testing sets.
4. Alternative approaches could firstly include more types of oversampling which could potentially generate more accurate and useful synthetic samples. We could also try evaluating other models that we didn't look into this time such as neural networks, which could potentially find better connections between features like "measure" and "value" to improve predictions.