In this notebook we are going over the different modeling and balancing techniques used for this project. The target for the modeling will be: 
    • 'popularity_category'
    
The features used are:

    • 'explicit'
    • 'danceability'
    • 'energy'
    • 'key'
    • 'loudness'
    • 'mode'
    • 'speechiness'
    • 'acousticness'
    • 'instrumentalness'
    • 'liveness'
    • 'valence'
    • 'tempo'
    • 'time_signature'

We saw in the EDA that some features had more impact than others when making a song popular, this will show us the impact the features will make through the modeling process.
We also saw how umbalance the data was when splitting the popularity feature. These are the techniques used for balancing the data: 

    • Random Over Sampler: This works by randomly relecting examples form the minority class, with replacement, and adding them to the training dataset.
    
    • Near Miss undersampling: Near Miss looks at the class distribution and randomly eliminates sample from the larger class. If two data point from the larger class are very close to each other in the distribution it is eliminated.
    
    •Weight: Normally class weights give all the classes a equal importance regardless of how many samples each class has. Weights are set into place to give more weight(importance) to the minority class to balance the data.

# Imports

In [29]:
import pandas as pd

from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.inspection import permutation_importance

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import NearMiss

from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, recall_score, precision_score, f1_score
pd.set_option('display.max_columns', None)

# Load Data

In [4]:
tracks_clean = pd.read_csv('./data/tracks_clean.csv')

# Modeling

## Baseline Model

In [1]:
def baseline(model, 
                 X_train, y_train, X_test, y_test,
                 verbose=True):
    
    results = {}
    
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    results['train_accuracy'] = accuracy_score(y_train, y_pred_train)
    results['test_accuracy'] = accuracy_score(y_test, y_pred_test)
    results['variance'] = results['train_accuracy'] - results['test_accuracy']
    results['test_recall'] = recall_score(y_test, y_pred_test, pos_label=1, zero_division=0)
    results['test_precision'] = precision_score(y_test, y_pred_test, pos_label=1, zero_division=0)
    results['test_f1'] = f1_score(y_test, y_pred_test, pos_label=1, zero_division=0)
    
    return results

## Helper Functions

In [2]:
def run_models(models, X_train, y_train, X_test, y_test, verbose=False):

    results = {}
    
    for name,model in models.items():
        if verbose:
            print('\nRunning {} - {}'.format(name, model))
        
        results[name] = baseline(model, X_train, y_train, X_test, y_test, verbose=False)
        
        if verbose:
            print('Results: ', results[name])

    return pd.DataFrame.from_dict(results, orient='index')

## Models

In [6]:
models = {'Dummy': DummyClassifier(strategy='most_frequent'),
          'Logistic Regression': LogisticRegression(solver='lbfgs'),
          'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
          'Random Forest': RandomForestClassifier(n_estimators=100),
          'XGBoost': XGBClassifier(),
          }

###### Why did I chose these models?
    When I was looking at what models to pick, I had to look at a few things. What models are best for classification? Which models are good for binary classification? Non-Linear relationships? The first thing I looked at was my EDA to see what kind of relationships my data has. Logistic Regression is one of the simpler and most used classification models so I included it to be my starting point for my classification models. Unfortunately my data was not very linear, to satify that I included the Random Forest model. Random Forest works really well for multiple feature selection, large data, and noise which made this a good choise. XGBoost is similar to Random Forest because it also uses Decision Trees, but works better for unbalanced data. KNearestNeighbors works by estimating the likelihood that a data point will become a member of one group or another based on the data points nearest to it. Since in some of the models we saw a seperation between the graphs, I thought it would be a good model to include.

## Train/ Test/ Split

In [7]:
# setting up X and y
X = tracks_clean.drop(columns=['popularity', 'popularity_category', 'release_date'])
y = tracks_clean['popularity_category']

#test/train/split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y
                                                   )

## Unbalanced Model

In [8]:
# Unbalanced data

print('\ny_train Class Distribution:\n', y_train.value_counts(normalize=True))
print('\ny_test Class Distribution:\n', y_test.value_counts(normalize=True))


y_train Class Distribution:
 1    0.923916
0    0.076084
Name: popularity_category, dtype: float64

y_test Class Distribution:
 1    0.923915
0    0.076085
Name: popularity_category, dtype: float64


In [9]:
unbalanced_model_results = run_models(models,
                                  X_train, y_train,
                                  X_test, y_test,
                                  verbose=True)


Running Dummy - DummyClassifier(strategy='most_frequent')
Results:  {'train_accuracy': 0.9239163541311513, 'test_accuracy': 0.923914599968633, 'variance': 1.7541625182415643e-06, 'test_recall': 1.0, 'test_precision': 0.923914599968633, 'test_f1': 0.9604528184189635}

Running Logistic Regression - LogisticRegression()
Results:  {'train_accuracy': 0.9239163541311513, 'test_accuracy': 0.923914599968633, 'variance': 1.7541625182415643e-06, 'test_recall': 1.0, 'test_precision': 0.923914599968633, 'test_f1': 0.9604528184189635}

Running Nearest Neighbors - KNeighborsClassifier()
Results:  {'train_accuracy': 0.9296101829753382, 'test_accuracy': 0.9188276929581114, 'variance': 0.010782490017226753, 'test_recall': 0.9894828515126243, 'test_precision': 0.9275045486935045, 'test_f1': 0.9574917868875874}

Running Random Forest - RandomForestClassifier()
Results:  {'train_accuracy': 0.9979884077736106, 'test_accuracy': 0.9459874122917675, 'variance': 0.052000995481843115, 'test_recall': 0.99021351

## Undersampling

In [15]:
# Undersamlping data using NearMiss
nr = NearMiss() 

X_train_near, y_train_near= nr.fit_resample(X_train, y_train) 

#https://analyticsindiamag.com/using-near-miss-algorithm-for-imbalanced-datasets/

In [16]:
print('\ny_train Class Distribution:\n', y_train_near.value_counts(normalize=True))
print('\ny_test Class Distribution:\n', y_test.value_counts(normalize=True))


y_train Class Distribution:
 0    0.5
1    0.5
Name: popularity_category, dtype: float64

y_test Class Distribution:
 1    0.923915
0    0.076085
Name: popularity_category, dtype: float64


In [18]:
undersampled_model_results = run_models(models,
                                        X_train_near, y_train_near,
                                        X_test, y_test,
                                        verbose=True)


Running Dummy - DummyClassifier(strategy='most_frequent')
Results:  {'train_accuracy': 0.5, 'test_accuracy': 0.07608540003136699, 'variance': 0.423914599968633, 'test_recall': 0.0, 'test_precision': 0.0, 'test_f1': 0.0}

Running Logistic Regression - LogisticRegression()
Results:  {'train_accuracy': 0.5, 'test_accuracy': 0.07608540003136699, 'variance': 0.423914599968633, 'test_recall': 0.0, 'test_precision': 0.0, 'test_f1': 0.0}

Running Nearest Neighbors - KNeighborsClassifier()
Results:  {'train_accuracy': 0.774773698204523, 'test_accuracy': 0.29907058253949853, 'variance': 0.4757031156650245, 'test_recall': 0.2697999158628121, 'test_precision': 0.9046051817574423, 'test_f1': 0.4156358013461888}

Running Random Forest - RandomForestClassifier()
Results:  {'train_accuracy': 0.9919786096256684, 'test_accuracy': 0.5295906608205877, 'variance': 0.4623879488050807, 'test_recall': 0.5018414235421756, 'test_precision': 0.9785709145858819, 'test_f1': 0.663446808925836}

Running XGBoost - X

## Oversampling

In [19]:
# Oversampling with RandomOverSampler
ros = RandomOverSampler()

X_train_over, y_train_over = ros.fit_resample(X_train, y_train)

In [20]:
print('\ny_train Class Distribution:\n', y_train_over.value_counts(normalize=True))
print('\ny_test Class Distribution:\n', y_test.value_counts(normalize=True))


y_train Class Distribution:
 1    0.5
0    0.5
Name: popularity_category, dtype: float64

y_test Class Distribution:
 1    0.923915
0    0.076085
Name: popularity_category, dtype: float64


In [21]:
oversampled_model_results = run_models(models,
                                        X_train_over, y_train_over,
                                        X_test, y_test,
                                        verbose=True)


Running Dummy - DummyClassifier(strategy='most_frequent')
Results:  {'train_accuracy': 0.5, 'test_accuracy': 0.07608540003136699, 'variance': 0.423914599968633, 'test_recall': 0.0, 'test_precision': 0.0, 'test_f1': 0.0}

Running Logistic Regression - LogisticRegression()
Results:  {'train_accuracy': 0.5190527385313314, 'test_accuracy': 0.8410921166579157, 'variance': -0.32203937812658434, 'test_recall': 0.8989394286051678, 'test_precision': 0.9268630479944602, 'test_f1': 0.9126877079399335}

Running Nearest Neighbors - KNeighborsClassifier()
Results:  {'train_accuracy': 0.9397739601502668, 'test_accuracy': 0.7977988557868682, 'variance': 0.14197510436339866, 'test_recall': 0.8323898651590858, 'test_precision': 0.9420092544643602, 'test_f1': 0.88381350918231}

Running Random Forest - RandomForestClassifier()
Results:  {'train_accuracy': 0.9985275919670732, 'test_accuracy': 0.9446099924310096, 'variance': 0.053917599536063565, 'test_recall': 0.9832611278811452, 'test_precision': 0.95790

## Weighted

In [22]:
models_weighted = {'Logistic Regression': LogisticRegression(solver='lbfgs', class_weight='balanced'),
          'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5, weights='distance'),
          'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
          'XGBoost': XGBClassifier(scale_pos_weight=5),
          }

In [23]:
print('\ny_train Class Distribution:\n', y_train_over.value_counts(normalize=True))
print('\ny_test Class Distribution:\n', y_test.value_counts(normalize=True))


y_train Class Distribution:
 1    0.5
0    0.5
Name: popularity_category, dtype: float64

y_test Class Distribution:
 1    0.923915
0    0.076085
Name: popularity_category, dtype: float64


In [24]:
weighted_model_results = run_models(models_weighted,
                                        X_train, y_train,
                                        X_test, y_test,
                                        verbose=True)


Running Logistic Regression - LogisticRegression(class_weight='balanced')
Results:  {'train_accuracy': 0.8426366632571883, 'test_accuracy': 0.8428036631185604, 'variance': -0.00016699986137203027, 'test_recall': 0.9009764342069332, 'test_precision': 0.9268403790087464, 'test_f1': 0.9137254168148051}

Running Nearest Neighbors - KNeighborsClassifier(weights='distance')
Results:  {'train_accuracy': 0.9979906807591772, 'test_accuracy': 0.9197755214761577, 'variance': 0.07821515928301948, 'test_recall': 0.9832611278811452, 'test_precision': 0.9334580513165455, 'test_f1': 0.9577125604298834}

Running Random Forest - RandomForestClassifier(class_weight='balanced')
Results:  {'train_accuracy': 0.997381520627344, 'test_accuracy': 0.9441599443576928, 'variance': 0.05322157626965118, 'test_recall': 0.9902947015712988, 'test_precision': 0.9512661996994016, 'test_f1': 0.970388182755067}

Running XGBoost - XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel

# Model Results

## Unbalanced Modeling Results:

In [25]:
unbalanced_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_recall,test_precision,test_f1
Random Forest,0.997988,0.945987,0.052001,0.990214,0.953148,0.971327
XGBoost,0.94708,0.943219,0.003861,0.988885,0.951558,0.969863
Dummy,0.923916,0.923915,2e-06,1.0,0.923915,0.960453
Logistic Regression,0.923916,0.923915,2e-06,1.0,0.923915,0.960453
Nearest Neighbors,0.92961,0.918828,0.010782,0.989483,0.927505,0.957492


## Undersampling Modeling Results:

In [26]:
undersampled_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_recall,test_precision,test_f1
Random Forest,0.991979,0.529591,0.462388,0.501841,0.978571,0.663447
XGBoost,0.897828,0.529413,0.368415,0.502771,0.976478,0.663776
Nearest Neighbors,0.774774,0.299071,0.475703,0.2698,0.904605,0.415636
Dummy,0.5,0.076085,0.423915,0.0,0.0,0.0
Logistic Regression,0.5,0.076085,0.423915,0.0,0.0,0.0


## Oversampling Modeling Results:

In [27]:
oversampled_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_recall,test_precision,test_f1
Random Forest,0.998528,0.94461,0.053918,0.983261,0.957902,0.970416
XGBoost,0.859905,0.857069,0.002836,0.862391,0.980565,0.917689
Logistic Regression,0.519053,0.841092,-0.322039,0.898939,0.926863,0.912688
Nearest Neighbors,0.939774,0.797799,0.141975,0.83239,0.942009,0.883814
Dummy,0.5,0.076085,0.423915,0.0,0.0,0.0


## Weighted Modeling Results:

In [28]:
weighted_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_recall,test_precision,test_f1
Random Forest,0.997382,0.94416,0.053222,0.990295,0.951266,0.970388
XGBoost,0.938825,0.936032,0.002793,0.998229,0.936694,0.966483
Nearest Neighbors,0.997991,0.919776,0.078215,0.983261,0.933458,0.957713
Logistic Regression,0.842637,0.842804,-0.000167,0.900976,0.92684,0.913725


### Modeling Summary:
For this project we will be looking more at accuracy, because we are looking at how well the model can predict popularity of a song.

    Unbalanced modeling: Suprisingly the unbalanced models did not perform that bad. This could be because the data is very ubalanced and could be very overfit. The best performing model was Random Forest with a training score of 0.997988 and a test score of 0.945987.
    
    Undersampling modeling: Undersampling performed the worst out of all the different balancing techniques used. The results were extremly over-fitt showing high train score and low test scores, and also has a very high variance score.
    
    Oversampling modeling: Oversampling had the best results. The results had high training and test score with very low variance. The best model was Random Forest with a training score of 0.998528 and a test score of 0.944610. We will proceed with this model for model tunning. 
    
    Weighted modeling: Weighted modeling also performed very well. Random Forest was also the best performing and had very similar scores to the oversampling. This is the second best data balancing technique.