# Modeling
In this notebook, we will (1) find the best classifier, (2) find the best parameters for the classifier, and (3) find the best set of features to use in our model.   
We will evaluate 3 models with our validation set:  
1. our chosen classifier with default parameters on all our features  
2. our chosen classifier with the best parameters on all our features  
3. our chosen classifier with the best paremeters on only the most important features  

# Imports

In [50]:
# import libraries
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from time import time
from pyspark.sql.functions import lit
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.classification import LogisticRegression, GBTClassifier, LinearSVC, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline

In [2]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

# Load & Process Data
Labels and features were generated in the `Sparkify_Exploratory_Analysis_Feature_Engineering.ipynb`. 

In [100]:
# loading transformed data as spark dataframe
data = spark.read.csv('sparkify_data', inferSchema =True, header = True).withColumnRenamed('user_churned', 'label')
print((data.count(), len(data.columns)))

(225, 85)


In [4]:
def vectorize_scale_features(data, features_list=data.columns[2:]):
    '''
    data [spark dataframe] = data of labels and features with numerical values
    features_list [list of strings] = list of names of feature columns
    Function to assemble features into a vector, scale it and return it with the labels
    input_data [spark dataframe] = dataframe of labels and scaled vectors
    '''
    # vectorize the features
    assembler = VectorAssembler(inputCols=features_list, outputCol='vector')
    data = assembler.transform(data)
    # scale the feature vectors
    scaler = StandardScaler(inputCol='vector', outputCol='features', withMean=True, withStd=True)
    scaler_fit = scaler.fit(data)
    data = scaler_fit.transform(data)
    # select labels and features
    input_data = data.select('label', 'features')
    return input_data

In [104]:
def split_process_data(data, features_list=data.columns[2:]):
    '''
    data [spark dataframe] = data of labels and features with numerical values
    features_list [list of strings] = list of names of feature columns
    Function to split the data into train, test, and validation sets and process them into vectors
    processed_train [spark dataframe] = dataframe of labels and feature vectors
    processed_test [spark dataframe] = dataframe of labels and feature vectors
    processed_validation [spark dataframe] = dataframe of labels and feature vectors
    '''
    # split the data into sets
    train, rest = data.randomSplit([0.6, 0.4], seed=42)
    print('train data: ', (train.count(), len(train.columns)))
    test, validation = rest.randomSplit([0.5, 0.5], seed=42)
    print('test data: ', (test.count(), len(test.columns)))
    print('validation data: ', (validation.count(), len(validation.columns)))
    # vectorize the features
    processed_train = vectorize_scale_features(train, features_list)
    processed_test = vectorize_scale_features(test, features_list)
    processed_validation = vectorize_scale_features(validation, features_list)
    return processed_train, processed_test, processed_validation

In [87]:
train, test, validation = split_process_data(data)

train data:  (117, 85)
test data:  (41, 85)
validation data:  (67, 85)


## Which Classifier Should We Use
In order to predict on the full 12GB dataset in the Spark cluster on AWS, we will need the fastest, most accurate model. In addition, we will need the ability to review feature coefficients to pare down the number of features. The data transformation required to generate the 83 features took several hours when run locally. Because the longer we use the Spark cluster in AWS the more expensive it will be, I would like to perform the fewest number of data transformations on the full dataset without sacrificing too much accuracy. For these reasons, we will look at 4 possible classifiers  
- Logistic Regression  
- Random Forest Classifier  
- Gradient Boosted Trees  
- Support Vector Machines    
  
For each classifer, we will use default parameters, train the classifiers on the same train dataset, test the classifier on the same test dataset. We will be looking at each models f1 score as well as speed.

In [24]:
def fit_model_predict(model, train_data, test_data, silent=False):
    '''
    model [pyspark classifier] = intialized classifier
    train_data [spark dataframe] = dataframe of labels and feature vectors
    test_data [spark dataframe] = dataframe of labels and feature vectors
    silent [boolean] = if false, print how long it took train the model and evaluation scores. If false, print nothing
    Function to train a classifier, predict with the model on the test set
    fitted_model [pyspark classifier model] = a classifier fitted to the train data
    predicted_results [pyspark dataframe] = the output of the model with predictions of the test set
    '''
    start = time()
    fitted_model = model.fit(train_data)
    predicted_results = fitted_model.transform(test_data)
    end = time()
    if silent:
        pass
    else:
        print('Training the model took {} seconds.'.format((end-start)))
    return fitted_model, predicted_results

Whether a user has churned or stayed is an imbalanced class--there are far fewer users who churned. As such, we will be evaluating our models with f1 scores, which is the harmonic mean of precision, which evaluates true positives against false positives, and recall, which evaluates true positives against false negatives. This means that f1 score will be a more class balanced evaluation measure. However, given that Sparkify is considering providing incentives to users who are at risk of churn, there maybe costs associated with false positives-- mistaking content users for those who are at risk and given them unnecessary incentives to stay. So while we want to focus on f1 scores, we don't want to overlook accuracy completely.  

In [25]:
# Set evaluators
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
accu_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')

def evaluate_predictions(predicted_results, silent=False):
    '''
    predicted_results [pyspark dataframe] = the output of the model with predictions of the test set
    Function to evaluate the predictions
    f1_score [float] = f1 score from evaluation of the model's predictions on the test set
    accu_score [float] = accuracy of the model's predictions on the test set
    '''
    accu_score = accu_evaluator.evaluate(predicted_results.select('label', 'prediction'))
    f1_score = f1_evaluator.evaluate(predicted_results.select('label', 'prediction'))
    if silent:
        pass
    else:
        print('Accuracy: {0}\nF1 Score: {1}'.format(accu_score, f1_score))
    return f1_score, accu_score

In [26]:
def fit_predict_evaluate_model(model, train_data, test_data, silent=False):
    '''
    model [pyspark classifier] = intialized classifier
    train_data [spark dataframe] = dataframe of labels and feature vectors
    test_data [spark dataframe] = dataframe of labels and feature vectors
    silent [boolean] = if false, print how long it took train the model and evaluation scores. If false, print nothing
    Function to train a classifier, predict with the model on the test set, and evaluate the predictions
    fitted_model [pyspark classifier model] = a classifier fitted to the train data
    predicted_results [pyspark dataframe] = the output of the model with predictions of the test set
    '''
    fitted_model, predicted_results = fit_model_predict(model, train_data, test_data, silent)
        
    f1_score, accu_score = evaluate_predictions(predicted_results, silent
                                               )
    return fitted_model, predicted_results

### Logistic Regression

In [95]:
lr_model = LogisticRegression(featuresCol='features', labelCol='label')
lr_fitted_model, lr_predictions = fit_predict_evaluate_model(lr_model, train, test)

Training the model took 6.314727783203125 seconds.
Accuracy: 0.6585365853658537
F1 Score: 0.6354051927616049


In [39]:
lr_fitted_model

LogisticRegressionModel: uid=LogisticRegression_018d014e26c6, numClasses=2, numFeatures=83

### Random Forest Classifier

In [27]:
rfc_model = RandomForestClassifier(featuresCol='features', labelCol='label', seed=42)
rfc_fitted_model, rfc_predictions = fit_predict_evaluate_model(rfc_model, train, test)

Training the model took 1.8004083633422852 seconds.
Accuracy: 0.6829268292682927
F1 Score: 0.574054436196536


In [46]:
rfc_fitted_model

RandomForestClassificationModel: uid=RandomForestClassifier_53de0dfc056c, numTrees=20, numClasses=2, numFeatures=83

### Gradient Boosted Trees

In [10]:
gbt_model = GBTClassifier(featuresCol='features', labelCol='label', seed=42)
gbt_fitted_model, gbt_predictions = fit_predict_evaluate_model(gbt_model, train, test)

Training the model took 19.85404109954834 seconds.
Accuracy: 0.5609756097560976
F1 Score: 0.5609756097560975


In [40]:
gbt_fitted_model

GBTClassificationModel: uid = GBTClassifier_136202859156, numTrees=20, numClasses=2, numFeatures=83

### Support Vector Machine

In [11]:
svm_model = LinearSVC(featuresCol='features', labelCol='label')
svm_fitted_model, svm_predictions = fit_predict_evaluate_model(svm_model, train, test)

Training the model took 26.991944074630737 seconds.
Accuracy: 0.6097560975609756
F1 Score: 0.5833202202989772


In [41]:
svm_fitted_model

LinearSVCModel: uid=LinearSVC_a3964e0705ae, numClasses=2, numFeatures=83

As expected all the classifiers didn't do particularly well when using default parameters. While logistic regression had a slightly better f1 score, the random forest classifier had the better accuracy. Significantly, the random forest classifer was by far the fastest classifer. Given that I anticipate performance to improve with any of the classifier, and we are building a model that may need to run regularly on large data sets, I will proceed with the fastest classifier - random forest.  

# First Validation: Validate the Best Default Classifier
Let's set a baseline evaluation to beat. We will see how the default random forest classifier performs on our validation data set. 

In [12]:
# rfc_model = RandomForestClassifier(featuresCol='features', labelCol='label', seed=42)
rfc1_fitted_model, rfc1_predictions = fit_predict_evaluate_model(rfc_model, train, validation)

Training the model took 2.3147828578948975 seconds.
Accuracy: 0.7910447761194029
F1 Score: 0.7119402985074627


Surprisingly, our default random forest classifier performed pretty well on our validation set. 

# Which Parameters Should We Use
[Documentation for Pyspark's Random Forest Classifier](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html) shows the following default parameters:
- maxBins=32
- maxDepth=5
- numTrees=20
- impurity='gini'
  
To find better parameters, we will explore `maxDepth`, `maxBins`, and `numTrees` values both lower than and higher than the defaults in the Parameter Grid. We will also consider `entropy` as an alternative `impurity`, the metric used to calculate information gain at each node.  
  
In addition to Parameter Grid, we will also use Cross Validation. The random forest algorithm has 2 random components when training models: (1) each tree trains on a random sample of the data and (2) each tree trains on a random subset of the features. A single experiment could result in a lucky jump in the f1 score. Instead, we will use a cross validator to split the train dataset into 3 folds, train a model on 2 folds and test on the third, rotate the folds and repeat twice more. Ultimately, for each combination of parameters, 3 models will be trained and evaluated, and we will get an average score. This will ensure that we get a combination of parameters that actually improved the model.

In [91]:
### Set up the Cross Validator
## Evaluator
# f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')

## classifier
# rfc_model = RandomForestClassifier(featuresCol='features', labelCol='label', seed=42)

## Parameter Grid
paramGrid = ParamGridBuilder() \
    .addGrid(rfc_model.maxBins, [10, 30, 50]) \
    .addGrid(rfc_model.maxDepth, [2, 5, 10]) \
    .addGrid(rfc_model.numTrees, [5, 25, 50]) \
    .addGrid(rfc_model.impurity,['entropy', 'gini']) \
    .build()

## Cross Validator
cv = CrossValidator(estimator=rfc_model,
                      evaluator=f1_evaluator, 
                      estimatorParamMaps=paramGrid,
                      numFolds=3)

In [92]:
### Train the Cross Validator
cv_model_fitted, cv_predictions = fit_predict_evaluate_model(cv, train, test, silent=True)

In [93]:
### The scores for each set of parameters
parameters_combo = [{p.name: v for p, v in m.items()} for m in cv_model_fitted.getEstimatorParamMaps()]
cv_scores = pd.DataFrame(parameters_combo)
cv_scores['f1_score'] = cv_model_fitted.avgMetrics
cv_scores = cv_scores.sort_values(by=['f1_score', 'maxDepth', 'numTrees', 'maxBins'], ascending=[False, True, True, True]).reset_index(drop=True)
cv_scores

Unnamed: 0,maxBins,maxDepth,numTrees,impurity,f1_score
0,30,10,5,gini,0.761377
1,10,5,25,gini,0.717993
2,30,10,50,entropy,0.715141
3,10,2,5,gini,0.70557
4,30,10,25,entropy,0.700224
5,10,10,50,entropy,0.699738
6,10,2,5,entropy,0.699532
7,50,10,5,gini,0.697388
8,50,5,25,gini,0.696782
9,50,10,25,entropy,0.691625


In [94]:
### best parameters
max_bins = cv_scores.iloc[0]['maxBins']
max_depth = cv_scores.iloc[0]['maxDepth']
num_trees = cv_scores.iloc[0]['numTrees']
impurity = cv_scores.iloc[0]['impurity']
print('The best parameters are {0} with {1} maxBins, {2} maxDepth, and {3} trees'.format(impurity, max_bins, max_depth, num_trees))

The best parameters are gini with 30 maxBins, 10 maxDepth, and 5 trees


The best parameters for our Random Forest Classifier are as follows:
- `maxBins`: 30 (the default was 32)  
MaxBins is a parameter found in the Pyspark's Random Forest Classfier, but not in SciKit Learn's counterpart. PySpark's documentation claims that the number of bins are used to discretize continuous features. Increasing this value allows to algorithm to make more fine grained splits. I believe that increasing this number too high could lead to overfitting on the training data, while lower values could lead to under fitting. So in the parameter grid, I considered a significantly lower number of bins than the default, 10, and a slightly lower number of bins than the default, 30. The top performing models appear to gravitated towards a lower number of bins. 
- `maxDepth`: 10 (the default was 5)  
This parameter determines the size of the individual trees. A larger number here, could lead to over fitting. However, with multiple trees in the algorithm, this risk is lower.  
- `numTrees`: 5 (the default was 20)  
This parameter determines the number of trees. A lower number here, could lead to over fitting. The combination of a higher maxDepth along with a lower numTrees indicates that the default model was too underfitting. 
- `impurity`: `gini` (the default was 'gini')  
This parameter determines how trees split data a branch. In short, `gini` measures the probablity of a random sample of data points being classified incorrectly at a branch/node. `entropy` measures the impurity of information (the labels being one class is considered pure) at a branch. All other parameters being equal, switching this parameter to `entropy` would have dropped the f1 score by about 10%. 

# Second Validation: Validate Classifier with the Best Parameters

In [96]:
# How does the best set of parameters do without cross validation
rfc2_model = RandomForestClassifier(featuresCol='features', labelCol='label', seed=42, \
                                    impurity = impurity, maxDepth=max_depth, numTrees=num_trees, maxBins=max_bins)
rfc2_fitted_model, rfc2_predictions = fit_predict_evaluate_model(rfc2_model, train, validation)

Training the model took 2.5150463581085205 seconds.
Accuracy: 0.7910447761194029
F1 Score: 0.7652003142183817


In our first validation: 
- Training the model took 2.3147828578948975 seconds.
- Accuracy: 0.7910447761194029
- F1 Score: 0.7119402985074627  
  
With updated parameters but on the same train and validation datasets, our F1 score improved by 5%. Coincidentially, our accuracy is exactly the same. This indicates that our second model is better a predicting churned users but is offsetting the accuracy by misidentifying users who would have stayed as at risk of churn. In addition, the time to train the second model barely increased by .2 seconds. We could stop here. However, for the sake of cost, time, and computational resources, we should consider feature selection.

# Which Features Should We Use?
When using the Spark cluster on AWS, the longer we need to run the code, the more it will cost us. The data transformation, required to create the features, was the most time consuming portion of this project. It took a few hours when run locally. So Sparkify may want to limit the number of features. As such, we will look at the top important features and rerun our best classifier with the best parameters on fewer features.  Ideally, we will find a smaller set of features to run in our model without sacrificing too much of the f1 score. 

### Rank features

In [68]:
feature_coefficients = rfc2_fitted_model.featureImportances
feature_names = data.columns[2:]
feature_importance = pd.DataFrame(list(zip(feature_names, feature_coefficients)),\
                                  columns=['Feature', 'Importance'])\
                                .sort_values('Importance', ascending=False).reset_index(drop=True)
feature_importance[feature_importance['Importance']>0]

Unnamed: 0,Feature,Importance
0,page_count_Thumbs_Down,0.075904
1,page_count_Logout,0.063622
2,status_count_200,0.050785
3,method_count_PUT,0.045865
4,avg_song_length,0.041142
5,page_count_NextSong,0.040795
6,total_session_length,0.038328
7,level_count_free_days,0.036859
8,avg_session_length,0.036561
9,page_count_Settings,0.034629


41 out of the 85 features were used to train the second model. Just looking at the top few, high values in `Thumb Down` and `Logout` would strongly indicate disatisfaction with the service. Conversely, `200`, HTTP status code that the service is working, `PUT` HTTP method request to store data, `avg_song_lenght` and `NextSong` indicate engagement with the service. If those values were high, they would signal a user's satisfaction with the service. Whereas, low numbers would signal a user's disatisfaction.

### Set up pipeline

In [73]:
def score_cv_features(num_features):   
    '''
    num_features [int] = number of most important features
    Function to trim the data set down to a select number of the most important features and split it, 
    run the new sets through a pipeline to process, train a model, predict on the test set, and evaluate the predictions
    features_list [list of strings] = list of names of feature columns
    f1_score [float] = f1 score from evaluation of the model's predictions on the test set
    accu_score [float] = accuracy of the model's predictions on the test set
    '''
    # Set up data with features
    features_list= list(feature_importance['Feature'])[:num_features]
    features_data = data.select( ['userId', 'label'] + features_list)
    train, rest = features_data.randomSplit([0.6, 0.4], seed=42)
    test, validation = rest.randomSplit([0.5, 0.5], seed=42)
    
    # Set up pipelines
    assembler = VectorAssembler(inputCols=features_list, outputCol='vector')
    scaler = StandardScaler(inputCol='vector', outputCol='features', withMean=True, withStd=True)
#     rfc2_model = RandomForestClassifier(featuresCol='features', labelCol='label', seed=42, \
#                                     impurity = impurity, maxDepth=max_depth, numTrees=num_trees, maxBins=max_bins)

    paramGrid = ParamGridBuilder().build()
    
    cv = CrossValidator(estimator=rfc2_model,
                      evaluator=f1_evaluator, 
                      estimatorParamMaps=paramGrid,
                      numFolds=3)
    
    pipeline = Pipeline(stages=[assembler, scaler, cv])
    
    pipeline_fitted, pipeline_predictions = fit_model_predict(pipeline, train, test, silent=True)
    f1_score, accu_score = evaluate_predictions(pipeline_predictions, silent=True)
    return features_list, f1_score, accu_score

Experimenting with various feature selection means that for each set of features, we will need to split, vectorize and scale our data sets again before we train our models. In order to speed this up, we will use PySpark's pipeline preprocessing and cross validation for a more reliable evaluation of the models. 

In [111]:
# Score the sets of top features
features_scores = pd.DataFrame(columns=['num_top_features', 'feature_list', 'accuracy', 'f1_score'])

for index, num_features in enumerate(range(1, len(feature_importance[feature_importance['Importance']>0]))):
    features_list, f1_score, accu_score = score_cv_features(num_features)
    features_scores.loc[index, 'num_top_features'] = num_features
    features_scores.loc[index, 'feature_list'] = str(features_list)
    features_scores.loc[index, 'accuracy'] = accu_score
    features_scores.loc[index, 'f1_score'] = f1_score
features_scores = features_scores.sort_values(by=['f1_score', 'num_top_features'], ascending=[False, True]).reset_index(drop=True)
features_scores

Unnamed: 0,num_top_features,feature_list,accuracy,f1_score
0,4,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.756098,0.72688
1,37,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.731707,0.707052
2,38,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.731707,0.690917
3,16,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.707317,0.672256
4,30,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.707317,0.672256
5,17,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.682927,0.653789
6,14,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.707317,0.651885
7,15,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.707317,0.651885
8,19,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.707317,0.651885
9,22,"['page_count_Thumbs_Down', 'page_count_Logout'...",0.731707,0.639585


Cross validation on the various sets of features show that 4 features are really necessary to predict user Churn. 

In [75]:
# Best set of Top Features
top_features_list = list(feature_importance['Feature'])[:int(features_scores.iloc[0]['num_top_features'])]
print('The most important features are: ', top_features_list)

The most important features are:  ['page_count_Thumbs_Down', 'page_count_Logout', 'status_count_200', 'method_count_PUT']


# Third Validation: Validate Classifier with Best Parameters on only the Most Important Features
The best model, the best parameters, only the important features on validation data

In [105]:
data2 = data.select(['userId', 'label'] + top_features_list)
train2, test2, validation2 = split_process_data(data2, top_features_list)
rfc3_model = RandomForestClassifier(featuresCol='features', labelCol='label', seed=42, \
                                    impurity=impurity, maxDepth=max_depth, numTrees=num_trees, maxBins=max_bins)
rfc3_fitted_model, rfc3_predictions = fit_predict_evaluate_model(rfc3_model, train2, validation2)

train data:  (117, 6)
test data:  (41, 6)
validation data:  (67, 6)
Training the model took 2.378504514694214 seconds.
Accuracy: 0.7164179104477612
F1 Score: 0.7017556167458828


In our first validation: 
- Training the model took 2.3147828578948975 seconds.
- Accuracy: 0.7910447761194029
- F1 Score: 0.7119402985074627  
  
In our second validation:
- Training the model took 2.5150463581085205 seconds.
- Accuracy: 0.7910447761194029
- F1 Score: 0.7652003142183817

Unsurprisingly, our 3rd model didn't perform as well. The F1 score dropped by 6% from the second model, 1% from the default model. Accuracy dropped about 8% from both models. The time to train third model dropped to about the time it took to train the default model. Despite, this dip in performance, this third model is a model worth consideration. It only requires 4 features, which would probably take less than an hour of data processing when run locally. 

# Conclusion
## Summary
  
Sparkify asked us to predict which users are at risk of churn, cancelling the service, so that they can incentivice those users to stay. The raw data was event-level log data with timestamps that had to be converted, `location` and `useragent` data that had to be simplified. The logs had to be aggregated into user-level data. Data exploration revealed that interestingly users who were on the paid tier were more likely to stay. 
  
Because this was a preliminary experiment, I was comprehensive in my feature generation, which took a few hours to create 85 features. It became apparent that such comprehensive feature generation could be exorbitantly time consuming and costly to perform on the full data set. Our analysis was performed on a mere 128MB subset of a 12GB data set. So I mainly considered classifier model that would allow me to rank feature importance. I evaluated the model performances with f1 score because we were predicting an imbalanced class--only 23% of users churned. The default versions of the classifier performed similarly, but I chose to continue with the Random Forest Classifier because it was signficantly faster and we are building a model that may need to run regularly on a large dataset.

## Improvement
Through experimentation, I produced 3 models to run on our validation data set:  
- Random Forest Classifier with default parameters trained on 85 features. F1 Score: 71%  
This model became our baseline for performance. 
- Random Forest Classifier with a better set of parameters trained on 85 features. F1 Score: 76%
The `impurity` and `maxBin` parameters mostly remained the same as the default values. But the model performed much better with  a higher `maxDepth`  than the default and much lower `numTrees` than the default, which combined indicated that the default model was underfit and a better model just needed to be closer fit to the training data.
- Random Forest Classifier with the same better set of parameters train on 4 features. F1 Score: 70% 
Because data processing took so long, I continued my experiments to include feature selection. Cross validation on various sets of features indicated that only 4 features were really necessary to predict user churn--`Thumbs Down`, `Log out`, `200` HTTP Status and `PUT` HTTP method request. Although training a model on just 4 features would sacrifice 6% in performance, it would speed up data process signficantly. 
  
Ultimately we improved our model by 5% and we found the most crucial features. I expect this model would improve even further when it is trained on a much larger set of the data. 

## Reflection
Although Sparkify asked us to predict users who were at risk of churning, its ultimate goal is to encourage users to stay. Exploratory Data Analysis indicated that user who churned weren't encountering `Error` pages or `404` status issues.  Feature importance showed us that the top feature influencing the models was `Thumbs Down`, meaning that a high occurence of this event in a user's account indicates dissatisfaction with the songs. While Sparkify said that they were considering providing discounts and incentives for users to stay, might I recommend that Sparkify improves its song recommendation engine. Perhaps the better approach to convincing users to stay is preventing users from streaming songs they would `Thumbs Down`.