# Grade: /100 pts

# Assignment 05: Model Selection & Cross Validation

In this assignment you will be using a #TidyTuesday dataset on Spotify songs to build a classification model for predicting Spotify song popularity.

The dataset has already been preprocessed, and is ready to be used! 

The Spotify songs dataset has provided you with data for 30947 Spotify songs.  Your job: build a model or models, perform model selection using cross validation techniques, and evaluate your final selected model.

### The Dataset

The data is stored in a csv file called `spotify_pre.csv`.  The data includes some information about playlist genre, playlist subgenre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration. The target variable is `track_popularity`, which has two categories `high` and `low`.


### Follow These Steps before submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline. 

5.  Your submission document should be saved in the form: `LastName_FirstName_Assignment5.ipynb`

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_val_score
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import make_scorer, mean_squared_error,roc_curve, auc, roc_auc_score
from sklearn.base import BaseEstimator, TransformerMixin

pd.set_option('display.max_columns', 500)
plt.style.use('ggplot')
%matplotlib inline

_____

### Question 1: /15pts

First, import the dataset `spotify_pre.csv` as a dataframe and print out a few rows and get metadata of the csv file to have a rough understanding of the type of each column.

In [None]:
# Import the dataset etc.

Finally create a barchart of playlist_subgenre with each bar showing the proportion of track_popularity categories (use ggplot with 'fill' option set to 'track_popularity'). Make sure to flip cartesian coordinates so that horizontal becomes vertical, and vertical, horizontal.

**Make sure to check out some online resources for plotting with ggplot in Python**

In [None]:
# Code to construct the barchart

Which category of playlist_subgenre is more likely to gain popularity?

**ANSWER HERE**


_____________

### Question 2: /10pts


Now, you will create boxplots with x axis set to `playlist_genre` and y axis set to `instrumentalness` and color option set to `track_popularity`. Make sure to change the y axis scale into log10 scale for a better representation.

In [None]:
## Your code

With track_popularity taken into account, does instrumentalness score differ within some of playlist genres?

**ANSWER HERE**

___________

### Question 3: /10

Create a basic logistic regression model (with default penalization) named `model1`. You need to create a model pipeline to be fit later. (Use `solver='lbfgs'`, `max_iter=10000` and `random_state=0`)

The predictor variables are `mode`, and `loudness`. Use the following chunk of code. You will use `Data1` to build your model.

In [None]:
# Your code
# Hints:
#Data1 = spotify_pre[['track_popularity', 'mode', 'loudness']] 
#Data1 = pd.get_dummies(Data1, drop_first=True) 
#Data1 = Data1.rename({'track_popularity_low': 'track_popularity'}, axis='columns') 
#Data1.head()

Now that you have created the pipeline, fit `model1` for predicting the target variable, track_popularity, with the two predictors. Use a 70/30 train-test split of the data, remember to set `random_state=0` in the function `train_test_split`. After that, evaluate this model plotting the ROC curve and reporting the AUC value. 

In [None]:
# np.random.seed(0);np.random.rand(5)
# Create the training and test data


#Fit the model

In [None]:
# Create the ROC curve and report AUC

At this point, would you use the baseline model as your final model? Why or why not?

**ANSWER HERE**

____________

### Question 4: /40

Here, we want to determine the best single numeric feature model to predict the track_popularity. To be specific, you are going to create a model per each predictor: 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_m'. Make sure to use the cross-validation technique to make your decision. Please make sure to use 'pd.get_dummies(, drop_first=True)' to encode the target variable, and change the column name 'track_popularity_low' to 'track_popularity', as in previous question.

You remember that *cross-validation* technique is used to find the expected test error of the models. In addition, in the next code, 5 different folds are displayed using 5 fold cross-validation. In this case it is used `StratifiedKFold()` function.

#### 4.1: /5
**Define a function `AUC_calculation` with inputs `(X, y, index_train, index_test)`  which calculates the AUC of the `model1` trained on `index_train` and tested on `index_test`.**

In [None]:
# Your code
# Hints:
#str_kf = StratifiedKFold(n_splits=5) for j, (index_train, index_test) in enumerate(str_kf.split(X,y)):

 ##enumerate has the index and the elements in the folds
 ##print(k, (index_train, index_test))
#plt.plot(index_train, [j+1 for s in index_train], '.')
#plt.title('Cross Validation ')

#### 4.2: /8
**Using `AUC_calculation` create a function named `AUC_cross_validation` which has as input (X, y, n_fold). `AUC_cross_validation` does a `n_fold` cross validation (using `StratifiedKFold`) and its output should be a list with the AUC for each fold.** 

In [None]:
# Your code

#### 4.3: /5
**Now you are ready to estimate and compare through cross validation the performance of all the *simple models* that only use one numeric predictor as input. Prepare you dataset here!**

In [None]:
# Your code

#### 4.4: /8

**Use your function `AUC_cross_validation` to compute cross-validation estimates of the AUC for each single numeric feature model, use a data frame (named `AUC_models`) to report the AUC value for each fold and each of the models. (Use `n_fold=10`).**

**The column names of `AUC_models` have to be in the form `Simple-numeric predictor variable`, e.g., `simple-tempo`.**

In [None]:
# Construct AUC_models dataframe

In [None]:
# Print AUC_models dataframe 

#### 4.5: /7

**Decide which of the studied models has the best and the worst performance, using a boxplot (without presenting outliers) that shows the distribution of the previous AUC scores for every model.** (Do not forget labels!)

In [None]:
# Code to plot the boxplots organized as required

**ANSWER HERE:**

#### 4.6: /7
**Now, lets compare these models with the one including all the numeric variables** 

**You again will use  10-fold cross-validation to determine if this new model has better performance, and at the end, you want to plot the boxplots with the information of this new model.**  

In [None]:
# Your code

# Print the new data frame 

In [None]:
# Plot the boxplots

_____________

### Question 5: /10

Finally, you are going to include all the numeric predictors as well as the categorical variable `mode` in the model. Make sure to encode the categorical variable. Use the 10-fold cross-validation to evaluate the performance of this model. Print the AUC mean for all the models (including previous models) in ascending order.

In [None]:
# Your code

# Print the new data frame 

In [None]:
# Print the AUC mean for each of the models in ascending order

Which of the above models is the best model? Why? 

**ANSWER HERE:**

______________

### Question 6: /15pts

Now it is time to train it on all the training data. Estimate the performance of this model on the test data and do the following (For this, use a 70/30 train-test split of the data, remember to set `random_state=0` in the function `train_test_split` ): 

- Use boostrap technique to find the 95% CI for the AUC. 
- Plot the distribution of the boostrap AUC scores.

In [None]:
# Your code

In [None]:
# Your code

In [None]:
# Your code

Is the test AUC close to the AUC cross validation of the model you chose? Why do you think this is the case?

**ANSWER HERE:**