<a href="https://colab.research.google.com/github/mariokart345/DS-Unit-2-Applied-Modeling/blob/master/module3-permutation-boosting/LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [28]:
!pip install category_encoders==2.*
!pip install pandas_profiling==2.*
data = 'https://raw.githubusercontent.com/mariokart345/DS-Unit-2-Applied-Modeling/master/data/Video_Games_Sales_as_at_22_Dec_2016.csv'
data2 = '~/Desktop/video_games.csv'



In [5]:
df = pd.read_csv(data)

In [9]:
import numpy as np
def wrangle(df):
    #Engineering features
    df['Above_Average_Critic_Score'] = df['Critic_Score']>70
    df['User_Score'] = df['User_Score'].replace('tbd',np.NaN,regex=True)
    df['User_Score'] = df['User_Score'].astype(float)
    df['Above_Average_User_Score'] = df['User_Score']>7
    #Dropping really high variance
    df = df.drop(labels=['Name','Developer'],axis=1)
    #Dropping high NaN columns
    df = df.drop(labels=['Rating','User_Count','User_Score','Critic_Count','Critic_Score'],axis=1)
    #Using log function to created a less skewed distribution
    df['Log_Global_Sales'] = np.log1p(df['Global_Sales'])
    #Dropping Sales columns to prevent leakage
    df = df.drop(labels=['NA_Sales','EU_Sales','JP_Sales','Other_Sales','Global_Sales'],axis=1)
    #Converting 'Year_of_Release' to pandas datetime
    df['Year_of_Release'] = pd.to_datetime(df['Year_of_Release'],format='%Y')
    df['Year_of_Release'] = df['Year_of_Release'].dt.year
    #Removing upper and lower .5 percentile
    df = df[(df['Log_Global_Sales'] >= np.percentile(df['Log_Global_Sales'], 0.5)) & (df['Log_Global_Sales'] <= np.percentile(df['Log_Global_Sales'], 99.5))]
    return df

In [10]:
game_sales = wrangle(df)

In [11]:
from sklearn.model_selection import train_test_split
train,val = train_test_split(game_sales,train_size=0.8,test_size=0.2,random_state=25)
#Dropping 'Global_Sales' to prevent leakage and splitting into features and target
y_train = train['Log_Global_Sales']
X_train = train.drop('Log_Global_Sales',axis=1)
y_val = val['Log_Global_Sales']
X_val = val.drop('Log_Global_Sales',axis=1)

In [None]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(ce.OrdinalEncoder(),SimpleImputer(strategy='mean'),LinearRegression())
pipeline.fit(X_train,y_train)

In [24]:
from sklearn.metrics import r2_score
y_train_pred = pipeline.predict(X_train)
y_val_pred = pipeline.predict(X_val)
print(f'Training R^2:{r2_score(y_train,y_train_pred)}\nValidation R^2:{r2_score(y_val,y_val_pred)}')

Training R^2:0.16184962180487672
Validation R^2:0.14114293188165206


Looking at different videogame datasets to perhaps get something that turns out good, but still using my first videogame dataset

In [13]:
import pandas as pd
df2 = pd.read_csv('video_games.csv')

In [17]:
df2.head()

Unnamed: 0,Title,Features_Handheld?,Features_Max Players,Features_Multiplatform?,Features_Online?,Metadata_Genres,Metadata_Licensed?,Metadata_Publishers,Metadata_Sequel?,Metrics_Review Score,Metrics_Sales,Metrics_Used Price,Release_Console,Release_Rating,Release_Re-release?,Release_Year,Length_All PlayStyles_Average,Length_All PlayStyles_Leisure,Length_All PlayStyles_Median,Length_All PlayStyles_Polled,Length_All PlayStyles_Rushed,Length_Completionists_Average,Length_Completionists_Leisure,Length_Completionists_Median,Length_Completionists_Polled,Length_Completionists_Rushed,Length_Main + Extras_Average,Length_Main + Extras_Leisure,Length_Main + Extras_Median,Length_Main + Extras_Polled,Length_Main + Extras_Rushed,Length_Main Story_Average,Length_Main Story_Leisure,Length_Main Story_Median,Length_Main Story_Polled,Length_Main Story_Rushed
0,Super Mario 64 DS,True,1,True,True,Action,True,Nintendo,True,85,4.69,24.95,Nintendo DS,E,True,2004,22.716667,31.9,24.483333,57,14.3,29.766667,35.033333,30.0,20,22.016667,24.916667,29.966667,25.0,16,18.333333,14.333333,18.316667,14.5,21,9.7
1,Lumines: Puzzle Fusion,True,1,True,True,Strategy,True,Ubisoft,True,89,0.56,14.95,Sony PSP,E,True,2004,10.1,11.016667,10.0,5,9.516667,0.0,0.0,0.0,0,0.0,9.75,9.866667,9.75,2,9.616667,10.333333,11.083333,10.0,3,9.583333
2,WarioWare Touched!,True,2,True,True,"Action,Racing / Driving,Sports",True,Nintendo,True,81,0.54,22.95,Nintendo DS,E,True,2004,4.566667,11.566667,2.5,57,2.266667,10.0,14.1,7.25,16,6.8,3.85,5.666667,3.333333,11,2.783333,1.916667,2.933333,1.833333,30,1.433333
3,Hot Shots Golf: Open Tee,True,1,True,True,Sports,True,Sony,True,81,0.49,12.95,Sony PSP,E,True,2004,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
4,Spider-Man 2,True,1,True,True,Action,True,Activision,True,61,0.45,14.95,Nintendo DS,E,True,2004,13.25,48.383333,10.0,37,7.066667,72.566667,78.866667,72.566667,2,66.283333,12.766667,17.316667,12.5,12,10.483333,8.35,11.083333,8.0,23,5.333333


In [15]:
df2.shape

(1212, 36)

In [19]:
df2.columns = df2.columns.str.replace('.','_')

In [27]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df2).to_notebook_iframe()
profile

Summarize dataset:   0%|          | 0/49 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]