# Maintainable Code in Data Science

This notebook should be used to test your implementation of the different functions provided in the exercise. If you're stuck, check the solution.

Since you will be working in an IDE and importing your functions and classes in this notebook, we can use the `autoreload` magic method that automatically reloads your code when you modify it, so you do not have to restart the notebook everytime.

In [1]:
%load_ext autoreload
%autoreload 2

First, the `load_dataset` function in your `model.py` should allow to load X and y (either train or test). The command below will load the training set. Check the folder `data` and the function `load_dataset` for more details.

Here it is important to note that we have fixed `dtype` in `read_csv` to ensure whatever data we load, pandas will always try to load the columns with the same types.

In [2]:
from model import load_dataset

X_train, y_train = load_dataset("X_train.zip", "y_train.zip")

In [5]:
y_train.head()

Unnamed: 0,state
0,1
1,1
2,1
3,1
4,0


In [6]:
X_train.shape, y_train.shape

((10000, 24), (10000, 1))

In [3]:
X_train.head()

Unnamed: 0_level_0,photo,name,blurb,goal,slug,disable_communication,country,currency,currency_symbol,currency_trailing_code,...,creator,location,category,profile,urls,source_url,friends,is_starred,is_backing,permissions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
810643898,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Reverence: A Documentary Short on Branded Yarm...,A documentary exploring the phenomenon of cust...,6000.0,reverence-a-documentary-short-on-custom-yarmulkes,False,US,USD,$,True,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
407153952,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",YOU Release the NEW Goddamn Electric Bill VINY...,Guarantee yourself one of 250 limited vinyl re...,1500.0,you-release-the-new-goddamn-electric-bill-viny...,False,US,USD,$,True,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
531190382,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Lonely Estates Album,Lonely Estates are almost done with their debu...,1000.0,lonely-estates-album,False,US,USD,$,True,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
1253528325,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",Crit Hit! 2016 - a tabletop roleplaying gather...,A tabletop RPG focused event where you can pla...,3000.0,crit-hit-2016-a-tabletop-roleplaying-gathering-in,False,US,USD,$,True,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":1,""should_show_fea...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
379783411,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Hillsboro Arts Preschool Yearbook,"The Yearbook will be a collection of drawings,...",2500.0,hillsboro-arts-preschool-yearbook,False,US,USD,$,True,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,


# Part 1: Custom Transformers

In [34]:
X_train.category.iloc[0]

'{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/documentary"}},"color":16734574,"parent_id":11,"name":"Documentary","id":30,"position":4,"slug":"film & video/documentary"}'

## CategoriesExtractor

Here we will create a new transformer that allows to extra categories from the `category` column that is stored as a json. We want this transformer to have a parameter `use_all` that allows us to choose between filtering to use only a subset of hardcoded categories we care about or get all categories found in the json.

The `transform` method should return two new columns, `gen_cat` the generic categorie, and `precise_cat` the precise category, those are extracted by assuming the json contains a string in the format `gen_cat/precise_cat`. We provide the method `extract_slug` that gets the two categories, and filter if necessary, so you only have to implement `fit` and `transform` to return the two new columns we want.

If your transformer is correctly implemented, the code below should return the correct new columns:

In [29]:
from transformers import CategoriesExtractor
ce = CategoriesExtractor(use_all=False)

In [32]:
X_train.category.apply(lambda x: ce._get_slug(x)[0]).head()

id
810643898     film & video
407153952            music
531190382            music
1253528325           games
379783411       publishing
Name: category, dtype: object

In [33]:
X_train.category.apply(lambda x: ce._get_slug(x)[1]).head()

id
810643898     documentary
407153952            misc
531190382            rock
1253528325           misc
379783411            misc
Name: category, dtype: object

In [25]:
from transformers import CategoriesExtractor

ce = CategoriesExtractor(use_all=False)
ce.fit(X_train)
ce.transform(X_train).head()

Unnamed: 0_level_0,gen_cat,precise_cat
id,Unnamed: 1_level_1,Unnamed: 2_level_1
810643898,film & video,documentary
407153952,music,misc
531190382,music,rock
1253528325,games,misc
379783411,publishing,misc


In [27]:
ce.transform(X_train).shape

(10000, 2)

In [28]:
X_train.shape

(10000, 24)

## GoalAdjustor

Here we want to build a simple transformer that returns a column `adjusted_goal` which is the goal multiplied by the static_USD_rate

In [41]:
X_train.columns

Index(['photo', 'name', 'blurb', 'goal', 'slug', 'disable_communication',
       'country', 'currency', 'currency_symbol', 'currency_trailing_code',
       'deadline', 'created_at', 'launched_at', 'static_usd_rate', 'creator',
       'location', 'category', 'profile', 'urls', 'source_url', 'friends',
       'is_starred', 'is_backing', 'permissions'],
      dtype='object')

In [40]:
X_train.static_usd_rate.head()

id
810643898     1.0
407153952     1.0
531190382     1.0
1253528325    1.0
379783411     1.0
Name: static_usd_rate, dtype: float64

In [44]:
from transformers import GoalAdjustor

ga = GoalAdjustor()
ga.fit_transform(X_train).head()

Unnamed: 0_level_0,adjusted_goal
id,Unnamed: 1_level_1
810643898,6000.0
407153952,1500.0
531190382,1000.0
1253528325,3000.0
379783411,2500.0


## TimeTransformer

Here we want to build a transformer that returns two columns: 
- `launched_to_deadline`: the number of days between launching day and the deadline
- `created_to_launched`: the number of days between the creation of the page and the launch

Note: to load the timestamp into datetime object you can multiply the timestamp by the constant `adj` defined in the class and then use the to_datetime function from pandas.

In [45]:
X_train.columns

Index(['photo', 'name', 'blurb', 'goal', 'slug', 'disable_communication',
       'country', 'currency', 'currency_symbol', 'currency_trailing_code',
       'deadline', 'created_at', 'launched_at', 'static_usd_rate', 'creator',
       'location', 'category', 'profile', 'urls', 'source_url', 'friends',
       'is_starred', 'is_backing', 'permissions'],
      dtype='object')

In [51]:
X_train.iloc[0].loc[['deadline', 'launched_at']]

deadline       1383676790
launched_at    1380217190
Name: 810643898, dtype: object

In [96]:
from transformers import TimeTransformer

tt = TimeTransformer()
tt.fit_transform(X_train).head()

Unnamed: 0_level_0,launched_to_deadline,created_to_launched
id,Unnamed: 1_level_1,Unnamed: 2_level_1
810643898,40,24
407153952,29,2
531190382,21,21
1253528325,39,61
379783411,30,201


## CountryTransformer

This transformer returns a larger area for the country feature, allowing to have less dummy features later. We provide a dictionary of countries and their corresponding groups, but feel free to change those depending on similarities you see between countries.

In [98]:
X_train.columns

Index(['photo', 'name', 'blurb', 'goal', 'slug', 'disable_communication',
       'country', 'currency', 'currency_symbol', 'currency_trailing_code',
       'deadline', 'created_at', 'launched_at', 'static_usd_rate', 'creator',
       'location', 'category', 'profile', 'urls', 'source_url', 'friends',
       'is_starred', 'is_backing', 'permissions'],
      dtype='object')

In [103]:
from transformers import CountryTransformer

ct = CountryTransformer()
ct.fit_transform(X_train).sample(10)

Unnamed: 0_level_0,country
id,Unnamed: 1_level_1
170993979,US
1640134491,US
1867035562,Oceania
2069677033,US
1508881112,US
876711018,UK & Ireland
962866307,Europe
2097061414,US
1406986200,Europe
1786207616,US


# Part 2: Column Transformer and Pipeline

Here we will implement the `build_model` function that is defined in `model.py`. This function should return a new Pipeline object that has two stages:
- `preprocessor`: A ColumnTransformer object that has all your preprocessing steps
- `model`: A predictive model, here we will use the `DecisionTreeClassifier`

We are providing code to build two simple intermediary Pipeline objects: `cat_processor` and `country_processor`. Those are just combining our `CategoriesExtractor` and `CountryTransformer` with a `OneHotEncoder` stage so the output is an array of 1 and 0 for all.

You only have to implement: 
- the main ColumnTransformer that puts all the transformers together and applies them on the right columns
- the final Pipeline object that puts together the preprocessor and model.

The code below will get a new model using your function, train it on the data and generate predictions

In [104]:
X_train.columns

Index(['photo', 'name', 'blurb', 'goal', 'slug', 'disable_communication',
       'country', 'currency', 'currency_symbol', 'currency_trailing_code',
       'deadline', 'created_at', 'launched_at', 'static_usd_rate', 'creator',
       'location', 'category', 'profile', 'urls', 'source_url', 'friends',
       'is_starred', 'is_backing', 'permissions'],
      dtype='object')

In [107]:
from model import build_model

model = build_model()
model.fit(X_train, y_train)
model.predict(X_train)

array([1, 1, 1, ..., 0, 1, 1])

# Part 3: tuning, training and testing.

## Tuning

First, implement `tune_model` that loads the data (we have a function to do that already), instanciate a model (we have a function for it), runs a gridsearch on it (load `GRID_PARAMS` from `config.py` to use in the grid search).

It should then print out the best score and hyperparameters found. 

The code below should run your tuning function and print the best parameters. Before running it, make sure you define some parameters to tune in `GRID_PARAMS`, those have to match the pipeline format, you can get a list of all parameters' names in your model by doing `model.get_params().keys()`


Node: Depending on your use case you might prefer to return the values instead of printing them, but here to keep things simple we will assume that the user then modifies the config manually.

In [114]:
model = build_model()
model.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__goal', 'preprocessor__categories', 'preprocessor__disable_communication', 'preprocessor__time', 'preprocessor__countries', 'preprocessor__categories__memory', 'preprocessor__categories__steps', 'preprocessor__categories__verbose', 'preprocessor__categories__transformer', 'preprocessor__categories__one_hot', 'preprocessor__categories__transformer__use_all', 'preprocessor__categories__one_hot__categorical_features', 'preprocessor__categories__one_hot__categories', 'preprocessor__categories__one_hot__drop', 'preprocessor__categories__one_hot__dtype', 'preprocessor__categories__one_hot__handle_unknown', 'preprocessor__categories__one_hot__n_values', 'preprocessor__categories__one_hot__sparse', 'preprocessor__countries__memory', 'preproc

In [116]:
from model import tune_model

tune_model()

Best Hyperparameters: {'model__max_depth': 9, 'model__min_samples_split': 5, 'preprocessor__categories__transformer__use_all': True}
Best score: 67.64%


## Training

For training, implement `train_model` that loads data and model, uses `set_params` to set the parameters of the model to those defined in `PARAMS` inside `config.py` (make sure you use `**PARAMS` in set_params so it unpacks the dictionary).

It should then train the model and use `joblib` to save it as a file. The file should be name after the variable `MODEL_NAME`, again defined in config.

In [117]:
from model import train_model

train_model()

If this properly saved a model, the code below should load it and generate predictions:

In [118]:
import joblib

model_loaded = joblib.load("model.joblib")
model_loaded.predict(X_train)

array([1, 0, 1, ..., 1, 1, 0])

## Testing

For testing, we will need to load the test dataset, the model that we have saved in joblib format (dont instanciate a new model), generate prediction and print metrics such as accuracy score:

In [119]:
from model import test_model

test_model()

Accuracy on the test set: 65.79%
              precision    recall  f1-score   support

           0       0.69      0.61      0.65       511
           1       0.63      0.70      0.66       474

    accuracy                           0.66       985
   macro avg       0.66      0.66      0.66       985
weighted avg       0.66      0.66      0.66       985



# Finally

Great, now that you have implemented everything, you can use the `run.py` script to work with your model on the command line:

- `python run.py tune`: will tune your model
- `python run.py train`: will train it
- `python run.py test`: will test it

In [120]:
!python run.py tune

Tuning model...
Best Hyperparameters: {'model__max_depth': 9, 'model__min_samples_split': 5, 'preprocessor__categories__transformer__use_all': True}
Best score: 67.64%


In [122]:
!python run.py train

Training model...
Model was saved


In [123]:
!python run.py test 

Testing model...
Accuracy on the test set: 65.79%
              precision    recall  f1-score   support

           0       0.69      0.61      0.65       511
           1       0.63      0.70      0.66       474

    accuracy                           0.66       985
   macro avg       0.66      0.66      0.66       985
weighted avg       0.66      0.66      0.66       985

