Authors:
    <br>Alejandro Alvarez (axa)
    <br>Brenda Palma (bpalmagu)

# <center>ML-Jokes: Model ensemble</center>

## Setup

In [1]:
# Path to ml-jokes folder
import os
if os.getcwd().split('/')[-2] == 'ml-jokes': os.chdir('..')

print(f'Current directory: {os.getcwd()}')
assert set(['data', 'mljokes', 'environment.yml', 'nbs']) <= set(os.listdir()), \
    'Wrong path; go to ./heinz-95729-project/api/ml-jokes'

Current directory: /home/brendapalmag/eCommerce/heinz-95729-project/api/ml-jokes


In [2]:
import optuna
import pickle
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score
from mljokes.data import read_ratings, read_jokes, load_test_idx   

## Data

In [3]:
# Load cb results (ordered by user, and joke asc)
with open('./results/predictions_nov28.pkl', 'rb') as f: predictions = pickle.load(f)
predictions.rename(columns={'joke:id': 'joke_id', 'rating_pred': 'cb_rating'}, inplace=True)

# Load cf results (ordered by user, and joke asc)
pred_cf = pd.read_pickle('./results/cf_preds.pkl')

# Merge results
predictions['cf_rating'] = pred_cf['pred_cf']
del pred_cf

In [7]:
# Load real ratings (ordered by user, and joke asc)
ratings = read_ratings()
test_idx = load_test_idx()

# Merge with 
predictions['real_rating'] = ratings['rating']
predictions['test_idx'] = test_idx
del test_idx

predictions.head()

Unnamed: 0,user_id,joke_id,cb_rating,cf_rating,real_rating,test_idx
0,0,1,-0.162202,-0.208646,99.0,0
1,0,2,-0.974697,-1.717733,99.0,0
2,0,3,-0.172788,-0.732498,99.0,0
3,0,4,-2.501286,-3.822813,99.0,0
4,0,5,1.915733,-0.653723,-1.65,0


In [9]:
# Train and test split
all_idxs = predictions.index.values

train_idxs = all_idxs[(predictions['real_rating']!=99.) & (predictions['test_idx']==0)]
test_idxs = all_idxs[(predictions['real_rating']!=99.) & (predictions['test_idx']==1)]

In [11]:
predictions.head()

Unnamed: 0,user_id,joke_id,cb_rating,cf_rating,real_rating,test_idx
0,0,1,-0.162202,-0.208646,99.0,0
1,0,2,-0.974697,-1.717733,99.0,0
2,0,3,-0.172788,-0.732498,99.0,0
3,0,4,-2.501286,-3.822813,99.0,0
4,0,5,1.915733,-0.653723,-1.65,0


In [12]:
def tune(objective, n_trials=10):
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=n_trials)

    params = study.best_params
    best_score = study.best_value
    print(f'Best score: {best_score}\n')
    print(f'Optimized parameters: {params}\n')
    return params

def lm_objective(trial):
    _alpha = trial.suggest_loguniform('alpha', 1e-4, 10)
    _random_state = trial.suggest_int('random_state', 0, 1000)

    lm = Ridge(alpha=_alpha, random_state=_random_state, fit_intercept=False)
    scores = cross_val_score(lm, 
                             predictions.loc[train_idxs, ['cb_rating', 'cf_rating']].values, 
                             predictions.loc[train_idxs, 'real_rating'].values, 
                             cv=[(slice(None), slice(None))],
                             n_jobs=-1,
                             verbose=4,
                             scoring='neg_mean_absolute_error')
    return scores.mean()

In [13]:
lm_params = tune(lm_objective, n_trials=100)

[32m[I 2021-11-29 20:20:48,007][0m A new study created in memory with name: no-name-ccc5a886-9fad-415b-840a-5ee5b0ea9968[0m
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.7s finished
[32m[I 2021-11-29 20:20:49,290][0m Trial 0 finished with value: -2.8050069357625156 and parameters: {'alpha': 0.10084748810418136, 'random_state': 848}. Best is trial 0 with value: -2.8050069357625156.[0m
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.4s finished
[32m[I 2021-11-29 20:20:50,053][0m Trial 1 finished with value: -2.80500693530169 and parameters: {'alpha': 0.0006651852476783738, 'random_state': 571}. Best is trial 1 with value: -2.80500693530169.[0m
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.3s finished
[32m[I 2021-11-29

Best score: -2.80500693529909

Optimized parameters: {'alpha': 0.00010012505030418726, 'random_state': 107}



In [14]:
lm = Ridge(**lm_params)
lm.set_params(**{'fit_intercept': False})
print(f'Ridge params: {lm.get_params()}')
lm.fit(predictions.loc[train_idxs, ['cb_rating', 'cf_rating']].values, predictions.loc[train_idxs, 'real_rating'].values)
rating = predictions.loc[test_idxs, ['real_rating']].values
rating_pred = lm.predict(predictions.loc[test_idxs, ['cb_rating', 'cf_rating']].values)
print(f'MAE: {mean_absolute_error(rating, rating_pred):0.2f}')

Ridge params: {'alpha': 0.00010012505030418726, 'copy_X': True, 'fit_intercept': False, 'max_iter': None, 'normalize': 'deprecated', 'positive': False, 'random_state': 107, 'solver': 'auto', 'tol': 0.001}
MAE: 3.25


In [28]:
predictions_ensemble = predictions.copy()
predictions_ensemble.insert(len(predictions.columns),
                                                   'ensemble_rating', 
                                                   lm.predict(predictions.loc[:, ['cb_rating', 'cf_rating']].values))

display(predictions_ensemble.head())
with open('./results/ensemble_nov28.pkl', 'wb') as f: pickle.dump(predictions_ensemble, f)

Unnamed: 0,user_id,joke_id,cb_rating,cf_rating,real_rating,ensemble_rating
0,0,1,-0.162202,-0.760591,99.0,-0.67343
1,0,2,-0.974697,-1.241289,99.0,-1.296617
2,0,3,-0.172788,-2.495778,99.0,-2.109742
3,0,4,-2.501286,-3.28885,99.0,-3.412844
4,0,5,1.915733,-0.267254,-1.65,0.312347


In [15]:
# Example
user_id = 4493

x_user = predictions.loc[(predictions['user_id']==user_id) & (predictions['real_rating']==99.), ['cb_rating', 'cf_rating']].values
rating_pred_user = lm.predict(x_user)
sorted_idx = np.argsort(rating_pred_user)[::-1]
sorted_ratings = rating_pred_user[sorted_idx]

In [16]:
# Load jokes
jokes = read_jokes()

In [18]:
# Display top k jokes
k = 10

for i in sorted_idx[:k]:
    print(jokes['text'][i], end='\n\n')

If pro- is the opposite of con- then congress must be the opposite of progress.

Reaching the end of a job interview, the human resources person asked a young engineer fresh out of Stanford, "And what starting salary were you looking for?" The engineer said, "In the neighborhood of $125,000 a year, depending on the benefits package." The interviewer said, "Well, what would you say to a package of 5-weeks vacation, 14 paid holidays, full medical and dental, company matching retirement fund to 50% of salary, and a company car leased every 2 years - say, a red Corvette?" The Engineer sat up straight and said, "Wow! Are you kidding?" And the interviewer replied, "Yeah, but you started it."

This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him. "What could they possibly have said to make you move out?" "They told me that you were a p