Objective : To rank the likelihoods of a given player for liking a set of subjects

Solution :

We will follow the following approaches:


The data is divided into training and test-set:
Since the data is in a chronological order, we select the first 80% of the data for training set and the remaining 20% for testing set.

Facts about data:
- Imbalanced dataset
- About ~80% not likes

Approach 1:
-Use Logistic Regression to get baseline performance.
-ROC_AUC as scoring
-Evaluate using stratified cross-validation

Approach 2:
-Use XGBoost with Logistic Regression
- ROC_AUC as scoring
- Evaluate using stratified cross-validation


In [5]:
#Add all the imports here
import numpy as np
import pandas as pd
from sklearn import pipeline, model_selection
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import train_test_split
import random
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
from sklearn import linear_model, decomposition

<body>Set parameters for Grid Search and Constants</body>

In [6]:
random_seed = 42
test_size = 0.20
iterations = [1000,5000,10000]
n_components = [10, 20, 29]
penalty = ['l1','l2']
Cs = np.logspace(-4, 4, 3)
n_folds = 5
param_dist_lasso = {"pca__n_components": n_components,
              "logistic__C": Cs, "logistic__penalty":penalty, "logistic__max_iter":iterations}
label_encoder = LabelEncoder()

<body>Define the Pipeline. Do dimensionality reduction using PCA and apply logistic regression</body>

In [7]:
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

Separate out the labels and drop the unnecessary columns

In [8]:
dataset = pd.read_csv('../../../data/processed_data.csv')
labels = dataset[['like']]
data = dataset.drop(['like','Unnamed: 0','player_id','subject_id'],axis=1)

Apply Encoding to Categorical variables

In [9]:
def encode_features(df, encoder):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        for feature in columnsToEncode:
            try:
                df[feature] = encoder.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

In [10]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

Function for data preparation.
- Encode the data
- Divide the data into training and test set

In [11]:
def data_prep():
    data_label_encoded = encode_features(data, label_encoder)
    data_array = np.array(data_label_encoded)
    label_array = np.array(labels)
    X_train, X_test, y_train, y_test = train_test_split(data_array, labels, test_size=test_size, random_state=random_seed)
    return X_train, X_test, y_train, y_test

In [13]:
def random_search(n_iter, parameters, clf):
    n_iter_search = 2
    random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)
    random_search.fit(X_train, y_train)
    report(random_search.cv_results_)

Run Grid Search with the parameters

In [14]:
X_train, X_test, y_train, y_test = data_prep()

In [18]:
grid_search = model_selection.GridSearchCV(estimator = pipe , 
                       param_grid = param_dist_lasso, scoring='roc_auc',n_jobs=4, cv=10)

In [None]:
grid_search.fit(X_train, y_train.values.ravel())