## Game plan:

DONE - 09/06/18
 - Retrieve user accounts created before 2017
 - Join the user account information with 2017 time series data
 - Filter the users by 26 answers that year
 - Apply the 4 month window to determine if the user is still active by the end of 2017
 - Drop the last 4 months' data
 - Train-test split 80:20
 - Build a baseline classification model
 
TO DO:
 - Build a pipeline (X as oversampler does not fit in pipeline)
 - Try different oversampling techniques (DONE-RandomOverSampler works the best)
 - Implement time series
 - User segmentation?
 - Implement voting classifier
 - Further tune classifiers (check hyper-parameter for xgboost)
 - Learning curve
 - Back-testing with 2018 data/cross-check if user came back in 2018

Story-telling:
 - Feature importance (Draw on map)
 - Cohort analysis

Extension:
 - Predict lifetime by posts in first month
 - Lifetime Customer Value
 - How to build an online model that automatically accumulate and produce the output
 - Build a flask app

In [88]:
import pickle
import patsy
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import seaborn as sns

from datetime import datetime
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from sklearn.pipeline import Pipeline

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

## Clean data

In [None]:
with open('./data/processed/user_reputation.pkl', 'rb') as picklefile:
    user_reputation = pickle.load(picklefile)

In [17]:
user_reputation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8917507 entries, 0 to 8917506
Data columns (total 19 columns):
id                                int64
about_me_length                   int64
age                               object
creation_date                     datetime64[ns, UTC]
last_access_date                  datetime64[ns, UTC]
location                          object
reputation                        int64
up_votes                          int64
down_votes                        int64
profile_image_url                 object
website_url                       object
answer_reputation_total_2017      float64
question_reputation_total_2017    float64
accepted_reputation_total_2017    float64
answer_reputation_total_2018      float64
question_reputation_total_2018    float64
accepted_reputation_total_2018    float64
reputation_2017                   float64
reputation_2018                   float64
dtypes: datetime64[ns, UTC](2), float64(8), int64(5), object(4)
memory usage: 1

In [28]:
user_reputation['account_year'] = user_reputation.apply(lambda row:row[3].year, axis=1)
user_before_2017 = user_reputation[user_reputation['account_year']<2017]

In [14]:
with open('./data/processed/answer_time_series.pkl', 'rb') as picklefile:
    answer_time_series = pickle.load(picklefile)

In [15]:
answer_time_series.head()

Unnamed: 0,id,m_201701,m_201702,m_201703,m_201704,m_201705,m_201706,m_201707,m_201708,m_201709,m_201710,m_201711,m_201712
1,3,0,0,1,0,0,0,0,0,0,0,0,0
2,9,0,0,1,0,0,0,0,0,0,0,0,0
3,13,2,5,0,0,0,0,0,0,0,0,0,0
4,22,0,0,0,0,0,0,0,0,0,6,2,0
5,33,1,0,1,2,1,0,1,1,1,2,1,1


In [38]:
user_before_2017_stats = pd.merge(user_reputation, answer_time_series, how = 'right', left_on='id', right_on='id')

In [42]:
topans = user_before_2017_stats[user_before_2017_stats.values[:,-12:].sum(axis=1)>26]

In [45]:
topans.head()

Unnamed: 0,id,about_me_length,age,creation_date,last_access_date,location,reputation,up_votes,down_votes,profile_image_url,...,m_201703,m_201704,m_201705,m_201706,m_201707,m_201708,m_201709,m_201710,m_201711,m_201712
1274,6632595,0.0,,2016-07-24 21:10:28.363000+00:00,2018-05-09 10:01:09.740000+00:00,,86.0,3.0,0.0,https://graph.facebook.com/512195288970617/pic...,...,1,0,13,3,1,0,0,0,4,10
5951,5619724,217.0,,2015-11-30 04:46:23.650000+00:00,2018-06-01 12:39:36.507000+00:00,"Sydney, Australia",3071.0,87.0,3.0,https://i.stack.imgur.com/1sBBe.jpg?s=128&g=1,...,0,0,0,3,7,13,0,10,22,1
5977,4315695,0.0,,2014-12-02 12:34:10.307000+00:00,2018-06-02 10:51:50.053000+00:00,"Ahmedabad, Gujarat, India",974.0,4.0,9.0,https://www.gravatar.com/avatar/2643d3f2e3e430...,...,0,0,0,0,0,0,0,0,0,31
5979,8932080,354.0,,2017-11-13 10:17:00.683000+00:00,2018-05-30 20:31:49.667000+00:00,"Lyon, France",660.0,61.0,10.0,https://i.stack.imgur.com/llgyP.jpg?s=128&g=1,...,0,0,0,0,0,0,0,0,32,0
5993,1175029,334.0,,2012-01-28 09:53:57.623000+00:00,2018-06-01 06:41:12.667000+00:00,,3321.0,23.0,3.0,https://i.stack.imgur.com/v80j0.jpg?s=128&g=1,...,7,1,5,3,5,5,7,5,4,6


In [47]:
topans['Active'] = topans.apply(lambda row:1 if row[-4:].sum()>0 else 0, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [55]:
cols_to_drop = (['m_201709', 'm_201710', 'm_201711', 'm_201712', 'account_year', 'last_access_date', 'reputation',
                'answer_reputation_total_2017', 'question_reputation_total_2017', 'accepted_reputation_total_2017',
                'answer_reputation_total_2018', 'question_reputation_total_2018', 'accepted_reputation_total_2018', 
                 'reputation_2017', 'reputation_2018', 'age', 'profile_image_url'])
topans.drop(cols_to_drop, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [58]:
topans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16427 entries, 1274 to 536881
Data columns (total 18 columns):
id                   16427 non-null int64
about_me_length      16427 non-null float64
age                  16427 non-null object
creation_date        16427 non-null datetime64[ns, UTC]
location             16427 non-null object
up_votes             16427 non-null float64
down_votes           16427 non-null float64
profile_image_url    16427 non-null object
website_url          16427 non-null object
m_201701             16427 non-null int64
m_201702             16427 non-null int64
m_201703             16427 non-null int64
m_201704             16427 non-null int64
m_201705             16427 non-null int64
m_201706             16427 non-null int64
m_201707             16427 non-null int64
m_201708             16427 non-null int64
Active               16427 non-null int64
dtypes: datetime64[ns, UTC](1), float64(3), int64(10), object(4)
memory usage: 3.0+ MB


In [72]:
topans[['creation_year','creation_month']] = topans.apply(lambda row: pd.Series([row[2].year, row[2].month], index=['creation_year','creation_month']), axis=1)
topans.drop('profile_image_url', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [80]:
topans.head()

Unnamed: 0,id,about_me_length,location,up_votes,down_votes,profile_image_url,website_url,m_201701,m_201702,m_201703,m_201704,m_201705,m_201706,m_201707,m_201708,Active,creation_year,creation_month
1274,6632595,0.0,,3.0,0.0,https://graph.facebook.com/512195288970617/pic...,,0,0,1,0,13,3,1,0,1,2016,7
5951,5619724,217.0,Australia,87.0,3.0,https://i.stack.imgur.com/1sBBe.jpg?s=128&g=1,,2,1,0,0,0,3,7,13,1,2015,11
5977,4315695,0.0,India,4.0,9.0,https://www.gravatar.com/avatar/2643d3f2e3e430...,,0,0,0,0,0,0,0,0,1,2014,12
5979,8932080,354.0,France,61.0,10.0,https://i.stack.imgur.com/llgyP.jpg?s=128&g=1,https://www.awesomeprods.fr,0,0,0,0,0,0,0,0,1,2017,11
5993,1175029,334.0,,23.0,3.0,https://i.stack.imgur.com/v80j0.jpg?s=128&g=1,,11,7,7,1,5,3,5,5,1,2012,1


In [79]:
topans['location'] = topans.location.apply(lambda x:x.split(', ')[-1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [49]:
with open('./data/processed/topans.pkl','rb') as picklefile:
    topans = pickle.load(picklefile)

In [50]:
topans.head()

Unnamed: 0,id,about_me_length,location,up_votes,down_votes,website_url,m_201701,m_201702,m_201703,m_201704,m_201705,m_201706,m_201707,m_201708,Active,creation_year,creation_month
1274,6632595,0.0,,3.0,0.0,,0,0,1,0,13,3,1,0,1,2016,7
5951,5619724,217.0,Australia,87.0,3.0,,2,1,0,0,0,3,7,13,1,2015,11
5977,4315695,0.0,India,4.0,9.0,,0,0,0,0,0,0,0,0,1,2014,12
5979,8932080,354.0,France,61.0,10.0,https://www.awesomeprods.fr,0,0,0,0,0,0,0,0,1,2017,11
5993,1175029,334.0,,23.0,3.0,,11,7,7,1,5,3,5,5,1,2012,1


In [51]:
website_count = pd.DataFrame(topans.website_url.value_counts())
website_count.reset_index(inplace=True)
personal_website = set(website_count[website_count['website_url']==1]['index'].values)
topans['personal_website'] = topans['website_url'].apply(lambda x:1 if x in personal_website else 0)

In [52]:
topans.drop('website_url', axis=1, inplace=True)
topans.set_index('id', inplace=True)

## Formalize the X and Y's

In [55]:
y = topans['Active']
X = topans.drop('Active', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4444, stratify=y)

In [56]:
with open('./data/processed/location_dict.pkl', 'rb') as picklefile:
    location_dict = pickle.load(picklefile)

In [57]:
def map_dummify_Locations(df, location_dict):
    df['location'] = df.location.apply(lambda x:location_dict[x] if x in set(location_dict.keys()) else 'Others')
    location_dummy = patsy.dmatrix('location', data=df, return_type='dataframe')
    df = df.join(location_dummy)
    df.drop('location', axis=1, inplace=True)
    new_colnames = [item.replace('[','-') for item in list(df.columns)]
    new_colnames = [item.replace(']','') for item in new_colnames]
    df.columns = new_colnames
    return df

In [58]:
X_train = map_dummify_Locations(X_train, location_dict)
X_test = map_dummify_Locations(X_test, location_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


## Perform some basic GridSearch

In [70]:
# Normalize the data
ssX = StandardScaler()
X_train_norm = ssX.fit_transform(X_train)

In [85]:
def gridSearchFiveModels(X, y):
    models = [
    #    ('knn', KNN),
    #    ('logistic', LogisticRegression),
    #    ('tree', DecisionTreeClassifier),
    #    ('forest', RandomForestClassifier),
        ('xgboost', XGBClassifier)
    ]

    param_choices = [
        {
            'max_depth': [3,4,5],
            'n_estimators': [1, 50, 100,200],
            'objective':['binary:logistic']
        }
    ]


    grids = {}
    for model_info, params in zip(models, param_choices):
        name, model = model_info
        grid = GridSearchCV(model(), params, scoring='accuracy', cv=5, n_jobs=-1)
        grid.fit(X, y)
        s = "{}: best score: {}".format(name, grid.best_score_)
        print(s)
        grids[name] = grid
    return grids

        """
        {
            'n_neighbors': range(2,12)
        },
        {
            'C': np.logspace(-3,6,12),
            'penalty':['l1', 'l2']
        },
        {
            'max_depth': [2,3,4,5],
            'min_samples_leaf': [3,6,10]
        },
        {
            'n_estimators': [50, 100, 200],
            'max_depth': [1,2,3,4,5],
            'min_samples_leaf': [3,6,10]
        },
        """
*Model performance*<br>
knn: best score: 0.8592953352104101<br>
logistic: best score: 0.8619587550414732<br>
tree: best score: 0.8675899855414352<br>
forest: best score: 0.8656875428049615<br>
xgboost: best score: 0.8759607335819192

In [18]:
with open('./data/model/answerer_prediction.pkl','rb') as picklefile:
    grids = pickle.load(picklefile)



## Check OverSampler

In [93]:
X_train_resampled, y_train_resampled = RandomOverSampler(random_state=4444).fit_sample(X_train_norm, y_train)
grid_OverSampler = gridSearchFiveModels(X_train_resampled, y_train_resampled)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


xgboost: best score: 0.8941467290544716


grids['xgboost']
GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': [3, 4, 5], 'n_estimators': [1, 50, 100, 200], 'objective': ['binary:logistic']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [94]:
accuracy_score(y_train, grid_OverSampler['xgboost'].predict(X_train_norm))

  if diff:


0.8917129594399209

In [92]:
confusion_matrix(y_train, xg.predict(X_train_norm))

  if diff:


array([[ 1707,   107],
       [ 1316, 10011]])

## Check performance of SMOTE

In [81]:
X_train_resampled, y_train_resampled = SMOTE(random_state=444).fit_sample(X_train_norm, y_train) 

In [86]:
grid_SMOTE = gridSearchFiveModels(X_train_resampled, y_train_resampled)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


xgboost: best score: 0.8938377328507107


In [87]:
confusion_matrix(y_train, grid_SMOTE['xgboost'].predict(X_train_norm))

  if diff:


array([[ 1098,   716],
       [  346, 10981]])

## Check Performance of ADASYN

In [89]:
X_train_resampled, y_train_resampled = ADASYN(random_state=444).fit_sample(X_train_norm, y_train) 

In [90]:
grid_ADASYN = gridSearchFiveModels(X_train_resampled, y_train_resampled)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


xgboost: best score: 0.8904878592114056


In [91]:
confusion_matrix(y_train, grid_ADASYN['xgboost'].predict(X_train_norm))

  if diff:


array([[ 1086,   728],
       [  320, 11007]])

## Check how to utlize time-series

In [96]:
X_train.head()

Unnamed: 0_level_0,about_me_length,up_votes,down_votes,m_201701,m_201702,m_201703,m_201704,m_201705,m_201706,m_201707,m_201708,creation_year,creation_month,personal_website,Intercept,location-T.APAC,location-T.EMEA,location-T.NDF,location-T.Others
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
5291448,102.0,85.0,13.0,0,1,3,4,1,0,7,7,2015,9,0,1.0,0.0,1.0,0.0,0.0
4747626,0.0,214.0,27.0,8,1,0,2,7,6,1,1,2015,4,0,1.0,1.0,0.0,0.0,0.0
7893169,43.0,12.0,14.0,0,0,0,0,0,0,0,0,2017,4,0,1.0,0.0,0.0,1.0,0.0
8670372,137.0,39.0,8.0,0,0,0,0,0,0,0,0,2017,9,0,1.0,0.0,1.0,0.0,0.0
6803853,0.0,46.0,10.0,7,17,8,8,3,4,3,0,2016,9,0,1.0,0.0,1.0,0.0,0.0
