# Classifying Burning Man Events Data: Predicting Probabilities

In the previous classification approach we considered events to be exclusive a single type, with Precision, Recall, and F1 score as classification metrics. But, this may not necessarily be the case in practice. An event could be both a party and a food event for example. This possibility was made more evident upon analyzing the types of errors being made by the classifiers. Certain types of mislabels were common.

Alternatively, we can instead predict the non-exclusive probability that event be categorized into a given type, and use a metric like ROC-AUC to gauge classifier performance.

We'll start off the same as before, establishing baselines one step at a time. Totally random, or even weighted random, guessing gives a ROC-AUC of 0.5, corresponding to useful distinction between positive and negative labels. A simple rule-based system, just like the one used previously, but this time allowing for multiple positives, gives an average ROC-AUC of 0.65 with a standard deviation of 0.08. A noticable improvement. Adding more compex features along with logistic regression improves the ROC-AUC up to almost 0.80. And including word vectors brings this up even higher towards 0.87! A 0-1 prediction mechanism on the other hand only gives a ROC-AUC of 0.65.

tl;dr Burning Man Org should consider allowing events to have multiple labels

- <a href='#total'> Totally Random Guessing </a>
- <a href='#simple'> Simple Rule Based Classification </a>
- <a href='#complex'> More Complex Engineered Features </a>
- <a href='#statsmodels'> Feature Importance with StatsModels </a>
- <a href='#tfidf'> With TF-IDF Features </a>

# Import Packages and Data

In [1]:
import pandas as pd;
import numpy as np;
import seaborn as sns;
import matplotlib.pyplot as plt;

import string, nltk, re, pprint

from functools import reduce
from tqdm import tqdm
from pylab import *;
from scipy import sparse
from time import time

from nltk.corpus   import stopwords
from nltk          import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from nltk.corpus import stopwords

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, mean_absolute_error, mean_squared_error, roc_auc_score
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV

import lightgbm as lgb;

import random

from wordcloud import WordCloud, STOPWORDS

import statsmodels.discrete.discrete_model as dm
# workaround for a statsmodels problem missing chi2
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

eng_stopwords = set(stopwords.words("english"))

%matplotlib inline



In [2]:
events = pd.read_csv('raw_data/cleaned_up.csv')

In [3]:
types_test = pd.get_dummies(events['Type'])

events = events.drop(['Type'], axis=1)

type_names = types_test.columns.values

<a id='total'></a> 

# Totally Random Guessing

Random guessing gives a ROC-AUC score of 0.5, corresponding to no seperation between positive/negative predictions

In [38]:
types_pred = pd.DataFrame(columns=type_names)

for i in tqdm(range(len(types_test))):
    types_pred = types_pred.append({name:random.uniform(0, 1) for name in type_names}, ignore_index=True)

types_pred.head()

100%|███████████████████████████████████████████████████████████████████████████| 20165/20165 [00:45<00:00, 440.83it/s]


Unnamed: 0,Adult-oriented,Care/Support,Class/Workshop,Fire,Food,Game,Gathering/Party,Kid-friendly,Other,Parade,Performance,Ritual/Ceremony
0,0.975835,0.390713,0.106307,0.341328,0.55925,0.500633,0.396881,0.407683,0.877327,0.552137,0.315615,0.230538
1,0.627725,0.573499,0.580961,0.848518,0.077781,0.702374,0.11262,0.066988,0.107543,0.983907,0.814209,0.857691
2,0.635816,0.031868,0.506356,0.924487,0.898475,0.656403,0.907343,0.069315,0.614324,0.807869,0.392421,0.855286
3,0.541222,0.417662,0.961777,0.931613,0.673746,0.293533,0.974955,0.37394,0.938506,0.801623,0.644234,0.494834
4,0.100472,0.877497,0.411177,0.883322,0.10549,0.107107,0.555637,0.733455,0.121749,0.675545,0.054966,0.217408


In [83]:
scores = []

for name in type_names:
    score = roc_auc_score(types_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.49593694351676354
Std Dev ROC-AUC: 0.008949933671384422


Weighed by prior event distributions, the average ROC-AUC is exactly 0.5

In [84]:
types_pred = pd.DataFrame(columns=type_names)

p = types_test.sum().values/types_test.sum().values.sum()

for i in tqdm(range(len(types_test))):
    types_pred = types_pred.append({name:p[i] for i, name in enumerate(type_names)}, ignore_index=True)

types_pred.head()

100%|███████████████████████████████████████████████████████████████████████████| 20165/20165 [00:44<00:00, 450.14it/s]


Unnamed: 0,Adult-oriented,Care/Support,Class/Workshop,Fire,Food,Game,Gathering/Party,Kid-friendly,Other,Parade,Performance,Ritual/Ceremony
0,0.05574,0.036896,0.318324,0.008034,0.041557,0.047012,0.261939,0.018001,0.070469,0.013935,0.077213,0.05088
1,0.05574,0.036896,0.318324,0.008034,0.041557,0.047012,0.261939,0.018001,0.070469,0.013935,0.077213,0.05088
2,0.05574,0.036896,0.318324,0.008034,0.041557,0.047012,0.261939,0.018001,0.070469,0.013935,0.077213,0.05088
3,0.05574,0.036896,0.318324,0.008034,0.041557,0.047012,0.261939,0.018001,0.070469,0.013935,0.077213,0.05088
4,0.05574,0.036896,0.318324,0.008034,0.041557,0.047012,0.261939,0.018001,0.070469,0.013935,0.077213,0.05088


In [85]:
scores = []

for name in type_names:
    score = roc_auc_score(types_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.5
Std Dev ROC-AUC: 0.0


<a id='simple'></a>

# Simple Rule-Based Classification

Let's see what accuracy we can achieve using an extremely simple rule-based classification scheme, based on findings from the Exploratory Data Analysis.

This simple rule system brings the ROC-AUC up to 0.65, with a standard-deviation of 0.08. So we did get a noticable improvement over random guessing with this sytem.

In [4]:
events["Description"] = (events["Description"].map(str) + ' ' + 
                         events["Title"].map(str) + ' ' + 
                         events["Hosted by Camp"].map(str) + ' ' + 
                         events["Location"].map(str))

events = events.drop(['Title', 'Hosted by Camp', 'Location'], axis=1)

In [5]:
adult_words  = ['adult', 'massage', 'sensual', 'erotic', 'sex', 'bdsm', 'pleasure']
care_words   = ['heal', 'massage', 'help', 'body']
class_words  = ['learn', 'workshop', 'practice', 'class']
fire_words   = ['fire', 'burn', 'spin', 'fuel', 'flame', 'light', 'flow']
food_words   = ['coffee', 'pickle', 'food', 'serv', 'fresh', 'bacon', 'cheese', 'delicious', 'pancake', 'tast']
game_words   = ['game', 'play', 'prize', 'race', 'tournament']
party_words  = ['party', 'dance', 'music', 'celebrate']
kids_words   = ['kid', 'scout']
parade_words = ['parade', 'march', 'tour']
perfor_words = ['perform', 'stage', 'live', 'show', 'audience']
ritual_words = ['ceremony', 'ritual', 'temple', 'sacred']

words = {'Adult-oriented': adult_words,
         'Care/Support':care_words,
         'Class/Workshop':class_words,
         'Fire':fire_words,
         'Food':food_words,
         'Game':game_words,
         'Gathering/Party':party_words,
         'Kid-friendly':kids_words,
         'Parade':parade_words,
         'Performance':perfor_words,
         'Ritual/Ceremony':ritual_words,
         'Other':[]}

def simple_classify(desc, words):
    return any([word in desc for word in words])

In [7]:
types_pred = pd.DataFrame(columns=type_names)

descriptions = events['Description'].values

for desc in tqdm(descriptions):
    types_pred = types_pred.append({name:simple_classify(desc, words[name]) for name in type_names}, ignore_index=True)
        
types_pred.head()

100%|███████████████████████████████████████████████████████████████████████████| 20165/20165 [01:16<00:00, 261.92it/s]


Unnamed: 0,Adult-oriented,Care/Support,Class/Workshop,Fire,Food,Game,Gathering/Party,Kid-friendly,Other,Parade,Performance,Ritual/Ceremony
0,False,False,False,True,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False


In [87]:
scores = []

for name in type_names:
    score = roc_auc_score(types_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.6527309204301003
Std Dev ROC-AUC: 0.08677021943808555


<a id='complex'></a>

# More Complex Feature Engineering

In [6]:
def convert_12_to_24(time):
    if 'a.m.' in time:
        time = time.replace(' a.m.', '')
        if ':' not in time:
            time = time + ':00'    
    elif 'p.m.' in time:
        time = time.replace(' p.m.', '')
        if ':' not in time:
            if '12' in time:
                time = time + ':00'
            else:
                time = str(int(time)+12) + ':00'
        elif '12' in time:
            pass
        else:
            time_split = time.split(':')
            time = str(int(time_split[0])+12) + ':' + time_split[1]
    elif 'midnight' in time:
        time = '23:45'
    elif 'noon' in time:
        time = '12:00'
            
    return time

def get_time_diff(df):      
    times = [];

    for row in tqdm(df.values): 
        if row == '0':
            times.append(np.nan);
        elif row == 'All Day':
            times.append((datetime.datetime.strptime('23:59', '%H:%M')-datetime.datetime.strptime('00:00', '%H:%M')).total_seconds()/3600)
        else:
            split = row.split(' – ');
            
            split[0] = convert_12_to_24(split[0])
            split[1] = convert_12_to_24(split[1])

            times.append((datetime.datetime.strptime(split[1], '%H:%M')-datetime.datetime.strptime(split[0], '%H:%M')).total_seconds()/3600)
    
    return times;

times_1 = pd.DataFrame(get_time_diff(events['Sunday']),    columns=['Event Length'])
times_2 = pd.DataFrame(get_time_diff(events['Monday']),    columns=['Event Length'])
times_3 = pd.DataFrame(get_time_diff(events['Tuesday']),   columns=['Event Length'])
times_4 = pd.DataFrame(get_time_diff(events['Wednesday']), columns=['Event Length'])
times_5 = pd.DataFrame(get_time_diff(events['Thursday']),  columns=['Event Length'])
times_6 = pd.DataFrame(get_time_diff(events['Friday']),    columns=['Event Length'])
times_7 = pd.DataFrame(get_time_diff(events['Saturday']),  columns=['Event Length'])
times_8 = pd.DataFrame(get_time_diff(events['Sunday2']),   columns=['Event Length'])
times_9 = pd.DataFrame(get_time_diff(events['Monday2']),   columns=['Event Length'])

times = times_1.fillna(times_2).fillna(times_3).fillna(times_4).fillna(times_5).fillna(times_6).fillna(times_7).fillna(times_8).fillna(times_9)

events['Event Length'] = abs(times)

100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 720549.84it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 153658.58it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 124124.61it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 106250.34it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 105247.61it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 133565.17it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 236681.52it/s]
100%|████████████████████████████████████████████████████████████████████████| 20165/20165 [00:00<00:00, 330605.49it/s]
100%|███████████████████████████████████

In [7]:
# Convert Days to Simple Binary (Lose Time of Day Information)

days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday2', 'Monday2'];

events[days] = ((events[days] == '0') == False).astype(int);

In [8]:
events['Times Repeated'] = 0

for day in days:
    events['Times Repeated'] += events[day]

In [9]:
# Convert Contact Email, URL to Binary

events['Contact Email']  = pd.isnull(events['Contact Email']).values.astype(int)
events['URL']            = pd.isnull(events['URL']).values.astype(int)
events['Located at Art'] = pd.isnull(events['Located at Art']).values.astype(int)

Let's first build a classifier with simple engineered features (no TF-IDF) and see how it performs.

In [10]:
#######################
# FEATURE ENGINEERING #
#######################

def engineer_feature(series, func, normalize=True):
    feature = series.apply(func)   
    if normalize:
        feature = pd.Series(z_normalize(feature.values.reshape(-1,1)).reshape(-1,))
    feature.name = func.__name__ 
    return feature
def engineer_features(series, funclist, normalize=True):
    features = pd.DataFrame()
    for func in funclist:
        print(str(func))
        feature = engineer_feature(series, func, normalize)
        features[feature.name] = feature
    return features

##################
### Normalizer ###
##################

scaler = StandardScaler()
def z_normalize(data):
    scaler.fit(data)
    return scaler.transform(data)   
def count_words(x, words):
    count = 0
    for word in words:
        count += len(re.findall(word, str(x)))
    return count
    
################
### Features ###
################

def uppercase_freq(x):
    return len(re.findall(r'[A-Z]', x))/len(x)
def sentence_count(x):
    return len(re.findall("\n", str(x)))+1
def word_count(x):
    return len(str(x).split())
def unique_word_count(x):
    return len(set(str(x).split()))
def count_letters(x):
    return len(str(x))
def count_punctuations(x):
    return len([c for c in str(x) if c in string.punctuation])
def count_words_title(x):
    return len([w for w in str(x).split() if w.istitle()])
def count_stopwords(x):
    return len([w for w in str(x).lower().split() if w in eng_stopwords])
def mean_word_len(x):
    words = [len(w) for w in str(x).split()]
    if len(words) == 0:
        return 0
    else:
        return np.mean(words)

##################################
### Category-Specific Features ###
##################################

def count_kids_words(x):
    return count_words(x, ['kid', 'scout'])
def count_party_words(x):
    return count_words(x, ['party', 'dance', 'music', 'celebrate'])
def count_adult_words(x):
    return count_words(x, ['adult', 'massage', 'sensual', 'erotic', 'sex', 'bdsm', 'pleasure'])
def count_game_words(x):
    return count_words(x, ['game', 'play', 'prize', 'race', 'tournament'])
def count_ritual_words(x):
    return count_words(x, ['ceremony', 'ritual', 'temple', 'sacred']) 
def count_care_words(x):
    return count_words(x, ['heal', 'massage', 'help', 'body'])
def count_class_words(x):
    return count_words(x, ['learn', 'workshop', 'practice', 'class'])
def count_performance_words(x):
    return count_words(x, ['perform', 'stage', 'live', 'show', 'audience'])
def count_food_words(x):
    return count_words(x, ['coffee', 'pickle', 'food', 'serv', 'fresh', 'bacon', 'cheese', 'delicious', 'pancake', 'tast'])
def count_fire_words(x):
    return count_words(x, ['fire', 'burn', 'spin', 'fuel', 'flame', 'light', 'flow'])
def count_parade_words(x):
    return count_words(x, ['parade', 'march', 'tour'])

############################
### Sentimental Features ###
############################

sia = SIA();
def sentiment_compound(x):
    polarity = sia.polarity_scores(x)
    return polarity['compound']       
def sentiment_negative(x):
    polarity = sia.polarity_scores(x)
    return polarity['neg']       
def sentiment_neutral(x):
    polarity = sia.polarity_scores(x)
    return polarity['neu']       
def sentiment_positive(x):
    polarity = sia.polarity_scores(x)
    return polarity['pos']       
        
########################
### Derived Features ###
########################

def unique_word_ratio(x):
    wc = word_count(x)   
    if wc == 0:
        return 0
    else:
        return unique_word_count(x)/wc
def percent_ratio(x):
    wc = word_count(x)
    if wc == 0:
        return 0
    else:
        return count_punctuations(x)/wc
def words_per_sentence(x):
    sc = sentence_count(x)
    if sc == 0:
        return 0
    else:
        return word_count(x)/sc

In [11]:
feature_functions = [uppercase_freq, sentence_count, word_count, unique_word_count, count_letters, count_punctuations, 
                     count_words_title, count_stopwords, mean_word_len, count_kids_words, count_party_words, 
                     count_adult_words, count_game_words, count_ritual_words, count_care_words,
                     count_class_words, count_performance_words, count_food_words, count_fire_words, count_parade_words,
                     unique_word_ratio, percent_ratio, words_per_sentence,
                     sentiment_compound, sentiment_negative, sentiment_positive, sentiment_neutral]

features = [f.__name__ for f in feature_functions]

F_train = engineer_features(events['Description'].fillna(''), feature_functions, normalize=False)

X_handFeatures = F_train[features].as_matrix()

<function uppercase_freq at 0x0000024CA7ACCC80>
<function sentence_count at 0x0000024CA7ACC0D0>
<function word_count at 0x0000024CA7ACC2F0>
<function unique_word_count at 0x0000024CA7ACC378>
<function count_letters at 0x0000024CA7ACC510>
<function count_punctuations at 0x0000024CA7ACC400>
<function count_words_title at 0x0000024CA7ACC488>
<function count_stopwords at 0x0000024CA7ACC8C8>
<function mean_word_len at 0x0000024CA7ACCBF8>
<function count_kids_words at 0x0000024CA7BD31E0>
<function count_party_words at 0x0000024CA7BD3048>
<function count_adult_words at 0x0000024CA7BD30D0>
<function count_game_words at 0x0000024CA7BD32F0>
<function count_ritual_words at 0x0000024CA79F1840>
<function count_care_words at 0x0000024CA79F1488>
<function count_class_words at 0x0000024CA79F17B8>
<function count_performance_words at 0x0000024CA79F1158>
<function count_food_words at 0x0000024CA79F1F28>
<function count_fire_words at 0x0000024CA79F1EA0>
<function count_parade_words at 0x0000024CA79F1E18>

In [12]:
basic_features = ['Contact Email', 'URL', 'Located at Art', 'Event Length']

X = sparse.csr_matrix(hstack((X_handFeatures, events[days].values, events[basic_features].values)))

print(shape(X))

(20165, 40)


In [84]:
types_test.head()

Unnamed: 0,Adult-oriented,Care/Support,Class/Workshop,Fire,Food,Game,Gathering/Party,Kid-friendly,Other,Parade,Performance,Ritual/Ceremony
0,0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, types_test, test_size=0.5, stratify=types_test)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(10082, 40)
(10083, 40)
(10082, 12)
(10083, 12)


Predicting 0,1 probablities give an ROC-AUC only a little better than random.

In [86]:
types_pred = y_test.copy()

classifiers = []

for name in tqdm(type_names):
    clf = LogisticRegression().fit(X_train, y_train[name])
    
    types_pred[name] = clf.predict(X_test)
    
    classifiers.append(clf)

100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.92it/s]


In [113]:
scores = []

for name in type_names:
    score = roc_auc_score(y_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.5627325728279179
Std Dev ROC-AUC: 0.05981430712187226


Whereas predicting the probabilities of each class shoots the average ROC-AUC all the way up to almost 0.8 

In [115]:
types_pred = y_test.copy()

classifiers = []

for name in tqdm(type_names):
    clf = LogisticRegression().fit(X_train, y_train[name])
    
    types_pred[name] = 1-clf.predict_proba(X_test) # not sure why sklearn is outputting the wrong (1-p) probability
    
    classifiers.append(clf)

100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.90it/s]


In [116]:
scores = []

for name in type_names:
    score = roc_auc_score(y_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.7951847309055863
Std Dev ROC-AUC: 0.06622114708620694


<a id='statsmodels'></a>

# Assessing Feature Importance with StatsModels

We can use the StatsModels package to analyze the relative importance of the different features in an (unregularized) Logistic Regression fit. StatsModels suggests (based off of large p-values) that the most influential variables are (1) whether the event occured on the first Sunday, (2) the words/sentence, and (3) whether the event was on Saturday, and (4) the overall sentence count. Several other variables appear prominently, while the majority have negligible p-values.

In [16]:
ystats = y_train['Class/Workshop']
Xstats = pd.DataFrame(X_train.toarray(), columns=[features + days + basic_features])
Xstats['target'] = ystats.values

print(np.shape(ystats))
print(np.shape(Xstats))

In [19]:
lr = dm.Logit(Xstats['target'], Xstats[all_feature_names[0]])

result = lr.fit()

In [22]:
print(result.summary())

  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


                           Logit Regression Results                           
Dep. Variable:                 target   No. Observations:                10082
Model:                          Logit   Df Residuals:                    10042
Method:                           MLE   Df Model:                           39
Date:                Thu, 17 May 2018   Pseudo R-squ.:                    -inf
Time:                        22:55:36   Log-Likelihood:                   -inf
converged:                       True   LL-Null:                   -1.3385e+06
                                        LLR p-value:                     1.000
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
uppercase_freq             -5.4078      0.923     -5.858      0.000      -7.217      -3.599
sentence_count              0.0066      0.012      0.543      0.587      -0.017       0.031


<a id='tfidf'></a>

# With TF-IDF Vectorization

Now let's include more complex TF-IDF Bag of Words type features to see how much that improves performance.

In [13]:
count_vect_desc  = CountVectorizer(stop_words='english', min_df=40,  ngram_range=(1, 3), analyzer='word')

X = count_vect_desc.fit_transform(events['Description'].values);

transformer = TfidfTransformer()
X = transformer.fit_transform(X)

iX_desc  = X.shape[1]

print(X.shape)

(20165, 3097)


In [15]:
basic_features = ['Contact Email', 'URL', 'Located at Art', 'Event Length']

desc_length = shape(X)[1]

print(shape(X));
print(shape(events[days].values));
print(shape(events[basic_features].values));

all_feature_names = ['str('+name+')' for name in count_vect_desc.get_feature_names()] + features + days + basic_features

X = sparse.csr_matrix(hstack((X.toarray(), X_handFeatures, events[days].values, events[basic_features].values)))

print(shape(X))

(20165, 3137)


In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, types_test, test_size=0.5, stratify=types_test)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(10082, 3137)
(10083, 3137)
(10082, 12)
(10083, 12)


Including text features helps even more

In [25]:
types_pred = y_test.copy()

classifiers = []

for name in tqdm(type_names):
    clf = LogisticRegression().fit(X_train, y_train[name])
    
    types_pred[name] = clf.predict(X_test)
    
    classifiers.append(clf)

100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:10<00:00,  1.19it/s]


In [26]:
scores = []

for name in type_names:
    score = roc_auc_score(y_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.62457459237169
Std Dev ROC-AUC: 0.0949513392850482


The ROC-AUC is now all the way up to 0.87

In [128]:
types_pred = y_test.copy()

classifiers = []

for name in tqdm(type_names):
    clf = LogisticRegression().fit(X_train, y_train[name])
    
    types_pred[name] = 1-clf.predict_proba(X_test)
    
    classifiers.append(clf)

100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:10<00:00,  1.19it/s]


In [129]:
scores = []

for name in type_names:
    score = roc_auc_score(y_test[name], types_pred[name])
    scores.append(score)
    
print('Average ROC-AUC: ' + str(np.mean(scores)))
print('Std Dev ROC-AUC: ' + str(np.std(scores)))

Average ROC-AUC: 0.8679557634221976
Std Dev ROC-AUC: 0.06612030338248001
