# OkNLP

This notebook demonstrates the algorithm we used in our project. It shows an example of how we clustered using Nonnegative Matrix Factorization. We manually inspect the output of NMF to determine the best number of clusters for each group. Then, we create word clouds for specific groups and demographic splits.

## Imports and Settings

In [1]:
import random
import warnings

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold, permutation_test_score
from sklearn.metrics.pairwise import cosine_similarity

from utils.clean_up import *
from utils.categorize_demographics import *
from utils.nonnegative_matrix_factorization import nmf_inspect, nmf_labels
from utils.distinctive_tokens import log_odds_ratio
from utils.classification import betas

warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
mpl.rc('savefig', dpi=300)
params = {'figure.dpi' : 300,
          'axes.axisbelow' : True,
          'lines.antialiased' : True}

for (k, v) in params.items():
    plt.rcParams[k] = v

In [3]:
# Keeping track of the names of the essays
essay_dict = {'essay0' : 'My self summary',
              'essay1' : 'What I\'m doing with my life',
              'essay2' : 'I\'m really good at',
              'essay3' : 'The first thing people notice about me',
              'essay4' : 'Favorite books, movies, tv, food',
              'essay5' : 'The six things I could never do without',
              'essay6' : 'I spend a lot of time thinking about',
              'essay7' : 'On a typical Friday night I am',
              'essay8' : 'The most private thing I am willing to admit',
              'essay9' : 'You should message me if'}

## Data Cleaning

First we read in the data frame and re-categorize some of the demographic information. We'll have two separate dataframes, one for essay0 and one for essay4.

In [4]:
df = pd.read_csv('data/profiles.20120630.csv')

essay_list = ['essay0', 'essay4']
df_0, df_4 = clean_up(df, essay_list,)

df_0 = recategorize(df_0)
df_4 = recategorize(df_4)

## Clustering

For each essay, we convert the users' essays into a tfidf matrix and then use NMF to cluster the data points, using 25 clusters for each essay. The cell below takes a while to run, just forewarning.

In [5]:
K = 25

In [6]:
count_matrix, tfidf_matrix, vocab = col_to_data_matrix(df_4, 'essay4', remove_stopwords=True)
df_4['group'] = nmf_labels(tfidf_matrix, K)




## Cosine Similarity

In [7]:
tfidf_0 = tfidf_matrix[np.array(df_4.drugs=='no'), :]
tfidf_1 = tfidf_matrix[np.array(df_4.drugs=='yes'), :]

In [8]:
tfidf_known = np.vstack((tfidf_0.mean(axis=0), tfidf_1.mean(axis=0)))

No drugs

In [9]:
drugs_no = cosine_similarity(tfidf_0, tfidf_known)

In [10]:
pd.Series(np.argmax(drugs_no, axis=1)).value_counts()

0    20004
1     9393
dtype: int64

In [11]:
(pd.Series(np.argmax(drugs_no, axis=1)).value_counts() /
 pd.Series(np.argmax(drugs_no, axis=1)).value_counts().sum())

0    0.680478
1    0.319522
dtype: float64

Yes drugs

In [12]:
drugs_yes = cosine_similarity(tfidf_1, tfidf_known)

In [13]:
pd.Series(np.argmax(drugs_yes, axis=1)).value_counts()

1    4542
0    2317
dtype: int64

In [14]:
(pd.Series(np.argmax(drugs_yes, axis=1)).value_counts() /
 pd.Series(np.argmax(drugs_yes, axis=1)).value_counts().sum())

1    0.662196
0    0.337804
dtype: float64

Unknown drugs

In [15]:
tfidf_2 = tfidf_matrix[np.array(df_4.drugs=='unknown'), :]

In [16]:
drugs_unknown = cosine_similarity(tfidf_2, tfidf_known)

In [17]:
pd.Series(np.argmax(drugs_unknown, axis=1)).value_counts()

1    6693
0    5156
dtype: int64

In [18]:
(pd.Series(np.argmax(drugs_unknown, axis=1)).value_counts() /
 pd.Series(np.argmax(drugs_unknown, axis=1)).value_counts().sum())

1    0.564858
0    0.435142
dtype: float64