This notebook sets up the workflow for the various functions we have implemented. It shows an example of how we clustered using Nonnegative Matrix Factorization. We manually inspect the output of NMF to determine the best number of clusters for each group

In [1]:
import pickle
import warnings
import sys
from utils.hash import make
from utils.calculate_pmi_features import *
from utils.clean_up import *
from utils.categorize_demographics import *
from utils.reduce_dimensions import run_kmeans
from utils.nonnegative_matrix_factorization import nmf_inspect, nmf_labels
warnings.filterwarnings('ignore')

Getting the data, cleaning it, and categorizing demographic data

In [2]:
df = pd.read_csv('data/profiles.20120630.csv')

In [3]:
essay_list = ['essay0','essay4','essay5']
df_clean = clean_up(df, essay_list)


In [4]:
df_clean.fillna('', inplace=True)

In [5]:
df.columns.values

array(['username', 'age', 'body_type', 'diet', 'drinks', 'drugs',
       'education', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4',
       'essay5', 'essay6', 'essay7', 'essay8', 'essay9', 'ethnicity',
       'height', 'income', 'job', 'last_online', 'location', 'offspring',
       'orientation', 'pets', 'religion', 'sex', 'sign', 'smokes',
       'speaks', 'status', 'TotalEssays'], dtype=object)

In [6]:
df_clean['religion'] = df_clean['religion'].apply(religion_categories)
df_clean['job'] = df_clean['job'].apply(job_categories)
df_clean['drugs'] = df_clean['drugs'].apply(drug_categories)
df_clean['diet'] = df_clean['diet'].apply(diet_categories)
df_clean['body_type'] = df_clean['body_type'].apply(body_categories)
df_clean['drinks'] = df_clean['drinks'].apply(drink_categories)
df_clean['sign'] = df_clean['sign'].apply(sign_categories)
df_clean['ethnicity'] = df_clean['ethnicity'].apply(ethnicity_categories)
df_clean['pets'] = df_clean['pets'].apply(pets_categories)
df_clean['speaks'] = df_clean['speaks'].apply(language_categories)

Split dataframe by orientation, cluster separately on each

In [None]:
data_dict = dict()
splits = ['straight','gay','bisexual']
category = 'orientation'
essay = [0, 4, 5]

for s in ['straight', 'gay', 'bisexual']:
    for e in essay:
        filename = '_'.join([category, s, str(e)+'.txt'])
        print(filename)
        fid  = open(filename,'w')
        sys.stdout = fid
        data_dict[s] = df_clean[df_clean[category] == s]
        count_matrix, tfidf_matrix, vocab = col_to_data_matrix(data_dict[s], 'essay'+str(e))
        tmp = nmf_inspect(tfidf_matrix, vocab)
        fid.close()

ERROR:tornado.general:Uncaught exception, closing connection.
Traceback (most recent call last):
  File "/Users/matar/anaconda/envs/py3k/lib/python3.3/site-packages/zmq/eventloop/zmqstream.py", line 407, in _run_callback
    callback(*args, **kwargs)
  File "/Users/matar/anaconda/envs/py3k/lib/python3.3/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/matar/anaconda/envs/py3k/lib/python3.3/site-packages/IPython/kernel/zmq/kernelbase.py", line 252, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/Users/matar/anaconda/envs/py3k/lib/python3.3/site-packages/IPython/kernel/zmq/kernelbase.py", line 219, in dispatch_shell
    sys.stdout.flush()
ValueError: I/O operation on closed file.
ERROR:tornado.general:Uncaught exception, closing connection.
Traceback (most recent call last):
  File "/Users/matar/anaconda/envs/py3k/lib/python3.3/site-packages/zmq/eventloop/zmqstream.py", line 433, in _handle_events
    self

Splitting the dataframe by gender, running clustering separately on each.

In [7]:
df_male = df_clean[df_clean['sex'] == 'm']

In [8]:
df_female = df_clean[df_clean['sex'] == 'f']

In [9]:
count_matrix_m, tfidf_matrix_m, vocab_m = col_to_data_matrix(df_male, 'essay0') #save out

In [11]:
count_matrix_f, tfidf_matrix_f, vocab_f = col_to_data_matrix(df_female, 'essay0')

In [10]:
vocab_m

['!',
 '! !',
 '! ! !',
 '! )',
 '! ) .',
 "! i'm",
 '! love',
 '"',
 '" "',
 '" ,',
 '" -',
 '" .',
 '%',
 '&',
 "'",
 "' s",
 '(',
 '( )',
 "( i'm",
 ')',
 ') ,',
 ') .',
 ") . i'm",
 ") i'm",
 '*',
 '* *',
 '+',
 ',',
 ', "',
 ', (',
 ', ,',
 ', .',
 ', ...',
 ', adventurous',
 ', art',
 ', art ,',
 ', believe',
 ', biking',
 ', biking ,',
 ', camping',
 ', camping ,',
 ", can't",
 ', caring',
 ', cooking',
 ', cooking ,',
 ', creative',
 ', creative ,',
 ', dancing',
 ', doing',
 ", don't",
 ', easy',
 ', eating',
 ', enjoy',
 ', especially',
 ', exploring',
 ', family',
 ', feel',
 ', food',
 ', friendly',
 ', friends',
 ', fun',
 ', fun ,',
 ', funny',
 ', funny ,',
 ', getting',
 ', going',
 ', good',
 ', great',
 ', hanging',
 ', happy',
 ', having',
 ', hiking',
 ', hiking ,',
 ', honest',
 ', honest ,',
 ", i'd",
 ", i'll",
 ", i'm",
 ", i'm looking",
 ", i've",
 ', intelligent',
 ', intelligent ,',
 ", it's",
 ', just',
 ', kind',
 ', know',
 ', learning',
 ', life',
 ', lik

In [12]:
nmf_inspect(tfidf_matrix_m, vocab_m)

3
Group 0:
, ( ) ) , ... ! music / : , i'm

Group 1:
- " - - ) ( . - . . " : ...

Group 2:
. i'm . i'm like love people life just . love . like


5
Group 0:
, ( ) ) , music / : ... , i'm music ,

Group 1:
. i'm . i'm like . like love don't life people just

Group 2:
" . " ... . ! ? ) ( , " " "

Group 3:
- - - . - ) ( : - i'm / . '

Group 4:
new ! san francisco san francisco bay moved . area years


7
Group 0:
. like love . like life . love people time don't things

Group 1:
, music music , , , , love good : , good movies hiking

Group 2:
san . new francisco san francisco bay moved area bay area years

Group 3:
" . " . , " , " " " . ? " , " -

Group 4:
- - - . - - i'm : . / , " - '

Group 5:
i'm . i'm , i'm . guy pretty looking i'm looking i'm pretty just

Group 6:
) ( ! ... ) . / ) , ? ! ! *


9
Group 0:
, music , , music , : art . , , good , . , love

Group 1:
. like . like don't life think want know time . don't

Group 2:
- - - . - - i'm : / " - ' ? ;

Group 3:
i'm . i'm , i'm . pret

In [14]:
nmf_inspect(tfidf_matrix_f, vocab_f)

3
Group 0:
, ( ) ) , music : , love ... ! good

Group 1:
. i'm love . i'm like . love people life . like just

Group 2:
- " - - ) ( ! . - ... . " :


5
Group 0:
, ( ) : ) , music , love music , . /

Group 1:
. i'm like . i'm love . love . like life don't people

Group 2:
" . " . ... , * , " ) ? (

Group 3:
- - - . - ( ) : - i'm / - love .

Group 4:
! new love ... i'm bay san area moved francisco


7
Group 0:
, ( ) ) , : music , love music , / good

Group 1:
. like love . love . like life people don't time want

Group 2:
i'm . i'm . , i'm i'm looking pretty i've i'm pretty , looking

Group 3:
! ... ! ! ! ! ! love ) ( just ? ..

Group 4:
- - - . - ( ) : / - i'm - love ) .

Group 5:
new . bay san area san francisco francisco love moved bay area

Group 6:
" . " . , " , " " * ) ( ?


9
Group 0:
, music , love music , good , good : friends , dancing friends

Group 1:
. like love . love . like life people time want don't

Group 2:
- - - . - - i'm - love : / , & ?

Group 3:
" . " . , " " " , ?