# Finding most-widely used, multi-contextual words


To build the third function, "Context Detector", I will extract the word-sense associatoin from Bert model. 
But I certainly do not want to serach for contextual meaning for every possible word that a user can possibly use. Therefore, in this notebook, I will come up with a dictionary of words that Enron employees commonly use and potentially in different contexts. To do so, I will use email topic labels, which were hand-coded by CMU students (available from https://data.world/brianray/enron-email-dataset)


* 1) Load libraries and define functions
* 2) Import data: email data and labeld data 
* 3) Calculate TF-IDF scores for each words in the company-wide email corpus by selecting words that occur frequently, across many people's 
* 4) Using the topic labels data, calculate topic-level term frequency
    * Join the labeled data with TF-IDF results
* 5) Build and store a dictionary of words that widely-used (according to TF-IDF scores) and multi-contextual words (that appear across all categories, quiet frequently).



## 1. Loading libraries and defining functions

In [1]:
import pandas as pd
import numpy as np
from subprocess import check_output
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer 
from nltk.util import ngrams 
from nltk.probability import FreqDist
import os, re, nltk, string

def email_clean(text):
    text = re.sub(r'\n--.*?\n', '', text, flags=re.DOTALL)
    text = re.sub(r'enron.com', '', text, flags=re.DOTALL)
    text = re.sub(r'Forwarded by.*?Subject:', '', text, flags=re.DOTALL) 
    text = re.sub(r'Fwd:.*?Subject:', '', text, flags=re.DOTALL) 
    text = re.sub(r'Fw:.*?Subject:', '', text, flags=re.DOTALL)     
    text = re.sub(r'FW:.*?Subject:', '', text, flags=re.DOTALL)         
    text = re.sub(r'Forwarded:.*?Subject:', '', text, flags=re.DOTALL)         
    text = re.sub(r'From:.*?Subject:', '', text, flags=re.DOTALL)
    text = re.sub(r'PM', '', text, flags=re.DOTALL)
    text = re.sub(r'AM', '', text, flags=re.DOTALL)
    
    return text

def clean(text):
    stop = set(stopwords.words('english'))
    stop.update(("to","cc","subject","http","from","sent",
                 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 
                 "enron america corp", "enron", "etc", "na"))
    exclude = set(string.punctuation) 
    lemma = WordNetLemmatizer()
    porter= PorterStemmer()
    
    text=text.rstrip()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    #stem = " ".join(porter.stem(token) for token in normalized.split())
    
    return normalized

## 2. Importing data 

In [2]:
path_to_email_data = 'C:/Users/Margeum/Dropbox/DS projects/05. Email data/emails_in_csv'
os.chdir(path_to_email_data)
emails_df = pd.read_csv('emails_parsed.csv')
labeled_emails_df = pd.read_csv("enron_05_17_2015_with_labels_v2.csv") # Labeled data 
address_user_df = pd.read_csv('./address_user_df.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## 3. Calculating TF-IDF scores 

Calculate word frequencies for the email corpus 

In [3]:
tf_list = []

for index, row in address_user_df.iterrows():
    
    text_cleaned_i = []

    user_i = row['user']
    print user_i
    
    lastname = row['address'].split('@')[0].split('.')[-1]
    firstname = str(row['address'].split('.')[0])

    text_to_clean_df_i = emails_df[emails_df["user"] == user_i][["content", "user"]].reset_index()
        
    for text in text_to_clean_df_i['content']:
        text_cleaned_i.append(clean(email_clean(text)).split())

    unlisted_text_cleaned_i = [item for sublist in text_cleaned_i for item in sublist]
    freqdist_user_i = nltk.FreqDist(ngrams(unlisted_text_cleaned_i, 1))

    tf_list.append(freqdist_user_i)    


allen-p
arnold-j
arora-h
badeer-r
bailey-s
bass-e
baughman-d
beck-s
blair-l
brawner-s
buy-r
campbell-l
carson-m
cash-m
causholli-m
corman-s
cuilla-m
dasovich-j
davis-d
dean-c
delainey-d
derrick-j
dickson-s
donoho-l
donohoe-t
dorland-c
ermis-f
farmer-d
fischer-m
forney-j
fossum-d
gang-l
gay-r
geaccone-t
germany-c
giron-d
griffith-j
grigsby-m
guzman-m
haedicke-m
hain-m
harris-s
hayslett-r
heard-m
hendrickson-s
hernandez-j
hodge-j
holst-k
horton-s
hyatt-k
hyvl-d
jones-t
kaminski-v
kean-s
keavey-p
keiser-k
king-j
kitchen-l
kuykendall-t
lavorato-j
lay-k
lenhart-m
lewis-a
lokay-m
lokey-t
love-p
lucci-p
maggi-m
mann-k
martin-t
may-l
mccarty-d
mcconnell-m
mckay-b
mckay-j
mclaughlin-e
meyers-a
motley-m
neal-s
nemec-g
panus-s
parks-j
pereira-s
perlingiere-d
pimenov-v
platter-p
presto-k
quenet-j
quigley-d
rapp-b
reitmeyer-j
richey-c
ring-a
ring-r
rogers-b
ruscitti-k
sager-e
saibi-e
salisbury-h
sanchez-m
sanders-r
scholtes-d
schoolcraft-d
schwieger-j
scott-s
semperger-c
shackleton-s
shankman-j
sha

Calculating word frequencies at the user-account level (Documet-level frequencies where all emails in a user's account are treated as a document) 

In [7]:
idf_text_cleaned_i = []

for text in emails_df["content"]:
    idf_text_cleaned_i.append(clean(email_clean(text)).split())

idf_unlisted_text_cleaned_i = [item for sublist in idf_text_cleaned_i for item in sublist]
idf_list = nltk.FreqDist(ngrams(idf_unlisted_text_cleaned_i, 1))

df_idf = pd.DataFrame.from_dict(idf_list, orient='index')
df_idf.columns = ['Frequency']
df_idf.index.name = 'Term'
df_idf.sort_values(by = "Frequency", ascending = False)
df_idf.head()

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
"(jurek,)",2
"(wefc,)",1
"(caqigapifbgbfagecbgdzbqyaxwibagya,)",1
"(qaaadfa,)",1
"(rjizidi,)",2


Now merge the corpus-level frequencies and account-level frequencies by using word as key. Then, calculate the TF-IDF scores.

In [9]:
 tf_idf_stacked = df_idf

for i in range(len(address_user_df)):
    my_td = pd.DataFrame.from_dict(tf_list[i], orient='index')
    my_td.columns = ['Frequency']
    my_td.index.name = 'Term'

    tf_idf_i = my_td.reset_index().set_index('Term').join(df_idf.reset_index().set_index('Term'), on= 'Term', how='left',
                                              lsuffix='_left', rsuffix='_right')

    tf_idf_i["idf"] =  np.log(tf_idf_i['Frequency_right'])/(len(df_idf)+1)
    tf_idf_i["tf_idf"] = tf_idf_i['Frequency_left']*tf_idf_i['idf']
    tf_idf_i.sort_values(by = "tf_idf", ascending = False)

    tf_idf_i = pd.DataFrame(tf_idf_i["tf_idf"])
    tf_idf_i.columns = tf_idf_i.columns + "_" + str(i)

    tf_idf_stacked = tf_idf_stacked.join(tf_idf_i, on = 'Term', how ='left', lsuffix='_left', rsuffix  ='right')


col_list= list(tf_idf_stacked)
col_list.remove('Frequency')


tf_idf_stacked['mean'] = tf_idf_stacked[col_list].mean(axis=1)
tf_idf_stacked['variance'] = tf_idf_stacked[col_list].var(axis=1)
tf_idf_stacked['count_notnull'] = tf_idf_stacked[col_list].count(axis=1)
tf_idf_stacked.sort_values(by = "mean", ascending = False).head(10)


Unnamed: 0_level_0,Frequency,tf_idf_0,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,tf_idf_7,tf_idf_8,...,tf_idf_130,tf_idf_131,tf_idf_132,tf_idf_133,tf_idf_134,tf_idf_135,tf_idf_136,mean,variance,count_notnull
Term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"(com,)",664513,0.064602,0.136463,0.035238,0.040803,0.002539,0.53649,0.230609,0.045009,0.04188,...,0.064295,0.027775,0.041008,0.075271,0.031134,0.030083,0.006822,0.121391,0.054427,137
"(ect,)",595113,0.051099,0.114941,0.002162,0.004349,7.6e-05,0.283041,0.022942,0.84246,0.000153,...,0.013989,0.049141,0.000127,0.0117,0.007758,0.003001,,0.109993,0.067947,134
"(please,)",379986,0.04296,0.06943,0.014132,0.014648,0.009904,0.09349,0.05719,0.243459,0.049375,...,0.021529,0.045836,0.014574,0.029836,0.028927,0.022144,0.009462,0.065944,0.010998,137
"(e,)",378364,0.030563,0.067662,0.011867,0.027738,0.00258,0.136676,0.05555,0.097857,0.021375,...,0.029532,0.018377,0.008624,0.027591,0.021989,0.012677,0.006756,0.065836,0.022772,137
"(would,)",325382,0.035473,0.071772,0.008231,0.011096,0.003156,0.080464,0.0236,0.194751,0.042077,...,0.030666,0.018186,0.010902,0.008887,0.012868,0.019618,0.005609,0.056304,0.018686,137
"(power,)",309093,0.019926,0.049984,0.005755,0.024375,0.002442,0.010519,0.041472,0.068072,0.011486,...,0.025996,0.043382,0.009769,0.006045,0.009262,0.015549,0.005634,0.05338,0.047698,137
"(hou,)",273226,0.024497,0.050358,0.001461,0.00249,,0.157157,0.010776,0.302318,0.000623,...,0.007567,0.020162,,0.004837,0.003712,0.000479,,0.052184,0.012847,122
"(energy,)",298781,0.010515,0.073026,0.00287,0.021464,0.006077,0.010877,0.022019,0.054794,0.020162,...,0.031497,0.007524,0.00574,0.016279,0.011624,0.009888,0.003087,0.051578,0.030421,137
"(company,)",288719,0.010054,0.122422,0.005604,0.010342,0.005147,0.015994,0.022055,0.07201,0.030377,...,0.084517,0.005772,0.005243,0.041657,0.007793,0.007312,0.001756,0.049604,0.025251,137
"(venturewire,)",8506,,,1.7e-05,,,,,,,...,,,,,,,,0.048956,0.007185,3


In [20]:
print (str(len(tf_idf_stacked["mean"])) + ' unique words are in the email corpus')

522762 unique words are in the email corpus


We need to cut the words from this complete list. First, we will keep only top 1 percent words in terms of mean TF-IDF score. In other words, we will keep words with high importance in terms of their frequency across different accounts.

In [24]:
mean_val_cut_99 = tf_idf_stacked["mean"].quantile(.99)
df_mean_val_cut_99 = tf_idf_stacked[tf_idf_stacked['mean'] > mean_val_cut_99]

print (str(len(df_mean_val_cut_99)) + ' unique words are in this list of top 1 percent words)')
print ('Top 1 percent words account for ' + 
       str(float(df_mean_val_cut_99["Frequency"].sum())/float(tf_idf_stacked["Frequency"].sum())) +
       ' unique words are in this list of top 1 percent words)' + '% in terms of frequency')

5156 unique words are in this list of top 1 percent words)
Top 1 percent words account for 0.798830791283 unique words are in this list of top 1 percent words)% in terms of frequency


In [29]:
df_mean_val_cut_99.sort_values(by = "mean", ascending = False).head(10)

Unnamed: 0_level_0,Frequency,tf_idf_0,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,tf_idf_7,tf_idf_8,...,tf_idf_130,tf_idf_131,tf_idf_132,tf_idf_133,tf_idf_134,tf_idf_135,tf_idf_136,mean,variance,count_notnull
Term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"(com,)",664513,0.064602,0.136463,0.035238,0.040803,0.002539,0.53649,0.230609,0.045009,0.04188,...,0.064295,0.027775,0.041008,0.075271,0.031134,0.030083,0.006822,0.121391,0.054427,137
"(ect,)",595113,0.051099,0.114941,0.002162,0.004349,7.6e-05,0.283041,0.022942,0.84246,0.000153,...,0.013989,0.049141,0.000127,0.0117,0.007758,0.003001,,0.109993,0.067947,134
"(please,)",379986,0.04296,0.06943,0.014132,0.014648,0.009904,0.09349,0.05719,0.243459,0.049375,...,0.021529,0.045836,0.014574,0.029836,0.028927,0.022144,0.009462,0.065944,0.010998,137
"(e,)",378364,0.030563,0.067662,0.011867,0.027738,0.00258,0.136676,0.05555,0.097857,0.021375,...,0.029532,0.018377,0.008624,0.027591,0.021989,0.012677,0.006756,0.065836,0.022772,137
"(would,)",325382,0.035473,0.071772,0.008231,0.011096,0.003156,0.080464,0.0236,0.194751,0.042077,...,0.030666,0.018186,0.010902,0.008887,0.012868,0.019618,0.005609,0.056304,0.018686,137
"(power,)",309093,0.019926,0.049984,0.005755,0.024375,0.002442,0.010519,0.041472,0.068072,0.011486,...,0.025996,0.043382,0.009769,0.006045,0.009262,0.015549,0.005634,0.05338,0.047698,137
"(hou,)",273226,0.024497,0.050358,0.001461,0.00249,,0.157157,0.010776,0.302318,0.000623,...,0.007567,0.020162,,0.004837,0.003712,0.000479,,0.052184,0.012847,122
"(energy,)",298781,0.010515,0.073026,0.00287,0.021464,0.006077,0.010877,0.022019,0.054794,0.020162,...,0.031497,0.007524,0.00574,0.016279,0.011624,0.009888,0.003087,0.051578,0.030421,137
"(company,)",288719,0.010054,0.122422,0.005604,0.010342,0.005147,0.015994,0.022055,0.07201,0.030377,...,0.084517,0.005772,0.005243,0.041657,0.007793,0.007312,0.001756,0.049604,0.025251,137
"(venturewire,)",8506,,,1.7e-05,,,,,,,...,,,,,,,,0.048956,0.007185,3


The last column in the df_mean_val_cut_99 dataframe, It shows that some of these top-1-percent words appear only in some of the accounts. These words might be important for those user who use them, but less likely to be a multi-context words. Therefore, I will remove them from the list. 

In [35]:
count_null_cut = 0.5
print (str(count_null_cut*100) + '% of words have more than ' + str(df_mean_val_cut_99["count_notnull"].quantile(count_null_cut)) + ' accounts')

50.0% of words have more than 105.0 accounts


Although 50% is an arbitrary cut, we will use that as our cut point for now. 

In [39]:
keywords_dictionary = df_mean_val_cut_99[df_mean_val_cut_99["count_notnull"]> df_mean_val_cut_99["count_notnull"].quantile(count_null_cut)]
print ('We now have ' + str(len(keywords_dictionary)) + ' words in the list')

We now have 2571 words in the list


In [None]:
#len(keywords_dictionary)
#keywords_dictionary.to_csv('keywords_dictionary.csv', index=True)

## 4. Count topic-level word frequencies 

Now, I want to figure out whether the key words in our dictionary appears across differen topic areas. The labeled_emails_df has email topic labels, but only for a subset of emails. Using message IDs, I will merge the labeles to our keywords dictionary so that we can further trim down the dictionary.  

In [47]:
labeled_emails_df.head()
#labeled_emails_df["labeled"].describe()
labeled_messages = labeled_emails_df[labeled_emails_df["labeled"]==True]
print ('The dataset provides topic labels for ' + str(len(labeled_messages)) + ' emails')
#labeled_messages.head()

The dataset provides topic labels for 1702 emails


Unnamed: 0.1,Unnamed: 0,Message-ID,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,...,Cat_10_level_1,Cat_10_level_2,Cat_10_weight,Cat_11_level_1,Cat_11_level_2,Cat_11_weight,Cat_12_level_1,Cat_12_level_2,Cat_12_weight,labeled
379,379,<9831685.1075855725804.JavaMail.evans@thyme>,2001-03-15 14:45:00,frozenset({'phillip.allen@enron.com'}),frozenset({'todd.burke@enron.com'}),Re: Confidential Employee Information/Lenhart,Phillip K Allen,Todd Burke,,,...,,,,,,,,,,True
381,381,<21041312.1075855725847.JavaMail.evans@thyme>,2001-03-15 14:11:00,frozenset({'phillip.allen@enron.com'}),frozenset({'kim.bolton@enron.com'}),RE: PERSONAL AND CONFIDENTIAL COMPENSATION INF...,Phillip K Allen,Kim Bolton,,,...,,,,,,,,,,True
2139,2139,<5907100.1075858639941.JavaMail.evans@thyme>,2001-06-20 17:04:51,frozenset({'k..allen@enron.com'}),"frozenset({'matt.smith@enron.com', 'matthew.le...",FW: Western Wholesale Activities - Gas & Power...,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...","Lenhart, Matthew </O=ENRON/OU=NA/CN=RECIPIENTS...",,,...,,,,,,,,,,True
2140,2140,<26625142.1075858639964.JavaMail.evans@thyme>,2001-06-20 17:09:00,frozenset({'k..allen@enron.com'}),"frozenset({'matt.smith@enron.com', 'matthew.le...",FW: Western Wholesale Activities - Gas & Power...,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...","Lenhart, Matthew </O=ENRON/OU=NA/CN=RECIPIENTS...",,,...,,,,,,,,,,True
2232,2232,<19730598.1075858642129.JavaMail.evans@thyme>,2001-08-09 12:30:58,frozenset({'k..allen@enron.com'}),"frozenset({'matt.smith@enron.com', 'm..tholt@e...",FW: Western Wholesale Activities - Gas & Power...,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...","Smith, Matt </O=ENRON/OU=NA/CN=RECIPIENTS/CN=M...",,,...,,,,,,,,,,True


From the label information, we will use "primary topics", which corresponds to level_1 == 3 and level_2 ranges from 1 to 13. (Full description of the topic labels are available here: https://data.world/brianray/enron-email-dataset). I focus on primary topics because they are relevant to the company's business and strategies (as opposed to personal emails or administrative/editing/etc.).

In [53]:
# Currently, Cat_1 implies that the first category that a human coder identified. Our category of interest, "3" can appear any of the 12 columns: Cat_1_level1, Cat_2_level1, ... Cat_12_level1. 
# So we will extract "3"s from the 12 columns. If 3 exists in any of the level 1 columns, we also want to extract the sub-category(i.e., level 2 category) from from the associated level_2 column.

labeled_messages_red = labeled_messages[(labeled_messages["Cat_1_level_1"].fillna(0.0).astype(int) == 3) | 
                        (labeled_messages["Cat_2_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_3_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_4_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_5_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_6_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_7_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_8_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_9_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_10_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_11_level_1"].fillna(0.0).astype(int) == 3) |
                        (labeled_messages["Cat_12_level_1"].fillna(0.0).astype(int) == 3) 
                       ]
print (str(len(labeled_messages_red)) + ' messages are labeled as primary topics')

879 messages are labeled as primary topics


In [82]:
cat_columns = ["Cat_1_level_1", "Cat_1_level_2", 
              "Cat_2_level_1", "Cat_2_level_2",
              "Cat_3_level_1", "Cat_3_level_2", 
              "Cat_4_level_1", "Cat_4_level_2", 
              "Cat_5_level_1", "Cat_5_level_2", 
              "Cat_6_level_1", "Cat_6_level_2", 
              "Cat_7_level_1", "Cat_7_level_2", 
              "Cat_8_level_1", "Cat_8_level_2", 
              "Cat_9_level_1", "Cat_9_level_2", 
              "Cat_10_level_1", "Cat_10_level_2", 
              "Cat_11_level_1", "Cat_11_level_2", 
              "Cat_12_level_1", "Cat_12_level_2", 
              ]
#cat_columns
labeled_messages_red[cat_columns] = labeled_messages_red[cat_columns].fillna(value=0).astype(int)


In [88]:
label_dict_dt = pd.DataFrame.from_dict(label_dict, orient = 'index', columns = ["topic", "content"])

We will now merge these labels to IDF for each topic.... 
To do so, unlist the topic category info... -- if topic column contains 3.1.
--> then, 3.1 bring all the "content" and count the occurance of dict = 1 
--> make a list of that count, and then attach it as a column "count_topic_3_1"  <##### I will be able to use this code in BERT? -- In any of the doc... do you see this vocab?

and so on..

In [89]:
label_dict_dt

Unnamed: 0,topic,content
<10469240.1075863429356.JavaMail.evans@thyme>,[3.2],Greetings from London. What do you think about...
<14585290.1075842999386.JavaMail.evans@thyme>,[3.6],***Sent on behalf of Sandi Thompson*** To All ...
<21785136.1075846160406.JavaMail.evans@thyme>,[3.4],Thanks for the update. Congratulations on your...
<14717550.1075846177238.JavaMail.evans@thyme>,[3.4],Please post the JP MOrgan doc. on our site ---...
<5369418.1075846152944.JavaMail.evans@thyme>,[3.5],Let's process this request. I think it's justi...
<16848822.1075853125247.JavaMail.evans@thyme>,[3.7],"David, You asked me to provide my opinion abou..."
<6871897.1075858732063.JavaMail.evans@thyme>,[3.1],The EPSA leg. affairs cmt. met today during th...
<20011465.1075847624589.JavaMail.evans@thyme>,[3.5],yes Linda Robertson 03/01/2001 07:47 AM To: St...
<8348919.1075844026871.JavaMail.evans@thyme>,"[3.2, 3.8]",I've looked into whether we can terminate our ...
<33228374.1075851641742.JavaMail.evans@thyme>,[3.1],Another agenda item I got a call today from Bo...


In [107]:
topic_3_1_list = []
topic_3_2_list = []
topic_3_3_list = []
topic_3_4_list = []
topic_3_5_list = []
topic_3_6_list = []
topic_3_7_list = []
topic_3_8_list = []
topic_3_9_list = []
topic_3_10_list = []
topic_3_11_list = []
topic_3_12_list = []

for i in range(len(label_dict_dt)): 
    if '3.1' in label_dict_dt.iloc[i]['topic']:
        topic_3_1_list.append(label_dict_dt.iloc[i]['content'])
    if '3.2' in label_dict_dt.iloc[i]['topic']:
        topic_3_2_list.append(label_dict_dt.iloc[i]['content'])
    if '3.3' in label_dict_dt.iloc[i]['topic']:
        topic_3_3_list.append(label_dict_dt.iloc[i]['content'])
    if '3.4' in label_dict_dt.iloc[i]['topic']:
        topic_3_4_list.append(label_dict_dt.iloc[i]['content'])
    if '3.5' in label_dict_dt.iloc[i]['topic']:
        topic_3_5_list.append(label_dict_dt.iloc[i]['content'])
    if '3.6' in label_dict_dt.iloc[i]['topic']:
        topic_3_6_list.append(label_dict_dt.iloc[i]['content'])
    if '3.7' in label_dict_dt.iloc[i]['topic']:
        topic_3_7_list.append(label_dict_dt.iloc[i]['content'])
    if '3.8' in label_dict_dt.iloc[i]['topic']:
        topic_3_8_list.append(label_dict_dt.iloc[i]['content'])
    if '3.9' in label_dict_dt.iloc[i]['topic']:
        topic_3_9_list.append(label_dict_dt.iloc[i]['content'])
    if '3.10' in label_dict_dt.iloc[i]['topic']:
        topic_3_10_list.append(label_dict_dt.iloc[i]['content'])
    if '3.11' in label_dict_dt.iloc[i]['topic']:
        topic_3_11_list.append(label_dict_dt.iloc[i]['content'])
    if '3.12' in label_dict_dt.iloc[i]['topic']:
        topic_3_12_list.append(label_dict_dt.iloc[i]['content'])

For words in our key word dictionary, let's count the occurance in eachof the topic document. 

In [181]:
topic_3_1_str = ' '.join(topic_3_1_list).lower()
topic_3_2_str = ' '.join(topic_3_2_list).lower()
topic_3_3_str = ' '.join(topic_3_3_list).lower()
topic_3_4_str = ' '.join(topic_3_4_list).lower()
topic_3_5_str = ' '.join(topic_3_5_list).lower()
topic_3_6_str = ' '.join(topic_3_6_list).lower()
topic_3_7_str = ' '.join(topic_3_7_list).lower()
topic_3_8_str = ' '.join(topic_3_8_list).lower()
topic_3_9_str = ' '.join(topic_3_9_list).lower()
topic_3_10_str = ' '.join(topic_3_10_list).lower()
topic_3_11_str = ' '.join(topic_3_11_list).lower()
topic_3_12_str = ' '.join(topic_3_12_list).lower()

In [182]:
topic_bin_3_1 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_1_str: 
        topic_bin_3_1.append(1)
    else: 
        topic_bin_3_1.append(0)

        
topic_bin_3_2 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_2_str: 
        topic_bin_3_2.append(1)
    else: 
        topic_bin_3_2.append(0)


topic_bin_3_3 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_3_str: 
        topic_bin_3_3.append(1)
    else: 
        topic_bin_3_3.append(0)
        

topic_bin_3_4 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_4_str: 
        topic_bin_3_4.append(1)
    else: 
        topic_bin_3_4.append(0)
        

topic_bin_3_5 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_5_str: 
        topic_bin_3_5.append(1)
    else: 
        topic_bin_3_5.append(0)
        

topic_bin_3_6 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_6_str: 
        topic_bin_3_6.append(1)
    else: 
        topic_bin_3_6.append(0)
        

topic_bin_3_7 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_7_str: 
        topic_bin_3_7.append(1)
    else: 
        topic_bin_3_7.append(0)
        

topic_bin_3_8 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_8_str: 
        topic_bin_3_8.append(1)
    else: 
        topic_bin_3_8.append(0)
        

topic_bin_3_9 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_9_str: 
        topic_bin_3_9.append(1)
    else: 
        topic_bin_3_9.append(0)
        

topic_bin_3_10 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_10_str: 
        topic_bin_3_10.append(1)
    else: 
        topic_bin_3_10.append(0)
        

topic_bin_3_11 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_11_str: 
        topic_bin_3_11.append(1)
    else: 
        topic_bin_3_11.append(0)
        

topic_bin_3_12 = []
for i in range(len(keywords_dictionary)):
    word = list((keywords_dictionary.index[i]))[0]
    if word in topic_3_12_str: 
        topic_bin_3_12.append(1)
    else: 
        topic_bin_3_12.append(0)
          

In [188]:
print(len(topic_bin_3_11))
print(sum(topic_bin_3_11))
type(topic_bin_3_1)
#keywords_dictionary.head()

2571
1521


list

In [190]:
keywords_dictionary['topic_bin_3_1']= topic_bin_3_1
keywords_dictionary['topic_bin_3_2']= topic_bin_3_2
keywords_dictionary['topic_bin_3_3']= topic_bin_3_3
keywords_dictionary['topic_bin_3_4']= topic_bin_3_4
keywords_dictionary['topic_bin_3_5']= topic_bin_3_5
keywords_dictionary['topic_bin_3_6']= topic_bin_3_6
keywords_dictionary['topic_bin_3_7']= topic_bin_3_7
keywords_dictionary['topic_bin_3_8']= topic_bin_3_8
keywords_dictionary['topic_bin_3_9']= topic_bin_3_9
keywords_dictionary['topic_bin_3_10']= topic_bin_3_10
keywords_dictionary['topic_bin_3_11']= topic_bin_3_11
keywords_dictionary['topic_bin_3_12']= topic_bin_3_12

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [193]:
topic_cat_list = ['topic_bin_3_1', 'topic_bin_3_2', 'topic_bin_3_3', 
                 'topic_bin_3_4', 'topic_bin_3_5', 'topic_bin_3_6',
                 'topic_bin_3_7', 'topic_bin_3_8', 'topic_bin_3_9', 
                 'topic_bin_3_10', 'topic_bin_3_11', 'topic_bin_3_12']

keywords_dictionary['count_topics'] = keywords_dictionary[topic_cat_list].sum(axis=1)
keywords_dictionary['count_topics'].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


count    2571.000000
mean       10.508751
std         2.021341
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max        12.000000
Name: count_topics, dtype: float64

In [200]:
print ('Among ' + str(len(keywords_dictionary)) +
       ' words in the keyword dictionary, ' + 
       str(len(keywords_dictionary[keywords_dictionary['count_topics'] == 12])) +
       ' appear in all of the primary category')

Among 2571 words in the keyword dictionary, 1168 appear in all of the primary category


Let's keep those 1168 words as our final keywords of interest. 

In [203]:
final_keywords_dictionary = keywords_dictionary[keywords_dictionary['count_topics'] == 12][['Frequency', 'mean']]
final_keywords_dictionary.head()

Unnamed: 0_level_0,Frequency,mean
Term,Unnamed: 1_level_1,Unnamed: 2_level_1
"(lower,)",20720,0.002863
"(member,)",62784,0.009659
"(taken,)",16616,0.002193
"(summer,)",35574,0.005372
"(act,)",21014,0.002895


## 5. Store the key word dictionary 

In [204]:
final_keywords_dictionary.to_csv('final_keywords_dictionary.csv', index=True)