# 60% Task 2
Kian Parnis

## Introduction

This jupiter notebook deals with the first part of the task given, that being:

1. Processing the emails
2. Calculating term weights using TF/IDF
3. Extracting the highest-weighted n of the terms for each correspondent.
4. Clustering the Users
5. Clustering the Terms
6. Exportation

For the second part, a folder named 'Flask Server' shall be provided with a text file instructing on how to set up the server through command line.

In [None]:
import os
from email.parser import Parser
from tqdm import tqdm          
from nltk.tokenize import RegexpTokenizer #used to tokenize into words and remove punctuation
from nltk.corpus import stopwords
import numpy as np #matrix
from collections import Counter
import pandas as pd
import random
import json
import pickle

from tqdm import tqdm #extra function to monitor lengthy calls 
import time

### Processing the emails

In order to process each email a class was created that would store each emails individual body paragraph, the sender and each receiver.

In [2]:
class Emails:
    def __init__(self, fromSender, toReceiver, body):
        self.fromSender = fromSender
        self.toReceiver = toReceiver
        self.body = body
        
    def __str__(self):
        return '{} \n {} \n {}'.format(self.fromSender,self.toReceiver,self.body) #Self created to display contents

The following goes through each email contatined in the given 'maildir' folder with all the emails and starts parsing each email.

Emails are parsed using the 'email.parser' packaged to recieve the 'to', 'from' and body from each email, a list is maintained that will be used later on that houses each unique email address from a sender.

In [5]:
rootDir = "maildir"
allEmailAddresses = []
allEmails = []
seen = set()
for directory, subdirectory, files in os.walk(rootDir):
    for file in files:
        with open(os.path.join(directory, file), "r") as f:
            contents = f.read()
        parse = Parser().parsestr(contents)
        send = parse['from']
        
        if parse['to']:
            recieve = parse['to']
            recieve = "".join(recieve.split())
            recieve = recieve.split(",")       
        else:
            recieve = "empty"
        
        body = parse.get_payload()
        
        if recieve!= "empty":
            mail = Emails(send,recieve,body)
            if send not in seen: #storing unique emails
                seen.add(send)
                allEmailAddresses.append(send)
            #for email in recieve:
            #    if email not in seen:
            #        seen.add(email)
            #        allEmailAddresses.append(email)
            allEmails.append(mail)          

The following is the creation of each TWO distinct sender/reciever documents.
A loop will go through every email and create a dictionary that for each receiever in a email will create a pair with there respective email.<br><br>
As an example lets say we an email which has as a sender 'x', recievers 'y' 'z' with some body 'b'.<br>
These will be divided into a dict:<br>
(x,y)(b) and (x,z)(b)<br><br>
If the same pair where to be present again throughout any other email then that email will be appended to the document.

In [4]:
documents = {}
for mail in allEmails:
    for recieve in mail.toReceiver:
        firstType = (mail.fromSender, recieve)
        secondType = (recieve, mail.fromSender)
        if(firstType in documents.keys()):
            documents[mail.fromSender, recieve] += mail.body
        elif(secondType in documents.keys()):
            documents[recieve, mail.fromSender] += mail.body
        else:
            documents[mail.fromSender, recieve] = mail.body   

### Tokenization

The following functions where used in the 40% IR project and are reused to tokenise, case fold, stemm and perform stop word removal on every document created.

Due to the size of 'mirdir' being that of ~250k emails, after the process was finished the 'pickle' package was used to save the tokenised documents in an Intermidiate file 'AllEmails.json'.

In [5]:
tokenizer = RegexpTokenizer(r'\w+')

def tokenise(string): 
 tokens_document = tokenizer.tokenize(string)
 return tokens_document

In [6]:
def caseFold (list): #function which handles casefolding
 casefoldedList = [word.casefold() for word in list]
    
 return casefoldedList 

In [7]:
def stopWordRemoval(list): #stopwords is used from nltk and stopWordRemoval is called to remove stop words from the passed document
 newlist = []
 stopWords = set(stopwords.words('english'))
    
 for word in list:
    if word not in stopWords:
        newlist.append(word)

 return newlist


In [8]:
from nltk.stem import PorterStemmer
def stemm(list): #function which handles a documents stemming
  Stem = PorterStemmer() 
  stemmList = [Stem.stem(word) for word in list] 
  return stemmList

Skip this step to the next for the already processed documents.

In [9]:
LENGTH = len(documents)
pbar = tqdm(total=LENGTH) # Number of iterations required to fill pbar

for document in documents:
    documents[document] = tokenise(documents[document])
    documents[document] = caseFold(documents[document])
    documents[document] = stopWordRemoval(documents[document])
    documents[document] = stemm(documents[document])
    pbar.update(n=1) # Increments counter

100%|██████████████████████████████████████████████████████████████████████▉| 251686/251699 [3:40:36<00:00, 129.53it/s]

Saving all the unique emails externally so the functions before dont need to be reran.

In [12]:
#with open('AllEmails.json', 'w') as outfile:
#       json.dump(allEmailAddresses, outfile)

In [2]:
with open('AllEmails.json') as json_file:
    AllEmails = json.load(json_file)
    print(AllEmails)

['heather.dunton@enron.com', 'anchordesk_daily@anchordesk.zdlists.com', 'subscriptions@intelligencepress.com', 'prizemachine@feedback.iwon.com', 'louise.kitchen@enron.com', 'arsystem@mailman.enron.com', 'exclusive_offers@sportsline.com', 'hunter.williams@grandecom.com', 'richard.morgan@austinenergy.com', 'jsmith@austintx.com', 'wise.counsel@lpl.com', 'renee.ratcliff@enron.com', 'msimpkins@winstead.com', 'gthorse@keyad.com', 'monica.l.brown@accenture.com', 'david.port@enron.com', 'webmaster@earnings.com', 'delivers@amazon.com', 'james.bruce@enron.com', 'ei_editor@ftenergy.com', 'kathryn.sheppard@enron.com', 'c..gossett@enron.com', 'iwon@info.iwon.com', 'mondohed@gte.net', 'enron_update@concureworkplace.com', 'steven.matthews@ubspw.com', 'kirk.mcdaniel@enron.com', 'mery.l.brown@accenture.com', 'gthorse@about-cis.com', "ryan.o'rourke@enron.com", 'leanne@integrityrs.com', 'melissaspears@open2win.oi3.net', 'jwills3@swbell.net', 'michelle.akers@enron.com', 'greg.whalley@enron.com', 'software

To avoid overwriting the saving of the document file has been commented out.

In [16]:
#with open('tokenIntermediary.pkl', 'wb') as f: 
#    pickle.dump(documents, f)

In [3]:
with open('tokenIntermediary.pkl', 'rb') as f:
    loaded_dict = pickle.load(f)

The following displays the content of the first document in the dict to give a better representation of how its stored.

In [4]:
dict_pairs = loaded_dict.items()
pairs_iterator = iter(dict_pairs)
first_pair = next(pairs_iterator)

print(first_pair)

(('heather.dunton@enron.com', 'k..allen@enron.com'), ['pleas', 'let', 'know', 'still', 'need', 'curv', 'shift', 'thank', 'heather', 'origin', 'messag', 'allen', 'phillip', 'k', 'sent', 'friday', 'decemb', '07', '2001', '5', '14', 'dunton', 'heather', 'subject', 'west', 'posit', 'heather', 'attach', 'file', 'email', 'origin', 'messag', 'dunton', 'heather', 'sent', 'wednesday', 'decemb', '05', '2001', '1', '43', 'pm', 'allen', 'phillip', 'k', 'belden', 'tim', 'subject', 'fw', 'west', 'posit', 'attach', 'delta', 'posit', '1', '16', '1', '30', '6', '19', '7', '13', '9', '21', 'origin', 'messag', 'allen', 'phillip', 'k', 'sent', 'wednesday', 'decemb', '05', '2001', '6', '41', 'dunton', 'heather', 'subject', 'west', 'posit', 'heather', 'exactli', 'need', 'would', 'possibl', 'add', 'prior', 'day', 'date', 'pivot', 'tabl', 'order', 'valid', 'curv', 'shift', 'date', 'also', 'need', 'prior', 'day', 'end', 'posit', 'thank', 'phillip', 'allen', 'origin', 'messag', 'dunton', 'heather', 'sent', 'tue

### TFIDF

*Major Note:* The following implementation was done on a system that has 32gb of memory and all of it was utilized for tokenisation, optimizations for how this process manages memory could be done but due to time restrictions this could not be met to it is recommended for the Tfidf intermidiary file, a system with at least 32gb is used any less will end up running out of memory.

The TFIDF function first gets all the DF values for each document present, thereafter calculates the subsequent tf values and finally the TF/IDF Values which is stored in a dict as follows (user, word)(tfidf value).

In [4]:
def DFValues(doc):
    DF_Vals = {}
    for i in range(len(doc)):
        tokens = doc[i]
        for w in tokens:
            try:
                DF_Vals[w].add(i)
            except:
                DF_Vals[w] = {i}
    for i in DF_Vals:
        DF_Vals[i] = len(DF_Vals[i])
    return(DF_Vals)

In [5]:
LENGTH = len(loaded_dict)

AllValues = list(loaded_dict.values())
AllKeys = list(loaded_dict.keys())
loaded_dict.clear() #clearning old dict to save memory
DF_Values = DFValues(AllValues) #Getting the all the DF values.

In [6]:
tfidf = {}
pbar2 = tqdm(total=LENGTH) # Number of iterations required to fill pbar

for i in range(LENGTH):
    pbar2.update(n=1) # Increments counter
    tokens = AllValues[i]
    count = Counter(tokens)
    
    for token in np.unique(tokens):
        tf = count[token]/len(token) #TF = (Number of times term t appears in a document) / (Total number of terms in the document).
        df = DF_Values[token]
        idf = np.log(LENGTH/(df)) #IDF: (Total number of documents)/(Number of documents containing the word)
        tfidf[AllKeys[i], token] = tf*idf
        
AllValues.clear()
AllKeys.clear()
DF_Values.clear()

100%|████████████████████████████████████████████████████████████████████████| 251699/251699 [14:08<00:00, 1786.94it/s]

Saving the TFIDF dict externally (commented out so no accidental rewrittens occur)

In [8]:
#import pickle
# 
#with open('data.pickle', 'wb') as f:
#    # Pickle the 'data' dictionary using the highest protocol available.
#    pickle.dump(tfidf, f, pickle.HIGHEST_PROTOCOL)

Loading TFIDF dict (*warning*: large memory storage needed)

In [10]:
with open('data.pickle', 'rb') as f:
    loaded_tf = pickle.load(f)

Filterning every unique email to only get the employee emails:

In [58]:
import re
regex = r'\b[A-Za-z0-9._%+-]+@enron.com'
EnronUsers = []
for user in AllEmails:
     if(re.fullmatch(regex, user)):
        EnronUsers.append(user)

In [60]:
print(len(EnronUsers))
print(EnronUsers)

5779
['heather.dunton@enron.com', 'louise.kitchen@enron.com', 'renee.ratcliff@enron.com', 'david.port@enron.com', 'james.bruce@enron.com', 'kathryn.sheppard@enron.com', 'c..gossett@enron.com', 'kirk.mcdaniel@enron.com', 'michelle.akers@enron.com', 'greg.whalley@enron.com', 'david.oxley@enron.com', 'critical.notice@enron.com', 'rebecca.cantrell@enron.com', 'paul.kaufman@enron.com', 'phillip.allen@enron.com', 'public.relations@enron.com', 'stephanie.miller@enron.com', 'tracy.arthur@enron.com', 'sarah.novosel@enron.com', 'tim.heizenrader@enron.com', 'frank.hayden@enron.com', 'kim.ward@enron.com', 'christi.nicolay@enron.com', 'richard.shapiro@enron.com', 'tiffany.miller@enron.com', 'tim.belden@enron.com', 'perfmgmt@enron.com', 'alyse.herasimchuk@enron.com', 'lisa.jacobson@enron.com', 'ina.rangel@enron.com', 'k..allen@enron.com', 'pam.butler@enron.com', 'colleen.koenig@enron.com', 'jeff.leath@enron.com', 'tracy.ramsey@enron.com', 'jeff.youngflesh@enron.com', 'larry.ciscon@enron.com', 'peter

### Extracting the highest-weighted n of the terms for each correspondent.

The following sets to achieves two goals,<br>
i) Representing each user as vector of features <br>
ii) Creating a keyword cloud for each user <br>

A subset of users are taken of around 150 users from the total amount of users, which would be a sample of users from the company, changing the 'LENGTH' size will correctly go through and process more users keyword clouds/vectorisations but for simplification purposes only a subset are taken. (Also works with all users).

The unique list created before hand of each enron employee is now utilised and for every employee, if a user is found to be a sender or reciever then a each word is retrieved and maintained inside a dict, every time a word is repeated the value is added to the same word dict and after 'avgDict' is used to go through each word and the average weight value is computed and stored.

A keyword cloud is created using the top 50 most frequently used words by a user (depending on a sorted weight value) and a vectorized dict is used with all the words.

In [62]:
import collections
LENGTH=150
pbar3 = tqdm(total=LENGTH)

counter = 0
keyWordCloud = collections.defaultdict(list)
Vectorize = collections.defaultdict(list) 

for user in EnronUsers:
    
    pbar3.update(n=1) # Increments counter
    counter+= 1
    dict_user = collections.defaultdict(list)
    
    for key,value in loaded_tf.items():
        if (key[0][0]==user or key[0][1]==user):
            dict_user[key].append(value)
                    
    avgDict = {}
    for key,value in dict_user.items():
        User_Word = [user, key[1]]
        q = tuple(User_Word)
      
        avgDict[q] = (sum(value)/ float(len(value)))
    
    
    MostFreq = sorted(avgDict.items(), key=lambda x:x[1], reverse=True)[:50]
    MostFreqAll = sorted(avgDict.items(), key=lambda x:x[1], reverse=True)
    
    for key,value in MostFreqAll:
        Vectorize[user].append([key[1],value])
        
    for key,value in MostFreq:
        keyWordCloud[user].append([key[1],value])
        
        
    dict_user.clear() #Cleared to maintain memory
    MostFreq.clear()
    avgDict.clear()
    if(counter==LENGTH):
        break
        
loaded_tf.clear() #frees most of the memory used   


  4%|███▎                                                                               | 2/50 [00:51<20:31, 25.66s/it][A

  1%|█                                                                                 | 2/150 [00:23<28:48, 11.68s/it][A
  2%|█▋                                                                                | 3/150 [00:48<41:57, 17.12s/it][A
  3%|██▏                                                                               | 4/150 [01:11<47:23, 19.47s/it][A
  3%|██▋                                                                               | 5/150 [01:34<50:13, 20.78s/it][A
  4%|███▎                                                                              | 6/150 [01:58<51:54, 21.63s/it][A
  5%|███▊                                                                              | 7/150 [02:20<51:59, 21.82s/it][A
  5%|████▎                                                                             | 8/150 [02:43<52:38, 22.24s/it][A
  6%|████▉    

 89%|██████████████████████████████████████████████████████████████████████▉         | 133/150 [50:40<06:36, 23.35s/it][A
 89%|███████████████████████████████████████████████████████████████████████▍        | 134/150 [51:04<06:17, 23.57s/it][A
 90%|████████████████████████████████████████████████████████████████████████        | 135/150 [51:27<05:52, 23.49s/it][A
 91%|████████████████████████████████████████████████████████████████████████▌       | 136/150 [51:50<05:24, 23.17s/it][A
 91%|█████████████████████████████████████████████████████████████████████████       | 137/150 [52:13<05:01, 23.19s/it][A
 92%|█████████████████████████████████████████████████████████████████████████▌      | 138/150 [52:35<04:35, 22.92s/it][A
 93%|██████████████████████████████████████████████████████████████████████████▏     | 139/150 [52:58<04:09, 22.73s/it][A
 93%|██████████████████████████████████████████████████████████████████████████▋     | 140/150 [53:20<03:45, 22.59s/it][A
 94%|███████████

The following is the resulting keyword cloud:

In [63]:
for key,value in keyWordCloud.items():
    print(key)
    print(value)

heather.dunton@enron.com
[['stwr', 60.317954589689826], ['ltsw', 51.397207151254186], ['gmc', 51.358656387467605], ['z', 45.951852123455566], ['1861', 44.61939338672351], ['w', 44.51432941840654], ['owa', 43.347864521524386], ['comw', 41.97243837360406], ['kate', 40.738856436754865], ['1100', 39.15020595950721], ['31861', 38.46080480431433], ['30397', 38.38872037373058], ['5e', 37.91979472106093], ['cn', 37.82044585025975], ['x5', 37.70293755886052], ['12', 37.34620492578781], ['q', 36.83723894536083], ['2pm', 34.89292185465968], ['syme', 33.86860647243371], ['ph', 31.678260840629285], ['akin', 31.673126002529266], ['outlook', 31.342970433434726], ['ect', 30.70002217979974], ['3249', 27.88449852843904], ['csu', 25.61415805124948], ['dyn', 25.39216503610725], ['ag', 25.165716290411915], ['pdx', 25.039033622765253], ['8612', 23.261573943022753], ['860', 23.194924707389386], ['r', 22.08861397789842], ['panu', 22.041119467636182], ['hain', 22.014898547675685], ['ye', 21.786241055940327], [

### Storing  the keyword cloud results and users used for part 2

In [64]:
import json
with open('keywordCloud.json', 'w') as outfile:
    json.dump(keyWordCloud, outfile)

In [65]:
import json
with open('listOfUsers.json', 'w') as outfile:
    json.dump(list(keyWordCloud.keys()), outfile)

A dataframe is created to represent each user as vector of features and words as vectors

In [94]:
seen2 = set()
allWords = []
LENGTH = len(Vectorize.values())

for i in range(LENGTH):
    for j in range(len(list(Vectorize.values())[i])):      
        if (list(Vectorize.values())[i][j][0]) not in seen2:
            seen2.add(list(Vectorize.values())[i][j][0])
            allWords.append(list(Vectorize.values())[i][j][0])
            
WordTfs = []           
for i in range(LENGTH):
    WordTf={}
    for j in range(len(list(Vectorize.values())[i])):
        WordTf[list(Vectorize.values())[i][j][0]] = list(Vectorize.values())[i][j][1]
    WordTfs.append(WordTf)
        

UserVector = pd.DataFrame(data=WordTfs, index = Vectorize.keys(), columns = allWords).fillna(0) #any missing values are correctly set to 0

In [96]:
#UserVector.to_json('UserIntermidiary.json')

The final intermidiary file was created and used repeatedly easily loop and find a fitting value of K for the K-means algorithm

In [3]:
UserVector = pd.read_json('UserIntermidiary.json')

In [4]:
UserVector

Unnamed: 0,stwr,ltsw,gmc,z,1861,w,owa,comw,kate,1100,...,rorschachher,newyorklif,republicga,atmbroadband,michael_beav,paternoimport,kevin_flack,westwindlogist,perriergroup,mmmarcantel
heather.dunton@enron.com,60.317955,51.397207,51.358656,45.951852,44.619393,44.514329,43.347865,41.972438,40.738856,39.150206,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
louise.kitchen@enron.com,0.000000,0.000000,0.000000,3.829321,1.394356,12.718380,0.000000,1.353950,2.089172,1.223444,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
renee.ratcliff@enron.com,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
david.port@enron.com,0.000000,0.000000,0.000000,3.829321,44.619393,16.957840,0.000000,1.353950,1.044586,39.150206,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
james.bruce@enron.com,0.000000,0.000000,0.000000,0.000000,0.000000,2.119730,0.000000,0.000000,1.044586,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
garrett.tripp@enron.com,0.000000,0.000000,0.000000,15.317284,44.619393,2.119730,0.000000,41.972438,0.000000,39.150206,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
d..baughman@enron.com,0.000000,0.000000,0.000000,11.487963,0.000000,2.119730,0.000000,0.000000,2.089172,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
andrew.kandolha@enron.com,0.000000,0.000000,0.000000,0.000000,0.000000,2.119730,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
jeremy.buss@enron.com,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### Clustoring

The K chosen are based on reapeated tests determining what each clustor percentage is and chosing the most appropriate result.
<br>The Dataframe was converted into a a vector of values and the cosine similarity function was used as a distance metric.

In [20]:
DataUV = UserVector.to_numpy()
print(DataUV)

[[60.31795459 51.39720715 51.35865639 ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.77332509  0.76809278
   0.72883169]]


In [21]:
def cosSim(a, b):
    
    a = np.array(a)
    b = np.array(b)
    
    cos_sim = ((np.dot(a, b))/np.multiply((np.linalg.norm(a)),(np.linalg.norm(b))))
    return cos_sim

Step 1: Select k random points from the data as centroids.

This is done by taking three values at random from list of values and setting the points as those values.

In [22]:
iniUserVect = random.sample(range(0, len(DataUV)), 3)
iniUserVect

[136, 59, 33]

In [23]:
centroidsUV = []
for i in iniUserVect:
    centroidsUV.append(DataUV[i])
centroidsUV = np.array(centroidsUV)
centroidsUV
len(centroidsUV)

3

Step 2: Assign all the points to the closest cluster centroid using cosine sim.

In [24]:
def findClosestCentroids(cent, data):
    closestCent = []
    for i in data:
        dist = []
        for j in cent:
             dist.append(cosSim(i,j))
        closestCent.append(np.argmin(dist))
    return closestCent

Step 3: Recompute the centroids of newly formed clusters.

In [25]:
def calc_centroids(clust, Data):
    newCent = []
    new_df = pd.concat([pd.DataFrame(Data), pd.DataFrame(clust, columns=['Cluster'])],axis=1)
    for c in set(new_df['Cluster']):
        currClust = new_df[new_df['Cluster'] == c][new_df.columns[:-1]]
        clustMean = currClust.mean(axis=0)
        newCent.append(clustMean)
        
    return newCent

Step 4: Step 2 and 3 are repeated a number of times to ensure proper placements.

In [26]:
pbar3 = tqdm(total=200, position=0, leave=True)
for i in range(200):
    getUVCent = findClosestCentroids(centroidsUV, DataUV)
    UVCent = calc_centroids(getUVCent, DataUV)
    pbar3.update(n=1) # Increments counter
    
UVCent
print(len(UVCent))
print(UVCent)

100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [14:07<00:00,  4.24s/it]
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [06:40<00:00,  1.98s/it]

3
[0         1.884936
1         1.606163
2         1.604958
3         1.914661
4         1.437930
            ...   
101904    0.000000
101905    0.000000
101906    0.000000
101907    0.000000
101908    0.000000
Length: 101909, dtype: float64, 0         0.000000
1         0.000000
2         0.000000
3         1.515773
4         0.145245
            ...   
101904    0.017350
101905    0.017242
101906    0.016111
101907    0.016002
101908    0.015184
Length: 101909, dtype: float64, 0          0.000000
1          0.033375
2          0.978260
3         16.301967
4         16.552998
            ...    
101904     0.000000
101905     0.000000
101906     0.000000
101907     0.000000
101908     0.000000
Length: 101909, dtype: float64]


In [27]:
print(getUVCent)
len(getUVCent)

[0, 2, 0, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 0, 2, 0, 2, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 1, 0, 2, 0, 2, 1, 0, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 0, 1, 2, 2, 1, 0, 1]


150

In [28]:
ClustorUV_df = pd.DataFrame({"Clustors": getUVCent}, index = UserVector.index)

The following are the final user clustors with K=3

In [29]:
ClustorUV_df

Unnamed: 0,Clustors
heather.dunton@enron.com,0
louise.kitchen@enron.com,2
renee.ratcliff@enron.com,0
david.port@enron.com,2
james.bruce@enron.com,1
...,...
garrett.tripp@enron.com,2
d..baughman@enron.com,2
andrew.kandolha@enron.com,1
jeremy.buss@enron.com,0


Transposing the dataframe to represent the users as vectors and the terms as features:

The same algorithm was used and k=2 was chosen for this step.

In [30]:
UserFeature = UserVector.T
UserFeature

Unnamed: 0,heather.dunton@enron.com,louise.kitchen@enron.com,renee.ratcliff@enron.com,david.port@enron.com,james.bruce@enron.com,kathryn.sheppard@enron.com,c..gossett@enron.com,kirk.mcdaniel@enron.com,michelle.akers@enron.com,greg.whalley@enron.com,...,jae.black@enron.com,exec.jones@enron.com,steve.gim@enron.com,tosha.henderson@enron.com,carlos.alatorre@enron.com,garrett.tripp@enron.com,d..baughman@enron.com,andrew.kandolha@enron.com,jeremy.buss@enron.com,nick.hiemstra@enron.com
stwr,60.317955,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
ltsw,51.397207,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
gmc,51.358656,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
z,45.951852,3.829321,0.0,3.829321,0.0,0.0,53.610494,0.0,0.0,3.829321,...,11.487963,0.0,0.0,0.0,0.0,15.317284,11.487963,0.0,0.0,0.000000
1861,44.619393,1.394356,0.0,44.619393,0.0,0.0,44.619393,0.0,0.0,1.394356,...,44.619393,0.0,0.0,0.0,0.0,44.619393,0.000000,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
paternoimport,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.832812
kevin_flack,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.827617
westwindlogist,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.773325
perriergroup,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.768093


In [38]:
DataUF = UserFeature.to_numpy()
iniUserFeat = random.sample(range(0, len(DataUF)), 2)
iniUserFeat

[61685, 13487]

In [39]:
centroidsUF = []
for i in iniUserFeat:
    centroidsUF.append(DataUF[i])
centroidsUF = np.array(centroidsUF)
centroidsUF
len(centroidsUF)

2

In [40]:
pbar4 = tqdm(total=200, position=0, leave=True)
for i in range(200):
    getUFCent = findClosestCentroids(centroidsUF, DataUF)
    UFCent = calc_centroids(getUFCent, DataUF)
    pbar4.update(n=1) # Increments counter
    
UFCent
print(len(UFCent))
print(UFCent)

100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [57:26<00:00, 17.23s/it]
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [34:03<00:00,  7.42s/it]

2
[0      0.135992
1      0.981397
2      0.006258
3      0.230150
4      0.057840
         ...   
145    0.132954
146    0.293791
147    0.025146
148    0.003112
149    0.052537
Length: 150, dtype: float64, 0      0.033413
1      0.418652
2      0.000939
3      0.053984
4      0.015803
         ...   
145    0.035115
146    0.077363
147    0.004798
148    0.000281
149    0.010380
Length: 150, dtype: float64]


In [41]:
print(getUFCent)
len(getUFCent)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 

101909

In [42]:
ClustorUF_df = pd.DataFrame({"Clustors": getUFCent}, index = UserFeature.index)

In [43]:
ClustorUF_df

Unnamed: 0,Clustors
stwr,0
ltsw,0
gmc,0
z,0
1861,0
...,...
paternoimport,0
kevin_flack,0
westwindlogist,0
perriergroup,0


Exporting clustors to be used for part 2:

In [44]:
ClustorUF_df.to_json('WordClustor.json')

In [45]:
ClustorUV_df.to_json('userClustor.json')