# Slack Data Exploration

My first goal for the dataset is to run K-means clustering algorithm and see if any specific trends or topics arise. Also, this will allow us to see if groups of similar words will contribute to any kind of topic that can be automatically extracted from the given dataset. TODO refine this explanation

## Data Creation

I want to have a text file where each line is a message sent by a user. For now, I will ignore the channel and user information that the exported data provides.

First import the required packages for data creation.

In [1]:
import json
import os
import re

import numpy as np
import pandas as pd

Assuming the existance of directiories containing *.json* files, I will visit each directory and add each json object to a list.

In [42]:
json_li = []

outer_dir = "../data"
directories = ["general", "dev", "ai", "marketresearch", "random", "sales-gamification"]
for directory in directories:
	path = os.path.join(outer_dir, directory)
	for filename in os.listdir(path):
		if filename.endswith(".json"):
			fp = open(os.path.join(path, filename))
			json_li.append(json.load(fp))

FileNotFoundError: [Errno 2] No such file or directory: 'data/general'

Now I will create a new file for the output of the text.

In [34]:
data_fp = open('../data/general_text.txt', 'w+')

In [35]:
for json_obj in json_li:
	for message in json_obj:
		message['text'] = re.sub(r'<(.)*>', '', message['text'], flags=re.MULTILINE)
		data_fp.write(message['text'])
		data_fp.write("\n")
data_fp.close()

## Preprocessing

I want to prepare the data through the use of lemmatizing and stemming. The packages needed are as follows:

In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk

In order to perform stemming and lemmatization, I need to install 'wordnet' from the nltk website. A built in downloader is provided but I needed to add the following code to get it to work:

In [33]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/JoeSkimmons/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/JoeSkimmons/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Stemming and lemmatizing are aimed to reduce words to inflectional forms and sometimes derivationally related forms of a word to a common base form.

( For more information I used this resource: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html )

Now I will create a "stemmer" using the nltk package and define two functions to perform the bulk of our preprocessing.

In [36]:
stemmer = SnowballStemmer("english")


def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

These two functions will remove stopwords, remove words length 3 or less, and convert all words to lowercase before stemming and lemmatizing each word.

preprocess(text) will return a list of the reduced message.

Now I can get a list of the messages in their reduced form with the following:

In [40]:
from nltk import word_tokenize

processed_msgs = []

for l in open('../data/general_text.txt', 'r'):
    print(l)
    l = l.strip()
    p = preprocess(l)
    if p:
        processed_msgs.append(p)
    
# for l in open('general_text.txt', 'r'):
#     l = l.strip()
#     l = l.lower()
#     p = word_tokenize(l)
#     if p:
#         processed_msgs.append(p)

print(processed_msgs)

[]


## Experimentation with K Means Clustering

First I will create a dictionary mapping each preprocessed word to a numeric token.

In [31]:
dictionary = gensim.corpora.Dictionary(processed_msgs)

This next line is optional and its purpose is to filter out words which are either too rare or too common to be significant when considering the topics.

In [32]:
dictionary.filter_extremes(no_below=20, no_above=0.8, keep_n= 100000)

print(dictionary)

Dictionary(0 unique tokens: [])


I decided to try the Word2Vec embedding to create a vector representation of each word found in the processed messages.

More on that here: https://skymind.ai/wiki/word2vec

## Creating the Word2Vec Model

In [23]:
from gensim.models import Word2Vec

In [24]:
model = Word2Vec(processed_msgs, min_count=1)

RuntimeError: you must first build vocabulary before training the model

In [13]:
# save model
# model.save('Word2Vec_model.bin')

In [14]:
# load model
# new_model = Word2Vec.load('Word2Vec_model.bin')
# print(new_model)

Next, I will use the K Means Clustering algorithm from the nltk package to see which topics are clustering well thus far.

http://www.pitt.edu/~naraehan/presentation/word2vec-try.html

In [15]:
from nltk.cluster import KMeansClusterer

In [16]:
X = model[model.wv.vocab]

NUM_CLUSTERS=5
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

NameError: name 'model' is not defined

In [17]:
from collections import defaultdict

In [18]:
words = list(model.wv.vocab)
cluster_dict = defaultdict(lambda: [])
for i, word in enumerate(words):
    cluster_dict[assigned_clusters[i]].append(word)

NameError: name 'model' is not defined

Getting a sense of what words are in each cluster

In [19]:
for key, word_li in cluster_dict.items():
    print(key)
    print(word_li[:10])

NameError: name 'cluster_dict' is not defined

Attempting to show correlation with PCA, unsignificant now, will return after trying the google model

In [20]:
from sklearn.decomposition import PCA
from matplotlib import pyplot

ModuleNotFoundError: No module named 'sklearn'

In [21]:
# not useful yet but could be in the future
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)

for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

NameError: name 'model' is not defined

## LDA Model

In [22]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_msgs]

lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 5, 
                                   id2word = dictionary,                                    
                                   passes = 15,
                                   workers = 2)

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

ValueError: cannot compute LDA over an empty collection (no terms)