# Using the Gensim Library

The following notebook will be an exercise in applying methods from the gensim library to the Blog Spot homeless blog corpus.

The aim is to process data mined using the `blogspot_scraper` which can be used to model a classification system or aid in the statistical analysis of textual information. The hope is that it could useful for creating a framework that can be useful in predicting successful outcomes, or evaluating sentiment scores of posts, comments or writings of people dealing with homelessness.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('blog_spot.csv', index_col=0)
df.head()

Unnamed: 0,wanderingscribe,homelesschroniclesintampa,livinghomelessourwritetospeak,seattlehomeless,homevan,joe-anybody,thehomelessfinch
0,# Extracted from http://wanderingscribe.blogsp...,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? ...,Another one\n,# Archived posts\n,HOME VAN NEWSLETTER 6/12/16\n,House Keys Not Handcuffs \n,The Homeless Finch Has Found Her Nest: Project...
1,In case you were wondering... the paperback of...,"Gee,\n",Its been some time since I actually have writt...,# Extracted from https://seattlehomeless.blogs...,HOME VAN NEWSLETTER 1/18/16\n,A (sticker) and a good idea\n,The Homeless Finch Makes It's First Rescue\n
2,"December probably isn't the time for it, but I...","this is a great question, and before I ever wr...",I am presently having a great meal as I write ...,"Ok, it's all relative. Seattle is hot at 80 d...",HOLIDAY ANGELS DISGUISED AS HOMELESS STRANGERS\n,- Portland 2018\n,The Start of Something New for The Homeless Fi...
3,Sometimes I give in to dreams — dream that on...,"play the viola. Then, I came down with essenti...","Back when I last posted, I was running a new b...","So tonight one of our local politicians, Seatt...",HOME VAN NEWSLETTER 11/15/15\n,- WRAP\n,"Jehane Lyle, Watercolor on paper, ""Cuppa"" - de..."
4,"In the meantime though, it's hard graft and sc...",neuro-muscular disorder (my mom was afflicted ...,I actually had myself a big slip and started u...,Here's what he saw: \n,HOME VAN NEWSLETTER 10/4/15\n,On 9/28/15 in Portland Oregon I filmed this in...,This week has been a complete blast. Getting ...


## Data processing and cleaning

The text data is arranged as elements of a dataframe. Since this will be used as unsupervised training data. We can combine the writings of our sample into one series object to make mapping easier. Unless someone wants to explicitly point it out, I do not anticipate computational problems arising from this. Standard data processing and cleaning methods are applied.

In [3]:
from gensim.parsing.preprocessing import remove_stopwords
from gensim .utils import simple_preprocess
from nltk.corpus import stopwords

In [4]:
# wrap in function
df1 = df.apply(lambda x: ','.join(x.astype(str)), axis=1)
df2 = df1.apply(lambda x: remove_stopwords(x))
df3 = df2.apply(lambda x: simple_preprocess(x, min_len=2))
extra_stop_words = ["http", "like"] # extra_stop_words
stop_words = stopwords.words('english')

for i in extra_stop_words:
    stop_words.append(i)

df3a = df3.apply(lambda x: [word for word in x if word not in stop_words])
df4 = df3a.to_list()

## Comparing processed data
Below is a comparison of the before and after states of the text. The output in `df4` can now be converted into a bag-of-words (`bow`).

Before:

'# Extracted from http://wanderingscribe.blogspot.com/\n,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? I CERTAINLY HOPE SO!\n,Another one\n,# Archived posts\n,HOME VAN NEWSLETTER 6/12/16\n,House Keys Not Handcuffs\xa0\n,The Homeless Finch Has Found Her Nest: Projects, Plans and Peace\n'

After:

['extracted', 'wanderingscribe', 'blogspot', 'iwsg',...


## Assign unique ID
Now that we have a list representing the dictionary of words found in our corpus, we can create a dictionary of `{key: word, value: frequency_count}` pairs.

In [5]:
from collections import defaultdict

frequency = defaultdict(int)

# Create a frequency count for words in dataframe
for text in df4:
    for token in text:
        frequency[token] += 1

# Sample frequency count
word = 'homeless'
print("The word {} appears {} times in the corpus".format(word, frequency[word]))

The word homeless appears 1971 times in the corpus


### Most common words
Using the `collections` module, we can see what the top 10 words by frequency count are.  We can also use this to further remove non-content words from the corpus if necessary. This is how I found that the word 'http' was everywhere in the corpus. I've removed it since, but it gives you an idea of non-content words you might have missed.

In [6]:
from collections import Counter
c = Counter(frequency)
c.most_common(10)

[('nan', 5232),
 ('homeless', 1971),
 ('people', 1803),
 ('night', 1048),
 ('time', 934),
 ('nightwatch', 860),
 ('know', 807),
 ('going', 770),
 ('shelter', 750),
 ('got', 687)]

### Word appearing n times
I thought it might be a good idea to find words appearing n number of times in the corpus. You can also do the exact opposite and find words that appear less than n times in the corpus and decide whether you want to keep those. This makes sense (as the Gensim docs point out) when you're using the library on massive datasets like the google news dataset - in a corpus of a billion words, you might want to decide that words appearing less than 5 times can be removed, for example.

In [7]:
# output words appearing at least n times
n = 10
processed_corpus = [[token for token in text if frequency[token] > n] for text in df4]
print(processed_corpus[:2])

[['blogspot', 'com', 'iwsg', 'january', 'check', 'new', 'certainly', 'hope', 'another', 'posts', 'home', 'van', 'newsletter', 'house', 'keys', 'homeless', 'finch', 'found', 'projects', 'plans', 'peace'], ['case', 'wondering', 'book', 'new', 'cover', 'th', 'november', 'cover', 'looks', 'sure', 'pink', 'white', 'time', 'actually', 'written', 'post', 'blog', 'figured', 'time', 'update', 'readers', 'past', 'year', 'bit', 'blogspot', 'com', 'home', 'van', 'newsletter', 'good', 'idea', 'homeless', 'finch', 'makes', 'first', 'rescue']]


## Create dictionary keys
Now that we have a list of all the words in the corpus, we can create a dictionary to map any future comparison to. These will be assigned a unique token_id for every token or word in the corpus. What we'll end up with is a bag-of-words with unique identifiers for words in our entire corpus. Note the difference between `processed_corpus` and `bow_corpus`, where the former is a untokenized list of context words in our corpus. For this, we will use the built-in `corpora` module from the `gensim` library to make things easy.

Sample output:

dictionary.token2id:
{'check': 0, 'home': 1, 'homeless': 2, 'hope': 3...

bow_corpus:
[(2, 4), (3, 1), (4, 1), (5, 2), (6, 9)...

In [8]:
from gensim import corpora

# Create (token, tokenID)
dictionary = corpora.Dictionary(processed_corpus)

# bow representation of the corpus; (tokenID, freq)
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

### Apply doc2bow to example_doc
Now we can test whether we can identify words present in an `example_doc` that are also in our `bow_corpus`.

In [9]:
# Modify for testing
example_doc = "it's a nice and sunny day outside, i'm so happy"

# Finds words in example_doc that appear within the context of our corpus
example_bow = dictionary.doc2bow(example_doc.lower().split())
example_bow

if len(example_bow) == 0:
    print("No context word found.")
else:
    print(f"The words: {[dictionary[tokenID] for tokenID, freq in example_bow]} are in our dictionary. No biggie.")

# TODO - Output (token, token_freq) in example_doc for words in corpora.Dictionary (processed_corpus)

The words: ['day', 'happy', 'nice'] are in our dictionary. No biggie.


# Word Embeddings
## TF-IDF weight
In order to quantify the relationship of words in a document (and across documents), we'll create a word embeddings using `tf-idf` weights using the bow_corpus. We'll then transform `example_doc2` into a bag-of-words. This way, we will be able to determine which words in our `example_doc2` also appear in our training data, and actually measure both the frequency of those words in a document, and across documents. This is measured by it's tf-idf weight and helps answer the question: "How relevant are words that appear in a document across a collection of documents". The wikipedia article on tf-idf probably does a better job at explaining this. Later we will use a different algorithm (LSI) to calculate the similarity of documents.

In [10]:
from gensim import models

# Initialize tf-idf model using bow_corpus
tfidf = models.TfidfModel(bow_corpus)

### bow representation to tf-idf weights
This gives us a measure of the frequency of a word (i.e. assumed relevance) given a corpus of documents.

Sample tfidf[dictionary.doc2bow(sample)] output:

[(0, 0.017154431730943658), (1, 0.020673564600620076), (2, 0.033339482607518046)...

### Comparing with manual document
We can test how well the model is able to determine an input's similiarity, measured in terms of tf-idf weights, by feeding it some text manually and seeing how it compares to the transformed corpus.

The output gives you the `(dictionary[key], tf-idf_weight)`.

In [11]:
example_doc1 = "I am a student and I like to study natural language processing and computer science."
unrelated_doc = example_doc1.lower().split()
unrelated_doc_dict = dictionary.doc2bow(unrelated_doc)
unrelated_doc_tfidf = tfidf[unrelated_doc_dict]

example_doc2 = "As a homeless person, I need all the help I can get from the police, instead of them harrassing me."
related_doc = example_doc2.lower().split()
related_doc_dict = dictionary.doc2bow(related_doc)
related_doc_tfidf = tfidf[related_doc_dict]

print(f"Tfidf weight of unrelated_doc: \n{unrelated_doc_tfidf}\n")
print(f"Tfidf weight of related_doc: \n{related_doc_tfidf}\n")

# TODO - create clean output

Tfidf weight of unrelated_doc: 
[(494, 0.5792698834554537), (1630, 0.5662128359240098), (2361, 0.5863867550997339)]

Tfidf weight of related_doc: 
[(8, 0.15950015416535915), (191, 0.3193789464903235), (241, 0.30415827237259907), (353, 0.6306102318406809), (1234, 0.6183649975581706)]



## Latent Semantic Indexing

### Save model to disk
TODO

import os
import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
    lsi_model.save(tmp.name)  # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

### Applying the LSI (chain) model to the corpus.
An interesting observation here is the difference in results obtained when applying LSI on top of either: a bag-of-words representation of the corpus (`bow_corpus`) or a tf-idf transformation (`corpus_tfidf`) on top of the bow_corpus.

In [12]:
from gensim import models

In [15]:
# Number of topics to generate
NUM_TOPICS = 2

In [18]:
# BOW -> LSI
lsi_bow = models.LsiModel(bow_corpus, id2word=dictionary, num_topics=NUM_TOPICS)

print(f"\nTopics generated with BOW:\n{lsi_bow.show_topics()}") # TODO - drop NAN

# BOW -> TF-IDF -> LSI 
corpus_tfidf = tfidf[bow_corpus] # bow_corpus as vector of tf-idf weights
lsi_tfidf = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=NUM_TOPICS)

print(f"\nTopics generated with TF-IDF:\n{lsi_tfidf.show_topics()}")


Topics generated with BOW:
[(0, '0.411*"homeless" + 0.341*"people" + 0.257*"nan" + 0.180*"city" + 0.151*"night" + 0.148*"time" + 0.146*"shelter" + 0.126*"know" + 0.125*"life" + 0.116*"going"'), (1, '0.805*"nan" + -0.270*"homeless" + -0.178*"city" + -0.142*"police" + -0.120*"people" + -0.107*"pam" + 0.100*"night" + 0.099*"nightwatch" + -0.084*"homelessness" + -0.082*"law"')]

Topics generated with tf-idf weights:
[(0, '0.155*"homeless" + 0.149*"people" + 0.136*"night" + 0.122*"nightwatch" + 0.109*"shelter" + 0.105*"time" + 0.104*"seattle" + 0.102*"going" + 0.098*"know" + 0.098*"city"'), (1, '-0.229*"night" + -0.223*"shelter" + -0.212*"nightwatch" + -0.184*"seattle" + -0.178*"women" + -0.157*"bar" + 0.150*"life" + -0.115*"city" + -0.109*"operation" + 0.104*"justice"')]


Aside from having to do some more clean up, it's interesting to note the difference in words included in the generated topic by the LSI model. I'd have to dig deeper as to why, but based on what I've understood, the difference could be coming from the factoring in of term frequencies across documents (the idf part) which effectively penalizes the model for words that occur too frequent in the corpus, but have no semantic meaning within a document.

Considering that the corpus is comprised of homeless blogs for example, one has to ask whether the word `homeless` actually carries much meaning in the corpus? Remember that BOW simply does a frequency count of words in a document, but does not account for justifiably meaningless words (functional words) that appear too frequent across the collection of documents. TF-IDF weights does exactly that - it adjusts (penalizes) for words that are too common to bear actual meaning (context words) for the document.

In [None]:
index = similarities.MatrixSimilarity(lsi_model[corpus_tfidf]) # tf-idf weights of dictionary

In [None]:
texts = dictionary.doc2bow(example_doc.lower().split())

In [None]:
test_lsi = lsi_model[texts]

In [None]:
test_lsi[:10]

In [None]:
sims = index[test_lsi]

In [None]:
doc_sp = []
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    doc_sp.append([doc_score, documents[doc_position]]) # documents = list of sentences; documents[doc_position]

In [None]:
(sims[:5])

## Word2Vec

In [None]:
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(sentences=processed_corpus, window=5, min_count=10, workers=4) # Omit: vector_size=100

In [None]:
model.train(['hello','world'], total_examples=1, epochs=1)

In [None]:
vector = model.wv['homeless']

In [None]:
(vector)

In [None]:
print(model.wv.vocab['homeless'])

In [None]:
from gensim.models import KeyedVectors

In [None]:
word_vectors = model.wv

In [None]:
wv2 = KeyedVectors.load('word2vec.model', mmap='r')

In [None]:
vector2 = wv2.__getitem__(['homeless'])

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
sample_tf = []
for ID, tf in sims:
    sample_tf.append(tf)

In [None]:
plt.plot(sample_tf, 'ob')

In [None]:
plt.plot(np.cumsum(sample_tf), 'ob')

In [None]:
cos_sim = [item for item in sims]
#print(cos_sim)

In [None]:
plt.plot(cos_sim, '+') # Does not mean anything! Clusters; K-nearest neighbor?
plt.show()

In [None]:
c.most_common_list = c.most_common(10)

In [None]:
c.most_common_list[0]

In [None]:
type(c.most_common_list[0])

In [None]:
x = ([[word] for word, count in c.most_common_list])
y = ([[count] for word, count in c.most_common_list])