# TF-IDF using Gensim 

The following notebook will be an exercise in applying methods from the gensim library to the Blog Spot homeless blog corpus.

The aim is to process data that can be used to model a classification system or aid in the statistical analysis of textual information. The model can be applied to predicting the successful exit from homelessness, or evaluating sentiment scores for people on the brink of becoming homeless, depending on the theoretical framework used in examining the corpus.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('blog_spot.csv', index_col=0)
df.head()

Unnamed: 0,wanderingscribe,homelesschroniclesintampa,livinghomelessourwritetospeak,seattlehomeless,homevan,joe-anybody,thehomelessfinch
0,# Extracted from http://wanderingscribe.blogsp...,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? ...,Another one\n,# Archived posts\n,HOME VAN NEWSLETTER 6/12/16\n,House Keys Not Handcuffs \n,The Homeless Finch Has Found Her Nest: Project...
1,In case you were wondering... the paperback of...,"Gee,\n",Its been some time since I actually have writt...,# Extracted from https://seattlehomeless.blogs...,HOME VAN NEWSLETTER 1/18/16\n,A (sticker) and a good idea\n,The Homeless Finch Makes It's First Rescue\n
2,"December probably isn't the time for it, but I...","this is a great question, and before I ever wr...",I am presently having a great meal as I write ...,"Ok, it's all relative. Seattle is hot at 80 d...",HOLIDAY ANGELS DISGUISED AS HOMELESS STRANGERS\n,- Portland 2018\n,The Start of Something New for The Homeless Fi...
3,Sometimes I give in to dreams — dream that on...,"play the viola. Then, I came down with essenti...","Back when I last posted, I was running a new b...","So tonight one of our local politicians, Seatt...",HOME VAN NEWSLETTER 11/15/15\n,- WRAP\n,"Jehane Lyle, Watercolor on paper, ""Cuppa"" - de..."
4,"In the meantime though, it's hard graft and sc...",neuro-muscular disorder (my mom was afflicted ...,I actually had myself a big slip and started u...,Here's what he saw: \n,HOME VAN NEWSLETTER 10/4/15\n,On 9/28/15 in Portland Oregon I filmed this in...,This week has been a complete blast. Getting ...


## Data processing and cleaning

The text data is arranged as elements of a dataframe. Since this will be used as training data, we can combine the writings of our sample into one series object to make mapping easier. Standard data processing and cleaning methods are applied.

In [3]:
from gensim.parsing.preprocessing import remove_stopwords
from gensim .utils import simple_preprocess
from nltk.corpus import stopwords

In [4]:
# wrap in function
df1 = df.apply(lambda x: ','.join(x.astype(str)), axis=1)
df2 = df1.apply(lambda x: remove_stopwords(x))

In [5]:
df2[0]

'# Extracted http://wanderingscribe.blogspot.com/ ,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? I CERTAINLY HOPE SO! ,Another ,# Archived posts ,HOME VAN NEWSLETTER 6/12/16 ,House Keys Not Handcuffs ,The Homeless Finch Has Found Her Nest: Projects, Plans Peace'

In [6]:
df3 = df2.apply(lambda x: simple_preprocess(x, min_len=4))


In [7]:
# remove extra_stop_words using nltk

In [8]:
extra_stop_words = ["http", "like"]

In [9]:
stop_words = stopwords.words('english')

In [10]:
for i in extra_stop_words:
    stop_words.append(i)

In [11]:
df3a = df3.apply(lambda x: [word for word in x if word not in stop_words])

In [12]:
df4 = df3a.to_list()

## Comparing processed data
Below is a comparison of the before and after states of the text. The output in `df4` can now be converted into a bag-of-words (`bow`).

In [13]:
df1.iloc[0]

'# Extracted from http://wanderingscribe.blogspot.com/\n,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? I CERTAINLY HOPE SO!\n,Another one\n,# Archived posts\n,HOME VAN NEWSLETTER 6/12/16\n,House Keys Not Handcuffs\xa0\n,The Homeless Finch Has Found Her Nest: Projects, Plans and Peace\n'

In [14]:
df4[0]

['extracted',
 'wanderingscribe',
 'blogspot',
 'iwsg',
 'january',
 'check',
 'certainly',
 'hope',
 'another',
 'archived',
 'posts',
 'home',
 'newsletter',
 'house',
 'keys',
 'handcuffs',
 'homeless',
 'finch',
 'found',
 'nest',
 'projects',
 'plans',
 'peace']

## Assign unique ID
Now that we have a list representing the dictionary of words found in the sample of writings (as list of lists per document), our soon to be `bow`, we can process the dictionary with the text as key and the frequency count as its value.

In [15]:
from collections import defaultdict

frequency = defaultdict(int)

# Create a frequency count for words in dataframe
for text in df4:
    for token in text:
        frequency[token] += 1

# Sample frequency count
word = 'http'
print("The word {} appears {} times in the corpus".format(word, frequency[word]))
print("It needs to be removed!")

The word http appears 0 times in the corpus
It needs to be removed!


### Most common words
Using the `collections` module, we can see what the top 10 words by frequency count are. Note that further removal of non-content words from the corpus is necessary. 

In [16]:
from collections import Counter
c = Counter(frequency)
c.most_common(10)

[('homeless', 1971),
 ('people', 1803),
 ('night', 1048),
 ('time', 934),
 ('nightwatch', 860),
 ('know', 807),
 ('going', 770),
 ('shelter', 750),
 ('city', 677),
 ('little', 664)]

In [17]:
# output words appearing at least n times
n = 100
processed_corpus = [[token for token in text if frequency[token] > n] for text in df4]
print(processed_corpus[:2])

[['check', 'hope', 'home', 'house', 'homeless'], ['book', 'sure', 'white', 'time', 'actually', 'post', 'blog', 'time', 'past', 'year', 'home', 'good', 'idea', 'homeless', 'first']]


## Create dictionary keys
Now that we have a list of all the words in the corpus, we can create a dictionary to map any future comparison to. These will be assigned a unique tokenID for every token or word in the corpus. What we'll end up with is a bag-of-words with unique identifiers for our entire corpus. Note the difference between `processed_corpus` and `bow_corpus`, where the former is a untokenized list of context words in our corpus. For this, we will use the built-in `corpora` module from the `gensim` library to make things easy.

In [18]:
from gensim import corpora

# Create (token, tokenID)
dictionary = corpora.Dictionary(processed_corpus)

# bow representation of the corpus; (tokenID, freq)
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

### Sample output
dictionary.token2id:
{'check': 0, 'home': 1, 'homeless': 2, 'hope': 3...

bow_corpus:
[(2, 4), (3, 1), (4, 1), (5, 2), (6, 9)...

### Apply doc2bow to example_doc
Now we can test whether we can identify words present in an `example_doc` that are also in our `bow_corpus`.

In [19]:
# Modify for testing
example_doc = "it's a nice and sunny day outside, i'm so happy"

In [20]:
# Output (token, token_freq) in example_doc for words in corpora.Dictionary (processed_corpus)
# Finds words in example_doc that appear within the context of our corpus
example_bow = dictionary.doc2bow(example_doc.lower().split())
example_bow

if len(example_bow) == 0:
    print("No context word found.")
else:
    print(f"The words: {[dictionary[tokenID] for tokenID, freq in example_bow]} are in our dictionary. No biggie.")

The words: ['happy', 'nice'] are in our dictionary. No biggie.


# Topic Modelling
## TF-IDF weight
In order to quantify the relationship of words in a document across documents, we'll create a `tf-idf` model using the bow_corpus as training data. We'll then transform `example_doc2` into a bag-of-words. This way, we will be able to determine which words in our `example_doc2` also appear in our training model. This is measured by it's tf-idf weight and helps answer the question: "How relevant are words that appear in a document across a collection of documents". Later we will use a different algorithm (LSI) to calculate the similarity of documents.

In [21]:
from gensim import models

# Initialize tf-idf model using bow_corpus
tfidf = models.TfidfModel(bow_corpus)

### Transform bow representation into tf-idf weight of words
This gives us a measure of the frequency of a word (i.e. assumed relevance) given a corpus of documents.

Sample tfidf[dictionary.doc2bow(sample)] output:

[(0, 0.017154431730943658), (1, 0.020673564600620076), (2, 0.033339482607518046)...

### Comparing with manual document
We can test how well the model is able to determine an input's similiarity, measured in terms of term frequency weights, by feeding it some text manually and seeing how it compares to the transformed corpus.

`example_doc1` = "i made significant stock investments last year"

`example_doc2` = "As a homeless person, I need all the help I can get."

In [22]:
example_doc1 = "I am a student and like to study natural language processing"
unrelated_doc = example_doc1.lower().split()
unrelated_doc_dict = dictionary.doc2bow(unrelated_doc)
unrelated_doc_tfidf = tfidf[unrelated_doc_dict]

In [23]:
example_doc2 = "As a homeless person, I need all the help I can get from the police."
related_doc = example_doc2.lower().split()
related_doc_dict = dictionary.doc2bow(related_doc)
related_doc_tfidf = tfidf[related_doc_dict]

In [24]:
print(f"Tfidf weight of unrelated_doc: \n{unrelated_doc_tfidf}\n")
print(f"Tfidf weight of related_doc: \n{related_doc_tfidf}\n")

Tfidf weight of unrelated_doc: 
[]

Tfidf weight of related_doc: 
[(2, 0.34008989529272204), (65, 0.6809871315734689), (75, 0.6485332603275829)]



# Latent Semantic Indexing
Use the tf-idf vector space to train LSI model (chain) with the bow_corpus

### Save model to disk
TODO

import os
import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
    lsi_model.save(tmp.name)  # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

print(lsi_model.print_topics(5))

### Applying LSI to test document
Similarity queries; Sample unknown dataset related to topic

### Import reddit_titles
with open("./titles.txt", "r") as f:
    data = f.read().splitlines()

data[:2]

In [None]:
from nltk.corpus import stopwords
from gensim import similarities

In [None]:
corpus_tfidf = tfidf[bow_corpus] # bow_corpus as vector of tf-idf weights
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi_model[corpus_tfidf]

In [None]:
index = similarities.MatrixSimilarity(lsi_model[corpus_tfidf]) # tf-idf weights of dictionary

In [None]:
texts = dictionary.doc2bow(example_doc.lower().split())

In [None]:
test_lsi = lsi_model[texts]

In [None]:
test_lsi[:10]

In [None]:
sims = index[test_lsi]

In [None]:
doc_sp = []
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    doc_sp.append([doc_score, documents[doc_position]]) # documents = list of sentences; documents[doc_position]

In [None]:
(sims[:5])

## Word2Vec

In [None]:
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(sentences=processed_corpus, window=5, min_count=10, workers=4) # Omit: vector_size=100

In [None]:
model.train(['hello','world'], total_examples=1, epochs=1)

In [None]:
vector = model.wv['homeless']

In [None]:
(vector)

In [None]:
print(model.wv.vocab['homeless'])

In [None]:
from gensim.models import KeyedVectors

In [None]:
word_vectors = model.wv

In [None]:
wv2 = KeyedVectors.load('word2vec.model', mmap='r')

In [None]:
vector2 = wv2.__getitem__(['homeless'])

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
sample_tf = []
for ID, tf in sims:
    sample_tf.append(tf)

In [None]:
plt.plot(sample_tf, 'ob')

In [None]:
plt.plot(np.cumsum(sample_tf), 'ob')

In [None]:
cos_sim = [item for item in sims]
#print(cos_sim)

In [None]:
plt.plot(cos_sim, '+') # Does not mean anything! Clusters; K-nearest neighbor?
plt.show()

In [None]:
c.most_common_list = c.most_common(10)

In [None]:
c.most_common_list[0]

In [None]:
type(c.most_common_list[0])

In [None]:
x = ([[word] for word, count in c.most_common_list])
y = ([[count] for word, count in c.most_common_list])