# TF-IDF using Gensim 

The following notebook will be an exercise in applying methods from the gensim library.

The hope is to create training data for a classifier. The classifer can be applied to predicting the successful exit from homelessness, evaluating sentiment scores for people on the brink of becoming homeless, etc.

In [1]:
import pandas as pd
from gensim.parsing.preprocessing import remove_stopwords
from gensim .utils import simple_preprocess

In [2]:
df = pd.read_csv('blog_spot.csv', index_col=0)
df.head()

Unnamed: 0,wanderingscribe,homelesschroniclesintampa,livinghomelessourwritetospeak,seattlehomeless,homevan,joe-anybody,thehomelessfinch
0,# Extracted from http://wanderingscribe.blogsp...,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? ...,Another one\n,# Archived posts\n,HOME VAN NEWSLETTER 6/12/16\n,House Keys Not Handcuffs \n,The Homeless Finch Has Found Her Nest: Project...
1,In case you were wondering... the paperback of...,"Gee,\n",Its been some time since I actually have writt...,# Extracted from https://seattlehomeless.blogs...,HOME VAN NEWSLETTER 1/18/16\n,A (sticker) and a good idea\n,The Homeless Finch Makes It's First Rescue\n
2,"December probably isn't the time for it, but I...","this is a great question, and before I ever wr...",I am presently having a great meal as I write ...,"Ok, it's all relative. Seattle is hot at 80 d...",HOLIDAY ANGELS DISGUISED AS HOMELESS STRANGERS\n,- Portland 2018\n,The Start of Something New for The Homeless Fi...
3,Sometimes I give in to dreams — dream that on...,"play the viola. Then, I came down with essenti...","Back when I last posted, I was running a new b...","So tonight one of our local politicians, Seatt...",HOME VAN NEWSLETTER 11/15/15\n,- WRAP\n,"Jehane Lyle, Watercolor on paper, ""Cuppa"" - de..."
4,"In the meantime though, it's hard graft and sc...",neuro-muscular disorder (my mom was afflicted ...,I actually had myself a big slip and started u...,Here's what he saw: \n,HOME VAN NEWSLETTER 10/4/15\n,On 9/28/15 in Portland Oregon I filmed this in...,This week has been a complete blast. Getting ...


## Data processing and cleaning

The text data is arranged as elements of a dataframe. Since this will be used as training data, we can combine the writings of our sample into one series object to make later mapping easier.

In [3]:
df1 = df.apply(lambda x: ','.join(x.astype(str)), axis=1)

In [4]:
df2 = df1.apply(lambda x: remove_stopwords(x))

In [5]:
df3 = df2.apply(lambda x: simple_preprocess(x, min_len=4))

In [6]:
df4 = df3.to_list()

## Comparing processed data

In [7]:
df1.iloc[0]

'# Extracted from http://wanderingscribe.blogspot.com/\n,#IWSG - JANUARY 2019 - CHECK IN - A NEW ME??? I CERTAINLY HOPE SO!\n,Another one\n,# Archived posts\n,HOME VAN NEWSLETTER 6/12/16\n,House Keys Not Handcuffs\xa0\n,The Homeless Finch Has Found Her Nest: Projects, Plans and Peace\n'

In [8]:
df4[0]

['extracted',
 'http',
 'wanderingscribe',
 'blogspot',
 'iwsg',
 'january',
 'check',
 'certainly',
 'hope',
 'another',
 'archived',
 'posts',
 'home',
 'newsletter',
 'house',
 'keys',
 'handcuffs',
 'homeless',
 'finch',
 'found',
 'nest',
 'projects',
 'plans',
 'peace']

## Assign unique ID
Now that we have a list of lists representing the dictionary of words found in the sample of writings, we can process the dictionary with the text as key and the frequency count (number of times it appears in the corpus) as its value.

In [9]:
from collections import defaultdict

In [10]:
frequency = defaultdict(int)

In [11]:
# Frequency count of df4
for text in df4:
    for token in text:
        frequency[token] += 1

In [12]:
# 'http' appearing 336 times in our corpus
frequency

defaultdict(int,
            {'extracted': 3,
             'http': 168,
             'wanderingscribe': 3,
             'blogspot': 30,
             'iwsg': 26,
             'january': 31,
             'check': 113,
             'certainly': 17,
             'hope': 320,
             'another': 97,
             'archived': 1,
             'posts': 21,
             'home': 451,
             'newsletter': 16,
             'house': 311,
             'keys': 18,
             'handcuffs': 1,
             'homeless': 1971,
             'finch': 79,
             'found': 16,
             'nest': 8,
             'projects': 49,
             'plans': 26,
             'peace': 96,
             'case': 91,
             'wondering': 56,
             'paperback': 4,
             'book': 122,
             'completly': 1,
             'cover': 34,
             'hardback': 2,
             'november': 29,
             'looks': 85,
             'like': 1050,
             'this': 783,
             'sure'

In [13]:
# text corpus of words appearing at least n times
processed_corpus = [[token for token in text if frequency[token] > 150] for text in df4]

### Most common words

In [14]:
from collections import Counter

In [15]:
c = Counter(frequency)

In [16]:
c.most_common(100)

[('homeless', 1971),
 ('people', 1803),
 ('like', 1050),
 ('night', 1048),
 ('time', 934),
 ('nightwatch', 860),
 ('know', 807),
 ('this', 783),
 ('going', 770),
 ('there', 753),
 ('shelter', 750),
 ('city', 677),
 ('little', 664),
 ('that', 658),
 ('life', 656),
 ('seattle', 635),
 ('think', 619),
 ('they', 618),
 ('good', 596),
 ('place', 594),
 ('years', 582),
 ('help', 519),
 ('said', 502),
 ('here', 495),
 ('street', 492),
 ('want', 489),
 ('what', 473),
 ('year', 470),
 ('need', 470),
 ('home', 451),
 ('things', 434),
 ('work', 418),
 ('told', 391),
 ('week', 380),
 ('didn', 379),
 ('away', 376),
 ('pretty', 349),
 ('right', 348),
 ('room', 340),
 ('love', 337),
 ('come', 330),
 ('getting', 330),
 ('well', 329),
 ('maybe', 322),
 ('hope', 320),
 ('tonight', 314),
 ('great', 313),
 ('thing', 312),
 ('house', 311),
 ('sure', 311),
 ('long', 309),
 ('women', 306),
 ('person', 296),
 ('took', 294),
 ('friend', 291),
 ('friends', 290),
 ('blog', 284),
 ('family', 283),
 ('today', 279)

## Create dictionary keys

In [17]:
from gensim import corpora

In [18]:
dictionary = corpora.Dictionary(processed_corpus)

In [19]:
print(dictionary.token2id)

{'home': 0, 'homeless': 1, 'hope': 2, 'house': 3, 'http': 4, 'actually': 5, 'blog': 6, 'good': 7, 'like': 8, 'past': 9, 'sure': 10, 'this': 11, 'time': 12, 'white': 13, 'year': 14, 'again': 15, 'come': 16, 'feel': 17, 'great': 18, 'hard': 19, 'having': 20, 'knew': 21, 'looking': 22, 'month': 23, 'people': 24, 'portland': 25, 'seattle': 26, 'shelters': 27, 'things': 28, 'anyway': 29, 'back': 30, 'called': 31, 'came': 32, 'city': 33, 'friend': 34, 'happy': 35, 'know': 36, 'little': 37, 'look': 38, 'love': 39, 'maybe': 40, 'outside': 41, 'some': 42, 'story': 43, 'tell': 44, 'that': 45, 'then': 46, 'think': 47, 'tonight': 48, 'took': 49, 'what': 50, 'with': 51, 'world': 52, 'christmas': 53, 'getting': 54, 'going': 55, 'here': 56, 'months': 57, 'need': 58, 'started': 59, 'stop': 60, 'week': 61, 'center': 62, 'community': 63, 'family': 64, 'guys': 65, 'help': 66, 'life': 67, 'nightwatch': 68, 'operation': 69, 'police': 70, 'right': 71, 'room': 72, 'shelter': 73, 'thank': 74, 'water': 75, 'we

In [20]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus] # See documentation

## Apply to new document

In [21]:
new_doc = "homeless homeless house open house open social house"

In [22]:
# Use dictionary as vector to map out new_doc to dictionary
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[(1, 2), (3, 3), (126, 1), (128, 2)]

# Model

In [23]:
from gensim import models

## tf-idf weight
The following output calculates the tf-idf weight for the words found in the `sample_doc` created from `df['thehomelessfinch']` . 

Could one look into descriptive statistics for the particular list of words and figure out which words were most common, and relevant?

In [24]:
sample_doc = df['thehomelessfinch'].to_string()
sample = sample_doc.lower().split() # list

In [25]:
# initialize the model on bag-of-words corpus
tfidf = models.TfidfModel(bow_corpus) # vector bow_corpus = processed_corpus (no stop words)

# Calculate tf-idf weight
print(tfidf[dictionary.doc2bow(sample)])

[(0, 0.02892837216472553), (1, 0.04665170130460947), (2, 0.10214380100929987), (3, 0.037657533488909656), (5, 0.031230409099446493), (6, 0.12312075228015906), (7, 0.033765865857571256), (8, 0.0540513815248069), (9, 0.03237221350977881), (10, 0.025733626274098666), (11, 0.5601162894238296), (12, 0.1996942702710659), (13, 0.0909166243797562), (14, 0.01982129906383492), (16, 0.04573879418597442), (17, 0.03418138491734963), (18, 0.0895965482291525), (19, 0.005009865246636717), (20, 0.025727907711626946), (21, 0.011095767583409076), (22, 0.02330525973613433), (23, 0.01015306531247949), (24, 0.011443897337417061), (28, 0.023154418700351072), (30, 0.1596473600186419), (31, 0.011095767583409076), (32, 0.053543251182626664), (34, 0.017947100623754913), (35, 0.036952692201426336), (36, 0.04931165964337038), (37, 0.20389663214368137), (38, 0.07205620167997898), (39, 0.09515323895910949), (41, 0.010386047026369739), (42, 0.12502305393994584), (44, 0.051271571419495524), (45, 0.3253380160466306), (

In [26]:
# Output tf-idf weight for specific word
print(tfidf[dictionary.doc2bow(sample)])

[(0, 0.02892837216472553), (1, 0.04665170130460947), (2, 0.10214380100929987), (3, 0.037657533488909656), (5, 0.031230409099446493), (6, 0.12312075228015906), (7, 0.033765865857571256), (8, 0.0540513815248069), (9, 0.03237221350977881), (10, 0.025733626274098666), (11, 0.5601162894238296), (12, 0.1996942702710659), (13, 0.0909166243797562), (14, 0.01982129906383492), (16, 0.04573879418597442), (17, 0.03418138491734963), (18, 0.0895965482291525), (19, 0.005009865246636717), (20, 0.025727907711626946), (21, 0.011095767583409076), (22, 0.02330525973613433), (23, 0.01015306531247949), (24, 0.011443897337417061), (28, 0.023154418700351072), (30, 0.1596473600186419), (31, 0.011095767583409076), (32, 0.053543251182626664), (34, 0.017947100623754913), (35, 0.036952692201426336), (36, 0.04931165964337038), (37, 0.20389663214368137), (38, 0.07205620167997898), (39, 0.09515323895910949), (41, 0.010386047026369739), (42, 0.12502305393994584), (44, 0.051271571419495524), (45, 0.3253380160466306), (

In [27]:
type(tfidf[dictionary.doc2bow(sample)][0][1])

numpy.float64

### Comparing with an unrelated document

In [28]:
s = "pikachu is awesome although he could be better off if i had a room i don't care how much damage can be done it is so sad the police are suicidal but everybody else is very happy and elated"

In [29]:
# transform the string and calculate tf-idf weight; 
words = s.lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(35, 0.45123655061733864), (70, 0.5437095030985055), (72, 0.37117463064094863), (82, 0.43105985595911905), (131, 0.4209303336377662)]


In [30]:
print(dictionary[72], dictionary[131], dictionary[82])

room better care


## Latent Semantic Analysis 

In [31]:
from gensim import models
tfidf = models.TfidfModel(bow_corpus)

In [32]:
bow_corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(0, 1),
  (1, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 2),
  (13, 1),
  (14, 1)],
 [(1, 3),
  (6, 1),
  (7, 1),
  (10, 3),
  (11, 2),
  (12, 2),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 2),
  (19, 1),
  (20, 2),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 2),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1)],
 [(0, 1),
  (8, 1),
  (12, 1),
  (20, 1),
  (24, 3),
  (26, 1),
  (28, 3),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 2),
  (37, 1),
  (38, 1),
  (39, 3),
  (40, 1),
  (41, 1),
  (42, 2),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 2),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 2)],
 [(0, 1),
  (5, 3),
  (9, 1),
  (11, 1),
  (12, 1),
  (19, 1),
  (25, 2),
  (28, 1),
  (37, 1),
  (44, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1)],
 [(0, 1),
  (1, 5),
  (5, 1),
  (7, 2),
  (15, 1),
  (17, 1),
  (1

In [33]:
sample_mapping = dictionary.doc2bow(sample)

In [34]:
corpus_tfidf = tfidf[bow_corpus] # list of lists
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]
lsi_model.print_topics(2)

[(0,
  '0.218*"homeless" + 0.214*"people" + 0.170*"night" + 0.160*"nightwatch" + 0.153*"like" + 0.146*"time" + 0.143*"this" + 0.143*"there" + 0.138*"shelter" + 0.137*"going"'),
 (1,
  '0.288*"shelter" + 0.278*"seattle" + 0.276*"nightwatch" + 0.270*"city" + 0.261*"people" + 0.258*"homeless" + 0.210*"tent" + 0.187*"night" + -0.170*"little" + 0.169*"women"')]