# Gensim Library

### Creating a corpus

Let's move towards creating a corpus. We can do so by splitting each of our documents into their component words.  

> Now, we could use Python to do this for us.

In [45]:
texts = [[text for text in doc.split()] for doc in documents[:1]]
texts[0][:3]

['From:', 'lerxst@wam.umd.edu', "(where's"]

But instead we'll lean on the gensim library.

In [33]:
from gensim.parsing.preprocessing import remove_stopwords
from gensim.utils import simple_preprocess
simple_preprocess(documents[0])[:5]

['from', 'lerxst', 'wam', 'umd', 'edu']

We refer to the list above as `tokens` because notice that not everything here is technically a word.

> Here, we want to capture information in the message itself.  Looking at the first part of each document, we can see that a lot of this information isn't related to the content of the message. 

In [128]:
documents[0][:200]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out"

Above we have the email address, posting host and organization.  In a more sophisticated algorithm, we would separate out each of these components into features and see what information we can gather from them.  But for this example, let's just focus on the content of the message, which is everything after the text `Lines:`.

We can use the `partition` method to divide each document.

In [135]:
documents[0].partition('Lines:')

("From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\n",
 'Lines:',
 ' 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n')

And then let's select the message content, which is in the last element.

In [138]:
def select_content(document):
    return document.partition('Lines:')[-1][3:]

In [139]:
select_content(documents[0])

'\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n'

In [151]:
document_tokens = [simple_preprocess(remove_stopwords(select_content(document))) for document in documents]

In [152]:
document_tokens[0][:3]

['wondering', 'enlighten', 'car']

Now we can pass this nested list into gensim's `corpora.Dictionary` method.

In [170]:
import gensim
from gensim import corpora
dictionary = corpora.Dictionary(document_tokens)

In [171]:
print(dictionary)

Dictionary(87385 unique tokens: ['addition', 'body', 'bricklin', 'brought', 'bumper']...)


So here we have 90,000 unique tokens.

In [57]:
# dictionary.token2id

And we could see that each of these words is assigned a different number.

In [287]:
def bow_vectors(document_tokens):
    return [dictionary.doc2bow(doc, allow_update=True) for doc in document_tokens]

In [288]:
vectors = bow_vectors(document_tokens)

In [286]:
vectors[0][:5]

[(6, 4), (19, 2), (0, 1), (1, 1), (2, 1)]

In [212]:
def find_top_words(vector, top = 10):
    vector.sort(key=lambda x: x[1], reverse=True)
    return dict([[idx, [dictionary.get(idx), num]] for idx, num in vector][:top])

In [213]:
find_top_words(vectors[0], 5)

{6: ['car', 4],
 19: ['it', 2],
 0: ['addition', 1],
 1: ['body', 1],
 2: ['bricklin', 1]}

So above we have the bag of words model, and this simply counts the number of words.  Looking at the top five we can see even just counting the words, our formula does a fairly good job of telling us this is a car.

In [122]:
documents[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

Let's try it on the tenth document.  

In [168]:
find_top_words(vectors[10], 5)

[['they', 2], ['bike', 2], ['irwin', 2], ['oil', 2], ['it', 1]]

Here, we can see that it does a good job of identifying about a bike and oil.

In [147]:
documents[10]

'From: irwin@cmptrc.lonestar.org (Irwin Arnstein)\nSubject: Re: Recommendation on Duc\nSummary: What\'s it worth?\nDistribution: usa\nExpires: Sat, 1 May 1993 05:00:00 GMT\nOrganization: CompuTrac Inc., Richardson TX\nKeywords: Ducati, GTS, How much? \nLines: 13\n\nI have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs\nvery well, paint is the bronze/brown/orange faded out, leaks a bit of oil\nand pops out of 1st with hard accel.  The shop will fix trans and oil \nleak.  They sold the bike to the 1 and only owner.  They want $3495, and\nI am thinking more like $3K.  Any opinions out there?  Please email me.\nThanks.  It would be a nice stable mate to the Beemer.  Then I\'ll get\na jap bike and call myself Axis Motors!\n\n-- \n-----------------------------------------------------------------------\n"Tuba" (Irwin)      "I honk therefore I am"     CompuTrac-Richardson,Tx\nirwin@cmptrc.lonestar.org    DoD #0826          (R75/6)\n-------------------------------------------