# Word Vector Spike

Inspired by [Stop Using Word2Vec](http://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), we attempt to find word similarity measurements from a corpus with statistics and vectors (rather than neural networks).

%%markdown

## Acquiring a Corpus

Rob has a Twitter corpus he's been exploring in R. Let's try messing with it in Python.



In [14]:
english_news = "data/en_US/en_US.news.txt"
import os

news_file = open(english_news, 'r')
data = news_file.read()

type(data)

str

In [18]:
print(data[0:800])


He wasn't home alone, apparently.
The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks l


In [21]:
lines = data.splitlines()
len(lines)

1010245

In [23]:
lines[0:10]

["He wasn't home alone, apparently.",
 'The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.',
 "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.",
 'The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15.',
 "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is 

In [26]:
document = lines[1]
document

'The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.'

In [51]:
import string

def normalize(document):
    tokens = document.split(" ")
    translator = str.maketrans('', '', string.punctuation)
    return [word.lower().translate(translator) for word in tokens]


['the',
 'st',
 'louis',
 'plant',
 'had',
 'to',
 'close',
 'it',
 'would',
 'die',
 'of',
 'old',
 'age',
 'workers',
 'had',
 'been',
 'making',
 'cars',
 'there',
 'since',
 'the',
 'onset',
 'of',
 'mass',
 'automotive',
 'production',
 'in',
 'the',
 '1920s']

In [52]:
from collections import Counter

def unigram_probabilities(document):
    words = normalize(document) 
    counts = Counter(words)
    probabilities = {}
    for word, count in counts.items():
        probabilities[word] = count/len(words)
    return probabilities
    
unigram_probabilities(document)

{'1920s': 0.034482758620689655,
 'age': 0.034482758620689655,
 'automotive': 0.034482758620689655,
 'been': 0.034482758620689655,
 'cars': 0.034482758620689655,
 'close': 0.034482758620689655,
 'die': 0.034482758620689655,
 'had': 0.06896551724137931,
 'in': 0.034482758620689655,
 'it': 0.034482758620689655,
 'louis': 0.034482758620689655,
 'making': 0.034482758620689655,
 'mass': 0.034482758620689655,
 'of': 0.06896551724137931,
 'old': 0.034482758620689655,
 'onset': 0.034482758620689655,
 'plant': 0.034482758620689655,
 'production': 0.034482758620689655,
 'since': 0.034482758620689655,
 'st': 0.034482758620689655,
 'the': 0.10344827586206896,
 'there': 0.034482758620689655,
 'to': 0.034482758620689655,
 'workers': 0.034482758620689655,
 'would': 0.034482758620689655}

In [47]:
sum(get_probability(document).values())

0.9999999999999994

In [65]:
WINDOW_RADIUS = 3

In [78]:
skipgram_counts = {}
input_word = words(document)[0]
thisgram_counts = skipgram_counts.get(input_word, {})

window = normalize(document)[0:WINDOW_RADIUS]
for word in del window[]:
    thisgram_counts[word] = thisgram_counts.get(word, 0) + 1
    
print(window)
skipgram_counts[input_word] = thisgram_counts



    

['the', 'st', 'louis']


In [80]:
normalized = normalize(document)

In [88]:
skipgram_counts = {}
input_index = 0
left_index = max(0, input_index - WINDOW_RADIUS)
right_index = min(input_index + WINDOW_RADIUS, len(normalized))
center_index = input_index - left_index

window = normalized[left_index:right_index]
del window[center_index]

for word in window:
    thisgram_counts[word] = thisgram_counts.get(word, 0) + 1
    
skipgram_counts[input_word] = thisgram_counts

skipgram_counts

{'the': {'louis': 4, 'st': 4, 'the': 1}}

In [91]:
skipgram_counts = {}
skipgram_counts

{}

In [92]:
skipgram_counts = {}
input_index = 0
left_index = max(0, input_index - WINDOW_RADIUS)
right_index = min(input_index + WINDOW_RADIUS, len(normalized))
center_index = input_index - left_index

window = normalized[left_index:right_index]
del window[center_index]

for word in window:
    thisgram_counts[word] = thisgram_counts.get(word, 0) + 1
    
skipgram_counts[input_word] = thisgram_counts

skipgram_counts

{'the': {'louis': 6, 'st': 6, 'the': 1}}

In [97]:
def compute_skipgrams(document):
    skipgram_counts = {}
    thisgram_counts = {}
    input_index = 0
    left_index = max(0, input_index - WINDOW_RADIUS)
    right_index = min(input_index + WINDOW_RADIUS, len(normalized))
    center_index = input_index - left_index

    window = normalized[left_index:right_index]
    del window[center_index]

    for word in window:
        thisgram_counts[word] = thisgram_counts.get(word, 0) + 1
    skipgram_counts[input_word] = thisgram_counts
    return skipgram_counts


    

In [98]:
compute_skipgrams(document)

{'the': {'louis': 1, 'st': 1}}