In this notebook, we will get an overview of how to generate word vectors using the various word embedding methods discussed in the lecture

### Objectives:
- Implement Count Vectors with sklearn
- Implement TF-IDF Vectors with sklearn and gensim
- Train and save word2vec model with gensim
- Load Google's pretrained word2vec model
- Load Stanford's pretrained GloVe model

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

In [2]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

corpus = twenty_train.data[0:50]

In [3]:
print(f"{corpus[0]}")

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [4]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(f"Dimensions of Document-term matrix: {X.toarray().shape}")

Dimensions of Document-term matrix: (50, 3075)


From the result above, we can see that the second dimension gives us the size of our vocabulary.

But why restrict ourselves to single words? We can pass an additional argument to the CountVectorizer() object to add n-grams to our vocabulary.

What is an n-gram?  It's just a collection of n consecutive words. For example:
"New", "York", "City", "subway" are all unigrams
"New York", "York City", "City subway" are bigrams
"New York City", "York City subway" are trigrams
"New York City subway" is a 4-gram

We can specify to include n-grams with the ngram_range argument.  This takes a tuple which specifies the range of n-grams that we should include (inclusively).

In [5]:
# Include unigrams, bigrams, and trigrams

vectorizer = CountVectorizer(ngram_range=(1,3))
X = vectorizer.fit_transform(corpus)

print(f"Dimensions of Document-term matrix: {X.toarray().shape}")

Dimensions of Document-term matrix: (50, 23397)


A common preprocessing step in many NLP applications is stop-word removal.
Common words like "a", "the", "and" often add a lot of noise, and don't typiccally contribute much to the task we are trying to solve.

CountVectorizer also comes equipped with a way of dealing with common English stop words!

In [6]:
vectorizer = CountVectorizer(ngram_range=(1,3), stop_words='english')
X = vectorizer.fit_transform(corpus)

print(f"Dimensions of Document-term matrix: {X.toarray().shape}")

Dimensions of Document-term matrix: (50, 14300)


# TF-IDF Vectors

Here we will demonstrate two ways to generate TF-IDF vectors with both sklearn and gensim.  It's good to be aware of both methods because depending on your specific workflow, one method might be easier than the other!

In [7]:
# For sklearn, it's VERY similar to how we did CountVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,3), stop_words='english')
X_tfidf = vectorizer.fit_transform(corpus)

In [8]:
import numpy as np
# How do these two representations compare?
# Let's look at the first 50 dimensions of the first document to gain some intuition

np.set_printoptions(precision=3) # This just makes things a little easier to read

print(f"CountVector: {X.toarray()[0,0:50]}\n\n")

print(f"TFIDF: {X_tfidf.toarray()[0,0:50]}")

CountVector: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 2 2 1 1 0 0 0 0 0 0 0]


TFIDF: [0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.    0.    0.    0.127 0.127 0.063 0.063 0.    0.    0.    0.    0.
 0.    0.   ]


Scikit-learn isn't our only option for doing TF-IDF.  Gensim is another popular library for many NLP tasks

In [9]:
import gensim

In [10]:
# Tokenize the documents
tokenized_docs = [gensim.utils.simple_preprocess(d) for d in corpus]

# Create a Gensim Dictionary.  This creates an id to word mapping for everything in our vocbulary
# It is NOT the same as the dictionary object in the Python standard library
mydict = gensim.corpora.Dictionary()

# Create a Gensim Corpus object.  This creates a list of tuples for each document.
# The first element of the tuple is the word id, the second is the number of counts
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_docs]

In [11]:
# This creates the doc-term matrix as a numpy array.
# Typically these matrices are HUGE so, it's usuall not a great idea to create the full dense doc-term matrix.
# We do it here to illustrate that you can get the same info as we obtained in scikit-learn!
doc_term_matrix = gensim.matutils.corpus2dense(mycorpus, num_terms=len(mydict))

In [12]:
doc_term_matrix

array([[2., 0., 0., ..., 2., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 3., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]], dtype=float32)

In [13]:
# Creating a tf-idf model is very simple!
tfidf = gensim.models.TfidfModel(mycorpus)
tfidf_matrix = gensim.matutils.corpus2dense(tfidf[mycorpus], num_terms=len(mydict))

In [14]:
tfidf_matrix

array([[0.146, 0.   , 0.   , ..., 0.141, 0.   , 0.   ],
       [0.093, 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
       [0.037, 0.   , 0.056, ..., 0.   , 0.   , 0.   ],
       ...,
       [0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.072],
       [0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.072],
       [0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.072]], dtype=float32)

# Word2Vec and GloVe

Word2Vec is a very powerful and useful word embedding method.  The math can get a little sticky, but luckily Gensim comes equipped with ways for us to train our own Word2Vec model, or load in a pre-trained word2vec model.  Let's check it out!

In [15]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

tokenized_docs = [gensim.utils.simple_preprocess(d) for d in documents]


In [16]:
# size refers to the desired dimension of our word vectors
# window refers to the size of our context window
# sg means that we are using the Skip-gram architecture

model = gensim.models.Word2Vec(tokenized_docs, size=10, window=2,min_count=1, sg=1)

In [17]:
print(model['human'])

[ 0.01   0.035 -0.008 -0.027  0.016 -0.041  0.009 -0.012  0.037 -0.021]


  print(model['human'])


Training our own model with word2vec is pretty cool, but it requires us to have a large corpus of data.

Fortunately, research groups at Stanford and Google have made their pre-trained word embeddings publicly available for us to use!

Google's word2vec: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

GloVe:  https://nlp.stanford.edu/projects/glove/

Just note that these model's will require ~4 GB of RAM to fit in memory

In [18]:
# Path to where the word2vec file lives
#google_vec_file = '/Users/joshuamailman/Downloads/GoogleNews-vectors-negative300.bin'
google_vec_file = '/Users/joshuamailman/Downloads/GoogleNews-vectors-negative300.bin'

In [19]:
# Load it!  This might take a few minutes...
model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)

In [50]:
model['ontology']

100

In [51]:
# We can access individual word vectors using a dictionary-like syntax
(model['fruit']

100

In [43]:
# Some cool results!

model.most_similar(positive =['ontology'], topn=8)

[('ontologies', 0.6897775530815125),
 ('epistemology', 0.6585497856140137),
 ('object-oriented', 0.6307498812675476),
 ('metaphysics', 0.588164210319519),
 ('pragmatics', 0.581379771232605),
 ('semantics', 0.5567224025726318),
 ('typology', 0.5516523122787476),
 ('scripting', 0.54522305727005)]

In [22]:
model.most_similar('president' ,topn=5)

[('President', 0.8006276488304138),
 ('chairman', 0.6708745360374451),
 ('vice_president', 0.6700224876403809),
 ('chief_executive', 0.6691275238990784),
 ('CEO', 0.6590125560760498)]

In [23]:
# Here's an analogy task!

model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

In [24]:
model.most_similar(positive=['Paris', 'England'], negative=['London'])

[('France', 0.667637825012207),
 ('Les_Bleus', 0.5665801167488098),
 ('Stade_De', 0.5045602917671204),
 ('Marseille', 0.502022922039032),
 ('Marc_Lièvremont', 0.500834584236145),
 ('les_bleus', 0.49737778306007385),
 ('Les_Tricolores', 0.49725979566574097),
 ('Fabien_Galthié', 0.49014753103256226),
 ('French', 0.4892624020576477),
 ('les_Bleus', 0.4850963056087494)]

Using GloVe with Gensim requires a little extra leg work, but it's not too bad.
The problem is that the file format that is publicly available doesn't play nice with Gensim.
Luckily, Gensim provides a handy method of converting it!

In [25]:
#glove_file = glove_dir  = '/Users/joshuamailman/Downloads/glove.6B/glove.6B.100d.txt'
glove_file = glove_dir  = '../glove.6B/glove.6B.100d.txt'




In [27]:
w2v_output_file = '../glove.6B/glove.6B.100d.txt.w2v'


In [28]:
# The following utility converts file formats
gensim.scripts.glove2word2vec.glove2word2vec(glove_file, w2v_output_file)

(400000, 100)

In [29]:
# Now we can load it!
model = gensim.models.KeyedVectors.load_word2vec_format(w2v_output_file, binary=False)

In [30]:
# How does it compare to the previous examples we did with word2vec?
model.most_similar('meeting' ,topn=8)

[('conference', 0.8648155927658081),
 ('meetings', 0.8619149923324585),
 ('summit', 0.8164560794830322),
 ('talks', 0.814795970916748),
 ('discuss', 0.795186460018158),
 ('ministers', 0.7842442989349365),
 ('met', 0.7822972536087036),
 ('leaders', 0.7725203037261963)]

In [31]:
model.most_similar('president' ,topn=5)

[('vice', 0.828760027885437),
 ('presidency', 0.7150214910507202),
 ('former', 0.706093966960907),
 ('presidents', 0.6961984038352966),
 ('chairman', 0.6928698420524597)]

In [32]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755735874176025),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534753799438),
 ('prince', 0.6517034769058228),
 ('elizabeth', 0.6464517712593079),
 ('mother', 0.6311717629432678),
 ('emperor', 0.6106470823287964),
 ('wife', 0.6098655462265015)]

# Your Turn!

- Using either word2vec or GloVe, what interesting analogies or relationships?

- Given a short piece of text (like a tweet) what strategies can you think of to create a "tweet vector"?