# Finding The Words - Word Embeddings and the Word2Vec model
##### David Miller - March 2019 - [Link to Github][1]
---
[1]:(https://github.com/millerdw/millerdw.github.io/tree/master/_notebooks/FindingTheWords_4)

In a [previous post](https://millerdw.github.io/Word-Roots-and-Associations-in-Vectorised-NLP/) I looked at how stemming appeared to improve the clustering of articles. This was trying to build on the fact that words with the same stem are most likely related, and by stemming, we provide some of that knowledge to the model.

I didn't get as far as I wanted to with this train of thought due to some limitations of clustering by word frequency, and I failed to reach the more interesting topics, such as building true associations between words. 



So with all of that in mind, for this piece I'm going to look at the following:

- Word Associations
    + PCA as a means of building word associations
- Word Embeddings
    + Translating words into 'meaning' vectors
- Transfer Learnings
    + Using the Word2Vec model

In [1]:
import DataAccess
import Preprocessing
import Vectors
import Clustering
import WordClouds
import numpy as np

In [2]:
#generate list of tokenised, stemmed articles with stopwords removed
rawArticles = DataAccess.getArticles()


In [3]:
#generate list of tokenised, stemmed articles with stopwords removed
articles = Preprocessing.preprocessArticles(rawArticles)
stemmedArticles = Preprocessing.stemTexts(articles)
meaningfulStemmedArticles = Preprocessing.removeStopwords(stemmedArticles)


In [7]:
#vectorise articles
vectorisedStemmedArticles, stemmedVocabulary = Vectors.vectoriseCorpus(stemmedArticles)
vectorisedArticles, vocabulary = Vectors.removeStopwordColumns(vectorisedStemmedArticles, stemmedVocabulary)


TypeError: 'int' object is not iterable

In [None]:
K=20
G=100
articleCentroidIds,centroids,performance = Clustering.kMeansCluster(vectorisedArticles,K,G)

In [None]:

np.array([[i,Clustering.countClusterArticles(meaningfulStemmedArticles,articleCentroidIds,i)] for i in range(K)])

In [None]:
Ks=np.array([[1,2,5],[6,7,9],[12,17,18]])
WordClouds.plotClusterWordCloudArray(meaningfulStemmedArticles,articleCentroidIds,Ks)

### summary

## Word Associations
### PCA on Document Level Word Correlations

Problem with the dimensionality required in 

In [None]:
vectorisedArticles.shape

In [None]:
from scipy.sparse.linalg import svds

u, s, vt = svds(vectorisedStemmedArticles, vectorisedStemmedArticles.shape[0]-1)

U S V descriptions

In [None]:
print(u.shape)
print(s.shape)
print(vt.T.shape)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(range(s.shape[0]),s/np.sum(s))

#### Singular Eigenvalues

In [None]:
def SVD(vectorisedText) :
    u, s, vt = svds(vectorisedText, vectorisedText.shape[0]-1)
    # find
    n=np.argmax(s)
    # reverse the n first columns of u
    u_filt = u[:, n::-1]
    # reverse s
    s_filt = s[n::-1]
    # reverse the n first rows of vt
    vt_filt = vt[n::-1, :]
    
    return u_filt, s_filt, vt_filt

u_filt, s_filt, vt_filt = SVD(vectorisedStemmedArticles)

In [None]:
def PlotSVD(s) :
    fig, axes = plt.subplots(1,2,figsize=(12,4))
    axes[0].scatter(range(s.shape[0]),s/np.sum(s))
    axes[1].scatter(range(s.shape[0]),np.cumsum(s)/np.sum(s))
    
PlotSVD(s_filt)

#### working with less


In [None]:
relevantIndeces = [i for i,v in enumerate(np.cumsum(s_filt)/np.sum(s_filt)) if v<=0.8]
relevantIndexMask = np.diag([v<=0.8 for v in np.cumsum(s_filt)/np.sum(s_filt)])

In [None]:
print(u_filt.shape)
print(s_filt.shape)
print(relevantIndexMask.shape)
print(vt_filt.shape)

In [None]:
vectorisedNoiseFilteredArticles = np.dot(np.dot(u_filt,np.diag(s_filt)),np.dot(relevantIndexMask,vt_filt))

In [None]:
K=20
G=100
nfArticleCentroidIds,nfCentroids,nfPerformance = Clustering.kMeansCluster(vectorisedNoiseFilteredArticles,K,G)
np.array([[i,Clustering.countClusterArticles(meaningfulStemmedArticles,nfArticleCentroidIds,i)] for i in range(K)])

In [None]:
Ks=np.array([[0,2,3],[5,6,10],[14,15,16]])
WordClouds.plotClusterWordCloudArray(meaningfulStemmedArticles,nfArticleCentroidIds,Ks)

### Sentence Level Decomposition

In [None]:
# build vocabulary
vocabulary, vocabToIndexMap = Vectors.buildVocabulary(stemmedArticles)

In [None]:
def sentenceCount(text) :
    return np.sum([word == "endofsen" for word in text])+1

def paragraphCount(text) :
    return np.sum([word == "endofpar" for word in text])+1

def wordCount(text) :
    return len(text)

def sentenceVectorise(text) :
    sentenceEnds = [i for i,word in enumerate(text) if word == "endofsen"]
    sentences = np.array(np.split(text, sentenceEnds))
    return Vectors.vectoriseTexts(vocabToIndexMap,sentences)
    
sentenceLevelVectorisedArticles = [sentenceVectorise(stemmedArticles[i]) for i in range(len(stemmedArticles))]
allSentences = np.vstack(sentenceLevelVectorisedArticles)


In [None]:
print(len(sentenceLevelVectorisedArticles))
print(allSentences.shape)

In [None]:
u_sen, s_sen, vt_sen = SVD(allSentences)
PlotSVD(s_sen)

In [170]:


#i=4
i+=1
print(wordCount(stemmedArticles[i]))
print(sentenceCount(stemmedArticles[i]))
print(paragraphCount(stemmedArticles[i]))
print(stemmedArticles[i])

408
15
18
['the', 'relic', 'in', 'mexico', 'citi', 'may', 'offer', 'clue', 'to', 'the', 'first', 'ever', 'discoveri', 'of', 'an', 'aztec', 'royal', 'burial', 'a', 'newli', 'discov', 'trove', 'of', 'aztec', 'sacrific', 'could', 'lead', 'archaeologist', 'to', 'an', 'elus', 'aztec', 'emperor', 'tomb', 'endofsen', 'endofpar', 'such', 'a', 'discoveri', 'would', 'mark', 'a', 'first', 'sinc', 'no', 'aztec', 'royal', 'burial', 'has', 'yet', 'been', 'found', 'despit', 'decad', 'of', 'dig', 'endofsen', 'endofpar', 'the', 'sacrifici', 'offerings,', 'includ', 'a', 'rich', 'adorn', 'jaguar', 'dress', 'as', 'a', 'warrior,', 'were', 'found', 'in', 'mexico', 'city,', 'reuter', 'report', 'endofsen', 'endofpar', '"we', 'have', 'enorm', 'expect', 'right', 'now,"', 'lead', 'archaeologist', 'leonardo', 'lopez', 'lujan', 'said', 'endofsen', 'endofpar', '"as', 'we', 'go', 'deeper', 'we', 'think', "we'll", 'continu', 'find', 'veri', 'rich', 'objects."', 'endofpar', 'discov', 'off', 'the', 'step', 'of', 'the',

### Revectorise with Word2Vec