## Distributed representation of Words

#### [Distributed semantics](https://en.wikipedia.org/wiki/Distributional_semantics) 
theoretically it states that we can analyze the use of words in language to deduce their meaning. Briefly, it derives from the concept called "Distributional Hypothesis" where the <i> linguistic items with similar distributions have similar meanings.</i>

The <b>distributional hypothesis</b> in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings.






#### Co-occurence matrix
It is matrix of terms × terms which captures the number of times a term appears in the context of another term is created. 

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
docs = ['Boston Celtics are leading in their conference.',
        'LA Lakers have secured top spot in Western conference table.',
        'Kobe and Lebron in 2010 were best players in their respective conference']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format

[[0 1 0 1 0 0 1 0 2 1 0 0 0 1 1 1 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 1 0 2 1 0 0 0 1 1 1 0 0 0 1 0 1 0]
 [0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0]
 [1 1 0 0 0 0 1 0 2 1 0 0 0 1 1 1 0 0 0 1 0 1 0]
 [0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0]
 [0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0]
 [1 1 1 1 1 1 0 1 4 1 1 1 1 1 1 1 1 1 1 2 1 1 1]
 [0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1]
 [2 2 1 2 1 1 4 1 0 2 1 1 1 2 2 2 1 1 1 3 1 2 1]
 [1 1 0 1 0 0 1 0 2 0 0 0 0 1 1 1 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 1]
 [0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1]
 [0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [1 1 0 1 0 0 1 0 2 1 0 0 0 0 1 1 0 0 0 1 0 1 0]
 [1 1 0 1 0 0 1 0 2 1 0 0 0 1 0 1 0 0 0 1 0 1 0]
 [1 1 0 1 0 0 1 0 2 1 0 0 0 1 1 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1]
 [0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1]
 [0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1]
 [1 1 1 1 1 1 2 0 3 1 0 0 1 1 1 1 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 1 1 1 

### Why is Co-occurrence matrix useful?

### A blog post from [Chris Moody](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/) suggests that:
We can directly factorize a co-occurrence matrix and get good word embeddings. In practice, we can follow these simple steps:
- Compute the probability of occurrence of each word $p(x)$
- Compute the probability of co-occurrence of each couple of words $p(x,y)$
- Divide each co-occurrence probability by each word’s probability $p(x,y)/p(x)p(y)$
- Apply the logarithm to the ratio: $ log[p(x,y)/p(x)p(y)].$
