### Work embeddings (word vector)

    Multi dimension representation of a word (multi dimension array with sparse features) created using deep learning models
    Helps get relationships as king-queen = man - woman

### Creating a gensim dictionary and corpus

    Creating word id and corresponding frequency dictionary
    Creating a corpus of the dictionary. A corpus (plural corpora) is a det of texts used for performing NLP tasks
    A gensim corpus is a bit different than a normal corpus (collection of documents)
    Gensim models can be easily saved, updated and resued
    Our dictionary can also be updated
    This is a more advanced and feature rich bag of words that can be used for topic modelling

In [11]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_docs = ["The movie was about spaceships and the passengers",
          "I really liked the movie",
          "Awesome action scenes but broing characters",
          "The movie was aweful!I hate alien movies",
          "Space is cool", "I liked the movie",
          "More space films, please",
          "The movie was very cool", "The animations were cool"]

In [12]:
tokenized_doc = [word_tokenize(doc.lower()) for doc in my_docs]
dictionary = Dictionary(tokenized_doc)
dictionary.token2id
corpus=[dictionary.doc2bow(doc) for doc in tokenized_doc]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1), (9, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
 [(2, 1), (5, 1), (6, 1), (7, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(21, 1), (22, 1), (23, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(23, 1), (24, 1), (25, 1), (26, 1), (27, 1)],
 [(2, 1), (5, 1), (6, 1), (21, 1), (28, 1)],
 [(5, 1), (21, 1), (29, 1), (30, 1)]]

In [13]:
# Gensim corpus is a list of list where each list represents a document. 
# Also each list is a series of tuples where first item is token-id from dictionary and the 2nd 
# item is the token frequency from the document

### Tf-Idf using gensim
    * Tf-idf stands for term frequency - inverse document frequency
    * Allows to determine the most important words in each document 
    * Idea behind tf-idf is that most corpus may have shared words beyong just stopwords and these words 
    should be downweighted in importance
    * Ensures most common word's don't show up as keywords
    * Keeps document specific frequent words weighted high and the common words across the entire corpus weighted low
    
    Tf-idf formula
    
        wi,j = tfi,j * log(N/dfi)
        wi,j = tf-idf weight for token i in document j
        tfi,j = number of occurances of token i in document j
        dfi = number of documents that contain token i
        N = total number of documents
    

In [14]:
from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(corpus)
tfidf[corpus[1]]

[(2, 0.19806533562037576),
 (5, 0.13662879325982816),
 (7, 0.370197258061228),
 (8, 0.5068260513210562),
 (9, 0.740394516122456)]

In [15]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1), (9, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
 [(2, 1), (5, 1), (6, 1), (7, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(21, 1), (22, 1), (23, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(23, 1), (24, 1), (25, 1), (26, 1), (27, 1)],
 [(2, 1), (5, 1), (6, 1), (21, 1), (28, 1)],
 [(5, 1), (21, 1), (29, 1), (30, 1)]]