# Embeddings -- Deep Learning for Text
***

## Table of Contents
1. [Part 1: Text Preprocessing of the 'Alice in Wonderland' Text](#Part-1:-Text-Preprocessing-of-the-'Alice-in-Wonderland'-Text)<br>
    1.1. [Case Normalization](#Case-Normalization)<br>
    1.2. [Tokenization](#Tokenization)<br>
    1.3. [Stopwords/Punctuation removal](#Stopwords/Punctuation-removal)<br>
    1.4. [Stemming our data](#Stemming-our-data)<br>
2. [Part 2: Text Representation for Alice in Wonderland](#Part-2:-Text-Representation-for-Alice-in-Wonderland)<br>
    2.1. [Training Text embeddings](#Training-Text-embeddings)<br>
    2.2. [Finding similarities between the representations of the phrases](#Finding-similarities-between-the-representations-of-the-phrases)<br>
    2.3. [Using GloVe pre-trained word embeddings](#Using-GloVe-pre-trained-word-embeddings)

### Part 1: Text Preprocessing of the 'Alice in Wonderland' Text
This activity aims to apply basic text preprocessing steps in the Alice in Wonderland corpora from the NLTK module. Techniques such as case normalization, tokenization, stopwords/punctuation removal and stemming are the ones to be performed in this first part.

In [1]:
#Natural Language Toolkit module
import nltk

#Tokenizer
nltk.download('punkt')
from nltk import tokenize

#Punctuation list
from string import punctuation
punct = list(punctuation)

#Stopwords lists
nltk.download("stopwords")
from nltk.corpus import stopwords
stops_funct = list(stopwords.words("english")) #Functional Stopwords
stops_contx = ['--','said']                    #Contextual Stopwords

#Stemmer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#For Word embeddings
from gensim.models import word2vec

from gensim.models import KeyedVectors

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Case Normalization

In [2]:
#Load entire text for Alice In Wonderland corpora from NLTK
alice_raw = nltk.corpus.gutenberg.raw('carroll-alice.txt')

In [3]:
#Print the first 800 characters
alice_raw[:800]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit"

In [4]:
#Change the raw text to lowercase
txt_lower = alice_raw.lower()
txt_lower



#### Tokenization

In [5]:
#Tokenize the raw text into sentences
txt_sent = tokenize.sent_tokenize(txt_lower)
txt_sent[0] #Print first sentence in the text

"[alice's adventures in wonderland by lewis carroll 1865]\n\nchapter i. down the rabbit-hole\n\nalice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought alice 'without pictures or\nconversation?'"

In [6]:
#Tokenize sentences into words 
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sent]
txt_words

[['[',
  'alice',
  "'s",
  'adventures',
  'in',
  'wonderland',
  'by',
  'lewis',
  'carroll',
  '1865',
  ']',
  'chapter',
  'i.',
  'down',
  'the',
  'rabbit-hole',
  'alice',
  'was',
  'beginning',
  'to',
  'get',
  'very',
  'tired',
  'of',
  'sitting',
  'by',
  'her',
  'sister',
  'on',
  'the',
  'bank',
  ',',
  'and',
  'of',
  'having',
  'nothing',
  'to',
  'do',
  ':',
  'once',
  'or',
  'twice',
  'she',
  'had',
  'peeped',
  'into',
  'the',
  'book',
  'her',
  'sister',
  'was',
  'reading',
  ',',
  'but',
  'it',
  'had',
  'no',
  'pictures',
  'or',
  'conversations',
  'in',
  'it',
  ',',
  "'and",
  'what',
  'is',
  'the',
  'use',
  'of',
  'a',
  'book',
  ',',
  "'",
  'thought',
  'alice',
  "'without",
  'pictures',
  'or',
  'conversation',
  '?',
  "'"],
 ['so',
  'she',
  'was',
  'considering',
  'in',
  'her',
  'own',
  'mind',
  '(',
  'as',
  'well',
  'as',
  'she',
  'could',
  ',',
  'for',
  'the',
  'hot',
  'day',
  'made',
  'her'

#### Stopwords/Punctuation removal

In [7]:
#Combine 'stops_funct', 'stops_contx' and 'punct' in one list
stop_final = stops_funct + stops_contx + punct
stop_final

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [8]:
#Define a fuction to remove the the items from 'stop_final'
def drop_stop(input_tokens):
    return [token for token in input_tokens \
            if token not in stop_final]

In [9]:
#Remove the terms in stop_final
alice_nostop = [drop_stop(stops) for stops in txt_words]

In [10]:
#Print out first three sentences w/o stopswords and compare to txt_words
for i,j in zip(txt_words[:3],alice_nostop[:3]):
    print('WITH STOPS: \n', i, '\n\n'
          'W/O STOPS: \n', j, '\n')

WITH STOPS: 
 ['[', 'alice', "'s", 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', '1865', ']', 'chapter', 'i.', 'down', 'the', 'rabbit-hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'alice', "'without", 'pictures', 'or', 'conversation', '?', "'"] 

W/O STOPS: 
 ['alice', "'s", 'adventures', 'wonderland', 'lewis', 'carroll', '1865', 'chapter', 'i.', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'pictures', 'conversations', "'and", 'use', 'book', 'thought', 'alice', "'without", 'pictures',

>**Observations:**  
Most of the stop words and punctuations are removed from the text. The data look more clean and can be used for further analysis and NLP application.

#### Stemming our data

In [11]:
#Apply the stemmer to each term
alice_words_stem = [[stemmer.stem(token) for token in sent] \
                    for sent in alice_nostop]

In [12]:
#Print out first three sentences before and after applying the stemmer
for i,j in zip(txt_words[:3],alice_words_stem[:3]):
    print('BEFORE APPLYING THE STEMMER: \n', i, '\n\n'
          'AFTER APPLYING THE STEMMER: \n', j, '\n')

BEFORE APPLYING THE STEMMER: 
 ['[', 'alice', "'s", 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', '1865', ']', 'chapter', 'i.', 'down', 'the', 'rabbit-hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'alice', "'without", 'pictures', 'or', 'conversation', '?', "'"] 

AFTER APPLYING THE STEMMER: 
 ['alic', "'s", 'adventur', 'wonderland', 'lewi', 'carrol', '1865', 'chapter', 'i.', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', "'and", 'use', 'book', 'thought', 'alic', "'without", 'pictur', 

>**Observations:**  
From the text above we see that all the plurals of the words are removed and only the main stem is being prefered and the suffixes like -ing, -s, -es, -ed, -ies are chopped off.

### Part 2: Text Representation for Alice in Wonderland
In this section, text representation approaches will be applied. Text embeddings were created from the data to find relations. Finally, pretrained embeddings were used to represent the data in the text.

In [13]:
#Work on the result of 'alice_nostop'
print(alice_nostop)



#### Training Text embeddings

In [14]:
#Train a Word2Vec model on 'alice_nostop' with default parameters. Find the 10 terms most similar to rabbit.
model = word2vec.Word2Vec(alice_nostop, window=2)
model.wv.most_similar("rabbit", topn=10)

[('little', 0.9455776810646057),
 ("'s", 0.9422605037689209),
 ('alice', 0.9410143494606018),
 ('one', 0.9382156133651733),
 ('first', 0.9346249103546143),
 ("'and", 0.9344311952590942),
 ('voice', 0.933306097984314),
 ('went', 0.9329162240028381),
 ('see', 0.9326656460762024),
 ('like', 0.9317619204521179)]

In [15]:
#Train a Word2Vec model on 'alice_nostop' using a window size of 2. Find the 10 terms most similar to rabbit.
model_w2 = word2vec.Word2Vec(alice_nostop)
model_w2.wv.most_similar("rabbit", topn=10)

[('little', 0.9958827495574951),
 ('alice', 0.9957787394523621),
 ("'s", 0.9956453442573547),
 ('one', 0.9954135417938232),
 ('went', 0.9952766299247742),
 ("'and", 0.9951276183128357),
 ('voice', 0.9949682354927063),
 ('see', 0.9948897361755371),
 ('could', 0.9948313236236572),
 ('two', 0.9948238134384155)]

In [16]:
#Train a Word2Vec model on 'alice_nostop' using Skipgram method, Window size of 5. Find the 10 terms most similar to rabbit.
model_sg = word2vec.Word2Vec(alice_nostop, window=5, sg=1)
model_sg.wv.most_similar("rabbit", topn=10)

[('near', 0.9980894327163696),
 ('words', 0.9979833960533142),
 ('mushroom', 0.9979005455970764),
 ('voice', 0.9978873133659363),
 ('little', 0.9978621602058411),
 ('first', 0.9978477358818054),
 ('called', 0.9978348612785339),
 ('see', 0.9978142976760864),
 ('one', 0.9977234601974487),
 ('anything', 0.9977220296859741)]

**Observations:**  

|      **Default parameters**      |        **Window size = 2**       |   **Window size = 5; Skipgram**   |
|:--------------------------------|:--------------------------------|:---------------------------------|
| [('little', 0.9455731511116028), | [('little', 0.9958850741386414), | [('near', 0.9980902075767517),    |
| ("'s", 0.9422791600227356),      | ('alice', 0.9957777857780457),   | ('words', 0.9979811906814575),    |
| ('alice', 0.9410364031791687),   | ("'s", 0.9956576228141785),      | ('mushroom', 0.9979040026664734), |
| ('one', 0.9382508993148804),     | ('one', 0.9954147934913635),     | ('voice', 0.997887134552002),     |
| ('first', 0.934668242931366),    | ('went', 0.9952813386917114),    | ('little', 0.997858464717865),    |
| ("'and", 0.9343937039375305),    | ("'and", 0.995123565196991),     | ('first', 0.9978495836257935),    |
| ('voice', 0.9333181977272034),   | ('voice', 0.9949773550033569),   | ('called', 0.9978437423706055),   |
| ('went', 0.9328914880752563),    | ('see', 0.9948962330818176),     | ('see', 0.9978193044662476),      |
| ('see', 0.932692289352417),      | ('could', 0.994831919670105),    | ('anything', 0.9977285265922546), |
| ('would', 0.9318217039108276)]   | ('two', 0.9948306083679199)]     | ('one', 0.9977281093597412)]      |


We can observe the default parameters and window size=2 have quite similar results. The first word for both was little, which can be a characteristic of this particular character. On the other hand, using a window size = 5 and the skipgram model the words are not as similar nor common, though the words seem to provide some context of along the target. 

#### Finding similarities between the representations of the phrases

In [17]:
#Find the representation for the phrase 'white rabbit' by averaging the vectors for 'white' and 'rabbit' using skipgram model
v1 = model_sg.wv['white']
v2 = model_sg.wv['rabbit']
res1 = (v1+v2)/2

In [18]:
#Find the representation for the phrase 'mad hatter' by averaging the vectors for 'mad' and 'hatter' using skipgram model
v3 = model_sg.wv['mad']
v4 = model_sg.wv['hatter']
res2 = (v3+v4)/2

In [19]:
#Find the cosine simililarity between these two phrases
model_sg.wv.cosine_similarities(res1, [res2])

array([0.9982772], dtype=float32)

>**Observations:**  
The result is a cosine similarity is 0.99, very close to 1. This means that the model thinks the phrases "white rabbit" and "mad hatter" are similar in meaning or seem to appear in the book very frequently.

In [20]:
#Create a model using the pre-trained GloVe embeddings of size 100D.
glove_model = KeyedVectors.load_word2vec_format\
("glove.6B.100d.w2vformat.txt", binary=False)

#### Using GloVe pre-trained word embeddings

In [21]:
#Find representations for white rabbit
v1 = glove_model['white']
v2 = glove_model['rabbit']
res1 = (v1+v2)/2

#Find representations for madd hatter
v3 = glove_model['mad']
v4 = glove_model['hatter']
res2 = (v3+v4)/2

In [22]:
#Find the cosine similarity between the two phrases.
glove_model.cosine_similarities(res1, [res2])

array([0.45145565], dtype=float32)

>**Observations**  
When using the GloVe pre-trained embeddings the cosine similarity dropped significantly, whereas our model finds these two words almost the same. Our model was trained entirely on Alice in Wonderland corpora, on the other hand the GloVe model might not be as familiar with this text, therefore this terms might not come as frequently.