# 3-2. **Word Embedding**

In [None]:
import re
import pprint
from lxml import etree
from gensim.models import Word2Vec

import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Word2Vec

### Data Preprocessing

In [None]:
!wget https://raw.githubusercontent.com/kimtwan/NLP_lecture/master/data/ted_en-20160408.zip
!unzip ted_en-20160408.zip

--2023-10-16 22:57:43--  https://raw.githubusercontent.com/kimtwan/NLP_lecture/master/data/ted_en-20160408.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16595708 (16M) [application/zip]
Saving to: ‘ted_en-20160408.zip’


2023-10-16 22:57:44 (120 MB/s) - ‘ted_en-20160408.zip’ saved [16595708/16595708]

Archive:  ted_en-20160408.zip
  inflating: ted_en-20160408.xml     


In [None]:
targetXML = open('ted_en-20160408.xml', 'r', encoding='UTF8')

# Getting contents of <content> tag from the xml file
target_text = etree.parse(targetXML)
parse_text = '\n'.join(target_text.xpath('//content/text()'))

# Removing 'Sound-effect labels' using regular expression (i.e. (Audio), (Laughter))
content_text = re.sub(r'\([^)]*\)', '', parse_text)

In [None]:
content_text[:1000]

"Here are two reasons companies fail: they only do more of the same, or they only do what's new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit. I'm actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.\nTo me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.\n\nFacit did too much exploitation. But exploration can go wild, too.\nA few years back, I worked closely alongside a European bio

In [None]:
# Tokenizing the sentence to process it by using NLTK library
sent_text = sent_tokenize(content_text)

# Removing punctuations and changing all characters to lower case
normalized_text = []
for string in sent_text:
     tokens = re.sub(r'[^a-z0-9]+', ' ', string.lower())
     normalized_text.append(tokens)

# Tokenising each sentence to process individual word
sentences = [word_tokenize(sentence) for sentence in normalized_text]

# Prints only 10 (tokenized) sentences
print(sentences[:10])

[['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new'], ['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation'], ['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing'], ['consider', 'facit'], ['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them'], ['facit', 'was', 'a', 'fantastic', 'company'], ['they', 'were', 'born', 'deep', 'in', 'the', 'swedish', 'forest', 'and', 'they', 'made', 'the', 'best', 'mechanical', 'calculators', 'in', 'the', 'world'], ['everybody', 'used', 'them'], ['and', 'what', 'did', 'facit', 'do', 'when', 'the', 'electronic', 'calculator', 'came', 'along'], ['they', 'continued', 'doing', 'exactly', 'the', 'same']]


### Word2Vec - Continuous Bag-Of-Words (CBOW)

In [None]:
wv_cbow_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, workers=4, sg=0)

In [None]:
similar_words = wv_cbow_model.wv.most_similar('man')
pprint.pprint(similar_words)

[('woman', 0.8458347916603088),
 ('guy', 0.8152571320533752),
 ('lady', 0.7929542064666748),
 ('boy', 0.7698023915290833),
 ('girl', 0.7342793345451355),
 ('soldier', 0.700160801410675),
 ('kid', 0.6862170696258545),
 ('friend', 0.6859571933746338),
 ('gentleman', 0.6858844757080078),
 ('rabbi', 0.6655703783035278)]


### Word2Vec - Skip Gram

In [None]:
wv_sg_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, workers=4, sg=1)

In [None]:
similar_words = wv_sg_model.wv.most_similar('man')
pprint.pprint(similar_words)

[('woman', 0.7723897099494934),
 ('soldier', 0.7161628603935242),
 ('guy', 0.715850293636322),
 ('rabbi', 0.6964094042778015),
 ('gentleman', 0.6705251932144165),
 ('son', 0.6702896356582642),
 ('testament', 0.669889509677887),
 ('imam', 0.6672779321670532),
 ('lady', 0.6609430313110352),
 ('pianist', 0.6547991633415222)]


## Word2Vec vs FastText

Let's try to find out the difference between Word2Vec and FastText

Word2Vec - Skipgram cannot find similar word 'electrofishing' as 'electrofishing' is not in the vocabulary - so you can see the error

In [None]:
similar_words = wv_sg_model.wv.most_similar('electrofishing')
pprint.pprint(similar_words)

KeyError: ignored

### FastText - Skip Gram

You can find that FastText works extremely well

In [None]:
from gensim.models import FastText

In [None]:
ft_sg_model = FastText(sentences, vector_size=100, window=5, min_count=5, workers=4, sg=1)

In [None]:
result = ft_sg_model.wv.most_similar('electrofishing')
pprint.pprint(result)

[('electrolux', 0.87488853931427),
 ('electrolyte', 0.8718639016151428),
 ('electroshock', 0.854791522026062),
 ('electro', 0.8501136302947998),
 ('electroencephalogram', 0.8373817205429077),
 ('electrochemical', 0.828506350517273),
 ('electrogram', 0.8282065391540527),
 ('airbus', 0.8261869549751282),
 ('airbag', 0.8261586427688599),
 ('electron', 0.8198568224906921)]


### FastText - Continuous Bag-Of-Words (CBOW)

In [None]:
ft_cbow_model = FastText(sentences, vector_size=100, window=5, min_count=5, workers=4, sg=0)

In [None]:
result = ft_cbow_model.wv.most_similar('electrofishing')
pprint.pprint(result)

[('fishing', 0.9164567589759827),
 ('flushing', 0.9054871797561646),
 ('licensing', 0.9021921157836914),
 ('refreshing', 0.8997875452041626),
 ('smashing', 0.8992639183998108),
 ('flourishing', 0.8991420269012451),
 ('flashing', 0.8969458341598511),
 ('vanishing', 0.8963451385498047),
 ('transplanting', 0.895754337310791),
 ('recycling', 0.8953123688697815)]


## King + Woman - Man = ?

Try both CBOW and Skip Gram model to calculate 'King - Man + Woman = ?'

In [None]:
result = wv_cbow_model.wv.most_similar(positive=['king' , 'woman'], negative=['man'], topn=1)
print(result)

[('president', 0.7923591732978821)]


In [None]:
result = wv_sg_model.wv.most_similar(positive=['king' , 'woman'], negative=['man'], topn=1)
print(result)

[('queen', 0.6769269108772278)]


In [None]:
result = ft_cbow_model.wv.most_similar(positive=['king' , 'woman'], negative=['man'], topn=1)
print(result)

[('kidding', 0.8963255286216736)]


In [None]:
result = ft_sg_model.wv.most_similar(positive=['king' , 'woman'], negative=['man'], topn=1)
print(result)

[('hawking', 0.6837846636772156)]


This is not what we expected...Probably not enough data to answer as 'Queen'


# Play with Colab Form Fields
**The Form** supports multiple types of fields, including **input fields**, **dropdown menus**.

You can edit this section by double-clicking it.

Let's get familiar by changing the value in each input field (on the right) and checking the changes in the code (on the left) - vice versa

In [None]:
# @title Example form fields
# @markdown please put description

string = 'examples'  # @param {type: 'string'}
slider_value = 143  # @param {type: 'slider', min: 100, max: 200}
number = 10253  # @param {type: 'number'}
date = '2020-01-05'  # @param {type: 'date'}
pick_me = 'tuesday'  # @param ['monday', 'tuesday', 'wednesday', 'thursday']
select_or_input = 'apples' # @param ['apples', 'bananas', 'oranges'] {allow-input: true}


# print the output
print('string is',string)
print('slider_value',slider_value)

string is examplesddeddd
slider_value 143


# Exercise
In this exercise, you need to implement a **'Word Algebra Calculator'  interface** using Word2Vec and FastText trained by the provided TED Scripts. The interface can be built by Colab Form Fields we just learned above.

What the users can do through the interface are:


1.   Input the word formula in the text field, e.g. King - Man + Woman
2.   Select the word embedding model from dropdown menu, either Word2Vec or FastText
3.   Select the training architecture from dropdown menu, either CBOW or Skip Gram
4.   Get(print out) the resulted word of the input formula by running the form (same to running the code section)



Note:
1. Please **do not** put the training process into your form section.
2. Please make your interface 'user-friendly' and instructional for users to use, e.g. by adding proper explaination or guide
3. We will use formula like 'Word1 + Word2 + Word3 - Word4' to test your interface, the number of the words and the sign between each two words can vary.



## 1.Build your word embedding models

In [None]:
## Please generate four different types of word embedding models with TED data
## The parameter for all four models: vector_size=100, window=5, min_count=5, workers=4.



##2.Build your Interface

You can edit the following form elements to build your interface

In [None]:
# @title Word Algebra Calculator

# @markdown Please select the model and formula to calculate the word algebra

# Get the input


# @markdown Now you can activate the Calculator by running this section

## 1.choose the corresponding model


## 2.processing the formula to extract the postive and negative word list


## 3.calculate the formula for an similar word using the selected model


## 4.print out the most similar word after
