<a href="https://colab.research.google.com/github/kobemawu/www/blob/master/Similarity_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing and similarity calculation
This notebook will help you to calculate the similarity of documents by yourself.  
Finally, you can calculate the similarity of different countries with Wikipedia data and find out countries similar to Japan.

In [None]:
# install necessary dependencies
!pip install nltk
!pip install gensim



In [None]:
import nltk
import numpy as np
import pandas as pd

In [None]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 1. Similarity caluculation
First, let us consider the following three documents:  
Doc A : "I like apples and strawberries. I will buy an apple tomorrow. "  
Doc B : "I bought some apples and strawberries. I will eat an apple tomorrow."   
Doc C : "I play basketball every day. I like Michael Jordan."   


Doc A seems similar to Doc B while Doc C seems different from Doc A or Doc B.  
Let us confirm our guess through similarity calculation.

There are several methods to calculate the similarity:  
* Set-based similarity
 * Jaccard index
 * Sørensen–Dice coefficient
 * Overlap coefficient  (Szymkiewicz–Simpson coefficient)
* Vector-based similarity
 * Euclidean distance
 * Consine similarity


### Set based similarity
Convert the words of documents to sets.   
Dupliate words will be eliminated due to the setting of sets.   
Here is an example and the preprocess is skipped.   

Doc A : "I like apples and strawberries. I will buy an apple tomorrow. "  
Doc B : "I bought some apples and strawberries. I will eat an apple tomorrow."  
Doc C : "I play basketball every day. I like Michael Jordan."  
↓    
Set A : {'a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'}  
Set B : {'an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'}  
Set C : {'basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'}  

Let us consider how to represent the features of the above word sets.



### Jaccard index
Jaccard index defines the similarity and diversity of two sample sets A and B:   

\begin{equation}
J(A,B)=\dfrac{|A\cap B|}{|A \cup B|}
\end{equation}

The greater the proportion of intersections, the more similar the two documents.

In [None]:
def jaccard_similarity(set_a,set_b):
  # calculate the intersection
  num_intersection = len(set.intersection(set_a, set_b))
  # calculate the union
  num_union = len(set.union(set_a, set_b))
  # calculate the Jaccard index, return 1 if the set is empty
  try:
      return float(num_intersection) / num_union
  except ZeroDivisionError:
      return 1.0 

In [None]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

print("jaccard(a, b) = ", jaccard_similarity(set_a, set_b)) # Get the Jaccard index of the two sets
print("jaccard(a, c) = ", jaccard_similarity(set_a, set_c))
print("jaccard(b, c) = ", jaccard_similarity(set_b, set_c))

jaccard(a, b) =  0.5714285714285714
jaccard(a, c) =  0.11764705882352941
jaccard(b, c) =  0.05555555555555555


Jaccard index is also implemented by nltk package.  
The Input is sets and it will calculate according to the definition.  
Please note that the output is the distance not the similarity.

In [None]:
from nltk.metrics import jaccard_distance

set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

# Note the output of the nltk package is distance, using 1 to minus the result to convert to similarity. 
print("jaccard(a, b) = ", 1 - jaccard_distance(set_a, set_b))
print("jaccard(a, c) = ", 1 - jaccard_distance(set_a, set_c))
print("jaccard(b, c) = ", 1 - jaccard_distance(set_b, set_c))


jaccard(a, b) =  0.5714285714285714
jaccard(a, c) =  0.11764705882352944
jaccard(b, c) =  0.05555555555555558


### Sørensen–Dice coefficient
The problem of Jaccard index is that if the size of one set is too large, the value is small no matter how large of the intersection.  
In order to improve the problem, Sørensen–Dice coefficient sets the denominator as the average value of the two sets.  

$
DSC(A,B) = \dfrac{|A\cap B|}{\dfrac{|A| + |B|}{2}} = \dfrac{2|A\cap B|}{|A| + |B|}
$

In [None]:
def dice_similarity(set_a, set_b):
  num_intersection =  len(set.intersection(set_a, set_b))
  sum_nums = len(set_a) + len(set_b)
  try:
    return 2 * num_intersection / sum_nums
  except ZeroDivisionError:
    return 1.0 

In [None]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

print("dice(a, b) = ", dice_similarity(set_a, set_b))
print("dice(a, c) = ", dice_similarity(set_a, set_c))
print("dice(b, c) = ", dice_similarity(set_b, set_c))

dice(a, b) =  0.7272727272727273
dice(a, c) =  0.21052631578947367
dice(b, c) =  0.10526315789473684


### Overlap coefficient
Overlap coefficient is defined as the size of the intersection divided by the smaller of the size of the two sets.

$
overlap(𝐴,𝐵) = \dfrac{|A\cap B|}{\min(|A|, |B|)}
$

In [1]:
def simpson_similarity(set_a, set_b):
  num_intersection = len(set.intersection(set_a, set_b))
  min_num = min(len(set_a), len(set_b))
  try:
    return num_intersection / min_num
  except ZeroDivisionError:
    if num_intersection == 0:
      return 1.0
    else:
      return 0

In [2]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

print("simpson(a, b) = ", simpson_similarity(set_a, set_b)) 
print("simpson(a, c) = ", simpson_similarity(set_a, set_c)) 
print("simpson(b, c) = ", simpson_similarity(set_b, set_c)) 

simpson(a, b) =  0.7272727272727273
simpson(a, c) =  0.25
simpson(b, c) =  0.125


#### Exercise 1
Create different kinds of sets and compare the result with set-based similarity calculation methods.

In [None]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])
set_d = set() # Try it with large size of sets.

print("jaccard similarity:")
print(jaccard_similarity(set_d, set_a))
print(jaccard_similarity(set_d, set_b))
print(jaccard_similarity(set_d, set_c))

print("dice similarity:")
print(dice_similarity(set_d, set_a))
print(dice_similarity(set_d, set_b))
print(dice_similarity(set_d, set_c))

print("simpson similarity:")
print(simpson_similarity(set_d, set_a))
print(simpson_similarity(set_d, set_b))
print(simpson_similarity(set_d, set_c))

jaccard similarity:
0.0
0.0
0.0
dice similarity:
0.0
0.0
0.0
simpson similarity:
1.0
1.0
1.0


### Vector based similarity
Calculate similarity on the vectorized documents.  
There are many vectorization methods and we use BoW (Bag of Words) in this tutorial.  

BoW is one of the document vectorization methods.  
Suppose the total number of words is N, each document is represented as an N-dimension vector. Each dimension represents one word and the value of each dimension is the number of occurrences of the word in the document.  

For example:
Doc A : "I like apples and strawberries. I will buy an apple tomorrow. "  
Doc B : "I bought some apples and strawberries. I will eat an apple tomorrow."  
Doc C : "I play basketball every day. I like Michael Jordan."  
↓  
The total number of words is 19. Therefore, the value of each dimension in BoW is decided by the number of occurrences of the following words.   
['an', 'and', 'apple', 'apples', 'basketball', 'bought', 'buy', 'day', 'eat', 'every', 'i', 'jordan', 'like', 'michael', 'play', 'some', 'strawberries', 'tomorrow', 'will']  
↓  
BoW A : [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1]  
BoW B : [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1]  
BoW C : [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0]  

These vectors represent features of the documents.


#### Euclidean distance
We can calculate the Euclidean distance on the vectorized documents.  
The closer the two documents are, the more similar they are.   

\begin{equation}
d(v_1,v_2) =(\sum_{i=1}^n (v_{1i}-v_{2i})^2)^{\frac{1}{2}}
\end{equation}

In [None]:
def euclidean_distance(list_a, list_b):
  diff_vec = np.array(list_a) - np.array(list_b)
  return np.linalg.norm(diff_vec)

In [None]:
bow_a = [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1]  
bow_b = [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1]  
bow_c = [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0]  

print("euclidean_distance(bow_a, bow_b) = ",euclidean_distance(bow_a, bow_b))
print("euclidean_distance(bow_a, bow_c) = ",euclidean_distance(bow_a, bow_c))
print("euclidean_distance(bow_b, bow_c) = ",euclidean_distance(bow_b, bow_c))

euclidean_distance(bow_a, bow_b) =  2.23606797749979
euclidean_distance(bow_a, bow_c) =  3.7416573867739413
euclidean_distance(bow_b, bow_c) =  4.123105625617661


#### Minkowski distance
A generalization of Euclidean distance.  
It can represent different kinds of distance by changing the value of ```p```

\begin{equation}
d(v_1,v_2) = (\sum_{i=1}^n |v_{1i}-v_{2i}|^p)^{\frac{1}{p}}
\end{equation}

#### Exercise 2
Try to implement a function to compute the Minkowski distance.  
Try it with different ```p``` value (e,g., p = 1, 2, 3).

In [None]:
# please refer to np.linalg.norm
def minkowski_distance(list_a, list_b, p):
  #please complete the code. 
  pass

In [None]:
# p=1
print(minkowski_distance(bow_a, bow_b, 1))
print(minkowski_distance(bow_a, bow_c, 1))
print(minkowski_distance(bow_b, bow_c, 1))

# p=2
print(minkowski_distance(bow_a, bow_b, 2))
print(minkowski_distance(bow_a, bow_c, 2))
print(minkowski_distance(bow_b, bow_c, 2))

# p=3
print(minkowski_distance(bow_a, bow_b, 3))
print(minkowski_distance(bow_a, bow_c, 3))
print(minkowski_distance(bow_b, bow_c, 3))

#### Cosine similarity
Measure the similarity of two vectors by computing the cosine of the angle between them.  

\begin{equation}
similarity(A, B)=cos(\theta)=\dfrac{\sum_{i=1}^n A_iB_i}{{\sqrt A}{\sqrt B}}
\end{equation}

#### Exercise 3
Try to implement a function to compute cosine similarity.

In [None]:
# please refer to numpy.array and np.linalg.norm
def cosine_similarity(list_a, list_b):
  #please complete the code.
  pass

In [None]:
bow_a = [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1]
bow_b = [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1]
bow_c = [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0]

print("cosine_similarity(bow_a, bow_b) = ",cosine_similarity(bow_a, bow_b))
print("cosine_similarity(bow_a, bow_c) = ",cosine_similarity(bow_a, bow_c))
print("cosine_similarity(bow_b, bow_c) = ",cosine_similarity(bow_b, bow_c))

cosine_similarity(bow_a, bow_b) =  0.8153742483272114
cosine_similarity(bow_a, bow_c) =  0.41812100500354543
cosine_similarity(bow_b, bow_c) =  0.3223291856101521


### Compare set-based and vector-based similarity calculation methods
Set based similarity calculation is more efficient because the computation is done on the small piece of word sets.  
If the size of the document is large enough, vector based methods are more useful.  
However, the calculation amount will increase as the corpus increases.

### Exercise 4
Create some short and long document datasets, try to compare the results of Jaccard index with cosine similarity. 


In [None]:
short_docs = []
long_docs = []

## 2. Preprocessing
Now we can calculate the similarity of documents by computing the union of sets or angle of vectors.  
However, if the sets and vectors cannot represent the features of the documents properly, the calculated similarity might not accurate enough.  
Therefore, data proprecessing such as convert documetns to sets or vectors is very important. 

Let us consider how to properly preprocess the documents for similarity calculation.  
Then we will concentrate on the training of document vectorization.  
1. Clearning
2. Tokenize
3. Stemming
4. Remove stop words
5. Vectorize

### 2-1. Clearning
Although the documents we used in the previous similarity computation looks clean, they were very messy even include HTML tags and strange symbols.

In [None]:
documents=["I like apples and strawberries. I will buy an apple tomorrow @Fresco.",
           "I bought some apples and strawberries. I will eat an apple <b>tomorrow.</b>",
           "I play basketball every day. I like Michael Jordan (born February 17, 1963)."]

We can delete these strange characters manually because there are only thre documents.  
However, when there is a large amount of data than needs to be cleaned up, a cleaning program is necessary. 

#### Exercise 5
Try to implement a function with regular expression to clean up text data. 

Reference: <https://www.w3schools.com/python/python_regex.asp>

In [None]:
import re

def cleaning_text(text):
    # delete @
    pattern1 = '@'
    text = re.sub(pattern1, '', text)    
    # delete <b> tag; please complete the code
    pattern2 = #
    text = re.sub(pattern2, '', text)    
    # delete the content in the (); please complete the code
    pattern3 =  # 
    text = re.sub(pattern3, '', text)
    return text
  

for text in documents:
    print(cleaning_text(text))

I like apples and strawberries. I will buy an apple tomorrow Fresco.
I bought some apples and strawberries. I will eat an apple tomorrow.
I play basketball every day. I like Michael Jordan .


In [None]:
for text in documents:
  print(cleaning_text(text))

I like apples and strawberries. I will buy an apple tomorrow Fresco.
I bought some apples and strawberries. I will eat an apple tomorrow.
I play basketball every day. I like Michael Jordan .


#### Option 1

Try to clean up the following text with your implemented cleaning function.

In [None]:
text = '<p><b>Natural language processing</b> (<b>NLP</b>) is a subfield of <a href="/wiki/Computer_science" title="Computer science">computer science</a>, <a href="/wiki/Information_engineering_(field)" title="Information engineering (field)">information engineering</a>, and <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of <a href="/wiki/Natural_language" title="Natural language">natural language</a> data.</p>'

### 2-2. Tokenize
After cleaning, you need to cut each word from document strings.  
English words can be separated by blank spaces, while Japanese words are more complicated. 

In [None]:
def tokenize_text(text):
  text = re.sub('[.,]', '', text)
  return text.split()

for text in documents:
  text = cleaning_text(text)
  print(tokenize_text(text))

['I', 'like', 'apples', 'and', 'strawberries', 'I', 'will', 'buy', 'an', 'apple', 'tomorrow', 'Fresco']
['I', 'bought', 'some', 'apples', 'and', 'strawberries', 'I', 'will', 'eat', 'an', 'apple', 'tomorrow']
['I', 'play', 'basketball', 'every', 'day', 'I', 'like', 'Michael', 'Jordan']


### 2-3. Stemming, Lemmatize
The same word may have multiple forms, and it would be a bit strange if we treat them as different words.  
So after converting to lowercase, we use **Stemming** and **Lemmatize** to turn words into the uniform format.  
Here we only show the lemmatize. 

In [None]:
from nltk.corpus import wordnet as wn # import lemmatize

def lemmatize_word(word):
    # make words lower  e.g., Python =>python
    word=word.lower()
    
    # lemmatize  e.g., cooked=>cook
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
      return lemma

In [None]:
for text in documents:
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  print([lemmatize_word(word) for word in tokens])

['i', 'like', 'apple', 'and', 'strawberry', 'i', 'will', 'buy', 'an', 'apple', 'tomorrow', 'fresco']
['i', 'buy', 'some', 'apple', 'and', 'strawberry', 'i', 'will', 'eat', 'an', 'apple', 'tomorrow']
['i', 'play', 'basketball', 'every', 'day', 'i', 'like', 'michael', 'jordan']


We can find out that 'strawberries' is converted to the standard form 'strawberry'.

### 2-4. Remove stop words
There are many words such as 'a' and 'the', which have no effects on the meaning of the documents.   
These words are called stop words.  
We can apply the stopwords list from nltk package which is defined by specialists.  
You can define your own stopwords if necessary. 

In [None]:
en_stop = nltk.corpus.stopwords.words('english')
print(en_stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
def remove_stopwords(word, stopwordset):
  if word in stopwordset:
    return None
  else:
    return word

In [None]:
for text in documents:
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  tokens = [lemmatize_word(word) for word in tokens]
  print([remove_stopwords(word, en_stop) for word in tokens])

[None, 'like', 'apple', None, 'strawberry', None, None, 'buy', None, 'apple', 'tomorrow', 'fresco']
[None, 'buy', None, 'apple', None, 'strawberry', None, None, 'eat', None, 'apple', 'tomorrow']
[None, 'play', 'basketball', 'every', 'day', None, 'like', 'michael', 'jordan']


We only show a simple preprocessing of stopwords in this tutorial.  
In the real application scenario, you may need to delete low occurrence frequency words or determiners of verbs and nouns. 

In [None]:
def preprocessing_text(text):
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  tokens = [lemmatize_word(word) for word in tokens]
  tokens = [remove_stopwords(word, en_stop) for word in tokens]
  tokens = [word for word in tokens if word is not None]
  return tokens


preprocessed_docs = [preprocessing_text(text) for text in documents]
preprocessed_docs

[['like', 'apple', 'strawberry', 'buy', 'apple', 'tomorrow', 'fresco'],
 ['buy', 'apple', 'strawberry', 'eat', 'apple', 'tomorrow'],
 ['play', 'basketball', 'every', 'day', 'like', 'michael', 'jordan']]

### 2-5. Vectorize
#### BoW(Bag of Words)
Convert text to a vector with the occurrence time of words.  
It is impossible to do this manually, so we need to write a function to handle it.




In [None]:
def bow_vectorizer(docs):
  word2id = {}
  for doc in docs:
    for w in doc:
      if w not in word2id:
        word2id[w] = len(word2id)
        
  result_list = []
  for doc in docs:
    doc_vec = [0] * len(word2id)
    for w in doc:
      doc_vec[word2id[w]] += 1
    result_list.append(doc_vec)
  return result_list, word2id

In [None]:
bow_vec, word2id = bow_vectorizer(preprocessed_docs)
print(bow_vec)

[[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]


In [None]:
word2id.items()

### TF-IDF(Term Frequency - Inverse Document Frequency)
BoW treats every word with the same weight, but different words may have different degrees of importance.  

TF-IDF improve this problem by considering the word's weight.  
TF(t, d) = the frequency of the word (t) in the document.   
IDF(t) = the inverse frequency of the word (t) in all documents.  

TF-IDF(t,d) = TF(t, d) * IDF(t)  


In [None]:
def tfidf_vectorizer(docs):
  def tf(word2id, doc):
    term_counts = np.zeros(len(word2id))
    for term in word2id.keys():
      term_counts[word2id[term]] = doc.count(term)
    tf_values = list(map(lambda x: x/sum(term_counts), term_counts))
    return tf_values
  
  def idf(word2id, docs):
    idf = np.zeros(len(word2id))
    for term in word2id.keys():
      idf[word2id[term]] = np.log(len(docs) / sum([bool(term in doc) for doc in docs]))
    return idf
  
  word2id = {}
  for doc in docs:
    for w in doc:
      if w not in word2id:
        word2id[w] = len(word2id)
  
  return [[_tf*_idf for _tf, _idf in zip(tf(word2id, doc), idf(word2id, docs))] for doc in docs], word2id
  

In [None]:
tfidf_vector, word2id = tfidf_vectorizer(preprocessed_docs)
print(tfidf_vector)
print(word2id.items())

#### Exercise 6
Calculate the cosine similarity of documents with BoW and TF-IDF.  

#### Option 2
TF-IDF is implemented in different packages and each has different parameters.  
Calculate TF-IDF with different packages (e.g., scikit-learn, nltk, gensim)

## Exercise 7
Here is a dataset that includes different countries' abstract from Wikipedia.  
Please download it through the URL:  
https://drive.google.com/open?id=1i7tekPQRKaAwg-ze3kv5IsufMW13LkLo   

Calculate the similarity of different countries and find out the top-5 countries similar to Japan. 
Please note that the data preprocessing should done by yourself.   
P.S. It is fine if the similarity value is not very high. 

## Option 3

### Word2Vec & Doc2Vec
Word2Vec and Doc2Vec are very popular tools which can handle with the meaning of words.  
For example: King - Man + Woman = Queen   
Detail can be found in the slides.

The Pre-trained model of word2vec can be found from the following URL:  
https://github.com/Kyubyong/wordvectors 

Try to calculate the similarity between Japan and other countries.  
Try addition or subtraction between the two countries.  
Reference: http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/