<a href="https://colab.research.google.com/github/kobemawu/www/blob/master/Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing and calculate similarity

このノートの目標は自力で文書の類似度を計算できるようになること  
最終的にWikipediaのデータを用いて国の類似度を測り  
日本と似ている国を探す

In [0]:
# 必要なパッケージのインストール
!pip install nltk
!pip install gensim



In [0]:
import nltk
import numpy as np
import pandas as pd

In [0]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 1. Calculate similarity

以下の三つの文を考える  
Doc A : "I like apples and a strawberries. I will buy an apple tomorrow. "  
Doc B : "I bought some apples and strawberries. I will eat an apple tomorrow."  
Doc C : "I play basketball every day. I like Michael Jordan."  

Doc AとDoc Bは似ていそうだが、Doc CはDoc AともDoc Bとも似ていなさそう  
これを類似度を計算することで確かめる

類似度の計算の仕方はいくつかある

- 集合ベースの類似度
  - Jaccard係数
  - Dice係数
  - Simpson係数
- ベクトルベースの類似度
  - ユークリッド距離
  - コサイン類似度


### 集合ベース

文書を単語の集合に変換する  
集合なので重複した単語は削除する  
前処理は今回はスキップする   

Doc A : "I like apples and a strawberries. I will buy an apple tomorrow. "  
Doc B : "I bought some apples and strawberries. I will eat an apple tomorrow."  
Doc C : "I play basketball every day. I like Michael Jordan."  
↓    
Set A : {'a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'}  
Set B : {'an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'}  
Set C : {'basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'}  

この集合が文書の特徴を表していると考える  


#### Jaccard係数
Jaccard係数は二つの集合A,Bに対して定義される類似度である  
計算式は以下の通り

\begin{equation}
J(A,B)=\dfrac{|A\cap B|}{|A \cup B|}
\end{equation}

共通部分の割合が大きければその二つの文書は似ていると考える

In [0]:
def jaccard_similarity(set_a,set_b):
  # 積集合の要素数を計算
  num_intersection = len(set.intersection(set_a, set_b))
  # 和集合の要素数を計算
  num_union = len(set.union(set_a, set_b))
  #Jaccard係数を算出　空集合の時は1を出力
  try:
      return float(num_intersection) / num_union
  except ZeroDivisionError:
      return 1.0 

In [0]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

print("jaccard(a, b) = ", jaccard_similarity(set_a, set_b)) #Jaccard係数を計算
print("jaccard(a, c) = ", jaccard_similarity(set_a, set_c))
print("jaccard(b, c) = ", jaccard_similarity(set_b, set_c))

jaccard(a, b) =  0.5714285714285714
jaccard(a, c) =  0.11764705882352941
jaccard(b, c) =  0.05555555555555555



nltkで実装されている  
定義と同じように計算を行うので、入力は集合  
距離になっているところには注意が必要

In [0]:
from nltk.metrics import jaccard_distance

set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

# Jaccard距離になっているので、類似度に変換するときは1から引く
print("jaccard(a, b) = ", 1 - jaccard_distance(set_a, set_b))
print("jaccard(a, c) = ", 1 - jaccard_distance(set_a, set_c))
print("jaccard(b, c) = ", 1 - jaccard_distance(set_b, set_c))


jaccard(a,b) =  0.5714285714285714
jaccard(a,c) =  0.11764705882352944
jaccard(b,c) =  0.05555555555555558


#### Sørensen-Dice係数

Jaccard係数では分母はの和集合であったため  
片方の集合がとても大きいと共通部分が大きくても係数の値が小さくなってしまうという問題がある  
Sørensen-Dice係数では、分母を二つの集合の大きさの平均をとることで、その影響を緩和している  

$
DSC(A,B) = \dfrac{|A\cap B|}{\dfrac{|A| + |B|}{2}} = \dfrac{2|A\cap B|}{|A| + |B|}
$

In [0]:
def dice_similarity(set_a, set_b):
  num_intersection =  len(set.intersection(set_a, set_b))
  sum_nums = len(set_a) + len(set_b)
  try:
    return 2 * num_intersection / sum_nums
  except ZeroDivisionError:
    return 1.0 

In [0]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

print("dice(a, b) = ", dice_similarity(set_a, set_b))
print("dice(a, c) = ", dice_similarity(set_a, set_c))
print("dice(b, c) = ", dice_similarity(set_b, set_c))

dice(a, b) =  0.7272727272727273
dice(a, c) =  0.21052631578947367
dice(b, c) =  0.10526315789473684


#### Szymkiewicz-Simpson係数

差集合の要素数の影響を極限まで抑えたのがSzymkiewicz-Simpson係数    
$
overlap(𝐴,𝐵) = \dfrac{|A\cap B|}{\min(|A|, |B|)}
$



In [0]:
def simpson_similarity(list_a, list_b):
  num_intersection = len(set.intersection(set(list_a), set(list_b)))
  min_num = min(len(set(list_a)), len(set(list_b)))
  try:
    return num_intersection / min_num
  except ZeroDivisionError:
    if num_intersection == 0:
      return 1.0
    else:
      return 0

In [0]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])

print("simpson(a, b) = ", simpson_similarity(set_a, set_b)) 
print("simpson(a, c) = ", simpson_similarity(set_a, set_c)) 
print("simpson(b, c) = ", simpson_similarity(set_b, set_c)) 

simpson(a, b) =  0.7272727272727273
simpson(a, c) =  0.25
simpson(b, c) =  0.125


#### Exercise 1
色々な集合を作って集合ベース手法の比較をしよう

In [0]:
set_a = set(['a', 'an', 'and', 'apple', 'apples', 'buy', 'i', 'like', 'strawberries', 'tomorrow', 'will'])
set_b = set(['an', 'and', 'apple', 'apples', 'bought', 'eat', 'i', 'some', 'strawberries', 'tomorrow', 'will'])
set_c = set(['basketball', 'day', 'every', 'i', 'jordan', 'like', 'michael', 'play'])
set_d = set() # 大きめの集合を作って試してみよう

print("jaccard similarity:")
print(jaccard_similarity(set_d, set_a))
print(jaccard_similarity(set_d, set_b))
print(jaccard_similarity(set_d, set_c))

print("dice similarity:")
print(dice_similarity(set_d, set_a))
print(dice_similarity(set_d, set_b))
print(dice_similarity(set_d, set_c))

print("simpson similarity:")
print(simpson_similarity(set_d, set_a))
print(simpson_similarity(set_d, set_b))
print(simpson_similarity(set_d, set_c))

jaccard similarity:
0.0
0.0
0.0
dice similarity:
0.0
0.0
0.0
simpson similarity:
1.0
1.0
1.0


### ベクトルベース 


文書をベクトルとして表現し類似度を計算する  
ベクトル化の手法は色々あるが今回はBoW(Bag of Words)で説明する  

BoWは文をベクトルで表現する方法の一つ  
想定している単語の総数をNとすると、各次元が各単語に対応するN次元のベクトルを考える  
各次元の値はその単語が文書中で出た回数

例）  
Doc A : "I like apples and a strawberries. I will buy an apple tomorrow. "  
Doc B : "I bought some apples and strawberries. I will eat an apple tomorrow."  
Doc C : "I play basketball every day. I like Michael Jordan."  
↓  
全単語は19個で、各次元の値は以下の単語の個数に対応するBoWを考える  
['an', 'and', 'apple', 'apples', 'basketball', 'bought', 'buy', 'day', 'eat', 'every', 'i', 'jordan', 'like', 'michael', 'play', 'some', 'strawberries', 'tomorrow', 'will']  
↓  
BoW A : [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1]  
BoW B : [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1]  
BoW C : [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0]  

このベクトルが文書の特徴を表していると考える


#### ユークリッド距離

各文書をベクトルで表すことが出来たので  
ユークリッド距離が計算できる  
この距離が小さければ似ていると考えることが出来る

\begin{equation}
d(v_1,v_2) =(\sum_{i=1}^n (v_{1i}-v_{2i})^2)^{\frac{1}{2}}
\end{equation}

In [0]:
def euclidean_distance(list_a, list_b):
  diff_vec = np.array(list_a) - np.array(list_b)
  return np.linalg.norm(diff_vec)

In [0]:
bow_a = [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1]  
bow_b = [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1]  
bow_c = [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0]  

print("euclidean_distance(bow_a, bow_b) = ",euclidean_distance(bow_a, bow_b))
print("euclidean_distance(bow_a, bow_c) = ",euclidean_distance(bow_a, bow_c))
print("euclidean_distance(bow_b, bow_c) = ",euclidean_distance(bow_b, bow_c))

euclidean_distance(bow_a, bow_b) =  2.23606797749979
euclidean_distance(bow_a, bow_c) =  3.7416573867739413
euclidean_distance(bow_b, bow_c) =  4.123105625617661


#### ミンコフスキー距離

ユークリッド距離を一般化した距離
pの値を変えることで色々な距離を表現できる  

\begin{equation}
d(v_1,v_2) = (\sum_{i=1}^n |v_{1i}-v_{2i}|^p)^{\frac{1}{p}}
\end{equation}

#### Exercise 2
ミンコフスキー距離を計算するプログラムを書いて  
p=1,2,3で距離を計算してみよう

In [0]:
# np.linalg.normについて調べよう
def minkowski_distance(list_a, list_b, p):
  

In [0]:
# p=1
print(minkowski_distance(bow_a, bow_b, 1))
print(minkowski_distance(bow_a, bow_c, 1))
print(minkowski_distance(bow_b, bow_c, 1))

# p=2
print(minkowski_distance(bow_a, bow_b, 2))
print(minkowski_distance(bow_a, bow_c, 2))
print(minkowski_distance(bow_b, bow_c, 2))

# p=3
print(minkowski_distance(bow_a, bow_b, 3))
print(minkowski_distance(bow_a, bow_c, 3))
print(minkowski_distance(bow_b, bow_c, 3))


5.0
14.0
17.0
5.0
14.0
17.0
2.23606797749979
3.7416573867739413
4.123105625617661
1.7099759466766968
2.4101422641752297
2.571281590658235


#### コサイン類似度

ベクトルのなす角に着目して類似度を計算する  

\begin{equation}
similarity(A, B)=cos(\theta)=\dfrac{\sum_{i=1}^n A_iB_i}{{\sqrt A}{\sqrt B}}
\end{equation}


#### Exercise 3
コサイン類似度を計算するプログラムを書いて計算しよう

In [0]:
# numpy.array について調べよう
def cosine_similarity(list_a, list_b):
  # あとで消す
  inner_prod = np.array(list_a).dot(np.array(list_b))
  norm_a = np.linalg.norm(list_a)
  norm_b = np.linalg.norm(list_b)
  try:
      return inner_prod / (norm_a*norm_b)
  except ZeroDivisionError:
      return 1.0

In [0]:
bow_a = [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1]
bow_b = [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1]
bow_c = [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0]

print("cosine_similarity(bow_a, bow_b) = ",cosine_similarity(bow_a, bow_b))
print("cosine_similarity(bow_a, bow_c) = ",cosine_similarity(bow_a, bow_c))
print("cosine_similarity(bow_b, bow_c) = ",cosine_similarity(bow_b, bow_c))

cosine_similarity(bow_a, bow_b) =  0.8153742483272114
cosine_similarity(bow_a, bow_c) =  0.41812100500354543
cosine_similarity(bow_b, bow_c) =  0.3223291856101521


### 集合ベースとベクトルベースの比較

集合演算の方は一つ一つの文書が小さいデータに対して性能が高い  
文書がある程度大きくなるとベクトルベースの方が有用になる  
その代わり、語彙集合が大きくなり計算量が大きくなってしまう


### Exercise 4
短い文章のデータセットと長い文章のデータセットを自分で作り    
Jaccard係数とコサイン類似度を計算して比較してみよう

In [0]:
short_docs = []
long_docs = []

## 2. Preprocessing

集合間の共通部分やベクトル間の距離や角度で類似度を測ることが出来た  
集合やベクトルが文書の特徴を上手く表せていないと類似度が上手く測れない  
文書からどのように集合やベクトルを作るかがとても大事  
 
適切な前処理を行うことで特徴を捉えた類似度を測れるようになる    
後半はベクトル化に絞って練習していく  

1. Clearning
2. Tokenize
3. Stemming
4. Remove stop words
5. Vectorize

### 2-1. Clearning

上の例では綺麗な文章ばかり扱っていたが、実際はもっと汚い   
Webから取ってきたデータだとhtmlタグが残っていたり、変な記号が入っていたりする  



In [0]:
documents=["I like apples and a strawberries. I will buy an apple tomorrow @Fresco.",
           "I bought some apples and strawberries. I will eat an apple <b>tomorrow.</b>",
           "I play basketball every day. I like Michael Jordan (born February 17, 1963)."]

今は三つなので手動で消せるが  
大量のデータを扱うときには自動で綺麗にできないといけない  
綺麗にするプログラムを作る

#### Exercise 5

正規表現を使ってテキストを綺麗にするプログラムを書こう

参考: 正規表現 (https://uxmilk.jp/41416)

In [0]:
import re

def cleaning_text(text):
    # @の削除
    pattern1 = '@'
    text = re.sub(pattern1, '', text)    
    # <b>タグの削除
    pattern2 = # 
    text = re.sub(pattern2, '', text)    
    # ()内を削除
    pattern3 = #
    text = re.sub(pattern3, '', text)
    return text
  

for text in documents:
    print(cleaning_text(text))

I like apples and a strawberries. I will buy an apple tomorrow Fresco.
I bought some apples and strawberries. I will eat an apple tomorrow.
I play basketball every day. I like Michael Jordan .


#### Option 1

以下のテキストを綺麗にするコードを書いてみよう


In [0]:
text = '<p><b>Natural language processing</b> (<b>NLP</b>) is a subfield of <a href="/wiki/Computer_science" title="Computer science">computer science</a>, <a href="/wiki/Information_engineering_(field)" title="Information engineering (field)">information engineering</a>, and <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of <a href="/wiki/Natural_language" title="Natural language">natural language</a> data.</p>'


### 2-2. Tokenize

まだ文字列のままなので、単語ごとに区切る  
英語だと空白区切りでよいが日本語だと少し面倒  

In [0]:
def tokenize_text(text):
  text = re.sub('[.,]', '', text)
  return text.split()

for text in documents:
  text = cleaning_text(text)
  print(tokenize_text(text))

['I', 'like', 'apples', 'and', 'a', 'strawberries', 'I', 'will', 'buy', 'an', 'apple', 'tomorrow', 'Fresco']
['I', 'bought', 'some', 'apples', 'and', 'strawberries', 'I', 'will', 'eat', 'an', 'apple', 'tomorrow']
['I', 'play', 'basketball', 'every', 'day', 'I', 'like', 'Michael', 'Jordan']


### 2-3. Stemming, Lemmatize

同じ意味の単語でも異なる形をしていることがある  
それらを別の単語としてカウントするのは不自然  
小文字に変換した後  
StemmingやLemmatizeという処理で同じ形にする  
今回はLemmatizeのみ

In [0]:
from nltk.corpus import wordnet as wn #lemmatize関数のためのimport

def lemmatize_word(word):
    # make words lower  example: Python =>python
    word=word.lower()
    
    # lemmatize  example: cooked=>cook
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
      return lemma

In [0]:
for text in documents:
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  print([lemmatize_word(word) for word in tokens])

['i', 'like', 'apple', 'and', 'a', 'strawberry', 'i', 'will', 'buy', 'an', 'apple', 'tomorrow', 'fresco']
['i', 'buy', 'some', 'apple', 'and', 'strawberry', 'i', 'will', 'eat', 'an', 'apple', 'tomorrow']
['i', 'play', 'basketball', 'every', 'day', 'i', 'like', 'michael', 'jordan']


strawberries→strawberryのように語を標準形に変換出来た

### 2-4. Remove stop words

a, theなどの文章に寄らず一般的に使われる冠詞、代名詞、前置詞などを使っても意味がない  
それらの単語はstop wordと呼ばれる  
nltkには専門家が定義したstop wordのリストがあるのでそれを使う  
必要に応じてstop wordは自分でカスタマイズするべき  

In [0]:
#1 nltkのストップワードリスト
en_stop = nltk.corpus.stopwords.words('english')
print(en_stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [0]:
def remove_stopwords(word, stopwordset):
  if word in stopwordset:
    return None
  else:
    return word

In [0]:
for text in documents:
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  tokens = [lemmatize_word(word) for word in tokens]
  print([remove_stopwords(word, en_stop) for word in tokens])

[None, 'like', 'apple', None, None, 'strawberry', None, None, 'buy', None, 'apple', 'tomorrow', 'fresco']
[None, 'buy', None, 'apple', None, 'strawberry', None, None, 'eat', None, 'apple', 'tomorrow']
[None, 'play', 'basketball', 'every', 'day', None, 'like', 'michael', 'jordan']


今回はこれだけで終わりにするが単語の削除はかなり重要  
出現頻度が極端に低い単語を削除したり、動詞と名詞に限定するなど色々ある

In [0]:
def preprocessing_text(text):
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  tokens = [lemmatize_word(word) for word in tokens]
  tokens = [remove_stopwords(word, en_stop) for word in tokens]
  tokens = [word for word in tokens if word is not None]
  return tokens


preprocessed_docs = [preprocessing_text(text) for text in documents]
preprocessed_docs

[['like', 'apple', 'strawberry', 'buy', 'apple', 'tomorrow', 'fresco'],
 ['buy', 'apple', 'strawberry', 'eat', 'apple', 'tomorrow'],
 ['play', 'basketball', 'every', 'day', 'like', 'michael', 'jordan']]

### 2-5. Vectorize




#### BoW(Bag of Words)


テキストを単語の出現回数のベクトルで表したもの  
人手で単語を数えたりするのは不可能なのでプログラムで処理を完結してしまおう

In [0]:
def bow_vectorizer(docs):
  word2id = {}
  for doc in docs:
    for w in doc:
      if w not in word2id:
        word2id[w] = len(word2id)
        
  result_list = []
  for doc in docs:
    doc_vec = [0] * len(word2id)
    for w in doc:
      doc_vec[word2id[w]] += 1
    result_list.append(doc_vec)
  return result_list, word2id

In [0]:
bow_vec, word2id = bow_vectorizer(preprocessed_docs)
print(bow_vec)

[[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]


In [0]:
word2id.items()

dict_items([('like', 0), ('apple', 1), ('strawberry', 2), ('buy', 3), ('tomorrow', 4), ('fresco', 5), ('eat', 6), ('play', 7), ('basketball', 8), ('every', 9), ('day', 10), ('michael', 11), ('jordan', 12)])

### TF-IDF(Term Frequency - Inverse Document Frequency)

BoWでは各単語の重みが同じだったが、単語によって重要度は変わる  
単語の重要度を考慮したのがTF-IDF  

TF(t, d) = ある単語(t)のある文書(d)における出現頻度  
IDF(t) = ある単語(t)が全文書集合(D)中にどれだけの文書で出現したかの逆数  

TF-IDF(t,d) = TF(t, d) * IDF(t)  

In [0]:
def tfidf_vectorizer(docs):
  def tf(word2id, doc):
    term_counts = np.zeros(len(word2id))
    for term in word2id.keys():
      term_counts[word2id[term]] = doc.count(term)
    tf_values = list(map(lambda x: x/sum(term_counts), term_counts))
    return tf_values
  
  def idf(word2id, docs):
    idf = np.zeros(len(word2id))
    for term in word2id.keys():
      idf[word2id[term]] = np.log(len(docs) / sum([bool(term in doc) for doc in docs]))
    return idf
  
  word2id = {}
  for doc in docs:
    for w in doc:
      if w not in word2id:
        word2id[w] = len(word2id)
  
  return [[_tf*_idf for _tf, _idf in zip(tf(word2id, doc), idf(word2id, docs))] for doc in docs], word2id
  

In [0]:
tfidf_vector, word2id = tfidf_vectorizer(preprocessed_docs)
print(tfidf_vector)
print(word2id.items())

[[0.05792358687259491, 0.11584717374518982, 0.05792358687259491, 0.05792358687259491, 0.05792358687259491, 0.15694461266687282, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.13515503603605478, 0.06757751801802739, 0.06757751801802739, 0.06757751801802739, 0.0, 0.1831020481113516, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.05792358687259491, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15694461266687282, 0.15694461266687282, 0.15694461266687282, 0.15694461266687282, 0.15694461266687282, 0.15694461266687282]]
dict_items([('like', 0), ('apple', 1), ('strawberry', 2), ('buy', 3), ('tomorrow', 4), ('fresco', 5), ('eat', 6), ('play', 7), ('basketball', 8), ('every', 9), ('day', 10), ('michael', 11), ('jordan', 12)])


### Exercise 6
BoWとTF-IDFでコサイン類似度をそれぞれ計算してみよう

tfidf
0-1 similarity: 0.4719198681637555
0-2 similarity: 0.03803869439363926
1-2 similarity: 0.0
bow
0-1 similarity: 0.8249579113843053
0-2 similarity: 0.1259881576697424
1-2 similarity: 0.0


### Option 2
scikit-learn, nltk gensimそれぞれにTF-IDFを計算する関数がある  
それぞれでTF-IDFを計算してみよう

In [0]:
# scikit-learnのtfidf　あとで消す


In [0]:
#nltk のtf-idf　あとで消す
collection = nltk.TextCollection(docs)
terms = list(set(collection))
nltk_vector = []
for doc in docs:
  tmp_vec = np.zeros(len(word2id))
  for term in word2id.keys():
    tmp_vec[word2id[term]] = collection.tf_idf(term, doc)
  nltk_vector.append(list(tmp_vec))
print(nltk_vector)

[[0.0, 0.033788759009013694, 0.033788759009013694, 0.033788759009013694, 0.0915510240556758, 0.033788759009013694, 0.033788759009013694, 0.0915510240556758, 0.033788759009013694, 0.033788759009013694, 0.033788759009013694, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.033788759009013694, 0.033788759009013694, 0.0, 0.033788759009013694, 0.033788759009013694, 0.0, 0.033788759009013694, 0.033788759009013694, 0.033788759009013694, 0.0915510240556758, 0.0915510240556758, 0.0915510240556758, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.04505167867868493, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12206803207423442, 0.12206803207423442, 0.12206803207423442, 0.12206803207423442, 0.12206803207423442, 0.12206803207423442]]


In [0]:
#gensim tf-idf あとで消す
from gensim import corpora
from gensim import models

dictionary = corpora.Dictionary(docs)
print('===単語->idの変換辞書===')
print(dictionary.token2id)
print(word2id)

corpus = list(map(dictionary.doc2bow, docs))
test_model = models.TfidfModel(corpus)
corpus_tfidf = test_model[corpus]

print('===結果表示===')
gensim_vector = []
for doc in corpus_tfidf:
  tmp_vec = [0] * len(word2id)
  for word in doc:
    key = dictionary[word[0]]
    tmp_vec[word2id[key]] = word[1]
  gensim_vector.append(tmp_vec)

print(gensim_vector)

===単語->idの変換辞書===
{'a': 0, 'an': 1, 'and': 2, 'apple': 3, 'apples': 4, 'buy': 5, 'i': 6, 'like': 7, 'strawberries.': 8, 'tomorrow.': 9, 'will': 10, 'bought': 11, 'eat': 12, 'some': 13, 'basketball': 14, 'day.': 15, 'every': 16, 'jordan.': 17, 'michael': 18, 'play': 19}
{'i': 0, 'like': 1, 'apples': 2, 'and': 3, 'a': 4, 'strawberries.': 5, 'will': 6, 'buy': 7, 'an': 8, 'apple': 9, 'tomorrow.': 10, 'bought': 11, 'some': 12, 'eat': 13, 'play': 14, 'basketball': 15, 'every': 16, 'day.': 17, 'michael': 18, 'jordan.': 19}
===結果表示===
[[0, 0.20996682609546996, 0.20996682609546996, 0.20996682609546996, 0.5689074861149032, 0.20996682609546996, 0.20996682609546996, 0.5689074861149032, 0.20996682609546996, 0.20996682609546996, 0.20996682609546996, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0.185617413417644, 0.185617413417644, 0, 0.185617413417644, 0.185617413417644, 0, 0.185617413417644, 0.185617413417644, 0.185617413417644, 0.5029324775265576, 0.5029324775265576, 0.5029324775265576, 0, 0, 0, 0, 0, 0], [

## Exercise 7

様々な国のWikipediaにおけるabstractを取り出したデータセットを用意した  
https://drive.google.com/open?id=1i7tekPQRKaAwg-ze3kv5IsufMW13LkLo  
このデータをダウンロードして使う  

Cosine類似度の計算を行い、Japanに似ている国Top5を表示してみよう  
前処理を自分なりに工夫すること  
注）類似度はあまり高くならなくても良い  

In [0]:
df = pd.read_csv("./nlp_country.csv")
df

Unnamed: 0,Name,Abstract
0,Japan,Japan is an island country in East Asia. Locat...
1,United States,"The United States of America (USA), commonly k..."
2,England,England is a country that is part of the Unite...
3,China,"China, officially the People's Republic of Chi..."
4,India,"India, also known as the Republic of India,[19..."
5,Korea,Korea is a region in East Asia.[3] Since 1948 ...
6,Germany,"Germany, officially the Federal Republic of Ge..."
7,Russia,"Russia, or the Russian Federation[12], is a tr..."
8,France,"France, officially the French Republic, is a c..."
9,Italy,"Italy, officially the Italian Republic,[10][11..."


In [0]:
df.iloc[0]["Abstract"]

'Japan is an island country in East Asia. Located in the Pacific Ocean, it lies off the eastern coast of the Asian continent and stretches from the Sea of Okhotsk in the north to the East China Sea and the Philippine Sea in the south. The kanji that make up Japan\'s name mean \'sun origin\', and it is often called the "Land of the Rising Sun". Japan is a stratovolcanic archipelago consisting of about 6,852 islands. The four largest are Honshu, Hokkaido, Kyushu, and Shikoku, which make up about ninety-seven percent of Japan\'s land area and often are referred to as home islands. The country is divided into 47 prefectures in eight regions, with Hokkaido being the northernmost prefecture and Okinawa being the southernmost one. Japan is the 2nd most populous island country. The population of 127 million is the world\'s eleventh largest, of which 98.5% are ethnic Japanese. 90.7% of people live in cities, while 9.3% live in the countryside.[16] About 13.8 million people live in Tokyo,[17] th

In [0]:
# 後で消す
def preprocessing_text(text):
  def cleaning_text(text):
    # @の削除
    pattern1 = '@|%'
    text = re.sub(pattern1, '', text)    
    pattern2 = '\[[0-9 ]*\]'
    text = re.sub(pattern2, '', text)    
    # <b>タグの削除
    pattern3 = '\([a-z ]*\)'
    text = re.sub(pattern3, '', text)    
    pattern4 = '[0-9]'
    text = re.sub(pattern4, '', text)
    return text
  
  def tokenize_text(text):
    text = re.sub('[.,]', '', text)
    return text.split()

  def lemmatize_word(word):
    # make words lower  example: Python =>python
    word=word.lower()
    
    # lemmatize  example: cooked=>cook
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
      return lemma
    
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  tokens = [lemmatize_word(word) for word in tokens]
  tokens = [remove_stopwords(word, en_stop) for word in tokens]
  tokens = [word for word in tokens if word is not None]
  return tokens
  
docs = df["Abstract"].values
pp_docs = [preprocessing_text(text) for text in docs]
tfidf_vector, word2id = tfidf_vectorizer(pp_docs)

In [0]:
word2id.items()

dict_items([('alibaba', 0), ('group', 1), ('holding', 2), ('limited', 3), ('chinese', 4), ('multinational', 5), ('conglomerate', 6), ('company', 7), ('specialize', 8), ('e-commerce', 9), ('retail', 10), ('internet', 11), ('technology', 12), ('found', 13), ('april', 14), ('provide', 15), ('consumer-to-consumer', 16), ('(cc)', 17), ('business-to-consumer', 18), ('(bc)', 19), ('business-to-business', 20), ('(bb)', 21), ('sales', 22), ('services', 23), ('via', 24), ('web', 25), ('portal', 26), ('well', 27), ('electronic', 28), ('payment', 29), ('shopping', 30), ('search', 31), ('engine', 32), ('cloud', 33), ('computing', 34), ('operate', 35), ('diverse', 36), ('array', 37), ('business', 38), ('around', 39), ('world', 40), ('numerous', 41), ('sector', 42), ('name', 43), ('one', 44), ("world's", 45), ('admire', 46), ('fortune', 47), ('closing', 48), ('time', 49), ('date', 50), ('initial', 51), ('public', 52), ('offering', 53), ('(ipo)', 54), ('–', 55), ('us$', 56), ('billion', 57), ('high', 

In [0]:
def calc_cosine(vector, vector_list):
  result = {}
  for i, x in enumerate(vector_list):
    result[i] = cosine_similarity(vector, vector_list[i])
    
  return result

print("tfidf")
res = calc_cosine(tfidf_vector[0],tfidf_vector)
res

tfidf


{0: 1.0,
 1: 0.04945156965230687,
 2: 0.03550026859810149,
 3: 0.07494324927746153,
 4: 0.02200165046387345,
 5: 0.089213868005443,
 6: 0.04329186935344452,
 7: 0.04340970910393382,
 8: 0.050616794433693456,
 9: 0.05446867547327852,
 10: 0.03479541972998953,
 11: 0.03392463518350004,
 12: 0.038469390607195876,
 13: 0.05035814117836253,
 14: 0.06794378321355649,
 15: 0.029516361108928312}

In [0]:
sorted(res.items(), key=lambda x:x[1],reverse=True)

[(0, 1.0),
 (5, 0.089213868005443),
 (3, 0.07494324927746153),
 (14, 0.06794378321355649),
 (9, 0.05446867547327852),
 (8, 0.050616794433693456),
 (13, 0.05035814117836253),
 (1, 0.04945156965230687),
 (7, 0.04340970910393382),
 (6, 0.04329186935344452),
 (12, 0.038469390607195876),
 (2, 0.03550026859810149),
 (10, 0.03479541972998953),
 (11, 0.03392463518350004),
 (15, 0.029516361108928312),
 (4, 0.02200165046387345)]

## Option 3

### Word2Vec & Doc2Vec

Word2VecやDoc2Vecでは単語の意味を捉えられているかのような演算が出来る  
King - Man + Woman = Queen など  
詳細は講義スライドへ   

学習済みのword2vecがgithub( https://github.com/Kyubyong/wordvectors )に上がっているので  
日本と各国の類似度を計算してみよう  
足し算や引き算が出来るのでそれも試してみよう  

参考 : "BOKU"のITな日常 (https://arakan-pgm-ai.hatenablog.com/entry/2019/02/08/090000)  