# 02_a. Distributional word representations

* 싸이그래머 / 어바웃 파이썬
* 김무성

# 차례
* TF-IDF
    - gensim으로 tf-idf
    - scikit-learn으로 tf-idf
* Distributional word representations

# TF-IDF
* gensim으로 tf-idf
* scikit-learn으로 tf-idf

#### vector space model

<img src="https://media.licdn.com/mpr/mpr/shrinknp_400_400/AAEAAQAAAAAAAAEDAAAAJGJlMzAwMzU0LTQ5MTEtNDk4Yi04MDVlLWMzN2MwMzZiNzQxMw.png" width=600 />

#### BOW (Bag of Words) 

<img src="https://image.slidesharecdn.com/mrkt451speciallecture1i-140307202701-phpapp01/95/introduction-to-text-mining-8-638.jpg?cb=1394224092" width=600 />

#### TF-IDF

<img src="http://images.slideplayer.com/16/5094063/slides/slide_2.jpg" width=600 />

<img src="http://www.bloter.net/wp-content/uploads/2016/09/td-idf-graphic-765x255.png" width=600 />

<img src="https://i.ytimg.com/vi/zvFGNpbAfEI/hqdefault.jpg" width=600 />

## gensim으로 tf-idf

gensim에서 tf-idf을 계산하려면 문서 목록인 documents를 corpus 클래스로 바꿔야 한다.

In [None]:
# 가상의 4가지 문서 
documents = [
    "a b c a",
    "c b c",
    "b b a",
    "a c c",
    "c b a",
]

In [None]:
from pprint import pprint

# 단어(토큰) 단위로 분할
texts = list(map(lambda x: x.split(), documents))

pprint(texts)

In [None]:
from gensim import corpora

# Dictionary 객체. 구체적으로는 단어에 id 할당 (그 외에도 여러가지 기능은 튜토리얼 참조)
dictionary = corpora.Dictionary(texts)

pprint(dictionary.token2id)

In [None]:
# 위의 dictioanry를 사용하여 방금 전의 texts를 corpus로 바꾼다.
# 각 문서의 (출현 단어 id, 출현 횟수) 튜플의 리스트 
corpus = list(map(dictionary.doc2bow, texts))
pprint(corpus)

In [None]:
# corpus에서 tfidf 모델 생성
from gensim import models
tfidf_model = models.TfidfModel(corpus)

In [None]:
# 코퍼스들의 tf-idf 구함
corpus_tfidf = tfidf_model[corpus]

for doc in corpus_tfidf:
    print(doc)

In [None]:
# tf-idf기반 유사 문서 찾기
from gensim import similarities
sims = similarities.Similarity('./',corpus_tfidf,
                                      num_features=len(dictionary))
print(sims)
print(type(sims))

In [None]:
# 비교할 문서도 tf-idf로 변경해줘야 한다.
pprint(texts)

query_doc = ['c', 'c', 'c']
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tfidf_model[query_doc_bow]
print(query_doc_tf_idf)

sims[query_doc_tf_idf]

### 실습 1 
#### 1) 이걸 tf-idf 벡터화 해보자.
```python
raw_documents = ["I'm taking the show on the road.",
                 "My socks are a force multiplier.",
                 "I am the barber who cuts everyone's hair who doesn't cut their own.",
                 "Legend has it that the mind is a mad monkey.",
                 "I make my own fun."]
```                 
#### 2) 위의 문서 중에서 아래 문서와 가장 유사한 것은 무엇인가?
```python
query_doc = "Socks are a force for good."
```

### 실습 2
#### 1) 다음 문서들을 tf-idf 벡터로 만들어보자.
* https://gasazip.com/view.html?no=614736
* https://gasazip.com/1224697
* https://gasazip.com/view.html?no=599082
* https://gasazip.com/view.html?no=645465
* http://gasazip.com/view.html?no=643505
* https://gasazip.com/view.html?no=615362

#### 2) 위의 문서 중에서 아래 문서와 가장 유사한 것은 무엇인가?
* https://gasazip.com/view.html?no=636135

## scikit-learn으로 tf-idf

Scikit-Learn 의 feature_extraction 서브패키지와 feature_extraction.text 서브 패키지는 다음과 같은 문서 전처리용 클래스를 제공.

* DictVectorizer:
    - 단어의 수를 세어놓은 사전에서 BOW 벡터를 만든다.
* CountVectorizer:
    - 문서 집합으로부터 단어의 수를 세어 BOW 벡터를 만든다.
* TfidfVectorizer:
    - 문서 집합으로부터 단어의 수를 세고 TF-IDF 방식으로 단어의 가중치를 조정한 BOW 벡터를 만든다.
* HashingVectorizer:
    - hashing trick 을 사용하여 빠르게 BOW 벡터를 만든다.

In [None]:
# 코퍼스에서 빈도 벡터 만들기
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'The last document?',    
]
vect = CountVectorizer()
vect.fit(corpus)
vect.vocabulary_

In [None]:
vect.transform(['This is the second document.']).toarray()

In [None]:
vect.transform(['Something completely new.']).toarray()

In [None]:
vect.transform(corpus).toarray()

In [None]:
# Stop Words (사전 생성할 때 무시할 단어들) 적용
vect = CountVectorizer(stop_words=["and", "is", "the", "this"]).fit(corpus)
vect.vocabulary_

In [None]:
vect = CountVectorizer(stop_words="english").fit(corpus)
vect.vocabulary_

In [None]:
# 웹 문서 빈도 분석
import requests
from bs4 import BeautifulSoup
import json
import string
from konlpy.utils import pprint
from konlpy.tag import Hannanum
hannanum = Hannanum()

url = "https://gasazip.com/view.html?no=614736"
#url = "https://gasazip.com/view.html?no=636135"

In [None]:
# HTTP GET Request
req = requests.get(url)
# HTML 소스 가져오기
html = req.text
# BeautifulSoup으로 html소스를 python객체로 변환하기
# 첫 인자는 html소스코드, 두 번째 인자는 어떤 parser를 이용할지 명시.
# 이 글에서는 Python 내장 html.parser를 이용했다.
soup = BeautifulSoup(html, 'html.parser')

In [None]:
lyrics = []
for txt in soup.find_all('div', attrs={'class': 'col-md-8'}) :
    lines = txt.get_text().split('\n')
    for line in lines :
        lyrics.append(line.strip())

In [None]:
lyrics

In [None]:
docs = [w for w in hannanum.nouns(" ".join(lyrics)) if ((not w[0].isnumeric()) and (w[0] not in string.punctuation))]

In [None]:
docs

여기에서는 하나의 문서가 하나의 단어로만 이루어져 있다. 따라서 CountVectorizer로 이 문서 집합을 처리하면 각 문서는 하나의 원소만 1이고 나머지 원소는 0인 벡터가 된다. 이 벡터의 합으로 빈도를 알아보았다.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

vect = CountVectorizer().fit(docs)
count = vect.transform(docs).toarray().sum(axis=0)
idx = np.argsort(-count)
count = count[idx]
feature_name = np.array(vect.get_feature_names())[idx]
plt.bar(range(len(count)), count)
plt.show()

In [None]:
pprint(list(zip(feature_name, count)))

In [None]:
# 코퍼스에서 tf-idf 벡터 만들기
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'The last document?',    
]

In [None]:
tfidv = TfidfVectorizer().fit(corpus)
corpus_tfidf = tfidv.transform(corpus).toarray()
corpus_tfidf

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(corpus_tfidf)

### 실습 3 - 실습 2를 scikit-learn으로 해보자
#### 1) 다음 문서들을 tf-idf 벡터로 만들어보자.
* https://gasazip.com/view.html?no=614736
* https://gasazip.com/1224697
* https://gasazip.com/view.html?no=599082
* https://gasazip.com/view.html?no=645465
* http://gasazip.com/view.html?no=643505
* https://gasazip.com/view.html?no=615362

#### 2) 위의 문서 중에서 아래 문서와 가장 유사한 것은 무엇인가?
* https://gasazip.com/view.html?no=636135

# Distributional word representations
* 환경 설정
* Distributional matrices
* Vector comparison
* Distributional neighbors
* Matrix reweighting

#### 참고
* [7] CS224U: Natural Language Understanding - https://web.stanford.edu/class/cs224u/
* [8] CS224U: Natural Language Understanding / Distributional word representations 
    - notebook -  http://nbviewer.jupyter.org/github/cgpotts/cs224u/blob/master/vsm.ipynb
    - Overview slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-overview.pdf
    - Vector comparison slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-veccompare.pdf
    - Reweighting slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-weighting.pdf

## 환경 설정
* data - https://web.stanford.edu/class/cs224u/data/vsmdata.zip

In [None]:
!wget https://web.stanford.edu/class/cs224u/data/vsmdata.zip

In [None]:
!unzip vsmdata.zip

In [None]:
!ls vsmdata*

In [None]:
vsmdata_home = "vsmdata"

In [None]:
import os
import sys
import csv
import random
import itertools
from operator import itemgetter
from collections import defaultdict
import numpy as np
import scipy
import scipy.spatial.distance
from numpy.linalg import svd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import utils

## Distributional matrices
* Overview slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-overview.pdf

The data distribution includes two pre-computed matrices of co-occurrence counts in IMDB movie reviews. The build function in the utils module for this repository allows you to read them in:

Let's read these in now for use in later examples:

In [None]:
ww = utils.build(os.path.join(vsmdata_home, 'imdb-wordword.csv'))
wd = utils.build(os.path.join(vsmdata_home, 'imdb-worddoc.csv'))

In [None]:
print(len(ww))

In [None]:
print(len(ww[0]), ww[0][:2])

In [None]:
print(len(ww[1]), ww[1][:10])

In [None]:
print(len(ww[2]), ww[2][:10])

## Vector comparison
* Vector comparison slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-veccompare.pdf

#### Euclidean distance

<img src="https://render.githubusercontent.com/render/math?math=%5Csqrt%7B%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20%7Cu_%7Bi%7D-v_%7Bi%7D%7C%5E2%7D&mode=display"/>

In [None]:
def euclidean(u, v):    
    """Eculidean distance between 1d np.arrays `u` and `v`, which must 
    have the same dimensionality. Returns a float."""
    # Use scipy's method:
    return scipy.spatial.distance.euclidean(u, v)
    # Or define it yourself:
    # return vector_length(u - v)

#### vector length

<img src="https://render.githubusercontent.com/render/math?math=%5C%7Cu%5C%7C%20%3D%20%5Csqrt%7B%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20u_%7Bi%7D%5E%7B2%7D%7D&mode=display" />

In [None]:
def vector_length(u):
    """Length (L2) of the 1d np.array `u`. Returns a new np.array with the 
    same dimensions as `u`."""
    return np.sqrt(np.dot(u, u))

In [None]:
ABC = np.array([
    [ 2.0,  4.0],  # A
    [10.0, 15.0],  # B
    [14.0, 10.0]]) # C

def plot_ABC(m):
    plt.plot(m[:,0], m[:,1], marker='', linestyle='')
    plt.xlim([0,np.max(m)*1.2])
    plt.ylim([0,np.max(m)*1.2])
    for i, x in enumerate(['A','B','C']):
        plt.annotate(x, m[i,:])

plot_ABC(ABC)

In [None]:
euclidean(ABC[0], ABC[1])

In [None]:
euclidean(ABC[1], ABC[2])

#### Length normalization 

In [None]:
def length_norm(u):
    """L2 norm of the 1d np.array `u`. Returns a float."""
    return u / vector_length(u)

In [None]:
plot_ABC(np.array([length_norm(row) for row in ABC]))

#### Cosine distance

<img src="https://render.githubusercontent.com/render/math?math=1%20-%20%5Cleft%28%5Cfrac%7B%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20u_%7Bi%7D%20%5Ccdot%20v_%7Bi%7D%7D%7B%5C%7Cu%5C%7C%5Ccdot%20%5C%7Cv%5C%7C%7D%5Cright%29&mode=display" />

In [None]:
def cosine(u, v):        
    """Cosine distance between 1d np.arrays `u` and `v`, which must have 
    the same dimensionality. Returns a float."""
    # Use scipy's method:
    return scipy.spatial.distance.cosine(u, v)
    # Or define it yourself:
    # return 1.0 - (np.dot(u, v) / (vector_length(u) * vector_length(v)))

In [None]:
for m in (euclidean, cosine):
    fmt = {'n': m.__name__,  
           'AB': m(ABC[0], ABC[1]), 
           'BC': m(ABC[1], ABC[2])}
    print('%(n)15s(A, B) = %(AB)5.2f %(n)15s(B, C) = %(BC)5.2f' % fmt)

## Distributional neighbors

In [None]:
def neighbors(word, mat, rownames, distfunc=cosine):    
    """Tool for finding the nearest neighbors of `word` in `mat` according 
    to `distfunc`. The comparisons are between row vectors.
    
    Parameters
    ----------
    word : str
        The anchor word. Assumed to be in `rownames`.
        
    mat : np.array
        The vector-space model.
        
    rownames : list of str
        The rownames of mat.
            
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure  
        between 1d vectors.
        
    Raises
    ------
    ValueError
        If word is not in rownames.
    
    Returns
    -------    
    list of tuples
        The list is ordered by closeness to `word`. Each member is a pair 
        (word, distance) where word is a str and distance is a float.
    
    """
    if word not in rownames:
        raise ValueError('%s is not in this VSM' % word)
    w = mat[rownames.index(word)]
    dists = [(rownames[i], distfunc(w, mat[i])) for i in range(len(mat))]
    return sorted(dists, key=itemgetter(1), reverse=False)

In [None]:
neighbors(word='superb', mat=ww[0], rownames=ww[1], distfunc=cosine)[: 5]

In [None]:
neighbors(word='superb', mat=ww[0], rownames=ww[1], distfunc=euclidean)[: 5]

## Matrix reweighting
* Reweighting slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-weighting.pdf

#### Normalization

In [None]:
def prob_norm(u):
    """Normalize 1d np.array `u` into a probability distribution. Assumes 
    that all the members of `u` are positive. Returns a 1d np.array of 
    the same dimensionality as `u`."""
    return u / np.sum(u)

#### TF-IDF

In [None]:
def tfidf(mat, rownames=None):
    """TF-IDF 
    
    Parameters
    ----------
    mat : 2d np.array
       The matrix to operate on.
       
    rownames : list of str or None
        Not used; it's an argument only for consistency with other methods 
        defined here.
        
    Returns
    -------
    (np.array, list of str)    
       The first member is the TF-IDF-transformed version of `mat`, and 
       the second member is `rownames` (unchanged).
    
    """
    colsums = np.sum(mat, axis=0)
    doccount = mat.shape[1]
    w = np.array([_tfidf_row_func(row, colsums, doccount) for row in mat])
    return (w, rownames)

def _tfidf_row_func(row, colsums, doccount):
    df = float(len([x for x in row if x > 0]))
    idf = 0.0
    # This ensures a defined IDF value >= 0.0:
    if df > 0.0 and df != doccount:
        idf = np.log(doccount / df)
    tfs = row/colsums
    return tfs * idf

In [None]:
wd_tfidf = tfidf(mat=wd[0], rownames=wd[1])

In [None]:
neighbors(word='superb', mat=wd_tfidf[0], rownames=wd_tfidf[1], distfunc=cosine)[: 5]

# 참고자료
* [1] gensim の tfidf で正規化（normalize）に苦しんだ話 - http://tawara.hatenablog.com/entry/2016/11/08/021408
* [2] How do I compare document similarity using Python? - https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
* [3] Scikit-Learn의 문서 전처리 기능 - https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/
* [4] 나만의 웹 크롤러 만들기 with Requests/BeautifulSoup - https://beomi.github.io/2017/01/20/HowToMakeWebCrawler/
* [5] scikit-learn: TF/IDF and cosine similarity for computer science papers - http://www.markhneedham.com/blog/2016/07/27/scitkit-learn-tfidf-and-cosine-similarity-for-computer-science-papers/
* [6] sklearn.metrics.pairwise.cosine_similarity - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity
* [7] CS224U: Natural Language Understanding - https://web.stanford.edu/class/cs224u/
* [8] CS224U: Natural Language Understanding / Distributional word representations 
    - notebook -  http://nbviewer.jupyter.org/github/cgpotts/cs224u/blob/master/vsm.ipynb
    - Overview slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-overview.pdf
    - Vector comparison slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-veccompare.pdf
    - Reweighting slide - https://web.stanford.edu/class/cs224u/materials/cs224u-vsm-weighting.pdf