# How do I compare document similarity using Python?
Learn how to use the gensim Python library to determine the similarity between two or more documents.

By Jonathan Mugan April 18, 2017

https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python

###### How do I find documents similar to a particular document?

We will use a library in Python called gensim.

In [1]:
import gensim
print(dir(gensim))



['NullHandler', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'corpora', 'interfaces', 'logger', 'logging', 'matutils', 'models', 'parsing', 'scripts', 'similarities', 'summarization', 'topic_coherence', 'utils']


###### Let's create some documents.

In [None]:
raw_documents = ["I'm taking the show on the road.",
                 "My socks are a force multiplier.",
             "I am the barber who cuts everyone's hair who doesn't cut their own.",
             "Legend has it that the mind is a mad monkey.",
            "I make my own fun."]
print("Number of documents:",len(raw_documents))

In [2]:
kor_documents = [
        "인공지능의 발전과 고용의 미래",
        "인공지능, 로봇, 빅데이터와 제4차 산업혁명",
        "인공지능 시대의 법적·윤리적 쟁점",
        "인공지능(AI) 시대, 교회 공동체 성립요건 연구: 예배와 설교 가능성을 중심으로",
        "게임 인공지능 연구동향",
        "파고(인공지능)의 번역 불가능성과 디지털 바벨탑의 붕괴 - 언어의 이종성異種性이 불러올 미래에 대한 가상 시놉시스 -",
        "인공지능과 심층학습의 발전사",
        "일반 비디오 게임 플레이 인공지능을 위한 GreedyUCB1기반 몬테카를로 트리 탐색",
        "포스트휴먼시대 인공지능과 미래 경제 트렌드",
        "컴퓨터 게임에서의 인공지능 기술"
    ]     
print("한글문장 갯수:",len(kor_documents))

한글문장 갯수: 10


###### We will use NLTK to tokenize.

A document will now be a list of tokens.

In [None]:
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)] 
            for text in raw_documents]
print(gen_docs)

In [3]:
from konlpy.tag import Komoran
komoran = Komoran()
gen_kor_docs = [[w for w in komoran.morphs(text)] 
            for text in kor_documents]
print(gen_kor_docs)

[['인공지능', '의', '발전', '과', '고용', '의', '미래'], ['인공지능', ',', '로봇', ',', '빅데이터', '와', '제4차 산업', '혁명'], ['인공지능', '시대', '의', '법', '적', '·', '윤리', '적', '쟁점'], ['인공지능', '(', 'AI', ')', '시대', ',', '교회', '공동체', '성립', '요건', '연구', ':', '예배', '와', '설교', '가능', '성', '을', '중심', '으로'], ['게임', '인공지능', '연구', '동향'], ['파고', '(', '인공지능', ')', '의', '번역', '불', '가능', '성', '과', '디지털', '바벨탑', '의', '붕괴', '-', '언어', '의', '이종성', '異種性', '이', '불러오', 'ㄹ', '미래', '에', '대하', 'ㄴ', '가상', '시놉시스', '-'], ['인공지능', '과', '심층', '학습', '의', '발전', '사'], ['일반', '비디오 게임', '플레이', '인공지능', '을', '위하', 'ㄴ', 'GreedyUCB', '1', '기반', '몬테카를로', '트리', '탐색'], ['포스트', '휴먼', '시대', '인공지능', '과', '미래', '경제', '트렌드'], ['컴퓨터 게임', '에서', '의', '인공지능', '기술']]


###### We will create a dictionary from a list of documents. 
A dictionary maps every word to a number.

In [None]:
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
print(dictionary.token2id['road'])
print("Number of words in dictionary:",len(dictionary))
for i in range(len(dictionary)):
    print(i, dictionary[i])

In [4]:
kor_dictionary = gensim.corpora.Dictionary(gen_kor_docs)
print(kor_dictionary[5])
print(kor_dictionary.token2id['고용'])
print("한국어 Dictionary의 단어수:",len(kor_dictionary))
for i in range(len(kor_dictionary)):
    print(i, kor_dictionary[i])

미래
4
한국어 Dictionary의 단어수: 74
0 인공지능
1 의
2 발전
3 과
4 고용
5 미래
6 ,
7 로봇
8 빅데이터
9 와
10 제4차 산업
11 혁명
12 시대
13 법
14 적
15 ·
16 윤리
17 쟁점
18 (
19 AI
20 )
21 교회
22 공동체
23 성립
24 요건
25 연구
26 :
27 예배
28 설교
29 가능
30 성
31 을
32 중심
33 으로
34 게임
35 동향
36 파고
37 번역
38 불
39 디지털
40 바벨탑
41 붕괴
42 -
43 언어
44 이종성
45 異種性
46 이
47 불러오
48 ㄹ
49 에
50 대하
51 ㄴ
52 가상
53 시놉시스
54 심층
55 학습
56 사
57 일반
58 비디오 게임
59 플레이
60 위하
61 GreedyUCB
62 1
63 기반
64 몬테카를로
65 트리
66 탐색
67 포스트
68 휴먼
69 경제
70 트렌드
71 컴퓨터 게임
72 에서
73 기술


###### Now we will create a corpus. 
A corpus is a list of bags of words. A bag-of-words representation for a document just lists the number of times each word occurs in the document.

In [None]:
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)

In [5]:
kor_corpus = [kor_dictionary.doc2bow(gen_kor_doc) for gen_kor_doc in gen_kor_docs]
print(kor_corpus)

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1)], [(0, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)], [(0, 1), (1, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1)], [(0, 1), (6, 1), (9, 1), (12, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(0, 1), (25, 1), (34, 1), (35, 1)], [(0, 1), (1, 3), (3, 1), (5, 1), (18, 1), (20, 1), (29, 1), (30, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1)], [(0, 1), (1, 1), (2, 1), (3, 1), (54, 1), (55, 1), (56, 1)], [(0, 1), (31, 1), (51, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1)], [(0, 1), (3, 1), (5, 1), (12, 1), (67, 1), (68, 1), (69, 1), (70, 1)], [(0, 1), (1, 1), (71, 1), (72, 1), (73, 1)]]


###### Now we create a tf-idf model from the corpus. Note that num_nnz is the number of tokens.

In [None]:
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
s = 0
for i in corpus:
    s += len(i)
print(s)

In [6]:
kor_tf_idf = gensim.models.TfidfModel(kor_corpus)
print(kor_tf_idf)
s = 0
for i in kor_corpus:
    s += len(i)
print(s)

TfidfModel(num_docs=10, num_nnz=104)
104


###### Now we will create a similarity measure object in tf-idf space.

tf-idf stands for term frequency-inverse document frequency. Term frequency is how often the word shows up in the document and inverse document fequency scales the value by how rare the word is in the corpus.

In [None]:
sims = gensim.similarities.Similarity('../note/workdir/',tf_idf[corpus],
                                      num_features=len(dictionary))
print(sims)
print(type(sims))

In [9]:
kor_sims = gensim.similarities.Similarity('../note/ko_workdir/',kor_tf_idf[kor_corpus],
                                      num_features=len(kor_dictionary))
print(kor_sims)
print(type(kor_sims))

Similarity index with 10 documents in 0 shards (stored under ../note/ko_workdir/)
<class 'gensim.similarities.docsim.Similarity'>


###### Now create a query document and convert it to tf-idf.

In [None]:
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)

In [16]:
kor_query_doc = komoran.morphs("컴퓨터 게임에서의 인공지능 기술")
print(kor_query_doc)

kor_query_doc_bow = kor_dictionary.doc2bow(kor_query_doc)
print(kor_query_doc_bow)

kor_query_doc_tf_idf = kor_tf_idf[kor_query_doc_bow]
print(kor_query_doc_tf_idf)

['컴퓨터 게임', '에서', '의', '인공지능', '기술']
[(0, 1), (1, 1), (71, 1), (72, 1), (73, 1)]
[(1, 0.1712328295110301), (71, 0.5688231471197489), (72, 0.5688231471197489), (73, 0.5688231471197489)]


###### We show an array of document similarities to query. We see that the second document is the most similar with the overlapping of socks and force.

In [None]:
sims[query_doc_tf_idf]

In [17]:
kor_sims[kor_query_doc_tf_idf]

array([ 0.06823284,  0.        ,  0.01782335,  0.        ,  0.        ,
        0.03177125,  0.02666271,  0.        ,  0.        ,  1.        ], dtype=float32)

In [18]:
for i in range(len(kor_sims[kor_query_doc_tf_idf])):
    print(i, kor_sims[kor_query_doc_tf_idf][i])

0 0.0682328
1 0.0
2 0.0178233
3 0.0
4 0.0
5 0.0317713
6 0.0266627
7 0.0
8 0.0
9 1.0
