# Simple NLP for English
간단한 Tokenizing, Pos-tagging, Lemmatization을 해보고 실제 예제로 arXiv에서 scraping한 text mining 관련 논문의 초록들로 DocumentTermMatrix를 만들어봅니다. 이 때, DocumentTermMatrix의 Term으로는 초록에 등장한 명사(NN), 고유명사(NNP) 중 많이 출현한 100개의 단어로 Term을 만들고 그 것을 토대로 DocumentTermMatrix를 만들어봅니다.  
  
* _nltk와 sklearn의 sub-module인 feature-extraction.text를 활용합니다._  
* 실제로 활용하셔도 좋으나 조금 더 코드를 다듬어서 사용하시는 것을 추천드립니다.
* nltk : http://www.nltk.org/book/

만든이 : 김보섭

## Pos tagging for English

In [1]:
import nltk
import re
#nltk.download() 필요한 것만 찾아서 다운받아도됩니다. (함수나 모듈)

### Tokenizing

In [2]:
s1 = 'And now for something completely different'
s1_tokens = nltk.word_tokenize(s1)
print(s1,s1_tokens)

And now for something completely different ['And', 'now', 'for', 'something', 'completely', 'different']


In [3]:
s2 = 'zqq'.join(['And', 'now', 'for', 'something', 'completely', 'different'])
print(s2)

Andzqqnowzqqforzqqsomethingzqqcompletelyzqqdifferent


In [4]:
s2_tokens = nltk.regexp_tokenize(text = s2, pattern = 'zqq', gaps = True)
print(s2, s2_tokens)

Andzqqnowzqqforzqqsomethingzqqcompletelyzqqdifferent ['And', 'now', 'for', 'something', 'completely', 'different']


### Pos-tagging

In [5]:
s1_tags= nltk.pos_tag(s1_tokens)
print(s1_tags)

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]


In [6]:
# 명사만 추출
[s1_tag for s1_tag in s1_tags if s1_tag[1] == 'NN']

[('something', 'NN')]

### Lemmatize 

In [7]:
wnl = nltk.stem.WordNetLemmatizer()

In [8]:
wnl.lemmatize('illuminate', pos = 'v')

'illuminate'

### example
arXiv에서 scraping한 text mining paper들의 abstract에서 DocumentTermMatrix의 Term을 명사로 한정한 DocumentTermMatrix 생성 (Term-frequency)

#### package load 

In [9]:
import pandas as pd
import re
import os, sys
from sklearn.feature_extraction.text import CountVectorizer
os.listdir()

['.ipynb_checkpoints',
 'Scrapping text mining papers in arXiv.py',
 'Simple NLP for English.ipynb',
 'text_mining_paper.csv']

In [10]:
papers = pd.read_csv('./text_mining_paper.csv', encoding = 'cp949')
papers.head()

Unnamed: 0,abstract,author,meta,subject,title
0,"The complicated, evolving landscape of cancer ...","Rocco Piazza, Daniele Ramazzotti, Roberta Spin...","Thu, 9 Mar 2017 01:24:23 GMT (948kb)",Genomics (q-bio.GN),"OncoScore: a novel, Internet-based tool to ass..."
1,"Mining textual patterns in news, tweets, paper...","Meng Jiang, Jingbo Shang, Taylor Cassidy, Xian...","Mon, 13 Mar 2017 01:06:19 GMT (1150kb,D) [v2] ...",Computation and Language (cs.CL),MetaPAD: Meta Pattern Discovery from Massive T...
2,This paper is a tutorial on Formal Concept Ana...,Dmitry I. Ignatov,"Wed, 8 Mar 2017 12:53:21 GMT (3541kb,D)",Information Retrieval (cs.IR),Introduction to Formal Concept Analysis and It...
3,Topic models have been widely used in discover...,"Jarvan Law, Hankz Hankui Zhuo, Junhua He, Erhu...","Thu, 23 Feb 2017 07:16:03 GMT (96kb,D)",Computation and Language (cs.CL),LTSG: Latent Topical Skip-Gram for Mutually Le...
4,Entity extraction is fundamental to many text ...,"Zeyi Wen, Dong Deng, Rui Zhang, Kotagiri Ramam...","Sun, 12 Feb 2017 12:46:40 GMT (89kb)",Databases (cs.DB),A Technical Report: Entity Extraction using Bo...


In [11]:
papers.shape

(168, 5)

#### Tokenizing

In [12]:
tmps = list(papers['abstract'])
tmps[0:6]

['The complicated, evolving landscape of cancer mutations poses a formidable\r\nchallenge to identify cancer genes among the large lists of mutations typically\r\ngenerated in NGS experiments. The ability to prioritize these variants is\r\ntherefore of paramount importance. To address this issue we developed\r\nOncoScore, a text-mining tool that ranks genes according to their association\r\nwith cancer, based on available biomedical literature. Receiver operating\r\ncharacteristic curve and the area under the curve (AUC) metrics on manually\r\ncurated datasets confirmed the excellent discriminating capability of OncoScore\r\n(OncoScore cut-off threshold = 21.09; AUC = 90.3%, 95% CI: 88.1-92.5%),\r\nindicating that OncoScore provides useful results in cases where an efficient\r\nprioritization of cancer-associated genes is needed.\r\n',
 'Mining textual patterns in news, tweets, papers, and many other kinds of text\r\ncorpora has been an active theme in text mining and NLP research. Pre

In [13]:
# 3글자 이상의 영어단어만 가져와서 tokenizing 하기
# 3글자 이상의 영어단어에 pos-tagging
tokenized = [re.findall('[A-z]{3,}', tmp) for tmp in tmps]
tmps = list(map(lambda x : nltk.pos_tag(x), tokenized))

In [14]:
# 3글자 이상의 단어중 명사(NN), 고유명사(NNP)만 가져와서 Corpus 생성
tmps = [list(filter(lambda x : x[1] == 'NN' or x[1] == 'NNP', tmp)) for tmp in tmps]
tmps = [list(map(lambda x: x[0],tmp)) for tmp in tmps]
tmps = [' '.join(tmp) for tmp in tmps]
len(tmps)

168

In [15]:
# Term을 설정 (tf기준 상위 100개 단어 추출)
from collections import Counter
word_count = list(map(lambda x : nltk.word_tokenize(x), tmps))
word_count = sum(word_count, [])
word_count = Counter(word_count)
Vocabulary = word_count.most_common(100)
Vocabulary = [tmp[0] for tmp in Vocabulary]
len(Vocabulary)

100

In [16]:
# DocumentTermMatrix 생성
dtm = CountVectorizer(vocabulary=Vocabulary)

In [17]:
dtm.fit(tmps)
dtm.vocabulary_ # DocumentTermMatrix의 Term
len(dtm.vocabulary_) # Term의 개수

100

In [18]:
my_dtm= dtm.transform(tmps).toarray()

In [19]:
my_dtm.shape

(168, 100)