all exercises based on [Introduction to Deep Learning for NLP](https://wikidocs.net/24949)

### LSA (Latent Semantic Analysis - 잠재 의미 분석)

**SVD**(특이값분해) 이해, svd 기반 분석<br>
기존 행렬 DTM -> A<br>
- A = U X S X VT

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle = True, random_state=1, remove=('headers,', 'footers', 'quotes'))
documents = dataset.data
len(documents)

11314

In [8]:
documents[1]

"From: timmbake@mcl.ucsb.edu (Bake Timmons)\nSubject: Re: Amusing atheists and agnostics\nLines: 66\n\n\n\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [10]:
print(dataset.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [25]:
df = pd.DataFrame(documents, columns = ['data'])
df['data']

0        From: ab4z@Virginia.EDU ("Andi Beyer")\nSubjec...
1        From: timmbake@mcl.ucsb.edu (Bake Timmons)\nSu...
2        From: bc744@cleveland.Freenet.Edu (Mark Ira Ka...
3        From: ray@ole.cdac.com (Ray Berry)\nSubject: C...
4        From: kkeller@mail.sas.upenn.edu (Keith Keller...
                               ...                        
11309    From: adams@bellini.berkeley.edu (Adam L. Schw...
11310    From: levin@bbn.com (Joel B Levin)\nSubject: R...
11311    From: tedward@cs.cornell.edu (Edward [Ted] Fis...
11312    From: mori@volga.mfd.cs.fujitsu.co.jp (Tsuyosh...
11313    From: marc@yogi.austin.ibm.com (Marc J. Stephe...
Name: data, Length: 11314, dtype: object

### 텍스트 전처리

In [26]:
# 특수문자 제거
df['data'] = df['data'].str.replace('[^A-Za-z]',' ')

# 길이 짧은 단어 제거
df['data'] = df['data'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

# 전체 단어 소문자화
df['data'] = df['data'].apply(lambda x: x.lower())

In [27]:
df['data']

0        from virginia andi beyer subject israeli terro...
1        from timmbake ucsb bake timmons subject amusin...
2        from cleveland freenet mark kaufman subject re...
3        from cdac berry subject clipper business usual...
4        from kkeller mail upenn keith keller subject p...
                               ...                        
11309    from adams bellini berkeley adam schwartz subj...
11310    from levin joel levin subject selective placeb...
11311    from tedward cornell edward fischer subject be...
11312    from mori volga fujitsu tsuyoshi mori subject ...
11313    from marc yogi austin marc stephenson subject ...
Name: data, Length: 11314, dtype: object

In [28]:
# 불용어 제거
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
df['data'] = df['data'].apply(lambda x: x.split()) # 토큰화..
df['data'] = df['data'].apply(lambda x: ' '.join([w for w in x if w not in stop_words]))

In [29]:
df['data']

0        virginia andi beyer subject israeli terrorism ...
1        timmbake ucsb bake timmons subject amusing ath...
2        cleveland freenet mark kaufman subject rejoind...
3        cdac berry subject clipper business usual arti...
4        kkeller mail upenn keith keller subject playof...
                               ...                        
11309    adams bellini berkeley adam schwartz subject d...
11310    levin joel levin subject selective placebo lin...
11311    tedward cornell edward fischer subject best ho...
11312    mori volga fujitsu tsuyoshi mori subject want ...
11313    marc yogi austin marc stephenson subject astro...
Name: data, Length: 11314, dtype: object

### TF-IDF 행렬 만들기
TfidfVectorizer은 기본적으로 토큰화되지 않은 텍스트 데이터를 입력으로 받음

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 1000, max_df = 0.5, smooth_idf = True)
# max_df : 0.5(반) 이상의 문서에 등장한 단어는 무시할 것! 제외할 것! (0.0-1.0 까지의 수를 넣으면 전체 문서 대비 문서 비율이 됨)
X = vectorizer.fit_transform(df['data'])

In [32]:
X.shape

(11314, 1000)

### 토픽모델링

In [33]:
# 행렬 분해 (Truncated SVD 사용할 것)

from sklearn.decomposition import TruncatedSVD
svd_model = TruncatedSVD(n_components=20, algorithm = 'randomized', n_iter = 100, random_state = 122)
# n_iter은 뭐지?
# n_iter : int, default=5
#    Number of iterations for randomized SVD solver. Not used by ARPACK. The
#    default is larger than the default in
#    :func:`~sklearn.utils.extmath.randomized_svd` to handle sparse
#    matrices that may have large slowly decaying spectrum.
svd_model.fit(X)
len(svd_model.components_)

20

In [36]:
import numpy as np
np.shape(svd_model.components_) # components가 SVD 분해 시 Vt에 해당함

(20, 1000)

In [38]:
terms = vectorizer.get_feature_names() # 단어집합 (1000개)
len(terms)

1000

In [48]:
def get_topics(components, feature_names, n =5):
    for idx, topic in enumerate(components):
        print("Topic {}: ".format(idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]]) # -1부터 -n-1까지 차례로 출력
get_topics(svd_model.components_, terms)

Topic 1:  [('posting', 0.2305), ('host', 0.22474), ('nntp', 0.22458), ('university', 0.19165), ('would', 0.18322)]
Topic 2:  [('nntp', 0.41933), ('host', 0.41674), ('posting', 0.41392), ('university', 0.19544), ('distribution', 0.15948)]
Topic 3:  [('windows', 0.36606), ('card', 0.17801), ('thanks', 0.16612), ('file', 0.15601), ('drive', 0.14655)]
Topic 4:  [('university', 0.46631), ('state', 0.31673), ('ohio', 0.21558), ('virginia', 0.15955), ('pitt', 0.15677)]
Topic 5:  [('pitt', 0.46765), ('gordon', 0.40777), ('banks', 0.3984), ('computer', 0.22913), ('science', 0.21493)]
Topic 6:  [('cleveland', 0.31501), ('cwru', 0.30519), ('freenet', 0.21488), ('western', 0.17831), ('reserve', 0.17686)]
Topic 7:  [('nasa', 0.30388), ('access', 0.28613), ('cleveland', 0.26006), ('ohio', 0.23071), ('state', 0.22779)]
Topic 8:  [('nasa', 0.54022), ('space', 0.31578), ('cleveland', 0.20301), ('cwru', 0.16288), ('freenet', 0.12835)]
Topic 9:  [('access', 0.27712), ('sale', 0.26642), ('drive', 0.26123)

[6, 4]