### TF-IDF(Term Frequency-Inverse Document Frequency)
: 단어 빈도에 적절한 가중치(TF와 IDF를 곱한 값)를 줌으로써 특정 단어가 무의미하게 반복되는 것을 보정함.
<br>상대적으로 적은 문서에 나오면서 특정 문서에 자주 나온 단어에 가중치 줌.
 
- TF: 단어(Term)가 등장한 횟수(Frequency). 단어 빈도
- IDF: 특정 단어가 등장한 문서(Document)의 빈도(Frequency)의 역수(Inverse). 역문서빈도
 
<br><br>여러 문서에 자주 나오면 문서빈도(df) 상승, 역문서빈도(idf) 하락.
<br>문서 간이 차이가 중요한 상황에서는 idf가 높은 단어가 좋은 단어.

#### 데이터 불러오기

In [1]:
import pandas as pd

df = pd.read_excel('imdb.xlsx', index_col=0)
df.head()

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=500, stop_words='english')

In [3]:
tdm = tfidf.fit_transform(df['review'])

In [8]:
type(tdm)

scipy.sparse._csr.csr_matrix

In [7]:
print(tdm.shape); print(); print(tdm)

(748, 500)

  (0, 264)	0.43676152065842583
  (0, 499)	0.5112421488050499
  (0, 284)	0.23148088751652843
  (0, 286)	0.5112421488050499
  (0, 385)	0.48261672511123166
  (1, 185)	0.4508252485652821
  (1, 27)	0.4630757536620436
  (1, 61)	0.330043211383874
  (1, 253)	0.4775650258680224
  (1, 417)	0.4952984618525245
  (2, 293)	0.3326996842333155
  (2, 242)	0.30667834544717537
  (2, 319)	0.24142314513946203
  (2, 321)	0.3140711939059074
  (2, 5)	0.22041303500891976
  (2, 354)	0.32260561111952785
  (2, 108)	0.3140711939059074
  (2, 54)	0.2890474338764623
  (2, 71)	0.32260561111952785
  (2, 475)	0.30015739096222105
  (2, 43)	0.2943241984036562
  (2, 284)	0.15064019733663225
  (3, 394)	0.6747516922530598
  (3, 287)	0.5316526285937699
  (3, 244)	0.5119137000618044
  :	:
  (739, 146)	1.0
  (740, 9)	0.83021432552157
  (740, 182)	0.5574443234070687
  (741, 250)	0.6167137686897145
  (741, 481)	0.4797072301038725
  (741, 405)	0.4694360111750707
  (741, 239)	0.41130880407139175
  (742, 284)	1.0
  (743,

#### 기존 TDM에서 변환

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 500, stop_words = 'english')
tdm2 = cv.fit_transform(df['review'])

In [16]:
word_count = pd.DataFrame({
    '단어' : cv.get_feature_names(),
    'tf-idf' : tdm.sum(axis = 0).flat
})

word_count.sort_values('tf-idf', ascending = False).head()



Unnamed: 0,단어,tf-idf
284,movie,44.917213
153,film,40.35639
33,bad,25.258572
225,just,20.296871
178,good,18.604656


In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

trans = TfidfTransformer()
tdm3 = trans.fit_transform(tdm2)
tdm3

<748x500 sparse matrix of type '<class 'numpy.float64'>'
	with 3434 stored elements in Compressed Sparse Row format>

In [18]:
# tf-idf 두 방법 비교
import numpy as np

np.allclose(tdm.A, tdm3.A) # .A를 하면 압축이 풀리면서 안의 값들을 서로 비교할 수 있음

True

### 영어 품사 태깅 & 표제어 추출

In [None]:
!conda install -y -c conda-forge spacy
# !pip install -U spacy

In [None]:
!python -m spacy download en_core_web_sm

In [22]:
import spacy

nlp = spacy.load("en_core_web_sm") # 영어 모형 불러오기

text = "Wikipedia is maintained by volunteers." # 모형에 영어 텍스트 적용
doc = nlp(text)

In [23]:
for token in doc:
    print(token.text,
    token.lemma_, # 표제어
    token.pos_, # 품사
    token.tag_, # 자세한 품사
    token.dep_, # 문법적 의존 관계
    token.is_stop) # 불용어 여부

Wikipedia Wikipedia PROPN NNP nsubjpass False
is be AUX VBZ auxpass True
maintained maintain VERB VBN ROOT False
by by ADP IN agent True
volunteers volunteer NOUN NNS pobj False
. . PUNCT . punct False


In [37]:
spacy.explain('PROPN') # 품사 자세히 볼 수 있음

'noun, proper singular'

In [40]:
# 명사와 동사의 표제어로 단어 문서 행렬 만들기
def extract_nv(text):
    doc = nlp(text)
    words = []
    for token in doc:
        print(token.tag_)
        if token.tag_[0] in 'NV': # 명사("N"NP, "N"N)와 동사("V"BZ)의 표제어만 추출
            words.append(token.lemma_.lower())
    return words

In [41]:
extract_nv('Apple is a company')

NNP
VBZ
DT
NN


['apple', 'be', 'company']

In [38]:
spacy.explain('NNP')

'noun, proper singular'

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 500, tokenizer = extract_nv)

In [46]:
import pandas as pd

df = pd.read_excel('imdb.xlsx', index_col = 0)
df.head()

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [44]:
# tdm 만들기
tdm = cv.fit_transform(df['review'])

DT
RB
,
JJ
,
RB
RB
HYPH
VBG
,
JJ
NN
IN
DT
JJ
,
VBG
JJ
NN
.
_SP
RB
JJ
WP
VBD
RBR
VBN
:
DT
JJ
NNS
CC
DT
NN
,
RB
NN
IN
WP
VBD
RP
.
_SP
VBG
NN
IN
JJ
CC
JJ
CC
JJ
NN
NNS
,
DT
NN
VBN
,
VBD
RB
RBR
JJ
:
IN
DT
NN
VBD
JJ
CC
DT
NN
CC
NNS
RB
JJ
JJ
JJ
.
_SP
RB
JJ
NN
CC
NN
TO
VB
IN
.
_SP
DT
JJS
NN
IN
DT
NN
VBD
WRB
NNP
VBZ
VBG
TO
VB
DT
NN
WDT
VBZ
VBG
IN
PRP$
NN
.
_SP
DT
NN
IN
DT
NN
VBZ
NN
,
NN
,
VBG
NFP
IN
PRP
VBZ
IN
NN
,
PRP
VBZ
PRP
VBP
IN
PRP
VBZ
JJ
.
_SP
VBD
CD
NNS
.
_SP
VBD
DT
NN
NN
CC
VBD
PRP
VBD
DT
JJ
NN
,
JJ
NNS
IN
NNS
.
_SP
DT
NN
JJ
.
_SP
VBD
DT
NN
IN
NNP
NNP
IN
DT
NN
NN
.
_SP
CC
DT
NN
NNS
VBD
JJ
.
_SP
DT
NN
VBD
DT
NN
IN
NNP
IN
PRP
VBZ
JJS
,
VBD
PRP
VB
RB
JJ
.
_SP
DT
NNS
VBD
DT
JJS
CC
DT
NNS
VBD
RB
JJ
.
_SP
PRP
VBD
RB
JJ
.
_SP
DT
VBZ
DT
RB
``
JJ
IN
NN
''
NN
WDT
VBZ
NN
RB
RB
IN
PRP$
NN
.
_SP
PRP
VBD
DT
JJ
NN
IN
DT
JJ
NN
,
CC
PRP
VBD
DT
JJ
NN
IN
PRP
RB
MD
VB
.
_SP
DT
NN
VBZ
RB
JJ
,
IN
PRP
VBP
DT
NN
IN
CD
NNS
TO
VB
DT
JJ
JJS
NN
RB
VBD
.
_SP
PRP
MD
VB
DT
NN
RP
IN
DT
NN
IN
NNS
IN
NN
,
NN
,
NN
,
NN


In [47]:
# 빈도 순으로 정렬
wc = pd.DataFrame({
    '단어': cv.get_feature_names(),
    '빈도': tdm.sum(axis=0).flat
})



In [52]:
wc.sort_values('빈도', ascending=False).head(10)

Unnamed: 0,단어,빈도
28,be,845
263,movie,211
126,film,189
166,have,119
92,do,112
370,see,78
49,character,59
233,make,58
478,watch,48
446,time,48
