### 파이썬 머신러닝
## 텍스트 데이터
---
# IMDb 리뷰 다루기

- 데이터 전체 :  말뭉치 (corpus)
- 샘플 하나 : 문서 (document)
- 자연어 처리 : NLP(Natural language processing)
- 한국어는 하나의 **어절**이 여러 개의 의미 단위로 구성되는 경우가 있으므로 **형태소 분석**을 해야 한다

- IMDb 영화 리뷰 데이터셋 : https://github.com/rickiepark/introduction_to_ml_with_python/
blob/master/data/aclImdb_v1.tar.gz
- 압축을 풀어 data/aclImdb 폴더로 옮긴다
> ```
./data
./data/aclImdb
./data/aclImdb/test
./data/aclImdb/test/pos
./data/aclImdb/test/neg
./data/aclImdb/train
./data/aclImdb/train/pos
./data/aclImdb/train/neg
./data/aclImdb/train/unsup
```

- ./data/aclImdb/train/unsup 폴더를 ./data/aclImdb/train_unsup 폴더로 옮긴다.

>- C:\khh\서울코딩학원\빅데이터분석03\3.머신러닝\data\aclImdb 에 있는 파일 사용
>- D:\dataset\aclimdb.zip

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
'''
from sklearn.datasets import load_files

imdb_train = load_files('data/aclImdb/train/')
imdb_test = load_files('data/aclImdb/test/')

np.save('imdb.npy',[imdb_train, imdb_test])
'''

In [4]:
imdb_train, imdb_test = np.load('imdb.npy')

- imdb_train 과 imdb_test 내용 확인

In [5]:
type(imdb_train)

sklearn.utils.Bunch

In [6]:
dir(imdb_train)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [7]:
imdb_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [8]:
imdb_train.target

array([1, 0, 1, ..., 0, 0, 0])

In [9]:
imdb_train.target_names

['neg', 'pos']

In [3]:
len('가나다')

3

In [5]:
'가나다'.encode()

b'\xea\xb0\x80\xeb\x82\x98\xeb\x8b\xa4'

In [6]:
b'\xea\xb0\x80'.decode()

'가'

In [7]:
b'I am Tom.'.decode()

'I am Tom.'

In [10]:
# DESCR, data, filenames, target, target_names
display(type(imdb_train.data), len(imdb_train.data))     # list
display(imdb_train.filenames)
display(type(imdb_train.target), len(imdb_train.target)) # array
display(imdb_train.target_names)                         # list

list

25000

array(['C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/train/pos\\11485_10.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/train/neg\\6802_1.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/train/pos\\7641_10.txt',
       ...,
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/train/neg\\7611_4.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/train/neg\\8470_2.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/train/neg\\1245_2.txt'],
      dtype='<U65')

numpy.ndarray

25000

['neg', 'pos']

In [11]:
imdb_train.target[:20]

array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1])

In [12]:
imdb_train.data[0]

b"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty."

In [13]:
imdb_train.data[0].decode()

"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty."

- 위의 문장은 타입이 bytes 이다. 그리고 문장 중간에 ```'<br />'``` 이 포함되어 있다.

In [14]:
s = imdb_train.data[6]
s.decode()

"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.<br /><br />Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. <br /><br />I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."

In [15]:
s = imdb_train.data[6]
s.decode().replace('<br />', '') # s.replace(b'<br />', b'') 로 변환하면 출력 타입이 bytes 이다

"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."

In [8]:
text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
len(text_train)

25000

In [17]:
text_train[0]

"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. Flawed but honest with a terrible honesty."

In [79]:
y_train = imdb_train.target
display(y_train.shape, y_train)

(25000,)

array([1, 0, 1, ..., 0, 0, 0])

In [19]:
np.bincount(y_train)

array([12500, 12500], dtype=int64)

In [20]:
sum(y_train)

12500

- load_files() 함수는 폴더 이름을 알파벳 순서로 읽기 때문에, 'neg' 폴더는 0, 'pos' 폴더는 1 로 타겟값이 지정된다.
- imdb_train.target_names 값의 순서이기도 하다.

- 테스트 파일을 분석하자

In [21]:
display(type(imdb_test.data), len(imdb_test.data))     # list
display(imdb_test.filenames)
display(type(imdb_test.target), len(imdb_test.target)) # array
display(imdb_test.target_names)                         # list

list

25000

array(['C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/test/pos\\11485_9.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/test/neg\\6802_1.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/test/pos\\7641_8.txt',
       ...,
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/test/neg\\7611_2.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/test/neg\\8470_1.txt',
       'C:/khh/서울코딩학원/빅데이터분석03/3.머신러닝/data/aclImdb/test/neg\\1245_2.txt'],
      dtype='<U64')

numpy.ndarray

25000

['neg', 'pos']

In [22]:
text_test = [s.decode().replace('<br />', '') for s in imdb_test.data]
display(len(text_test))

y_test = imdb_test.target
display(y_test.shape, y_test)

25000

(25000,)

array([1, 0, 1, ..., 0, 0, 0])

In [23]:
np.bincount(y_test)

array([12500, 12500], dtype=int64)

> **정리**
> - text_train => list, 25000
> - y_train => array, 25000
> - text_test => list, 25000
> - y_test => array, 25000

In [11]:
s = text_train[0]
s

"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. Flawed but honest with a terrible honesty."

In [13]:
s = s.lower()

In [21]:
s2 = ''.join([ (' ' if c in ".,/-?!&*()'" else c) for c in s])

'zero day leads you to think  even re think why two boys young men would do what they did   commit mutual suicide via slaughtering their classmates  it captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own mutual world via coupled destruction it is not a perfect movie but given what money time the filmmaker and actors had   it is a remarkable product  in terms of explaining the motives and actions of the two young suicide murderers it is better than  elephant    in terms of being a film that gets under our  rationalistic  skin it is a far  far better film than almost anything you are likely to see  flawed but honest with a terrible honesty abcabc'

In [24]:
s3 = [i for i in s2.split(' ') if i!='']
s3

['zero',
 'day',
 'leads',
 'you',
 'to',
 'think',
 'even',
 're',
 'think',
 'why',
 'two',
 'boys',
 'young',
 'men',
 'would',
 'do',
 'what',
 'they',
 'did',
 'commit',
 'mutual',
 'suicide',
 'via',
 'slaughtering',
 'their',
 'classmates',
 'it',
 'captures',
 'what',
 'must',
 'be',
 'beyond',
 'a',
 'bizarre',
 'mode',
 'of',
 'being',
 'for',
 'two',
 'humans',
 'who',
 'have',
 'decided',
 'to',
 'withdraw',
 'from',
 'common',
 'civility',
 'in',
 'order',
 'to',
 'define',
 'their',
 'own',
 'mutual',
 'world',
 'via',
 'coupled',
 'destruction',
 'it',
 'is',
 'not',
 'a',
 'perfect',
 'movie',
 'but',
 'given',
 'what',
 'money',
 'time',
 'the',
 'filmmaker',
 'and',
 'actors',
 'had',
 'it',
 'is',
 'a',
 'remarkable',
 'product',
 'in',
 'terms',
 'of',
 'explaining',
 'the',
 'motives',
 'and',
 'actions',
 'of',
 'the',
 'two',
 'young',
 'suicide',
 'murderers',
 'it',
 'is',
 'better',
 'than',
 'elephant',
 'in',
 'terms',
 'of',
 'being',
 'a',
 'film',
 'that'

In [26]:
s4 = [i for i in s3 if len(i)!=1]
s4

['zero',
 'day',
 'leads',
 'you',
 'to',
 'think',
 'even',
 're',
 'think',
 'why',
 'two',
 'boys',
 'young',
 'men',
 'would',
 'do',
 'what',
 'they',
 'did',
 'commit',
 'mutual',
 'suicide',
 'via',
 'slaughtering',
 'their',
 'classmates',
 'it',
 'captures',
 'what',
 'must',
 'be',
 'beyond',
 'bizarre',
 'mode',
 'of',
 'being',
 'for',
 'two',
 'humans',
 'who',
 'have',
 'decided',
 'to',
 'withdraw',
 'from',
 'common',
 'civility',
 'in',
 'order',
 'to',
 'define',
 'their',
 'own',
 'mutual',
 'world',
 'via',
 'coupled',
 'destruction',
 'it',
 'is',
 'not',
 'perfect',
 'movie',
 'but',
 'given',
 'what',
 'money',
 'time',
 'the',
 'filmmaker',
 'and',
 'actors',
 'had',
 'it',
 'is',
 'remarkable',
 'product',
 'in',
 'terms',
 'of',
 'explaining',
 'the',
 'motives',
 'and',
 'actions',
 'of',
 'the',
 'two',
 'young',
 'suicide',
 'murderers',
 'it',
 'is',
 'better',
 'than',
 'elephant',
 'in',
 'terms',
 'of',
 'being',
 'film',
 'that',
 'gets',
 'under',
 'o

In [34]:
s5 = sorted(list(set(s4)))
s5

['abcabc',
 'actions',
 'actors',
 'almost',
 'and',
 'anything',
 'are',
 'be',
 'being',
 'better',
 'beyond',
 'bizarre',
 'boys',
 'but',
 'captures',
 'civility',
 'classmates',
 'commit',
 'common',
 'coupled',
 'day',
 'decided',
 'define',
 'destruction',
 'did',
 'do',
 'elephant',
 'even',
 'explaining',
 'far',
 'film',
 'filmmaker',
 'flawed',
 'for',
 'from',
 'gets',
 'given',
 'had',
 'have',
 'honest',
 'honesty',
 'humans',
 'in',
 'is',
 'it',
 'leads',
 'likely',
 'men',
 'mode',
 'money',
 'motives',
 'movie',
 'murderers',
 'must',
 'mutual',
 'not',
 'of',
 'order',
 'our',
 'own',
 'perfect',
 'product',
 'rationalistic',
 're',
 'remarkable',
 'see',
 'skin',
 'slaughtering',
 'suicide',
 'terms',
 'terrible',
 'than',
 'that',
 'the',
 'their',
 'they',
 'think',
 'time',
 'to',
 'two',
 'under',
 'via',
 'what',
 'who',
 'why',
 'with',
 'withdraw',
 'world',
 'would',
 'you',
 'young',
 'zero']

In [37]:
itow = {i:v for i,v in enumerate(s5)}
wtoi = {v:i for i,v in enumerate(s5)}
wtoi

{'abcabc': 0,
 'actions': 1,
 'actors': 2,
 'almost': 3,
 'and': 4,
 'anything': 5,
 'are': 6,
 'be': 7,
 'being': 8,
 'better': 9,
 'beyond': 10,
 'bizarre': 11,
 'boys': 12,
 'but': 13,
 'captures': 14,
 'civility': 15,
 'classmates': 16,
 'commit': 17,
 'common': 18,
 'coupled': 19,
 'day': 20,
 'decided': 21,
 'define': 22,
 'destruction': 23,
 'did': 24,
 'do': 25,
 'elephant': 26,
 'even': 27,
 'explaining': 28,
 'far': 29,
 'film': 30,
 'filmmaker': 31,
 'flawed': 32,
 'for': 33,
 'from': 34,
 'gets': 35,
 'given': 36,
 'had': 37,
 'have': 38,
 'honest': 39,
 'honesty': 40,
 'humans': 41,
 'in': 42,
 'is': 43,
 'it': 44,
 'leads': 45,
 'likely': 46,
 'men': 47,
 'mode': 48,
 'money': 49,
 'motives': 50,
 'movie': 51,
 'murderers': 52,
 'must': 53,
 'mutual': 54,
 'not': 55,
 'of': 56,
 'order': 57,
 'our': 58,
 'own': 59,
 'perfect': 60,
 'product': 61,
 'rationalistic': 62,
 're': 63,
 'remarkable': 64,
 'see': 65,
 'skin': 66,
 'slaughtering': 67,
 'suicide': 68,
 'terms': 69,

### BOW (Bag Of Words) : 단어집
전체 텍스트 데이터에서 단어집을 만드는 방법
- CountVectorizer
- TfidfVectorizer

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

ss = ['I am Tom. Tom is me!', 'He is Tom. He is a man.']
vect = CountVectorizer()
vect.fit(ss)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [25]:
vect.vocabulary_ # 한 글자 단어와 구두점은 제외한다 (소문자로 변환)

{'am': 0, 'tom': 5, 'is': 2, 'me': 4, 'he': 1, 'man': 3}

In [26]:
voca = vect.vocabulary_
sorted([(v,k) for k,v in voca.items()])

[(0, 'am'), (1, 'he'), (2, 'is'), (3, 'man'), (4, 'me'), (5, 'tom')]

In [27]:
vect.transform(ss)

<2x6 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [28]:
vect.transform(ss).toarray()

array([[1, 0, 1, 0, 1, 2],
       [0, 2, 2, 1, 0, 1]], dtype=int64)

In [29]:
vect.get_feature_names()

['am', 'he', 'is', 'man', 'me', 'tom']

In [31]:
print(*list(enumerate(vect.get_feature_names())), sep=', ')

(0, 'am'), (1, 'he'), (2, 'is'), (3, 'man'), (4, 'me'), (5, 'tom')


- IMDB 데이터의 BOW 생성

In [40]:
vect = CountVectorizer()
vect.fit(text_train) # 단어집이 만들어진다
X_train = vect.transform(text_train) # sparse matrix

In [41]:
type(vect.vocabulary_), len(vect.vocabulary_)

(dict, 75911)

In [54]:
vect.vocabulary_['arirang']

4336

In [57]:
vect.get_feature_names()[::1000]

['00',
 '8700',
 'adultery',
 'alvarez',
 'appearence',
 'atrendants',
 'bang',
 'benches',
 'blissfully',
 'brainwhy',
 'burress',
 'carpathia',
 'chaulk',
 'clea',
 'compensations',
 'coorain',
 'crossfire',
 'daysthis',
 'derboiler',
 'discharge',
 'dop',
 'débutant',
 'empty',
 'eurocult',
 'falling',
 'finances',
 'formats',
 'gainsay',
 'gisbourne',
 'greenaway',
 'hallucinogenics',
 'helming',
 'honore',
 'ids',
 'inferenced',
 'ireland',
 'johnston',
 'khang',
 'laconic',
 'leroi',
 'looping',
 'majkowski',
 'matlin',
 'messing',
 'modelling',
 'multiculturalism',
 'nerdish',
 'nva',
 'orion',
 'panzerkreuzer',
 'periphery',
 'plasterboard',
 'preached',
 'prowl',
 'raha',
 'redman',
 'resolving',
 'roeves',
 'salli',
 'scouser',
 'shaffer',
 'significance',
 'smurfs',
 'spenser',
 'stereotypic',
 'suck3d',
 'synchronized',
 'tepper',
 'tirard',
 'treebeard',
 'una',
 'unprovoked',
 'venantini',
 'waling',
 'whycome',
 'xdbut']

In [34]:
X_train

<25000x75911 sparse matrix of type '<class 'numpy.int64'>'
	with 3431163 stored elements in Compressed Sparse Row format>

In [35]:
X_train.shape

(25000, 75911)

In [75]:
vect.vocabulary_

{'zero': 75669,
 'day': 16986,
 'leads': 38653,
 'you': 75381,
 'to': 68091,
 'think': 67468,
 'even': 23059,
 're': 54503,
 'why': 73998,
 'two': 69757,
 'boys': 8922,
 'young': 75392,
 'men': 42764,
 'would': 74762,
 'do': 19634,
 'what': 73731,
 'they': 67409,
 'did': 18588,
 'commit': 13888,
 'mutual': 45268,
 'suicide': 65104,
 'via': 72211,
 'slaughtering': 61588,
 'their': 67280,
 'classmates': 12958,
 'it': 35211,
 'captures': 10809,
 'must': 45209,
 'be': 6512,
 'beyond': 7341,
 'bizarre': 7716,
 'mode': 43993,
 'of': 47352,
 'being': 6852,
 'for': 25839,
 'humans': 32540,
 'who': 73935,
 'have': 30570,
 'decided': 17219,
 'withdraw': 74379,
 'from': 26582,
 'common': 13907,
 'civility': 12845,
 'in': 33505,
 'order': 47900,
 'define': 17460,
 'own': 48610,
 'world': 74699,
 'coupled': 15414,
 'destruction': 18214,
 'is': 35099,
 'not': 46714,
 'perfect': 49947,
 'movie': 44779,
 'but': 10096,
 'given': 28034,
 'money': 44193,
 'time': 67883,
 'the': 67244,
 'filmmaker': 24942

In [36]:
i=0
for v in vect.vocabulary_:
    print(v, '=>', vect.vocabulary_[v])
    i+=1
    if i==10: break

zero => 75669
day => 16986
leads => 38653
you => 75381
to => 68091
think => 67468
even => 23059
re => 54503
why => 73998
two => 69757


In [37]:
feature_names = vect.get_feature_names()
display(type(feature_names), len(feature_names))
display(feature_names[:5], feature_names[20010:20015], feature_names[::10000], feature_names[-5:])

list

75911

['00', '000', '0000000000001', '00001', '00015']

['doppelgangers', 'doppelgänger', 'dopplebangers', 'doppleganger', 'doppler']

['00',
 'burress',
 'dop',
 'hallucinogenics',
 'looping',
 'periphery',
 'shaffer',
 'una']

['ís', 'ísnt', 'østbye', 'über', 'üvegtigris']

In [38]:
feature_names[1200]

'aaliyah'

- X_train[0] 내용 확인

In [58]:
X_train[0]

<1x75911 sparse matrix of type '<class 'numpy.int64'>'
	with 91 stored elements in Compressed Sparse Row format>

In [59]:
X_train[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [66]:
text_train[0]

"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. Flawed but honest with a terrible honesty."

- 첫번째 문장에서 값이 0 이 아닌 단어들을 출력하시오.

In [64]:
fn = vect.get_feature_names()

for i in range(X_train[0].shape[1]): # len(X_train[0]) : error! => 75911
    if X_train[0,i]>0:
        print(i,fn[i],'=>',X_train[0,i])

1723 actions => 1
1741 actors => 1
2880 almost => 1
3375 and => 2
3859 anything => 1
4269 are => 1
6512 be => 1
6852 being => 2
7288 better => 2
7341 beyond => 1
7716 bizarre => 1
8922 boys => 1
10096 but => 2
10809 captures => 1
12845 civility => 1
12958 classmates => 1
13888 commit => 1
13907 common => 1
15414 coupled => 1
16986 day => 1
17219 decided => 1
17460 define => 1
18214 destruction => 1
18588 did => 1
19634 do => 1
21607 elephant => 1
23059 even => 1
23541 explaining => 1
24147 far => 2
24904 film => 2
24942 filmmaker => 1
25360 flawed => 1
25839 for => 1
26582 from => 1
27726 gets => 1
28034 given => 1
29807 had => 1
30570 have => 1
31970 honest => 1
31972 honesty => 1
32540 humans => 1
33505 in => 3
35099 is => 4
35211 it => 5
38653 leads => 1
39336 likely => 1
42764 men => 1
43993 mode => 1
44193 money => 1
44676 motives => 1
44779 movie => 1
45110 murderers => 1
45209 must => 1
45268 mutual => 2
46714 not => 1
47352 of => 4
47900 order => 1
48156 our => 1
48610 own => 1

- X_train 의 각 문장이 포함하고 있는 단어의 갯수를 출력하시오. (문장은 총 25000개)

In [69]:
(X_train>0).sum(axis=1)

matrix([[ 91],
        [120],
        [ 62],
        ...,
        [117],
        [121],
        [211]])

In [75]:
X_train.sum(axis=1)

matrix([[127],
        [192],
        [ 79],
        ...,
        [184],
        [178],
        [361]], dtype=int64)

In [39]:
for i in range(10):
    # a = X_train[i].toarray()
    a = X_train[i]
    print('%4d, %4d, %s' % ((a > 0).sum(), a.sum(), a.shape))

  91,  127, (1, 75911)
 120,  192, (1, 75911)
  62,   79, (1, 75911)
  95,  139, (1, 75911)
 166,  295, (1, 75911)
  47,   50, (1, 75911)
  78,  104, (1, 75911)
 151,  262, (1, 75911)
 220,  435, (1, 75911)
  43,   51, (1, 75911)


In [40]:
doc_0 = X_train[0]
doc_0[doc_0>0]

#X_train[0][X_train[0]>0]

matrix([[1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3,
         4, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 4, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 2, 2, 1, 2, 1, 3, 2, 1, 2, 1, 4, 3, 1, 2, 3, 1, 1,
         1, 1, 1, 1, 2, 2, 1]], dtype=int64)

In [41]:
doc_0.shape

(1, 75911)

In [85]:
for i in range(75911):
    n = doc_0[0,i]
    if n>0: print(i,feature_names[i],n)

1723 actions 1
1741 actors 1
2880 almost 1
3375 and 2
3859 anything 1
4269 are 1
6512 be 1
6852 being 2
7288 better 2
7341 beyond 1
7716 bizarre 1
8922 boys 1
10096 but 2
10809 captures 1
12845 civility 1
12958 classmates 1
13888 commit 1
13907 common 1
15414 coupled 1
16986 day 1
17219 decided 1
17460 define 1
18214 destruction 1
18588 did 1
19634 do 1
21607 elephant 1
23059 even 1
23541 explaining 1
24147 far 2
24904 film 2
24942 filmmaker 1
25360 flawed 1
25839 for 1
26582 from 1
27726 gets 1
28034 given 1
29807 had 1
30570 have 1
31970 honest 1
31972 honesty 1
32540 humans 1
33505 in 3
35099 is 4
35211 it 5
38653 leads 1
39336 likely 1
42764 men 1
43993 mode 1
44193 money 1
44676 motives 1
44779 movie 1
45110 murderers 1
45209 must 1
45268 mutual 2
46714 not 1
47352 of 4
47900 order 1
48156 our 1
48610 own 1
49947 perfect 1
52605 product 1
54367 rationalistic 1
54503 re 1
55513 remarkable 1
59385 see 1
61440 skin 1
61588 slaughtering 1
65104 suicide 2
67035 terms 2
67049 terrible 1

### 분류 모델 적용
- LogisticRegression 과 교차검증 적용 (학습데이터만 사용함)
- LogisticRegression 의 설정값인 C 를 바꿔가면서 적용해야 한다.

> **정리**
> - X_train => sparse matrix, 25000
> - y_train => array, 25000

In [77]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [80]:
scores = cross_val_score(LogisticRegression(C=1), X_train, y_train) # default cv=3
scores

array([0.87724982, 0.8737701 , 0.87662026])

In [81]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(X_train, y_train) 

In [87]:
model = LogisticRegression(C=1)
model.fit(train_data, train_target)
model.score(test_data, test_target)

0.884

- 테스트세트를 적용하여 점수 확인
- C 값으로 0.1 을 적용한다
> **주의** : text_train 으로 단어집을 만들었으므로 text_test 에는 단어집에 없는 단어가 있을 수 있다