<a href="https://colab.research.google.com/github/k-washi/python-cookbook/blob/master/ml/2/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 感情分析

IMDbの映画レビューデータセット(50000件)から、肯定的なレビューと否定的なレビュを区別する。

In [36]:
!pip install nltk



In [7]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
!tar -zxf "/content/drive/My Drive/data/aclImdb_v1.tar.gz"

In [11]:
!pip install pyprind

Collecting pyprind
  Downloading https://files.pythonhosted.org/packages/1e/30/e76fb0c45da8aef49ea8d2a90d4e7a6877b45894c25f12fb961f009a891e/PyPrind-2.11.2-py3-none-any.whl
Installing collected packages: pyprind
Successfully installed pyprind-2.11.2


In [15]:
import pyprind
import pandas as pd
import os

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()

for s in ('test', 'train'):
  for l in ('pos', 'neg'):
    path = os.path.join(basepath, s, l)
    for file in sorted(os.listdir(path)):
      with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
        txt = infile.read()
      df = df.append([[txt, labels[l]]], ignore_index=True)
      pbar.update()
df.head()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:34


Unnamed: 0,0,1
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [0]:
df.columns = ['review', 'sentiment']

In [0]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('aclImdb/movie_data.csv', index=False, encoding='utf-8')

In [18]:
df = pd.read_csv('drive/My Drive/data/movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [19]:
df.shape

(50000, 2)

単語を特徴ベクトルに変換する。

In [0]:
import numpy as np
from sklearn.feature_extraction.text import  CountVectorizer
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet,''and one and one is two'])
bag = count.fit_transform(docs)

In [23]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [24]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


TF-IDF

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer 

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True) #smooth_idf => 1+ nd / ( 1 + df)
print(tfidf)
np.set_printoptions(precision=2) #set_printset
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


In [28]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [0]:
import re
def preprocessor(text):
  text = re.sub('<[^>]*>', '', text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
  text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
  return text

In [31]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [32]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [0]:
df['review'] = df['review'].apply(preprocessor)

In [34]:
df.head(3)

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about w...,0


In [35]:
def tokenizer(text):
  return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

Python用のNatural Language Toolkit（NLTK、http: //www.nltk.org）

In [37]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

ストップワードの削除

In [40]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [42]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

[w for w in tokenizer_porter('a runner likes' ' running and runs a lot')[-10:] if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

以上の処理をデータに反映

In [0]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [50]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
param_grid = [{'vect__ngram_range': [(1,1)],'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer,tokenizer_porter],
               'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]},
                {'vect__ngram_range': [(1,1)], 'vect__stop_words': [stop, None],
                 'vect__tokenizer': [tokenizer,tokenizer_porter],'vect__use_idf':[False], 
                 'vect__norm':[None],'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]} ]

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0, solver='liblinear'))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=2, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


KeyboardInterrupt: ignored

以上の処理ではメモリーに乗らないことも多い。しかし、ConutVectorizerやTFidfVectorizerは、すべての特徴ベクトルをメモリに乗せる必要がある。
