# 感情分析(sentiment analysis)
> 感情分析でよく知られているタスクは、ある話題に関して書き手が表明した意見や感情に基づいて文書を分類すること

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
import os
import re
import sys


pwd = os.getcwd()
path = re.search('.+/自然言語処理', pwd).group(0)

sys.path.append(path)
# print(sys.path)

## データセットの取得

 - IMDbの映画レビューデータセット
 - 「肯定的」または「否定的」として両極に分類される50,000件の映画レビューで構成されている
     - 「肯定的」とは、**その映画がIMDbで6個以上の星を獲得している**ことを意味する
     - 「否定的」とは、**星が5個以下の映画である**ことを意味する

http://ai.stanford.edu/~amaas/data/sentiment/

## データセットを扱いやすいフォーマットに変換する

ダウンロードアーカイブに含まれていたテキスト文書を１つのCSVファイルにまとめる。

### データフレームに格納

In [5]:
import pyprind
import os
from tqdm import tqdm

aclImdb_dir = '../data/raw/aclImdb'
labels = {'pos': 1, 'neg': 0}
review_df = pd.DataFrame()

pbar = pyprind.ProgBar(5000)
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        file_dir = os.path.join(aclImdb_dir, s, l)
        for file in tqdm(os.listdir(file_dir)):

            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()

            review_df = review_df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:19
100%|██████████| 12500/12500 [00:48<00:00, 257.93it/s]
100%|██████████| 12500/12500 [00:57<00:00, 217.55it/s]
100%|██████████| 12500/12500 [01:02<00:00, 199.74it/s]
100%|██████████| 12500/12500 [01:11<00:00, 174.89it/s]


In [6]:
review_df.columns = ['review', 'sentiment']

In [7]:
review_df.head()

Unnamed: 0,review,sentiment
0,"Based on an actual story, John Boorman shows t...",1
1,This is a gem. As a Film Four production - the...,1
2,"I really like this show. It has drama, romance...",1
3,This is the best 3-D experience Disney has at ...,1
4,"Of the Korean movies I've seen, only three had...",1


### csvファイルに書き出す

In [9]:
np.random.seed(0)

review_df = review_df.reindex(np.random.permutation(review_df.index))
review_df.to_csv('../data/processed/movie_data.csv', index=False, encoding='utf-8')

In [6]:
# 確認
review_df = pd.read_csv('../data/processed/movie_data.csv', encoding='utf-8')
review_df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


## テキストのクリーニング

In [7]:
review_df.loc[0, 'review']

'My family and I normally do not watch local movies for the simple reason that they are poorly made, they lack the depth, and just not worth our time.<br /><br />The trailer of "Nasaan ka man" caught my attention, my daughter in law\'s and daughter\'s so we took time out to watch it this afternoon. The movie exceeded our expectations. The cinematography was very good, the story beautiful and the acting awesome. Jericho Rosales was really very good, so\'s Claudine Barretto. The fact that I despised Diether Ocampo proves he was effective at his role. I have never been this touched, moved and affected by a local movie before. Imagine a cynic like me dabbing my eyes at the end of the movie? Congratulations to Star Cinema!! Way to go, Jericho and Claudine!!'

### HTMLタグの除去

In [8]:
from src.preprocessings.cleaning import clean_html_tags


review_df.loc[:, 'review'] = review_df['review'].apply(lambda review: clean_html_tags(review)) 

In [9]:
review_df.loc[0, 'review']

'My family and I normally do not watch local movies for the simple reason that they are poorly made, they lack the depth, and just not worth our time.The trailer of "Nasaan ka man" caught my attention, my daughter in law\'s and daughter\'s so we took time out to watch it this afternoon. The movie exceeded our expectations. The cinematography was very good, the story beautiful and the acting awesome. Jericho Rosales was really very good, so\'s Claudine Barretto. The fact that I despised Diether Ocampo proves he was effective at his role. I have never been this touched, moved and affected by a local movie before. Imagine a cynic like me dabbing my eyes at the end of the movie? Congratulations to Star Cinema!! Way to go, Jericho and Claudine!!'

## 文書をトークン化する

 1. 単語の分割
 2. ワードステミング(wort stemming)
 3. ストップワードの除去

In [10]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer_with_stemming(text):
    return [porter.stem(word) for word in text.split()]

In [11]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

In [12]:
X_train = review_df.loc[:2500, 'review'].values
y_train = review_df.loc[:2500, 'sentiment'].values
X_test = review_df.loc[2500:, 'review'].values
y_test = review_df.loc[2500:, 'sentiment'].values

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(
    strip_accents=None, lowercase=False, preprocessor=None, 
    tokenizer=tokenizer_with_stemming,
    stop_words=stop_words,
)
tfidf_vectorizer.fit(X_train)

  'stop_words.' % sorted(inconsistent))


TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenizer_with_stemming at 0x1a1aa7ef28>,
        use_idf=True, vocabulary=None)

In [14]:
x = tfidf_vectorizer.transform(X_train)

In [15]:
np.sum(x.toarray()[:5], axis=1)

array([ 7.65602289,  4.62332634, 16.81611149,  6.02967574, 14.35600666])