## 11/11 Sentiment Analysis on Movie Reviews

### Overview

"There's a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side."

The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1]. In their work on sentiment treebanks, Socher et al. [2] used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.

Kaggle is hosting this competition for the machine learning community to use for fun and practice. This competition was inspired by the work of Socher et al [2]. We encourage participants to explore the accompanying (and dare we say, fantastic) website that accompanies the paper:

http://nlp.stanford.edu/sentiment/

There you will find have source code, a live demo, and even an online interface to help train the model.

[1] Pang and L. Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115–124.

[2] Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. Conference on Empirical Methods in Natural Language Processing (EMNLP 2013).



### Data description

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
test.tsv contains just phrases. You must assign a sentiment label to each phrase.
The sentiment labels are:

0 - negative 
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

In [1]:
import pandas as pd

## Load Dataset

In [2]:
### tsv는 tab 기반

# sep => seperate
train = pd.read_csv("train.tsv", sep="\t", index_col="PhraseId")

print(train.shape)
train.head()

(156060, 3)


Unnamed: 0_level_0,SentenceId,Phrase,Sentiment
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,A series of escapades demonstrating the adage ...,1
2,1,A series of escapades demonstrating the adage ...,2
3,1,A series,2
4,1,A,2
5,1,series,2


In [3]:
test = pd.read_csv("test.tsv", sep="\t", index_col="PhraseId")

print(test.shape)

test.head()

(66292, 2)


Unnamed: 0_level_0,SentenceId,Phrase
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
156061,8545,An intermittently pleasing but mostly routine ...
156062,8545,An intermittently pleasing but mostly routine ...
156063,8545,An
156064,8545,intermittently pleasing but mostly routine effort
156065,8545,intermittently pleasing but mostly routine


## Preprocessing

#### Clean Text

In [4]:
train["Phrase(Origin)"] = train["Phrase"].copy()

print(train.shape)
train[["Phrase", "Phrase(Origin)"]].head()

(156060, 4)


Unnamed: 0_level_0,Phrase,Phrase(Origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
2,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
3,A series,A series
4,A,A
5,series,series


In [5]:
test["Phrase(Origin)"] = test["Phrase"].copy()

print(test.shape)
test[["Phrase", "Phrase(Origin)"]].head()

(66292, 3)


Unnamed: 0_level_0,Phrase,Phrase(Origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
156061,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156062,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156063,An,An
156064,intermittently pleasing but mostly routine effort,intermittently pleasing but mostly routine effort
156065,intermittently pleasing but mostly routine,intermittently pleasing but mostly routine


In [6]:
def clean_text(phrase):
    phrase = phrase.replace("ca n't", "can not")
    phrase = phrase.replace("n't", "not")
    
    return phrase

phrase = "ca n't recommend it"

clean_text(phrase)

'can not recommend it'

In [7]:
train["Phrase"] = train["Phrase"].apply(clean_text)

print(train.shape)
train[["Phrase", "Phrase(Origin)"]].head()

(156060, 4)


Unnamed: 0_level_0,Phrase,Phrase(Origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
2,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
3,A series,A series
4,A,A
5,series,series


In [8]:
test["Phrase"] = test["Phrase"].apply(clean_text)

print(test.shape)
test[["Phrase", "Phrase(Origin)"]].head()

(66292, 3)


Unnamed: 0_level_0,Phrase,Phrase(Origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
156061,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156062,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156063,An,An
156064,intermittently pleasing but mostly routine effort,intermittently pleasing but mostly routine effort
156065,intermittently pleasing but mostly routine,intermittently pleasing but mostly routine


#### Stemmer

In [9]:
# import nltk

# nltk.download('wordnet')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

# from nltk.stem.snowball import SnowballStemmer

# stemmer = SnowballStemmer('english')
# stemmer

# from tqdm import tqdm

In [10]:
# print(train.shape)

In [11]:
# def stem_phrase(phrase):
#     words = phrase.split(" ")
#     stemmed_phrase = [stemmer.stem(w) for w in words]
#     return " ".join(stemmed_phrase)

# tqdm.pandas(desc="Stemming...")
# train["Phrase"] = train["Phrase"].progress_apply(stem_phrase)

# print(train.shape)
# train[["Phrase", "Phrase(Origin)"]].head()

In [12]:
# test["Phrase"] = test["Phrase"].progress_apply(stem_phrase)

## Lemmatize

In [13]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
lemmatizer

def transorm_pos(pos):
    if 'V' in pos:
        return 'v'
    else:
        return 'n'

def lemmatizing(phrase):
    words = word_tokenize(phrase)
    word_pos = pos_tag(words)
    
    filtered = (lemmatizer.lemmatize(word, transorm_pos(word_pos)) for word in words)
    
    return " ".join(filtered)

train["Phrase"] = train["Phrase"].apply(lemmatizing)

In [14]:
train.head()

Unnamed: 0_level_0,SentenceId,Phrase,Sentiment,Phrase(Origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,A series of escapade demonstrating the adage t...,1,A series of escapades demonstrating the adage ...
2,1,A series of escapade demonstrating the adage t...,2,A series of escapades demonstrating the adage ...
3,1,A series,2,A series
4,1,A,2,A
5,1,series,2,series


#### Vectorize Phrases

In [15]:
# Bag of words, vectorization

from sklearn.feature_extraction.text import CountVectorizer

# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

# max_features: 우리가 단어의 개수를 지정함.
# 빈번하게 나오는 단어를 많이 뽑고, 빈도가 낮은 단어는 뽑지 않음
# 최대 1000개만 뽑으라고 지정
# 문장을 단어로 뽑아줌.
# binary option: 해당 단어가 존재유무로 체크

stop_words = ['disappointments']

vectorizer = CountVectorizer(max_features=15000, ngram_range=(1, 2), binary=True, stop_words=stop_words)
vectorizer

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=15000, min_df=1,
        ngram_range=(1, 2), preprocessor=None,
        stop_words=['disappointments'], strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

CountVectorizer
- fit
- transport

In [16]:
vectorizer.fit(train["Phrase"])

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=15000, min_df=1,
        ngram_range=(1, 2), preprocessor=None,
        stop_words=['disappointments'], strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

In [17]:
## 단어의 개수를 

X_train = vectorizer.transform(train["Phrase"])

print(X_train.shape)
X_train

(156060, 15000)


<156060x15000 sparse matrix of type '<class 'numpy.int64'>'
	with 1257306 stored elements in Compressed Sparse Row format>

In [18]:
vocabulary = vectorizer.get_feature_names()

print(len(vocabulary))
vocabulary[0:5]

15000


['000', '10', '10 minute', '10 or', '10 year']

In [19]:
pd.DataFrame(X_train[0:100].toarray(), columns=vocabulary).head()

## row: 전체 데이터
## column: voca

Unnamed: 0,000,10,10 minute,10 or,10 year,100,100 minute,100 year,101,11,...,yu,zany,zeal,zealand,zero,zhang,zinger,zippy,zombie,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
X_test = vectorizer.transform(test["Phrase"])

print(X_test.shape)
X_test

(66292, 15000)


<66292x15000 sparse matrix of type '<class 'numpy.int64'>'
	with 416627 stored elements in Compressed Sparse Row format>

In [21]:
label_name = "Sentiment"

y_train = train[label_name]

print(y_train.shape)
y_train.head()

(156060,)


PhraseId
1    1
2    2
3    2
4    2
5    2
Name: Sentiment, dtype: int64

## Score

In [22]:
from sklearn.linear_model import SGDClassifier

seed = 43

model = SGDClassifier(random_state=seed)
model

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=43, shuffle=True,
       tol=None, verbose=0, warm_start=False)

In [23]:
# from sklearn.model_selection import cross_val_score

# score = cross_val_score(model, X_train, y_train, cv=5).mean()

# print("Score = {0:5f}".format(score))

## 기존의 cross_val_score를 두 단계로 나눈 것

from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GroupKFold

# 기존의 cv만 사용했을 때보다 kaggle score와 비슷하게 나옴
# => crosss validation의 정확도(?)가 좀 더 높아지는 효과라고 볼 수 있음
kfold = GroupKFold(n_splits=5)
y_predict = cross_val_predict(model, X_train, y_train,
                              cv=kfold, groups=train["SentenceId"])

print(y_predict.shape)
y_predict[0:10]



(156060,)


array([1, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [24]:
from sklearn.metrics import accuracy_score

score = accuracy_score(y_train, y_predict)

print("Score = {0:.5f}".format(score))

Score = 0.58685


In [25]:
import numpy as np

# 원본이 훼손되니까 복사해서 사용
result = train.copy()
result["Sentiment(predict)"] = y_predict
# predict한 Sentiment와 train의 실제 Sentiment와의 차이를 통해 모델의 예측차를 확인
result["Distance"] = np.abs(result["Sentiment"] - result["Sentiment(predict)"])
result = result.sort_values(by="Distance", ascending=False)

print(result.shape)
result[["Phrase", "Sentiment(predict)", "Distance"]].head(100)

(156060, 6)


Unnamed: 0_level_0,Phrase,Sentiment(predict),Distance
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
142370,"The performance are so overstated , the effect...",4,4
53284,attempt to mine laugh from a genre -- the gang...,4,4
37375,is the best Star Trek movie in a long time .,0,4
73472,"is -RRB- a fascinating character , and deserve...",4,4
37374,This is the best Star Trek movie in a long time .,0,4
117415,"'s unfortunate that Wallace , who wrote Gibson...",4,4
143437,seems altogether too slight to be called any k...,4,4
66532,one of the saddest action hero performance eve...,4,4
130641,The gag that fly at such a furiously funny pac...,0,4
72813,One of the most unpleasant thing the studio ha...,4,4


In [26]:
result.to_csv("result.csv")

In [27]:
pd.DataFrame(vocabulary).to_csv("vocabular.csv")

In [28]:
# stunning 단어 개선

result.loc[132656]

SentenceId                                                         7153
Phrase                Miyazaki 's nonstop image are so stunning , an...
Sentiment                                                             4
Phrase(Origin)        Miyazaki 's nonstop images are so stunning , a...
Sentiment(predict)                                                    4
Distance                                                              0
Name: 132656, dtype: object

## Predict

In [29]:
model.fit(X_train, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=43, shuffle=True,
       tol=None, verbose=0, warm_start=False)

In [30]:
predictions = model.predict(X_test)

print(predictions.shape)
predictions[0:10]

(66292,)


array([3, 3, 2, 3, 3, 3, 3, 2, 3, 2])

## Submit

In [31]:
submission = pd.read_csv("SampleSubmission.csv", index_col="PhraseId")

submission["Sentiment"] = predictions

print(submission.shape)
submission.head()

(66292, 1)


Unnamed: 0_level_0,Sentiment
PhraseId,Unnamed: 1_level_1
156061,3
156062,3
156063,2
156064,3
156065,3


In [32]:
submission.to_csv("baseline-script.csv")

## SDGClassifier

Machine Learning
- Supervised Learning(SL)
- Unsupervised Learning(UL)
- Reinforcement Learning

SL: feature(O), label(O)
UL: feature(O), label(X)

삼성전자 주식 예측 예시
- cross validation을 사용할 경우, 시간적인 고려가 되지 않음
- ex) 2016년도 주식을 가지고, 2015년도 주식을 예측하는 경우

문장의 경우
- train 데이터와 test 데이터의 셋이 연관성도 없고 voca set이 다름.
- SentenceId 기준으로 중복시키지 않도록 처리

0.60 초과