## 캐글에서 0.6이상의 스코어를 받는 것을 목표로 개선
### CountVectorizer
- stop_words : 문자열 {‘english’}, 리스트 또는 None (디폴트)
- stop words 목록.‘english’이면 영어용 스탑 워드 사용.
- analyzer : 문자열 {‘word’, ‘char’, ‘char_wb’} 또는 함수
- 단어 n-그램, 문자 n-그램, 단어 내의 문자 n-그램
- tokenizer : 함수 또는 None (디폴트) 토큰 생성 함수 .
- token_pattern : string 토큰 정의용 정규 표현식
- ngram_range : (min_n, max_n) 튜플 n-그램 범위
- max_df : 정수 또는 [0.0, 1.0] 사이의 실수. 디폴트 1 단어장에 포함되기 위한 최대 빈도
- min_df : 정수 또는 [0.0, 1.0] 사이의 실수. 디폴트 1 단어장에 포함되기 위한 최소 빈도
- vocabulary : 사전이나 리스트단어장


#### 시계열 데이터 
* 예를 들어 2010-2011-2012-2013 의 시계열 데이터가 있다. 이럴 때는 과거의 데이터를 기준으로 미래를 맞춤
* 크로스 밸리데이션을 쓰지 않고 홀드아웃 밸리데이션을 씀 과거의 데이터를 통해 미래의 데이터를 맞춘다.
* 미래의 데이터로 과거 데이터를 맞추지 않도록 한다.

* 트레인데이터와 테스트데이터는 공유하고 있는 게 없다.

### 더 해보기
* 센텐스를 기준으로 쪼개준다.
* 사이킷런에 고민했던게 구현되어 있다.
* cossval_prediction이 여러개로 쪼갤 수가 있다.
* 특정 컬럼을 지정하고 그룹 k-fold validation
* 몇 조각으로 쪼갤지 정해주고 값이 겹치지 않도록 한다.
* 시간 관련 문제를 풀때는 GroupFold같은 것이 중요하다.
* 캐릭터별로 나누고 단어별로 나눠 이를 X_train X_test로 합쳐주는 것을 해본다.

In [174]:
import pandas as pd

### Load Dataset

In [175]:
train = pd.read_csv("data/train.tsv", sep="\t", index_col="PhraseId")

print(train.shape)
train.head()

(156060, 3)


Unnamed: 0_level_0,SentenceId,Phrase,Sentiment
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,A series of escapades demonstrating the adage ...,1
2,1,A series of escapades demonstrating the adage ...,2
3,1,A series,2
4,1,A,2
5,1,series,2


In [176]:
test = pd.read_csv("data/test.tsv", sep="\t", index_col="PhraseId")

print(test.shape)
test.head()

(66292, 2)


Unnamed: 0_level_0,SentenceId,Phrase
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
156061,8545,An intermittently pleasing but mostly routine ...
156062,8545,An intermittently pleasing but mostly routine ...
156063,8545,An
156064,8545,intermittently pleasing but mostly routine effort
156065,8545,intermittently pleasing but mostly routine


## Preprocessing

In [177]:
train["Phrase(origin)"] = train["Phrase"].copy()

print(train.shape)
train[["Phrase", "Phrase(origin)"]].head()

(156060, 4)


Unnamed: 0_level_0,Phrase,Phrase(origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
2,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
3,A series,A series
4,A,A
5,series,series


In [178]:
test["Phrase(origin)"] = test["Phrase"].copy()

print(test.shape)
test[["Phrase", "Phrase(origin)"]].head()

(66292, 3)


Unnamed: 0_level_0,Phrase,Phrase(origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
156061,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156062,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156063,An,An
156064,intermittently pleasing but mostly routine effort,intermittently pleasing but mostly routine effort
156065,intermittently pleasing but mostly routine,intermittently pleasing but mostly routine


### Clean Text

In [179]:
import re

def clean_text(phrase):
    phrase = phrase.replace("doesn't ", "does not ")
    phrase = phrase.replace("ca n't ", "can not ")
    phrase = phrase.replace(" n't ", " not ")
    phrase = re.sub("[a]{1,15}", 'a', phrase)
    phrase = re.sub("[o]{2,15}", 'oo', phrase)

    return phrase

train["Phrase"] = train["Phrase"].apply(clean_text)

print(train.shape)
train[["Phrase", "Phrase(origin)"]].head()

(156060, 4)


Unnamed: 0_level_0,Phrase,Phrase(origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
2,A series of escapades demonstrating the adage ...,A series of escapades demonstrating the adage ...
3,A series,A series
4,A,A
5,series,series


In [180]:
test["Phrase"] = test["Phrase"].apply(clean_text)

print(test.shape)
test[["Phrase", "Phrase(origin)"]].head()

(66292, 3)


Unnamed: 0_level_0,Phrase,Phrase(origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1
156061,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156062,An intermittently pleasing but mostly routine ...,An intermittently pleasing but mostly routine ...
156063,An,An
156064,intermittently pleasing but mostly routine effort,intermittently pleasing but mostly routine effort
156065,intermittently pleasing but mostly routine,intermittently pleasing but mostly routine


In [181]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')
stemmer

def stem_phrase(phrase):
    stemmed_words = [stemmer.stem(w) for w in phrase.split(" ")]
    stemmed_phrase = " ".join(stemmed_words)
    
    return stemmed_phrase


train["Phrase"] = train["Phrase"].apply(stem_phrase)

print(train.shape)
train[["Phrase", "Phrase(origin)"]].head()

test["Phrase"] = test["Phrase"].apply(stem_phrase)

(156060, 4)


스테밍은 꾸준히 점수를 올릴 수 있는 방법이다.

* 카운트 벡터라이즈, 단어가 phase에서 몇개가 나오는 지를 분석한다.
* 단어를 발생하는 것을 카운트 하는 게 의미가 없을 수 있다.
* 바이너리를 True로 주면 0,1로 판단
* 자주 발생하는 단어는 중요하지 않은 단어 일 수 있다. a, the,....
* 자주 발생하지 않는 단어를 우선순위로 뽑는 게 좋을 수도 있다. 이렇게 하는 게 결과적으로 더 나을 수 있다.
TFIDF가 이런 방법이다.
단어가 유니크할 수록 숫자가 높다.
phase가 2개가 있다.
전체를 기준으로 this는 전체 단어 갯수 5개 중 하나 
[TF-IDF - 위키백과, 우리 모두의 백과사전](https://ko.wikipedia.org/wiki/TF-IDF)
this, is, the는 매우 빈번하게 발생하기 때문에 중요하지 않기 때문에 패널티를 주어 걸러내고 점수를 덜 준다.

idf는 전체 도큐먼트에서 도큐먼트 프리퀀시를 뒤집음 분모와 분자를 바꿔준다.

빈번하게 발생하는 this보다 빈번하지 않는 example이 점수가 더 높다.

파이썬의 장점은 쉽다.


### Vectorize phrases

In [226]:
# Tf-idf 벡터라이즈로 바꿔본다.
# word로 쪼개는 것과 캐릭터로 쪼개는 것을 둘 다 써서 합치는 게 점수가 더 좋아진다.

# n그램이 늘어나면 맥스피처도 늘려주는 게 좋다. 하이퍼파라메터 튜닝을 해서 찾아낸다.
# 1,1은 24개
# 넘파이의 베이스가 scipy다. 수학적 연산을 쓰고 싶을 때 사용한다.


from sklearn.feature_extraction.text import CountVectorizer
import nltk

# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# binary=True/False
# lowercase=True/False
# ngram_range=(1, 1)
# stop_words=None

# vectorizer = CountVectorizer(max_features=1000)
vectorizer = CountVectorizer(max_features=100000, min_df=2, ngram_range=(1, 3), tokenizer=nltk.word_tokenize)
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=100000, min_df=2,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function word_tokenize at 0x1268abd08>, vocabulary=None)

In [227]:
vectorizer.fit(train["Phrase"])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=100000, min_df=2,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function word_tokenize at 0x1268abd08>, vocabulary=None)

In [228]:
X_train = vectorizer.transform(train["Phrase"])
X_train

<156060x100000 sparse matrix of type '<class 'numpy.int64'>'
	with 2398373 stored elements in Compressed Sparse Row format>

In [229]:
columns = vectorizer.get_feature_names()
pd.DataFrame(X_train[:100].toarray(), columns=columns).head()

Unnamed: 0,!,! ',! '',! -rrb-,! ?,! ? ',! ? -rrb-,#,# 9,$,...,"zone , and",zone arm,zone arm with,zone of,zone of sympath,zooland,zucker,zucker brothers\/abraham,zucker brothers\/abraham film,zwick
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [230]:
X_test = vectorizer.transform(test["Phrase"])
X_test

<66292x100000 sparse matrix of type '<class 'numpy.int64'>'
	with 629635 stored elements in Compressed Sparse Row format>

In [231]:
# 레이블의 불균형도 많다. 암환자 예측하기에서 모든 테스트데이터가 암환자가 아니기 때문에 쪼개서 맞추도록 한다.
# 이것을 자동으로 맞추도록 한다.


sentence_ids = train["SentenceId"]

print(sentence_ids.shape)


(156060,)


In [232]:
y_train = train["Sentiment"]

print(y_train.shape)
y_train.head()

(156060,)


PhraseId
1    1
2    2
3    2
4    2
5    2
Name: Sentiment, dtype: int64

## Score

In [251]:
from sklearn.linear_model import SGDClassifier

# alpha 랜덤값으로 찾아서 하면 됨
# TF-idf가 어디에서나 잘 동작하지는 않는다. 리그레션 모델과 잘 어울린다.
# 문장, 단어와 단어의 조합에 따라 달라진다. 앞뒤 단어가 중요하다.
# 대부분은 트리모델이 좋다. 
model = SGDClassifier(random_state=37)
model



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=37, shuffle=True,
       tol=None, verbose=0, warm_start=False)

In [237]:
# 데이터가 많으면 홀드아웃으로 처리
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GroupKFold

kfold = GroupKFold(n_splits=5)

y_predict = cross_val_predict(model, X_train, y_train, cv=kfold, groups=sentence_ids)

print(y_predict.shape)
y_predict[0:10]

(156060,)


array([3, 3, 2, 2, 2, 3, 2, 3, 2, 3])

In [238]:
from sklearn.metrics import accuracy_score

score = accuracy_score(y_train, y_predict)
print("Score = {0:.5f}".format(score))

Score = 0.59018


* Score = 0.59018
* Score = 0.58511 nltk 토크나이저 지정
* Score = 0.58648 
* Score = 0.58386

In [239]:
import numpy as np

result = train.copy()
result["Sentiment(predict)"] = y_predict
result["Difference(Phrase)"] = np.abs(y_train - y_predict)

print(result.shape)
result.head()

(156060, 6)


Unnamed: 0_level_0,SentenceId,Phrase,Sentiment,Phrase(origin),Sentiment(predict),Difference(Phrase)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,a seri of escapad demonstr the adag that what ...,1,A series of escapades demonstrating the adage ...,3,2
2,1,a seri of escapad demonstr the adag that what ...,2,A series of escapades demonstrating the adage ...,3,1
3,1,a seri,2,A series,2,0
4,1,a,2,A,2,0
5,1,seri,2,series,2,0


In [240]:
sentiment = result.groupby("SentenceId")["Difference(Phrase)"].mean()
print(sentiment.shape)
sentiment.head()

(8529,)


SentenceId
1    0.317460
2    0.500000
3    0.171429
4    0.625000
5    0.800000
Name: Difference(Phrase), dtype: float64

In [252]:
train.head()
train.loc[train["SentenceId"] == 3025]


Unnamed: 0_level_0,SentenceId,Phrase,Sentiment,Phrase(origin)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
59963,3025,this is one bad movi .,0,This is one baaaaaaaaad movie .
59964,3025,is one bad movi .,0,is one baaaaaaaaad movie .
59965,3025,is one bad movi,0,is one baaaaaaaaad movie
59966,3025,one bad movi,0,one baaaaaaaaad movie
59967,3025,bad movi,0,baaaaaaaaad movie
59968,3025,bad,0,baaaaaaaaad


In [242]:
def find_sentiment(sentence_id):
    return sentiment.loc[sentence_id]

result["Difference(Sentence)"] = result["SentenceId"].apply(find_sentiment)
result = result.sort_values(by="Difference(Sentence)", ascending=False)

print(result.shape)
result.head(30)

(156060, 7)


Unnamed: 0_level_0,SentenceId,Phrase,Sentiment,Phrase(origin),Sentiment(predict),Difference(Phrase),Difference(Sentence)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
79350,4087,can not recommend it .,0,ca n't recommend it .,4,4,4.0
79349,4087,i can not recommend it .,0,I ca n't recommend it .,4,4,4.0
113468,6031,is well below expect,1,is well below expectations,2,1,2.5
152334,8312,is a big time stinker,0,is a big time stinker,3,3,2.5
152332,8312,'' is a big time stinker .,0,'' is a big time stinker .,3,3,2.5
152331,8312,the adventur of pluto nash '' is a big time st...,0,The Adventures of Pluto Nash '' is a big time ...,3,3,2.5
152330,8312,`` the adventur of pluto nash '' is a big time...,0,`` The Adventures of Pluto Nash '' is a big ti...,3,3,2.5
152335,8312,a big time stinker,1,a big time stinker,2,1,2.5
152336,8312,big time stinker,0,big time stinker,2,2,2.5
152337,8312,time stinker,0,time stinker,2,2,2.5


In [243]:
result[0:1000].to_csv("result.csv")

In [244]:
vocabulary = vectorizer.get_feature_names()
vocabulary[0:3]

['!', "! '", "! ''"]

In [245]:
pd.DataFrame(vocabulary, columns=["word"]).to_csv("vocabulary.csv")

In [246]:
result[result["Phrase"].str.contains("can not recommend")]

Unnamed: 0_level_0,SentenceId,Phrase,Sentiment,Phrase(origin),Sentiment(predict),Difference(Phrase),Difference(Sentence)
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
79350,4087,can not recommend it .,0,ca n't recommend it .,4,4,4.0
79349,4087,i can not recommend it .,0,I ca n't recommend it .,4,4,4.0
80730,4158,can not recommend it enough,4,ca n't recommend it enough,2,2,2.0
80729,4158,can not recommend it enough .,4,ca n't recommend it enough .,2,2,2.0
80728,4158,simpli can not recommend it enough .,4,simply ca n't recommend it enough .,0,4,2.0
80727,4158,i simpli can not recommend it enough .,3,I simply ca n't recommend it enough .,0,3,2.0
22229,998,"yet can not recommend it , becaus it overstay ...",1,"yet can not recommend it , because it overstay...",4,3,1.294118
22230,998,"can not recommend it , becaus it overstay it n...",1,"can not recommend it , because it overstays it...",4,3,1.294118
22226,998,"admir it and yet can not recommend it , becaus...",2,"admire it and yet can not recommend it , becau...",4,2,1.294118
22225,998,"admir it and yet can not recommend it , becaus...",1,"admire it and yet can not recommend it , becau...",3,2,1.294118


## Train

In [247]:
model.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=37, shuffle=True,
       tol=None, verbose=0, warm_start=False)

In [248]:
predictions = model.predict(X_test)

print(predictions.shape)
predictions[0:10]

(66292,)


array([3, 3, 2, 3, 3, 2, 3, 2, 2, 2])

## Submit

In [249]:
submission = pd.read_csv("data/sampleSubmission.csv", index_col="PhraseId")

submission["Sentiment"] = predictions

print(submission.shape)
submission.head()

(66292, 1)


Unnamed: 0_level_0,Sentiment
PhraseId,Unnamed: 1_level_1
156061,3
156062,3
156063,2
156064,3
156065,3


In [250]:
# 서브미션 파일 저장
submission.to_csv("data/baseline-script-2nd.csv")

* 다음주 예고 : XGBM 의 리그레션을 보여주고 어떻게 쓰는지
* LGBM도 있지만 설치가 까다롭다. XGBM이 좀 더 널리 사용 된다.