<a name="top"></a>
<h1>Bag of Words tutorial, part 1<span class="tocSkip"></span></h1>
    
    - Natural Language Processing (NLP): 자연어 처리
    - Bag of Words (BOW) : 자연어 처리 방식 중 하나
    
    BOW 자연어처리를 사용하여 IMDB 사이트의 
    review 데이터를 정제하고, feature vector를 만들어,
    random forest로 학습
    
    1. 학습 데이터 호출
    2. 데이터 전처리
        - BeautifulSoup 사용해서 html 태그 제거
        - regular expression (re)를 사용해서 구두 제거
        - 소문자로 변환
        - 문장 => 단어 리스트로 추출 (tokenization)
        - nltk의 stopwords로 단어리스트에 있는 stopword들 제거
        - (옵션) steamming, lemmatizing
        - 리스트에서 다시 문장으로 변환 ' '.join()
    3. BOW feature vector 생성
        - scikit의 CountVectorizer 사용
        - 정제된 데이터셋 전체에 반복수 최상위 5,000 단어,
          vocabulary라 칭하며, 이를 기준으로 각 문장에 대한
          vector가 생성된다.
        - vector는 각 문장에 있는 단어들이 vocabulary에 있는
          단어인 경우, 이에 대한 반복수를 기입한다.
     4. Random Forest 모델 학습
         - scikit에 RandomForestClassifer를 사용한다.
         - 학습 데이터 feature, label 데이터 기입
     5. 테스트 데이터셋 정제, feature 생성 (학습 데이터 정제 참고)
     6. 모델 예측 (forest.predict)
     7. 저장
    
resource: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words

# 학습 데이터 호출
<br><a href="#top">go to top</a>

In [1]:
import pandas as pd

In [19]:
train = pd.read_csv("./data/bow/labeledTrainData.tsv",sep='\t')

In [21]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [22]:
train.shape

(25000, 3)

In [23]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

In [24]:
print(train["review"][0])

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally star

# 데이터 전처리
<br><a href="#top">go to top</a>

## HTML 태그 제거
<br><a href="#top">go to top</a>

In [26]:
from bs4 import BeautifulSoup

In [31]:
# BeautifulSoup는 .get_text()라는 html태그를 제거해주는 함수가 있다.

example1 = BeautifulSoup(train["review"][0])
text_example1 = example1.get_text()
print(text_example1)

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 min

## regex로 구두 제거
<br><a href="#top">go to top</a>

In [30]:
import re

In [32]:
# a-z 소문자, A-Z 대문자 알파벳들만 추출하고,
# 위 컨디션에 속하지 않는 것들 (punctuation, num)은 " "으로 대체
letters_only = re.sub("[^a-zA-Z]", " ", text_example1)
# ^[ ] 는 이 그룹에 속하지 않는것들을 말한다. ^ 이거 조심.
print(letters_only)

With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    min

## 소문자 추출, 단어리스트 생성
<br><a href="#top">go to top</a>

In [33]:
# 소문자로 추출
lower_case = letters_only.lower()
# 단어들로만 갖고오게 split
words = lower_case.split()
print(words)

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 

## stopwords 제거
<br><a href="#top">go to top</a>

In [34]:
from nltk.corpus import stopwords

In [36]:
# nltk 자료에서 english에 속하는 stopwords를 갖고온다.
# stopword란 약간 너무 반복되는 불필요한 단어?
s_words = stopwords.words("english")
print(s_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [37]:
filtered_words = [w for w in words if not w in s_words]
print(filtered_words)

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

## 객체지향적으로 정리
<br><a href="#top">go to top</a>

In [38]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_words( raw_review ):
    # 1. train셋에서 review 하나를 받는다.
    # 2. BeautifulSoup로 html태그 제거한다.
    review_text = BeautifulSoup(raw_review).get_text()
    # 3. 알파벳 아닌것들 제거한다.
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    # 4. 소문자로 변형하고 각 단어들을 리스트 형식으로 받는다.
    words = letters_only.lower().split()
    # 5. stopwords 호출
    # 참고: set()처리 해줘야 실행속도가 빠르다.
    # Python의 기본 데이터 처리 사항인가 보다.
    stops = set(stopwords.words("english"))
    # 6. stopwords에 속하지 않는 review 단어들을 추출한다.
    meaningful_words = [w for w in words if not w in stops]
    # 7. 필터한 단어들을 string형식으로 리턴...
    return(" ".join(meaningful_words))

In [39]:
raw_review_1 = train["review"][0]
review_1_data = review_to_words(raw_review_1)
print(review_1_data)

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [42]:
# 위 함수를 사용해서 각 review에 대한 cleaning 작업을 한다.
num_reviews = train["review"].size
clean_train_reviews = []
for i in range(0, num_reviews):
    clean_train_reviews.append(review_to_words(train["review"][i]))
    if ((i+1)%1000 == 0):
        print("Review {} of {}".format(i+1, num_reviews))

Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 25000 of 25000


# Feature Vector 추출
    1. 여기서 feature는 각 문장에 각 단어의 (문장내의) 분포도를 찾는다.
        - sent1 = "The cat sat on the hat"
        - sent2 = "The dog ate the cat and the hat"
        - vocabulary = {the, cat, sat, on, hat, dog, ate, and}
        - sent1_feature : {2,1,1,1,1,0,0,0}
            - "the"가 문장내에 2번 반복된다.
        - sent2_feature : {3,1,0,0,1,1,1,1}
            - "the"가 문장내에 3번 반복된다.
    2. 이러한 작업을 손쉽게 하기 위해선 scikit-learn을 사용한다.
    
<br><a href="#top">go to top</a>

In [43]:
from sklearn.feature_extraction.text import CountVectorizer

In [44]:
# feature vector를 뽑기위해선 vectorizer를 하나 생성한다.
# feature(vocabulary)가 너무 크면 안되니 5000개 (반복수에 따른 상위)
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor= None,
                             stop_words= None,
                             max_features = 5000)
# 내 생각에는 위에 전처리도 필요없었을듯...?

# 여기서 fit_transform은 약간 dummies해주는 듯.
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()

In [50]:
# 5,000 features에 대한 25000개의 문장들이 나온다.
print(train_data_features.shape)

(25000, 5000)


In [53]:
# train_data_features는 feature가 뭔지 보여주진 않는다.
# 전에 생성한 vectorizer는 feature에 대한 정보들을 유지한다.
vocab = vectorizer.get_feature_names()
print(vocab[:5])

['abandoned', 'abc', 'abilities', 'ability', 'able']


# 테스트 데이터 정제
<br><a href="#top">go to top</a>

In [55]:
test = pd.read_csv("./data/bow/testData.tsv", sep="\t")

In [56]:
test.shape

(25000, 2)

In [58]:
num_reviews = len(test["review"])
clean_test_reviews = []
for i in range(0, num_reviews):
    if ((i+1)%1000==0):
        print("Review {} of {}".format(i+1, num_reviews))
    clean_review = review_to_words(test["review"][i])
    clean_test_reviews.append(clean_review)

Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 25000 of 25000


# Random Forest 모델 학습
<br><a href="#top">go to top</a>

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data_features, train["sentiment"])

# 모델 예측
<br><a href="#top">go to top</a>

In [59]:
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
result = forest.predict(test_data_features)
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})

# 예측 결과 저장
<br><a href="#top">go to top</a>

In [60]:
output.to_csv("./data/BOW_model_01.csv", index=False, quoting=3)