# Sentiment Analysis with IMDB dataset
## 4. Modeling

I am going to build 4 models
- Logistic regression
- Random forest
- RNN
- CNN

### 4.1 Logistic regression

This is variation of linear regression. First, linear regression which predict value with weights(parameters) for each input x,

could be thought as drowing a line to minimize error. Logistic regression uses predicted value as inputs for logistic function

to get 0 ~ 1 value. Its result can be thought as probability that input x is classified as specific class.

In this part, I'll use word2vec and tf-idf to make input embedding bector

### 4.1.1 TF-IDF - logistic regression
### Data load

In [1]:
import pandas as pd

DATA_OUT_PATH = 'C:/python/NLP/chap_4/data_for_modeling/'
TRAIN_CLEAN_DATA = 'train_clean.csv'

train_data = pd.read_csv(DATA_OUT_PATH + TRAIN_CLEAN_DATA, header=0, quoting=3)

In [2]:
reviews = list(train_data['review'])
sentiments = list(train_data['sentiment'])

### TF-IDF Vectorizing

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df = 0.0, analyzer='char', sublinear_tf=True, ngram_range=(1,3), max_features=5000) # Refer document

X = vectorizer.fit_transform(reviews)

### Split data

In [4]:
from sklearn.model_selection import train_test_split
import numpy as np

RANDOM_SEED = 42
TEST_SPLIT = 0.2

y = np.array(sentiments)

X_train, X_dev, Y_train, Y_dev = train_test_split(X, y, test_size=TEST_SPLIT, random_state=RANDOM_SEED)

### Modeling

In [5]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_weight='balanced')
lr.fit(X_train, Y_train)



LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

### Evaluation

In [6]:
print("Accuracy: {:5f}".format(lr.score(X_dev, Y_dev)))

Accuracy: 0.859600


### Submit

In [7]:
TEST_CLEAN_DATA = 'test_clean.csv'
test_data = pd.read_csv(DATA_OUT_PATH + TEST_CLEAN_DATA, header=0, quoting=3)

testDataVecs = vectorizer.transform(test_data['review'])
test_predicted = lr.predict(testDataVecs)
print(test_predicted)

[1 0 1 ... 0 1 0]


In [8]:
import os
SUBMIT_DATA_PATH = "C:/python/NLP/chap_4/submit/"

if not os.path.exists(SUBMIT_DATA_PATH):
    os.mkdir(SUBMIT_DATA_PATH)

ids = list(test_data['id'])
answer_lr = pd.DataFrame({'id': ids, 'sentiment': test_predicted})
answer_lr.to_csv(SUBMIT_DATA_PATH + 'lgs_tfidf_answer.csv')

### 4.1.2 word2vec - logistic regression

To use word2vec, all reviews have to be diveded by word in list

In [10]:
sentences = []

for review in reviews:
    sentences.append(review.split())

In [13]:
sentences[0]

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'really',
 'cool',
 'eighties',
 'maybe',
 'make',
 'mind',
 'whether',
 'guilty',
 'innocent',
 'moonwalker',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'remember',
 'going',
 'see',
 'cinema',
 'originally',
 'released',
 'subtle',
 'messages',
 'mj',
 'feeling',
 'towards',
 'press',
 'also',
 'obvious',
 'message',
 'drugs',
 'bad',
 'kay',
 'visually',
 'impressive',
 'course',
 'michael',
 'jackson',
 'unless',
 'remotely',
 'like',
 'mj',
 'anyway',
 'going',
 'hate',
 'find',
 'boring',
 'may',
 'call',
 'mj',
 'egotist',
 'consenting',
 'making',
 'movie',
 'mj',
 'fans',
 'would',
 'say',
 'made',
 'fans',
 'true',
 'really',
 'nice',
 'actual',
 'feature',
 'film',
 'bit',
 'finally',
 'starts',
 'minutes',
 'excluding',
 'smooth',
 'crim

### word2vec vectorizing
##### Hyperparameter for word2vec
I am going to use gensim package to vectorize with word2vec, and hyperparameters ,below, are used for gensim 

In [16]:
num_features = 300
min_word_count = 40
num_workers = 4
context = 10
downsampling = 1e-3

We can track process of training word2vec by 'logging' method

In [18]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)

In [19]:
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences,
                          workers=num_workers,
                          size=num_features,
                          min_count=min_word_count,
                          window=context,
                          sample=downsampling)

2019-06-15 07:57:37,184 : INFO : 'pattern' package not found; tag filters are not available for English
  "C extension not loaded, training will be slow. "
2019-06-15 07:57:37,210 : INFO : collecting all words and their counts
2019-06-15 07:57:37,214 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Training model...


2019-06-15 07:57:37,713 : INFO : PROGRESS: at sentence #10000, processed 1205223 words, keeping 51374 word types
2019-06-15 07:57:38,193 : INFO : PROGRESS: at sentence #20000, processed 2396605 words, keeping 67660 word types
2019-06-15 07:57:38,431 : INFO : collected 74065 word types from a corpus of 2988089 raw words and 25000 sentences
2019-06-15 07:57:38,433 : INFO : Loading a fresh vocabulary
2019-06-15 07:57:38,530 : INFO : effective_min_count=40 retains 8160 unique words (11% of original 74065, drops 65905)
2019-06-15 07:57:38,531 : INFO : effective_min_count=40 leaves 2627273 word corpus (87% of original 2988089, drops 360816)
2019-06-15 07:57:38,581 : INFO : deleting the raw counts dictionary of 74065 items
2019-06-15 07:57:38,586 : INFO : sample=0.001 downsamples 30 most-common words
2019-06-15 07:57:38,587 : INFO : downsampling leaves estimated 2494384 word corpus (94.9% of prior 2627273)
2019-06-15 07:57:38,679 : INFO : estimated required memory for 8160 words and 300 dimen

2019-06-15 08:04:25,449 : INFO : EPOCH 1 - PROGRESS: at 23.18% examples, 1443 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:04:31,865 : INFO : EPOCH 1 - PROGRESS: at 23.56% examples, 1440 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:04:32,960 : INFO : EPOCH 1 - PROGRESS: at 23.96% examples, 1456 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:04:37,410 : INFO : EPOCH 1 - PROGRESS: at 24.28% examples, 1460 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:04:47,521 : INFO : EPOCH 1 - PROGRESS: at 24.64% examples, 1445 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:04:53,255 : INFO : EPOCH 1 - PROGRESS: at 24.98% examples, 1445 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:04:55,132 : INFO : EPOCH 1 - PROGRESS: at 25.30% examples, 1458 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:04:59,441 : INFO : EPOCH 1 - PROGRESS: at 25.60% examples, 1463 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:05:10,098 : INFO : EPOCH 1 - PROGRESS: at 25.94% examples, 1447 words/s, in_qsize 8, out_qsize 0
2

2019-06-15 08:12:06,634 : INFO : EPOCH 1 - PROGRESS: at 50.69% examples, 1469 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:12:13,032 : INFO : EPOCH 1 - PROGRESS: at 51.03% examples, 1467 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:12:18,949 : INFO : EPOCH 1 - PROGRESS: at 51.37% examples, 1467 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:12:21,767 : INFO : EPOCH 1 - PROGRESS: at 51.68% examples, 1471 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:12:28,532 : INFO : EPOCH 1 - PROGRESS: at 52.03% examples, 1470 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:12:36,043 : INFO : EPOCH 1 - PROGRESS: at 52.35% examples, 1466 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:12:40,951 : INFO : EPOCH 1 - PROGRESS: at 52.60% examples, 1468 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:12:43,915 : INFO : EPOCH 1 - PROGRESS: at 52.93% examples, 1472 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:12:50,287 : INFO : EPOCH 1 - PROGRESS: at 53.26% examples, 1471 words/s, in_qsize 7, out_qsize 0
2

2019-06-15 08:19:41,538 : INFO : EPOCH 1 - PROGRESS: at 77.05% examples, 1459 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:19:44,049 : INFO : EPOCH 1 - PROGRESS: at 77.34% examples, 1462 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:19:50,737 : INFO : EPOCH 1 - PROGRESS: at 77.68% examples, 1461 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:19:54,484 : INFO : EPOCH 1 - PROGRESS: at 78.01% examples, 1463 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:20:03,280 : INFO : EPOCH 1 - PROGRESS: at 78.35% examples, 1460 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:20:05,101 : INFO : EPOCH 1 - PROGRESS: at 78.70% examples, 1464 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:20:12,023 : INFO : EPOCH 1 - PROGRESS: at 79.04% examples, 1462 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:20:16,147 : INFO : EPOCH 1 - PROGRESS: at 79.41% examples, 1464 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:20:26,195 : INFO : EPOCH 1 - PROGRESS: at 79.77% examples, 1459 words/s, in_qsize 8, out_qsize 0
2

2019-06-15 08:28:26,011 : INFO : EPOCH 2 - PROGRESS: at 9.21% examples, 1392 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:28:41,059 : INFO : EPOCH 2 - PROGRESS: at 9.51% examples, 1322 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:28:42,233 : INFO : EPOCH 2 - PROGRESS: at 9.85% examples, 1358 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:28:46,982 : INFO : EPOCH 2 - PROGRESS: at 10.16% examples, 1368 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:28:49,862 : INFO : EPOCH 2 - PROGRESS: at 10.53% examples, 1391 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:29:03,738 : INFO : EPOCH 2 - PROGRESS: at 10.90% examples, 1337 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:29:04,858 : INFO : EPOCH 2 - PROGRESS: at 11.23% examples, 1370 words/s, in_qsize 7, out_qsize 0
2019-06-15 08:29:11,720 : INFO : EPOCH 2 - PROGRESS: at 11.52% examples, 1365 words/s, in_qsize 8, out_qsize 0
2019-06-15 08:29:15,346 : INFO : EPOCH 2 - PROGRESS: at 11.87% examples, 1380 words/s, in_qsize 7, out_qsize 0
2019

KeyboardInterrupt: 

##### word2vec model save

In [None]:
model_name = "300features_40minwords_10context"
model.save(model_name)

### word2vec -> Input

Now, word2vec needs to be proper shape for input. Each reviews has different number of words, so we need to make them same shape.

One of the simple method to deal with this is to get average of each review. 

In [None]:
def get_features(words, model, num_features):
    
    feature_vector = np.zeros((num_features), dtype=np.float32)
    
    num_words = 0
    
    index2word_set = set(model.wv.index2word)
    
    for w in words:
        if w in index2word_set:
            num_words += 1
            
            feature_vector = np.add(feature_vector, model[w])
            
    feature_vector = np.divide(feature_vector, num_words)
    return feature_vector

In [None]:
def get_dataset(reviews, model, num_features):
    dataset = list()
    
    for s in reviews:
        dataset.append(get_features(s, model, num_features))
        
    reviewFeatureVecs = np.stack(dataset)
    
    return reviewFeatureVecs

In [None]:
train_data_vecs = get_dataset(sentences, model, num_features)

### Training, evaluation and submit

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

X = test_data_vecs
Y = np.array(sentiments)

RANDOM_SEED = 42
TEST_SPLIT = 0.2

X_train, X_dev, Y_train, Y_dev = train_test_split(X, Y, test_size=TEST_SPLIT, random_state=RANDOM_SEED)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_Weight='balanced')
lr.fit(X_train, Y_train)

print("Accuracy: %f" % format(lr.score(X_dev, Y_dev)))

TEST_CLEAN_DATA = 'test_clean.csv'
test_data = pd.read_csv(DATA_OUT_PATH + TEST_CLEAN_DATA)
test_reviews = list(test_data['review'])
test_sentences = []
for review in test_data:
    test_sentenes.append(review.split())
    
test_data_vecs = get_dataset(test_sentences, model, num_features)

DATA_OUT_PATH = 'C:/python/NLP/Chap_4/submit/'
test_predicted = lr.predict(test_data_vecs)

if not os.path.exists(DATA_OUT_PATH):
    os.makedirs(DATA_OUT_PATH)
    
ids = list(test_data['id'])
answer_dataset = pd.DataFrame({'id': ids, 'sentiments': test_predicted})
answer_dataset.to_csv(DATA_OUT_PATH + 'lgs_w2v_answer')