<a href="https://colab.research.google.com/github/hwangtaemin/word2vec-with-movie-review/blob/main/movie_word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import warnings
import os
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
SEED = 33

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 데이터 로드

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Kaggle/movie/labeledTrainData.tsv', delimiter='\t')
test = pd.read_csv('/content/drive/MyDrive/Kaggle/movie/testData.tsv', delimiter='\t')
unlabeled_train = pd.read_csv('/content/drive/MyDrive/Kaggle/movie/unlabeledTrainData.tsv', delimiter='\t', error_bad_lines=False)

b'Skipping line 43043: expected 2 fields, saw 3\n'


In [4]:
print(train.shape)
train.head()

(25000, 3)


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [5]:
print(test.shape)
test.head()

(25000, 2)


Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [6]:
print(unlabeled_train.shape)
unlabeled_train.head()

(49998, 2)


Unnamed: 0,id,review
0,9999_0,"Watching Time Chasers, it obvious that it was ..."
1,45057_0,I saw this film about 20 years ago and remembe...
2,15561_0,"Minor Spoilers<br /><br />In New York, Joan Ba..."
3,7161_0,I went to see this film with a great deal of e...
4,43971_0,"Yes, I agree with everyone on this site this m..."


### 전처리

In [7]:
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

In [8]:
sample = train['review'][0]

In [9]:
sample

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [10]:
soup = BeautifulSoup(sample, 'html.parser')

In [11]:
soup.text

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

In [12]:
import re

In [13]:
cleaned = re.sub('[^a-zA-Z]', ' ', soup.text)

In [14]:
cleaned

'With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [15]:
cleaned.lower()

'with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    mi

In [16]:
import nltk
nltk.download('stopwords')
eng_stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [18]:
lemmatizer = WordNetLemmatizer()

In [19]:
def process_lemma(sentence):
  return [lemmatizer.lemmatize(word, 'v') for word in sentence]

In [20]:
def preprocessing(sentence):
  soup = BeautifulSoup(sentence, 'html.parser')
  cleaned = re.sub('[^a-zA-Z]', ' ', soup.text)
  cleaned = cleaned.lower()
  cleaned = [word for word in cleaned.split() if word not in eng_stopwords]
  cleaned = process_lemma(cleaned)
  return ' '.join(cleaned)

In [21]:
preprocessing(sample)

'stuff go moment mj start listen music watch odd documentary watch wiz watch moonwalker maybe want get certain insight guy think really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember go see cinema originally release subtle message mj feel towards press also obvious message drug bad kay visually impressive course michael jackson unless remotely like mj anyway go hate find bore may call mj egotist consent make movie mj fan would say make fan true really nice actual feature film bite finally start minutes exclude smooth criminal sequence joe pesci convince psychopathic powerful drug lord want mj dead bad beyond mj overhear plan nah joe pesci character rant want people know supply drug etc dunno maybe hat mj music lot cool things like mj turn car robot whole speed demon sequence also director must patience saint come film kiddy bad sequence usually directors hate work one kid let alone whole bunch perform complex dance scene botto

In [22]:
all_review = pd.concat([train['review'], unlabeled_train['review'], test['review']])

In [23]:
all_review_clean = all_review.apply(preprocessing)

In [24]:
all_review_clean.head()

0    stuff go moment mj start listen music watch od...
1    classic war worlds timothy hines entertain fil...
2    film start manager nicholas bell give welcome ...
3    must assume praise film greatest film opera ev...
4    superbly trashy wondrously unpretentious explo...
Name: review, dtype: object

### CountVectorizer

In [25]:
#from sklearn.feature_extraction.text import CountVectorizer

In [26]:
#cv = CountVectorizer(analyzer='word', max_features=5000)

In [27]:
#all_review_cv = cv.fit_transform(all_review_clean)

In [28]:
#all_review_cv.shape

### Tokenizer

In [29]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [30]:
tokenizer = Tokenizer(oov_token='<OOV>')

In [31]:
tokenizer.fit_on_texts(all_review_clean)

In [32]:
len(tokenizer.word_index)

126312

In [33]:
train_sentences = all_review_clean[:len(train)]
test_sentences = all_review_clean[-len(test):]

In [34]:
train_sentences.shape, test_sentences.shape

((25000,), (25000,))

In [35]:
train_sequences = tokenizer.texts_to_sequences(train_sentences)
test_sequences = tokenizer.texts_to_sequences(test_sentences)

In [36]:
train_sequences[0]

[397,
 12,
 463,
 11594,
 83,
 931,
 127,
 13,
 895,
 507,
 13,
 21106,
 13,
 19437,
 179,
 46,
 8,
 639,
 2250,
 66,
 16,
 18,
 469,
 3273,
 179,
 5,
 188,
 643,
 2110,
 1155,
 19437,
 58,
 4431,
 58,
 258,
 2,
 240,
 12,
 7,
 349,
 1643,
 255,
 1145,
 550,
 11594,
 59,
 773,
 2039,
 29,
 471,
 550,
 593,
 26,
 4231,
 1924,
 1032,
 175,
 420,
 1453,
 782,
 2209,
 6,
 11594,
 459,
 12,
 613,
 37,
 170,
 116,
 146,
 11594,
 34889,
 9296,
 5,
 3,
 11594,
 109,
 15,
 25,
 5,
 109,
 198,
 18,
 253,
 727,
 258,
 2,
 114,
 339,
 83,
 141,
 7788,
 3475,
 1502,
 311,
 781,
 6909,
 526,
 9123,
 785,
 593,
 1370,
 46,
 11594,
 242,
 26,
 558,
 11594,
 9785,
 505,
 12451,
 781,
 6909,
 11,
 3763,
 46,
 27,
 24,
 2666,
 593,
 413,
 8743,
 179,
 724,
 11594,
 127,
 64,
 469,
 94,
 6,
 11594,
 90,
 419,
 1905,
 130,
 1523,
 2147,
 311,
 29,
 68,
 113,
 3929,
 3388,
 36,
 2,
 22089,
 26,
 311,
 516,
 843,
 613,
 43,
 4,
 129,
 152,
 518,
 130,
 630,
 890,
 1120,
 423,
 55,
 1131,
 107,
 3,
 27,
 6,
 

In [37]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [38]:
MAX_LENGTH = 150

In [39]:
train_padded = pad_sequences(train_sequences, maxlen=MAX_LENGTH, truncating='post', padding='post')
test_padded = pad_sequences(test_sequences, maxlen=MAX_LENGTH, truncating='post', padding='post')

In [40]:
train_padded.shape, test_padded.shape

((25000, 150), (25000, 150))

In [41]:
train_labels = train['sentiment']

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
x_train, x_valid, y_train, y_valid = train_test_split(train_padded, train_labels, stratify=train_labels, test_size =0.1, random_state=SEED)

### Word2Vec

In [44]:
from gensim.models import KeyedVectors

In [45]:
word2vec = KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Kaggle/movie/GoogleNews-vectors-negative300.bin', binary=True)

In [46]:
EMBEDDING_DIM = 300
VOCAB_SIZE = len(tokenizer.word_index) + 1

embedding_matrix = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))

In [49]:
for word, idx in tokenizer.word_index.items():
  embedding_vector = word2vec[word] if word in word2vec else None
  if embedding_vector is not None:
    embedding_matrix[idx] = embedding_vector

In [50]:
embedding_matrix.shape

(126313, 300)

### Model

In [51]:
from tensorflow.keras.layers import Dense, LSTM, Bidirectional, Embedding, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint

In [52]:
model = Sequential([
                    Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LENGTH,
                              weights=[embedding_matrix],
                              trainable=False),
                    Bidirectional(LSTM(128, return_sequences=True)),
                    Bidirectional(LSTM(128)),
                    Dropout(0.25),
                    Dense(32, activation='relu'),
                    Dense(1, activation='sigmoid')
])

In [53]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 150, 300)          37893900  
_________________________________________________________________
bidirectional (Bidirectional (None, 150, 256)          439296    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 32)                8224      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 38,735,693
Trainable params: 841,793
Non-trainable params: 37,893,900
______________________________________

In [54]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [55]:
checkpoint_path = 'tmp/checkpoint.ckpt'
checkpoint = ModelCheckpoint(filepath=checkpoint_path,
                             save_best_only=True,
                             save_weights_only=True,
                             monitor='val_loss',
                             verbose=1)

In [56]:
model.fit(x_train, y_train,
          validation_data=(x_valid, y_valid),
          batch_size=128,
          epochs=20,
          callbacks=[checkpoint])

Epoch 1/20

Epoch 00001: val_loss improved from inf to 0.41242, saving model to tmp/checkpoint.ckpt
Epoch 2/20

Epoch 00002: val_loss improved from 0.41242 to 0.39713, saving model to tmp/checkpoint.ckpt
Epoch 3/20

Epoch 00003: val_loss did not improve from 0.39713
Epoch 4/20

Epoch 00004: val_loss did not improve from 0.39713
Epoch 5/20

Epoch 00005: val_loss improved from 0.39713 to 0.36004, saving model to tmp/checkpoint.ckpt
Epoch 6/20

Epoch 00006: val_loss improved from 0.36004 to 0.35803, saving model to tmp/checkpoint.ckpt
Epoch 7/20

Epoch 00007: val_loss improved from 0.35803 to 0.32013, saving model to tmp/checkpoint.ckpt
Epoch 8/20

Epoch 00008: val_loss did not improve from 0.32013
Epoch 9/20

Epoch 00009: val_loss did not improve from 0.32013
Epoch 10/20

Epoch 00010: val_loss did not improve from 0.32013
Epoch 11/20

Epoch 00011: val_loss did not improve from 0.32013
Epoch 12/20

Epoch 00012: val_loss did not improve from 0.32013
Epoch 13/20

Epoch 00013: val_loss did n

<tensorflow.python.keras.callbacks.History at 0x7f076d5cb790>

In [57]:
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f0769c4cd10>

In [58]:
model.evaluate(x_valid, y_valid)



[0.3201339542865753, 0.8651999831199646]

In [59]:
prediction = model.predict(test_padded)

In [60]:
prediction[prediction >= 0.5] = 1
prediction[prediction < 0.5] = 0

In [61]:
prediction

array([[1.],
       [0.],
       [1.],
       ...,
       [0.],
       [1.],
       [0.]], dtype=float32)

In [62]:
submission = pd.read_csv('/content/drive/MyDrive/Kaggle/movie/sampleSubmission.csv')

In [63]:
submission['sentiment'] = prediction

In [64]:
submission['sentiment'] = submission['sentiment'].astype('int')

In [65]:
submission.to_csv('movie_word2vec.csv', index=False)