문제 개요
1. Netflix, Watcha 등 OTT 플랫폼 서비스 상에는 각 영화, 드라마 등 영상마다 사용자들의 Review 데이터가 풍부하게 저장되어있다.
2. 해당 Review는 텍스트와 Rating로 구성되어 있는데, 각 영상의 Rating은 이 Review 데이터의 평균 Rating으로 Aggregate Value로 표현되어 있을 것이다.
3. Rating 없이 텍스트로만 구성되어 있는 Review 데이터가 존재할 경우, 해당 영상의 예상 Rating을 측정하기 어렵다.
4. 따라서, 텍스트와 Rating으로 구성되어 있는 Review 데이터를 가지고, 텍스트를 통해 Rating을 예측할 수 있는 모델을 구축할 경우, 다음과 같은 새로운 가치를 창출할 수 있을 것으로 기대된다.
    * 웹 상에 흩어져 있는 수많은 텍스트 데이터를 수집하여, 해당 영상의 Rating을 측정할 수 있다.

접근 방법
1. Regression: 1, 2, 3, 4, 5 순으로 Descrete Rating
2. 사용할 Algorithm
    * FastText Embedding
    * Bidirectional with LSTM
3. Metrics for Model Performance: Mean Absolute Error

목표
* Mean Absolute Error를 최소화할 수 있는 Regression 모델을 만든다.

# 0. System Settings

In [1]:
# Model Result Submission to Kaggle Competition
! pip install kaggle
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json # Not to make Permission Warning

# My Google Drive Mount
from google.colab import drive
drive.mount('/content/drive/')



Saving kaggle.json to kaggle.json
Mounted at /content/drive/


# 1. Import All the Required Libraries

In [2]:
# Basic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import joblib

# Text Data Handling
import re
from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Data Splitting
from sklearn.model_selection import train_test_split

# Models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN, LSTM, GRU, Bidirectional, Conv1D, MaxPooling1D, GlobalMaxPooling1D, GlobalAveragePooling1D, Flatten
from tensorflow.keras.layers import Dropout, SpatialDropout1D

# Activation Functions
from tensorflow.keras.activations import sigmoid

# Losses
from tensorflow.keras.losses import SparseCategoricalCrossentropy

# Optimizers & Metrics
from tensorflow.keras.optimizers import SGD, RMSprop, Adagrad, Adadelta, Adam, Nadam
from tensorflow.keras.metrics import mae, mse

# Callbacks
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# 2. Load Data

In [3]:
FolderPath = '/content/drive/MyDrive/03. Kookmin AI Big Data MBA/Semester 3_032021-062021/2. Deep Learning/Jupyter Notebook/Final Exam/data'

FName_train = 'train_set.xlsx'
FName_test = 'test_set.xlsx'

FPath_train = FolderPath + '/' + FName_train
FPath_test = FolderPath + '/' + FName_test

df_train = pd.read_excel(FPath_train)
df_test = pd.read_excel(FPath_test)

print('Train Set:', df_train.shape)
print('Test Set:', df_test.shape)

Train Set: (26320, 6)
Test Set: (11283, 5)


In [4]:
# Train Set 확인
df_train.head()

Unnamed: 0,rid,user_id,region_id,review_date,rating,text
0,R29239,U12528,P00274,2017-12-07,5,"If you want to try American pizza, this is the..."
1,R28062,U04925,P01295,2016-01-28,5,I was worried because it was a famous Wargnac ...
2,R33335,U12241,P01702,2015-10-16,5,I've tried both the hotel breakfast buffet and...
3,R12178,U00501,P01122,2016-02-18,5,Reservation Required There are occasions when ...
4,R06151,U23143,P01652,2017-03-08,5,The soup was really cold and the food we ate w...


In [5]:
# Test Set 확인 (rating 칼럼 비존재)
df_test.head()

Unnamed: 0,rid,user_id,region_id,review_date,text
0,R05976,U18517,P00793,2016-07-01,This is a room-type bar located in Dunsan-dong...
1,R29314,U12447,P01856,2017-05-29,Tansuyuk was great. They cut the pork thicker ...
2,R26743,U20023,P01924,2020-01-09,"The identity of the pork cutlet, which turns p..."
3,R03659,U19996,P01924,2018-10-21,"The sirloin is also delicious, but the tenderl..."
4,R27959,U10815,P01924,2018-10-07,"I tasted the new menu, Katsu Sando. It is prob..."


In [6]:
# 필요한 칼럼만 추출

train_doc = df_train['text']
test_doc = df_test['text']
train_target = df_train['rating']

print(train_doc.shape, train_target.shape)
print(test_doc.shape)

(26320,) (26320,)
(11283,)


In [7]:
# 데이터 확인

for idx, (input, target) in enumerate(zip(train_doc, train_target)):
    if idx == 3: break
    print(input)
    print(target)

If you want to try American pizza, this is the place to go. Cheese pizza and Hawaiian pizza are really good. It goes really well with beer. It's not greasy and it's sweet.
5
I was worried because it was a famous Wargnac house, but the day I went there was a huge cold wave, so I didn't have to wait at all and entered right away! The feeling of the udon noodles made right on the spot was impressive. The broth had a strong flavor of katsuobushi, and it felt like authentic Japanese udon noodles. It was the best quality of the udon noodles I had in Korea. I'm salivating as I write this review... A really famous house has a reason
5
I've tried both the hotel breakfast buffet and the dinner buffet, but the dinner is still good. It's expensive, but it's a good memory for the anniversary. My husband said he liked lobster and I liked the steak. I recommend it.
5


# 3. FastText Embedding

In [8]:
# User-defined function (1)
def GetEnglishIntoList(doc):
    doc = np.str.lower(doc)
    return re.findall(
        r'[a-zA-Z]+',
        doc
    )

# User-defined function (2)
def ConcatenateEnglish(doc):
    return ' '.join(GetEnglishIntoList(doc))

# 각 Document 중에서 오직 English만 Filter해서 다시 표현해보자. (FastText Embedding을 하기 위해 오직 Train Set만 이용한다. Test Set은 정보유출하면 안되므로.)
trainEngList_input = train_doc[train_doc.notnull()].map(GetEnglishIntoList)
trainEngConcat_input = train_doc[train_doc.notnull()].map(ConcatenateEnglish)
testEngList_input = test_doc[test_doc.notnull()].map(GetEnglishIntoList)
testEngConcat_input = test_doc[test_doc.notnull()].map(ConcatenateEnglish)

print(trainEngList_input.shape, trainEngConcat_input.shape)
print(testEngList_input.shape, testEngConcat_input.shape)

(26320,) (26320,)
(11283,) (11283,)


In [11]:
# FastText Model Fit (Train Set으로만 구성해본다.)

model = FastText(
    size=128
)

model.build_vocab(
    sentences=trainEngList_input
)

model.train(
    sentences=trainEngList_input,
    epochs=10,
    total_examples=model.corpus_count,
    total_words=model.corpus_total_words
)

In [12]:
# FastText Model Test 해보자.
print('Similarity between Woman and Pretty:', model.wv.similarity('woman', 'pretty'))
print('Similarity between Love and Romance:', model.wv.similarity('love', 'romance'))
print('Similarity between War and Battle:', model.wv.similarity('war', 'battle'))

Similarity between Woman and Pretty: 0.19724837
Similarity between Love and Romance: 0.019088026
Similarity between War and Battle: 0.5142833


# 4. Text Preprocessing

In [13]:
MaxFeatures = 5000

tokenizer = Tokenizer(num_words=MaxFeatures)
tokenizer.fit_on_texts(trainEngConcat_input)

train_input = tokenizer.texts_to_sequences(trainEngConcat_input)
test_input = tokenizer.texts_to_sequences(testEngConcat_input)

In [14]:
print(len(train_input))
print(len(test_input))
print(train_input[0])
print(test_input[0])

26320
11283
[37, 12, 90, 7, 159, 725, 162, 19, 5, 1, 27, 7, 41, 291, 162, 2, 2830, 162, 16, 42, 13, 4, 610, 42, 89, 18, 145, 4, 14, 32, 817, 2, 4, 14, 245]
[19, 5, 3, 383, 898, 183, 134, 10, 1361, 283, 2312, 1172, 376, 3, 199, 8, 421, 2, 206, 16, 341, 2, 4, 14, 3, 74, 27, 7, 38, 3, 191, 18, 184]


In [15]:
word2idx_dict = {k:v for k, v in tokenizer.word_index.items()}
idx2word_dict = {v:k for k, v in tokenizer.word_index.items()}

In [16]:
# Padding

MaxLen = 50

train_input = pad_sequences(
    sequences=train_input,
    maxlen=MaxLen
)

test_input = pad_sequences(
    sequences=test_input,
    maxlen=MaxLen
)

print(train_input.shape)
print(test_input.shape)
print(train_input[0])
print(test_input[0])

(26320, 50)
(11283, 50)
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0   37   12   90    7  159  725  162   19    5    1   27    7   41
  291  162    2 2830  162   16   42   13    4  610   42   89   18  145
    4   14   32  817    2    4   14  245]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0   19    5    3  383  898  183  134   10 1361  283 2312
 1172  376    3  199    8  421    2  206   16  341    2    4   14    3
   74   27    7   38    3  191   18  184]


In [17]:
# Target이 5,4,3,2,1의 Descending Order로 정렬되어 있는데, batch로 model fitting할 때를 대비하여 shuffle해주도록 하자.

index_arr = np.arange(train_input.shape[0])
np.random.shuffle(index_arr)

train_input = train_input[index_arr]
train_target = train_target[index_arr]

print(train_input.shape, train_target.shape)
print(test_input.shape)

(26320, 50) (26320,)
(11283, 50)


In [18]:
# 앞서 미리 학습된 FastText Embedding 가져오기

word2coef_dict = {}

word2coef_arr = np.zeros((MaxFeatures, 128))

for word, idx in word2idx_dict.items():
    if idx == 5000: break
    word2coef_arr[idx] = model.wv[word]

print(word2coef_arr.shape)
print(word2coef_arr[0])
print(word2coef_arr[1])

(5000, 128)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
[-2.43533087e+00 -1.98359990e+00  4.89710271e-01 -1.17726576e+00
 -5.79237603e-02 -1.94636810e+00  2.40329552e+00  1.81767249e+00
  1.85885251e+00 -1.52091396e+00 -1.51907718e+00  1.92535329e+00
 -1.11377776e+00  1.19633186e+00  1.16249835e+00 -1.06277382e+00
 -2.19797063e+00  1.28341877e+00 -3.02105337e-01 -1.69168448e+00
 -9.12556529e-01  1.41161621e+00 -1.00257702e-01 -7.29845405e-01
  5.03498912e-01  1.18165386e+00 -9.97242153e-01 -2.00711322e+00
  3.60299468e+00  2.46818995e+00  7.94178322e-02 -1.32510290e-01
 -6.45863473e-01 -2.33771515e+00 -6.35422841e-02  3.56005043e-01
 -6.64089501

# 5. Create Model

In [31]:
model = Sequential()

model.add(Embedding(
    input_dim=MaxFeatures,
    output_dim=128,
    input_length=MaxLen,
    mask_zero=True
))
model.add(SpatialDropout1D(0.4))
model.add(Bidirectional(LSTM(196, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(1))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 128)           640000    
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 50, 128)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 392)               509600    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 393       
Total params: 1,149,993
Trainable params: 1,149,993
Non-trainable params: 0
_________________________________________________________________


In [32]:
# Model에 FastText Embedding을 Load하기

model.layers[0].set_weights([word2coef_arr])
model.layers[0].trainable = False

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 128)           640000    
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 50, 128)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 392)               509600    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 393       
Total params: 1,149,993
Trainable params: 509,993
Non-trainable params: 640,000
_________________________________________________________________


# 6. Compile and Fit Model

In [33]:
model.compile(
    optimizer=Adam(learning_rate=1e-4),
    loss='mse',
    metrics=['mae']
)

In [34]:
STP = EarlyStopping(
    monitor='val_mae',
    patience=4,
    restore_best_weights=True
)

# Submit the predicted values of Test Set.

TestResultAll_list = []

for iRepeat in range(1):

    history = model.fit(
        train_input, train_target,
        epochs=50,
        batch_size=128,
        validation_split=0.2,
        callbacks=[STP]
    )

    TestResultAll_list.append(model.predict(test_input).flatten())

test = df_test[['rid']]
test['rating'] = np.mean(TestResultAll_list, axis=0)
test.to_csv(FolderPath + '/' + 'submission.csv', index=False)

! kaggle competitions submit -c 2021-1-deeplearning -f '/content/drive/MyDrive/03. Kookmin AI Big Data MBA/Semester 3_032021-062021/2. Deep Learning/Jupyter Notebook/Final Exam/data/submission.csv' -m "From Google Colab"

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


100% 182k/182k [00:01<00:00, 183kB/s]
Successfully submitted to 2021-1 DeepLearning

# 7. Create the Final Predictor

In [43]:
def PredictRating(doc):

    doc = pd.Series(doc)

    docEngList_input = doc.map(GetEnglishIntoList)
    docEngConcat_input = doc.map(ConcatenateEnglish)

    doc_input = tokenizer.texts_to_sequences(docEngConcat_input)

    doc_input = pad_sequences(
        sequences=doc_input,
        maxlen=MaxLen
    )

    return model.predict(doc_input)

In [44]:
doc = 'I love this movie! This was unbelievable and I would love to see this once again soon!'
PredictRating(doc)

array([[5.3520536]], dtype=float32)

In [47]:
doc = 'That was terrible. I should have seen another movie.'
PredictRating(doc)

array([[3.2476711]], dtype=float32)

In [46]:
doc = 'The plot was not bad, but it is not worth watching twice. recommend this movie to kill your time.'
PredictRating(doc)

array([[3.13997]], dtype=float32)