<img src='https://user-images.githubusercontent.com/6457691/90080969-0f758d00-dd47-11ea-8191-fa12fd2054a7.png' width = '200' align = 'right'>

## *DATA SCIENCE / SECTION 4 / SPRINT 2 / Assignment 3*

--- 

# Language Modeling with RNN



## Code

다음 링크는 LSTM을 사용하여 Spam 메시지 분류를 수행한 캐글 노트북입니다. => [Link](https://www.kaggle.com/kredy10/simple-lstm-for-text-classification) <br/>

위 노트북에서 사용한 코드를 참고하여<br/>
캐글 데이터셋인 [Women's E-Commerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) 를 분류해 보세요.

- 분류에 사용될 텍스트 데이터 : **`Review Text`** 열을 사용합니다.
- 레이블(label) 데이터 : **`Recommended IND`** 열을 사용합니다.

In [20]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model, Sequential
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
%matplotlib inline

In [3]:
from google.colab import files

file = files.upload()

Saving Womens Clothing E-Commerce Reviews.csv to Womens Clothing E-Commerce Reviews.csv


### 1) 데이터 전처리
    
- 데이터셋을 데이터프레임으로 읽어옵니다.
- 필요없는 열(column)을 삭제합니다.

In [4]:
np.random.seed(42)
tf.random.set_seed(42)

In [5]:
### 이곳에서 과제를 수행해 주세요 ###
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [7]:
df = df[['Review Text', 'Recommended IND']]
df.head()

Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


In [9]:
df.isnull().sum()

Review Text        845
Recommended IND      0
dtype: int64

In [10]:
df.dropna(inplace=True)
df.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Review Text        0
Recommended IND    0
dtype: int64

In [12]:
df_copy = df.copy()
df_copy['Review Text'] = df_copy['Review Text'].str.replace("[^a-zA-Z ]", " ")
df_copy.head()

  


Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful silky and sexy and comf...,1
1,Love this dress it s sooo pretty i happene...,1
2,I had such high hopes for this dress and reall...,0
3,I love love love this jumpsuit it s fun fl...,1
4,This shirt is very flattering to all due to th...,1


### 2) 텍스트 분류를 수행해주세요.

- 데이터셋 split시 test_size의 비율은 20%로, `random_state = 42` 로 설정합니다. 
- Tokenizer의 `num_words=3000` 으로 설정합니다.
- pad_sequence의 `maxlen=400` 으로 설정합니다.
- 학습 시, 파라미터는 `batch_size=128, epochs=10, validation_split=0.2` 로 설정합니다.
- EarlyStopping을 적용합니다. 파라미터는 `monitor='val_loss',min_delta=0.0001, patience=3` 로 설정합니다.
- evaluate 했을 때의 loss와 accuarcy를 [loss, acc] 형태로 입력해주세요. Ex) [0.4321, 0.8765]

In [27]:
### 이곳에서 과제를 수행해 주세요 ###
X_train, X_test, y_train, y_test = train_test_split(df_copy['Review Text'], df_copy['Recommended IND'], test_size=0.2, random_state=42, stratify=df_copy['Recommended IND'])
X_train.shape, X_test.shape

((18112,), (4529,))

In [28]:
num_words = 3000
token = Tokenizer(num_words)
token.fit_on_texts(X_train)


X_train_enc = token.texts_to_sequences(X_train)
X_test_enc = token.texts_to_sequences(X_test)

In [29]:
vocab_size = len(token.word_index) + 1
vocab_size

12392

In [30]:
X_train = sequence.pad_sequences(X_train_enc, maxlen=400, padding='post')
X_test = sequence.pad_sequences(X_test_enc, maxlen=400, padding='post')

In [31]:
model = Sequential()
model.add(Embedding(num_words, 128)) # Embedding Layer를 거친 후의 shape : (batch_size, maxlen, embedding_size=128)
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) # LSTM Layer를 거친 후의 shape : (batch_size, 1, hidden_size=128)
model.add(Dense(1, activation='sigmoid'))

In [32]:
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['acc'])

In [33]:
early_stop = EarlyStopping(monitor = 'val_loss', min_delta= 1e-4, patience= 5, verbose=1)

In [34]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 128)         384000    
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 515,713
Trainable params: 515,713
Non-trainable params: 0
_________________________________________________________________


In [35]:
model.fit(X_train, y_train, batch_size=128, epochs= 10,
          validation_split= 0.2, callbacks= [early_stop])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 6: early stopping


<keras.callbacks.History at 0x7f27f1f53810>

In [36]:
model.evaluate(X_test, y_test)



[0.4772103428840637, 0.8189445734024048]