# SC42x 
## 자연어처리 (Natural Language Processing)

# Part 1 : 개념 요약

> 다음의 키워드에 대해서 **한 줄**로 간단하게 요약해주세요. (세션 노트를 참고하여도 좋습니다.)<br/>
> **Tip : 아래 문제를 먼저 수행한 후 모델 학습 등 시간이 오래 걸리는 셀이 실행되는 동안 아래 내용을 작성하면 시간을 절약할 수 있습니다.**

**N421**
- Stopwords(불용어) : 자주 등장하지만 분석을 하는 것에 있어서는 큰 도움이 되지 않는 단어들
- Stemming과 Lemmatization : Stemming은 단어 그 자체만을 고려하지만 Lemmatization은 그 단어가 문장 속에서 어떤 품사(Part-of-speech)로 쓰였는지까지 판단한다.
- Bag-of-Words : 단어들의 순서는 전혀 고려하지 않고, 단어들의 출현 빈도(frequency)에만 집중하는 텍스트 데이터의 수치화 표현 방법입니다.
- TF-IDF : TF와 IDF를 곱한 값으로 점수가 높은 단어일수록 다른 문서에는 많지 않고 해당 문서에서 자주 등장하는 단어 를 의미한다

**N422**
- Word2Vec : Word2Vec은 특정 단어 양 옆에 있는 두 단어(window size = 2)의 관계를 활용하기 때문에 분포 가설을 잘 반영하고 있다.
- fastText : 말뭉치에 등장하지 않은 단어에 대해서도 임베딩 벡터를 만들기 위해 철자 단위 임베딩을 보조 정보로 사용하는 방법.

**N423**
- RNN : 순환 신경망. 순서 정보를 전달하기 위해 고안된 NN.
- LSTM, GRU : 순환 신경망의 gradient vanishing 등의 문제를 보완한 모델
- Attention : LSTM, GRU를 이용하여도 문장이 매우 길어지면 hs 벡터에 담기 힘들기 때문에 인코더, 디코더를 생성하여 각 time-step마다 생성되는 벡터를 모두 디코더에 넘겨준다.

# Part 2 : Fake/Real News Dataset

한 주간 자연어처리 기법을 배우면서 여러분은 다양한 기술들을 접했습니다.<br/>
어떻게 텍스트 데이터를 다뤄야 하는지, 텍스트를 벡터화 하는 법, 문서에서 토픽을 모델하는 법 등 다양한 NLP 기법을 배웠는데요.<br/>
이번 스프린트 챌린지에선 [Fake/Real News Dataset](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)을 사용하여 배운 것들을 복습해보는 시간을 갖겠습니다.

**주의 : 모델의 성능을 최대한 끌어올리는 것이 아닌 모델 구동에 초점을 맞춰주세요.<br/>
모든 문제를 완료한 후에도 "시간이 남았다면" 정확도를 올리는 것에 도전하시는 것을 추천드립니다.**

In [1]:
# 코드 실행 전 seed를 지정하겠습니다.
import numpy as np
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

## 2.0 데이터셋을 불러옵니다.

- 위 캐글 링크에서 데이터셋을 받아 업로드 합니다.<br/>
(직접 업로드하게 되면 시간이 꽤 걸리므로 **drive_mount** 나 **kaggle 연동**하시는 것을 추천드립니다.)

- 'label' 열을 만들어 Fake = 1, True = 0 로 레이블링해줍니다.
- 두 파일을 합쳐 하나의 데이터프레임에 저장해 준 후 데이터를 섞어줍니다.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
fake = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/fake_and_real_news_dataset/Fake.csv')
real = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/fake_and_real_news_dataset/True.csv')


In [4]:
fake['label'] = 1
real['label'] = 0

In [5]:
news = pd.concat([real, fake])
news = news.sample(frac=1, random_state=42).reset_index(drop=True)

In [6]:
news.head(5)

Unnamed: 0,title,text,subject,date,label
0,BREAKING: GOP Chairman Grassley Has Had Enoug...,"Donald Trump s White House is in chaos, and th...",News,"July 21, 2017",1
1,Failed GOP Candidates Remembered In Hilarious...,Now that Donald Trump is the presumptive GOP n...,News,"May 7, 2016",1
2,Mike Pence’s New DC Neighbors Are HILARIOUSLY...,Mike Pence is a huge homophobe. He supports ex...,News,"December 3, 2016",1
3,California AG pledges to defend birth control ...,SAN FRANCISCO (Reuters) - California Attorney ...,politicsNews,"October 6, 2017",0
4,AZ RANCHERS Living On US-Mexico Border Destroy...,Twisted reasoning is all that comes from Pelos...,politics,"Apr 25, 2017",1


## 2.1 TF-IDF 를 활용하여 특정 뉴스와 유사한 뉴스 검색하기

시간상 특별한 **전처리 없이** 아래 태스크를 수행하겠습니다.

### 2.1.1 TFidfVectorizer를 사용하여 문서-단어 행렬(Document-Term Matrix) 만들기

In [7]:
# 이 곳에 답안을 작성하시길 바랍니다.
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vect = TfidfVectorizer(stop_words='english', max_features=300)

# Fit 후 dtm을 만듭니다.(문서, 단어마다 tf-idf 값을 계산합니다)
dtm_news = tfidf_vect.fit_transform(news['text'])

dtm_news = pd.DataFrame(dtm_news.todense(), columns=tfidf_vect.get_feature_names())
dtm_news



Unnamed: 0,000,10,20,2015,2016,2017,according,act,action,actually,added,administration,agency,al,america,american,americans,anti,asked,attack,attacks,attorney,away,barack,based,believe,big,billion,black,border,business,called,came,campaign,candidate,care,case,change,chief,children,...,thursday,time,times,today,told,took,trade,trump,trying,tuesday,twitter,union,united,use,used,ve,video,violence,vote,voters,wall,want,war,washington,watch,way,wednesday,week,went,white,win,woman,women,won,work,working,world,year,years,york
0,0.000000,0.000000,0.000000,0.000000,0.203336,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.086631,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.075157,0.0,0.000000,0.000000,0.000000,0.0,0.533455,0.112736,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.253197,0.118695,0.0,0.00000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
1,0.000000,0.000000,0.000000,0.000000,0.098637,0.0,0.000000,0.0,0.000000,0.112444,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.084789,0.106879,0.000000,0.105458,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.145833,0.0,0.109199,0.000000,0.000000,0.0,0.172518,0.000000,0.0,0.094774,0.121452,0.000000,0.000000,0.0,0.000000,0.416868,0.247054,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.103932,0.259000,0.000000,0.0,0.0,0.081883,0.115157,0.0,0.00000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.224346,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.140103,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.075345,0.000000,0.0,0.372523,0.000000,0.000000,0.000000,0.0,0.138992,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.202920,0.000000,0.113115,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.095126,0.000000,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.193397,0.000000,0.0,0.616881,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.000000,0.0,0.20662,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.176198,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.115945,0.000000,0.0,0.284269,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.18792,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.1496,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.593382,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.201387,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,0.000000,0.111792,0.116769,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.095437,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.116317,0.0,0.00000,0.0,0.107658,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.071751,0.000000,0.0,0.058639,0.000000,0.0,0.000000,0.000000,0.079620,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.078963,0.000000,0.000000,0.096360,0.0,0.0,0.000000,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
44894,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.366591,0.0,0.0,0.000000,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
44895,0.000000,0.000000,0.000000,0.282364,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0000,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.129099,0.0,0.0,0.0,...,0.115771,0.000000,0.0,0.000000,0.000000,0.119929,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
44896,0.192271,0.070876,0.000000,0.000000,0.000000,0.0,0.110696,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.055709,0.0,0.0,0.0000,0.000000,0.0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.321241,0.0,0.000000,0.070964,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.061993,0.000000,0.0,0.000000,0.045490,0.000000,0.0,0.037177,0.000000,0.0,0.000000,0.000000,0.000000,0.134008,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.050062,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.140810,0.052650,0.0


### 2.1.2 KNN 알고리즘을 사용하여 유사한 문서 검색하기

- **42번 인덱스의 문서**와 가장 유사한 **5개 문서(42번 포함)의 인덱스**와 **해당 인덱스의 레이블**을 나타내주세요.
- NN 모델의 파라미터 중 `algorithm = 'kd_tree'` 로 설정합니다.

In [29]:
# 이 곳에 답안을 작성하시길 바랍니다.
from sklearn.neighbors import NearestNeighbors

# dtm을 사용히 NN 모델을 학습시킵니다. (디폴트)최근접 5 이웃.
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_news)

print('most relevant news: ', nn.kneighbors([dtm_news.iloc[42]])[1])

news.loc[nn.kneighbors([dtm_news.iloc[42]])[1].ravel(),:]

most relevant news:  [[   42 29927 22109   602 11519]]


  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


Unnamed: 0,title,text,subject,date,label
42,Iraqi Kurds face more sanctions after calling ...,"ERBIL, Iraq (Reuters) - Iraq s autonomous Kurd...",worldnews,"October 3, 2017",0
29927,France offers to mediate between Baghdad and K...,PARIS (Reuters) - France offered on Thursday t...,worldnews,"October 5, 2017",0
22109,Iraq steps up retaliation against Kurdish inde...,BAGHDAD (Reuters) - Iraq stopped selling dolla...,worldnews,"October 3, 2017",0
602,Hundreds of suspected Islamic State militants ...,BAGHDAD (Reuters) - Hundreds of suspected Isla...,worldnews,"October 10, 2017",0
11519,"Iraq to pay Kurdish Peshmerga, civil servants,...",BAGHDAD (Reuters) - The Iraqi government plans...,worldnews,"October 31, 2017",0


## 2.2 Keras Embedding을 사용하여 분류하기

### 2.2.0 데이터셋 split

- Train, Test 데이터셋으로 분리(Split)하여 주세요.

In [30]:
# 이 곳에 답안을 작성하시길 바랍니다.
from sklearn.model_selection import train_test_split

X = news[news.columns[:-1]]
y = news[news.columns[-1]]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((35918, 4), (8980, 4), (35918,), (8980,))

### 2.2.1 단어 벡터의 평균을 이용하여 분류해보기

N422에서 했던 단어 임베딩 벡터의 평균을 사용하여 문장을 분류하는 작업을 수행해봅시다.<br/>
인스턴스마다 텍스트 길이가 길고 시간이 오래 걸리므로 시간상 epoch 수를 **10 이하**로 하는 것을 추천드립니다.<br/>
모델 구동이 목적이므로 임베딩 차원 수를 크지 않게(50이하)로 설정해주세요.<br/>
**권장사항 : `max_len` 은 텍스트 길이 평균보다 높게 설정해주세요.**<br/>

> **Tip : 모델이 학습하는 동안 2.2.3의 내용을 작성하면 시간을 절약할 수 있습니다.**


In [40]:
text_len_df = X_train['text'].apply(lambda x:len(x.split()))
text_len_df.mean(), text_len_df.max()

(405.2764908959296, 8135)

In [31]:
X_train.head(3)

Unnamed: 0,title,text,subject,date
36335,Kellyanne Conway’s Husband Just Publicly Bash...,So the Conway marriage just took a turn and ...,News,"June 5, 2017"
12384,JEB BUSH WANTS CONGRESS TO APPROVE AMNESTY And...,Jeb Bush just unofficially placed himself on t...,politics,"Apr 17, 2015"
24419,"Henningsen on Trump’s Foreign Policy: Russia, ...",21st Century Wire says While the US media con...,Middle-east,"November 21, 2016"


In [46]:
X_train_tk

Unnamed: 0,text
36335,So the Conway marriage just took a turn and ...
12384,Jeb Bush just unofficially placed himself on t...
24419,21st Century Wire says While the US media con...
24740,JERUSALEM (Reuters) - Israeli Prime Minister B...
27039,Here s a compilation of President Trump s most...
...,...
11284,WASHINGTON (Reuters) - U.S. House Republican S...
44732,Two U.S. Marines are reportedly under investig...
38158,WASHINGTON (Reuters) - The Trump administratio...
860,BEIJING (Reuters) - A senior Chinese diplomat ...


In [49]:
# 이 곳에 답안을 작성하시길 바랍니다
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D

X_train_tk = X_train['text']

tokenizer = Tokenizer(num_words=8000)
tokenizer.fit_on_texts(X_train_tk)

max_len = 500

X_encoded = tokenizer.texts_to_sequences(X_train_tk)

X_train_pad = pad_sequences(X_encoded, maxlen=max_len, padding='post')

# embedding weight matrix
vocab_size = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((vocab_size, 30))

# create model
model = Sequential()
model.add(Embedding(vocab_size, 30, weights=[embedding_matrix],
                    input_length=max_len, trainable=False))
model.add(GlobalAveragePooling1D())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])


# train model
model.fit(X_train_pad, y_train, batch_size=64, epochs=5, validation_split=0.2)

# evaluate model
X_test_tk = X_test['text']
tokenizer.fit_on_texts(X_test_tk)
X_test_encoded = tokenizer.texts_to_sequences(X_test_tk)
X_test_pad = pad_sequences(X_test_encoded, maxlen=max_len, padding='post')

loss, acc = model.evaluate(X_test_pad, y_test)
print(loss, acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
0.6923892498016357 0.5199331641197205


### 2.2.2 LSTM을 사용하여 텍스트 분류 수행해보기

N423에서 했던 단어 임베딩 벡터의 평균을 사용하여 문장을 분류하는 작업을 수행해봅시다.<br/>
인스턴스마다 텍스트 길이가 길어 시간이 매우 오래 걸리므로 <br/>
**층을 최소한으로 쌓고**, epoch 수를 **3 이하**로 하는 것을 추천드립니다.<br/>

> **Tip : 모델이 학습하는 동안 2.2.3의 내용을 작성하면 시간을 절약할 수 있습니다.**


In [62]:
# 이 곳에 답안을 작성하시길 바랍니다
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.optimizers import RMSprop

def RNN():
    model = Sequential()
    model.add(Embedding(vocab_size, 5,
                    input_length=max_len, trainable=False))
    model.add(LSTM(3,dropout=0.2,recurrent_dropout=0.2, name='LSTM'))
    model.add(Dense(1,name='out_layer'))
    model.add(Activation('sigmoid'))

    

    return model

RNN().summary()


Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_14 (Embedding)    (None, 500, 5)            626680    
                                                                 
 LSTM (LSTM)                 (None, 3)                 108       
                                                                 
 out_layer (Dense)           (None, 1)                 4         
                                                                 
 activation_18 (Activation)  (None, 1)                 0         
                                                                 
Total params: 626,792
Trainable params: 112
Non-trainable params: 626,680
_________________________________________________________________


In [63]:
model_lstm = RNN()
model_lstm.compile(loss='binary_crossentropy',
              optimizer=RMSprop(), 
              metrics=['accuracy'])

model_lstm.fit(X_train_pad, y_train, batch_size=64, epochs=2, validation_split=0.2)

# evaluate model
loss, acc = model_lstm.evaluate(X_test_pad, y_test)
print(loss, acc)

Epoch 1/2
Epoch 2/2
0.692578911781311 0.5199331641197205


### 2.2.3 위에서 실행한 내용에 대해 다시 알아봅시다.

#### a) 데이터셋을 학습할 때 사용하는 `pad_sequences`  메서드에 대해 설명해주세요.<br/>어떤 기능을 하나요? 모델을 학습할 때 왜 필요한가요?

각 문장마다 단어의 길이가 다르기 때문에 max_len 보다 적은 경우 그 차이를 다 메꿔주어 행렬의 형태로 만들어주어야 한다. 이 때문에 pad_sequence를 이용한다.

#### b) 2.2.1과 2.2.2에서 사용한 각 모델의 evaluation 성능은 어떻게 나왔나요?<br/>각 모델의 장단점은 무엇이라고 생각하나요?

2.2.1의 모델은 단순 임베딩벡터 평균을 이용하여 분류하는 방법이라 시간은 적게 소요되지만 단어의 순서에 따른 문장의 특성이 반영되기 힘들다.  
2.2.2의 모델은 LSTM으로 순서 정보가 전달됨으로서 좀더 성능이 높을 수 있지만 병렬화가 불가능하여 시간이 많이 걸린다.

#### c) 종래의 RNN(Recurrent Neural Networks) 대신 LSTM(Long-Short Term Memory)을 사용하는 이유는 무엇인가요?<br/>(i.e. RNN에 비해 LSTM의 좋은 점을 설명해주세요.)

RNN은 vanishing gradient 가 많이 일어나 초기 정보가 학습에 제대로 반영되지 않을 수 있다.
이를 보완한 것이 LSTM으로서 3가지 게이트를 추가하여 초기, 앞쪽 시퀀스의 정보를 잃지 않으면서 학습을 할 수 있다.

#### d) LSTM이나 RNN을 사용하는 예시를 **3개**이상 제시하고 해당되는 경우에 왜 LSTM이나 RNN을 사용하는 것 적절한지 간단하게 설명해주세요.

1. 이미지 캡셔닝(Image Captioning)  
이미지에 맞는 설명글을 학습하여 각 이미지에 어울리는 캡션을 달아준다. 글을 학습하기 때문에 순차적인 학습에 맞는 RNN, LSTM이 적합하다.

2. 텍스트 감성 분석(Text Sentiment Analysis)  
텍스트와 해당 텍스트에 맞는 감성(좋고 싫음)을 학습한다. 어순에 따라 달라질 수 있기 때문에 해당 모델을 통한 학습이 필요하다.

3. 기계 번역(Machine Translation)  
many to many 형태의 학습으로 언어 별로 어순이 다르기 때문에 이를 고려한 학습이 필수적이다.



#### e) 이외에 N424 에서 배운 자연어처리 모델과 관련된 키워드를 3개 이상 적어주세요. <br/> (해당 키워드에 대한 설명은 옵션입니다.)

Transformer, GPT, BERT

# Advanced Goals: 3점을 획득하기 위해선 아래의 조건 중 하나 이상을 만족해야합니다
 
- 2.1 에서 TF-IDF(`TfidfVectorizer`)가 아닌 방법을 사용하여 유사도 검색을 수행해보세요.<br/>
TF-IDF와 해당 방법의 차이를 설명해주세요. 
- 2.2 에서 사용한 방법을 재사용하되 하이퍼 파라미터를 조정하거나 모델 구조를 변경하여 성능을 올려봅시다.<br/>**(주의 : GridSearch, RandomSearch 등의 방법을 사용하여도 좋으나 시간이 오래 걸리므로 범위를 잘 선택해야 합니다.)**

In [65]:
# 이 곳에 답안을 작성하시길 바랍니다

# 은닉층 추가 및 dropout 적용함.

def RNN2():
    model = Sequential()
    model.add(Embedding(vocab_size, 5,
                    input_length=max_len, trainable=False))
    model.add(LSTM(3,dropout=0.2,recurrent_dropout=0.2, name='LSTM'))
    model.add(Dense(10,name='FC1'))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1,name='out_layer'))
    model.add(Activation('sigmoid'))

    

    return model

model_lstm = RNN2()
model_lstm.compile(loss='binary_crossentropy',
              optimizer=RMSprop(), 
              metrics=['accuracy'])

model_lstm.fit(X_train_pad, y_train, batch_size=64, epochs=2, validation_split=0.2)

# evaluate model
loss, acc = model_lstm.evaluate(X_test_pad, y_test)
print(loss, acc)

Epoch 1/2
Epoch 2/2
0.6923375725746155 0.5210467576980591
