# [Exp10]WordTranslator

## 개요

* 글자 단위(Character-level) -> 단어 단위(Word-level)
* 동일한 Dataset
* 글자 단위와는 다른 전처리
* 임베딩 층 추가
* 단어장의 크기가 커지므로 학습 속도도 좀 더 느려집니다.
* 데이터에서 상위 33000개의 샘플만 사용
* 이중 3천개는 테스트 데이터로 분리하여 모델을 학습한 후 테스트하는 용도로 사용

## Library 불러오기

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

## Data 불러오기

### Data를 불러옵니다.

In [2]:
file_path = 'data/fra.txt'
df_data = pd.read_csv(file_path, names=['eng', 'fra', 'cc'], sep='\t')
len(df_data)
df_data.sample(5)

Unnamed: 0,eng,fra,cc
106347,Your sweater is on backwards.,Ton chandail est à l'envers.,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
68716,Can we speak in the hall?,Pouvons-nous parler dans le hall ?,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
164006,My car was badly damaged in the accident.,Ma voiture a été sérieusement endommagée dans ...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
169124,She believed him when he said he loved her.,Elle le crut lorsqu'il lui dit qu'il l'aimait.,CC-BY 2.0 (France) Attribution: tatoeba.org #8...
118102,There wasn't any special hurry.,Il n'y avait pas le feu au lac.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...


### CC 제거 및 33000개 사용

In [3]:
df_data = df_data[['eng', 'fra']][:33000]
df_data.sample(5)

Unnamed: 0,eng,fra
18181,How good are you?,Jusqu'à quel point êtes-vous bonne ?
19192,I'm by your side.,Je suis à tes côtés.
11176,It's all wrong.,C'est complètement faux.
30649,It was all a dream.,Ce n'était qu'un rêve.
6761,Do you see it?,Est-ce que vous le voyez ?


### 시작과 종료 토큰 추가

In [4]:
sos_token = '\t'
eos_token = '\n'
df_data.fra = df_data.fra.apply(lambda x : '\t '+ x + ' \n')
print('전체 샘플의 수 :',len(df_data))
df_data.sample(5)

전체 샘플의 수 : 33000


Unnamed: 0,eng,fra
25477,She went shopping.,\t Elle est allée faire des courses. \n
16204,Tom looks dazed.,\t Tom a l'air hébété. \n
20142,She worships him.,\t Elle le vénère. \n
22012,You need a drink.,\t Il vous faut boire quelque chose. \n
18142,Hi! I'm new here.,"\t Salut ! Je suis nouveau, ici. \n"


## Step 1. 정제, 정규화, 전처리 (영어, 프랑스어 모두!)

### 1. 구두점(Punctuation)을 단어와 분리해주세요.

In [21]:
df_data.head(10)

Unnamed: 0,eng,fra
0,Go.,\t Va ! \n
1,Go.,\t Marche. \n
2,Go.,\t Bouge ! \n
3,Hi.,\t Salut ! \n
4,Hi.,\t Salut. \n
5,Run!,\t Cours ! \n
6,Run!,\t Courez ! \n
7,Run!,\t Prenez vos jambes à vos cous ! \n
8,Run!,\t File ! \n
9,Run!,\t Filez ! \n


In [23]:
df_updated = df_data.replace(to_replace=r'[^\w\s]', value = r' ', regex = True)

In [24]:
df_updated.head(10)

Unnamed: 0,eng,fra
0,Go,\t Va \n
1,Go,\t Marche \n
2,Go,\t Bouge \n
3,Hi,\t Salut \n
4,Hi,\t Salut \n
5,Run,\t Cours \n
6,Run,\t Courez \n
7,Run,\t Prenez vos jambes à vos cous \n
8,Run,\t File \n
9,Run,\t Filez \n
