# Text Generation using LSTM Network
#### ディープラーニングによる文章生成

## What is LSTM?
  
テキスト生成は一種の言語モデリング問題。  
言語モデリングは、テキスト読み上げ、会話システム、テキスト要約など、多数の自然言語処理タスクの中心的な問題。    
訓練された言語モデルは、テキストで使用されている前の一連の単語に基づいて、単語の出現の可能性を学習する。  
言語モデルは、文字レベル、n-gramレベル、文レベル、または段落レベルでも操作できる。  
このノートでは、最先端のリカレントニューラルネットワークを実装してトレーニングすることによって、自然言語テキストを生成するための言語モデルを作成する方法について説明していく
。

## 今回はニュースの本文からタイトルを自動生成します  
  
#### Process  
1. データの準備  
2. 文章のお掃除（記号削除、小文字統一）  
3. 単語に切り分ける  
4. トークン化＝数値化  
5. パディングで変数の長さを統一  
6. LSTMの実装

## 1. Import the libraries ライブラリのインポート

In [2]:
import pandas as pd
import numpy as np

## 2. Load the dataset データのロード

In [23]:
import os
path = os.getcwd()
print(path)

C:\MyWorks\PythonWorkspace\seq2seqHome\Text_Generation_using_GRU


In [4]:
ls

 Volume in drive C is Windows
 Volume Serial Number is E050-A3EE

 Directory of C:\MyWorks\PythonWorkspace\seq2seqHome\Text_Generation_using_GRU

2020-02-07  09:31 PM    <DIR>          .
2020-02-07  09:31 PM    <DIR>          ..
2020-02-07  09:23 PM    <DIR>          .ipynb_checkpoints
2020-02-07  09:31 PM            16,100 0315_Text_Generation_with_GRU.ipynb
2020-02-07  09:18 PM               742 attention.py
2020-02-07  09:18 PM             4,632 lstm.py
2020-02-07  09:18 PM             4,626 my_lstm.py
2020-02-07  09:24 PM               555 Neural-Text-Generation.ipynb
2019-10-01  06:05 PM        83,917,554 News_Category_Dataset_v2.json
2020-02-07  09:25 PM        26,677,036 News_Category_Dataset_v2.json.zip
2020-02-07  09:18 PM             2,718 README.md
2020-02-07  09:18 PM            23,413 RECENT_Notenool.ipynb
2020-02-07  09:18 PM               744 seq2seq_attention.py
2020-02-07  09:18 PM            14,238 slack_model.py
2020-02-07  09:22 PM           179,696 Text_Generation_

# import pandas as pd
import json
my_json_file=path+"/news-category-small.json"
data = [json.loads(line) for line in open(my_json_file, 'r')]
print(data[1])



## reading the headlines
import json  
from pandas.io.json import json_normalize  

#with open(path +"/News_Category_Dataset_v2.json") as f: 
# d = json.loads(f) 
df=pd.read_json(path +"/News_Category_Dataset_v2.json")          
print(df)
#headlines= json_normalize(d["headline"].value()) 
#print(headlines.head())




In [38]:
import json 

my_json_file=path+"/news-category-small.json" 

data = [json.loads(line) for line in open(my_json_file, 'r')] 

#for row in data:
#    print(row["headline"])

headlines=[row["headline"]  for row in data]

print(len(headlines))
print(headlines)


30
['There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV', "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song", 'Hugh Grant Marries For The First Time At Age 57', "Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork", 'Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog', "Morgan Freeman 'Devastated' That Sexual Harassment Claims Could Undermine Legacy", "Donald Trump Is Lovin' New McDonald's Jingle In 'Tonight Show' Bit", 'What To Watch On Amazon Prime That’s New This Week', "Mike Myers Reveals He'd 'Like To' Do A Fourth Austin Powers Film", 'What To Watch On Hulu That’s New This Week', 'Justin Timberlake Visits Texas School Shooting Victims', "South Korean President Meets North Korea's Kim Jong Un To Talk Trump Summit", 'With Its Way Of Life At Risk, This Remote Oyster-Growing Region Called In Robots', "Trump's Crackdown On Immigrant Parents Puts More Kids In An Already Strained System", "'Trump's Son Should

In [18]:
headlines = []
for filename in os.listdir(path):
    if "Articles" in filename:
        article_df = pd.read_csv(path + "/New York Times/" + filename)
        headlines.extend(list(article_df["headline"].values))
        break
        
headlines = [ h for h in headlines if h != "Unknown" ]
print("The number of headline is :", len(headlines))

The number of headline is : 0


In [16]:
headlines

[]

## 3. Dataset preparation 前処理

### 3.1 Dataset cleaning  データクリーニング
  
記号を取り除き、「文字」と「数字」だけ残す。また小文字で統一する。

In [39]:
import string
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

def clean_text(headline):
    text = "".join( word for word in headline if word not in string.punctuation ).lower()
    text = text.encode("utf8").decode("ascii", "ignore")
    return text

# 元データのタイトルに含まれる単語群から独自のコーパスを作成
corpus = [ clean_text(headline) for headline in headlines ]

In [40]:
corpus[:5]

['there were 2 mass shootings in texas last week but only 1 on tv',
 'will smith joins diplo and nicky jam for the 2018 world cups official song',
 'hugh grant marries for the first time at age 57',
 'jim carrey blasts castrato adam schiff and democrats in new artwork',
 'julianna margulies uses donald trump poop bags to pick up after her dog']

### 3.2 Generating Sequence of N-gram Tokens 文章の単語化&数値化

- 自然言語処理では、テキストを単語単位に分解してベクトル化するのが主流である。  
- N-gram は、Morphological Analysis（形態要素解析）に並ぶ代表的な単語の切り出し手法のひとつ。  
- 具体的には、N-gramとは自然言語（テキスト）を連続するN個の文字、もしくはN個の単語単位で切り出す手法のこと。  
- 強みは「コーパス」が事前に入らないこと、弱みは切り出した単語数が肥大化しやすい点。  
- ex:) "I voted for Trump." n=2 => "I voted", "for Trump"

In [59]:
vocab = []
for line in corpus:
    words = line.split()
    for word in words:
        vocab.append(word)

vocabraly = set(vocab)

In [60]:
len(vocabraly)

265

In [43]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(2000)
tokenizer.fit_on_texts(corpus)
word2index = tokenizer.word_index
len(word2index)

Using TensorFlow backend.


265

In [44]:
word2index

{'to': 1,
 'in': 2,
 'on': 3,
 'trump': 4,
 'new': 5,
 'trumps': 6,
 'week': 7,
 'and': 8,
 'the': 9,
 'at': 10,
 'this': 11,
 'with': 12,
 'abortion': 13,
 '2': 14,
 'texas': 15,
 'but': 16,
 'for': 17,
 'first': 18,
 'donald': 19,
 'after': 20,
 'what': 21,
 'watch': 22,
 'prime': 23,
 'thats': 24,
 'north': 25,
 'summit': 26,
 'of': 27,
 'more': 28,
 'putin': 29,
 'than': 30,
 'ireland': 31,
 'landslide': 32,
 'men': 33,
 'he': 34,
 'there': 35,
 'were': 36,
 'mass': 37,
 'shootings': 38,
 'last': 39,
 'only': 40,
 '1': 41,
 'tv': 42,
 'will': 43,
 'smith': 44,
 'joins': 45,
 'diplo': 46,
 'nicky': 47,
 'jam': 48,
 '2018': 49,
 'world': 50,
 'cups': 51,
 'official': 52,
 'song': 53,
 'hugh': 54,
 'grant': 55,
 'marries': 56,
 'time': 57,
 'age': 58,
 '57': 59,
 'jim': 60,
 'carrey': 61,
 'blasts': 62,
 'castrato': 63,
 'adam': 64,
 'schiff': 65,
 'democrats': 66,
 'artwork': 67,
 'julianna': 68,
 'margulies': 69,
 'uses': 70,
 'poop': 71,
 'bags': 72,
 'pick': 73,
 'up': 74,
 'her':

In [45]:
dictionary = {}
rev_dictionary = {}
for word, idx in word2index.items():
    if idx > 1406:
        continue
    dictionary[word] = idx
    rev_dictionary[idx] = word

In [46]:
max(rev_dictionary.keys())

265

In [47]:
input_seqences = tokenizer.texts_to_sequences(corpus)

In [48]:
input_seqences[:5]

[[35, 36, 14, 37, 38, 2, 15, 39, 7, 16, 40, 41, 3, 42],
 [43, 44, 45, 46, 8, 47, 48, 17, 9, 49, 50, 51, 52, 53],
 [54, 55, 56, 17, 9, 18, 57, 10, 58, 59],
 [60, 61, 62, 63, 64, 65, 8, 66, 2, 5, 67],
 [68, 69, 70, 19, 4, 71, 72, 1, 73, 74, 20, 75, 76]]

In [49]:
len(input_seqences)

30

### 3.3 Padding the Sequences and obtain Variables
#### パディングによって固定長データを作り、説明変数を得る

In [61]:
input_data = []
target = []
for line in input_seqences:
    for i in range(1, len(line)-1):
        input_data.append(line[:i])
        target.append(line[i+1])

In [62]:
input_data[:5]

[[35], [35, 36], [35, 36, 14], [35, 36, 14, 37], [35, 36, 14, 37, 38]]

In [63]:
target[:5]

[14, 37, 38, 2, 15]

In [64]:
MAX_LEN = 0
for seq in input_data:
    if len(seq) > MAX_LEN:
        MAX_LEN = len(seq)
MAX_LEN

14

In [65]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
total_words= len(dictionary.keys()) 

input_data = pad_sequences(input_data, maxlen=MAX_LEN, padding="post", truncating="post")

In [66]:
len(input_data[0])

14

In [67]:
input_data

array([[ 35,   0,   0, ...,   0,   0,   0],
       [ 35,  36,   0, ...,   0,   0,   0],
       [ 35,  36,  14, ...,   0,   0,   0],
       ...,
       [255, 256,  10, ...,   0,   0,   0],
       [255, 256,  10, ...,   0,   0,   0],
       [255, 256,  10, ...,   0,   0,   0]])

In [58]:
input_data.shape

(275, 14)

In [71]:
from keras.utils import to_categorical
target = to_categorical(target, num_classes=total_words)

NameError: name 'total_words' is not defined

In [None]:
target

In [None]:
target.shape

In [None]:
VOCAB_SIZE = 2001
VOCAB_SIZE

In [None]:
MAX_LEN

## 4. LSTMs for Text Generation 長短期記憶層アルゴリズムの実装

### 4.1 LSTM ( Long Short-Term Memory  )    
  
1. Input Layer : Takes the sequence of words as input  
2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.  
3. Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer.  
4. Output Layer : Computes the probability of the best possible next word as output  

![title](https://cdn-images-1.medium.com/max/1600/1*yBXV9o5q7L_CvY7quJt3WQ.png)

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, GRU, Dense, Dropout
from keras.callbacks import EarlyStopping

In [None]:
model = Sequential()

"""入力層"""
model.add(Embedding(input_dim=VOCAB_SIZE, output_dim=100, input_length=MAX_LEN))

"""隠れ層"""
model.add(LSTM(units=100))
model.add(Dropout(rate=0.1))

"""出力層 活性化関数は多層のソフトマックス関数"""
model.add(Dense(units=target.shape[1], activation="softmax"))

In [None]:
"""損失関数と最適化手法の設定"""
model.compile(loss="categorical_crossentropy", optimizer="adam")

In [None]:
model.summary()

#### LSTMモデルにテストデータを学習させていく！

In [None]:
"""予測ラベルと正解ラベルを用意する"""
model.fit(input_data, target, batch_size=32, epochs=5, verbose=1)

### 4.2 GRU ( Gated recurrent unit )

In [None]:
gru_model = Sequential()
gru_model.add(Embedding(input_dim=VOCAB_SIZE, output_dim=100, input_length=MAX_LEN))
gru_model.add(GRU(units=100))
gru_model.add(Dropout(rate=0.1))
gru_model.add(Dense(units=target.shape[1], activation="softmax"))

In [None]:
gru_model.compile(loss="categorical_crossentropy", optimizer="adam")

In [None]:
gru_model.summary()

#### GRUモデルにテストデータを学習させていく！

In [None]:
gru_model.fit(input_data, target, batch_size=32, epochs=5, verbose=1)

## 5. Generating the text タイトルにふさわしいテキストを自動生成する

In [None]:
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

In [None]:
def text_generater(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len, padding="post")
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [None]:
text1 = "Trump decided"
text_generater(text1, 5, model, MAX_LEN)

In [None]:
text_generater(text1, 5, gru_model, MAX_LEN)

## I need more data to training I guess.... 学習不足