<a href="https://colab.research.google.com/github/pkm29/Philosophy_Analysis/blob/master/Text_Generation_(LSTM).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation by Analytics Vidhya
source: https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import os
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils
import tensorflow as tf
from tensorflow import keras
#!pip install -U -q PyDrive


## Loading Data

In [2]:
try:
  sent_df = pd.read_csv("sentiment_dataframe.csv")
except:
  sent_df = pd.read_csv("https://github.com/pkm29/Philosophy_Analysis/raw/master/sentiment_dataframe.csv")
sent_df.head()

Unnamed: 0,index,sentence,chapter,neg,neu,pos,compound,chapter_text,moving_avg_comp,chapter_author,chapter_text_num
0,0,PART ONE CONTAINING THE PAPERS OF A Are pas...,0,0.0,0.814,0.186,0.4939,Preface,0.080093,Neutral,
1,1,Reason alone baptized?,0,0.5,0.5,0.0,-0.25,Preface,0.084089,Neutral,
2,2,Edward Young1 PREFACE PERHAPS it has sometime...,0,0.083,0.773,0.144,0.6486,Preface,0.084089,Neutral,
3,3,Your life has perhaps brought you into touch w...,0,0.054,0.85,0.096,0.3612,Preface,0.084089,Neutral,
4,4,Perhaps neither case applies to you and your l...,0,0.071,0.857,0.071,0.0,Preface,0.089234,Neutral,


chapter_text_num is NaN only for Preface, this is ok as this column was only created for graphical visualations. Basically just shortened chapter names to numbers. 

## Creating Word/Character Mappings

Here are some interesting pros and cons to word vs character mapping highlighted by this [lighttag.io blogpost](https://www.lighttag.io/blog/character-level-NLP/). <br>
### Advantages
- Character mapping is more general when discerning syntax. With word mapping, words like 3/4", csv w/, or 50k need to be specified. A regular model usually has the 10-30k most frequent words in its vocabulary, and so "unusual" looking words are interpreted poorly. Character mapping sees the input "as-is" and each word is equally strange. This is beneficial for poorly spelled, user-generated text as this approach is more generalized and robust. <br>
- Due to the smaller vocabulary of character level models, pretraining avoids softmax bottlenecking. Softmax bottlenecking occurs when probability calculations (matrix factorization problems) use such large matrices that the softmax calculations become too difficult to perform. This is a problem faced by recurrent neural networks (RNNs) in general because context of words are incredibly important, leading to much larger matrices, or more specifically, high-rank matrices. A well-known paper by Zhilin Yang in 2018 explores this problem and solution [further](https://arxiv.org/pdf/1711.03953.pdf). 

### Disdvantages
- Character mapping loses the semantic content of words, which is oftentimes useful for accuracy purposes. 
- Character mapping is sometimes more computationally expensive. While character mapping is generally characterized by lower computational costs than word mapping, working at the character level effectively multiplies the length of our sequence by the average number of characters per word. As a result, certain NLP architectures may be necessary to keep computational expenses low. (The lighttag article mentions convolutions or transformers as a way to nullify the cost of long sequences)
- The output of character level models requires more work from the user to convert it into words and meaningful insights. Additionally, more work is needed to account for tokenization errors, as character level models split up words into individual charaters and during that process, may interpret words differently than we would like or have a preference for. 

## Character Mapping Approach

Below we can visualize what is inside of our character mapping. I noticed some weird errors of sentences composed of only numbers and went back to filter them out. Uncomment the last line "text" to display the entire text and char_to_n to view the character mapping, I have kept it commented as our notebook output will display the entire thing and be way too large. <br>
In this iteration I use the entire book's text. Later on I will try variations such as training seperately on the two authors.

In [3]:
#Convert sentences back into book
join_with = " "
text = join_with.join(sent_df["sentence"][sent_df["chapter_author"] == "Aesthetic"])
text=text.lower()
characters = sorted(list(set(text)))
n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}
#text
char_to_n

{' ': 0,
 '!': 1,
 '&': 2,
 '(': 3,
 ')': 4,
 '*': 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '/': 9,
 '0': 10,
 '1': 11,
 '2': 12,
 '3': 13,
 '4': 14,
 '5': 15,
 '6': 16,
 '7': 17,
 '8': 18,
 '9': 19,
 ':': 20,
 ';': 21,
 '?': 22,
 '[': 23,
 ']': 24,
 'a': 25,
 'b': 26,
 'c': 27,
 'd': 28,
 'e': 29,
 'f': 30,
 'g': 31,
 'h': 32,
 'i': 33,
 'j': 34,
 'k': 35,
 'l': 36,
 'm': 37,
 'n': 38,
 'o': 39,
 'p': 40,
 'q': 41,
 'r': 42,
 's': 43,
 't': 44,
 'u': 45,
 'v': 46,
 'w': 47,
 'x': 48,
 'y': 49,
 'z': 50,
 '\xa0': 51,
 'à': 52,
 'ä': 53,
 'æ': 54,
 'è': 55,
 'é': 56,
 'ö': 57,
 'ø': 58,
 'ü': 59,
 '–': 60,
 '‘': 61,
 '’': 62,
 '“': 63,
 '”': 64,
 '…': 65}

For the rest of this code I will be following Analytics Vidhya's blogpost by [Pranjal Srivastava](https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/) without including much explanatory text. His descriptions are great and I don't see a reason to copy and paste his explanations if they are already in the article. There are a couple of technical terms that I think are worth explaining further though.

## Data Preprocessing for LSTM Training Format

In [4]:
X = []
Y = []
length = len(text)
#seq_length = length of the sequence of characters that we want to consider before predicting a particular character.
seq_length = 100
for i in range(0, length-seq_length, 1):
    sequence = text[i:i + seq_length]
    label = text[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])

In [5]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

## Modelling

In [6]:
#sequential model with two LSTM layers having 400 units each
model = Sequential()
model.add(LSTM(700, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
#20% dropout layer to check for overfitting
model.add(Dropout(0.2))
model.add(LSTM(700, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(700))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
model.fit(X_modified, Y_modified, epochs=24, batch_size=64)
#100 epochs, batch size 50, and 3 levels of 700 units and 20% dropout layer
model.save_weights('text_generator_e24_bs64_700_0.2_x3.h5')

from google.colab import files
files.download('text_generator_e24_bs64_700_0.2_x3.h5') 

Epoch 1/24

In [None]:
files.download('text_generator_e24_bs64_700_0.2_x3.h5') 

## Generating Text

In [None]:
string_mapped = X[99]
full_string = [n_to_char[value] for value in string_mapped]
# generating characters
for i in range(seq_length):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))
    pred_index = np.argmax(model.predict(x, verbose=0))
    seq = [n_to_char[value] for value in string_mapped]
    string_mapped.append(pred_index)
    string_mapped = string_mapped[1:len(string_mapped)]

In [None]:
txt=""
for char in full_string:
    txt = txt+char
txt