# Auto Text Generation using TensorFlow
## Generating Text with an RNN (Recurrent Neural Network)

Recurrent Neural Networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language.

An RNN works like this; First words get transformed into machine-readable vectors. Then the RNN processes the sequence of vectors one by one.



<div class="alert alert-box alert-warning">
Use the following links to go back to the different parts of this exercise that require to modify the function `nnCostFunction`.<br>

Back to:
- [Dataset Preparation](#section1)
- [Process the text](#section2)
- [Generating Sequence of N-gram Tokens (Word Embeddings)](#section4)
- [Padding the Sequences and obtain Variables: Predictors and Target](#section5)
- [LSTM for Text Generation](#Section6)
- [Generating the text](#section7)
</div>




![alt text](https://miro.medium.com/max/1400/1*AQ52bwW55GsJt6HTxPDuMA.gif)

Enable GPU

In [0]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


Check GPU Performance

In [0]:
%tensorflow_version 2.x
import tensorflow as tf
import timeit

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)
  
# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
3.667116954999983
GPU (s):
0.05741041599998198
GPU speedup over CPU: 63x


Import TensorFlow and other libraries

In [0]:
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

# set seeds for reproducability
import tensorflow as tf
tf.random.set_seed(1234) 
from numpy.random import seed
seed(1)

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

Using TensorFlow backend.


Load the Dataset

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
pwd

'/content'

In [0]:
cd /content/drive

/content/drive


In [0]:
ls

[0m[01;34m'My Drive'[0m/


In [0]:
cd /content/drive/My Drive/ML

/content/drive/My Drive/ML


In [0]:
ls

ArticlesApril2017.csv  ArticlesJan2017.csv    ArticlesMay2017.csv
ArticlesApril2018.csv  ArticlesJan2018.csv    Headline.ipynb
ArticlesFeb2017.csv    ArticlesMarch2017.csv  TextGenerator.ipynb
ArticlesFeb2018.csv    ArticlesMarch2018.csv


In [0]:
ArticleJan17 = pd.read_csv('ArticlesJan2017.csv')
ArticleFeb17 = pd.read_csv('ArticlesFeb2017.csv')
ArticleMar17 = pd.read_csv('ArticlesMarch2017.csv')
ArticleApr17 = pd.read_csv('ArticlesApril2017.csv')
ArticleMay17 = pd.read_csv('ArticlesMay2017.csv')
ArticleJan18 = pd.read_csv('ArticlesJan2018.csv')
ArticleFeb18 = pd.read_csv('ArticlesFeb2018.csv')
ArticleMar18 = pd.read_csv('ArticlesMarch2018.csv')
ArticleApr18 = pd.read_csv('ArticlesApril2018.csv')

In [0]:
article_df = pd.concat([ArticleJan17 ,ArticleFeb17 ,ArticleMar17 ,ArticleApr17 ,ArticleMay17 ,ArticleJan18 , ArticleFeb18, ArticleMar18, ArticleApr18], sort = False);

In [0]:
article_df.head()

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama’s Pi...,"['United States Politics and Government', 'Law...",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led...,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics...,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young Pre...,"['Obama, Barack', 'Afghanistan', 'United State...",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the ...,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/...,2836
2,58698a1095d0e0392607894a,,By CAITLIN LOVINGER,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays ...",1,Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,The New York Times,News,https://www.nytimes.com/2017/01/01/crosswords/...,445
3,5869911a95d0e0392607894e,,By JOCHEN BITTNER,article,"Angela Merkel, Russia’s Next Target","['Cyberwarfare and Defense', 'Presidential Ele...",1,OpEd,15,2017-01-01 23:30:27,Unknown,"With a friend entering the White House, Vladim...",The New York Times,Op-Ed,https://www.nytimes.com/2017/01/01/opinion/ang...,864
4,5869a61795d0e03926078962,,By JIAYIN SHEN,article,Boots for a Stranger on a Bus,"['Shoes and Boots', 'Buses', 'New York City']",0,Metro,12,2017-01-02 01:00:02,Unknown,Witnessing an act of generosity on a rainy day.,The New York Times,Brief,https://www.nytimes.com/2017/01/01/nyregion/me...,309


In [0]:
article_df.shape

(9335, 16)

In [0]:
import os
cwd = os.getcwd()
print(os.getcwd())

/content/drive/My Drive/ML


# **Dataset preparation**
##Dataset Cleaning

1. Perform text cleaning of the data which includes removal of punctuations
2. Lower casing all the words
3. Remove HTML tags

In [0]:
curr_dir = '../ML/'
headline = []
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
        article_df = pd.concat([ArticleJan17 ,ArticleFeb17 ,ArticleMar17 ,ArticleApr17 ,ArticleMay17 ,ArticleJan18 , ArticleFeb18, ArticleMar18, ArticleApr18], sort = False);
        headline.extend(list(article_df.headline.values))
        break

headline = [h for h in headline if h != "Unknown"]
len(headline)

8603

**A corpus** is a large collection of text, and in the machine learning sense a corpus can be thought of as your model's input data. 

In [0]:
def clean_text(text):
    text = "".join(v for v in text if v not in string.punctuation).lower()
    text = text.encode("utf8").decode("ascii",'ignore')
    return text 

corpus = [clean_text(x) for x in headline]
corpus[:10]

[' gop leadership poised to topple obamas pillars',
 'fractured world tested the hope of a young president',
 'little troublemakers',
 'angela merkel russias next target',
 'boots for a stranger on a bus',
 'molder of navajo youth where a game is sacred',
 'the affair season 3 episode 6 noah goes home',
 'sprint and mr trumps fictional jobs',
 'america  becomes a stan',
 'fighting diabetes and leading by example']

In [0]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Look into the Text

Using the function "clean_text", below string gets converted to lower case with no punctuation or tags

In [0]:
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(corpus)))

Length of text: 8603 characters


Take a look into the first 250 characters in text

In [0]:
print(corpus[:250])

[' gop leadership poised to topple obamas pillars', 'fractured world tested the hope of a young president', 'little troublemakers', 'angela merkel russias next target', 'boots for a stranger on a bus', 'molder of navajo youth where a game is sacred', 'the affair season 3 episode 6 noah goes home', 'sprint and mr trumps fictional jobs', 'america  becomes a stan', 'fighting diabetes and leading by example', 'chinese court says mr c was fired unjustifiably', 'cold therapy maybe better save your money', 'shunned stars of steroid era are on deck for cooperstown', 'picking up a personal thread at an office party', 'health reform could outlast repeal efforts', 'mr trump bureaucracy apprentice', 'house gop votes to gut an office reviewing ethics', 'right to disconnect from work email and other laws go into effect in france', 'lessons from the tea party', 'all talk', 'winter comforts', 'the snapchat presidency', 'the house at the end of the world', 'power down', 'fraud culture rises in india ai

Unique characters in the file

In [0]:
vocab = sorted(set(corpus))
print ('{} unique characters'.format(len(vocab)))

8309 unique characters


#**Process the text**
##**Generating Sequence of N-gram Tokens**


Tokenization
Tokenization is a process of extracting tokens from a corpus. 
After this step, every text document in the dataset is converted into sequence of tokens.

Function to predict the next word based on the input words

- Tokenize the text
- Pad the sequences
- Pass into the trained model to get predicted word.

In [0]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    seq = []
    for row in corpus:
        tokens = tokenizer.texts_to_sequences([row])[0]
        for i in range(1, len(tokens)):
            n_gram_sequence = tokens[:i+1]
            seq.append(n_gram_sequence)
    return seq, total_words

seq, total_words = get_sequence_of_tokens(corpus)
seq[:10]

[[77, 1951],
 [77, 1951, 1360],
 [77, 1951, 1360, 3],
 [77, 1951, 1360, 3, 3366],
 [77, 1951, 1360, 3, 3366, 1601],
 [77, 1951, 1360, 3, 3366, 1601, 1952],
 [5166, 86],
 [5166, 86, 3367],
 [5166, 86, 3367, 1],
 [5166, 86, 3367, 1, 349]]

In [0]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(corpus[:10]), seq[:10]))

[' gop leadership poised to topple obamas pillars', 'fractured world tested the hope of a young president', 'little troublemakers', 'angela merkel russias next target', 'boots for a stranger on a bus', 'molder of navajo youth where a game is sacred', 'the affair season 3 episode 6 noah goes home', 'sprint and mr trumps fictional jobs', 'america  becomes a stan', 'fighting diabetes and leading by example'] ---- characters mapped to int ---- > [[77, 1951], [77, 1951, 1360], [77, 1951, 1360, 3], [77, 1951, 1360, 3, 3366], [77, 1951, 1360, 3, 3366, 1601], [77, 1951, 1360, 3, 3366, 1601, 1952], [5166, 86], [5166, 86, 3367], [5166, 86, 3367, 1], [5166, 86, 3367, 1, 349]]


##**Padding the Sequences and obtain Variables: Predictors and Target**

- A data-set with sequence of tokens is generated
- These different sequences have different lengths
- Pad the sequences before training the model to make their lengths equal 
- There is built-in 'pad_sequence' function of Kears for this purpose 
- Predictors and label are created to input data into a learning model
- Create N-grams sequence as predictors and the next word of the N-gram as label

In [0]:
def generate_padded_sequences(seq):
    max_length = max([len(x) for x in seq])
    seq = np.array(pad_sequences(seq, maxlen=max_length, padding='pre'))
    
    predictors, label = seq[:,:-1],seq[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_length

predictors, label, max_length = generate_padded_sequences(seq)

Now we can obtain the input vector X and the label vector Y which can be used for the training purposes.

## LSTM for Text Generation
Quick Background

- LSTM: Long short-term memory are a building unit for layers of a recurrent neural network (RNN). 
- A RNN composed of LSTM units is often called an LSTM network. 
- A common LSTM unit is composed of 
  - a cell
  - an input gate, 
  - an output gate and 
  - a forget gate.

1) Cell State: The long-term memory is usually called the cell state. The looping arrows indicate recursive nature of the cell.

2) Forget Gate: The remember vector is usually called the forget gate. The output of the forget gate tells the cell state which information to forget by multiplying 0 to a position in the matrix. If the output of the forget gate is 1, the information is kept in the cell state.

3) Input Gate: The save vector is usually called the input gate. These gates determine which information should enter the cell state / long-term memory.

4) Output Gate: The focus vector is usually called the output gate.

The working memory is usually called the hidden state. 

![alt text](https://miro.medium.com/max/542/1*ULozye1lfd-dS9RSwndZdw.png)

Explanation of the algorithm
- The idea is to train the RNN with many sequences of words and the target next_word. 
example
- If each sentence is a list of five words, then the target is a list of only one element, indicating which is the following word in the original text

![alt text](https://miro.medium.com/max/1400/1*n-IgHZM5baBUjq0T7RYDBw.gif)

Run this model for 100 epochs

In [0]:
def create_model(max_length, total_words):
    length = max_length - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=length))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_length, total_words)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 23, 10)            112650    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               44400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 11265)             1137765   
Total params: 1,294,815
Trainable params: 1,294,815
Non-trainable params: 0
_________________________________________________________________


##**Train the Model**

In [0]:
model.fit(predictors, label, epochs=100, verbose=5)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.callbacks.History at 0x7fc83bc50358>

## Generating the text

Function to predict the next word based on the input words 
1. Tokenize the text
2. pad the sequences  
3. pass into the trained model to get predicted word.

In [0]:
def generate_text(seed_text, next_words, model, max_length):
    for _ in range(next_words):
        #tokenize the text
        tokens = tokenizer.texts_to_sequences([seed_text])[0]
        #pad the sequence
        tokens = pad_sequences([tokens], maxlen=max_length-1, padding='pre')
        #pass into the trained model to get the predicted word
        predict = model.predict_classes(tokens, verbose=0)
        
        result = ""
        for word,index in tokenizer.word_index.items():
            if index == predict:
                result = word
                break
        seed_text += " "+result
    return seed_text.title()

Some Results

In [0]:
print (generate_text("united states", 5, model, max_length))
print (generate_text("preident trump", 4, model, max_length))
print (generate_text("donald trump", 4, model, max_length))
print (generate_text("india and china", 8, model, max_length))
print (generate_text("new york", 4, model, max_length))
print (generate_text("science and technology", 5, model, max_length))

In [0]:
generate_text("world", 4, model, max_length)

'World Capital Inquiry To Win'

In [0]:
generate_text("population", 4, model, max_length)

'Population The Walking Dead Season'

In [0]:
generate_text("strange", 4, model, max_length)

'Strange Pension Math Leaves States'

In [0]:
generate_text("boots", 6, model, max_length)

'Boots For A Stranger In A Bus'

In [0]:
generate_text("how", 4, model, max_length)

'How To Be Mindful While'

In [0]:
generate_text("silicon valley", 4, model, max_length)

'Silicon Valley Wary Of Trump Warms'

In [0]:
generate_text("sound", 6, model, max_length)

'Sound Barriers The Shape Of The Hacks'

In [0]:
generate_text("Pie", 4, model, max_length)

'Pie Forges Is A Mountain'

In [0]:
generate_text("When do you", 5, model, max_length)

'When Do You Want To See The Way'

# Part 2: Training Markov Chain Model on NYT Comments using markovify and spaCy

### This is an attempt to make a bot comment meaningfully by generating comments similar to those on the NYT articles.

1. 'markovify' is used for Markov chain generator for the automated text generation. 
2. NLP package spaCy is used for parts of speech tagging

In [0]:
pip install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.2.4)


In [0]:
pip install markovify

Collecting markovify
  Downloading https://files.pythonhosted.org/packages/de/c3/2e017f687e47e88eb9d8adf970527e2299fb566eba62112c2851ebb7ab93/markovify-0.8.0.tar.gz
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 4.0MB/s 
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.8.0-cp36-none-any.whl size=10694 sha256=fd54a72f4dceeff570ac1fa128fba206ecdaf96ddd37d0e9e38df548871ba543
  Stored in directory: /root/.cache/pip/wheels/5d/a8/92/35e2df870ff15a65657679dca105d190ec3c854a9f75435e40
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.8.0 unidecode-1.1.1


These libraries are mainly used for building Markov models of large corpora of text and generating random sentences from that.

Loading required packages and data

In [0]:
import pandas as pd
import markovify 
import spacy
import re
from time import time
import gc
import warnings
warnings.filterwarnings('ignore')

In [0]:
pwd

'/content'

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
pwd

'/content'

In [0]:
cd /content/drive/My Drive/MLFINAL

/content/drive/My Drive/MLFINAL


In [0]:
ls

bot.ipynb              CommentsFeb2018.csv    CommentsMarch2018.csv
CommentsApril2017.csv  CommentsJan2017.csv    CommentsMay2017.csv
CommentsApril2018.csv  CommentsJan2018.csv    stackText.ipynb
CommentsFeb2017.csv    CommentsMarch2017.csv  Text.ipynb


## Steps to perform:
### 1. Prepare text from comments for training the generator.
### 2. Training a simple Markov chain generator using the comments' text and using it to generate comments.
### 3. Training an improved Markov chain generator with POS-Tagged text and using it to generate more comments.

In [0]:
curr_dir = '../MLFINAL/'
df1 = pd.read_csv(curr_dir + 'CommentsJan2017.csv')
df2 = pd.read_csv(curr_dir + 'CommentsFeb2017.csv')
df3 = pd.read_csv(curr_dir + 'CommentsMarch2017.csv')
df4 = pd.read_csv(curr_dir + 'CommentsApril2017.csv')
df5 = pd.read_csv(curr_dir + 'CommentsMay2017.csv')
df6 = pd.read_csv(curr_dir + 'CommentsJan2018.csv')
df7 = pd.read_csv(curr_dir + 'CommentsFeb2018.csv')
df8 = pd.read_csv(curr_dir + 'CommentsMarch2018.csv')
df9 = pd.read_csv(curr_dir + 'CommentsApril2018.csv')
comments = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9])
comments.drop_duplicates(subset='commentID', inplace=True)
comments.head(3)

Unnamed: 0,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,inReplyTo,newDesk,parentID,parentUserDisplayName,permID,picURL,printPage,recommendations,recommendedFlag,replyCount,reportAbuseFlag,sectionName,sharing,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,typeOfMaterial
0,1483455908,58691a5795d0e039260788b9,1324.0,For all you Americans out there --- still rejo...,20969730.0,20969730.0,<br/>,comment,1483426000.0,1.0,0,0.0,National,0.0,,20969730,https://graphics8.nytimes.com/images/apps/time...,1.0,5.0,,0.0,,Politics,0,approved,1.0,0.0,1483455908,N. Smith,64679318.0,New York City,,,News
1,1483455656,58691a5795d0e039260788b9,1324.0,Obamas policies may prove to be the least of t...,20969325.0,20969325.0,<br/>,comment,1483417000.0,1.0,0,0.0,National,0.0,,20969325,https://graphics8.nytimes.com/images/apps/time...,1.0,3.0,,0.0,,Politics,0,approved,1.0,0.0,1483455656,Kilocharlie,69254188.0,Phoenix,,,News
2,1483455655,58691a5795d0e039260788b9,1324.0,Democrats are comprised of malcontents who gen...,20969855.0,20969855.0,<br/>,comment,1483431000.0,1.0,0,0.0,National,0.0,,20969855,https://graphics8.nytimes.com/images/apps/time...,1.0,3.0,,0.0,,Politics,0,approved,1.0,0.0,1483455655,Frank Fryer,76788711.0,Florida,,,News


In [0]:
comments.shape

(2118617, 34)

In [0]:
comments.sectionName.value_counts()[:5]

Unknown          1096761
Politics          479701
Sunday Review     143849
Europe             46844
Middle East        32385
Name: sectionName, dtype: int64

Politics has more number of articles so let's select "Politics" as our section Name

In [0]:
def preprocess(comments):
    commentBody = comments.loc[comments.sectionName=='Politics', 'commentBody']
    commentBody = commentBody.str.replace("(<br/>)", "")
    commentBody = commentBody.str.replace('(<a).*(>).*(</a>)', '')
    commentBody = commentBody.str.replace('(&amp)', '')
    commentBody = commentBody.str.replace('(&gt)', '')
    commentBody = commentBody.str.replace('(&lt)', '')
    commentBody = commentBody.str.replace('(\xa0)', ' ')  
    return commentBody

In [0]:
commentBody = preprocess(comments)
commentBody.shape

(479701,)

In [0]:
del comments, df1, df2, df3, df4, df5, df6, df7, df8
gc.collect()

200

A Sample comment present on the dataset

In [0]:
commentBody.sample().values[0]

"Both parties need to come to an agreement to limit budget talks to one or at most, two C.R.'s. Continuing to kick the can down the road plays havoc with the lives of millions of families and effectively limits discussion on any other issues because budget talks are always in a crisis mode."

### How it works?
The Markov chain generator focuses on the current word and randomly find the next word

In [0]:
start_time = time()
comments_generator = markovify.Text(commentBody, state_size = 5)
print("Run time for training the generator : {} seconds".format(round(time()-start_time, 2)))

Run time for training the generator : 79.15 seconds


Print randomly-generated comments using the built model

In [0]:
def generate_comments(generator, number=10, short=False):
    count = 0
    while count < number:
        if short:
            comment = generator.make_short_sentence(140)
        else:
            comment = generator.make_sentence()
        if comment:
            count += 1
            print("Comment {}".format(count))
            print(comment)
            print()

### Comments generated by Bot

In [0]:
generate_comments(comments_generator)

Comment 1
I don't know how to tell the truth.

Comment 2
Thanks to all of you and to the The New York Times for its slanted coverage favoring Hillary over Trump.

Comment 3
Dems and sane republicans should get to the bottom of this... just like the republicans did for Benghazi.

Comment 4
Some day these Republican members of Congress will do nothing as long as they have been properly licensed and register the weapon.

Comment 5
Donald Trump has not hidden the fact that he is the most persecuted innocent man in the world since Job.

Comment 6
Again, it is obvious Trump does not understand he is a public servant, not the dictator he believes he is.

Comment 7
Mind you, it will take a generation of young people to begin by saying NO! to the current system.

Comment 8
Of course Trump would have to still be president to be able to vent and rant behind closed doors with his aides.

Comment 9
The corruption is so deep, someone needs to step up to the plate now and put country before party.

C

### Improving Markov chain generator using spaCy for POS-Tagging:

Improving sentence structure by using parts of speech tagging

In [0]:
nlp = spacy.load("en")

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

- POS-Tagging somewhat slows down the training of the generator model
- Use a smaller training set consisting of comments

In [0]:
commentBody = preprocess(df9)
commentBody.shape

(58818,)

In [0]:
del comments_generator, df9
gc.collect()

66

In [0]:
start_time = time()
comments_generator_POSified = POSifiedText(commentBody, state_size = 2)
print("Run time for training the generator : {} seconds".format(round(time()-start_time, 2)))

Run time for training the generator : 1325.71 seconds


## Improved Comments generated by AutoBot using POS-Tagging

In [0]:
generate_comments(comments_generator_POSified)

Comment 1
It is so scared of dying in the Oval Office

Comment 2
Let 's see if there is plenty of voter suppression efforts like Crosscheck , the sick , and to him .

Comment 3
This is a walking void of   national tragedy is the dumbest doctor in the middle of the ferocity of counter attacks by the Mueller probe .

Comment 4
Amy , if Stormy could be up in arms about this ?

Comment 5
I note one more sentimental commentator portraying Ryan as Vice President Mike Pence in that regard .

Comment 6
Just register Democrats and Republicans .

Comment 7
Germany is having difficulty understanding exactly what attorney client privilege with regard to politics .

Comment 8
Vote in November , and the public to special interests .

Comment 9
One can only imagine how they knew that these were the pawns in a very expensive health care free for all the money he raised on Social Security and Medicare by manufacturing a plethora of philippics , thru his stupidity , and preserve a treasonous president .