# Learn and Run
This notebook trains all the models and uses them to generate a reply from a comment. Given a reasonable number of comments and not great computing power, the models will take a large amount of time to train.

First, we will need nltk again.

In [3]:
!pip install nltk

Collecting nltk
Installing collected packages: nltk
Successfully installed nltk-3.3
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
import random
from operator import itemgetter

import numpy as np
import pickle
import keras

import nltk
nltk.download('vader_lexicon')

from keras.preprocessing import sequence
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.externals import joblib



## Training
To make thing simpler to write, the following scripts will load the data from the correct locations, train the models, and them save the models to the model directory. See the individual files for information on each model. Hopefull a model summary will be implemented soon.

In [None]:
# Train all the models needed for generating comments and save them.
# These take awhile. So avoid running these if possible.
!python3 scripts/train/next_word_markov_model.py
!python3 scripts/train/predict_first_word_model.py
!python3 scripts/train/predict_upvotes_model.py
!python3 scripts/train/cluster_comments.py
!python3 scripts/train/cluster_markov_model.py

## Loading the Models
With the models trained, we can load each of them from the models directory to be used in the comment generator. Then we then load some files to make generating readable comment possible. Each of the models does a follows:

    predict_upvotes                 ====>        Predicts the upvotes of a comment 
    predict_first_word              ====>        Predicts the first word of child of the comment
    cluster_comments                ====>        Clusters the comments into 256 different clusters
    cluster_markov                  ====>        Given the cluster of the parent, generates a probability
                                                   distribution for the words in the child
    generator_probability_chain     ====>        generates a bigram based probability distribution for the next word

In [6]:
# Load all models, data, words, and tokenizer
predict_upvote_model = keras.models.load_model("models/predict_upvotes.h5")
predict_first_word_model = keras.models.load_model("models/predict_first_word.h5")
cluster_comments_model = joblib.load(open("models/cluster_comments.pkl", 'rb'))
cluster_markov_model = pickle.load(open("models/cluster_markov_model.pkl", 'rb'))
p_chain = pickle.load(open("models/generator_probability_chain.pkl", 'rb'))

comment_texts = np.load("cleaned/comment_texts.npy")
comment_sentiment_scores = np.load("cleaned/comment_sentiment_scores.npy")
words = pickle.load(open("cleaned/words.pkl", "rb"))
inv_map = {v: k for k, v in words.items()}
tokenizer = pickle.load(open("tokenizer.pkl", "rb"))

## Comment Generation
Now we can use all the models to generate a comment in a upvote maximizing way. To see the procedure for generating a comment, see the presentation in the main directory. We show both the parent comment and the generated reply. 

In [17]:
# Using all the previously described models and data to generate a random comment reply.
cmt = random.randint(0, comment_texts.shape[0])
print("comment = {}\n".format(
            " ".join([inv_map[ind] for ind in [word for word in comment_texts[cmt] if word != 0]])))

print("reply = {}".format(generate_comment(comment_texts[cmt], comment_sentiment_scores[cmt],
                                           inv_map, p_chain,
                                           upvote_model=predict_upvote_model,
                                           first_word_model=predict_first_word_model,
                                           comment_cluster_model=cluster_comments_model,
                                           cluster_markov_model=cluster_markov_model,
                                           tokenizer=tokenizer
                                           )
                         )
     )

comment = damn even here you're everywhere

reply = grep the term spyware like saying plagiarism is okay
