# Get Reddit Data
This notebook is used to pull and clean the reddit comments that will be used for training the various models.


We will need praw, the Python Reddit API Wrapper, nltk, and Keras (which needs tensorflow) for data collection and cleaning. As the libraries not included in conda, here are the four installs.

In [None]:
!pip install praw
!pip install nltk
!pip install tensorflow
!pip install keras

In [None]:
# constants
THRESHOLD = 100000

In [None]:
import praw
import nltk
nltk.download('vader_lexicon')
import pickle
import numpy as np

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

from nltk.sentiment import SentimentIntensityAnalyzer
from get_data import *

## Establishing reddit instance and scraping comments
Having installed and imported all the neceassary libraries, we request THRESHOLD comments and their parents from reddit. The data is stored in 6 pickle files. We have,

    comment_texts               ====>        PLaintext of each comment requested from reddit
    comment_sentiment_scores    ====>        list of the scores returned by vader sentiment analyzer for each comment
    comment_upvotes             ====>        The number of upvotes for each comment
    parent_texts                ====>        Plaintext of the parent from each parent child pair
    parent_sentiment_scores     ====>        list of the scores returned by vader sentiment analyzer for each parent
    children_texts              ====>        Plaintext of the parent from each parent child pair

In [None]:
reddit = praw.Reddit(client_id = "",
                     client_secret = "",
                     user_agent='Python: Comment Scraper: v0.1(by /u/josmfred)')

comment_texts, comment_sentiment_scores, comment_upvotes = [], [], []
parent_texts, parent_sentiment_scores, children_texts = [], [], []

while len(comment_texts) < THRESHOLD:
    score_predict_data, word_predict_data = (
            get_comments_and_parents(get_random_submission("ProgrammerHumor", reddit))
    )
    comment_texts.extend(score_predict_data[0])
    comment_sentiment_scores.extend(score_predict_data[1])
    comment_upvotes.extend(score_predict_data[2])
    parent_texts.extend(word_predict_data[0])
    parent_sentiment_scores.extend(word_predict_data[1])
    children_texts.extend(word_predict_data[2])

pickle.dump(comment_texts, open("preclean/comment_texts.pkl", "wb"))
pickle.dump(comment_sentiment_scores, open("preclean/comment_sentiment_scores.pkl", "wb"))
pickle.dump(comment_upvotes, open("preclean/comment_upvotes.pkl", "wb"))
pickle.dump(parent_texts, open("preclean/parent_texts.pkl", "wb"))
pickle.dump(parent_sentiment_scores, open("preclean/parents_sentiment_scores.pkl", "wb"))
pickle.dump(children_texts, open("preclean/children_texts.pkl", "wb"))


## Data Cleaning
Having finished aquiring and saving the data, we will begin cleaning the data under the assumption that the data is saved in the preclean directory. The following portion of the notebook does not require any of the previous cells to have been run. All they require is that the correct pickles are in the preclean directory. So, first we load all the data from the correct location, and the tokenizer, if it exists. We then run the cleaning function in get_data.py to prepare the data for the learn notebook. We are then left with the data that is saved stored as follows:

    parent_texts               ====>        Padded, tokenized on words vectors of each parent comment
    parent_sentiment_scores    ====>        Numpy array of the vader sentiment scores for parent comments
    child_first_word           ====>        The tokenized first word in each child comment
    comment_texts              ====>        Padded, tokenized on words vectors of each comment
    comment_sentiment_scores   ====>        Numpy array of the vader sentiment scores for every comment
    comment_upvotes            ====>        The number of upvotes for every comment
    words                      ====>        The words associated to each token in order of token
    tokenizer                  ====>        The tokenizer to reuse.

In [None]:
comment_texts = pickle.load(open("preclean/comment_texts.pkl", "rb"))
sentiment_scores = pickle.load(open("preclean/comment_sentiment_scores.pkl", "rb"))
comment_upvotes = pickle.load(open("preclean/comment_upvotes.pkl", "rb"))
parent_texts = pickle.load(open("preclean/parent_texts.pkl", "rb"))
parent_sentiment_scores = pickle.load(open("preclean/parents_sentiment_scores.pkl", "rb"))
children_texts = pickle.load(open("preclean/children_texts.pkl", "rb"))
# The tokenizer might not exist. If the tokenizer does not exist, then
# we assume that the data we are processing should have be used
# to fit a new tokenizer, and then this tokenizer is saved to use in later
# data processing. If the tokenizer does exist, we use the existing
# tokenizer on the new data.
try:
    tokenizer = pickle.load(open("tokenizer.pkl", "rb"))
except:
    tokenizer = None

pad_parents, parent_scores, first_word, tokenizer =  prepare_next_word_data(parent_texts,
                                                                            parent_sentiment_scores,
                                                                            children_texts,
                                                                            comment_texts,
                                                                            tokenizer=tokenizer)
texts_pad, sentiment_scores, upvotes, tokenizer = prepare_score_data(comment_texts,
                                                                     comment_sentiment_scores,
                                                                     comment_upvotes,
                                                                     150,
                                                                     tokenizer=tokenizer)

# Save all the processed data, the word -> index dictionary of th
# tokenizer, and the tokenizer itself.
np.save("cleaned/parent_texts.npy", pad_parents)
np.save("cleaned/parent_sentiment_scores.npy", parent_scores)
np.save("cleaned/child_first_word.npy", first_word)
np.save("cleaned/comment_texts.npy", texts_pad)
np.save("cleaned/comment_sentiment_scores.npy", sentiment_scores)
np.save("cleaned/comment_upvotes.npy", upvotes)
pickle.dump(tokenizer.word_index, open("cleaned/words.pkl", "wb"))
pickle.dump(tokenizer, open("tokenizer.pkl", "wb"))