RedditNLProcessor.py is the main script that handles the pre-processing of the scrapped Reddit posts.  As with RedditScrapper.py, RedditNLProcessor.py can be run as a background thread or task pre-processing the posts at defined intervals.

#### 1. Retrieve all Reddit posts that requires pre-processing

Reddit posts that are pending pre-processing have the batch_id = 0.  A query to the database via the AWSDB interface, retrieves all Reddit posts with batch_id = 0.

#### 2. Pre-process the posts 

Each Reddit post undergodes a series of steps for pre-processing. For each Reddit post, a NLPPost is created and instantiated from it.  On initialization, the Reddit post's contents undergoes the following cleasning steps with the help of the NLPUtils.py module.

    2.1 Combined the contents of both the title and text within the post
    2.2 Using BeautifulSoup, remove all HTML tags
    2.3 Using regex, remove all digits, special characters and symbols
    2.4 All words are lower cased
    2.5 Stop words in the NLTK module are removed

In [None]:
def post_to_words(text, my_stop_words=None):

    # Remove HTML tags
    text = BeautifulSoup(text, features='html5lib').get_text()

    # Remove digits and special characters
    letters_only = re.sub("[^a-zA-Z]", " ", text)

    # Convert to lower case and split into individual words
    words = letters_only.lower().split()

    # Get NLTK stop words
    stops = set(stopwords.words('english'))

    # Add new stop words using the my_stop_words parameter
    if my_stop_words is not None:
        stops.update(my_stop_words)

    # Remove stop words
    meaningful_words = [word for word in words if not word in stops]

    return meaningful_words

#### 3. Tokenization, Lemmatization, Stemming
    
At this point, the Reddit post has been processed to a collection of words meaningful for NLP. Next, I'll tokenized these words by lemmatization or stemming, or both.  RedditNLProcessor.py allows both tokenization methods via arguments 'lemmatize' and 'stem'.  I can also control the number of post to pre-process through the 'limit' parameter. Results of the tokenization is stored in the NLPPost object and later persist in the AWS database.

In [None]:
# Check whether to lemmatize posts to tokens
if lemmatize:
    for post in nlp_posts:
        post.lemma_tokens = NLPUtils.lemmatize_words(post.words)

# Check whether to stem posts to tokens
if stem:
    for post in nlp_posts:
        post.stem_tokens = NLPUtils.stem_words(post.words)

#### 4. Store pre-processing results in database

The results of the cleansing and tokenization in the NLPPost instances are persisted in the posts_nlp table in the AWS MySQL database via the AWSDB interface. We can then use these results in future for count vectorization, TFID

In [1]:
import numpy as np
import pandas as pd

from nltk.tokenize import RegexpTokenizer

from nltk.stem import WordNetLemmatizer

from nltk.corpus import stopwords # Import the stop word list

from sklearn.model_selection import train_test_split

from utils.aws import AWSDB 

In [2]:
posts = AWSDB.get_reddit_posts_for_nlp()

df = pd.DataFrame([vars(post) for post in posts])

df.drop(columns=['batch_id', 'created_utc'], inplace=True)

df.sort_values(by='created').head()

Unnamed: 0,created,name,subreddit,text,title
0,2019-12-05 23:58:39,t3_e6q4c5,StarWars,,Baby Yoda in real life
1,2019-12-06 00:01:05,t3_e6q5fw,StarWars,Who would be the best and who would be the wor...,Best and Worst of Baby Yoda’s Voice
2,2019-12-06 00:02:34,t3_e6q64h,StarWars,Hello there. I’m getting more hyped for Ep IX ...,Casual fan in need of some background info
3,2019-12-06 00:12:15,t3_e6qaxu,StarWars,,Ready to lay low and stretch your legs for a c...
4,2019-12-06 00:16:04,t3_e6qcqq,StarWars,,3D printed Baby-Yoda!


In [49]:

tokenizer = RegexpTokenizer(r'\w+')

lemmatizer = WordNetLemmatizer()

df['content'] = df['title'] + ' ' + df['text']

df['tokens'] = [ ' '.join(tokenizer.tokenize(content.lower())) for content in df['content']]

df['lemma'] = [(lemmatizer.lemmatize(word) for word in tokens.split(' ')) for tokens in df['tokens']]




In [50]:
df

Unnamed: 0,after,before,created_utc,name,text,title,created,content,tokens,lemma
0,t3_e49tiy,,1.575092e+09,t3_e3r9eg,We want to remind everyone about the above rul...,Reminder: We don't allow memes or image macros,2019-11-30 05:26:30,Reminder: We don't allow memes or image macros...,reminder we don t allow memes or image macros ...,<generator object <listcomp>.<genexpr> at 0x17...
1,t3_e49tiy,,1.575015e+09,t3_e3bxpp,Join us for a discussion on the latest episode...,The Mandalorian - Discussion Thread - S1E4,2019-11-29 08:13:20,The Mandalorian - Discussion Thread - S1E4 Joi...,the mandalorian discussion thread s1e4 join us...,<generator object <listcomp>.<genexpr> at 0x17...
2,t3_e49tiy,,1.575151e+09,t3_e447b4,,Dying Star Wars Fan Sees The Rise Of Skywalker...,2019-11-30 21:57:56,Dying Star Wars Fan Sees The Rise Of Skywalker...,dying star wars fan sees the rise of skywalker...,<generator object <listcomp>.<genexpr> at 0x17...
3,t3_e49tiy,,1.575152e+09,t3_e44f3f,,Our 3d printed Baby Yoda Christmas tree topper!,2019-11-30 22:11:41,Our 3d printed Baby Yoda Christmas tree topper!,our 3d printed baby yoda christmas tree topper,<generator object <listcomp>.<genexpr> at 0x17...
4,t3_e49tiy,,1.575171e+09,t3_e49du5,,My husbands name is Luke so we thought this wa...,2019-12-01 03:24:10,My husbands name is Luke so we thought this wa...,my husbands name is luke so we thought this wa...,<generator object <listcomp>.<genexpr> at 0x17...
5,t3_e49tiy,,1.575174e+09,t3_e4aebt,,"I've been absolutely loving The Mandalorian, s...",2019-12-01 04:26:14,"I've been absolutely loving The Mandalorian, s...",i ve been absolutely loving the mandalorian so...,<generator object <listcomp>.<genexpr> at 0x17...
6,t3_e49tiy,,1.575177e+09,t3_e4azx7,,Got a Bday surprise from my daughter. Hand pai...,2019-12-01 05:04:01,Got a Bday surprise from my daughter. Hand pai...,got a bday surprise from my daughter hand pain...,<generator object <listcomp>.<genexpr> at 0x17...
7,t3_e49tiy,,1.575162e+09,t3_e46ws0,,Rancor comic,2019-12-01 00:57:06,Rancor comic,rancor comic,<generator object <listcomp>.<genexpr> at 0x17...
8,t3_e49tiy,,1.575126e+09,t3_e3xms5,,I repainted some cheap Droid toys and made the...,2019-11-30 15:01:35,I repainted some cheap Droid toys and made the...,i repainted some cheap droid toys and made the...,<generator object <listcomp>.<genexpr> at 0x17...
9,t3_e49tiy,,1.575210e+09,t3_e4hxgf,HDMI\n\nEdit: I’m getting a ton of flack for ...,What did Master Yoda say when he first saw him...,2019-12-01 14:14:43,What did Master Yoda say when he first saw him...,what did master yoda say when he first saw him...,<generator object <listcomp>.<genexpr> at 0x17...


In [None]:
X = df[['text_feature']]
y = df['twitter']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19, stratify=y)

