<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" />

# # Project 3 : Web API and NLP - Section 2: NLP Processing

## Part 1: Imports and Read

References: https://stackoverflow.com/questions/39782418/remove-punctuations-in-pandas/39782973

In [34]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from nltk import sent_tokenize, word_tokenize, RegexpTokenizer, WordNetLemmatizer


pd.set_option('display.max_colwidth', None)

In [4]:
#read the csv created
reddit_df = pd.read_csv("./datasets/reddit4500.csv")
#dropping unnamed column
reddit_df.drop(columns = 'Unnamed: 0', inplace=True)
reddit_df.head(2)

Unnamed: 0,subreddit,selftext,title
0,boardgames,,codenames online
1,boardgames,"I was reading the comments on a Brothers Murph video. Someone posted ""there are mechanics I love and some I haaaaaate"" . That got me wondering if I hated any mechanics.\n\nThere are some I like more than others but I couldn't think of any I hate. The only one I had a problem with multiple times was trading resources. There have been times when a ""couple"" would make really unfair trades with each other. It was obvious the female was trading to make her husband happy. Fortunately I am good enough friends with them to point this out. I even found a short podcast on this subject and played it for them. The term they used was ""lioning"".\n\nBut I believe this was instances of people abusing/not playing as intended not the mechanic.\n\nI have a friends wife that refuses to play auction games. I just don't get it. Why? I'm guessing it's because she's not good at it. \n\nAre there mechanics you hate and why?",What mechanics do you hate and why?


## Part 2: EDA and Cleanup

In [5]:
#Get value counts on our y value.
reddit_df['subreddit'].value_counts()

Fallout         1600
RocketLeague    1500
boardgames      1400
Name: subreddit, dtype: int64

In [6]:
#check the value counts normalize
reddit_df['subreddit'].value_counts(normalize=True)

Fallout         0.355556
RocketLeague    0.333333
boardgames      0.311111
Name: subreddit, dtype: float64

In [7]:
#look for nulls in all the columns
reddit_df.isnull().sum()

subreddit       0
selftext     1456
title           0
dtype: int64

In [8]:
reddit_df.shape

(4500, 3)

In [9]:
#replacing nulls in selftext with do instead of adding blank as it is stop word
reddit_df['selftext'].fillna("do ", inplace=True)

In [30]:
#combining selftext and title so we can consider title as part of the features
reddit_df["submission_text"] = reddit_df["selftext"] + ' ' + reddit_df["title"]

In [24]:
reddit_df.head(3)

Unnamed: 0,subreddit,selftext,title,submission_text
0,boardgames,do,codenames online,do codenames online
1,boardgames,"I was reading the comments on a Brothers Murph video. Someone posted ""there are mechanics I love and some I haaaaaate"" . That got me wondering if I hated any mechanics.\n\nThere are some I like more than others but I couldn't think of any I hate. The only one I had a problem with multiple times was trading resources. There have been times when a ""couple"" would make really unfair trades with each other. It was obvious the female was trading to make her husband happy. Fortunately I am good enough friends with them to point this out. I even found a short podcast on this subject and played it for them. The term they used was ""lioning"".\n\nBut I believe this was instances of people abusing/not playing as intended not the mechanic.\n\nI have a friends wife that refuses to play auction games. I just don't get it. Why? I'm guessing it's because she's not good at it. \n\nAre there mechanics you hate and why?",What mechanics do you hate and why?,"I was reading the comments on a Brothers Murph video. Someone posted ""there are mechanics I love and some I haaaaaate"" . That got me wondering if I hated any mechanics.\n\nThere are some I like more than others but I couldn't think of any I hate. The only one I had a problem with multiple times was trading resources. There have been times when a ""couple"" would make really unfair trades with each other. It was obvious the female was trading to make her husband happy. Fortunately I am good enough friends with them to point this out. I even found a short podcast on this subject and played it for them. The term they used was ""lioning"".\n\nBut I believe this was instances of people abusing/not playing as intended not the mechanic.\n\nI have a friends wife that refuses to play auction games. I just don't get it. Why? I'm guessing it's because she's not good at it. \n\nAre there mechanics you hate and why? What mechanics do you hate and why?"
2,boardgames,[removed],Golfer Joaquin Niemann helps raise $2.1 million to save his infant cousin's life,[removed] Golfer Joaquin Niemann helps raise $2.1 million to save his infant cousin's life


In [25]:
#test a row
reddit_df["submission_text"][2]

"[removed] Golfer Joaquin Niemann helps raise $2.1 million to save his infant cousin's life"

## Part 3: NLP Processing 


In [31]:
reddit_df['submission_text'][:15]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [27]:
## we can use tokenizer to just use words
#create function to get words only using RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') 

def gettoken(x):
    return tokenizer.tokenize(x.lower())

def lematize_reddit(col):
    lemmatizer = WordNetLemmatizer()
    tokens= [word_tokenize(text) for text in reddit_df[f'{col}']]
    for i , token in enumerate(tokens):
        for word in token:
            words = []
            words.append(lemmatizer.lemmatize(word))
            token = ' '.join(tokens[i])
    for i , token in enumerate(tokens):
        tokens[i] = ' '.join(tokens[i])
        reddit_df[f'{col}'][i]= tokens[i]
    return 'Reddit Submission Text Data Lemmatized'

In [28]:
#replace resource link = https://stackoverflow.com/questions/39782418/remove-punctuations-in-pandas/39782973
# removing the instances of [removed] as the actual post has been removed
# reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('[removed]', '',regex = True)

In [21]:
# replacing all url links to blank in submissions_text as urls al
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('http.*.com', '',regex = True)


In [21]:

# replace \n\n string to blank in submissions_text
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('\n\n', ' ',regex = True)


In [21]:

# replace 'TT' to blank in submissions_text
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('TT', '',regex = True)

# replace punchuations to blank in submissions_text
# reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('[^\w\s]', ' ',regex = True)

In [22]:
#check for few records
reddit_df['submission_text'][:15]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [38]:
#tokenize the words using function above
reddit_df['submission_text']=reddit_df['submission_text'].map(gettoken)

In [36]:
lematize_reddit ('submission_text')

'Reddit Submission Text Data Lemmatized'

In [39]:
#check sample records again
reddit_df['submission_text'][:15]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [do, codenames, online]
1                                                                                    [i, was, reading, the, comments, on, a, brothers, murph, video, someone, posted, there, are, mechanics, i, love, and, some, i, haaaaaate, that, got, me, wondering, if, i, hated, any, mechanics, there, a

In [40]:
#drop the columns selftext and title that we no longer need
reddit_df.drop(columns=['selftext', 'title'], inplace=True)

In [41]:
reddit_df.to_csv('./datasets/redditready.csv', index=False)