<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" />

# # Project 3 : Web API and NLP - Section 2: NLP Processing

## Part 1: Imports and Read

References: https://stackoverflow.com/questions/39782418/remove-punctuations-in-pandas/39782973

In [2]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from nltk import sent_tokenize, word_tokenize, RegexpTokenizer


pd.set_option('display.max_colwidth', None)

In [3]:
#read the csv created
reddit_df = pd.read_csv("./datasets/reddit4500.csv")
#dropping unnamed column
reddit_df.drop(columns = 'Unnamed: 0', inplace=True)
reddit_df.head()

Unnamed: 0,subreddit,selftext,title
0,boardgames,"Hi all, really want to get into board games that put an emphasis on diplomacy and trade on a medium to large scale, What are some of your favorites/recommendations, thanks",Recommendations for board games
1,boardgames,My husband and I are looking for fun 2 player games to play on days when the weather isn’t favorable to spend time outside! We’ve been watching far too much TV lately.\n\nThanks in advance!,Games for 2 people?
2,boardgames,,Apocalypse 5E Kickstarter
3,boardgames,I've been looking into the counter insurgency (COIN) series of games from GMT and they look really interesting. My problem is that there are a lot of games in the series and I want all of your opinions on which are the best (and which to avoid).,Which COIN games are the best?
4,boardgames,[removed],Making Space For Disappointment


## Part 2: EDA and Cleanup

In [4]:
#Get value counts on our y value.
reddit_df['subreddit'].value_counts()

Fallout         1600
RocketLeague    1500
boardgames      1400
Name: subreddit, dtype: int64

In [5]:
#check the value counts normalize
reddit_df['subreddit'].value_counts(normalize=True)

Fallout         0.355556
RocketLeague    0.333333
boardgames      0.311111
Name: subreddit, dtype: float64

In [6]:
#look for nulls in all the columns
reddit_df.isnull().sum()

subreddit       0
selftext     1469
title           0
dtype: int64

In [7]:
reddit_df.shape

(4500, 3)

In [8]:
#replacing nulls in selftext with do instead of adding blank as it is stop word
reddit_df['selftext'].fillna("do ", inplace=True)

In [18]:
#combining selftext and title so we can consider title as part of the features
reddit_df["submission_text"] = reddit_df["selftext"] + ' ' + reddit_df["title"]

In [19]:
reddit_df.head()

Unnamed: 0,subreddit,selftext,title,submission_text
0,boardgames,"Hi all, really want to get into board games that put an emphasis on diplomacy and trade on a medium to large scale, What are some of your favorites/recommendations, thanks",Recommendations for board games,"Hi all, really want to get into board games that put an emphasis on diplomacy and trade on a medium to large scale, What are some of your favorites/recommendations, thanks Recommendations for board games"
1,boardgames,My husband and I are looking for fun 2 player games to play on days when the weather isn’t favorable to spend time outside! We’ve been watching far too much TV lately.\n\nThanks in advance!,Games for 2 people?,My husband and I are looking for fun 2 player games to play on days when the weather isn’t favorable to spend time outside! We’ve been watching far too much TV lately.\n\nThanks in advance! Games for 2 people?
2,boardgames,do,Apocalypse 5E Kickstarter,do Apocalypse 5E Kickstarter
3,boardgames,I've been looking into the counter insurgency (COIN) series of games from GMT and they look really interesting. My problem is that there are a lot of games in the series and I want all of your opinions on which are the best (and which to avoid).,Which COIN games are the best?,I've been looking into the counter insurgency (COIN) series of games from GMT and they look really interesting. My problem is that there are a lot of games in the series and I want all of your opinions on which are the best (and which to avoid). Which COIN games are the best?
4,boardgames,[removed],Making Space For Disappointment,[removed] Making Space For Disappointment


In [20]:
#test a row
reddit_df["submission_text"][2]

'do  Apocalypse 5E Kickstarter'

## Part 3: NLP Processing 


In [21]:
reddit_df['submission_text'][:15]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [22]:
## we can use tokenizer to just use words
#create function to get words only using RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') 

def gettoken(x):
    return tokenizer.tokenize(x.lower())



In [23]:
#replace resource link = https://stackoverflow.com/questions/39782418/remove-punctuations-in-pandas/39782973
# replace url links to blank in submissions_text
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('http.*.com', '',regex = True)

# replace \n\n string to blank in submissions_text
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('\n\n', ' ',regex = True)

# replace 'TT' to blank in submissions_text
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('TT', '',regex = True)

# replace punchuations to blank in submissions_text
reddit_df['submission_text'] = reddit_df['submission_text'].str.replace('[^\w\s]', ' ',regex = True)

In [24]:
#check for few records
reddit_df['submission_text'][:15]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [25]:
#tokenize the words using function above
reddit_df['submission_text']=reddit_df['submission_text'].map(gettoken)

In [26]:
#check sample records again
reddit_df['submission_text'][:15]

0                                                                                                                                                                                                                                                                                                                                                                           [hi, all, really, want, to, get, into, board, games, that, put, an, emphasis, on, diplomacy, and, trade, on, a, medium, to, large, scale, what, are, some, of, your, favorites, recommendations, thanks, recommendations, for, board, games]
1                                                                                                                                                                                                                                                                                                                                                                   [my, husband, and, i, are, looking, for, fu

In [27]:
#drop the columns selftext and title that we no longer need
reddit_df.drop(columns=['selftext', 'title'], inplace=True)

In [28]:
reddit_df.to_csv('./datasets/redditready.csv', index=False)

In [None]:
def _lem_(col):
    lemmatizer = WordNetLemmatizer()
    tokens= [word_tokenize(text) for text in df[f'{col}']]
    for i , token in enumerate(tokens):
        for word in token:
            words = []
            words.append(lemmatizer.lemmatize(word))
            token = ' '.join(tokens[i])
    for i , token in enumerate(tokens):
        tokens[i] = ' '.join(tokens[i])
        df[f'{col}'][i]= tokens[i]
    return 'Words Lemmatized'