# Personal Finance Subreddit Capstone Project

## Data Wrangling

The goal of this notebook is to extract data containing information about the personal finance subreddit from the pushshift.io Reddit API, select interesting features that can be relevant to the project and clean the dataset so that it is ready for data exploration.

- **Import the necessary packages**

In [1]:
import praw
import pandas as pd
import numpy as np
import datetime
import json
import requests
import string
import time
import datetime
import sqlite3
import matplotlib.pyplot as plt
%matplotlib inline

### Data Acquisition

- **Extract the data from reddit using pushshift.io's API**

In [None]:
# Variables
sub    = 'personalfinance'     # name of the subreddit you would like to scrape
after  = '2018-08-01'    # earliest date that will be scraped
before = '2018-09-19'    # latest date that will be scraped
fast   = True           # True will be faster, won't pull upvote ratio

In [None]:
# Initiate sqlite
sql = sqlite3.connect('personalfinance_.db')
cur = sql.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS posts (name TEXT, title TEXT, readable_utc TEXT, permalink TEXT, domain TEXT, url TEXT, author TEXT, score TEXT, upvote_ratio TEXT, num_comments TEXT)')
sql.commit()
print('Loaded SQL Database and Tables')

# Convert the specified dates to strptime
after = time.mktime(datetime.datetime.strptime(after, '%Y-%m-%d').timetuple())
after = int(after)
readable_after = time.strftime('%d %b %Y %I:%M %p', time.localtime(after))
before = time.mktime(datetime.datetime.strptime(before, '%Y-%m-%d').timetuple())
before = int(before) + 86399
readable_before = time.strftime('%d %b %Y %I:%M %p', time.localtime(before))
print('Searching for posts between ' + readable_after + ' and ' + readable_before + '.')
currentDate = before

Using pushshift will allow us to retrieve valuable information from reddit submissions including:
- Submission ID
- Submission title
- Submission date
- Submission permalink
- Submission domain
- Submission url
- Submission author
- Submission score (upvotes)
- Submission upvote_ratio (ratio of upvotes to downvotes)
- Submission number of comments

However, it does not provide us with the flair information.

NOTE: This process is extremely computationally expensive.

In [None]:
# Perform a new full pull from Pushshift
def newpull(thisBefore):
    global currentDate
    readable_thisBefore = time.strftime('%d %b %Y %I:%M %p', time.localtime(thisBefore))
    print('Searching posts before ' + str(readable_thisBefore))
    url = 'http://api.pushshift.io/reddit/search/submission/?subreddit=' + sub + '&size=500&before=' + str(thisBefore)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print('    Discussion: HTML Error - ', response.status_code)
        time.sleep(60)
        return
    curJSON = response.json()

    # Update each Pushshift result with Reddit data
    for child in curJSON['data']:

        # Check to see if already added
        name = str(child['id'])
        cur.execute('SELECT * FROM posts WHERE name == ?', [name])
        if cur.fetchone():
            print(str(child['id']) + ' skipped (already in database)')
            continue

        # If not, get more data
        if fast is True:
            searchURL = 'http://reddit.com/by_id/t3_'
        else:
            searchURL = 'http://reddit.com/'
        url = searchURL + str(name) + '.json'
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
        response = requests.get(url, headers=headers)
        if response.status_code != 200:
            print('    Discussion: HTML Error - ', response.status_code)
            time.sleep(60)
            break
        postJSON = response.json()
        if fast is True:
            jsonStart = postJSON
        else:
            jsonStart = postJSON[0]

        # Check to see if Date has passed
        global currentDate
        created_utc = jsonStart['data']['children'][0]['data']['created_utc']
        currentDate = int(created_utc)
        if currentDate <= after:
            break

        # If not, process remaining data
        try:
            title = str(jsonStart['data']['children'][0]['data']['title'])  # Checks for emojis and other non-printable characters
        except UnicodeEncodeError:
            title = ''.join(c for c in str(jsonStart['data']['children'][0]['data']['title']) if c in string.printable)
        readable_utc = time.strftime('%d %b %Y %I:%M %p', time.localtime(created_utc))
        permalink    = (str(jsonStart['data']['children'][0]['data']['permalink']))
        domain       = (str(jsonStart['data']['children'][0]['data']['domain']))
        url          = (str(jsonStart['data']['children'][0]['data']['url']))
        author       = (str(jsonStart['data']['children'][0]['data']['author']))
        score        = (str(jsonStart['data']['children'][0]['data']['score']))
        num_comments = (str(jsonStart['data']['children'][0]['data']['num_comments']))
        if fast is True:
            upvote_ratio = 0
        else:
            upvote_ratio = (str(jsonStart['data']['children'][0]['data']['upvote_ratio']))

        # Write it to SQL Database
        cur.execute('INSERT INTO posts VALUES(?,?,?,?,?,?,?,?,?,?)', [name, title, readable_utc, permalink, domain, url, author, score, upvote_ratio, num_comments])
        sql.commit()

# Run the newpull
while currentDate >= after:
    newpull(currentDate)

Now that we have pulled the data from pushshift, we will need to create a dataframe which will store the relevant information (title, date, time, upvotes, id).

In [None]:
# Reconnect to sqlite
connection = sqlite3.connect("personalfinance_.db") 
  
# Cursor object 
crsr = connection.cursor() 
  
# Execute the command to fetch all the data from the table posts 
crsr.execute("SELECT * FROM posts")  
  
# Store all the fetched data in the ans variable 
ans= crsr.fetchall()  

# Create empty dataframe
columns = ['title']
index = range(0,2)
df = pd.DataFrame(index = index, columns = columns)
df = df.fillna(0)

# Create new columns and extract the relevant data 
for n, i in enumerate(ans):
    # Create title column
    df.loc[n , 'title'] = i[1]
    # Create date column
    df.loc[n , 'date'] = i[2][:-8]
    # Create time column
    df.loc[n , 'time'] = i[2][-8:]
    # Create upvote column
    df.loc[n , 'upvotes'] = i[7]
    # Create id column
    df.loc[n , 'id'] = i[0]

- **Retrieve flair information from Reddit's API**

As mentioned before, we still need to extract the flair (which indicates the topic of each submission) from each post. To do this, we will have to initiate a Reddit instance using praw (which gives access to Reddit's API).

In [None]:
# Create new Reddit instance
reddit = praw.Reddit(client_id='',
                     client_secret='',
                     user_agent='',
                    username = '',
                    password = '')

Create a for-loop that will check and return the appropriate flair for each submission by using its ID as verification.

In [None]:
for a,b in enumerate(df.id):
    df.loc[a, 'topic'] = reddit.submission(id = "{}".format(b)).link_flair_text
    try:
        print(a,',', df.loc[a, 'topic'])
    except:
        print('None')

- **Convert Dates into datetime format**

Since the data is given as a string, we will need to convert it into datatime format.

In [None]:
from datetime import datetime
for a,b in enumerate(df.date):
     df.loc[a, 'date'] = datetime.strptime(b, '%d %b %Y ').date()

In [None]:
df = df.sort_values(by=['date', 'time'], ascending = False).reset_index().drop('index',axis='columns')

- **Examine basic information**

Let's begin by taking a peek at the dataframe's contents.

In [3]:
print('Dimensions of the dataframe: {}'.format(df.shape))
print(20*'-')
df.info()

Dimensions of the dataframe: (10183, 6)
--------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10183 entries, 0 to 10182
Data columns (total 6 columns):
title      10183 non-null object
date       10183 non-null object
time       10183 non-null object
upvotes    10183 non-null int64
id         10183 non-null object
topic      9532 non-null object
dtypes: int64(1), object(5)
memory usage: 556.9+ KB


The only column with missing data is 'topic' due to the fact that some posts have been removed and therefore their flairs no longer show up during extraction. To deal with this issue, we can simply fill in 'unknown'.

In [4]:
df['topic'].fillna('unknown', inplace = True)

In [5]:
df.topic.value_counts()

Debt                 1256
Other                1205
Credit               1048
Investing             859
Retirement            824
Employment            724
Housing               709
unknown               651
Auto                  560
Planning              526
Saving                524
Taxes                 509
Budgeting             405
Insurance             380
Meta                    2
THIS IS A SPAMMER       1
Name: topic, dtype: int64

Replace the outlier topics with the topic 'unknown'.

In [6]:
outliers = df[(df['topic'] == 'Meta' )| (df['topic'] == 'THIS IS A SPAMMER')]['topic']

for id_ in outliers.index:
    df.loc[id_,'topic'] = 'unknown'

In [91]:
#df = pd.read_csv(r'C:\Users\joshu\Downloads\Data\reddit\reddit_pf1.csv', engine='python', index_col=[0])

### Data Pre-processing

Pre-processing is an important part of machine learning and even more significant for Natural Language Processing tasks because a simple error could result in catastrophic mishaps. 

There are 3 major components in data pre-processing:

**1) Data-cleaning**: We will need to first clean up the text through various steps:

- **Lowercase** the words so that the model will not differentiate capitalized words from other words.

- **Remove numbers/digits** since the model is interpreting *text* not numbers.

- **Remove punctuation** since it is not important for the context.

- **Strip white space** since empty strings could be interpreted as text and we want to avoid that.

- **Remove stopwords**, which are general words that are very frequent in the English dictionary (ex. because, such, so). Here is a list of some common stopwords: https://www.ranks.nl/stopwords

- **Remove noise** that is not picked up through the other cleaning methods. This step can come either before or after tokenization and normalization, or both (ex. dropping words that are less than 2 characters long).

**2) Tokenization**: In order to better analyze individual words, we will need to *tokenize* the documents (or in this case, the submission titles) into pieces of words. By doing so, we will be able to use the various NLP libraries to further dissect the tokens.

**3) Normalization**: After tokenizing the data, we will need to normalize the text through lemmatization and stemming. Lemmatization is typically a better method since it returns the canonical forms based on a word's lemma. However, this process takes much more time compared to stemming the words, which simply removes the affixes of a given word.

In [31]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [32]:
import string
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

def preprocess(text):
    # Replace the forward slash with space
    text = text.replace('/', ' ')
    # Remove all other punctuation without replacement
    text = ''.join([char for char in text if char not in string.punctuation])
    # Remove digits (excluding strings that contain both digits and letters)
    text = ''.join([char for char in text if char not in string.digits])
    # Strip whitespaces
    text = ' '.join(text.split())
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stopWords])
    # Lowercase all words
    text = text.lower()
    
    return text

In [33]:
df['clean_title'] = df['title'].apply(lambda x: preprocess(x))
df['clean_title'].head(10)

0                           ways make extra side money
1    year update legally blind going homeless one i...
2                                               kicked
3                               online savings account
4                      tools managing incomes expenses
5    with resources like reddit financial consultin...
6    credit hit late payment fee waiver score refle...
7                    need help budgetting getting debt
8                                                 yo m
9        debt collector gave hours pay yelled said ssn
Name: clean_title, dtype: object

In [34]:
# Check to see all clean titles are not empty
for num, x in enumerate(df.clean_title):
    if not x:
        print('Row number {} has an empty entry'.format(num))

Row number 2167 has an empty entry
Row number 9597 has an empty entry


In [35]:
# See what the issue is and correct it
df.loc[[2167, 9597]]

Unnamed: 0,title,date,time,upvotes,id,topic,clean_title
2167,????????? 1 ??? ? ?????? ??? ???????? ? 2018,2018-09-14,11:33 AM,1,9fszwz,,
9597,50/20/30,2018-08-27,11:27 PM,1,9aviyl,Budgeting,


In [36]:
# Since the titles do not contain much information, let's drop them
df.drop([2167, 9597], inplace = True)

# Reset the index 
df = df.reset_index().drop('index', axis=1)

Finally, let's save the data as a .csv file for later use.

In [95]:
# Save as .csv file
df.to_csv(r'C:\Users\joshu\Downloads\Data\reddit\reddit_pf1.csv')

In [60]:
#df = pd.read_csv(r'C:\Users\joshu\Downloads\Data\reddit\reddit_pf1.csv', engine='python', index_col=[0])

- **Tokenization**

Tokenization is essentially the process of segmenting a text into pieces, such as words, phrases, symbols, etc. 

Let's create a list of the tokens for each submission title.

In [44]:
import spacy
nlp = spacy.load('en_core_web_sm')

def create_tokens(text):
    doc = nlp(text)
    tokens = [token for token in doc]
    return tokens

df['tokenized_title'] = df['clean_title'].apply(lambda x: create_tokens(x))

- **Lemmatization**

Lemmas are the "base form" of a word. 

Ex. walk, walked, walking, walks would all be derived from the base form 'walk'. 

Using the tokens that we generated in the column 'tokenized_title', let's next lemmatize the tokens.

In [48]:
def lemmatize(text):
    # Make sure to remove pronouns (ex. he, she) before returning the lemmas
    lemmas = [token.lemma_ for token in text if token.lemma_ not in '-PRON-']
    return lemmas

df['lemmatized_title'] = df['tokenized_title'].apply(lambda x: lemmatize(x))

- **Named Entity Recognition**

Named entities are real-world objects that have a name, such as a person, country, or company. spaCy is able to recognize different types of named entities in a document and can return features such as the label (ex. ORG - organization, GPE - geopolitical entity).

In [51]:
def create_NER(text, label = False):
    doc = nlp(text)
    if label is False:
        NER_list = [(ent.text) for ent in doc.ents]
    else:
        NER_list = [(ent.label_) for ent in doc.ents]
    return NER_list    

df['named_entities'] = df['clean_title'].apply(lambda x: create_NER(x))
df['entity_labels'] = df['clean_title'].apply(lambda x: create_NER(x, label = True))