In [22]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

 Joshua Kim 
 
 SpringBoard - Data Science Career Track

# Capstone Project 2: Categorizing /r/PersonalFinance Submissions

***

## Introduction

>### The Problem

One of the common problems that the everyday American faces is their financial situation. There are many different reasons for this, such as debt, student loans, the passing of a parent. According to a study of 1000 Americans over the age of 30 (conducted by GuideVine), less than half of Americans understood common financial concepts such as interest, bankruptcy, and inflation. More importantly, more than half felt "lost" when it came to having a secure long-term plan for their financial future. This highlights a re-occuring theme in our society where many people have trouble getting by with their finances and naturally some people seek help online. One of these resources is a subreddit called 'Personal Finance', which is hosted on the popular discussion-board website **Reddit**.

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Subreddit%20Screenshot.png?raw=true)

**/r/personalfinance** has become a place where many people are able to ask for help regarding their financial situations. A few common topics are savings, budgeting, taxes and debt. There are hundreds of different posts everyday from people of different ages, genders, locations and occupations. Frequent users of the subreddit often provide advice through comments and are able to upvote posts to help it gain more visibility on the front page of /r/personalfinance. However, the main feature we will be focusing on in this project is the "flair" feature which enables users to tag their submissions with a topic. For example, if I wrote a post asking advice on paying off my student loans, I would use the 'Debt' flair. This helps organize the hundreds of daily posts into different categories so that people will be able to search them more easily.

There are 3 main problems with this:

<div class="alert alert-block alert-danger">
1) <b>Not all users choose a flair for their post.</b> 
</div>

In this scenario, the system will automatically tag the post with 'Other', which is used when the user either doesn't know what topic the post should fall under OR if the user does not assign a topic. This becomes a problem since this can result in many posts not having their proper flair and ruins the organized structure of searching by topic. The 'Other' flair was originally designed to contain posts that did not fall under the other topics but due to the popularity of /r/PersonalFinance (over 13 million subscribers), there are often many new users who do not how to either assign a topic (issue with the interface) or choose a topic (unfamiliar with financial categories). Because of this, the 'Other' flair is usually the most popular and common topic on the subreddit despite the fact that most of the posts actually do fall under another topic. 

It would be like someone throwing their plastic trash into a regular garbage bin instead of the recycle bin because they were unaware of what the recycle bin is for, they did not see it, or they did not care to throw it into the proper bin. If this was simply a rarity, it could be overlooked but if it becomes a common occurence, it would create a larger problem that would need to be remedied quickly. The same logic applies for properly flairing a post rather than leaving it to 'Other'. 

![s](http://ahs.mcnairycountyschools.com/uploads/5/2/7/9/52798193/9759366_orig.gif)

<div class="alert alert-block alert-danger">
2) <b>There are posts which fit under multiple topics.</b>
</div>

Each post can only have a single flair thus forcing users to decide which topic best represents their question or idea. However this can have several repercussions. First, the way a user decides on the topic is very subjective and may not be the best fit. Second, if they don't know which topic to pick for their post, they may decide to simply classify it under 'Other', which goes back to the first problem. 

<div class="alert alert-block alert-danger">
3) <b>The current set of available topics may not contain the best variety or be most efficient.</b> 
</div>

This is a more interesting problem and can be an analytical problem for another project because it is difficult to ascertain how efficient it is to use the current batch of flairs. One question is whether we should add more topics in order to reduce the number of 'Other' posts and to also make it easier for users to identify the proper flair for their submissions. For example, should we include a new 'Student' topic for questions pertaining to student loans, textbook savings, getting a campus job, etc. Another question is whether or not we can *remove* any of the current topics and replace it with something else. This is not to say that the current topics are bad; in fact, using an eye-test would be enough for most people to be satisfied with them but how do we know whether or not they are the most efficient set of topics? 

To sum up these 3 main problems into a single question: 

<div class="alert alert-block alert-info">
<b>How can we create an efficient set of topics and automatically assign a single one for each post?</b>
</div>

---------------------------------

>### The Project

<div class="alert alert-block alert-success">
<b>The main goal of this project is to create a topic-modeling algorithm that will generate a new set of topics for /r/personalfinance posts.</b>
</div>

This will consist of 4 stages to accomplish this task.

**1) Data Acquisition**: This is often the most important step in any data science project. We will first need to acquire the relevant data from /r/personalfinance using pushshift.io's API as well as Reddit's official API. 

**2) Data Wrangling**: After we extract the data, we will need to clean it and prepare it for analysis. Natural Language Processing projects require pre-processing the text so that the machine readily able to interpret it after vectorization. Any incoherent symbols or punctuation will make the model performance worse so this is crucial before creating the model.

**3) Exploratory Data Analysis**: We will begin to explore the data and observe any interesting trends that can provide a direction or foundation for how we proceed with the project. Natural Language Processing projects are typically limited in how much EDA can be performed because of the lack of features. After all, we are mainly dealing with text!

**4) Machine Learning**: Let us split the problem statement into 2 parts: 

<div class="alert alert-block alert-danger">
<b>First, how can we create topics based on the posts' text and assign one to each post?</b>
</div>

- The main type of algorithm we seek to use is a topic model, which discovers topics based on hidden semantic structures in the various documents of a corpus (which is a large unstructured set of text). More specifically, this will require the implementation of Latent Dirichlet Allocation, a generative statistical model that works well in topic modeling by mapping each document to a number of different topics and each topic to a number of different words. To evaluate how well a generated set of topics does in clustering the documents, we will be using 2 metrics: Perplexity and Coherence (this will be discussed in greater detail in the machine learning section).

<div class="alert alert-block alert-danger">
<b>Second, how do we evaluate whether or not these topics are efficient?</b>
</div>

- To answer this question, we must think about how we can define an "efficient" topic. Remember that we want this model to be able to not only do well in creating topics for the fitted data, but also do a good job in assigning topics for *new* posts. The problem we face is that there are many incidents where a user does not select a topic for their own posts. Therefore this model should help resolve this issue by automatically creating a topic for these new posts. 


- The best way to evaluate the efficiency of the new topics that we generate from our topic model is by measuring the accuracy of predicting the labels. In other words, we will be creating a new classification problem where the feature variables are the text for each post and the response variable is the generated topic for each post. We then use different machine learning algorithms (such as Naive Bayes and Stochastic Gradient Descent) to create a good model that can predict the topics for a given row. Finally, we use the Accuracy, Precision and Recall metrics to evaluate how well the model performed.

***
<div class="alert alert-block alert-warning">
Note: We did not use inferential statistics to do hypothesis testing or to observe relationships between the feature and response variables. This is because our feature variables are the vectorized form of words and there are thousands of different words in our usable dictionary, which would make any statistical insight practically meaningless.
</div>

***

>### The Clients

1) **/r/PersonalFinance Users** will be interested in having a new system that could automatically determine a proper topic for their submission so that they don't have to choose one themselves. Whether they are too lazy, don't know which topic to select This will also help other people who search posts by topic. For example, if someone was interested in reading advice for people struggling with low income, they could simply navigate through the 'Planning' topic. 
 
 
2) **/r/PersonalFinance Moderators** will benefit from having a clearer structure for all submissions. Many of the popular subreddits insist on their members tagging their posts with flairs because of the convenience and organization it provides to the subreddit as a whole (sometimes this is even a requirement). One of the problems that moderators face with their subreddits is the influx of re-occuring questions. Many subreddits will ask users to first search for similar posts before submitting their own because it helps to reduce the number of the same posed questions. By having a better topic system, this will help the search feature because there will be fewer 'Other' posts and more of the specific topics. 
 
 ***

## 1. Data Acquisition

>**Obtaining the data**

For this report, I created my own dataset through 2 steps:

**First**, we extract the submission information using the __[pushshift.io Reddit API](https://github.com/pushshift/api)__. It gives us the ability to search through Reddit data and create data aggregations. Using sqlite was necessary as there are many different rows of data to collect.

We want to first specify the date ranges from which we can extract the data. This is important so that we can attach the date ranges to the pushshift url. For example, if we wanted to extract the newest post submissions as of August 25th 2018 for /r/nba (a subreddit for fans of the National Basketball Association), we would first convert August 25th 2018 11:59:59 PM into datetime format and then into a timestamp using the .timetuple() method. We then add the subreddit and timestamp as strings in the url. 

This would be the resulting url: http://api.pushshift.io/reddit/search/submission/?subreddit=personalfinance&size=500&before=1535255999

Take a quick look!

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/Pushshift%20IO%20Example.png?raw=true)

Notice that this will provide the 500 most recent posts as of August 25th 11:59PM because we specified size = 500. We then use the requests module to use its builtin JSON decoder and extract the relevant data from each pull. To get the next 500 most recent posts, we simply do another pull by replacing the timestamp such that it comes right after the date and time of the 500th post. 

**Second**, I used the __[official Reddit API](https://praw.readthedocs.io/en/latest/)__ to extract information that is not provided through the pushshift.io Reddit API. In this case, we extracted the flair (which will act as the "topics" of each submission) and self-text information (which are the body of text for each submission). The reason we cannot simply use the official Reddit API to extract all the submission data is because they limit the number of submissions (1000) you are allowed to extract as well as the date range. For example, if we extract the 1000 most recent posts on a subreddit, we cannot extract the next 1000 most recent posts since we cannot specify the date range. We will have a dataset that contains approximately 100,000 rows of data. Each row represents a submission in the Personal Finance subreddit. The features will include:

- title

- date (the date at which the post was submitted)

- time (the time at which the post was submitted)

- upvotes (number of upvotes for each submission)

- id (submission ID)

- topic (flair)

- self-text (the text information within each post)

To see the code and more annotation, check out the Data Acquisition notebook.

<div class="alert alert-block alert-warning">
Note: This process of scraping data is very computationally expensive. Consider using PaperSpace or Google Cloud to speed up the process.
</div>

***

## 2. Data Wrangling

>**Inspecting the Dataset**

Let's begin by observing the dataset.

In [4]:
import pandas as pd

df = pd.read_pickle(r'C:\Users\Joshua\Pickle_files\pf_df')
first_df = df[['title','date','time','upvotes','id','topic','self_text']]

first_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10177 entries, 0 to 10176
Data columns (total 7 columns):
title        10177 non-null object
date         10177 non-null object
time         10177 non-null object
upvotes      10177 non-null int64
id           10177 non-null object
topic        10177 non-null object
self_text    6894 non-null object
dtypes: int64(1), object(6)
memory usage: 636.1+ KB


In [19]:
first_df.fillna('').head()

Unnamed: 0,title,date,time,upvotes,id,topic,self_text
0,Ways to make extra side money?,2018-09-19,12:57 PM,1,9h6whn,unknown,
1,"(Year UPDATE) Legally blind, going homeless, h...",2018-09-19,12:56 AM,16,9h29g7,Other,
2,"19, being kicked out",2018-09-19,12:55 PM,2,9h6vyv,Other,So i just found out last night the home ive be...
3,Online Savings Account?,2018-09-19,12:54 PM,1,9h6vs4,Saving,Hello! Looking for recommendations for an onli...
4,Tools for Managing Incomes and Expenses,2018-09-19,12:52 PM,0,9h6v48,Other,


>**Creating the Text column**

Rather than having the title and the self-text in separate columns, it would be more convenient to have both in a single column. We do so by concatenating the 2 columns into a new column called 'text'.

In [11]:
df[['title','self_text','text']].fillna('').head()

Unnamed: 0,title,self_text,text
0,Ways to make extra side money?,,Ways to make extra side money?
1,"(Year UPDATE) Legally blind, going homeless, h...",,"(Year UPDATE) Legally blind, going homeless, h..."
2,"19, being kicked out",So i just found out last night the home ive be...,"19, being kicked out So i just found out last ..."
3,Online Savings Account?,Hello! Looking for recommendations for an onli...,Online Savings Account? Hello! Looking for rec...
4,Tools for Managing Incomes and Expenses,,Tools for Managing Incomes and Expenses


>**Convert Time and Dates into datetime format**

Since the data is given as a string, we will need to convert the time and date into datatime format.

In [24]:
df[['date','time']].head()

Unnamed: 0,date,time
0,2018-09-19,12:57 PM
1,2018-09-19,12:56 AM
2,2018-09-19,12:55 PM
3,2018-09-19,12:54 PM
4,2018-09-19,12:52 PM


>**Replace missing topics**

Some posts have missing topics and self-text. This is because the post was either deleted by the user or removed by a moderator. Since we plan on creating our own topics, it is not a big deal if we are missing the topic. Let's fill missing topics by categorizing them under 'unknown'. 

In [32]:
df1 = pd.read_csv(r'C:\Users\Joshua\Downloads\Data\reddit\reddit_pf.csv', engine = 'python', index_col=0)

df1.fillna('unknown').topic.value_counts()

Debt                 1256
Other                1205
Credit               1048
Investing             859
Retirement            824
Employment            724
Housing               709
unknown               651
Auto                  560
Planning              526
Saving                524
Taxes                 509
Budgeting             405
Insurance             380
Meta                    2
THIS IS A SPAMMER       1
Name: topic, dtype: int64

>**Replace outlier topics**

As we can see from the previous point, there are two outlier topics: 2 instances of 'Meta' and 1 instance of 'THIS IS A SPAMMER'. Since these are clearly not intended to be part of the set of topics, we should replace these with 'unknown'.

In [33]:
df.topic.value_counts()

Debt          1256
Other         1205
Credit        1048
Investing      859
Retirement     823
Employment     724
Housing        709
unknown        651
Auto           560
Planning       526
Saving         524
Taxes          508
Budgeting      404
Insurance      380
Name: topic, dtype: int64

>**Text pre-processing**

As the saying goes: __["Garbage In, Garbage Out"](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)__. Before tokenizing the text and feeding it into the machine, we will need to clean up the data. To do so, we will create a function that will perform the following steps:

- **Lowercase** the words so that the model will not differentiate capitalized words from other words.

- **Remove numbers/digits** since the model is interpreting *text* not numbers.

- **Remove punctuation** since it is not important for the context.

- **Strip white space** since empty strings could be interpreted as text and we want to avoid that.

- **Remove stopwords**, which are general words that are very frequent in the English dictionary (ex. because, such, so). Here is a __[list of some common stopwords](https://www.ranks.nl/stopwords)__.

- **Remove noise** that is not picked up through the other cleaning methods. This step can come either before or after tokenization and normalization, or both (ex. dropping words that are less than 2 characters long).

In [8]:
df[['text','clean_text']].head()

Unnamed: 0,text,clean_text
0,Ways to make extra side money?,ways make extra side money
1,"(Year UPDATE) Legally blind, going homeless, h...",year update legally blind going homeless one j...
2,"19, being kicked out So i just found out last ...",kicked found last night home ive staying going...
3,Online Savings Account? Hello! Looking for rec...,online savings account hello looking recommend...
4,Tools for Managing Incomes and Expenses,tools managing incomes expenses


>**Check for missing text**

After pre-processing the text, there are some cases where it completely removes all text. We will remove these rows from the dataset.

In [21]:
df1 = pd.read_pickle(r'C:\Users\Joshua\Pickle_files\df')

df1.loc[[2845, 3646, 4592, 4705], ['text', 'clean_text']]

Unnamed: 0,text,clean_text
2845,1099 or W2?,
3646,401Ks?,
4592,401k,
4705,T,


>**Setting a word limit**

We also limited the number of words in the text since the longer the text, the longer the computation time. We did consider using the number of characters but the number of words seemed to do a better job. 150 was used as the word limit for each row.

>**Tokenization**

In order to better analyze individual words, we will need to *tokenize* the documents (or in this case, the submission titles) into pieces of words. By doing so, we will be able to use the various NLP libraries to further dissect the tokens.

>**Lemmatization**

After tokenizing the data, we will need to normalize the text through lemmatization or stemming. Lemmatization is typically a better method since it returns the canonical forms based on a word's lemma. However, this process takes much more time compared to stemming the words, which simply removes the affixes of a given word.

>**Named Entity Recognition**

Named entities are real-world objects that have a name, such as a person, country, or company. spaCy is able to recognize different types of named entities in a document and can return features such as the label (ex. ORG - organization, GPE - geopolitical entity). While this is not a feature we plan to incorporate in the topic model, it will be interesting for observation in the exploratory data analysis.

## 3. Exploratory Data Analysis

The goal of this notebook is to explore the data extracted from Reddit's Personal Finance subreddit, including information such as:

- Text of submission

- Date of submission

- Topic

- Number of Upvotes

- Length of text

Using Natural Language Processing tools and data visualization, we aim to learn any obvious and underlying trends from the submission information. However, it should be worth mentioning that there won't that many details to uncover compared to other types of datasets due to the fact that the model's only input will be text. Therefore many of the following findings may not be relevant to creating the topics but nonetheless, they are still interesting to see.