**Joshua Kim**
 
**SpringBoard - Data Science Career Track**

# Capstone Project 2: Topic-Modeling with /r/PersonalFinance

***

## Table of Contents

0. [Introduction](#0.-Introduction)


1. [Data Acquisition](#1.-Data-Acquisition)


2. [Data Wrangling](#2.-Data-Wrangling)


3. [Exploratory Data Analysis](#3.-Exploratory-Data-Analysis)


4. [Machine Learning](#4.-Machine-Learning)

    a) [Part 1: Creating the Topics](#Part-1:-Creating-the-Topics)
    
    b) [Part 2: Evaluating/Predicting the Topics](#Part-2:-Evaluating/Predicting-the-Topics)


5. [Conclusion](#5.-Conclusion)


6. [Resources](#6.-Resources)

***

## 0. Introduction

[Return to Table of Contents](#Table-of-Contents)

>### The Problem

One of the common problems that the everyday American faces is their financial situation. There are many different reasons for this, such as debt, student loans, the passing of a parent. According to a study of 1000 Americans over the age of 30 (conducted by [GuideVine](https://www.guidevine.com/newsroom/half-americans-age-30-cant-explain-401k/), less than half of Americans understood common financial concepts such as interest, bankruptcy, and inflation. More importantly, more than half felt "lost" when it came to having a secure long-term plan for their financial future. This highlights a re-occuring theme in our society where many people have trouble getting by with their finances and naturally some people seek help online. One of these resources is a subreddit called 'Personal Finance', which is hosted on the popular discussion-board website **[Reddit](https://www.reddit.com/)**.

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Subreddit%20Screenshot.png?raw=true)

**[/r/personalfinance](https://www.reddit.com/r/personalfinance/)** has become a place where many people are able to ask for help regarding their financial situations. A few common topics are savings, budgeting, taxes and debt. There are hundreds of different posts everyday from people of different ages, genders, locations and occupations. Frequent users of the subreddit often provide advice through comments and are able to upvote posts to help it gain more visibility on the front page of /r/personalfinance. However, the main feature we will be focusing on in this project is the "flair" feature which enables users to tag their submissions with a topic. For example, if I wrote a post asking advice on paying off my student loans, I would use the 'Debt' flair. This helps organize the hundreds of daily posts into different categories so that people will be able to search them more easily.

There are 3 main problems with this:

<div class="alert alert-block alert-danger">
1) <b>Not all users choose a flair for their post.</b> 
</div>

In this scenario, the system will automatically tag the post with 'Other', which is used when the user either doesn't know what topic the post should fall under OR if the user does not assign a topic. This becomes a problem since this can result in many posts not having their proper flair and ruins the organized structure of searching by topic. The 'Other' flair was originally designed to contain posts that did not fall under the other topics but due to the popularity of /r/PersonalFinance (over 13 million subscribers), there are often many new users who do not how to either assign a topic (issue with the interface) or choose a topic (unfamiliar with financial categories). Because of this, the 'Other' flair is usually the most popular and common topic on the subreddit despite the fact that most of the posts actually do fall under another topic. 

It would be like someone throwing their plastic trash into a regular garbage bin instead of the recycle bin because they were unaware of what the recycle bin is for, they did not see it, or they did not care to throw it into the proper bin. If this was simply a rarity, it could be overlooked but if it becomes a common occurence, it would create a larger problem that would need to be remedied quickly. The same logic applies for properly flairing a post rather than leaving it to 'Other'. 

![s](http://ahs.mcnairycountyschools.com/uploads/5/2/7/9/52798193/9759366_orig.gif)

<div class="alert alert-block alert-danger">
2) <b>There are posts which fit under multiple topics.</b>
</div>

Each post can only have a single flair thus forcing users to decide which topic best represents their question or idea. However this can have several repercussions. First, the way a user decides on the topic is very subjective and may not be the best fit. Second, if they don't know which topic to pick for their post, they may decide to simply classify it under 'Other', which goes back to the first problem. 

<div class="alert alert-block alert-danger">
3) <b>The current set of available topics may not contain the best variety or be most efficient.</b> 
</div>

This is a more interesting problem and can be an analytical problem for another project because it is difficult to ascertain how efficient it is to use the current batch of flairs. One question is whether we should add more topics in order to reduce the number of 'Other' posts and to also make it easier for users to identify the proper flair for their submissions. For example, should we include a new 'Student' topic for questions pertaining to student loans, textbook savings, getting a campus job, etc. Another question is whether or not we can *remove* any of the current topics and replace it with something else. This is not to say that the current topics are bad; in fact, using an eye-test would be enough for most people to be satisfied with them but how do we know whether or not they are the most efficient set of topics? 

To sum up these 3 main problems into a single question: 

<div class="alert alert-block alert-info">
<b>How can we create an efficient set of topics and automatically assign a single one for each post?</b>
</div>

---------------------------------

>### The Project

<div class="alert alert-block alert-success">
<b>The main goal of this project is to create a topic-modeling algorithm that will generate a new set of topics for /r/personalfinance posts.</b>
</div>

This will consist of 4 stages to accomplish this task.

**1) [Data Acquisition](#1.-Data-Acquisition)**: This is often the most important step in any data science project (along with cleaning the data). We will first need to acquire the relevant data from /r/personalfinance using pushshift.io's API as well as Reddit's official API. 

**2) [Data Wrangling](#2.-Data-Wrangling)**: After we extract the data, we will need to clean it and prepare it for analysis. Natural Language Processing projects require pre-processing the text so that the machine readily able to interpret it after vectorization. Any incoherent symbols or punctuation will make the model performance worse so this is crucial before creating the model.

**3) [Exploratory Data Analysis](#3.-Exploratory-Data-Analysis)**: We will begin to explore the data and observe any interesting trends that can provide a direction or foundation for how we proceed with the project. Natural Language Processing projects are typically limited in how much EDA can be performed because of the lack of features. After all, we are mainly dealing with text!

**4) [Machine Learning](#4.-Machine-Learning)**: Let us split the problem statement into 2 parts: 

<div class="alert alert-block alert-danger">
<b>First, how can we create topics based on the posts' text and assign one to each post?</b>
</div>

- The main type of algorithm we seek to use is a topic model, which discovers topics based on hidden semantic structures in the various documents of a corpus (which is a large unstructured set of text). More specifically, this will require the implementation of Latent Dirichlet Allocation, a generative statistical model that works well in topic modeling by mapping each document to a number of different topics and each topic to a number of different words. To evaluate how well a generated set of topics does in clustering the documents, we will be using 2 metrics: Perplexity and Coherence (this will be discussed in greater detail in the machine learning section).

<div class="alert alert-block alert-danger">
<b>Second, how do we evaluate whether or not these topics are efficient?</b>
</div>

- To answer this question, we must think about how we can define an "efficient" topic. Remember that we want this model to be able to not only do well in creating topics for the fitted data, but also do a good job in assigning topics for *new* posts. The problem we face is that there are many incidents where a user does not select a topic for their own posts. Therefore this model should help resolve this issue by automatically creating a topic for these new posts. 


- The best way to evaluate the efficiency of the new topics that we generate from our topic model is by measuring the accuracy of predicting the labels. In other words, we will be creating a new classification problem where the feature variables are the text for each post and the response variable is the generated topic for each post. We then use different machine learning algorithms (such as Naive Bayes and Stochastic Gradient Descent) to create a good model that can predict the topics for a given row. Finally, we use the Accuracy, Precision and Recall metrics to evaluate how well the model performed.

***
<div class="alert alert-block alert-warning">
Note: We did not use inferential statistics to do hypothesis testing or to observe relationships between the feature and response variables. This is because our feature variables are the vectorized form of words and there are thousands of different words in our usable dictionary, which would make any statistical insight practically meaningless.
</div>

***

>### The Clients

1) **/r/PersonalFinance Users** will be interested in having a new system that could automatically determine a proper topic for their submission so that they don't have to choose one themselves. Whether they are too lazy, don't know which topic to select This will also help other people who search posts by topic. For example, if someone was interested in reading advice for people struggling with low income, they could simply navigate through the 'Planning' topic. 

![s](https://watermarked.cutcaster.com/cutcaster-photo-100747197-Cartoon-boy-working-with-computer.jpg)
 
 
2) **/r/PersonalFinance Moderators** will benefit from having a clearer structure for all submissions. Many of the popular subreddits insist on their members tagging their posts with flairs because of the convenience and organization it provides to the subreddit as a whole (sometimes this is even a requirement). One of the problems that moderators face with their subreddits is the influx of re-occuring questions. Many subreddits will ask users to first search for similar posts before submitting their own because it helps to reduce the number of the same posed questions. By having a better topic system, this will help the search feature because there will be fewer 'Other' posts and more of the specific topics. 
 
 ***

## 1. Data Acquisition

[Return to Table of Contents](#Table-of-Contents)

>**Obtaining the data**

For this report, I created my own dataset through 2 steps:

**First**, we extract the submission information using the __[pushshift.io Reddit API](https://github.com/pushshift/api)__. It gives us the ability to search through Reddit data and create data aggregations. Using sqlite was necessary as there are many different rows of data to collect.

We want to first specify the date ranges from which we can extract the data. This is important so that we can attach the date ranges to the pushshift url. For example, if we wanted to extract the newest post submissions as of August 25th 2018 for /r/nba (a subreddit for fans of the National Basketball Association), we would first convert August 25th 2018 11:59:59 PM into datetime format and then into a timestamp using the .timetuple() method. We then add the subreddit and timestamp as strings in the url. 

This would be the resulting url: http://api.pushshift.io/reddit/search/submission/?subreddit=personalfinance&size=500&before=1535255999

Take a quick look!

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/Pushshift%20IO%20Example.png?raw=true)

Notice that this will provide the 500 most recent posts as of August 25th 11:59PM because we specified size = 500. We then use the requests module to use its builtin JSON decoder and extract the relevant data from each pull. To get the next 500 most recent posts, we simply do another pull by replacing the timestamp such that it comes right after the date and time of the 500th post. 

**Second**, I used the __[official Reddit API](https://praw.readthedocs.io/en/latest/)__ to extract information that is not provided through the pushshift.io Reddit API. In this case, we extracted the flair (which will act as the "topics" of each submission) and self-text information (which are the body of text for each submission). The reason we cannot simply use the official Reddit API to extract all the submission data is because they limit the number of submissions (1000) you are allowed to extract as well as the date range. For example, if we extract the 1000 most recent posts on a subreddit, we cannot extract the next 1000 most recent posts since we cannot specify the date range. We will have a dataset that contains approximately 100,000 rows of data. Each row represents a submission in the Personal Finance subreddit. The features will include:

- title

- date (the date at which the post was submitted)

- time (the time at which the post was submitted)

- upvotes (number of upvotes for each submission)

- id (submission ID)

- topic (flair)

- self-text (the text information within each post)

To see the code and more annotation, check out the [Data Acquisition](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/01_pf_data_wrangling.ipynb) notebook.

<div class="alert alert-block alert-warning">
Note: This process of scraping data is very computationally expensive. Consider using PaperSpace or Google Cloud to speed up the process.
</div>

***

## 2. Data Wrangling

[Return to Table of Contents](#Table-of-Contents)

>**Inspecting the Dataset**

Let's begin by observing the dataset.

In [6]:
import pandas as pd
from pf_files import *

df = pd.read_pickle(r'C:\Users\Joshua\Pickle_files\df')
first_df = df[['title','date','time','upvotes','id','topic','self_text']]

first_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23182 entries, 0 to 23181
Data columns (total 7 columns):
title        23182 non-null object
date         23182 non-null object
time         23182 non-null object
upvotes      23182 non-null int64
id           23182 non-null object
topic        23182 non-null object
self_text    23182 non-null object
dtypes: int64(1), object(6)
memory usage: 1.2+ MB


In [27]:
first_df.head()

Unnamed: 0,title,date,time,upvotes,id,topic,self_text
0,Ways to make extra side money?,2018-09-19,12:57 PM,1,9h6whn,unknown,
1,"(Year UPDATE) Legally blind, going homeless, h...",2018-09-19,12:56 AM,16,9h29g7,Other,
2,"19, being kicked out",2018-09-19,12:55 PM,2,9h6vyv,Other,So i just found out last night the home ive be...
3,Online Savings Account?,2018-09-19,12:54 PM,1,9h6vs4,Saving,Hello! Looking for recommendations for an onli...
4,Tools for Managing Incomes and Expenses,2018-09-19,12:52 PM,0,9h6v48,Other,


>**Creating the Text column**

Rather than having the title and the self-text in separate columns, it would be more convenient to have both in a single column. We do so by concatenating the 2 columns into a new column called 'text'.

In [28]:
df[['title','self_text','text']].head()

Unnamed: 0,title,self_text,text
0,Ways to make extra side money?,,Ways to make extra side money?
1,"(Year UPDATE) Legally blind, going homeless, h...",,"(Year UPDATE) Legally blind, going homeless, h..."
2,"19, being kicked out",So i just found out last night the home ive be...,"19, being kicked out So i just found out last ..."
3,Online Savings Account?,Hello! Looking for recommendations for an onli...,Online Savings Account? Hello! Looking for rec...
4,Tools for Managing Incomes and Expenses,,Tools for Managing Incomes and Expenses


>**Convert Time and Dates into datetime format**

Since the data is given as a string, we will need to convert the time and date into datatime format.

In [29]:
df[['date','time']].head()

Unnamed: 0,date,time
0,2018-09-19,12:57 PM
1,2018-09-19,12:56 AM
2,2018-09-19,12:55 PM
3,2018-09-19,12:54 PM
4,2018-09-19,12:52 PM


>**Replace missing topics**

Some posts have missing topics and self-text. This is because the post was either deleted by the user or removed by a moderator. Since we plan on creating our own topics, it is not a big deal if we are missing the topic. Let's fill missing topics by categorizing them under 'unknown'. 

In [33]:
df1 = pd.read_csv(r'C:\Users\Joshua\Downloads\Data\reddit\reddit_pf.csv', engine = 'python', index_col=0)

df1.fillna('unknown').topic.value_counts()

Debt                 1256
Other                1205
Credit               1048
Investing             859
Retirement            824
Employment            724
Housing               709
unknown               651
Auto                  560
Planning              526
Saving                524
Taxes                 509
Budgeting             405
Insurance             380
Meta                    2
THIS IS A SPAMMER       1
Name: topic, dtype: int64

>**Replace outlier topics**

As we can see from the previous point, there are two outlier topics: 2 instances of 'Meta' and 1 instance of 'THIS IS A SPAMMER'. Since these are clearly not intended to be part of the set of topics, we should replace these with 'unknown'.

In [31]:
df.topic.value_counts()

Debt          2819
Other         2749
Credit        2447
Investing     1910
Retirement    1814
Employment    1688
Housing       1574
unknown       1396
Auto          1303
Planning      1229
Saving        1200
Taxes         1164
Budgeting     1012
Insurance      877
Name: topic, dtype: int64

>**Text pre-processing**

As the saying goes: __["Garbage In, Garbage Out"](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)__. Before tokenizing the text and feeding it into the machine, we will need to clean up the data. To do so, we will create a function that will perform the following steps:

- **Lowercase** the words so that the model will not differentiate capitalized words from other words.

- **Remove numbers/digits** since the model is interpreting *text* not numbers.

- **Remove punctuation** since it is not important for the context.

- **Strip white space** since empty strings could be interpreted as text and we want to avoid that.

- **Remove stopwords**, which are general words that are very frequent in the English dictionary (ex. because, such, so). Here is a __[list of some common stopwords](https://www.ranks.nl/stopwords)__.

- **Remove noise** that is not picked up through the other cleaning methods. This step can come either before or after tokenization and normalization, or both (ex. dropping words that are less than 2 characters long).

In [34]:
df[['text','clean_text']].head()

Unnamed: 0,text,clean_text
0,Ways to make extra side money?,ways make extra side money
1,"(Year UPDATE) Legally blind, going homeless, h...",year update legally blind going homeless one j...
2,"19, being kicked out So i just found out last ...",kicked found last night home ive staying going...
3,Online Savings Account? Hello! Looking for rec...,online savings account hello looking recommend...
4,Tools for Managing Incomes and Expenses,tools managing incomes expenses


>**Check for missing text**

After pre-processing the text, there are some cases where it completely removes all text. We will remove these rows from the dataset.

In [36]:
df.loc[[2845, 3646, 4592, 4705], ['text', 'clean_text']]

Unnamed: 0,text,clean_text
2845,1099 or W2?,
3646,401Ks?,
4592,401k,
4705,T,


>**Setting a word limit**

We also limited the number of words in the text since the longer the text, the longer the computation time. We did consider using the number of characters but the number of words seemed to do a better job. 150 was used as the word limit for each row.

In [40]:
txt_wrd_mean = df['clean_text'].apply(lambda x: x.split()).map(len).mean()
txt_wrd_median = df['clean_text'].apply(lambda x: x.split()).map(len).median()

print('Mean Word Length: {}\n\nMedian Word length: {}\n'.format(txt_wrd_mean, txt_wrd_median))

word_count_ = len(df[df['clean_text'].apply(lambda x: x.split()).map(len)<150])
total_len = len(df)

print('Percentage of texts with word count less than 150: {0:.2f}%'.format((word_count_/total_len)*100))

Mean Word Length: 66.79087222845311

Median Word length: 49.0

Percentage of texts with word count less than 150: 89.47%


>**Tokenization**

In order to better analyze individual words, we will need to *tokenize* the documents (or in this case, the submission titles) into pieces of words. By doing so, we will be able to use the various NLP libraries to further dissect the tokens (ex. spaCy).

>**Lemmatization**

After tokenizing the data, we will need to normalize the text through lemmatization or stemming. Lemmatization is typically a better method since it returns the canonical forms based on a word's lemma. However, this process takes much more time compared to stemming the words, which simply removes the affixes of a given word.

In [43]:
df[['text','clean_text', 'lemmatized_text']].head()

Unnamed: 0,text,clean_text,lemmatized_text
0,Ways to make extra side money?,ways make extra side money,"[way, make, extra, side, money]"
1,"(Year UPDATE) Legally blind, going homeless, h...",year update legally blind going homeless one j...,"[year, update, legally, blind, go, homeless, o..."
2,"19, being kicked out So i just found out last ...",kicked found last night home ive staying going...,"[kick, find, last, night, home, have, stay, go..."
3,Online Savings Account? Hello! Looking for rec...,online savings account hello looking recommend...,"[online, saving, account, hello, look, recomme..."
4,Tools for Managing Incomes and Expenses,tools managing incomes expenses,"[tool, manage, income, expense]"


>**Named Entity Recognition**

Named entities are real-world objects that have a name, such as a person, country, or company. spaCy is able to recognize different types of named entities in a document and can return features such as the label (ex. ORG - organization, GPE - geopolitical entity). Not all documents will contain named entities since they may not reference any. While this is not a feature we plan to incorporate in the topic model, it will be interesting for observation in the exploratory data analysis.

In [46]:
df[['named_entities', 'entity_labels']].head()

Unnamed: 0,named_entities,entity_labels
0,[],[]
1,[one],[CARDINAL]
2,"[last night, returning make week]","[TIME, DATE]"
3,[first],[ORDINAL]
4,[],[]


## 3. Exploratory Data Analysis

[Return to Table of Contents](#Table-of-Contents)

The goal of this notebook is to explore the data extracted from Reddit's Personal Finance subreddit, including information such as:

- Text of submission

- Date of submission

- Topic

- Number of Upvotes

- Length of text

Using Natural Language Processing tools and data visualization, we aim to learn any obvious and underlying trends from the submission information. However, it should be worth mentioning that there won't that many details to uncover compared to other types of datasets due to the fact that the model's only input will be text. Therefore many of the following findings may not be relevant to creating the topics but nonetheless, they are still interesting to see.

>**Which are the most common topics?**

Debt is the most common topic in the personal finance subreddit, followed by Other and Credit. This indicates that debt is a major concern for many of the redditors (users) and they make submissions in order to seek advice.

As discussed before, Other would include posts that either haven't been given a topic or didn't match the main topics. With more data, it could be possible that Other is the most frequent topic instead of Debt.

Investment and Retirement are also similar in their counts while also being very similar in their functions.

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Topic%20Count.png?raw=true)

> **Which are the top 10 most popular submissions?**

By popular, we mean which submissions had the most user upvotes. 

There are a variety of different subjects in the top 10 most popular posts, ranging from scams to dealing with the loss of a family member. However, 3 of the top 10 include the word "scam", showing that this may be a popular subject in /r/personalfinance and could happen often. This could be a common issue among many people living in the US based on the number of upvotes.

While some posts are seeking advice, others provide advice such as '"Hidden" costs of buying a home and how to prepare for them.'

In [51]:
# Top 10 most popular headlines
[title for title in df.sort_values(by='upvotes', ascending=False)['title'].values[:10]]

["IRS will allow employers to match their employees' student loan repayments",
 "If you can't get your emergency fund to grow because of emergencies that keep coming up, you're still doing a good job.",
 'Your amazon store card is probably scamming you',
 'A story of how I just got out of paying a $1500 bill, why you should NEVER blindly trust a Debt collection agency, and ALWAYS request proof of a Debt owed. ðŸ˜‚ðŸ˜‚ðŸ˜‚',
 'Girlfriend had some standard medical tests done. The clinic apparently waited too long to bill her insurance company and insurance is declining to pay them. Now, the clinic is trying to bill girlfriend for full amount and threatening to go to collections. What are her options?',
 "If the only reason you pay for Amazon Prime is because of 2-day shipping, there is a good chance it's a smarter financial move for you to cancel it than continue.",
 'So I fell for a scam yesterday and it still angers me.',
 'Credit freezes are now free. Starting today.',
 "I'm 18 and I 

> **Which are the most common tokenized words?**

In the plot below, we can see the frequency of the top 100 most common tokens with ticks on intervals of 10.

The word 'pay' has been used over 20,000 times in the corpus of text. After the first 10 tokens, the frequency that a word is used decreases very quickly. The words 'account' and 'payment' are also fairly common and transferable among many different topics which is why it is logical for them to have high counts. Towards the end of the graph, we can start to see more topic-specific words such as 'rent' and 'invest'.

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Token%20Count.png?raw=true)

> **What does the distribution of the text length look like?**

As we can see, the histogram of the distribution of the text length is abnormally right-skewed with a strong kurtosis. While the graph peaks around 75-100 characters, it is difficult to tell where the mean of the data is if we only look at the graph. In reality, the mean number of characters for the posts is 602, which is far from the peak of the data. There is a strong peak but after about 150 words is a very flat-lined distribution where it gradually increases up until 600 and then begins to decline. 

One interpretation of this graph can be that there are many new users who simply want to ask a quick question without going into much detail. For example, a person may want to ask about whether they should get a Visa or MasterCard credit card. These types of posts are very common where people do not have to adequately describe their financial situation to get advice.

Another interpretation is that many of the posts get deleted, resulting in the self-text disappearing and significantly reducing the number of characters in the total text. This is certainly plausible because of how many subscribers /r/personalfinance has. The more popular a subreddit becomes, the more common it is for new users to disregard the subreddit's rules and consequently, a higher chance that a moderator removes their post. In fact, the mean number of characters for posts under the 'unknown' topic (which is for posts that have been deleted or removed) is 130, which is well below the average of 602.

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Text%20Length%20Distribution.png?raw=true)

While this plot describes how the distribution is like for all posts, let's also take a look at the distribution by topic.

We can see depending on the topic, the kurtosis of the histogram can vary greatly. For example, the histograms for 'Other' and 'unknown' have much heavier weight on their tails compared to the other topics such as 'Budgeting', which is appears very flat. Besides their peaks, the histograms look very similar to one another which means that outside of 'unknown', the topics have very similar distributions. 

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Text%20Length%20Distribution1.png?raw=true)

>**Most popular tokens by POS (Part-of-Speech) Tagging**

Part of Speech tagging is marking a word based on its __[part of speech](https://en.wikipedia.org/wiki/Part_of_speech)__. Examples can include noun, verb, and pronoun. It was initially done by hand (ex. [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus)) but there are now many different algorithms that can be used to decipher a POS tag for any given document. spaCy is a python library that has one such algorithm that allows us to obtain the POS for a given word. Let's take advantage of this by extracting singular and plural nouns from each document. 

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Noun%20Token%20Count.png?raw=true)

We see that the most common token is 'credit', not 'pay' like we had seen before. This hints at the fact that 'pay' is mainly used as a verb rather than a noun. Another interesting finding is that there are many date-based words such as 'year' and 'month'. Most likely this is because many people are describing their financial circumstances in the context of a year or several months. Other common tokens strongly suggest which topics they're part of. For example: 

- $'car' -> Auto$

- $'debt' -> Debt$

- $'job' -> Employment$

- $'loan' -> Debt$

>**What are the most common named entities?**

By far the most common types of named entity are time and date entities. This indicates that most users place a great amount of detail on the context of time when they are describing their financial issues. 

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20NER.png?raw=true)

>**Sentiment Analysis**

While observing the relative sentiments of tokens is not particularly useful in determining the topics of the posts, it is nonetheless interesting to observe due to some surprising (and humorous) results. We can apply it to our text through the SentimentIntensityAnalyzer function from nltk's sentiment module.

'Credit' is easily the most common positive word from the dataset, but this does beg the queston of whether or not it's truly positive in the context of the text. For example, what if they are referring to credit cards; would the word 'credit' still be considered positive? Similarly, the word 'interest' is the second most common positive word but interest in the financial context is probably negative! 

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Positive.png?raw=true)

On the flip side, we see that 'debt' is the most common negative word and this isn't surprising since it's also the most popular topic on /r/personalfinance. Whether or not the word 'pay' is necessarily negative can be debatable in this context since it could simply be neutral. For example, if someone has to "pay" for a cup of coffee, it's not necessarily negative since they are making a transaction to buy a drink for consumption. 

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Negative.png?raw=true)

***

## 4. Machine Learning

[Return to Table of Contents](#Table-of-Contents)

As discussed before, there are two main parts for this topic-modeling project:

1) The first part will be to generate the topics using Latent Dirichlet Allocation and then labeling each post with a main topic based on their topic distributions. This is an unsupervised task.

2) The second part will be to predict the topic for each post using various algorithms and then to evaluate their efficiency through the accuracy, precision and recall metrics. This is a supervised classification task.

Without further ado, let's jump right into it!

> ### Part 1: Creating the Topics

Like any other machine learning project, we begin by first creating features that we will be inputting into the LDA model. We do this by using a bag-of-words model which assigns each token an id number and also counts the frequency of a token in each text (document). To implement this feature with gensim (a popular library for NLP purposes), we fit the corpus of text into a dictionary. This dictionary then converts all of the tokens into a bag-of-words model.

For example, take a look at the first 10 tokens in our dictionary:

In [60]:
count = 0
for x in dictionary.token2id.items():
    count +=1
    print(x)
    if count == 10:
        break

('extra', 0)
('make', 1)
('money', 2)
('side', 3)
('way', 4)
('blind', 5)
('go', 6)
('help', 7)
('homeless', 8)
('job', 9)


Notice how each word is assigned a number next to it. That is its token ID for vector representation. This is done because the machine doesn't interpret documents as a collection of words since it is unable to understand the definition of a word. However, it **is** able to interpret words as a vector. Now let's take a look at the bag-of-words representation of the first document in the corpus: "way make extra side money".

In [61]:
# Create a bag of words
corpus = [dictionary.doc2bow(text) for text in data_bigrams]
# with open('pf_corpus.pickle', 'wb') as f:
#     pickle.dump(corpus, f)

# View sample text and its respective bag of words transformation
print(data_bigrams[0])
print(corpus[0])

['way', 'make', 'extra', 'side', 'money']
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]


You can see that the tuple underneath each token of the document contains the token ID and a second number. That second number represents the frequency of that particular token in the document. For example, the token 'way' has the number 1 as its frequency. This means that there is only one instance of the token 'way' in the entire document.

Now that we have created a dictionary that assigned a number to each token and created a bag-of-words model for each document, let's begin creating the Latent Dirichlet Allocation model. 

Here's how it works (click **[HERE](#Creating-the-Model)** if you want to skip the explanation of the model).

> #### How LDA Works

A corpus can be represented in vector space by a document-word matrix which contains the frequency of every word in a corpus against every document in the corpus. Here is an example:

![s](https://www.researchgate.net/profile/Mark_Cutkosky/publication/220677827/figure/fig2/AS:394013053079585@1470951440970/ector-space-representation-of-a-document-collection-or-corpus-matrix.png)

The value $c\scriptsize11$ represents the number of times the word 'schedule' appears in the document 'Memo'. So how does this relate to creating the topics? What LDA does is essentially split this $M x N$ matrix into 2 lower dimensional matrices, which we will call $A1$ and $A2$. $A1$ is a document-topics matrix whereas $A2$ is a topic-word matrix. Let the number of topics be $K$; $A1$ is therefore a $M x K$ matrix while $A2$ is a $K x N$ matrix. 

There are now 2 major steps in how it assigns a topic to each document:

1) The model first *randomly* assigns a set of $K$ topics to each document. In other words, these topics can be completely wrong in how well it represents the clusters of documents. You can think of them as dummy topics at this point in time.

2) Each topic generates words based on their probability distribution. For example, if 'dog' appears more frequently than other words in documents with topic 1, then 'dog' will have a strong weight in topic 1. By repeatedly iterating through the different words in each document, the model attempts to adjust itself by assigning a new word to a topic. If the model discovers that the word 'dog' is infrequent in topic 2, then it will correct itself by replacing 'dog' with another word with a higher probability $P$.

The way we calculate $P$ is through multiplying 2 smaller probabilities:

1) $P(topic\;t|document\;d)$ - the proportion of words in document $d$ that are currently assigned to topic $t$.

2) $P(word\;w|topic\;t)$ - the proportion of assignments to topic $t$ over all documents that come from the word $w$.

In other words, $P$ is the probability that topic $t$ generated word $w$. 

The figure below describes how LDA functions in a nutshell. Each topic has a discrete probability distribution of words and each document is composed of a mixture of topics. 

![s](http://bigdata.ices.utexas.edu/wp-content/uploads/2015/01/LDA-concept.png)

> #### Creating the Model

Actually creating the model using gensim's LDA model is a relatively straight-forward process because we only need to input 2 parameters into gensim's LDA model, the corpus of text and the dictionary. However, in order to optimize the model there are additional parameters that we need to focus on:

1) **number of topics (num_topics)** - The number of topics that are created from the bag-of-words

2) **alpha value (alpha)** - The alpha value controls the mixture of topics for any given document. The greater the alpha, the greater the variety of topics will appear for any given document. The lower the alpha, the lower the variety of topics.

According to Griffiths and Steyvers (2004), it is suggested to use a value of 50/k (where k is number of topics) for α. However, gensim allows us to specify 'auto' for the alpha parameter, which causes the machine to automatically estimate alpha based off of the corpus we train with. This conveniently means that we won't have to test out different alpha values.

To determine which are the best hyper-parameters, we will be testing different number of topics and evaluating each iteration of the model based on its performance on its **coherence value**. Coherence is the measure of how much a document makes sense given the context of the words in relation to each other. For example, a coherence score of 0.10 would indicate that the model did a poor job at creating topics for similar documents.

Let's take a look at the results when we use gensim's built in LDA wrapper.

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Coherence%20Graph.png?raw=true)

Coherence score peaks when the number of topics is equal to 20. This means that the LDA model produces the most coherent and best variety of clustered documents when there are 20 topics. Therefore this would be the most "optimal" number of topics. Let's also use the __[pyLDAvis](https://github.com/cpsievert/LDAvis)__ package to display how the topics vary in relation to each other.

In [8]:
import pyLDAvis
import pyLDAvis.gensim 
pyLDAvis.enable_notebook()
pd.read_pickle('pf_pyldavis.pickle')

While there are noticeably distinct topics such as topic 16 (investing), topic 8 (401k), topic 11(employment), and topic 2 (credit card), you can see that there are many clusters around topic 1. A healthy set of topics will produce a plot with topic bubbles that are fairly distant from one another and having medium-sized bubbles. Therefore this cluster could definitely use some improvement.

To do this, we will also use the __[MALLET](http://mallet.cs.umass.edu/)__ package because it often does a better job at generating topics for this very purpose.

<div class="alert alert-block alert-warning">
Note: There is currently an error when using the pyLDAvis package with LDA Mallet wrappers so we are unable to visualize the Mallet topics. 
</div>

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Coherence%20Graph1.png?raw=true)

As we can see, the coherence value peaks when the number of topics $K = 8$. However, when looking at the topics and doing an eye-test, it was determined that using $K = 20$ led to a better spread and variety of topics. Let's take a look below.

In [63]:
from pprint import pprint
MALLET_optimal_model = MALLET_model_list[3]
pprint(MALLET_optimal_model.print_topics(num_words=10))

[(0,
  '0.082*"year" + 0.066*"job" + 0.060*"school" + 0.053*"college" + '
  '0.052*"parent" + 0.031*"time" + 0.029*"work" + 0.028*"student" + '
  '0.024*"graduate" + 0.024*"start"'),
 (1,
  '0.140*"year" + 0.052*"high" + 0.045*"option" + 0.036*"low" + 0.034*"good" + '
  '0.029*"worth" + 0.022*"large" + 0.020*"share" + 0.020*"increase" + '
  '0.018*"point"'),
 (2,
  '0.102*"month" + 0.081*"move" + 0.061*"rent" + 0.052*"live" + '
  '0.042*"expense" + 0.032*"saving" + 0.024*"apartment" + 0.019*"monthly" + '
  '0.019*"food" + 0.016*"gas"'),
 (3,
  '0.197*"money" + 0.090*"put" + 0.089*"save" + 0.068*"good" + 0.062*"start" + '
  '0.044*"’" + 0.043*"make" + 0.030*"idea" + 0.026*"advice" + 0.019*"dollar"'),
 (4,
  '0.098*"job" + 0.095*"company" + 0.057*"offer" + 0.040*"salary" + '
  '0.036*"work" + 0.024*"position" + 0.024*"current" + 0.021*"raise" + '
  '0.019*"experience" + 0.016*"bonus"'),
 (5,
  '0.106*"house" + 0.086*"home" + 0.061*"buy" + 0.053*"mortgage" + '
  '0.037*"sell" + 0.029*"liv

Great! It appears as though we have generated a good variety of topics but each one is made up of a distribution of words. There's no one-size-fits-all method for determining what theme to describe each LDA-generated topic but for the purposes of this project, we have decided to do an eye-test and use the different words in each topic to come up with a topic name. 

In [10]:
new_topics = {
    0: 'school',
    1: 'other',
    2: 'expenses',
    3: 'saving',
    4: 'employment',
    5: 'housing',
    6: 'other',
    7: 'other',
    8: 'bills',
    9: 'investment',
    10: 'collection call',
    11: 'insurance',
    12: 'credit card',
    13: 'debt',
    14: 'loans',
    15: 'bank account',
    16: 'other',
    17: 'taxes',
    18: 'auto',
    19: 'other' 
}

new_topics

{0: 'school',
 1: 'other',
 2: 'expenses',
 3: 'saving',
 4: 'employment',
 5: 'housing',
 6: 'other',
 7: 'other',
 8: 'bills',
 9: 'investment',
 10: 'collection call',
 11: 'insurance',
 12: 'credit card',
 13: 'debt',
 14: 'loans',
 15: 'bank account',
 16: 'other',
 17: 'taxes',
 18: 'auto',
 19: 'other'}

<div class="alert alert-block alert-warning">
Note: We named topics 1, 6, 7, 16, and 19 with the name 'Other' since they did not fit into any recognizable topic. 
</div>

Now let's take a look at how the newly generated topics stack up!

In [71]:
df_main_topic.drop(['Doc_Number','Main_Topic'], axis=1)[:10]

Unnamed: 0,Topic_Proportion,Keywords,Text,Topic_Names
0,0.073,"pay, debt, month, owe, make, year, extra, end,...",Ways to make extra side money?,debt
1,0.0699,"bill, back, pay, due, lose, ago, mom, dad, tim...","(Year UPDATE) Legally blind, going homeless, h...",bills
2,0.1245,"work, time, week, day, hour, give, business, f...","19, being kicked out So i just found out last ...",other
3,0.1172,"account, bank, saving, check, open, money, fee...",Online Savings Account? Hello! Looking for rec...,bank account
4,0.0813,"make, feel, thing, spend, lot, pretty, bad, ti...",Tools for Managing Incomes and Expenses,other
5,0.0842,"good, advice, finance, people, post, personal,...","With resources like reddit, is financial consu...",other
6,0.1695,"credit, card, score, balance, limit, apply, bu...","Credit hit from late payment, fee was waiver, ...",credit card
7,0.1575,"loan, payment, interest, student, rate, pay, m...",Need Help budgetting/getting out of debt Long ...,loans
8,0.1211,"work, time, week, day, hour, give, business, f...","22yo M I'm a 22 year old male, with an almost ...",other
9,0.0984,"call, "", charge, send, number, receive, report...","Debt collector gave 2 hours to pay, yelled, sa...",collections


![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Generated%20Topic%20Count.png?raw=true)

Other, Investments and Credit Cards are the 3 most common topics in terms of number of posts. An astounding 14% of all posts are classified under 'other', which could imply that these were bad classifications. With a coherence value of 0.4867, it certainly makes sense that certain topics are more "generalized" than others. In the original topics, 'other' was the 2nd largest topic by number of posts so it is not too far off. 

On the flip side, Saving, Bills, and Debt are the 3 least common topics. This is unusual since debt was THE most common topic in the original topics as dictated by /r/personalfinance. This could be due to a few different reasons:

- 1) Redditors classified their posts as 'Debt' without actually being about debt. In other words, it could be the case that there are many false positives in the original classification.

- 2) Redditors classified their posts as 'Debt' but the posts had stronger semblance to other topics. For example the new generated topic 'school' could also include debt as part of student loans.

- 3) Some of the new generated topics could also contain debt as a significant contributer while being labeled under a different topic. For example, the 'other' topics could very well have had many of the original 'Debt' posts.

- 4) Neither the original topics nor the LDA MALLET model have done a good job of determining which posts are about the topic 'Debt'. 

With these questions in mind, let's now move on to evaluating how effective our topics are.

> ### Part 2: Evaluating/Predicting the Topics

By using our new generated topics as the target variables, we will use the text as the feature variables and try to predict the labels for new data. By measuring the accuracy, precision and recall of the predictions we will be able to determine how reasonable and logical the topics are. For example, if it turns out that many of the posts labeled as 'school' were indeed predicted as 'school', our model would have done a good job of capturing most of the school-related posts into a single cluster.

We begin by defining our features (the cleaned text) and our response variables (our generated topics).

In [77]:
df_eval[['clean_text','new_topic']].head()

Unnamed: 0,clean_text,new_topic
0,ways make extra side money,debt
1,year update legally blind going homeless one j...,bills
2,kicked found last night home ive staying going...,other
3,online savings account hello looking recommend...,bank account
4,tools managing incomes expenses,other


Just like we created a bag-of-words model before we applied the topic modeling techniques, we will once again have to vectorize the text. Only this time, we will be using the CountVectorizer function from sci-kit learn. While using a bag-of-words model does the job in counting the frequency of each word per document, we want to better quantify the signature words which don't appear as frequently as common words. For this reason, we will also introduce a **term frequency - inverse document frequency (tf-idf) model**. We apply the tf-idf transformer, which will transform the word count matrix in each vector into a tf-idf representation. Finally, we apply the classifier to create the machine learning model.

**Models**: 

5 different types of machine learning algorithms were implemented:

- __[Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)__


- __[Linear SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)__


- __[Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)__


- __[Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)__


- __[Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)__

We will be covering our top 2 performing models: the Logistic Regression model and the Linear Support Vector Classification model, but we will also discuss the results of all 5 models.

**Evaluation Metrics**: 

The way we will be evaluating the effectiveness of a model is through 3 metrics:

- Accuracy: Correctly labeled predictions over total predictions

- Precision: Correctly labeled true predictions over total labeled true predictions

- Recall: Correctly labeled true predictions over total actually true cases

For this project, **accuracy** would be the most important of the 3 because we are mainly interested in correctly labeled predictions regardless of which are the "trues" and "negatives". If it were a binary classification problem, it may be possible that precision or recall would be better but this is a multi-class classification problem. 

**Baseline**

Since metrics like the accuracy are very relative depending on which model you use, a baseline metric should be established in order to act as a comparison to gauge how effective the model is at predicting the topic. The method chosen to create the baseline for the regressor was through the sci-kit learn function DummyClassifier. For this project, stratified was selected as the strategy.

In [83]:
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
# Next let's select our features
X = df_eval.clean_text # The cleaned text will be the feature variables
y = df_eval.new_topic # The generated topic labels will be the response variables
topic_labels = y.unique() # List of unique topic labels

# Split the data by using 30% as the testing data 
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, df_eval.index, test_size=0.3, random_state=42)

# Print the accuracy 
y_pred = dummy.predict(X_test)
print('Accuracy: %s' % round(dummy.score(X_test, y_test), 4)) # You can use either the .score method or the accuracy_score function from sklearn.metrics
dummy_cv = cross_val_score(dummy,X,y,cv=10, scoring='accuracy') # Using cross-validation will provide a more accurate score
print('Average 10-Fold Cross-Validation Accuracy Score: %s' % round(np.mean(dummy_cv), 4))
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred, target_names=dummy.classes_))

Accuracy: 0.0783
Average 10-Fold Cross-Validation Accuracy Score: 0.0735

Classification Report:

              precision    recall  f1-score   support

        auto       0.06      0.05      0.06       202
bank account       0.07      0.07      0.07       184
       bills       0.01      0.01      0.01        96
 collections       0.04      0.04      0.04       193
 credit card       0.13      0.12      0.12       298
        debt       0.04      0.05      0.05        80
  employment       0.08      0.08      0.08       162
    expenses       0.07      0.07      0.07       149
     housing       0.07      0.07      0.07       219
   insurance       0.05      0.04      0.04       122
  investment       0.08      0.09      0.08       296
       loans       0.09      0.09      0.09       202
       other       0.12      0.14      0.13       392
      saving       0.05      0.04      0.04       126
      school       0.07      0.07      0.07       204
       taxes       0.06      0.05    

So basically as long as we get any results better than 0.0735 accuracy, 0.08 precision and 0.08 for recall, we will have outperformed the baseline! (sarcasm)

Let's take a look below at the Linear SVC and Logistic Regressions results in particular.

**Linear SVC**

In [86]:
# Extract the model with the best parameters
best_svc = gs_svc.best_estimator_
# Predict the labels
y_pred = best_svc.predict(X_test)
# Extract an array of the labels
x = best_svc.classes_
# Obtain the cross-validation scores for 10-folds using accuracy as the scoring metric
svc_cv = cross_val_score(best_svc,X,y,cv=10, scoring='accuracy')

# Print the accuracy
print('Accuracy: {0:.4f}'.format(best_svc.score(X_test, y_test)))
print('Average 10-Fold Cross-Validation Accuracy Score: %s' % round(np.mean(svc_cv), 4))
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred, target_names=x))

Accuracy: 0.7603
Average 10-Fold Cross-Validation Accuracy Score: 0.7585

Classification Report:

              precision    recall  f1-score   support

        auto       0.84      0.81      0.82       202
bank account       0.80      0.83      0.82       184
       bills       0.65      0.62      0.63        96
 collections       0.78      0.76      0.77       193
 credit card       0.89      0.92      0.91       298
        debt       0.75      0.57      0.65        80
  employment       0.80      0.78      0.79       162
    expenses       0.75      0.77      0.76       149
     housing       0.81      0.81      0.81       219
   insurance       0.78      0.75      0.76       122
  investment       0.77      0.89      0.82       296
       loans       0.85      0.82      0.83       202
       other       0.56      0.62      0.59       392
      saving       0.74      0.48      0.59       126
      school       0.70      0.75      0.73       204
       taxes       0.79      0.64    

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Linear%20SVC.png?raw=true)

Using Linear Support Vector Classification resulted in an accuracy of 0.7582, precision of 0.76 and recall of 0.76.

On paper this looks pretty good, especially compared to our baseline. If we take a look at the classification report, it's worthwhile noting that while some topics like credit card and debt had high precision rates, topics like other and bills did considerably poorly. Looking at the confusion matrix we can see why this is the case. There were many instances where 'other' posts were incorrectly labeled as investment, school, etc. On the flip side, there were many non-'other' posts that were incorrectly labeled as 'other' such as school, saving, and taxes.

This could have been due to the fact that the other topic had many overlapping words and phrases with other topics. Another reason could have been due to the fact that more than 1 topic was classified under 'other', resulting in a mixture of differing words that are not quite coherent when used together.

**Logistic Regression**

In [85]:
best_logreg = gs_logreg.best_estimator_
y_pred = best_logreg.predict(X_test)
x = best_logreg.classes_
logreg_cv = cross_val_score(best_logreg,X,y,cv=10, scoring='accuracy')

print('Accuracy: {0:.4f}'.format(best_logreg.score(X_test, y_test))) 
print('Average 10-Fold Cross-Validation Accuracy Score: %s' % round(np.mean(logreg_cv), 4))
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred, target_names=x))

Accuracy: 0.7656
Average 10-Fold Cross-Validation Accuracy Score: 0.7651

Classification Report:

              precision    recall  f1-score   support

        auto       0.83      0.81      0.82       202
bank account       0.83      0.80      0.81       184
       bills       0.72      0.57      0.64        96
 collections       0.80      0.77      0.78       193
 credit card       0.89      0.92      0.90       298
        debt       0.83      0.55      0.66        80
  employment       0.84      0.80      0.82       162
    expenses       0.78      0.72      0.75       149
     housing       0.82      0.82      0.82       219
   insurance       0.80      0.70      0.75       122
  investment       0.76      0.89      0.82       296
       loans       0.86      0.86      0.86       202
       other       0.54      0.69      0.61       392
      saving       0.78      0.44      0.57       126
      school       0.73      0.75      0.74       204
       taxes       0.84      0.64    

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Logistic%20Regression.png?raw=true)

The Logistic Regression model performed slightly better with an accuracy of 0.7655, precision of 0.78, and recall of 0.77.

The 'other' topic did even worse in this model with a precision of 0.54 but there are more topics with higher precision.

> #### Model Results

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Results%20Boxplot.png?raw=true)

![s](https://github.com/nysportsfan/Personal_Finance_Subreddit/blob/master/Images/PersonalFinance%20Results%20Bar%20Graph.png?raw=true)

Evidently, the Logistic Regression model was the best both in the mean 10-fold CV (0.7654) and the peak accuracy score (0.7845). However, the Linear SVC model is very close in accuracy as well. Multinomial Naive Bayes surprisingly did not perform that well considering its reputation as being one of the best for text classification purposes. 

When looking at the precision and recall scores for the Logistic Regression model, we could see that many of the topics did very well with scores above 0.8 and the high 0.7s. With the exception of the 'other topic (which we will discuss in the limitations section), it seems as though gensim's LDA algorithm did a fine job of creating the topics and choosing the 20 as the number of topics was the right choice. 

## 5. Conclusion

[Return to Table of Contents](#Table-of-Contents)

**Recommendations**

There are a few import takeaways from this project:

1) 

**Limitations of the Model**

One pattern we did notice throughout all the models is the fact that the 'other' topic did a very poor job in both precision and recall. In other words, the model often falsely labeled non-'other' posts as 'other' and also did not correctly label 'other' posts as 'other'. If we had to choose between avoiding a Type I error (misclassifying 'other' posts as not 'other') and a Type II error (misclassifying non-'other' posts as 'other'), the former would be preferred. This is because the topic 'other' is a mixture of several topics whereas the other topics have a stricter definition of what kinds of posts and words belong to that category. 

For example, if an 'auto' post about car insurance was labeled as 'other', it would not be as damaging as if an 'other' post about a grandmother leaving behind a will was labeled as 'school'. There are also cases where the author of the post does not know what the topic should be; in this situation, they would most likely categorize the post under "other" rather than falsely labeling it as something definitive in nature. In a sense, the 'other' topic could be seen as a wildcard when 

Another problem does not have to do so much with the classification model, but with the topic-modeling. There are certain cases where a post was labeled under a topic that does not make sense to a human observer. Because the machine makes its decisions using vectorized inputs, it is unable to accurately classify each post 100% of the time. As a result, there are some exceptions where the classification model actually "corrects" the topic-modeling mistakes and labels some posts with the "right" topic. One explanation for this limitation is the fact that each document (or post) is assigned a number of different topics and we simply chose the topic with the highest similarity percentage as the label. This may work well for most posts but there are scenarios where this line of thinking can backfire.

For example, what if a post about college loans had a 34% contribution from 'loans' but 33% contribution from 'school'? In this case, the model would categorize this post under the 'loans' topic but the classification model may predict it as 'school'. 

**Final Thoughts**

Topic-modeling is still very much a work-in-progress due to the fact that technology continues to adapt and evolve at a rapid pace. It especially requires an "eye-test" approach to refining and improving the models because text is not as easily interpreted by the machine compared to numbers. Therefore the user will need to verify whether or not the generated topics "work" by looking at several factors such as how much the topics make sense given the type of words that surround the topic, how many documents fit under each topic (if there are too many documents under 1 topic and too few under another, it is practically useless), etc. 

All in all, these results looked good considering the fact that the topics were generated by the machine rather than by humans.

In [4]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
