# Content Analysis on People's Personal Finance Concerns Based on Reddit Data




## Research Questions

What's the most frequent concern of personal finance? Is there a heterogeineity among different groups? Does the topics of these concerns change over time?

Corpora:

* Reddit Articles in subreddit [Personal Finance](https://www.reddit.com/r/personalfinance/)

* Investing subreddit: [Investing](https://www.reddit.com/r/investing/)

* Wall Street Bets subreddit: [Wall Street Bets](https://www.reddit.com/r/wallstreetbets/)


Social Game:

Consumption and investment are two import social indicators in economics. So, I would like to study people's consumption and behavior by their online postings.


Actors:

Most people who post articles on Reddit are young people, many of them are 20-30 (many of them reveal their age in posts) and it's interesting to learn the consumption and investment patterns of these young people.


World:

A large group of anxious young people--we can find students who just got their first job start to consider paying back student loan, buying houses/cars, taking care of aging parents, for the first time in their life. They ask advice from others on online platform to make finance-wise decisions.

What's people's biggest concerns in personal finance? Do they Change over time?

## Why my research important?

In the most widely used fomula in Macroeconomics: Y = C + I + G + NX 
(Total economic output = Consumption + Investment + Government spending + Net Export), consuption and investment are individual activities that constitute of our society. 

A [Federal Reserve survey](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiB6O2fhs3uAhXQXc0KHbeLAXUQFjAAegQIARAC&url=https%3A%2F%2Fwww.federalreserve.gov%2Fpublications%2Ffiles%2F2017-report-economic-well-being-us-households-201805.pdf&usg=AOvVaw33ULJILWvmE0JU8Dweye4R)  finds almost 40% of American adults wouldn't be able to cover a $400 emergency with cash, savings or a credit-card charge that they could quickly pay off. Why do people in the United States, the most powerful country in the world, face this problem? What's the heaviest financial burden on people? Where is the money going? What are the topics that people who seek financial security talks every day? To answer these questions, we can analyze people's posting online.

**The benefits people can get after they learn the results of my study**

My study will report the most common financial burden on people, and the time trend of the changes most-discussed topics. So people can know what bothers us and if the things that bother us change over time.



## My sample

**The rationale behind my proposed sample design**
Collect data from online [financial discussion forums](https://www.doughroller.net/personal-finance/8-awesome-online-forums-personal-finance-investing/): Reddit-Personal Finance, myFICO Forums, YNAB Forums, Morningstar Forums, Reddit–Investing, and Bogleheads Forum. My sample will include the first four datasets.

**Social Game:**
People's income and financial concern.

**Social Actors:**
Online financial websites users: people who post their concern, seek for advice, or share personal experience.

**Its virtues with respect to my research questions:**
People's online discussion is a reflect of their real-life concern

**Limitations:**
Generalization Bias--most users of online platforms are young people who are used to the internet. Middle-age people may not be willing to disclose their financial concerns online.

**Alternatives:**
Other discussion websites.

**Methods to scale up my sample:**
I can boarden my dataset by scale up the time peorid to include more aticles from myFICO Forums, YNAB Forums, Morningstar Forums, Reddit–Investing, Bogleheads Forum, Fat Wallet Forums, and Bigger Pockets Forum.


# Scrape Corpora from Reddit with PRAW

In [7]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
import lucem_illud #pip install git+git://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git
import praw

#All these packages need to be installed from pip
import requests #for http requests
import bs4 #called `beautifulsoup4`, an html parser
import pandas as pd#gives us DataFrames
import docx #reading MS doc files, install as `python-docx`

import re #for regexs
import urllib.parse #For joining urls
import io #for making http requests look like files
import json #For Tumblr API responses
import os.path #For checking if files exist
import os #For making directories

For privacy issue, API key is in json file (which is in .gitignore) because it contains my password and personal tokens and my GitHub repository is public.
I will save the data for later use, so you don't need to scrape the data by yourself. However, if you want to replicate this process, you can create your own Reddit API conveniently by following [this guidance](https://praw.readthedocs.io/en/latest/getting_started/authentication.html).

In [4]:
with open('Reddit_API_info.json') as f:
    api_info = json.load(f)
    f.close()

In [9]:
reddit = praw.Reddit(client_id = api_info['client_id'],
                     client_secret = api_info['client_secret'],
                    user_agent = api_info['user_agent'],
                    username = api_info['username'],
                    password= api_info['password'])

### Scrape from Subreddit 

In [21]:
subred = reddit.subreddit("personalfinance")

# several attribute helps us to sort articles in subreddit
hot = subred.hot(limit = 10000) # sort all the subreddit articles by hot
new = subred.new(limit =10000)
controv = subred.controversial(limit = 10000)
top = subred.top(limit=10000)
gilded = subred.gilded(limit=10000)

In [22]:
type(hot)

praw.models.listing.generator.ListingGenerator

As shown above, this can help us get a generator object, then we can use a loop to retrieve the data. However, Reddit has a limitation that it will only have at most ~1000 posts in each category (which is far more than a real person's reading ability) But it's not enough for computational analysis, especially for dynamic topic analysis in the later part. There are archived Reddit data on Google's Big Query platform, but it has stopped updated since 2019. So I decide to study the latest reddit post for the most part of this project, and use the archived historical reddit data in the dynamic topic modeling analysis.

In [23]:
pf_hot = []
for i in hot:
    pf_hot.append({'title': i.title,
                        'text': i.selftext,
                        'url':i.url,
                        'created_utc':i.created_utc,
                        'score':i.score,
                        'up':i.ups,
                        'down':i.downs
                       })
    print(i.title)
pf_hot = pd.DataFrame(pf_hot)

Coronavirus Megathread Update (January, 2021)
Weekday Help and Victory Thread for the week of March 08, 2021
Is saving money worth missing out on the "college experience"
Don't Pay H&R Block From Your Refund
Question about a utility company screwing up, putting me into collections, realizing they screwed up and what happens if I agree to pay a small amount to end it.
Job salary negotiations
Developer of my condo owns just over 25% of my building. I am worried that will make it hard to sell my unit
Moved out of state in 2018 but realized I've been paying California taxes for the last two years. Can I ask for a refund?
I just received my residency in the US.
My mom is in training to work for Primerica and I'm worried it's a scam
My MIL wants to add us on the deed to her house
Loan Officer trying to change our interest rate after we signed off on the 'locked in' rate
At what rate is collision insurance worth it for a 2007 car with ~130k miles?
Is paying mortgage down early too good to be 

Missing my tax form!
$6k better in IRA or Tues a house down payment
What to do with vested teacher pension? (CA -> WA)
Just started a Charles Schwab Roth IRA account, how do I go about investing in Target Date Fund? Looking for a real “set it and forget it” kind of vibe.
How to stop stressing about money?
Advice needed on Universal Life Insurance that I was given as a kid
How to manage/compare for purchases
How Best to Handle Investment Property
I want my home office deduction back, can this accomplish it?
Stuck in a toxic job while expecting a baby
How do you navigate bringing a potential job offer to your current employer asking them to meet/exceed it so you can stay? Interviewing today, help!!!
Lenders for Rental / Investment Property?
Diversification of separate investing accounts?
Refinancing home - Missed Opportunity - Please help.
Am I eligible for the Recovery Rebate Credit if I qualified as a dependent in 2020, but no one claimed me as one?
Restitution for being injured at wor

Investing HSA Funds
Question about capital one and leasing a car
For a Roth IRA, is it better to invest 6k on January 1st or incrementally throughout the year?
Transfermoney from Russia
Question about my first auto loan.
Process of renting a home for college (Asking for help)
FreeTaxUsa Bug: NIIT Tax not including state income tax attributable to net investment income
Pay off debt or put it in high yield savings
Ideal length of elimination period for long-term disability?
Looking for feedback on my overall financial picture
How much money can my brother gift me?
Car Loan Advice
Debt Collectors
Fidelity 401K Advice
Feeling rudderless
The Verge loan experience
How to Manage Credit Account Once Paid
Help getting my account number
Can I cash a check at a bank with a passport that has been holepunched (2 holes) but is still valid? (not expired)
Dental insurance: copays
Banks and Credit Unions that allow multiple checking and savings accounts
If I received a check for a small amount of retir

Where to place equity $$$ while waiting to purchase a house?
Spousal RSP Contributions with TD Webbroker
Best Credit Card for investment properties?
Best way to get ahold of someone at BOA that will actually help me??
Health Insurance not covering "consulting radiologist" on ER visit. What?
Best Home Renovation Option
529 vs Roth IRA
Switching Jobs to a Different Retirement Plan, Need advice and have questions
Never followed up on my stock options bump from last year, what are the possible scenarios?
Best place for a Checking Account?
Hospital threatening to send me to collection over a charge I never agreed to
Doctor filed claim late, now I'm stuck with the bill?
Vehicle lease problem
Recovery rebate credit child payments
When Can You Mix W2 Income and Self-Employed Income?
In a situation with a financed car
Are There Any Tax Rules Limited To Only Self-Employed Individuals?
Are work computers considered taxable income or anything like that?
20 y/o college student, switching IRA from B

Legal Order Debit - Contact Franchise Tax Board?
Will becoming an authorized user temporarily ding my credit?
“Smarter” emergency fund – do I need it all in cash?
Question about form 1099-R (Traditional to Roth conversion)
Ways to support my Autistic Younger Brother for life?
Emergency Department (Fast Track) Medical Bill
Accounts with 1+ free outdoing domestic wires /month?
Will get married in a community property state (US), are there any advantages to keeping assets (bank accounts) before marriage separate?
Savings/investing options for an international student on an F1 visa in the US
BofA underwriter wants HOA special assessment documents...realtor and HOA dragging their feet...
I am 31, worked as a waiter, now pursuing a career in photography, getting kicked out from my flat and have no money, what to do?
Advice on whether or not I should leave my car in Mississippi.
Invalid Collections notice, how to handle
Tax filing questions
New Home loan, issues with builder.
My Dad got a job

In [24]:
len(pf_hot)

894

We only want got 894 posts here, since Reddit restricts it to less than 1000. But we can still analysze it! We can accquire posts in this subreddit via archive later.

In [25]:
subred = reddit.subreddit("wallstreetbets")

# several attribute helps us to sort articles in subreddit
hot = subred.hot(limit = 10000) # sort all the subreddit articles by hot
new = subred.new(limit =10000)
controv = subred.controversial(limit = 10000)
top = subred.top(limit=10000)
gilded = subred.gilded(limit=10000)

wsb_hot = []

for i in hot:
    wsb_hot.append({'title': i.title,
                        'text': i.selftext,
                        'url':i.url,
                        'created_utc':i.created_utc,
                        'score':i.score,
                        'up':i.ups,
                        'down':i.downs
                       })
    print(i.title)
wsb_hot = pd.DataFrame(wsb_hot)

What Are Your Moves Tomorrow, March 10, 2021
WSB Rules - Please Read Before Posting
Why I'm selling GME
OUR SAVIOR KEEPS MAKING THE NEWS! DFV INSPIRES US ALL CUZ WE JUST LIKE THE STOCK! 🚀🙌🏼🐈
Bought GME in 2014. Held at $4. Held at $400. Held at $40. Still in.
You heard the man. YES
Dimitri finds out about GME
GME Hype Trailer
True Short interest in GEE EM EE could be anywhere from 250% to 967% of the float. Yes short sellers are that fucking retarded.
Can’t believe it’s already been 10 years since January 27th 2021!
I see your LED desk, I give you my lighting equipment moving to the GME stonk
GME Megathread for March 09, 2021
Nintendo goes all-in on GME: 3 new amiibo will be GameStop exclusives
The Big Squeeze - Full Trailer [HD]
All In! 🚀🚀🌔🌔
😂😂
You guys actually came back for me!! Lets bring this rocket to $1000 guys!
Get ready boys.. we’re coming back fo y’all , no retard left behind!
🚀🚀 $GME Premarket be like 🚀🚀
This guy f***s
GME and DFV mentioned on FOX Business!
Vibe at work whil

Here you go APES & RETARDS, I might not be a super ape with shit load of 💵s but a small and retarded 🦍 with $7k joined the party this morning.
Hold strong apes!!
On the Launch 🚀pad! HOLDING STEADY AND STRONG!!
HODL 🙌💎🙌 I don’t know much about call options but I’m buying ITM call options for AMC and GME! I like these freaking stocks. Sue me 🤷🏻‍♀️
PT for Friday 3/12/2021 $388
The zoo is open, and the cages too! We're launching thanks to you Diamond Handed, Smooth Brained, Steel Balled Apes and Aperinas out there. Keep strong and enjoy the ride
Why brokers are not accepting limit orders at prices far above last traded price.
Another $AMC YOLO trade! Bought 200 call contracts as well as 1200 shares! Lets get it!!
GameStop Stock Is Flying Again. The Transformation Is on the Way.
The power of hodling. IT AINT OVER TILL ITS OVER 💎🙌
$12k GME yolo -> 100%+ in a week
Was that the dip? Bought 2 more shares of AMC to HOLD just in case. 800 seemed boring anyway. Hold. Hold. Hold.
I Can't Stand $GME

Really praying for you Apes who made $800 calls
UNFI YOLO Pre-Earnings Update - March 9, 2021 Last Day before Q2 Earnings
Averaged down from 179 per contract. Thanks apes. Can we hit 320 by end of this week?
This is why you buy the dip, f***ts - Part 2 (only regret is not YOLOing enough)
Crayon say: Mid-May big flop or pop
SPY options are my new fav money maker
If you could go back in time to 03Apr20...
When is $NOK time I’m getting decimated on this side :/
I love you Apes
Posting my WSB-inspired portfolio gains. GME outperforms everything so far😅
Half my net worth in GME calls; gonna make up for a lifetime of poor money habits these next ten days
How much are hedgies losing daily with all of this GME madness?
Round 2
LETS GOOOO ROBLOX
GME YOLO Update. Y’all came back for me, so I bought 15 more shares.
The Fed’s dual mandates and what that means for markets (and how the potentially negative effects won’t matter any time soon)
Welp here goes nothing. $6,000 on $500.00 GME 3/12 calls
B

Small fry compared to the big dogs, but I’m up nearly 60% thanks to my fellow apes! 🦍
Up 30%... good day for small fry
Don’t forget to pick me up :)
$DASH puts $85k gain porn for the day. I tried to tell y’all. Should be just as fun with their lockup expiry due tomorrow.
Buy the dip and HODL we going to the moon
$AMD YOLO Update - March 8th
Sold all my ETFs this morning. Don’t Fomo, just YOLO 🚀🚀
going for 162,000,000$ or bust🚀🚀🚀🚀🚀🚀 🤲💎
My life is a dumpster fire of FD's
It’s not my entire portfolio or my biggest gainer but it’s fun! Don’t even want to think about the capital gains.
🚀🚀🚀Latest GME Short Interest data from @Ihors3 at S3 Partners.......$GME🚀🚀🚀
$ASO YOLO Update
Hold apes hold 💎🙌
I got a sign from above....all the DD I needed.
May the force be with you. AMC management is at war and are telling us that they hear us with the vote date.
$uwmc most naked short
AER making me 540k today! GE AER deal is 🚀🚀🚀
An Obligatory GME Post
Betting LLY Has Found a Treatment for Alzheimer's
You

In [26]:
len(wsb_hot)

624

In [33]:
subred = reddit.subreddit("investing")

# several attribute helps us to sort articles in subreddit
hot = subred.hot(limit = 10000) # sort all the subreddit articles by hot
new = subred.new(limit =10000)
controv = subred.controversial(limit = 10000)
top = subred.top(limit=10000)
gilded = subred.gilded(limit=10000)

invt_hot = []

for i in hot:
    invt_hot.append({'title': i.title,
                        'text': i.selftext,
                        'url':i.url,
                        'created_utc':i.created_utc,
                        'score':i.score,
                        'up':i.ups,
                        'down':i.downs
                       })
    print(i.title)
invt_hot = pd.DataFrame(wsb_hot)

Daily General Discussion and spitballin thread
r/investing's Discord can be found in the sidebar, feel free to drop in and say hello!
The CPI report didn't just reduce fears of inflation - it's actually a massive bullish signal
Live thread for the 10-year bond auction
Cybersecurity Sector Play
Walmart(WMT): The upcoming $1T market cap stock
Spotify (SPOT) DD. Established company on the brink of profitability, trading at a low P/S relative to the industry.
The case for Palantir Technologies!
Here's How Much $10,000 Invested in Berkshire Hathaway Stock in 1964 Is Worth Now
My algorithm tracks chatter and sentiment of stocks on social media. it has picked up increased sentiment and chatter around AAPL. Here is my research.
Stimulus bill treasury bond sale
ROBLOX IPO $RBLX In Depth Analysis
GME Thread - Wednesday March 10th, 2021
HEAR Turtle Beach Valuation
Roth Capital Analyst rationalizes $28 PT for KMPH (KemPharm) after FDA approval of new ADHD drug
Coupang (The Amazon of Korea) - Full 

In [34]:
len(invt_hot)

624

In [35]:
pf_hot.to_csv('data\pf_hot.csv')
wsb_hot.to_csv('data\wsb_hot.csv')
invt_hot.to_csv('data\invt_hot.csv')

# Counting Words & Phrases

## Read the data

In [37]:
pf_df = pd.read_csv('data\pf_hot.csv', index_col = 0)
wsb_df = pd.read_csv('data\wsb_hot.csv', index_col = 0)
invt_df = pd.read_csv('data\invt_hot.csv', index_col = 0)

In [32]:
import nltk

In [None]:
def word_tokenize(word_list):
    tokenized = []
    # pass word list through language model.
    doc = nlp(word_list)
    for token in doc:
        if not token.is_punct and len(token.text.strip()) > 0:
            tokenized.append(token.text)
    return tokenized
