# Reddit Posts Analysis with Natural Language Processing

# Part 1

## Executive Summary

Pet ownership is often viewed positively in the mainstream media, especially during the COVID-19 pandemic. However, these perspectives can sometimes be overstated and biased, creating unrealistic expectations for pet owners. In 2022, a US-based survey revealed that pets were a major source of stress for a majority of owners during the pandemic. ([source](https://www.avma.org/news/study-explores-pandemic-specific-challenges-pet-ownership)) Hence, a follow-up analysis is required to understand pet owners' pain points to ensure the right measure can be taken to reduce stress and improve their experience.

In this project, we'll be assuming the role of pro-bono data analysts for an animal shelter to address:
1. What are the top concerns for pet owners, specifically dog and cat owners?
2. Which classification model, along with the natural language processing, would be able to predict cat- or dog-specific problems? 

To answer these questions, we'll mine for user-generated content in Reddit and analyze them through natural language processing techniques. Thereafter, we'll provide the animal shelter with recommendations on how they can better support potential pet owners and ensure a positive experience for both owners and pets.


*Content:*

1. [Executive Summary](#Executive-Summary)
2. [Data Collection & Methodology](#Data-Collection-&-Methodology)
3. [Web Scraping](#Web-Scraping)

*Please refer to the next notebook for remaining content, click [here](Part_2_Reddit_Analysis_NLP.ipynb)

## Data Collection & Methodology

### Data Collection

We collected 6,000 posts from two popular subreddits:

1. [r/CatAdvice](https://www.reddit.com/r/CatAdvice/): This subreddit has over 118,000* cat enthusiasts globally, who actively discuss cat-care topics and exchange best tips for their cats. A total of 3,000 posts were scraped from this subreddit, along with 72 potential feature variables.
2. [r/DogAdvice](https://www.reddit.com/r/DogAdvice/): With over 65,900* members globally, this subreddit is dedicated to discussions on dog training, health, nutrition and other dog-care topics. A total of 3,000 posts were scraped from this subreddit, as well as 80 potential feature variables.

*figures are latest as of November 2022 

### Data Dictionary

After cleaning the data we narrowed now to 2,746 data from r/CatAdvice and 1,353 in r/DogAdvice.

Based on the 72-80 variables pulled, we also narrowed down to a few key features that were pertinent in our investigation:

| Selected Feature 	| Type    	| Dataset                     	| Description                                      	|
|------------------	|---------	|-----------------------------	|--------------------------------------------------	|
| subreddit        	| object  	| r/CatAdvice and r/DogAdvice 	| Refers to the Reddit channel                     	|
| title            	| object  	| r/CatAdvice and r/DogAdvice 	| Reddit title of the post                         	|
| selftext         	| object  	| r/CatAdvice and r/DogAdvice 	| Reddit body post                                 	|
| author           	| object  	| r/CatAdvice and r/DogAdvice 	| Reddit member                                    	|
| created_utc      	| integer 	| r/CatAdvice and r/DogAdvice 	| Time of when the post was created, in UTC format 	|

### Methodology

The subreddit posts were processed and analysed through the following steps:

1. **Web Scraping**: Using Pushshift API, we were able to extract 100 posts per day within 30 days from the subreddits.
2. **Data Cleaning**: We assessed the data by checking the relevance of the variables, removing outliers, dealing with missing values that were key to the investigation.
3. **Exploratory Data Analysis**: We visualized the cleaned data through a series of graphs and plots to better understand the dataset and identify potential keywords for the modeling process.
4. **Data Modeling & Evaluation**: Following the selection of variables, we processed the text data even further and modeled them through these models - Logistic Regression, Naive Bayes and XXX. The best model will then be used to predict whether keywords that would best identify cat vs dog issues. 

## Web Scraping

### Import packages

In [1]:
import pandas as pd
import time
import random
import requests

### Set up requests library

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Pull 100 posts per day for 30 days from r/CatAdvice and r/DogAdvice and save into dataframes

i = 0

while i < 30: #pull from API 30 times
    before = i 
    cat_dict = requests.get(url, params = {'subreddit' : 'CatAdvice', 'size' : 100, 'before' : f"{before}d"}).json()
    dog_dict = requests.get(url, params = {'subreddit' : 'DogAdvice', 'size' : 100, 'before' : f"{before}d"}).json()
    print(f"Iteration {i} completed")
    
    if i == 0:
        cat_df = pd.DataFrame(cat_dict['data'])
        dog_df = pd.DataFrame(dog_dict['data'])
    else:
        cat_df = pd.concat([cat_df, pd.DataFrame(cat_dict['data'])], ignore_index=True, axis = 0)
        dog_df = pd.concat([dog_df, pd.DataFrame(dog_dict['data'])], ignore_index=True, axis = 0)
    
    i += 1
    
    time.sleep(random.uniform(5,10))

Iteration 0 completed
Iteration 1 completed
Iteration 2 completed
Iteration 3 completed
Iteration 4 completed
Iteration 5 completed
Iteration 6 completed
Iteration 7 completed
Iteration 8 completed
Iteration 9 completed
Iteration 10 completed
Iteration 11 completed
Iteration 12 completed
Iteration 13 completed
Iteration 14 completed
Iteration 15 completed
Iteration 16 completed
Iteration 17 completed
Iteration 18 completed
Iteration 19 completed
Iteration 20 completed
Iteration 21 completed
Iteration 22 completed
Iteration 23 completed
Iteration 24 completed
Iteration 25 completed
Iteration 26 completed
Iteration 27 completed
Iteration 28 completed
Iteration 29 completed


In [4]:
# Check for the size of data pulled from CatAdvice

cat_df.shape

(3000, 72)

In [5]:
# Check for the size of data pulled from DogAdvice

dog_df.shape

(3000, 80)

In [6]:
# Save CatAdvice raw data into csv

cat_df.to_csv('../datasets/cat_advice_raw.csv', index=False)

In [7]:
# Save DogAdvice raw data into csv

dog_df.to_csv('../datasets/dog_advice_raw.csv', index=False)