# Gift Recommender Engine: Reddit Data Collection

This notebook summarizes the steps I took to collect Reddit data to train and test a topic classifier. This topic classifier will predict categories based on tweets. Subreddits contain numerous posts regarding a particular category, making it a great platform to gather data to train a topic classifier. I decided to choose a total of 15 categories and research the top subreddits related to those topics. The categories and subreddits I decided to use are listed below.

1. <b> Electronics/Gadgets </b>: r/tech, r/gadgets, r/techsupport, r/apple, r/android, r/learnprogramming
2. <b> Sports </b>: r/sports, r/nba, r/soccer, r/nfl, r/baseball, r/hockey
3. <b> Gamers </b>: r/gaming, r/games, r/nintendoswitch, r/ps4, r/xboxone, r/pcgaming
4. <b> Animals/Nature </b>: r/natureifuckinglit, r/nature, r/gardening, r/sustainability, r/pets, r/animals
5. <b> Travel </b>: r/shoestring, r/travel, r/wanderlust, r/solotravel, r/camping, r/hiking
6. <b> Art/Photography </b>: r/art, r/sketchpad, r/arttools, r/photography, r/photomarket, r/artfundamentals
7. <b> Music </b>: r/vinyl, r/music, r/electronicmusic, r/hiphopheads, r/indieheads, r/aves
8. <b> Books </b>: r/books, r/booksuggestions, r/bookclub, r/writing, r/bookexchange, r/audiobooks
9. <b> Movies/TV-Shows </b>: r/actors, r/moviesuggestions, r/movieclub, r/television, r/realitytv, r/oscars
10. <b> Food </b>: r/cooking, r/recipes, r/wewantplates, r/culinaryplating, r/food, r/foodporn
11. <b> Alcohol </b>: r/alcohol, r/beer, r/wine, r/liquor, r/tequila, r/vodka
12. <b> Coffee </b>: r/starbucks, r/coffee, r/barista, r/coffeelovers, r/coffeestations, r/cafe
13. <b> Self-Care </b>: r/zenhabits, r/skincareaddiction, r/meditation, r/yessleep, r/massage, r/quoteporn
14. <b> Work/Education/Productivity </b>: r/productivity, r/workspace, r/jobs, r/worklifebalance, r/iwanttolearn
15. <b> Household/Decor </b>: r/roomporn, r/homedecorating, r/decor, r/apartmentdesign, r/designmyroom
16. <b> Business </b>: r/personalfinance, r/investing, r/cryptocurrency, r/bitcoin, r/entrepreneur


## Import Libraries

I will use the praw library to scrape titles and descriptions from subreddits. I will also import API keys and IDs from a separate py file.

In [2]:
import numpy as np
import pandas as pd
import praw
from keys import Keys
import warnings
warnings.filterwarnings('ignore')

## Data Collection: Scraping Reddit Posts

In [3]:
key = Keys()
client_id = key.client_id
secret_key = key.secret_key
user_agent = key.user_agent
user_name = key.user_name
password = key.password

class Reddit:
    
    '''
    Class used to scrape top posts from specified subreddits.
    '''
    
    def __init__(self, username, password, client_id, secret_key, user_agent):
        
        '''
        -Initialize username, password, client-id, api key, and user agent
        -Generate Reddit instance using PRAW
        -Setup Dataframe to store scraped Reddit information
        -Create list to store comments 
        '''
        
        self.username = username
        self.password = password
        self.client_id = client_id
        self.secret_key = secret_key
        self.user_agent = user_agent
        
        self.reddit = praw.Reddit(username=self.username, password=self.password, client_id=self.client_id,
                                 client_secret= self.secret_key, user_agent = self.user_agent)
        
        self.posts = pd.DataFrame(columns=['title', 'score', 'id', 'subreddit', 'url',
                                     'num_comments', 'body', 'created', 'category'])
        
        self.comments = list()
        
    def get_posts(self, category, subreddit_list):
        
        '''
        Input: 
            - category: name of category/topic (ex. electronics, nature, travel, etc.)
            - subreddit_list: list of subreddits
        Output:
            - stores title, score, id, subreddit name, url, number of comments, text, and date of post from 
            specified subreddit into self.posts Dataframe
        '''
        
        posts = list()
        
        for sub in subreddit_list:
            try:
                subreddit = self.reddit.subreddit(sub)
            except:
                print('Error in subreddit search.')
            else:
                for post in subreddit.hot(limit=500):
                    posts.append([post.title, post.score, post.id, post.subreddit,
                                 post.url, post.num_comments, post.selftext, post.created, category])
                df = pd.DataFrame(posts, columns=self.posts.columns)
                #self.posts = pd.concat([self.posts, df])
    
    def get_comments(self, df):  
        for row in df.iterrows():
            submission = self.reddit.submission(id=row[1]['id'])
            submission.comments.replace_more(limit=None)
            for comment in submission.comments.list():
                self.comments.append(comment.body)
                
        df['comments'] = self.comments
        return df

### Constructing Training Dataset

In [8]:
# 1. Electronics and Gadgets

elec_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
elec_reddit.get_posts('Electronics/Gadgets', ['tech', 'gadgets', 'techsupport', 'apple', 'android', 'learnprogramming'])
elec_df = elec_reddit.posts

In [9]:
# 2. Sports

sports_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
sports_reddit.get_posts('Sports', ['sports', 'nba', 'soccer', 'nfl', 'baseball', 'hockey'])
sorts_df = sports_reddit.posts

In [10]:
# 3. Gaming

games_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
games_reddit.get_posts('Gamers', ['gaming', 'games', 'nintendoswitch', 'ps4', 'xboxone', 'pcgaming'])
games_df = games_reddit.posts

In [11]:
# 4. Animals/Nature

nature_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
nature_reddit.get_posts('Nature', ['natureisfuckinglit', 'nature', 'gardening', 'sustainability', 'pets', 'animals'])
nature_df = nature_reddit.posts

In [12]:
# 5. Travel

travel_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
travel_reddit.get_posts('Travel', ['shoestring', 'travel', 'wanderlust', 'solotravel', 'camping', 'hiking'])
travel_df = travel_reddit.posts

In [13]:
# 6. Art/Photography

art_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
art_reddit.get_posts('Art', ['art', 'sketchpad', 'arttools', 'photography', 'photomarket', 'artfundamentals'])
art_df = art_reddit.posts

In [14]:
# 7. Music

music_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
music_reddit.get_posts('Music', ['vinyl', 'music', 'electronicmusic', 'hiphopheads', 'indieheads', 'aves'])
music_df = music_reddit.posts

In [15]:
# 8. Books

books_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
books_reddit.get_posts('Books', ['books', 'booksuggestions', 'bookclub', 'writing', 'bookexchange', 'audiobooks'])
books_df = books_reddit.posts

In [16]:
# 9. Movies/TV-Show

movies_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
movies_reddit.get_posts('Movies', ['actors', 'moviesuggestions', 'movieclub', 'television', 'realitytv', 'oscars'])
movies_df = movies_reddit.posts

In [17]:
# 10. Food

food_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
food_reddit.get_posts('Food', ['cooking', 'recipes', 'wewantplates', 'culinaryplating', 'food', 'foodporn'])
food_df = food_reddit.posts

In [18]:
# 11. Alcohol

alcohol_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
alcohol_reddit.get_posts('Alcohol', ['alcohol', 'beer', 'wine', 'liquor', 'tequila', 'vodka'])
alcohol_df = alcohol_reddit.posts

In [20]:
# 12. Coffee

coffee_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
coffee_reddit.get_posts('Coffee', ['starbucks', 'coffee', 'barista', 'coffeelovers', 'coffeestations', 'cafe'])
coffee_df = coffee_reddit.posts

In [21]:
# 13. Self-Care

care_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
care_reddit.get_posts('Self-care', ['zenhabits', 'skincareaddiction', 'meditation', 'yessleep', 'massage', 'quoteporn'])
care_df = care_reddit.posts

In [22]:
# 14. Work/Education/Productivity

work_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
work_reddit.get_posts('Work', ['productivity', 'workspace', 'jobs', 'worklifebalance', 'iwanttolearn'])
work_df = work_reddit.posts

In [23]:
# 15. Household

household_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
household_reddit.get_posts('Household', ['roomporn', 'homedecorating', 'decor', 'apartmentdesign', 'designmyroom'])
household_df = household_reddit.posts

In [24]:
# 16. Business

bus_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
bus_reddit.get_posts('Business', ['personalfinance', 'investing', 'cryptocurrency', 'bitcoin', 'entrepreneur'])
bus_df = bus_reddit.posts

In [35]:
# Merge Dataframes

df = pd.concat([elec_df, sorts_df, games_df, nature_df,
               travel_df, art_df, music_df, books_df,
               movies_df, food_df, alcohol_df, coffee_df,
               care_df, work_df, household_df, bus_df])

df.to_csv('datasets/reddit-categories2.csv')

### Construct Testing Dataset

In [4]:
# 1. Electronics

elec_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
elec_reddit.get_posts('Electronics/Gadgets', ['robotics'])
elec_df = elec_reddit.posts

Version 7.0.0 of praw is outdated. Version 7.4.0 was released 4 days ago.


In [5]:
# 2. Travel

travel_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
travel_reddit.get_posts('Travel', ['backpacking'])
travel_df = travel_reddit.posts

In [6]:
# 3. Sports 

sports_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
sports_reddit.get_posts('Sports', ['sportsbook'])
sorts_df = sports_reddit.posts

In [7]:
# 4. Music

music_reddit = Reddit(user_name, password, client_id, secret_key, user_agent)
music_reddit.get_posts('Music', ['musicsuggestions'])
music_df = music_reddit.posts

In [8]:
df = pd.concat([elec_df, travel_df, sorts_df, music_df])

df.to_csv('datasets/reddit-test-data.csv')