# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & NLP

Reddit is an online platform for sharing news, content, and discussions. It features user-created sections called 'subreddits' that cater to a wide range of topics and interests. Members can contribute various types of content, including images, texts, and links, to these subreddits, and other members can express their approval ('upvote') or disapproval ('downvote') of the content.

### **Problem Statement:**
The goal is to build a binary classification model that can accurately distinguish between cryptocurrency-based posts and stock-based posts from the famous wallstreetbets and CryptoMoonShots subreddits. The model will be trained using Natural Language Processing (NLP) techniques on text data collected from these subreddits, with the help of PRAW. The classification model will help Reddit users identify whether a post is related to cryptocurrency or stocks, which can be valuable for making informed decisions about trading or investment opportunities.

#### **Objectives**:
1. Gather and prepare data from the wallstreetbets and CryptoMoonShots subreddits using PRAW.
2. Preprocess the text data by cleaning, tokenizing, and vectorizing the posts for NLP analysis.
3. Train and compare two classification models - random forest trees and logistic regression - to predict whether a post is related to cryptocurrency or stocks.
4. Evaluate the performance of the models using appropriate metrics such as accuracy, precision, recall, and F1 score.

In [2]:
import time, warnings
import pandas as pd
import numpy as np

# warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# %matplotlib inline

In [3]:
import praw

# instantiate an instance PRAW using OAuth credentials
reddit = praw.Reddit(
    client_id='dYcW5XC-BIZCgWpqa5apNg',
    client_secret='9DbERLcWrml0Q0ptlYl6Tmi4UCWFJw',
    user_agent='sentiment-analysis-lawdhavmerccy',
    username='lawdhavmerccy',
    password='SprScrtPsswrd4life!')

### Scrape New Posts in Each Subreddit

In [5]:
# define custom scraping function for new posts
def scrape_subreddit_new(subreddit_name, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit_name)
    post_id = []
    date_utc = []
    title = []
    text = []
    score = []
    upvote_ratio = []
    sub_name = []
    url = []
    


    # collect from posts sorted by new posts
    for post in subreddit.new(limit = postlimit):
        # collect information on post
        post_id.append(post.id)
        date_utc.append(post.created_utc)
        title.append(post.title)
        text.append(post.selftext)
        score.append(post.score)
        upvote_ratio.append(post.upvote_ratio)
        sub_name.append(post.subreddit)
        url.append(post.url)
        
        
        
    # transform new posts list into a df
    df_post = pd.DataFrame({'id': post_id,
                            'datetime': date_utc,
                            'title': title,
                            'text': text,
                            'score': score,
                            'upvote_ratio': upvote_ratio,
                            'url': url,
                            'subreddit': sub_name,})
    
    df_post['datetime'] = pd.to_datetime(df_post['datetime'], unit = 's')
    
    return df_post

In [6]:
%%time
# scrape from subreddit wallstreetbets
wsb_new_posts = scrape_subreddit_new('wallstreetbets')

CPU times: total: 1.06 s
Wall time: 12.5 s


In [7]:
wsb_new_posts.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,1dzdf1j,2024-07-09 21:00:53,$Sofi is coming back .,I suggest you buy and hold .,3,0.71,https://i.redd.it/ffwh4k2u4kbd1.jpeg,wallstreetbets
1,1dzdebj,2024-07-09 21:00:11,What’s going on with BABA?,The stock has performed really poorly over the...,2,0.67,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets
2,1dzd0u6,2024-07-09 20:44:37,ZI Yolo better late than early,DD was some guy posted about seeing a high vol...,2,0.75,https://i.redd.it/a2n775jx1kbd1.jpeg,wallstreetbets
3,1dzckvv,2024-07-09 20:26:14,We are already in a recession,,3,0.57,https://i.redd.it/1uao7vklyjbd1.png,wallstreetbets
4,1dzcjix,2024-07-09 20:24:41,🍔McDonalds🍔: At ATL in 52 weeks. How to make y...,"Hello gentleman. \n\nFirst of all, the only DD...",1,0.55,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets


In [8]:
wsb_new_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            944 non-null    object        
 1   datetime      944 non-null    datetime64[ns]
 2   title         944 non-null    object        
 3   text          944 non-null    object        
 4   score         944 non-null    int64         
 5   upvote_ratio  944 non-null    float64       
 6   url           944 non-null    object        
 7   subreddit     944 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 59.1+ KB


In [9]:
%%time
# scrape from subreddit CryptoMoonShots
cms_new_posts = scrape_subreddit_new('CryptoMoonShots')

CPU times: total: 297 ms
Wall time: 18.3 s


In [10]:
cms_new_posts.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,1dzdgyf,2024-07-09 21:03:03,$Lndry | Major Updates for coming up – Excitin...,Major Updates for LNDRY – Exciting Times Ahead...,55,1.0,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
1,1dzcnf8,2024-07-09 20:29:06,Nancy coin Daily Update!,"Update time!\n\nIt's day 4 for Nancy coin, and...",5,1.0,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
2,1dzb3et,2024-07-09 19:25:17,I know you're a JEET and so am I,How many of you have ever jeeted a coin? I'm n...,9,0.91,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
3,1dz7m7a,2024-07-09 17:03:20,$HEGE Secures Dual Exchange Listings on MEXC a...,I just got some electrifying news that I had t...,12,0.93,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
4,1dz7ibw,2024-07-09 16:59:18,$CEEZUR keep growing fast and looking really b...,$CEEZUR is the first of the ancient politifi t...,70,0.95,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots


In [11]:
cms_new_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 934 entries, 0 to 933
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            934 non-null    object        
 1   datetime      934 non-null    datetime64[ns]
 2   title         934 non-null    object        
 3   text          934 non-null    object        
 4   score         934 non-null    int64         
 5   upvote_ratio  934 non-null    float64       
 6   url           934 non-null    object        
 7   subreddit     934 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 58.5+ KB


#### Export new posts to csv

In [13]:
wsb_new_posts.to_csv('data/wsb_new_posts.csv', index=False) 
cms_new_posts.to_csv('data/cms_new_posts.csv', index=False)

### Repeat process for Top Comments

In [15]:
# define custom scraping function for top posts
def scrape_subreddit_top(subreddit_name, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit_name)
    post_id = []
    date_utc = []
    title = []
    text = []
    score = []
    upvote_ratio = []
    sub_name = []
    url = []
    

    # collect from posts sorted by top
    for post in subreddit.top(limit = postlimit):
        # collect information on post
        post_id.append(post.id)
        date_utc.append(post.created_utc)
        title.append(post.title)
        text.append(post.selftext)
        score.append(post.score)
        upvote_ratio.append(post.upvote_ratio)
        sub_name.append(post.subreddit)
        url.append(post.url)
        
           
    # put posts into a df
    df_post = pd.DataFrame({'id': post_id,
                            'datetime': date_utc,
                            'title': title,
                            'text': text,
                            'score': score,
                            'upvote_ratio': upvote_ratio,
                            'url': url,
                            'subreddit': sub_name,})
    
    df_post['datetime'] = pd.to_datetime(df_post['datetime'], unit = 's')
    
    return df_post

In [16]:
%%time
# scrape from subreddit wallstreetbets
wsb_top_posts = scrape_subreddit_top('wallstreetbets')

CPU times: total: 250 ms
Wall time: 12.5 s


In [17]:
wsb_top_posts.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,l8rf4k,2021-01-30 18:00:38,Times Square right now,,486588,0.99,https://v.redd.it/x64z70f7eie61,wallstreetbets
1,l6wu59,2021-01-28 13:40:34,UPVOTE so everyone sees we got SUPPORT,,337794,0.98,https://i.redd.it/sgoqy8nyt2e61.png,wallstreetbets
2,l78uct,2021-01-28 21:06:23,GME YOLO update — Jan 28 2021,,300125,0.98,https://i.redd.it/opzucppb15e61.png,wallstreetbets
3,l846a1,2021-01-29 21:04:45,GME YOLO month-end update — Jan 2021,,264428,0.98,https://i.redd.it/r557em3t5ce61.png,wallstreetbets
4,l881ia,2021-01-29 23:40:59,It’s treason then,,246576,0.98,https://i.redd.it/d3t66lv1yce61.jpg,wallstreetbets


In [18]:
wsb_top_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            954 non-null    object        
 1   datetime      954 non-null    datetime64[ns]
 2   title         954 non-null    object        
 3   text          954 non-null    object        
 4   score         954 non-null    int64         
 5   upvote_ratio  954 non-null    float64       
 6   url           954 non-null    object        
 7   subreddit     954 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 59.8+ KB


In [19]:
%%time
# scrape from subreddit CryptoMoonShots
cms_top_posts = scrape_subreddit_top('CryptoMoonShots')

CPU times: total: 328 ms
Wall time: 22.5 s


In [20]:
cms_top_posts.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,lp96mm,2021-02-21 23:10:12,Introducing the official CryptoMoonShots premi...,Introducing the official [**CryptoMoonShots Di...,7540,0.98,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
1,pxs0q9,2021-09-29 09:47:51,Sphynxswap now live - CERTIK Audit NowCommplete,🔥 Sphynx — the all-in-one platform that is rev...,6022,0.97,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
2,oevhep,2021-07-06 13:59:33,Sonar ($PING) | 💻 By Investors For Investors |...,"**Game Changing Dex Tool, Made By Investors Fo...",5778,0.97,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
3,prcd8x,2021-09-19 17:28:40,Introduction SLAMCHAT an online social marketp...,SLAMCHAT TOKEN 👀 THE NEXT X100?! 💎\n\n&#x200B;...,5267,0.99,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
4,syqp6g,2022-02-22 15:48:03,JUMPTOKEN | gig economy marketplace | Active P...,JumpToken (JMPT) is a utility token for JumpTa...,5160,0.85,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots


In [21]:
cms_top_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 968 entries, 0 to 967
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            968 non-null    object        
 1   datetime      968 non-null    datetime64[ns]
 2   title         968 non-null    object        
 3   text          968 non-null    object        
 4   score         968 non-null    int64         
 5   upvote_ratio  968 non-null    float64       
 6   url           968 non-null    object        
 7   subreddit     968 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 60.6+ KB


#### Export top posts to csv

In [23]:
wsb_top_posts.to_csv('data/wsb_top_posts.csv', index=False)
cms_top_posts.to_csv('data/cms_top_posts.csv', index=False)

### Repeat process for Controversial Comments

In [25]:
# define custom scraping function for controversial posts
def scrape_subreddit_controversial(subreddit_name, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit_name)
    post_id = []
    date_utc = []
    title = []
    text = []
    score = []
    upvote_ratio = []
    sub_name = []
    url = []
    

    # collect from posts sorted by hot
    for post in subreddit.controversial(limit = postlimit):
        # collect information on post
        post_id.append(post.id)
        date_utc.append(post.created_utc)
        title.append(post.title)
        text.append(post.selftext)
        score.append(post.score)
        upvote_ratio.append(post.upvote_ratio)
        sub_name.append(post.subreddit)
        url.append(post.url)
           
    # put posts into a df
    df_post = pd.DataFrame({'id': post_id,
                            'datetime': date_utc,
                            'title': title,
                            'text': text,
                            'score': score,
                            'upvote_ratio': upvote_ratio,
                            'url': url,
                            'subreddit': sub_name,})
    
    df_post['datetime'] = pd.to_datetime(df_post['datetime'], unit = 's')
    
    return df_post

In [26]:
%%time
# scrape from subreddit wallstreetbets
wsb_controversial_posts = scrape_subreddit_controversial('wallstreetbets')

CPU times: total: 344 ms
Wall time: 14.2 s


In [27]:
wsb_controversial_posts.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,lw6hoh,2021-03-02 16:47:04,"RKT Megathread for March 2nd, 2021","RKT was highly discussed yesterday, so please ...",1648,0.54,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets
1,u2rhsa,2022-04-13 14:24:01,$WEBR: 40% Short Interest - Grilled Tendies & ...,Weber is the world's premier grill manufactur...,167,0.52,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets
2,s72srt,2022-01-18 17:28:45,AMC & GME Technical Analysis for retards,,700,0.57,https://i.redd.it/ri9ss107ehc81.png,wallstreetbets
3,181snlr,2023-11-23 04:15:56,NVDA is cooking their books. And so are others.,Here is their latest earnings report: https://...,0,0.49,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets
4,w1yroo,2022-07-18 13:00:07,GME NFT Marketplace : New Low Score! 07.18,,0,0.48,https://i.redd.it/8przdetypbc91.png,wallstreetbets


In [28]:
%%time
# scrape from subreddit CryptoMoonShots
cms_controversial_posts = scrape_subreddit_controversial('CryptoMoonShots')

CPU times: total: 203 ms
Wall time: 16.4 s


In [29]:
cms_controversial_posts.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,o280zc,2021-06-17 21:13:08,⬆️ EverRise taking over Safemoon? 4000 BNB pre...,BRAND NEW TOKENOMICS \n\n100k Buys on ALL TIME...,106,0.5,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
1,o2cym1,2021-06-18 01:11:13,🐻 Kodiak Coin ($KODIAK) launched just now! It ...,"Let me reiterate what I said in the title: **""...",56,0.52,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
2,n7rfa6,2021-05-08 15:20:43,🚀 $PXL [PIXL] $8m+ mcap just got listed on cnc...,They're creating their very own Arcade to link...,0,0.49,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
3,qbki61,2021-10-19 20:44:04,Afrostar | PreSale Coming Soon | Push to Becom...,I’m not here to pitch you on the usual s\*\*tc...,109,0.85,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
4,ofp7z8,2021-07-07 18:42:25,Uncle Doge is best variant of Doge-Themed coin...,The newest member of the extended doge family ...,255,0.5,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots


In [30]:
wsb_controversial_posts.to_csv('data/wsb_controversial_posts.csv', index=False)
cms_controversial_posts.to_csv('data/cms_controversial_posts.csv', index=False)

### Merging

In [32]:
# read wsb csv files into DataFrames
wsb_new = pd.read_csv('data/wsb_new_posts.csv')
wsb_top = pd.read_csv('data/wsb_top_posts.csv')
wsb_controversial = pd.read_csv('data/wsb_controversial_posts.csv')

# merge DataFrames using 'id' column as the key
wsb_merged = pd.concat([wsb_new, wsb_top, wsb_controversial]).drop_duplicates(subset='id')

In [33]:
wsb_merged.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,1dzdf1j,2024-07-09 21:00:53,$Sofi is coming back .,I suggest you buy and hold .,3,0.71,https://i.redd.it/ffwh4k2u4kbd1.jpeg,wallstreetbets
1,1dzdebj,2024-07-09 21:00:11,What’s going on with BABA?,The stock has performed really poorly over the...,2,0.67,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets
2,1dzd0u6,2024-07-09 20:44:37,ZI Yolo better late than early,DD was some guy posted about seeing a high vol...,2,0.75,https://i.redd.it/a2n775jx1kbd1.jpeg,wallstreetbets
3,1dzckvv,2024-07-09 20:26:14,We are already in a recession,,3,0.57,https://i.redd.it/1uao7vklyjbd1.png,wallstreetbets
4,1dzcjix,2024-07-09 20:24:41,🍔McDonalds🍔: At ATL in 52 weeks. How to make y...,"Hello gentleman. \n\nFirst of all, the only DD...",1,0.55,https://www.reddit.com/r/wallstreetbets/commen...,wallstreetbets


In [34]:
wsb_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2823 entries, 0 to 930
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            2823 non-null   object 
 1   datetime      2823 non-null   object 
 2   title         2823 non-null   object 
 3   text          1375 non-null   object 
 4   score         2823 non-null   int64  
 5   upvote_ratio  2823 non-null   float64
 6   url           2823 non-null   object 
 7   subreddit     2823 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 198.5+ KB


In [35]:
# save merged DataFrame to a new csv file
wsb_merged.to_csv('data/wsb_merged_posts.csv', index=False)

In [36]:
# read cms csv files into DataFrames
cms_new = pd.read_csv('data/cms_new_posts.csv')
cms_top = pd.read_csv('data/cms_top_posts.csv')
cms_controversial = pd.read_csv('data/cms_controversial_posts.csv')

# merge DataFrames using 'id' column as the key
cms_merged = pd.concat([cms_new, cms_top, cms_controversial]).drop_duplicates(subset='id')

In [37]:
cms_merged.head()

Unnamed: 0,id,datetime,title,text,score,upvote_ratio,url,subreddit
0,1dzdgyf,2024-07-09 21:03:03,$Lndry | Major Updates for coming up – Excitin...,Major Updates for LNDRY – Exciting Times Ahead...,55,1.0,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
1,1dzcnf8,2024-07-09 20:29:06,Nancy coin Daily Update!,"Update time!\n\nIt's day 4 for Nancy coin, and...",5,1.0,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
2,1dzb3et,2024-07-09 19:25:17,I know you're a JEET and so am I,How many of you have ever jeeted a coin? I'm n...,9,0.91,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
3,1dz7m7a,2024-07-09 17:03:20,$HEGE Secures Dual Exchange Listings on MEXC a...,I just got some electrifying news that I had t...,12,0.93,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots
4,1dz7ibw,2024-07-09 16:59:18,$CEEZUR keep growing fast and looking really b...,$CEEZUR is the first of the ancient politifi t...,70,0.95,https://www.reddit.com/r/CryptoMoonShots/comme...,CryptoMoonShots


In [38]:
cms_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2722 entries, 0 to 964
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            2722 non-null   object 
 1   datetime      2722 non-null   object 
 2   title         2722 non-null   object 
 3   text          2705 non-null   object 
 4   score         2722 non-null   int64  
 5   upvote_ratio  2722 non-null   float64
 6   url           2722 non-null   object 
 7   subreddit     2722 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 191.4+ KB


In [39]:
# save merged DataFrame to a new csv file
cms_merged.to_csv('data/cms_merged_posts.csv', index=False)