Problem statement:

The objective is to classify where the post is coming from and segregate the post to different catagory according to the type of post. This model will be use for the upcoming news aggregator app where users can pick whichever catagory/topic they would like to see and the app will push all news related about the topic to the user. (Something like Google News)

1. Find sub reddit
2. Use API to extract data
* Consider looking at number of comments per post
3. Preprocessing with stop words, small letters, etc...
4. Vectorization with Countvectorizer/TFID Vectorizer/Word Vector via embedding?
5. Modeling with Random Forest, Naive Bayer and Logistic Regression
6. Conclusion

## Import Libraries

In [1]:
import requests
import pandas as pd

In [2]:
def get_data(subreddit, loops):
    for loop in range(loops):
        if loop ==0:
            url = 'https://api.pushshift.io/reddit/search/submission'

            params = {
                'subreddit': subreddit,
                'size': 100
            }

            res = requests.get(url,params)

            df = pd.DataFrame(res.json()['data'])
            created_utc = df['created_utc'].iloc[-1]
            print(f'DataFrame shape: {df.shape}')
        else:
            params = {
                'subreddit': subreddit,
                'size': 100, 
                'before' : created_utc
            }
            res = requests.get(url,params)
            df = pd.concat([df, pd.DataFrame(res.json()['data'])], ignore_index=True)
            created_utc = df['created_utc'].iloc[-1]
            print(f'DataFrame shape: {df.shape}')
        print(f'Scraping data from subreddit/{subreddit}')
        print(f'Status code: {res.status_code}')
        print(f'Iteration: {loop}')
    return df


In [3]:
iteration = 100

In [4]:
raw_data_marvel = get_data('marvelstudios',iteration)

DataFrame shape: (100, 80)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 0
DataFrame shape: (200, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 1
DataFrame shape: (300, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 2
DataFrame shape: (400, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 3
DataFrame shape: (500, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 4
DataFrame shape: (600, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 5
DataFrame shape: (700, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 6
DataFrame shape: (800, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 7
DataFrame shape: (900, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 8
DataFrame shape: (1000, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 9

In [10]:
# raw_data = pd.concat([raw_data,get_data('DC_Cinematic',iteration)],ignore_index=True)
raw_data_dc = get_data('DC_Cinematic',iteration)

DataFrame shape: (100, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 0
DataFrame shape: (200, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 1
DataFrame shape: (300, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 2
DataFrame shape: (400, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 3
DataFrame shape: (500, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 4
DataFrame shape: (600, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 5
DataFrame shape: (700, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 6
DataFrame shape: (800, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 7
DataFrame shape: (900, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 8
DataFrame shape: (1000, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 9
DataFrame

In [6]:
raw_data_marvel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9993 entries, 0 to 9992
Data columns (total 85 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  9993 non-null   object 
 1   allow_live_comments            9993 non-null   bool   
 2   author                         9993 non-null   object 
 3   author_flair_css_class         1261 non-null   object 
 4   author_flair_richtext          9964 non-null   object 
 5   author_flair_text              1242 non-null   object 
 6   author_flair_type              9964 non-null   object 
 7   author_fullname                9964 non-null   object 
 8   author_is_blocked              9993 non-null   bool   
 9   author_patreon_flair           9964 non-null   object 
 10  author_premium                 9964 non-null   object 
 11  awarders                       9993 non-null   object 
 12  can_mod_post                   9993 non-null   b

In [7]:
raw_data_dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9993 entries, 0 to 9992
Data columns (total 85 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  9993 non-null   object 
 1   allow_live_comments            9993 non-null   bool   
 2   author                         9993 non-null   object 
 3   author_flair_css_class         1261 non-null   object 
 4   author_flair_richtext          9964 non-null   object 
 5   author_flair_text              1242 non-null   object 
 6   author_flair_type              9964 non-null   object 
 7   author_fullname                9964 non-null   object 
 8   author_is_blocked              9993 non-null   bool   
 9   author_patreon_flair           9964 non-null   object 
 10  author_premium                 9964 non-null   object 
 11  awarders                       9993 non-null   object 
 12  can_mod_post                   9993 non-null   b

In [11]:
pd.DataFrame.to_csv(raw_data_marvel,'./data/raw_data_marvel.csv',index_label=True)
pd.DataFrame.to_csv(raw_data_dc,'./data/raw_data_dc.csv',index_label=True)