<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Classification Model Analysis <br>
**Notebook 1: Web Scrapping**

# EXECUTIVE SUMMARY

# 00. INTRODUCTION

## BACKGROUND

Is science a part of philosophy or are they two totally different subjects? Although in current days many people assume that science and philosophy are concepts contradictory to each other, but both subjects share a more positive relationship rather than an animosity. In fact, for roughly 98% of the last 2,500 years of Western intellectual history, science was a part of philosophy. It was then called natural philosophy, but science deviated from philosophy in the 17th century and emerged as a separate study or domain ([*source*](https://archive.nytimes.com/opinionator.blogs.nytimes.com/2012/04/05/philosophy-is-not-a-science/)).

The definition of science and philosophy are as follows ([*source*](https://1000wordphilosophy.com/2018/02/13/philosophy-and-its-contrast-with-science/#:~:text=Science%20is%20about%20descriptive%20facts,objects%20(if%20they%20exist))):
- Science is about empirical knowledge; philosophy is often about that but is also about a priori knowledge (if it exists).
- Science is about contingent facts or truths; philosophy is often about that but is also about necessary truths (if they exist).[5]
- Science is about descriptive facts; philosophy is often about that but is also about normative and evaluative truths (if such truths exist).
- Science is about physical objects; philosophy is often about that but is also about abstract objects (if they exist).

## PROBLEM STATEMENT

As moderators of Science & Philosophy subreddits with substantial number of members, 28.4 million & 16.9 million respectively, our mission are to:

1. Develop a classification model that predicts which category a post belongs to. This will be a great help for us in making sure that topics are posted in the correct subreddit, as well as improving users experience when reading the posts.
2. Conduct sentiment analysis to evaluate user's posts. As Science & Philosophy are both factual based subreddits, a neutral and unopinionated posts are to be expected.
3. Identify trending topics for each subreddits so that we can pin it on top of our landing page.

The baseline of the classification model will be done using Logistic Regression with CountVectorizer and TFIDF Vectorizer, and based on the baseline model performance, Multinomial Naive Bayes, Logistic Regression, Random Forest, and SVM models will be developed with hyperparameter tuning. Model with the highest score will be selected as the final model.

Sentiment analysis of the overall Science & Philosophy will be done using Vader, and the further sentiment analysis will be done for the trending topics from each Science & Philosophy subreddits using HuggingFace.

## DATA COLLECTION

The data is taken from the following subreddits:
1. Science ([*source*](https://www.reddit.com/r/science/))
2. Philosophy ([*source*](https://www.reddit.com/r/philosophy/))

[*Pushshift API*](https://github.com/pushshift/api) are used to scrape 25000 posts of each subreddits, starting from 4th October 4 2022 0:00:00 SGT backwards.

## TABLE OF CONTENTS

**1. Web Scrapping (This Notebook)** <br>
- [01. Library](#01.-LIBRARY) <br>
- [02. Function](#02.-FUNCTION) <br>
- [03. Webscraping](#03.-WEBSCRAPING) <br>
- [04. Export Scrapped Data](#04.-EXPORT-SCRAPPED-DATA) <br>

**2. Data Cleaning & EDA** <br>
**3. Modelling, Hyper-parameter tuning, Model Selection** <br>
**4. Sentiment Analysis, Conclusion & Recommendation** <br>

# 01. LIBRARY

In [11]:
import requests
import pandas as pd
import time
import random

# 02. FUNCTION

In [2]:
# Defining function to scrape informations from subreddit

def get_post(subreddit, batch):
    url = 'https://api.pushshift.io/reddit/search/submission/'
    
    params = {
            'subreddit': subreddit,
            'size': 201,
            'before': 1664812800, #October 4,2022 0:00:00 SGT
            'sort_type': 'created_utc',
            'sort': 'desc'
        }
    
    reddit_subs = []
    
    for i in range(batch):
        res = requests.get(url, params)
        if res.status_code!= 200:
            print("Error")
        else:
            data = res.json()['data']
            reddit_subs += data
            
            print(f"Batch {i+1}/{batch} completed - {len(reddit_subs)} total posts")
            
            params['before'] = reddit_subs[-1]['created_utc']
            time.sleep((random.randint(5, 10)))
            
    return reddit_subs

# 03. WEBSCRAPING

## (i) Subreddit: r/science 

In [3]:
# Get ~25000 posts from science subreddit
science = get_post('science', 125)

Batch 1/125 completed - 200 total posts
Batch 2/125 completed - 401 total posts
Batch 3/125 completed - 602 total posts
Batch 4/125 completed - 803 total posts
Batch 5/125 completed - 1004 total posts
Batch 6/125 completed - 1205 total posts
Batch 7/125 completed - 1406 total posts
Batch 8/125 completed - 1607 total posts
Batch 9/125 completed - 1807 total posts
Batch 10/125 completed - 2008 total posts
Batch 11/125 completed - 2209 total posts
Batch 12/125 completed - 2410 total posts
Batch 13/125 completed - 2611 total posts
Batch 14/125 completed - 2812 total posts
Batch 15/125 completed - 3013 total posts
Batch 16/125 completed - 3213 total posts
Batch 17/125 completed - 3414 total posts
Batch 18/125 completed - 3615 total posts
Batch 19/125 completed - 3816 total posts
Batch 20/125 completed - 4016 total posts
Batch 21/125 completed - 4217 total posts
Batch 22/125 completed - 4418 total posts
Batch 23/125 completed - 4619 total posts
Batch 24/125 completed - 4820 total posts
Batch

In [4]:
# Convert to DataFrame
science_df = pd.DataFrame(science)

In [5]:
science_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25086 entries, 0 to 25085
Data columns (total 83 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  25086 non-null  object 
 1   allow_live_comments            25086 non-null  bool   
 2   author                         25086 non-null  object 
 3   author_flair_css_class         1597 non-null   object 
 4   author_flair_richtext          24858 non-null  object 
 5   author_flair_text              511 non-null    object 
 6   author_flair_type              24858 non-null  object 
 7   author_fullname                24858 non-null  object 
 8   author_is_blocked              25086 non-null  bool   
 9   author_patreon_flair           24858 non-null  object 
 10  author_premium                 24858 non-null  object 
 11  awarders                       25086 non-null  object 
 12  can_mod_post                   25086 non-null 

In [6]:
science_df['created_utc']

0        1664811669
1        1664809032
2        1664807875
3        1664807763
4        1664807340
            ...    
25081    1636822314
25082    1636819215
25083    1636818580
25084    1636817924
25085    1636817159
Name: created_utc, Length: 25086, dtype: int64

From: 1664811669 - Monday, October 3, 2022 11:41:09 PM (SGT) <br>
To: 1636817159 - Saturday, November 13, 2021 11:25:59 PM (SGT) <br>
Total = 25085 posts

## (ii) Subreddit: r/philosophy

In [7]:
# Get ~25000 posts from philosophy subreddit
philosophy = get_post('philosophy', 125)

Batch 1/125 completed - 201 total posts
Batch 2/125 completed - 402 total posts
Batch 3/125 completed - 603 total posts
Batch 4/125 completed - 804 total posts
Batch 5/125 completed - 1005 total posts
Batch 6/125 completed - 1206 total posts
Batch 7/125 completed - 1407 total posts
Batch 8/125 completed - 1608 total posts
Batch 9/125 completed - 1809 total posts
Batch 10/125 completed - 2010 total posts
Batch 11/125 completed - 2210 total posts
Batch 12/125 completed - 2411 total posts
Batch 13/125 completed - 2612 total posts
Batch 14/125 completed - 2813 total posts
Batch 15/125 completed - 3014 total posts
Batch 16/125 completed - 3215 total posts
Batch 17/125 completed - 3416 total posts
Batch 18/125 completed - 3617 total posts
Batch 19/125 completed - 3818 total posts
Batch 20/125 completed - 4019 total posts
Batch 21/125 completed - 4220 total posts
Batch 22/125 completed - 4421 total posts
Batch 23/125 completed - 4622 total posts
Batch 24/125 completed - 4823 total posts
Batch

In [8]:
# Convert to DataFrame
philosophy_df = pd.DataFrame(philosophy)

In [9]:
philosophy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25113 entries, 0 to 25112
Data columns (total 83 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  25113 non-null  object 
 1   allow_live_comments            25113 non-null  bool   
 2   author                         25113 non-null  object 
 3   author_flair_css_class         1 non-null      object 
 4   author_flair_richtext          24941 non-null  object 
 5   author_flair_text              496 non-null    object 
 6   author_flair_type              24941 non-null  object 
 7   author_fullname                24941 non-null  object 
 8   author_is_blocked              22307 non-null  object 
 9   author_patreon_flair           24941 non-null  object 
 10  author_premium                 24941 non-null  object 
 11  awarders                       25113 non-null  object 
 12  can_mod_post                   25113 non-null 

In [10]:
philosophy_df['created_utc']

0        1664810634
1        1664810509
2        1664808499
3        1664808211
4        1664807836
            ...    
25108    1621631144
25109    1621628366
25110    1621625939
25111    1621623512
25112    1621621244
Name: created_utc, Length: 25113, dtype: int64

From: 1664810634 -  Monday, October 3, 2022 11:23:54 PM (SGT) <br>
To: 1621621244 - Saturday, May 22, 2021 2:20:44 AM (SGT) <br>
Total = 25112 posts

# 04. EXPORT SCRAPPED DATA

In [11]:
science_df.to_csv('../datasets/science.csv', index=False)
philosophy_df.to_csv('../datasets/philosophy.csv', index=False)