<img src='http://imgur.com/1ZcRyrc.png' style='float: left; margin: 20px; height: 55px'>

# Project 3: Reddit Web Scraping

## Part 1 - Web Scraping

## 1. Introduction 

### 1.1 Background

The trends in intermittent fasting and anorexia have raised concerns due to their potential detrimental effects on physical and mental health. 

Intermittent fasting is a dietary approach that involves cycling between periods of eating and fasting. This has been a trend among Singaporeans who are trying to lose weight (SingHealth, 2021). However, when practiced without proper guidance or medical supervision, it can lead to nutrient deficiencies, muscle loss, and an unhealthy fixation on food and body image. On the other hand, anorexia nervosa, an eating disorder characterized by extreme restriction of food intake, can have severe consequences, including malnutrition, organ damage, and even death (SGH, 2019). Click on the following links to find out more:

* <a href='https://www.healthxchange.sg/food-nutrition/weight-management/intermittent-fasting-how-to-do-safely'> Intermittent fasting </a>
* <a href='https://scc.sg/e/anorexia-nervosa/'> Anorexia nervosa </a>

The troubling connection lies in the blurred lines between health-conscious fasting and disordered eating behaviors. Promoting unrealistic body ideals and glorifying extreme fasting methods can inadvertently contribute to the development of anorexic tendencies among vulnerable individuals. It is crucial to prioritize balanced, sustainable nutrition and seek professional guidance when considering any dietary changes to ensure both physical and mental well-being.

### 1.2 Problem Statement

39 SIR, one of the leading healthcare groups in Singapore, seeks to distinguish whether individuals practising intermittent fasting may be exhibiting signs of anorexia nervosa. Our objective is to offer appropriate guidance to ensure that their fasting practices promote overall health or to provide resources on how to seek assistance if they show symptoms of anorexia.

<img src='internet_use_sg.png' width='650' height='350'>
<center> (Source: Meltwater, 2023) </center>

In a country where 96.9% of the population are active internet users, 39 SIR is committed to connecting with Singaporeans through digital channels. Our dedicated Data Science team have gone through rigorous research to develop a seamless application that helps individuals and healthcare facilities effectively identify people who may be at risk of anorexia nervosa. This application is designed to facilitate timely interventions and support for those requiring assistance.

This application was developed through the following steps:

1. Webscraping of subreddits
2. Data cleaning
3. Exploratory Data Analysis (EDA)
4. Pre-processing and modelling
5. Application development

| column                           | datatype | explanation                                       |
|----------------------------------|-----------|---------------------------------------------------|
| title                            | object    | Title of the post                                 |
| post_text                        | object    | Text content of the post                          |
| id                               | object    | Unique identifier for the post                    |
| score                            | int64     | Score or upvotes of the post                      |
| total_comments                   | int64     | Total number of comments on the post              |
| post_url                         | object    | URL of the post                                    |
| subreddit                        | object    | Subreddit where the post was made                  |
| post_type                        | object    | Type or format of the post                         |
| time_uploaded                    | object    | Timestamp when the post was uploaded               |
| punctuation_removed_title_and_text  | object    | Title and text content with punctuations removed     |
| title_text_stemmed               | object    | Title and text content after stemming              |
| title_text_lemmatized            | object    | Title and text content after lemmatization         |

## 2. Import

### 2.1 Libraries

In [1]:
import pandas as pd
import re
import datetime
import requests
from bs4 import BeautifulSoup
import praw
import nltk
import numpy as np
import matplotlib.pyplot as plt
import concurrent.futures
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import time
import itertools
from collections import defaultdict, Counter
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.util import bigrams
from sklearn.feature_extraction.text import CountVectorizer
import string

In [3]:
# Initialize the Reddit API client
redditscrapper = praw.Reddit(
    client_id='mTKAc7piwaoiD3fvkhY7qA',
    client_secret='GdT29i_cBYDTJwb0eExYEh6prVceGg',
    user_agent='(REDACTED NAME HERE)'
)

In [4]:
# List of subreddit names to scrape
subreddit_names = ['intermittentfasting', 'AnorexiaNervosa']

In [7]:
# Iterate through the subreddit names to test accessibility
for subreddit_name in subreddit_names:
    reddit_url = f'https://www.reddit.com/r/{subreddit_name}'
    response = requests.get(reddit_url)

    if response.status_code == 200:
        print(f'Success! The subreddit at {reddit_url} is accessible.')
    else:
        print(f'Error! The subreddit at {reddit_url} returned a status code of {response.status_code}.')

Success! The subreddit at https://www.reddit.com/r/intermittentfasting is accessible.
Success! The subreddit at https://www.reddit.com/r/AnorexiaNervosa is accessible.


In [5]:
# Dictionary to store post data
posts_dict = {'Title': [],
              'Post Text': [],
              'ID': [],
              'Score': [],
              'Total Comments': [],
              'Post URL': [],
              'Subreddit': [],
              'Post Type': [],
              'Time Uploaded': []
             }

# Set to keep track of collected post IDs
post_ids = set()

# Define a dictionary to map post types to fetch functions
post_type_mapping = {'new': 'new',
                     'hot': 'hot',
                     'top': 'top',
                     'rising': 'rising'
                    }

In [8]:
# Iterate through the subreddit names and post types to fetch and collect posts
for subreddit_name in subreddit_names:
    subreddit = redditscrapper.subreddit(subreddit_name)
    
    # Fetch posts using the mapping
    for post_type, fetch_function in post_type_mapping.items():
        posts = getattr(subreddit, fetch_function)(limit=1000000)
        
        for post in posts:
            
            if post.id not in post_ids:
                post_ids.add(post.id)
                
                # Only append posts that are not deleted nor removed
                if post.selftext != '[deleted]' and post.selftext != '[removed]':
                    posts_dict['Title'].append(post.title)
                    posts_dict['Post Text'].append(post.selftext)
                    posts_dict['ID'].append(post.id)
                    posts_dict['Score'].append(post.score)
                    posts_dict['Total Comments'].append(post.num_comments)
                    posts_dict['Post URL'].append(post.url)
                    posts_dict['Subreddit'].append(subreddit_name)
                    posts_dict['Post Type'].append(post_type)
                    time_uploaded = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                    posts_dict['Time uploaded'].append(time_uploaded)

# Create a DataFrame from the collected data
all_posts = pd.DataFrame(posts_dict)

# Print a summary of the collected data
print(f'Total number of non-deleted/non-removed posts collected: {len(all_posts)}')
all_posts.head()

Total number of non-deleted/non-removed posts collected: 3960


Unnamed: 0,Title,Post Text,ID,Score,Total Comments,Post URL,Subreddit,Post Type,Time uploaded
0,Does taking flavoured creatine break a fast?,"Taking one scoop, roughly 3g. It has sucralose...",16shh83,1,0,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 07:57:13
1,I lost 120 lbs.......she lost 80. One meal a d...,,16shbmz,6,1,https://i.redd.it/cft42u8lso151.jpg,intermittentfasting,new,2023-09-26 07:46:54
2,Does fasting out of spite work?,We’ll see in 4 weeks when I go to a wedding wh...,16sfrlc,0,2,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 06:10:27
3,Daily Fasting Check-in!,"* **Type** of fast (water, juice, smoking, etc...",16sfl07,1,0,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 06:00:31
4,90 Days of Intermittent Fasting - IT WORKS!,"Hi Everyone, \n\nToday was the 90th day of my ...",16sdl2e,17,8,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 04:10:24


In [11]:
# Save the raw data in 'reddit_raw.csv'
all_posts.to_csv('reddit_raw.csv', index=False)

[Click here to see the notebook used to clean the scrapped data](Project%203%20Cleaning%20%28caa%20250923%202057%29.ipynb)