# Web Scraping for Reddit Posts

The purpose of this notebook is to extract the following features from Reddit posts collected in December, 2021, and early January, 2022, to identify which characteristics of a Reddit post are associated with a high level of engagement in a thread as measured by the number of comments on the thread. The features extracted from each post include:

* Title of the thread
* Time since the thread was posted
* Number of comments
* Subreddit
* An indicator for whether a post contains a video
* An indicator for whether a post contains an image 

### Import the required libraries and set up a webdriver to scrape the data using selenium

In [None]:
import schedule
from datetime import datetime, timedelta, time
from time import sleep
import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys

import pandas as pd
import os

In [None]:
ser = Service('chromedriver/chromedriver')
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)

In [None]:
URL = "http://www.reddit.com"

In [None]:
driver.get(URL)

In [None]:
def web_scrapping():
    """ This function is used to scroll to the bottom of the Reddit page opened by the web-driver,
        creating a BeautifulSoup object containing the HTML of the scrapped pages, accessing the tags
        and classes necessary to extract a thread's title, time since it was posted, number of comments,
        subreddit, an indicator for whether the thread contained an image, and an indicator for whether a thread
        contained a video. As a next step, the data are combined to a dataframe and exported to a CSV file for use
        in the analysis.
    """
    for i in range(1,301):
        if i % 50 == 0:
            print("sleep repetition:", i)
        sleep(3)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    print(driver.title)
    assert 'Reddit' in driver.title
    
    html = driver.page_source
    soup = BeautifulSoup(html)
    
    title = []
    for entry in soup.findAll('div', {'class':'_1poyrkZ7g36PawDueRza-J'}):
        try:
            title.append(entry.find('h3').text)
        except:
            title.append('No value')

    timestamps = []
    for entry in soup.findAll('div', {'class':'_1poyrkZ7g36PawDueRza-J'}):
        try:
            timestamps.append(entry.find('a', {'class':'_3jOxDPIQ0KaOWpzvSQo-1s'}).text)
        except:
            timestamps.append('No value')
    
    comments = []
    for entry in soup.findAll('div', {'class':'_1poyrkZ7g36PawDueRza-J'}):
        try:
            comments.append(entry.find('span', {'class':'FHCV02u6Cp2zYL0fhQPsO'}).text)
        except:
            comments.append('No value')
    
    subreddit = []
    for entry in soup.findAll('div', {'class':'_1poyrkZ7g36PawDueRza-J'}):
        try:
            subreddit.append(entry.find('div', {'class':'_2mHuuvyV9doV3zwbZPtIPG'}).text)
        except:
            subreddit.append('No value')
            
    image = []
    for entry in soup.findAll('div', {'class':'_1poyrkZ7g36PawDueRza-J'}):
        try:
            image.append(str(entry.find('img', {'class': '_2_tDEnGMLxpM6uOa2kaDB3'})).split()[0])
        except:
            image.append('No value')
            
    video = []
    for entry in soup.findAll('div', {'class':'_1poyrkZ7g36PawDueRza-J'}):
        try:
            video.append(str(entry.find('video', {'class': '_1EQJpXY7ExS04odI1YBBlj'})).split()[0])
        except:
            video.append('No value')
    
    assert len(title) == len(timestamps) == len(comments) == len(subreddit) == len(image) == len(video)
    
    df = pd.DataFrame({'title':title,'time':timestamps,'subreddit':subreddit,'number_comments':comments,
                  'image':image, 'video':video})
    
    df['image'] = df['image'].apply(lambda x: 1 if x == '<img' else 0)
    df['video'] = df['video'].apply(lambda x: 1 if x == '<video' else 0)
    
    df = df.drop_duplicates()
    return df.to_csv('./data/web-scrapping-{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)

### Execute the web-scrapping function for 5-hour intervals.
[This documentation was helpful in automating the execution of the web-scrapping function](https://schedule.readthedocs.io/en/stable/examples.html). [This](https://www.youtube.com/watch?v=zwIGxcDxS5o) YouTube video was helpful as well. [This](https://www.programiz.com/python-programming/datetime/current-time) resource was helpful in setting up the datetime object used below to allow the while loop to run until a predetermined time. [This](http://selenium-python.readthedocs.io/faq.html) resource was helpful in understanding how to scroll down to the bottom of the page. [This](https://stackoverflow.com/questions/67505537/add-timestamp-to-file-name-during-saving-data-frame-in-csv) resource was helpful in finding a way to add a timestamp to each CSV file generated from the extracted data.

In [None]:
schedule.every(5).hours.until("2022-01-03 23:59").do(web_scrapping)
while datetime.now().strftime("%H:%M:%S") < "23:59:00":
    schedule.run_pending()
    sleep(4)

### Generate a file of scrapped data for use in the analysis.


#### Add the timestamp of when the data was scrapped to estimate the time when posted.

In [None]:
for file in sorted(os.listdir('data'),reverse=True):
    df = pd.read_csv(f'./data/{file}')
    df['date_scrapped'] = ''.join(''.join(file.split('web-scrapping-')).split('.csv'))
    df.to_csv(f'./data/{file}', index=False)

#### Test that the date when the data was scrapped was added correctly.


In [None]:
pd.read_csv('./data/web-scrapping-2021-12-28 063123.csv').head(1)

#### Append all scrapped data into one data frame.


In [None]:
df = pd.concat([pd.read_csv('./data/'+file) for file in sorted(os.listdir('data'),reverse=True)])

#### Drop columns that will not be used in the analysis.


In [None]:
df.drop(columns=['Unnamed: 0', 'timestamps'],inplace=True)

In [None]:
df.sort_values(['title','date_scrapped']).head(3)

In [None]:
df.reset_index(drop=True,inplace=True)

#### Create a datetime variable to estimate the time when the thread was posted.


In [None]:
for i,date in enumerate(df['date_scrapped']):
    df.loc[i,'date_scrapped'] = datetime.strptime(date,'%Y-%m-%d %H%M%S')

In [None]:
df.head()

#### Drop posts without a number of comments or a time value or the image and video indicators.

In [None]:
df = df[(df['number_comments'] != '0 comments') & (df['number_comments'] != 'No value')]

In [None]:
df = df[df['time'] != 'No value']

In [None]:
df = df[(~df['image'].isna()) & (~df['video'].isna())]

#### Reset the index and export the file that will be used as input data for the analysis.

In [None]:
df.reset_index(drop=True,inplace=True)

In [None]:
df.shape

In [None]:
df.to_csv('./data/web_scrapping_results.csv', index=False)