# Project 3 - Part 1 Web Scraping
by: Nah Wei Jie

## Table of Contents
This project is broken down into 3 different parts, each with it's accompanying notebook.


**Part 1**

- Web scraping  

**Part 2** 

- Exploratory Data Analysis
- Data Cleaning
- Visualizations

**Part 3**
- Pre-processing
- Model Fit and Testing
- Model Iteration
- Model Evaluation
    

---

## Library Imports

In [1]:
# Basic Imports
import random
import time

# Processing imports
import pandas as pd # Version 1.2.4

#Scraping Imports
import requests # Version 2.25.1

----

## Web Scraping

The data was retrieved from two different subreddits, namely [**r/FanFiction**](https://www.reddit.com/r/FanFiction) and [**r/LifeAdvice**](https://www.reddit.com/r/LifeAdvice). As Reddit's .json format is conveniently structured like a Python dictionary, by relying on [**Pushshift's API**](https://github.com/pushshift/api) and the ```requests``` library, we are able to easily extract the information from Reddit, retrieving 100 posts in each request.

To give us enough data in our corpus, the loop was called 10 times for each subreddit, although the requests resulted in about 990+ posts for the FanFiction subreddit rather than 1000.  

For the purposes of this project, we scraped about ~2000 posts with 20 ```HTTP``` requests made to Reddit via the API. As the API has a query limit of 100 posts, it is considered [**good practice**](https://towardsdatascience.com/web-scraping-basics-82f8b5acd45c) to use a random timeout before the start of each loop (see line ```14/15``` in code block ```2```). This is to avoid visits to the API/Reddit too frequently which may result in us getting blocked/banned as well as to alleviate the traffic for normal visitors of Reddit.

For building larger datasets or scraping across a higher number of different subreddits in future, scraping with consideration for the target API/website may be of even higher importance, especially when we are doing it programmatically.


As we have already scraped through both subreddits before, some code blocks below has been commented to prevent overwriting of existing data which may affect our analysis. The output shown below the code block is the result of the instance I ran on my end. The purpose of inlcuding it here is to show it's functionality as well as to show that the code works.

Viewers may uncomment the code block and attempt to run it, however do be warned that it might take quite a while to finish depending on the network and computing resources of the viewer.

### Scraping from FanFiction subreddit

In [2]:
# # Initializing our URL, params and empty lists to capture posts
# url = 'https://api.pushshift.io/reddit/search/submission'
# posts_ff = [] # Initialize an empty list to facilitate storing of results

# # Below we set the following parameters for Pushshift's API
# params = {
#     'subreddit': 'FanFiction', # We use this to inform the API which subreddit to scrape from
#     'size': 100, # We use this to inform the API how many posts to retrieve from Reddit (Hard Limited at 100)
#     'before': None, # We use this to provide an 'index' for the API to know where to continue from where it has previously stopped
# }

# # Below we use a for loop to execute the the command 10 times to get ~ 1000 posts
# for i in range(10): # Loop 10 times
#     sleep_dur = max(random.gauss(3,2),2) # Generate a random sleep interval with minimal value being 2 seconds as sleep_dur
#     time.sleep(sleep_dur) # Sleep for duration to prevent too frequent request
#     res = requests.get(url, params) # Request data from Reddit
#     if res.status_code != 200:   #Print error if request fails
#         print('Status error', res.status_code)
#         break
#     else:
#         print(f'Request iteration: {i+1}\n Status code: {res.status_code}') # Print iteration count and corresponding status code
#     data = res.json() # Store results in data variable
#     posts = data['data'] # Store results into list
#     if len(posts) > 0: # Condition to check if our results are returning nothing
#         newbefore = posts[-1]['created_utc']
#         params['before'] = newbefore # Replace the value of 'created_utc' of last post in current request iteration as the new 'before' value in our parameters
#         posts_ff.extend(posts) # Add the 100 results of the current request loop into the posts list for FanFiction
#         print('Number of posts scraped: ', len(posts_ff))
#     else:
#         print('Request did not fetch any results')
# print('**End of scrape**')        
# ## Click on the ... bubble to see the output of each iterative scrape

Request iteration: 1
 Status code: 200
Number of posts scraped:  100
Request iteration: 2
 Status code: 200
Number of posts scraped:  200
Request iteration: 3
 Status code: 200
Number of posts scraped:  300
Request iteration: 4
 Status code: 200
Number of posts scraped:  400
Request iteration: 5
 Status code: 200
Number of posts scraped:  499
Request iteration: 6
 Status code: 200
Number of posts scraped:  599
Request iteration: 7
 Status code: 200
Number of posts scraped:  699
Request iteration: 8
 Status code: 200
Number of posts scraped:  799
Request iteration: 9
 Status code: 200
Number of posts scraped:  898
Request iteration: 10
 Status code: 200
Number of posts scraped:  998
**End of scrape**


We will now repeat the same steps for the LifeAdvice subreddit.

### Scraping from LifeAdvice subreddit

In [3]:
# # Initializing our URL, params and empty lists to capture posts
# url = 'https://api.pushshift.io/reddit/search/submission'
# posts_la = [] # Initialize an empty list to facilitate storing of results

# # Below we set the following parameters for Pushshift's API
# params = {
#     'subreddit': 'LifeAdvice', # We use this to inform the API which subreddit to scrape from
#     'size': 100, # We use this to inform the API how many posts to retrieve from Reddit (Hard Limited at 100)
#     'before': None, # We use this to provide an 'index' for the API to know where to continue from where it has previously stopped
# }

# # Below we use a for loop to execute the the command 10 times to get ~ 1000 posts
# for i in range(10): # Loop 10 times
#     sleep_dur = max(random.gauss(3,2),2) # Generate a random sleep interval with minimal value being 2 seconds as sleep_dur
#     time.sleep(sleep_dur) # Sleep for duration to prevent too frequent request
#     res = requests.get(url, params) # Request data from Reddit
#     if res.status_code != 200:   #Print error if request fails
#         print('Status error', res.status_code)
#         break
#     else:
#         print(f'Request iteration: {i+1}\n Status code: {res.status_code}') # Print iteration count and corresponding status code
#     data = res.json() # Store results in data variable
#     posts = data['data'] # Store results into list
#     if len(posts) > 0: # Condition to check if our results are returning nothing
#         newbefore = posts[-1]['created_utc']
#         params['before'] = newbefore # Replace the value of 'created_utc' of last post in current request iteration as the new 'before' value in our parameters
#         posts_la.extend(posts) # Add the 100 results of the current request loop into the posts list for FanFiction
#         print('Number of posts scraped: ', len(posts_la))
#     else:
#         print('Request did not fetch any results')
# print('**End of scrape**')
# ## Click on the ... bubble to see output of each iterative scrape of 100 posts

Request iteration: 1
 Status code: 200
Number of posts scraped:  100
Request iteration: 2
 Status code: 200
Number of posts scraped:  200
Request iteration: 3
 Status code: 200
Number of posts scraped:  300
Request iteration: 4
 Status code: 200
Number of posts scraped:  400
Request iteration: 5
 Status code: 200
Number of posts scraped:  500
Request iteration: 6
 Status code: 200
Number of posts scraped:  600
Request iteration: 7
 Status code: 200
Number of posts scraped:  700
Request iteration: 8
 Status code: 200
Number of posts scraped:  800
Request iteration: 9
 Status code: 200
Number of posts scraped:  900
Request iteration: 10
 Status code: 200
Number of posts scraped:  1000
**End of scrape**


In [4]:
# Reading our posts from each subreddit as dataframes
df_fanfic = pd.DataFrame(posts_ff)
df_lifeadv = pd.DataFrame(posts_la)

### Preliminary Observations

In [5]:
# Display number of rows, columns for fanfic dataframe
df_fanfic.shape

(998, 76)

In [6]:
# Display number of rows, columns for lifeadvice dataframe
df_lifeadv.shape

(1000, 69)

We have slightly more entries from our lifeadvice subreddit and we can see that the number of columns is different as well. As the different subreddits may have different rules and different features for their posts, these differences betwee nthe two is expected. In the next notebook, we will want to identify those columns that are interesting to us and ensure that both datasets have this information.

To keep to the scope of this notebook we will save the dataframes as they are now and do the  detailed cleaning and processing steps in the subsequent notebook.

### Exporting Data

In [7]:
## This code block has been commented to prevent overwriting the .csv files in case the scraping code block has been ran by the viewer

## Saving the dataframes into their corresponding named .csv files
# df_fanfic.to_csv('../datasets/fanfic.csv', index = False)
# df_lifeadv.to_csv('../datasets/lifeadv.csv', index = False)

--- End of Notebook ---

----