# Project 3: Reddit NLP - Data Collection

### Pushshift API

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. It leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources:

* Pushshift Endpoints: https://pushshift.io/
* Pushshift Documentation: https://github.com/pushshift/api
* Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/

In [2]:
import pandas as pd
import csv
import requests, json, time, re

## Setting the base query syntax:
Setting the query url to the pushshift api


In [10]:
url_1= 'https://api.pushshift.io/reddit/search/comment/?subreddit=BlackPink'
url_2= 'https://api.pushshift.io/reddit/search/comment/?subreddit=bangtan'

In [11]:
params = {
          'sort':'desc',
          'size':10,
         }

response_1 = requests.get(url_1, params=params)
response_2 = requests.get(url_2, params=params)

Making the request.

In [13]:
response_1.status_code

200

In [14]:
response_2.status_code

200

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [15]:
len(response_1.json()['data'])

10

As expected, the length is 10. This is the 10 comments pulled from the subreddit. Below is the first response pulled from the 'BlackPink' subreddit. 

In [20]:
response_1.json()['data'][0]

{'author': 'iFappa',
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_qb71o',
 'author_patreon_flair': False,
 'body': "You can expect some of them to have a more mellowed out Rock that leans more towards RnB. There are a ton of ways to do rock. Nonetheless, I'm very excited for the release. ",
 'created_utc': 1554339044,
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'ek2gz36',
 'is_submitter': False,
 'link_id': 't3_b95tgv',
 'no_follow': True,
 'parent_id': 't1_ek2exc0',
 'permalink': '/r/BlackPink/comments/b95tgv/190403_blackpink_kill_this_love_album_song/ek2gz36/',
 'retrieved_on': 1554339045,
 'score': 1,
 'send_replies': True,
 'stickied': False,
 'subreddit': 'BlackPink',
 'subreddit_id': 't5_3f8po'}

## Querying Reddit and saving raw data in .json format:
Writing a function for creating a logfile and formatting file names with a unique timestamp.


In [23]:
def filename_format_log(file_path, 
                        logfile = './assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [28]:
def reddit_query(url, n_samples, before=None, after=None):
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {
              'sort':'desc',
              'size':1000,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'./data/raw_submissions.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} comments.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function, I will collect 10k comments from the 'Blackpink' subreddit.

In [29]:
# reddit_query(url=url_1,
#              n_samples=10000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 comments.
Reddit text is ready for processing.
Last timestamp was 1549943009.


Using the query function, I will collect 10k comments from the 'bangtan' subreddit.

In [30]:
# reddit_query(url=url_2,
#              n_samples=10000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 comments.
Reddit text is ready for processing.
Last timestamp was 1553656065.


Loading in the Blackpink samples as a .json file and confirming the correct amount of comments.

In [3]:
with open(f'./data/1549943009_raw_submissions.json', 'r') as f:
    bp_sample_list = json.load(f)

In [4]:
len(bp_sample_list)

10000

Loading in the Bts (bangtan) samples as a .json file and confirming the correct amount of comments.

In [6]:
with open(f'./data/1553656065_raw_submissions.json', 'r') as f:
    bts_sample_list = json.load(f)

In [7]:
len(bts_sample_list)

10000

Parsing the json file into a dataframe containing the features of interest. The 2 keys of interest are 'body', which contains the texts of each comments, and 'subreddit', which contains whether or not the comment belongs to the 'BlackPink' subreddit. 
I will also change the column name from 'subreddit' to 'blackpink'.

In [9]:
def reddit_parse(sample):
    
    col_list = [
                'body',
                'subreddit'
                ]
    
    comments_df = pd.DataFrame(sample)
    comments_df = comments_df[col_list]
    
    comments_df.rename(columns={'subreddit':'blackpink'}, inplace=True)
    comments_df['blackpink'] = comments_df['blackpink'].map({'bangtan':0, 'BlackPink':1})
    
    col_order = [
                 'body',
                 'blackpink'
                ]

    return comments_df[col_order]

In [10]:
bts_comments = reddit_parse(bts_sample_list)

In [11]:
bts_comments.head()

Unnamed: 0,body,blackpink
0,"Sorry, Photoshop crashed before I could make t...",0
1,What's this? Smiles? I can finally go to bed h...,0
2,I get my DVD tomorrow so I can let you know if...,0
3,Just read Ep 15 of Webtoon. I AM NOT OKAY!!!!!...,0
4,That is amazing 😂 your manager is awesome!,0


Reviewing the shape of the dataframe to ensure correct transformation. I will also delete duplicate comments from the dataframe.

In [12]:
bts_comments.shape

(10000, 2)

In [13]:
bts_comments.drop_duplicates(inplace=True)
bts_comments.shape

----

In [15]:
bp_comments = reddit_parse(bp_sample_list)

In [17]:
bp_comments.tail()

Unnamed: 0,body,blackpink
9995,Tickets when?,1
9996,I would have to go about 10 miles.,1
9997,EU they had 4 ticket max I think.,1
9998,Pretty much its if you own a car you go. I was...,1
9999,ps y’all ever went to a concert by urself ? i ...,1


In [16]:
bp_comments.shape

(10000, 2)

In [18]:
bp_comments.drop_duplicates(inplace=True)
bp_comments.shape

----

Now that we parsed both json files into their respective dataframes, we will merge to two seperate dataframes into a combined dataframe.

In [20]:
bp_bts_df = pd.concat([bp_comments,bts_comments], ignore_index=True, sort=False)
bp_bts_df.reset_index(drop=True, inplace=True)

In [21]:
bp_bts_df.head()

Unnamed: 0,body,blackpink
0,This is something I can get behind and appreci...,1
1,Hold the fuck up?! Rock songs? I'm considering...,1
2,What time and date is this in PDT?\n\n0 AM is ...,1
3,Is there a list? Last year I remember they str...,1
4,As a blink I'll wait till mv dropped. Then I'l...,1


From the original 20k comments we pulled from the 2 subreddits, we have 19,451 total comments left. 549 comments were removed due to being duplicates. Checking the class balance, we can see that they are still balanced for the most part. Thankfully, there are no nulls so we can proceed with our EDA.

In [59]:
bp_bts_df.shape

(19451, 2)

In [64]:
bp_bts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19451 entries, 0 to 19450
Data columns (total 2 columns):
body         19451 non-null object
blackpink    19451 non-null int64
dtypes: int64(1), object(1)
memory usage: 304.0+ KB


In [61]:
bp_bts_df.isnull().sum()

body         0
blackpink    0
dtype: int64

In [65]:
bp_bts_df['blackpink'].value_counts(normalize=True)

0    0.504807
1    0.495193
Name: blackpink, dtype: float64

Using regex to substitute any one or more white-spaces to a regular one white-space.

In [66]:
bp_bts_df['body'] = bp_bts_df.body.map(lambda x :re.sub('\s+', ' ', x))

Saving out the combined dataframe as a csv for data cleaning and exploratory data analysis in the next notebook.

In [69]:
formatted_name, now, file_description = filename_format_log(file_path ='./data/bp_bts_df.csv')
bp_bts_df.to_csv(formatted_name, index=False)

In [26]:
bp_bts_df[bp_bts_df.body.str.contains("cute concepts wouldn't do well here")]

Unnamed: 0,body,blackpink
8834,Totally agree with this. DDDD/PWF/Whistle or e...,1


In [27]:
bp_bts_df.body[8834]

"Totally agree with this. DDDD/PWF/Whistle or even Really are definitely the most Western-influenced and would've been the best choices to introduce BP to the US. Also agree that cute concepts wouldn't do well here. "

## **Please continue to Notebook 02: Data Cleaning & EDA**