**Don't Forget:** Import necessary libraries below with our usual import statement:

In [1]:
import requests
import pandas as pd
import numpy as np
import time

Since our target variable will be finding posts that correlate with the ESFP subreddit, let's begin scraping that. Below is a function that will allow us to enter the name of the subreddit and will return a collection of 2200 posts. In order to maintain consistency when running our jupyter notebook on different days, I have created an initial before benchmark from the first time I pulled, and set the first post utc date as the initial before. Following that, each pull during the for loop will adjust the 'before' parameter based on the last post of the most recently pulled 100 posts. 

**Important!** Do not forget to incorproate time.sleep into the the loop, this both maintains scraping integrity in accordance with the site, but also if you don't include it, the function runs the risk of returning a JSON error.

In [2]:
def get_jumbo_posts(subreddit, n=21):
    
    #The first part of the function will be our initial pull of 100 posts, as mentioned I have set the before to 
    #a hardcoded utc in order to keep the data the same across each day. The below loop will update it's corresponding
    #before parameter and continue the scraping since our API only allows for 100 posts per scrape
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
    'subreddit': subreddit,
    'size': 100,
    'before': 1611890622}
    first_res = requests.get(url, params)
    first_data =first_res.json()
    first_posts = first_data['data']
    
    #The start of the for loop for range of above n parameter, during the loop each pull will update it's parameters 
    #accordingly and allow us to continue pulling 100 posts and EXTENDING (not appending) to our post list
    for i in range(n): 
        parammies = {
        'subreddit':subreddit,
        'size': 100,
        'before': first_posts[-1].get('created_utc')    
        }
        res = requests.get(url, parammies)
        data = res.json()
        posts = data['data']
        first_posts.extend(posts)
        
        #as mentioned above, don't forget to include time.sleep, I have set it for intervals of 3 seconds here
        time.sleep(3)
   
    #finally returning the full package of 2200 posts from the subreddit:    
    return first_posts

Now let's run the function for our first subreddit: ESFP. Simply input the corresponding subreddit tag (found at the end of the corresponding url). This will let the function know what subreddit to pull from as seen above:

In [3]:
posts_chunk = get_jumbo_posts('ESFP')

Let's save our scraped data into an easier to view dataframe. We can do this by using the pd.DataFrame method shown below:

In [4]:
df=pd.DataFrame(posts_chunk)

In [5]:
#Investigating our initial dataframe and observing the shape (shown below):
df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_id,brand_safe,approved_at_utc,banned_at_utc,view_count,author_created_utc,banned_by,distinguished,mod_reports,user_reports
0,[],False,ToegapBananaboat,,[],,text,t2_1jr8a7v1,False,False,...,,,,,,,,,,
1,[],False,TemporaryParking2868,,[],,text,t2_9iy8h07m,False,False,...,,,,,,,,,,
2,[],False,tomatoesinthailand,,[],,text,t2_9u0w35h2,False,False,...,,,,,,,,,,
3,[],False,Ilnurno,,[],,text,t2_5fhdr3i6,False,False,...,,,,,,,,,,
4,[],False,jrhernandez,,[],,text,t2_13ylwk,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1973,,,[deleted],,,,,,,,...,,,,,,,,,[],[]
1974,,,[deleted],,,,,,,,...,,,,,,,,,[],[]
1975,,,[deleted],,,,,,,,...,,,,,,,,,[],[]
1976,,,[deleted],,,,,,,,...,,,,,,,,,[],[]


**Filtering Out for Interested Columns:** As we can see from our first look, there are quite a bit of columns that are "useless" to us, in the sense that they don't hold any significant text that would be worth analyzing. The three columns that we are interested from an NLP analytical standpoint are the textual contents of the post. So let's grab the subreddit column so we know where it comes from, the selftext since that holds the bulk of the information, but let's also include the title to see if it may hold anything relevant since some users may have a tendency to treat the title as their posting method

In [6]:
features = ['subreddit', 'selftext', 'title']

In [7]:
#setting a new dataframe with our filtered features:
df_filter = df[features]

In [8]:
#checking for nulls values:
df_filter.isnull().sum()

subreddit    0
selftext     1
title        0
dtype: int64

In [9]:
#dropping nulls with .dropna, and then confirming nulls have been dropped:
df_filter.dropna(inplace=True)
df_filter.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filter.dropna(inplace=True)


subreddit    0
selftext     0
title        0
dtype: int64

While we have dropped nulls, we haven't addressed posts that have been **removed** or **deleted** by the user themselves. In these cases the cells will not return NaN values, but instead show up as removed or deleted in text format. Similar to the nulls, we can drop these rows since they don't hold anything of value for our analysis and will just clutter up our models and data

In [10]:
#checking for rows with remove:
df_filter[df_filter['selftext'].str.contains('remove')].shape

(14, 3)

One easy way to remove these columns is to just overwrite your dataframe but setting it equal to your original dataframe filtering out by .contains 'remove' and excluding:

In [11]:
#remove 'removed' columns
df_filter=df_filter[~df_filter.selftext.str.contains('remove')]

In [12]:
#check to see if it worked (# of rows should decrease accordingly):
df_filter

Unnamed: 0,subreddit,selftext,title
0,ESFP,,[survey] MBTI—Enneagram Tritype Correlation
1,ESFP,just for fun\n\n[View Poll](https://www.reddit...,Zodiac Sign
2,ESFP,We currently have a very active MBTI discord s...,Any ESFPs (or others) here would like to join ...
3,ESFP,I have seen someone in r/ESTP saying that(i gu...,Are there anyone in romantic relatioship with ...
4,ESFP,"Like, a lot of people see (or refer to) the ES...","What is your idea of the ""party girl""?"
...,...,...,...
1973,ESFP,Are we ESFPs that much of a minority on Reddit...,Why is this subreddit so underpopulated?
1974,ESFP,,A study found that the only leisure pastime fo...
1975,ESFP,,Asking the ESFP if the water is half-full or h...
1976,ESFP,,Famous ESFPs - CelebrityTypes.com


In [13]:
#same thing for deleted posts denoted as [deleted]:
df_filter[df_filter['selftext'].str.contains('deleted')].shape

(65, 3)

In [14]:
#filtering out with ~ exclusion:
df_filter=df_filter[~df_filter.selftext.str.contains('deleted')]

In [15]:
#sanity check:
df_filter

Unnamed: 0,subreddit,selftext,title
0,ESFP,,[survey] MBTI—Enneagram Tritype Correlation
1,ESFP,just for fun\n\n[View Poll](https://www.reddit...,Zodiac Sign
2,ESFP,We currently have a very active MBTI discord s...,Any ESFPs (or others) here would like to join ...
3,ESFP,I have seen someone in r/ESTP saying that(i gu...,Are there anyone in romantic relatioship with ...
4,ESFP,"Like, a lot of people see (or refer to) the ES...","What is your idea of the ""party girl""?"
...,...,...,...
1973,ESFP,Are we ESFPs that much of a minority on Reddit...,Why is this subreddit so underpopulated?
1974,ESFP,,A study found that the only leisure pastime fo...
1975,ESFP,,Asking the ESFP if the water is half-full or h...
1976,ESFP,,Famous ESFPs - CelebrityTypes.com


In [16]:
#saving our new dataframe to something more meaningful in the name:
df_esfp=df_filter

Since we are doing an analysis on text we will not want any unnecessary items as mentioned. This includes 'white spaces' denoted as \n. We can see from below that quite a few of our rows contain white spaces:

In [17]:
df_esfp[df_esfp['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title
1,ESFP,just for fun\n\n[View Poll](https://www.reddit...,Zodiac Sign
9,ESFP,Hey ESFP fam. Feel like mixing it up today and...,Silly ESFP Trivial/Random Pet Peeves
10,ESFP,"So, as we know, ESFP's are kind of known for l...",Living in the Moment
11,ESFP,Hey there! Looking for an MBTI Discord to be a...,MBTI Central Discord
14,ESFP,"Hello, I (infj) like to watch movies with my ...",What are your favorite movies?
...,...,...,...
1954,ESFP,"I'm a first year high school teacher, and whil...",Any ESFP Teachers out there?
1955,ESFP,~Do you have any advice on how to be more soci...,How to be more social?
1962,ESFP,Hi everyone.\n\n,Hello everyone! Let's get to know each other!
1963,ESFP,It's the morning and I'm wondering are ESFP up...,Good morning


This can be a quick fix by using string replace method to replace all white space characters with a blank. It's best practice to not include a space as that may alter the text of value we are interested in:

In [18]:
#replacing white space:
df_esfp['selftext']=df_esfp['selftext'].str.replace('\n', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_esfp['selftext']=df_esfp['selftext'].str.replace('\n', '')


In [19]:
#check to confirm
df_esfp[df_esfp['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title


Since we have lost some rows along the way during cleaning, our index no longer accurately aligns with our shape. To be practical let's reset the index, and we can use drop=True to avoid an unwanted additional index column making it's way into our final dataframe.

In [20]:
#reset our frame:
df_esfp.reset_index(drop=True, inplace=True)

### CSV EXPORT BELOW

Now let's save our initially cleaned dataframe for EDA (and further cleaning). This way we can simply import it in a different notebook instead of running the above over and over each time. Now it's onto the next subreddit!

In [21]:
# df_esfp.to_csv('esfp.csv')