**Import same Libraries from last notebook**

In [1]:
import requests
import pandas as pd
import numpy as np
import time

Since our analysis requires two subreddits to create a classifier, we will have to perform the same techniques from the ESFP notebook. We can use the same function from before except pass in a different subreddit:

In [2]:
def get_big_posts(subreddit, n=19):
    #The first part of the function will be our initial pull of 100 posts, as mentioned I have set the before to 
    #a hardcoded utc in order to keep the data the same across each day. The below loop will update it's corresponding
    #before parameter and continue the scraping since our API only allows for 100 posts per scrape
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
    'subreddit': subreddit,
    'size': 100,
    'before': 1611890622}
    first_res = requests.get(url, params)
    first_data =first_res.json()
    first_posts = first_data['data']
    
    #The start of the for loop for range of above n parameter, during the loop each pull will update it's parameters 
    #accordingly and allow us to continue pulling 100 posts and EXTENDING (not appending) to our post list
    for i in range(n): 
        parammies = {
        'subreddit':subreddit,
        'size': 100,
        'before': first_posts[-1].get('created_utc')    
        }
        res = requests.get(url, parammies)
        data = res.json()
        posts = data['data']
        first_posts.extend(posts)
        
        #as mentioned above, don't forget to include time.sleep, I have set it for intervals of 3 seconds here
        time.sleep(3)
   
    #finally returning the full package of 2200 posts from the subreddit:    
    return first_posts
        
        

**Disclaimer: The remaining cleaning steps are the same for this notebook as from the ESPF,  so for a detailed step by step, please refere to notebook: 01-ESFP-Scraping for more info:**

In [3]:
posts_chunk = get_big_posts('intj')

In [4]:
df=pd.DataFrame(posts_chunk)

In [5]:
#View our scrape:
df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,...,media,media_embed,secure_media,secure_media_embed,banned_by,link_flair_css_class,gallery_data,is_gallery,media_metadata,author_cakeday
0,[],False,Whiskey_Delta20,,[],ed32d792-b05b-11ea-a6be-0edb91bbfad7,INTJ - Male,dark,text,t2_9yis0yq1,...,,,,,,,,,,
1,[],False,Lost-Championship-74,,[],f58c4db0-b05b-11ea-ad84-0e8746feb23b,INTJ - Female,dark,text,t2_7xevvfgw,...,,,,,,,,,,
2,[],False,rootseat,,[],,,,text,t2_5nlnmotk,...,,,,,,,,,,
3,[],False,mr_freeman,,[],,,,text,t2_3trq8,...,,,,,,,,,,
4,[],False,potentialpotato,,[],,,,text,t2_fwuf2,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,[],False,manicpixieleangirl,,[],,,,text,t2_7yi9sspw,...,,,,,,,,,,
1996,[],False,Bravo-Panda,,[],,,,text,t2_7lxzx0yl,...,,,,,,,,,,
1997,[],False,dreaming17,,[],,,,text,t2_9ei0hzuc,...,,,,,,,,,,
1998,[],False,iwantansi,,[],,INTJ,dark,text,t2_685kd,...,,,,,,,,,,


In [6]:
#features to filter:
features = ['subreddit', 'selftext', 'title']

In [7]:
#filter to new dataframe
df_filter = df[features]

In [8]:
#dropping all nulls:
df_filter.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filter.dropna(inplace=True)


In [9]:
#confirming dropped nulls:
df_filter.isnull().sum()

subreddit    0
selftext     0
title        0
dtype: int64

In [10]:
#check for user removed posts:
df_filter[df_filter['selftext'].str.contains('remove')].shape

(128, 3)

In [11]:
#check for user delted posts:
df_filter[df_filter['selftext'].str.contains('deleted')].shape

(53, 3)

In [12]:
#remove 'removed' posts:
df_filter=df_filter[~df_filter.selftext.str.contains('remove')]

In [13]:
#should reduce accordingly:
df_filter.shape

(1868, 3)

In [14]:
#remove 'deleted' posts:
df_filter=df_filter[~df_filter.selftext.str.contains('deleted')]

In [15]:
#filtered dataframe with deleted, removed, null values removed:
df_filter

Unnamed: 0,subreddit,selftext,title
0,intj,"Every time I use the word, “just”, that post i...","Someone posted about the words, just &amp; act..."
1,intj,My ex boyfriend (we dated 7 months) said that ...,"This is definitely an overreaction, right?"
2,intj,"It's a common enough aphorism, but I need a bi...","What does the phrase ""Do what makes you happy""..."
3,intj,"My wife asked me last night, just as a hypothe...",Three wishes: a hypothetical question
4,intj,"Hello! I am writing this for my own purposes, ...",What kind of gifts and gestures do you appreci...
...,...,...,...
1994,intj,I am so sick and tired of my explosive Ne that...,How do I become you? (INTP)
1996,intj,,I’m in this picture and I don’t like it!
1997,intj,anyone else paranoid about running into someon...,intj paranoia?
1998,intj,,Interesting new app to go beyond MBTI - im not...


In [16]:
#save to sensible name:
df_intj = df_filter

In [17]:
#checking for white space:
df_intj[df_intj['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title
0,intj,"Every time I use the word, “just”, that post i...","Someone posted about the words, just &amp; act..."
1,intj,My ex boyfriend (we dated 7 months) said that ...,"This is definitely an overreaction, right?"
2,intj,"It's a common enough aphorism, but I need a bi...","What does the phrase ""Do what makes you happy""..."
3,intj,"My wife asked me last night, just as a hypothe...",Three wishes: a hypothetical question
4,intj,"Hello! I am writing this for my own purposes, ...",What kind of gifts and gestures do you appreci...
...,...,...,...
1991,intj,Hear me out- I have been trying to decide this...,Should I (24F) go back to school.. or not?
1992,intj,So my family never understood me ever since I ...,My POV as a kid
1994,intj,I am so sick and tired of my explosive Ne that...,How do I become you? (INTP)
1997,intj,anyone else paranoid about running into someon...,intj paranoia?


In [18]:
#no white space in title
df_intj[df_intj['title'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title


In [19]:
#replcing white spaces with blanks:
df_intj['selftext']=df_intj['selftext'].str.replace('\n', '')

In [20]:
#check to make sure
df_intj[df_intj['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title


In [21]:
#reset our index, same as before:
df_intj.reset_index(drop=True, inplace=True)

### CSV EXPORT BELOW

In [22]:
# Just for overwriting purposes, I have commented out the csv export:
# df_intj.to_csv('intj.csv')