# Project 3 1. Webscraping

### Classification models using Natural Language Processing (NLP)

1. Webscraping
2. Data Cleaning
3. Modelling

Description: Two different subreddits have been selected to be classified. They are namely: <br>
<br>
1) Tales From Retail <br>
2) Tales From Tech Support <br>
<br>
<b> Both subreddits consists of posts where people rant about bad experiences while working in these respective sectors. </b>

### Webscraping using Reddit API

Importing relevant libraries for webscraping:

In [1]:
import pandas as pd
import requests
import numpy as np
import time

Designating url of subreddits to feed into function:

In [2]:
TFR = 'https://www.reddit.com/r/TalesFromRetail.json'
TFTS = 'https://www.reddit.com/r/TalesFromTechSupport.json'

Defining the function to scrape our data from the subreddits we want to use:

In [3]:
def get_posts(url,headers = {'User-agent':'Bleep borp bot 1.0'},loops=2):
    posts = []
    names = []
    titles = []
    subreddit = []
    aft_name=None

    for i in range(loops):
        if aft_name==None:
            params={}
        else:
            params={'after':aft_name}

        req = requests.get(url,params=params,headers=headers)

        if req.status_code == 200:
            the_json = req.json()
            for p in range(len(the_json['data']['children'])):
                names.append(the_json['data']['children'][p]['data']['name'])
                titles.append(the_json['data']['children'][p]['data']['title'])
                posts.append(the_json['data']['children'][p]['data']['selftext'])
                subreddit.append(the_json['data']['children'][p]['data']['subreddit'])
                aft_name = the_json['data']['after']
        else:
            print(res.status_code)
            break
            
        time.sleep(np.random.randint(1,5))
    
    posts_df = pd.DataFrame({'names':names,
                         'titles':titles,
                         'posts':posts,
                         'subreddit':subreddit},columns = ['names','titles','posts','subreddit'])
    
    return posts_df

Checking the length of the freshly scraped DataFrame:

In [4]:
TFR_df = get_posts(TFR,loops=40)

In [5]:
TFTS_df = get_posts(TFTS,loops=40)

In [6]:
len(TFR_df)

999

In [7]:
len(TFTS_df)

1003

Removing duplicates as Reddit API limits requests of post to 1000 only and checking how much data is collected:

In [10]:
TFR_df = TFR_df.drop_duplicates(subset='posts',keep='first')
TFTS_df = TFTS_df.drop_duplicates(subset='posts',keep='first')
TFR_df = TFR_df[TFR_df['posts']!='']
TFTS_df = TFTS_df[TFTS_df['posts']!='']

In [11]:
len(TFR_df)

442

In [12]:
len(TFTS_df)

976

In [13]:
TFR_df.head()

Unnamed: 0,names,titles,posts,subreddit
0,t3_dhruue,Weekly TFR Express Lane Checkout - Post your v...,Welcome to /r/TalesFromRetail's Express Lane! ...,TalesFromRetail
1,t3_dj8mxt,"""The doctors have to take my face off because ...",First-time poster on this sub. This event happ...,TalesFromRetail
2,t3_djb3c9,She thought we worked 24/7,There was a woman who came in to the store I w...,TalesFromRetail
3,t3_dizq6s,I Want To Hand Pick 60 Cupcakes,Mobile warning. \n\nThis happened 3 weeks back...,TalesFromRetail
4,t3_dj5k7s,The first time ever someone made a complaint a...,I work at a privately owned builder's store th...,TalesFromRetail


In [14]:
TFTS_df.head()

Unnamed: 0,names,titles,posts,subreddit
0,t3_3dqlny,TFTS POSTING RULES (MOBILE USERS PLEASE READ!),"Hey, we can have two stickies now!\n\n---\n\nS...",talesfromtechsupport
1,t3_dfzukz,TFTS Top Tales - September 2019,Hi Everybody!\n\nHere's another month of Top T...,talesfromtechsupport
2,t3_dj7kgc,Woman just doesn't understand how this works,"Bit of a lurker here, come from a friend group...",talesfromtechsupport
3,t3_djd5xs,"We tried to help an old lady, she brought in h...",Older story. Been debating posting this for aw...,talesfromtechsupport
4,t3_djeew5,Encyclopædia Moronica: T is for Three Little W...,"Two years ago:\n\n&gt; ME: Ma'am, I understand...",talesfromtechsupport


Saving into CSV files to be used in other notebooks to reduce time of scraping data over and over again:

In [15]:
TFR_df.to_csv('./datasets/TFR_df.csv',index=False)
TFTS_df.to_csv('./datasets/TFTS_df.csv',index=False)