# Project 3: Web API and NLP Data Importing (Disneyland)

## Problem Statement

- The objective is to built the most optimal text classification model between Disneyland and Universal Subreddits, which will be implemented in KKDays (Travel booking agency) algorthm to classify customer's comments on these two theme parks. Moreover, the company wants to understand which brand is more popular and understand customer's tone/sentiment from the comment. The result would be the promotional tactics that will be targeted toward the more popular brand.

## Libraries Importing 

In [9]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
from time import sleep
import json, os


## Instantiate webdriver

In [10]:
## instantiate driver
## check the version of Google Chrome and download correct version of chromedriver
driver = webdriver.Chrome()

In [11]:
## get page of "social grep", which gived old posts of subreddit
## original reddit url = 'https://www.reddit.com/r/Disneyland/'

subreddit = 'disneyland' # choose by yourself
start_date = '2010-01-01' # choose by yourself

url = f'https://socialgrep.com/search?query=%2Fr%2F{subreddit}%2Cafter%3A{start_date}&order_by=oldest'

driver.get(url)
repeat_time, waiting_time = 4, 2

## scroll to the bottom of the page and wait
for i in range(repeat_time):
    driver.execute_script(f"window.scrollTo(0, document.body.scrollHeight);")
    sleep(waiting_time)

## Example of one post

In [10]:
## function to scrape
def get_content(post, subreddit):
    try:
        vote = int(post.select_one('span.text-info').text)
    except:
        vote = 0
    try:
        title = post.a.text
    except:
        return None
    try:
        text = post.select_one('div.post_content').get_text(separator='\n').strip()
        if text == '':
            text = None
    except:
        text = None
    date = post.select_one('h6.card-subtitle').text.split(',')[1].strip()

    if text == None and title == f"/r/{subreddit.lower()}":
        return None
    else:
        return {
            "vote" : vote,
            "title" : title,
            "text" : text,
            "date" : date
        }

In [11]:
soup = BeautifulSoup(driver.page_source)
posts = soup.select('div.card-body') # content is under here

get_content(posts[1], subreddit) # show one example

{'vote': 1,
 'title': 'Disneyland Time Lapse',
 'text': '[deleted]',
 'date': '2010-12-16'}

# For loop with datetime

In [12]:
if os.path.exists(f'{subreddit}.json'):
    ## resume scraping from the last date in the json file
    with open(f'{subreddit}.json', 'r') as f:
        scraped_data = json.load(f)
    new_date = scraped_data[-1]['date']
    url = f'https://socialgrep.com/search?query=%2Fr%2F{subreddit}%2Cafter%3A{new_date}&order_by=oldest'
else:
    ## if the file not exists, create a new list
    scraped_data = []

In [24]:
## scrape and append to `scraped_data`
## RUN THIS CELL AGAIN AND AGAIN until getting the latest post

for _ in tqdm(range(400)): # set repeat time 

    ## scroll to the bottom of the page and wait
    driver.get(url)
    for i in range(4):
        driver.execute_script(f"window.scrollTo(0, document.body.scrollHeight);")
        sleep(2)

    ## get HTML
    soup = BeautifulSoup(driver.page_source)
    posts = soup.select('div.card-body')

    ## iterate each post
    for post in posts:
        one_post_dict = get_content(post, subreddit)
        if one_post_dict != None:
            scraped_data.append(one_post_dict)

    ## save to json
    with open(f'{subreddit}.json', 'w') as f:
        json.dump(scraped_data, f, indent=False, ensure_ascii=False)

    ## set new date
    new_date = scraped_data[-1]['date']
    url = f'https://socialgrep.com/search?query=%2Fr%2F{subreddit}%2Cafter%3A{new_date}&order_by=oldest'


100%|██████████| 400/400 [1:21:03<00:00, 12.16s/it]


## To dataframe and drop duplicate

In [25]:
df = pd.read_json(f'{subreddit}.json').drop_duplicates()
df

Unnamed: 0,vote,title,text,date
0,4,Disney Parks Live Stream! Big Announcement!,,2010-09-23
1,1,Disneyland Time Lapse,[deleted],2010-12-16
2,1,The sound workshop of Disneyland Imagineers...,[deleted],2011-02-25
3,2,The skunk. Every Freaking Time!,,2011-10-02
4,16,Disneyland Entrance,,2011-10-03
...,...,...,...,...
12570,196,Got engaged last Friday!,[deleted],2017-09-28
12571,5,Plaza inn character buffet,Hi all! I have a question. So on Saturday morn...,2017-09-29
12572,1,How late can you pay for your visit?,"Looking through the FAQ, I didn't see anything...",2017-09-29
12573,12,Visiting Disneyland during Christmas!,"So, my family and I (wife, 8 y/o boy, 6 y/o bo...",2017-09-29


In [26]:
## missing value in text
df.isna().sum()

vote        0
title       0
text     3673
date        0
dtype: int64

In [27]:
## text includes [removed] [deleted]
df[df['text'].isin(['[removed]', '[deleted]'])]

Unnamed: 0,vote,title,text,date
1,1,Disneyland Time Lapse,[deleted],2010-12-16
2,1,The sound workshop of Disneyland Imagineers...,[deleted],2011-02-25
23,1,Mini Lego Sleeping Beauty Castle,[deleted],2011-12-08
28,1,"Best places to eat in or near Disneyland, Cali...",[deleted],2012-01-02
33,3,Disneyland Hotels which one to stay at?,[deleted],2012-01-25
...,...,...,...,...
12551,1,so what do you guys think about pda at the par...,[removed],2017-09-27
12554,1,Dress code for Mickeys Halloween Party,[removed],2017-09-27
12563,5,Oogie Boogie Popcorn!,[deleted],2017-09-28
12564,56,Just popped up on my disney app. No idea who t...,[deleted],2017-09-28


In [30]:
disney_df = df[(~df['text'].isin(['[removed]', '[deleted]'])) & (df['text'].notna())].drop_duplicates(subset=['text'])
disney_df.to_csv('disneyland.csv', index=False)

In [1]:
disney_df = df[(~df['text'].isin(['[removed]', '[deleted]'])) & (df['text'].notna())].drop_duplicates(subset=['text'])
disney_df

NameError: name 'df' is not defined

In [7]:
disney_df = pd.read_json('disneyland_dropped.json')
disney_df

Unnamed: 0,vote,title,text,date
0,12,[Poll] Is this ok?,"Hey, /r/disneyland. I think I'll try and post ...",2011-10-04
1,8,Favorite Land?,"So, what's YOUR favorite land in Disneyland?",2011-10-09
2,4,Does anyone have any pictures from inside club...,Hey everyone! I don't know if you're allowed t...,2011-10-10
3,4,Who is your favorite face character?,I love Bert and Mary. I think it's something a...,2011-10-10
4,7,Does Disneyland drug test employees? (x-post f...,"I have an interview Friday, and I'd like to kn...",2011-11-10
...,...,...,...,...
3852,3,What day to get Maxpass,I'm going to Disneyland/DCA in October for a w...,2017-09-28
3853,4,Disneyland questions,As it turns out as I’m preparing for my WDW Ch...,2017-09-28
3854,5,Plaza inn character buffet,Hi all! I have a question. So on Saturday morn...,2017-09-29
3855,1,How late can you pay for your visit?,"Looking through the FAQ, I didn't see anything...",2017-09-29
