<img src="http://imgur.com/1ZcRyrc.png" style="float: centre; margin: 20px; height: 55px"> 

# Project 3: Web API & NLP Part 1: Webscrapping, API

## Problem Statement

We are tasked to utilize and train a classifier on a binary classification problem to predict which subreddit a given post comes from either Phuket or Bali island (our selected choice). Gathering of our data will be through web scrapping data from reddit using API.

Our solution is 2 fold:
- Using Pushshift's API to scrap for bali and phuket posts.
- Use NLP to train a classifier, 1)RandomforrestClassifier and 2)logistic regression model on which subreddit a given post came from.

## Executive Summary

This project is a binary classification problem which aims to utilize NLP to train a classifier of scrapped data from subreddit using API to classify which reddit post it came from. 

We have selected the topics which will be places of tourist beach interests, Bali and Phuket. These 2 places are familier to holiday beach-goers from the south east asian region and are similar in type of tourist activites in nature. Given the current pandemic situation, both islands faced limited visitors from foreign countries. It will be interesting to see what are the results we yield from the analysis of this project.

Utilizing the API, we scarpped 8 loops of posts and 4 types of datasets, namely the submission and comments posts from each classification. we then cleaned and merged the data together as a single set. we've concat the comment post to the submission posts as we will like to know what are the words that are used in the different classification topic.

For the modelling, we choose the Logistic Regression and RandomForestClassifier for our modelling. utilizing pipelines to conduct feature engineering and using gridsearchcv to get the best parameters for our models.

Resulting with Logistic Regression model with the better consistent scoring model.Therefore we reccomend the LogisticRegression model for our binary classifier


*The following are a brief description of the islands for our commonn understanding of the individual island description. These information will come in handy during our analysis.

1. Bali: an Indonesian island known for its forested volcanic mountains, iconic rice paddies, beaches and coral reefs. The island is home to religious sites such as cliffside Uluwatu Temple. To the south, the beachside city of Kuta has lively bars, while Seminyak, Sanur and Nusa Dua are popular resort towns. The island is also known for its yoga and meditation retreats.
ref: https://en.wikipedia.org/wiki/Bali

2. Phuket: A Thai province located in south of Thailand. It is the biggest Island of Thailand and sits on the Andaman sea. The nearest province to the north is Phang-nga and the nearest provinces to the east are Phang-nga and Krabi. Phuket has a large Chinese influence, therefore you will find many Chinese shrines and Chinese Restaurants around the city.Being a big island, Phuket is surrounded by many magnificent Beaches such as Rawai, Patong, Karon, Kamala, Kata Yai, Kata Noi, and Mai Khao. Laem Phromthep viewpoint is said to feature the most beautiful sunsets in Thailand. It isn’t all just beaches though, there is also the famous Phuket NIGHTLIFE,and is a hotspot for tourists in Thailand.
ref: https://www.tourismthailand.org/Destinations/Provinces/Phuket/350


## Contents:

### Part 1:
- Import Library
- Data Collection: Web Scrapping using API
- Data Dictionary


### Part 2:
- Data Cleaning
- Import Library and Datasets
- Preprocessing
- EDA
- Modelling
- Evaluation
- Conclusion & Summary

### Import Library

In [53]:
import pandas as pd
import requests
import re
import time
import string

- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Conceptual Understanding
- Conclusion and Recommendations

## Data Collection: Web Scrapping

In [2]:
# create function for web scrapping submissions:

def scrapping(url, loops, subreddit):
    df = []
    start_time = time.time() # get the time in seconds since epoch
    params = {'subreddit': subreddit,
    'size': 100,
    'before': round(start_time)
    }
    
    for i in range(loops):
        current_time = time.time()
        #requesting data
        res = requests.get(url, params)
        print(f'res status {i+1}: ', res.status_code)
        
        data = res.json()
        posts = data['data']
        post_df = pd.DataFrame(posts)
        df.append(post_df)
        #get oldest post time and use as before parameter in next request
        old = post_df['created_utc'].min()
        params['before'] = old
        time.sleep(1)
        reddit_posts = pd.concat(df)

        filename = subreddit + '_submission.csv'

    return reddit_posts.to_csv('./datasets/' + filename, index=False)

In [19]:
# scrape for bali submission data set from Reddit.
url = 'https://api.pushshift.io/reddit/submission/search'
loops = 8 # no. of loops to scrap
subreddit = 'bali' # subreddit topic

scrapping(url, loops, subreddit)

res status 1:  200
res status 2:  200
res status 3:  200
res status 4:  200
res status 5:  200
res status 6:  200
res status 7:  200
res status 8:  200


In [15]:
bali_submission = pd.read_csv('./datasets/bali_submission.csv')

In [16]:
bali_submission.shape

(800, 84)

In [17]:
# extract out the necessary features of the df
bali_df = bali_submission[['subreddit', 'selftext', 'title']]
bali_df.head()

Unnamed: 0,subreddit,selftext,title
0,bali,,PARASHAKTI AKASHIC READING || 22 SEPT 2021
1,bali,[removed],SBOBET PUSAT JUDI ONLINE
2,bali,[removed],North Bali beachfront resort offered for sale ...
3,bali,[removed],Coming to Bali from Australia
4,bali,[removed],Getting into bali with a business visa


In [23]:
# scrape for phuket submission data set from Reddit.
url = 'https://api.pushshift.io/reddit/submission/search'
loops = 8 # no. of loops to scrap
subreddit = 'phuket' # subreddit topic

scrapping(url, loops, subreddit)

res status 1:  200
res status 2:  200
res status 3:  200
res status 4:  200
res status 5:  200
res status 6:  200
res status 7:  200
res status 8:  200


In [18]:
phuket_submi = pd.read_csv('./datasets/phuket_submission.csv')

In [19]:
# extract out the necessary features of the df
phuket_df = phuket_submi[['subreddit', 'selftext', 'title']]
phuket_df.head()

Unnamed: 0,subreddit,selftext,title
0,phuket,"Just curious if anyone knows the rules, hard f...",Travel from Phuket to Bangkok after 14 day san...
1,phuket,"Hi, just wondering if anyone wanted to grab so...",Drinks Saturday in Patong?
2,phuket,"With current restrictions, do you think Phuket...",Is Phuket good destination for honeymoon?
3,phuket,"Hi, is an unvaccinated Thai national still abl...",Unvaccinated domestic travel.
4,phuket,We've visited the 5 major beach towns on the w...,Review of Phuket's beach towns since beginning...


In [20]:
phuket_df.shape

(800, 3)

In [44]:
# creat function for web scrapping comments:

def scrapping(url, loops, subreddit):
    df = []
    start_time = time.time() # get the time in seconds since epoch
    params = {'subreddit': subreddit,
    'size': 100,
    'before': round(start_time)
    }
    
    for i in range(loops):
        current_time = time.time()
        #requesting data
        res = requests.get(url, params)
        print(f'res status {i+1}: ', res.status_code)
        
        data = res.json()
        posts = data['data']
        post_df = pd.DataFrame(posts)
        df.append(post_df)
        #get oldest post time and use as before parameter in next request
        old = post_df['created_utc'].min()
        params['before'] = old
        time.sleep(1)
        reddit_posts = pd.concat(df)

        filename = subreddit + '_comment.csv'

    return reddit_posts.to_csv('./datasets/' + filename, index=False)

In [28]:
# scrape for bali comments data set from Reddit.
url = 'https://api.pushshift.io/reddit/comment/search'
loops = 8 # no. of loops to scrap
subreddit = 'bali' # subreddit topic

scrapping(url, loops, subreddit)

res status 1:  200
res status 2:  200
res status 3:  200
res status 4:  200
res status 5:  200
res status 6:  200
res status 7:  200
res status 8:  200


In [21]:
bali_comm = pd.read_csv('./datasets/bali_comment.csv')

In [22]:
# extract out the necessary features of the df
bali_comments = bali_comm[['subreddit', 'body']]
bali_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  800 non-null    object
 1   body       800 non-null    object
dtypes: object(2)
memory usage: 12.6+ KB


In [23]:
#check 
bali_comments.head()

Unnamed: 0,subreddit,body
0,bali,I have no idea about the official visa website...
1,bali,Your submission has been removed for suspected...
2,bali,You need a specialised Covid19 test for overse...
3,bali,Your submission has been removed for suspected...
4,bali,It's entirely possible I posted this in anothe...


In [45]:
# scrape for phuket comments data set from Reddit.
url = 'https://api.pushshift.io/reddit/comment/search'
loops = 8 # no. of loops to scrap
subreddit = 'phuket' # subreddit topic

scrapping(url, loops, subreddit)

res status 1:  200
res status 2:  200
res status 3:  200
res status 4:  200
res status 5:  200
res status 6:  200
res status 7:  200
res status 8:  200


In [24]:
phuket_comm = pd.read_csv('./datasets/phuket_comment.csv')

In [25]:
# extract out the necessary features of the df
phuket_comments = phuket_comm[['subreddit', 'body']]
phuket_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  800 non-null    object
 1   body       800 non-null    object
dtypes: object(2)
memory usage: 12.6+ KB


In [26]:
#check
phuket_comments.head()

Unnamed: 0,subreddit,body
0,phuket,You'll need an ART test I believe available at...
1,phuket,Perfect.
2,phuket,"Yesyes, sound good. \nIf im still i’ll DM you,..."
3,phuket,Anything I'm easy going just some drinks to st...
4,phuket,Thank you very much!


In [27]:
phuket_comments.shape

(800, 2)

- We have now scrapped 4 sets of data. each set contains 800 rows
1. Bali submission
2. Bali comments
3. Phuket submission
4. Phuket commnets

### Data Dictionary

| S/N | Data | Description | File |
|:---:|:---:|:---:|:---:|
| 1 | phuket_df | scrapped submission posts of phuket from reddit | phuket_submission.csv |
| 2 | phuket_comments | scrapped comments posts of phuket from reddit | phuket_comments.csv |
| 3 | bali_df | scrapped submission posts of bali from reddit | bali_submission.csv |
| 4 | bali_comments | scrapped submission posts of bali from reddit | phuket_comments.csv |
| 5 | df | combined cleaned dataset | df_clean.csv |