<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP SubReddit Classification (Walmart & Costco)

## Executive Summary

In a most recent survey, Walmart was ranked **LAST** for supermarkets in the 2021 American Customer Satisfaction Index Retail and Consumer Shipping Report, We as a Data Science Team in Walmart are been appointed by Top Management on tasks to understand what users(Customers & Walmart employees) think of Walmart and our competitors, to understand where positive feedback will continue to be reinforced and adopted while negative feedback can be addressed and prevented, with the leverage of Data Science approach, We will suggest on what actions We could take and eventually improve the social media image of Walmart as well as improve Customer Satisfaction.

As Social media presence is vital in our day and age, Reddit is an American social news aggregation, web content rating, and discussion website featuring user-posted stories. As of February 2021, Reddit was ranked the 18th most visited website in the world and 7th most visited website in the U.S.

As a start, we will be looking into the posts on Walmart's subreddit page, alongside a chosen competitor, **Costco** (who was ranked at top 2 in above suvey mentioned)

**Primary Objective**: 
To enhance our understanding of Walmart's social media image on reddit so as to introduce strategy for improvement.

**Secondary Objective**: 
To identify what subreddit users think of both supermarkets, where positive feedback will continue to be reinforced and adopted while negative feedback can be addressed and prevented.

The __Primary stakeholder is:__ Walmart Corporate, and the __Secondary Stakeholder is:__ Walmart Consumers.

The success of this project will be evaluated based on a classification model having a higher prediction accuracy and F1 Score than the baseline score of 50%.

The classification model through Natural language processing will aid us in "understanding" the contents of the subreddits.
Posts from Walmart and Costco subreddits are scraped, analyzed, cleaned, processed and ran through several classification models (e.g. Logistic Regression, KNearestNeighbor, Multinomial Naive Bayes, Random Tree Classifier).
The final production model that we chose was __Logistic Regression with TfidfVectorizer__, at a classification prediction rate of __93.3%__.

- Reference: 
    - https://www.supermarketnews.com/issues-trends/customer-satisfaction-fell-supermarkets-2020
    - https://www.barrons.com/articles/walmart-stock-is-falling-on-earnings-miss-muted-outlook-51613657909
    - https://www.vox.com/recode/22423706/walmart-memo-retail-amazon-target-instacart

---
Project notebook organisation:<br>
**1 - SubReddit Web Scrapping** (current notebook)<br>
[2 - Exploratory Data Analysis and Preprocessing](./2_exploratory_data_analysis_and_preprocessing.ipynb)<br>
[3 - Classification Model and Recommendation](./3_Classification_Model_and_Recommendation.ipynb)<br>
<br>

# Part 1: SubReddit Web Scrapping

### Contents:
- [1. Import Libraries](#1.-Import-Libraries)
- [2. Web Posts Scrapping](#2.-Web-Posts-Scrapping)

In Part 1, the main task is doing posts Data Scrapping from the 2 subreddits--Walmart & Costco (around 1k posts been scrapped for each subreddit). And data has been merged into 1 dataframe and exported for next Part 2-EDA & Data-Processing. 

---

# 1. Import Libraries

---

In [1]:
import pandas as pd
import requests
import time
import random
import datetime

# !pip install psaw, for data scrapping
from psaw import PushshiftAPI

---

# 2. Web Posts Scrapping

---

## 2.1 Define Public Functions 

In [2]:
url ="https://api.pushshift.io/reddit/search/submission"

In [3]:
#define the function to do data scrapping from subreddits
def subreddit_scrapping(subreddit, n_iter):
    
    # setting up an empty post list, inital after set to none
    df_posts = []
    
    # Use EpochConverter to get CurrentTimeStamp in UTC integer
    current_time= 1623345408
    # Sunday, June 6, 2021 12:33:23 AM
    
    score_range=">100"
    
    for i in range(n_iter):
        
        print(f'Scrapping {i+1}th 100 posts in progress...')
        
        res = requests.get(
            url,
            params = {
                "subreddit":subreddit,
                "size":100,   #every request retrieve 100 posts
                "is_self":True, # only scrap text posts
                "sort_type": "created_utc",
                "sort":"desc",
                "score":">10",
                "before":current_time
            }
        )
        
        if res.status_code == 200:
            # if request calls success
            time.sleep(random.random()*5)  # random sleep time for next request
            df = pd.DataFrame(res.json()['data'])
            
            # there are many columns but not really relevant to this NLP project
            # I will only use these useful fields
            df = df.loc[:,['subreddit',"created_utc","id","title","selftext","is_self","upvote_ratio","score", "num_comments"]]
            df_posts.append(df)
            
            current_time = df.created_utc.min()
             
        else:
            print(res.status_code)
            break
            
    print('Scapping all posts completed!')
    
    return pd.concat(df_posts,axis=0)  

### Comments:

I have tried to scrap Walmart & Costco Data using above defined function, but noticed that the posts' score are at the range of 0 to 8 (without setting the score range parmeter), but after set the score parameter (score >10), I cannot get close 1k posts.

Since Our business objective is trying to analyze and model on most hot posts, I will use another method to pull the posts by using Pushshift API functions directly. 

## 2.2 Walmart & Costco Data Collection

In [4]:
api = PushshiftAPI()

### 2.2.1  Scrapping Walmart data

In [5]:
# To scrap posts in recent 6 months since 2021-06-01 & only collect text posts 
api_request_generator = api.search_submissions(subreddit='walmart', 
                                               before = 1623489389, 
                                               after = 1590981341, 
                                               is_self = True, 
                                               limit =1500)

In [6]:
%%time
aita_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])

Wall time: 33.2 s


In [7]:
# We are mainly focus on the text posts of the title & selftext
# So to remove duplicate rows on these 2 fields
df_walmart=aita_submissions.drop_duplicates(subset = ['title', 'selftext'],keep = 'last').reset_index(drop = True)
df_walmart.shape

(1490, 75)

In [8]:
df_walmart.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_text', 'author_flair_text_color',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_t

In [9]:
# there are many columns but not really relevant to this NLP project
# I will only use these useful fields
df_walmart = df_walmart[['subreddit', 'created_utc', 'id', 'title', 'selftext',
       'upvote_ratio', 'score', 'num_comments']]

In [10]:
df_walmart.shape

(1490, 8)

In [11]:
df_walmart.head()

Unnamed: 0,subreddit,created_utc,id,title,selftext,upvote_ratio,score,num_comments
0,walmart,1623481278,ny0t2k,"The whole meat wall, in one night?",Is that even possible?,1.0,1,15
1,walmart,1623479940,ny0hk4,Early Lunches,[removed],1.0,1,0
2,walmart,1623477525,nxzwvb,Cap 2/Overnight Team leads,"Due to unforseen circumstances, I was not able...",1.0,1,14
3,walmart,1623476491,nxzndf,Finally promoted MYSELF to customer,"This is going to be a long one so buckle up, f...",1.0,1,9
4,walmart,1623476227,nxzkvz,Pointing out after putting in your two week no...,So I submitted my 2 week notice yesterday and ...,1.0,1,11


In [12]:
min_posts_datetime = datetime.datetime.fromtimestamp(df_walmart['created_utc'].min())
max_posts_datetime =datetime.datetime.fromtimestamp(df_walmart['created_utc'].max())

print (f'The Walmart posts are from {min_posts_datetime} to {max_posts_datetime}')

The Walmart posts are from 2021-05-29 01:59:22 to 2021-06-12 15:01:18


### Comments:
I've removed duplicated rows (duplicate "title" & "selftext"), and able to get 1490 posts for Walmart SubReddit.

### 2.2.2 Scrap Costco Data

In [13]:
# To scrap posts in recent 6 months since 2021-01-01 & only collect text posts 
api_request_generator = api.search_submissions(subreddit='costco', 
                                               before = 1623489389, 
                                               after = 1592069466, 
                                               is_self = True, 
                                               limit =1500)

In [14]:
%%time
aita_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])

Wall time: 33.9 s


In [15]:
# We are mainly focus on the text posts of the title & selftext
# So to remove duplicate rows on these 2 fields
df_costco=aita_submissions.drop_duplicates(subset = ['title', 'selftext'],keep = 'last').reset_index(drop = True)
df_costco.shape

(1497, 74)

### Comments:

Same as Walmart posts, I removed duplicated rows (duplicate "title" & "selftext") and able to get 1497 posts for Costco SubReddit.

In [16]:
# there are many columns but not really relevant to this NLP project
# I will only use these useful fields

df_costco = df_costco[['subreddit', 'created_utc', 'id', 'title', 'selftext',
       'upvote_ratio', 'score', 'num_comments']]

In [17]:
df_costco.columns

Index(['subreddit', 'created_utc', 'id', 'title', 'selftext', 'upvote_ratio',
       'score', 'num_comments'],
      dtype='object')

In [18]:
df_costco.shape

(1497, 8)

In [19]:
df_costco.head()

Unnamed: 0,subreddit,created_utc,id,title,selftext,upvote_ratio,score,num_comments
0,Costco,1623484046,ny1g1y,Teenage girl harassed by a male employee,I was too shaken up to talk to a manager while...,1.0,1,24
1,Costco,1623468009,nxxc5b,Does Costco sell hanging baskets of flowers? R...,[removed],1.0,1,2
2,Costco,1623467264,nxx4b3,18 year anniversary,Would be my 18th year anniversary had i stayed...,1.0,1,1
3,Costco,1623463893,nxw4nm,PLEASE include the Costco location (City/State...,"I asked once ""Where did you find this item?"" -...",1.0,1,2
4,Costco,1623462565,nxvqeg,Add burgers to food court!,,1.0,1,28


In [20]:
min_posts_datetime = datetime.datetime.fromtimestamp(df_costco['created_utc'].min())
max_posts_datetime =datetime.datetime.fromtimestamp(df_costco['created_utc'].max())

print (f'The Costco posts are from {min_posts_datetime} to {max_posts_datetime}')

The Costco posts are from 2021-04-08 09:23:11 to 2021-06-12 15:47:26


### Comments：

We can see about 15 days' scrapped Walmart posts have the same amount as about 2+ months Costco posts. In other words, more people are cared about Walmart than Costco, as Walmart is more popular than Costco. 

Also this might because more diversity business in Walmart, mainly including Walmart U.S.,Walmart Supercenter, Walmart Discount Store, Walmart Neighborhood Market, Walmart Express, Walmart International and Sam's Club, in total 7 segmentations, besides Walmart is also doing Charity and contribute to COVID-19 (coronavirus) Vaccination. More business and operating models, more comments from the customers or employees. 

From the research, with over 2.3 million employees worldwide, Walmart has faced a torrent of lawsuits and issues with regards to its workforce. These issues involve low wages, poor working conditions, inadequate health care, and issues involving the company's strong anti-union policies. From Oct 2020, Walmart has launched new employee structure which there might lead to a lot of discussions and comments as well. 

We will leverag on the already established Sub-Reddit pages of Costco and Walmart-to extract posts, look at the most-talked about subjects, and to finally explore the key similarities and differences between the 2 sub-reddits.

### 2.2.3 Merge Walmart & Costco Data

In [21]:
df_merge = pd.concat([df_walmart, df_costco], ignore_index=True)

In [22]:
# checking the datas were combined correctly 
df_merge['subreddit'].value_counts()

Costco     1497
walmart    1490
Name: subreddit, dtype: int64

### Comments:

In total, I get 1497 Costco posts and 1490 Walmart posts.

## 2.3 Export Data to CSV

In [27]:
data_path = "../datasets/01_SubReddit_Web_Scrapping/"

In [28]:
df_walmart.to_csv(data_path + 'walmart_scrapped_posts.csv', index= False)

In [29]:
df_costco.to_csv(data_path + 'costco_scrapped_posts.csv', index= False)

In [30]:
df_merge.to_csv(data_path + 'combined_posts.csv', index= False)

# in case of rerun this notebook cause data over-writtern, 
# the combined posts csv file I've also manually copied to dataset folder "02_Exploratory_Data_Analysis_and_Preprocessing"

#### __Please go to Notebook 02_Exploratory Data Analysis and Preprocessing.ipynb__