<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Classifying Grab's Customer Feedback into Ride Hailing and Delivery

> Authors: Irfan Muzafar, Ng Wei, Lim Zheng Gang
---

**Problem Statement:**  
How can we distinguish between customer feedback related to Grab's ride-hailing and delivery fast and accurately? 

**Target Audience:**
Grab Product Team

**Summary:**
- Product Manager is overwhelmed by vast user reviews; struggles to classify reviews between delivery and ride hailing.
- Develop a NLP model to differentiate ride-hailing comments from delivery on Grab's Google Play Store.
- Training data: Uber and Uber Eats subreddit.
- Other data sources: App store reviews from Grab and Grab Driver applications as well as those from competitors:  
  - foodpanda  
  - Deliveroo  
  - Gojek  

There are a total of three notebooks for this project:  
 1. `01_Data_Collection.ipynb`   
 2. `02_Data_Cleaning_EDA.ipynb`   
 3. `03_Modelling_Evaluation_Conclusion.ipynb`   

---
**This Notebook**
- We will scrape the web to get the data we need to build as well as the app reviews from the various store.
- Save them into CSV under the folder "datasets" for analysis 

#### 1. Import libraries and authenticate Reddit access

Imports necessary libraries for web scraping and data manipulation in Python. Specifically, it includes `praw` for accessing Reddit data, `requests` for making HTTP requests, and `google_play_scraper` for scraping data from the Google Play Store.

In [1]:
import praw
import requests
import numpy as np
import pandas as pd
from google_play_scraper import app, Sort, reviews_all

Initializes a Reddit instance using the PRAW. To use it, you'll need to replace the placeholders for `client_id`, `client_secret`, and `user_agent` with your own Reddit API credentials. You can obtain these credentials by creating a Reddit app on the Reddit Developer website.

In [2]:
reddit = praw.Reddit(
    client_id="gQ5g-Npw_eGanQVJz_IXew",
    client_secret="R4PEfOzcihdm6CkJtfnDBasc7-JHfQ",
    user_agent="Project 3 GA",
)

Check if the credentials is valid. Expected response is `None`

In [3]:
print(reddit.user.me())

None


#### 2. Scraping for Reddit Data and save into CSV

Iterates through a list of subreddit names, which are 'uber' and 'UberEATS'. sends an HTTP GET request to the subreddit URL using the `requests` library to retrieve the webpage's content.

In [4]:
subreddit_names = ['uber', 'UberEATS']
for subreddit_name in subreddit_names:
    reddit_url = f'https://www.reddit.com/r/{subreddit_name}'
    response = requests.get(reddit_url)

Check the number of posts in each sub-reddit

In [5]:

subreddit_uber = reddit.subreddit("uber")

count_submission_uber = 0 
for submission in subreddit_uber.new(limit = None):
    count_submission_uber += 1

subreddit_ubereats = reddit.subreddit("UberEATS")

count_submission_ubereats = 0 
for submission in subreddit_ubereats.new(limit = None):
    count_submission_ubereats += 1

print(f"Uber sub-reddit has {count_submission_uber} post")
print(f"Ubereat sub-reddit has {count_submission_ubereats} post")

Uber sub-reddit has 976 post
Ubereat sub-reddit has 988 post


Scrape comments from the top 100 submissions of each subreddit. This approach aims to optimize processing time while ensuring the retrieval of the most relevant data. By limiting the scope to the top submissions, the script focuses on extracting high-engagement content, which can provide valuable insights for analysis or research purposes. Save them into `csv` at the end of the scraping. 

In [9]:
data_ubereats = []

subreddit_ubereats = reddit.subreddit("UberEATS")

for submission in subreddit_ubereats.top(limit= 100):
    data_ubereats.append({
    "submission/comment id": submission.id,
    "submission or comment": "submission.title",
    "body": submission.title,
    "up_votes": submission.ups,
    })

    data_ubereats.append({
    "submission/comment id": submission.id,
    "submission or comment": "submission",
    "body": submission.selftext,
    "up_votes": submission.ups,
    })

    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        data_ubereats.append({
            "submission/comment id": comment.id,
            "submission or comment": "comment",
            "body": comment.body,
            "up_votes": comment.ups,
        })

df_ubereats = pd.DataFrame(data_ubereats)

# Save the data_ubereatsFrame to a CSV file
output_path = '../datasets/reddit_ubereats_data_final.csv'
df_ubereats.to_csv(output_path, index=False)

print(f"data_ubereats saved to {output_path}")

data_ubereats saved to ../datasets/reddit_ubereats_data_test.csv


In [11]:
# List to store data_uber
data_uber = []

subreddit_uber = reddit.subreddit("uber")


for submission in subreddit_uber.top(limit= 100):
    data_uber.append({
    "submission/comment id": submission.id,
    "submission or comment": "submission.title",
    "body": submission.title,
    "up_votes": submission.ups,
    })

    data_uber.append({
    "submission/comment id": submission.id,
    "submission or comment": "submission",
    "body": submission.selftext,
    "up_votes": submission.ups,
    })

    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        data_uber.append({
            "submission/comment id": comment.id,
            "submission or comment": "comment",
            "body": comment.body,
            "up_votes": comment.ups,
        })


# Convert data_uber to a data_uberFrame
df_uber = pd.DataFrame(data_uber)

# Save the data_uberFrame to a CSV file
output_path = '../datasets/reddit_uber_data_final.csv'
df_uber.to_csv(output_path, index=False)

print(f"data_uber saved to {output_path}")

data_uber saved to ../datasets/reddit_uber_data_test.csv


#### 3. Scrape app store data. 

For our analysis, we gather data from various app stores, including Grab and its competitors such as foodpanda, Deliveroo, and Gojek. By scraping app store data, we aim to collect comprehensive information about these platforms, including user reviews, ratings, and other relevant metadata. This data enables us to conduct a thorough analysis of the competitive landscape, identify trends, and gain insights into user preferences and experiences across different ride-hailing and food delivery services.

##### Grab

In [12]:
grab_reviews = reviews_all(
    'com.grabtaxi.passenger',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_grab_review = pd.DataFrame(np.array(grab_reviews),columns=['review'])
df_grab_review = df_grab_review.join(pd.DataFrame(df_grab_review.pop('review').tolist()))

output_path = '../datasets/grab_data_playstore.csv'
df_grab_review.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/grab_data_playstore.csv


In [14]:
grab_driver_reviews = reviews_all(
    'com.grabtaxi.driver2',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_grab_driver_reviews = pd.DataFrame(np.array(grab_driver_reviews),columns=['review'])
df_grab_driver_reviews = df_grab_driver_reviews.join(pd.DataFrame(df_grab_driver_reviews.pop('review').tolist()))

output_path = '../datasets/grab_driver_data_playstore.csv'
df_grab_driver_reviews.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/grab_driver_data_playstore.csv


##### Gojek 

In [15]:
gojek_reviews = reviews_all(
    'com.gojek.app',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_gojek_review = pd.DataFrame(np.array(gojek_reviews),columns=['review'])
df_gojek_review = df_gojek_review.join(pd.DataFrame(df_gojek_review.pop('review').tolist()))

output_path = '../datasets/gojek_data_playstore.csv'
df_gojek_review.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/gojek_data_playstore.csv


In [21]:
gojek_driver_reviews = reviews_all(
    'com.gojek.partner',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_gojek_driver_reviews = pd.DataFrame(np.array(gojek_driver_reviews),columns=['review'])
df_gojek_driver_reviews = df_gojek_driver_reviews.join(pd.DataFrame(df_gojek_driver_reviews.pop('review').tolist()))

output_path = '../datasets/gojek_driver_data_playstore.csv'
df_gojek_driver_reviews.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/gojek_driver_data_playstore.csv


##### Deliveroo

In [17]:
droo_driver_reviews = reviews_all(
    'com.deliveroo.driverapp',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_droo_driver_reviews = pd.DataFrame(np.array(droo_driver_reviews),columns=['review'])
df_droo_driver_reviews = df_droo_driver_reviews.join(pd.DataFrame(df_droo_driver_reviews.pop('review').tolist()))

output_path = '../datasets/droo_driver_data_playstore.csv'
df_droo_driver_reviews.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/droo_driver_data_playstore.csv


In [18]:
droo_reviews = reviews_all(
    'com.deliveroo.orderapp',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_droo_reviews = pd.DataFrame(np.array(droo_reviews),columns=['review'])
df_droo_reviews = df_droo_reviews.join(pd.DataFrame(df_droo_reviews.pop('review').tolist()))

output_path = '../datasets/droo_data_playstore.csv'
df_droo_reviews.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/droo_data_playstore.csv


##### foodpanda

In [19]:
fp_reviews = reviews_all(
    'com.global.foodpanda.android',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_fp_reviews = pd.DataFrame(np.array(fp_reviews),columns=['review'])
df_fp_reviews = df_fp_reviews.join(pd.DataFrame(df_fp_reviews.pop('review').tolist()))

output_path = '../datasets/fp_data_playstore.csv'
df_fp_reviews.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/fp_data_playstore.csv


In [20]:
fp_driver_reviews = reviews_all(
    'com.logistics.rider.foodpanda',
    sleep_milliseconds=0,
    lang='en',
    country='US',
    sort=Sort.NEWEST,
)

# convert to dataframe
df_fp_driver_reviews = pd.DataFrame(np.array(fp_driver_reviews),columns=['review'])
df_fp_driver_reviews = df_fp_driver_reviews.join(pd.DataFrame(df_fp_driver_reviews.pop('review').tolist()))

output_path = '../datasets/fp_driver_data_playstore.csv'
df_fp_driver_reviews.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Data saved to ../datasets/fp_driver_data_playstore.csv


#### 4. Expected Output file
At the end of this notebook, we expect to have the following files in our dataset folder, ready for analysis:

- `reddit_uber_data_final.csv`
- `reddit_ubereats_data_final.csv`
- `grab_data_playstore.csv`
- `grab_driver_data_playstore.csv`
- `fp_data_playstore.csv`
- `fp_driver_data_playstore.csv`
- `droo_data_playstore.csv`
- `droo_driver_data_playstore.csv`
- `gojek_data_playstore.csv`
- `gojek_driver_data_playstore.csv`