## <font color = 'purple'>Notebook 1: Web-scraping Google Play Store Reviews</font>

<font color = 'purple'>

This notebook focuses on web scraping user reviews from the Google Play Store to gather data for a sentiment analysis and feedback categorisation project. It uses libraries such as `requests`, `BeautifulSoup`, and `pandas` to send HTTP requests, parse HTML content, and organise the extracted data into a structured format. The reviews collected include key details like the review text, rating, and date, which are then preprocessed to remove duplicates and handle missing values. The scraped data will be used for further analysis, including feature engineering, model building, and dashboard development. 

</font>

In [108]:
#!pip install google-play-scraper

In [109]:
import google_play_scraper

<font color = 'purple'>

Get the app id of the Application from Playstore you want to fetch the reviews of. 

* e.g. for Facebook app, the link to the Google Play Store is: 
https://play.google.com/store/apps/details/Facebook?id=com.facebook.katana&hl=en_ZA&pli=1|

* So the app id for Facebook is `com.facebook.katana`

</font>

In [110]:
app_id = 'com.ideashower.readitlater.pro' # changed this to any app on Google Play

In [111]:
# Importing sorting options for reviews (e.g., most relevant, newest)
from google_play_scraper import Sort

# Importing constants that specify elements to be extracted from the scraped data
from google_play_scraper.constants.element import ElementSpecs

# Importing regular expressions used for data extraction and validation 
from google_play_scraper.constants.regex import Regex

# Importing constants related to request formatting options (e.g., handling payload structure) 
from google_play_scraper.constants.request import Formats

# Importing a utility function to send POST requests for retrieving data from the Google Play Store
from google_play_scraper.utils.request import post

In [112]:
import pandas as pd 
from datetime import datetime
import os

# Useful for displaying progress bars during loops, useful for long-running tasks
from tqdm import tqdm 
import time

# Handling JSON data, particularly for storing and loading data in JSON format 
import json 
from time import sleep
from typing import List, Optional, Tuple

<font color = 'purple'>

The below code defines a process for extracting reviews from the Google Play Store using pagination to handle large datasets. 

The `ReviewContinuationToken` class sstores details such as language, region, sorting order, and filters for score and device, allowing the code to maintain state between requests. 

The `fetch_review_batch` function handles a single request to the Google Play Store's review endpoint, using the app ID, sort order, and optional filters, then parses the response. 

The `get_reviews` function utilises this to fetch reviews in chunks, handling pagination with continuation tokens and compiling reviews until the requested count is reached. 

The `get_all_reviews` function wraps this process to fetch all available reviews by continuously calling `get_reviews` until all pages are retrieved, adding an optional delay to avoid rate-limiting.

</font>

In [113]:
# Maximum number of reviews fetched in one request
MAX_REVIEWS_PER_REQUEST = 199

# Class to manage continuation token and associated parameters
class ReviewContinuationToken:
    __slots__ = (
        "continuation_token",
        "language",
        "region",
        "order_by",
        "num_reviews",
        "score_filter",
        "device_filter",
    )

    def __init__(
        self, continuation_token, language, region, order_by, num_reviews, score_filter, device_filter
    ):
        self.continuation_token = continuation_token
        self.language = language
        self.region = region
        self.order_by = order_by
        self.num_reviews = num_reviews
        self.score_filter = score_filter
        self.device_filter = device_filter

# Function to retrieve a batch of reviews from the server
def fetch_review_batch(
    endpoint: str,
    app_identifier: str,
    sort_option: int,
    num_reviews: int,
    score_filter: Optional[int],
    device_filter: Optional[int],
    next_page_token: Optional[str],
):
    response = post(
        endpoint,
        Formats.Reviews.build_body(
            app_identifier,
            sort_option,
            num_reviews,
            "null" if score_filter is None else score_filter,
            "null" if device_filter is None else device_filter,
            next_page_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    parsed_data = json.loads(Regex.REVIEWS.findall(response)[0])

    return json.loads(parsed_data[0][2])[0], json.loads(parsed_data[0][2])[-2][-1]

# Function to retrieve multiple reviews with pagination handling
def get_reviews(
    app_identifier: str,
    language: str = "en",
    region: str = "us",
    order_by: Sort = Sort.MOST_RELEVANT,
    num_reviews: int = 100,
    score_filter: int = None,
    device_filter: int = None,
    token: ReviewContinuationToken = None,
) -> Tuple[List[dict], ReviewContinuationToken]:
    order_by = order_by.value

    # Use continuation token if available, otherwise start a new request
    if token is not None:
        next_page_token = token.continuation_token
        if next_page_token is None:
            return ([], token)

        language = token.language
        region = token.region
        order_by = token.order_by
        num_reviews = token.num_reviews
        score_filter = token.score_filter
        device_filter = token.device_filter
    else:
        next_page_token = None

    # Construct the request URL based on language and region
    request_url = Formats.Reviews.build(lang=language, country=region)

    remaining_reviews = num_reviews
    review_list = []

    while True:
        if remaining_reviews == 0:
            break

        # Limit the number of reviews fetched in one request
        if remaining_reviews > MAX_REVIEWS_PER_REQUEST:
            remaining_reviews = MAX_REVIEWS_PER_REQUEST

        try:
            # Fetch a batch of reviews and update the continuation token
            reviews_data, next_page_token = fetch_review_batch(
                request_url,
                app_identifier,
                order_by,
                remaining_reviews,
                score_filter,
                device_filter,
                next_page_token,
            )
        except (TypeError, IndexError):
            # Handling any errors that occur during review retrieval
            next_page_token = token.continuation_token
            continue

        # Process the reviews and append them to the result list
        for review in reviews_data:
            review_list.append(
                {
                    key: spec.extract_content(review)
                    for key, spec in ElementSpecs.Review.items()
                }
            )

        remaining_reviews = num_reviews - len(review_list)

        # Stop fetching if there is no continuation token left
        if isinstance(next_page_token, list):
            next_page_token = None
            break

    # Return the list of reviews and the updated continuation token
    return (
        review_list,
        ReviewContinuationToken(
            next_page_token, language, region, order_by, num_reviews, score_filter, device_filter
        ),
    )

# Function to retrieve all reviews by continuously calling the review fetcher
def get_all_reviews(app_identifier: str, delay_ms: int = 0, **kwargs) -> list:
    kwargs.pop("num_reviews", None)
    kwargs.pop("token", None)

    next_token = None
    all_reviews = []

    while True:
        # Fetch reviews using the maximum count per request
        batch_reviews, next_token = get_reviews(
            app_identifier,
            num_reviews=MAX_REVIEWS_PER_REQUEST,
            token=next_token,
            **kwargs
        )

        all_reviews += batch_reviews

        # Stop if there are no more reviews to fetch
        if next_token.continuation_token is None:
            break

        # Optional delay between requests to avoid server rate-limiting
        if delay_ms:
            sleep(delay_ms / 1000)

    return all_reviews

In [114]:
# Set the number of reviews you want to scrape
reviews_count = 25000

In [115]:
reviews_data = []
next_token = None

# Progress bar setup for tracking review collection
with tqdm(total=desired_review_count, position=0, leave=True) as progress_bar:
    while len(reviews_data) < desired_review_count:
        batch_reviews, next_token = get_reviews(
            app_identifier,
            continuation_token=next_token,
            language='en',  # Language for the reviews
            region='au',    # Country code for the reviews
            order_by=Sort.NEWEST,
            score_filter=None,
            num_reviews=199  # Keep this unchanged for max batch size
        )
        
        if not batch_reviews:
            break  # Exit loop if no more reviews are available
        
        reviews_data.extend(batch_reviews)
        progress_bar.update(len(batch_reviews))

25074it [01:12, 344.07it/s]                                                     


In [96]:
df = pd.DataFrame(result) 

df.head(5).style.set_caption("Styled DataFrame")

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion
0,d3331d3b-6170-43e2-a105-d42868bd1495,L Archie-Barnes,https://play-lh.googleusercontent.com/a/ACg8ocKqcEnahVzpRubHYnJepNyCUUtpuz8L5dlsJMysTZ-4usaK-g=mo,we like the new delivery windows,5,0,24.31.1,2024-08-14 22:56:15,,NaT,24.31.1
1,d7f03650-cf63-4665-8efc-b38501ce7bce,Ariel Lewis,https://play-lh.googleusercontent.com/a/ACg8ocKm_tqnKAmA6VOJm00I9zPsfCrDr3QtNCz94uuKd-X7hEKtwKo=mo,"This app is a glichy mess. I could not even see the map and I could not get back to my dashboard after my order was delivered. Will the UX/UI designers please stand up!? This is how you lose downloads. Walmart in Google Chrome works much better. Also, please pass along to whoever it may concern that every item that's delivered does not need to be in its own bag. Food with food and non-food with non-food. I do not think that it difficult.",1,1,24.30.1,2024-08-14 22:51:19,"We’re here for you. Please reach out at https://www.walmart.com/help, so we can help resolve this.",2024-08-14 23:28:27,24.30.1
2,c670f36a-48a5-4394-8c24-12ad525a6aa0,Edith Hernandez,https://play-lh.googleusercontent.com/a/ACg8ocLuAbA8mMeRw6fLy6GyJFVCxpEL7Y2eAQuIeLySEATDqjl9nA=mo,too slow,2,0,24.30.1,2024-08-14 22:48:55,,NaT,24.30.1
3,e5a227d9-23ee-4f47-a11e-f0d97cac2a87,Edward TRAINOR,https://play-lh.googleusercontent.com/a/ACg8ocIBKHV_GrnMtvDABwyjZ-EOGxvVfMTXOLsB62fPNpvns6YVgQ=mo,Love this app,5,0,24.31.1,2024-08-14 22:48:52,,NaT,24.31.1
4,f0481530-8dfe-4993-b25c-4e56c5b213a5,Vickey Lockvis,https://play-lh.googleusercontent.com/a-/ALV-UjVVFDuaE-ybW3-9KU1fmvOeXdKc_vQidjA-vX8nJ7u-SsSCvCKB,I love being able to order groceries. I've had the best delivery drivers.,5,0,24.31.1,2024-08-14 22:35:25,,NaT,24.31.1


In [97]:
df.columns

Index(['reviewId', 'userName', 'userImage', 'content', 'score',
       'thumbsUpCount', 'reviewCreatedVersion', 'at', 'replyContent',
       'repliedAt', 'appVersion'],
      dtype='object')

In [98]:
df = df[['reviewId', 'userName', 'content', 'score',
       'thumbsUpCount', 'reviewCreatedVersion', 'at', 'appVersion']]

Scrapes the reviews only up to yesterday 

In [100]:
today = datetime.date.today()

yesterday = today - datetime.timedelta(days = 1) 

print(yesterday) 

2024-08-14


In [101]:
df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,d3331d3b-6170-43e2-a105-d42868bd1495,L Archie-Barnes,we like the new delivery windows,5,0,24.31.1,2024-08-14 22:56:15,24.31.1
1,d7f03650-cf63-4665-8efc-b38501ce7bce,Ariel Lewis,This app is a glichy mess. I could not even se...,1,1,24.30.1,2024-08-14 22:51:19,24.30.1
2,c670f36a-48a5-4394-8c24-12ad525a6aa0,Edith Hernandez,too slow,2,0,24.30.1,2024-08-14 22:48:55,24.30.1
3,e5a227d9-23ee-4f47-a11e-f0d97cac2a87,Edward TRAINOR,Love this app,5,0,24.31.1,2024-08-14 22:48:52,24.31.1
4,f0481530-8dfe-4993-b25c-4e56c5b213a5,Vickey Lockvis,I love being able to order groceries. I've had...,5,0,24.31.1,2024-08-14 22:35:25,24.31.1


In [102]:
df['at'].iloc[0].date()

datetime.date(2024, 8, 14)

In [103]:
df

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,d3331d3b-6170-43e2-a105-d42868bd1495,L Archie-Barnes,we like the new delivery windows,5,0,24.31.1,2024-08-14 22:56:15,24.31.1
1,d7f03650-cf63-4665-8efc-b38501ce7bce,Ariel Lewis,This app is a glichy mess. I could not even se...,1,1,24.30.1,2024-08-14 22:51:19,24.30.1
2,c670f36a-48a5-4394-8c24-12ad525a6aa0,Edith Hernandez,too slow,2,0,24.30.1,2024-08-14 22:48:55,24.30.1
3,e5a227d9-23ee-4f47-a11e-f0d97cac2a87,Edward TRAINOR,Love this app,5,0,24.31.1,2024-08-14 22:48:52,24.31.1
4,f0481530-8dfe-4993-b25c-4e56c5b213a5,Vickey Lockvis,I love being able to order groceries. I've had...,5,0,24.31.1,2024-08-14 22:35:25,24.31.1
...,...,...,...,...,...,...,...,...
25069,95e796ac-447f-4312-bc6c-58a573277c7d,Sandra A QUEEN,I ENJOYYYY the savings on every purchase.,5,0,24.9.1,2024-03-15 04:58:10,24.9.1
25070,d4ed91dc-0b84-41a5-a397-4005332b4d44,Carol Bailey,It does not allow you to change from pickup to...,3,0,24.9.1,2024-03-15 04:57:26,24.9.1
25071,f844c142-fe42-4cad-bb39-83210fadc242,Nancy Tweed,"Great delivery service, reasonable prices.",5,0,24.9.1,2024-03-15 04:57:20,24.9.1
25072,75238666-6492-41e4-8d1e-3c098ff626f5,Sharon,"live this. to order and have it delivered, it ...",5,0,24.10.1,2024-03-15 04:30:11,24.10.1


In [104]:
new_df = df[df['at'].dt.date == yesterday]

In [105]:
new_df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,d3331d3b-6170-43e2-a105-d42868bd1495,L Archie-Barnes,we like the new delivery windows,5,0,24.31.1,2024-08-14 22:56:15,24.31.1
1,d7f03650-cf63-4665-8efc-b38501ce7bce,Ariel Lewis,This app is a glichy mess. I could not even se...,1,1,24.30.1,2024-08-14 22:51:19,24.30.1
2,c670f36a-48a5-4394-8c24-12ad525a6aa0,Edith Hernandez,too slow,2,0,24.30.1,2024-08-14 22:48:55,24.30.1
3,e5a227d9-23ee-4f47-a11e-f0d97cac2a87,Edward TRAINOR,Love this app,5,0,24.31.1,2024-08-14 22:48:52,24.31.1
4,f0481530-8dfe-4993-b25c-4e56c5b213a5,Vickey Lockvis,I love being able to order groceries. I've had...,5,0,24.31.1,2024-08-14 22:35:25,24.31.1


In [107]:
new_df.to_csv('Walmart-reviews.csv')

<font color = 'purple'>
    
We extract reviews for various Google Play Store apps, and then append them which the below code provides us with.
    
</font>

In [None]:
# This is the directory where the csv files are located
directory = '/Users/iamfaiyam/Downloads'

# Empty list to store the dataframe
dataframes = []

# Loops through all the csv files in the directory
for filename in os.listdir(directory):
    if filename.endswith(".csv"):
        file_path = os.path.join(directory, filename)
        # Read each csv file into a dataframe
        df = pd.read_csv(file_path, chunksize=10000)
        for chunk in df: 
            dataframes.append(chunk)

# Concatenate all dataframes in the list 
combined_df = pd.concat(dataframes, ignore_index = True) 

# Drop the first column (index column) 
combined_df.drop(combined_df.columns[0], axis = 1, inplace = True)

# Saving the combined dataframe to a new csv file
combined_df.to_csv('combined_reviews.csv', index = False) 