# **SpooderApp™**
#### *Leveraging business reviews to gain insights for potential improvements.*

## Hypothesis;


Consumer reviews are critical to the success of any business, yet many lack the resources to effectively analyze and act on this feedback. We hypothesize that by leveraging advanced Natural Language Processing (NLP) models, specifically through HuggingFace Transformers and OpenAI LangChain, we can accurately classify customer sentiment and generate actionable recommendations. Our goal is to empower businesses with detailed sentiment analysis and dynamic feedback, enabling them to enhance consumer satisfaction and overall performance.

# **Initialization**

When executing the following code, it is recommended to uncomment any necessary packages not yet installed in your environment.

## Instillations

In [None]:
# Installing necessary libraries 
# NOTE: Uncomment any libraries not currently present in your environment for
#       initial execution of this notebook

# General utilities
# %pip install pandas --quiet                  # Data manipulation and analysis
# %pip install numpy --quiet                   # Numerical computations
# %pip install scipy --quiet                   # Scientific computing
# %pip install matplotlib --quiet              # Plotting and visualization
# %pip install seaborn --quiet                 # Statistical data visualization
# %pip install tqdm --quiet                    # Progress bar for loops
# %pip install gdown --quiet                   # Downloading files from Google Drive
# %pip install zipfile --quiet                 # Working with zip files
# %pip install json --quiet                    # JSON handling

# Machine Learning & NLP
# %pip install torch --quiet                   # PyTorch for deep learning
# %pip install transformers --quiet            # HuggingFace Transformers
# %pip install datasets --quiet                # HuggingFace Datasets
# %pip install scikit-learn --quiet            # Machine learning tools
# %pip install nltk --quiet                    # Natural Language Toolkit for text processing
# %pip install accelerate --quiet              # Accelerate training
# %pip install evaluate --quiet                # Metric evaluation

# Web scraping
# %pip install selenium --quiet                # Browser automation
# %pip install webdriver-manager --quiet       # Manage WebDriver binaries
# %pip install beautifulsoup4 --quiet          # Parsing HTML and XML

# Environment & API
# %pip install python-dotenv --quiet           # Load environment variables
# %pip install langchain --quiet               # OpenAI LangChain for AI models

# Dash (Web App Framework)
# %pip install dash --quiet                       # Dash core components
# %pip install dash-bootstrap-components --quiet  # Dash Bootstrap components

# Plotting & Visualization
# %pip install plotly --quiet                  # Interactive graphing library

# Image Handling
# %pip install opencv-python-headless --quiet  # OpenCV for image processing

## Imports and Dependencies

In [None]:
# General Utilities
import pandas as pd               # Data manipulation and analysis
import os                         # Operating system interfaces
import re                         # Regular expressions
import json                       # JSON handling
import time                       # Time management
import zipfile                    # Working with zip files
import unicodedata                # Unicode character handling
import numpy as np                # Numerical computations
import scipy as sp                # Scientific computing
import gdown                      # Google Drive file download
from tqdm import tqdm             # Progress bar for loops
import base64                     # Encoding and decoding binary data
from io import BytesIO            # Handling binary data in memory

# Image Handling
import cv2                        # OpenCV for image processing
from PIL import Image             # Image processing via PIL (for handling image conversion)

# Plotting and Visualization
import matplotlib.pyplot as plt   # Plotting and visualization
import matplotlib.ticker as mtick # Setting ticks to larger numbers
import seaborn as sns             # Statistical data visualization
import plotly.express as px       # Simple interactive plots
import plotly.graph_objects as go # Detailed interactive plots

# Machine Learning & NLP
import torch                                          # PyTorch for deep learning
from sklearn.model_selection import train_test_split  # Data splitting for training and testing
from datasets import load_metric                      # Compute metrics for NLP models
import nltk                                           # Natural Language Toolkit for text processing
from nltk.corpus import stopwords                     # Stop words for text preprocessing
from nltk.tokenize import word_tokenize               # Tokenization of text
import transformers                                   # HuggingFace Transformers

# Pretrained Model and Tokenization
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer  # DistilBERT model and tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification         # Auto-tokenizer and model for sequence classification
from transformers import DataCollatorWithPadding                                   # Dynamic padding for batched data
from transformers import TrainingArguments, Trainer                                # Training arguments and trainer
from transformers import pipeline                                                  # Inference pipeline

# Dataset Formatting
import accelerate                           # Accelerate training
from datasets import Dataset                # Dataset handling
from evaluate import load                   # Metric evaluation

# Web Scraping
from selenium import webdriver                                          # Browser automation
from selenium.webdriver.chrome.service import Service as ChromeService  # WebDriver service for Chrome
from selenium.webdriver.support.ui import WebDriverWait                 # WebDriver wait
from selenium.webdriver.common.by import By                             # Locating elements by attributes
from selenium.webdriver.support import expected_conditions as EC        # Expected conditions for WebDriver waits
from selenium.common.exceptions import NoSuchElementException           # Exception handler when elements are not found
from webdriver_manager.chrome import ChromeDriverManager                # Manage WebDriver binaries
from bs4 import BeautifulSoup                                           # Parsing HTML and XML

# Environment & API
from dotenv import load_dotenv              # Load environment variables
from langchain_openai import ChatOpenAI      # OpenAI API for LangChain

# Prompt Template and LLM Chain
from langchain import PromptTemplate        # Prompt template for LangChain
from langchain.chains import LLMChain       # LLM Chain for linking models

# Dash (Web App Framework)
from dash import Dash, dcc, html, callback, callback_context  # Dash core components and callbacks
from dash.dependencies import Input, Output, State            # Dash dependencies for callbacks
from dash.exceptions import PreventUpdate                     # Prevent updates in callbacks
import dash_bootstrap_components as dbc                       # Dash Bootstrap components

# Other
import math                       # Mathematical functions

# **Data**

## Dataset

To prepare our sentiment analysis model, we leveraged the __[Yelp Open Dataset](https://www.yelp.com/dataset)__ to harness existing reviews and ratings. We also explored other available metrics during our EDA, before proceeding to preprocessing and model training.

The [Yelp Open Dataset](#Yelp-Open-Dataset) provided `.json` files with businesses, reviews, and user data from their businesses. Given the lack of image classification, we opted to forego the photo dataset.

#### Retrieval

Due to the large size of the provided files, direct pushes to GitHub were not a viable option. Instead, the files were converted to `.csv` formates (outlined in `Resources/json_convesion_for_gdown.ipynb`) and uploaded to a Google Drive for retrieval through `gdown`.

In [None]:
# Defining a function to access datasets through `gdown`
def fetch_data(set):
    '''
    Fetches a specific dataset from Google Drive using gdown and loads it into a DataFrame.

    Args:
        set (str):      A string representing the dataset to be retrieved. Must be one of the following;
                        'buesiness', 'checkin', 'reviews', 'tip', or 'user'
    
    Returns:
        df (DataFrame): A DataFrame with the retrieved Yelp dataset.

    Raises:
        ValueError:     If an invalid dataset identifier is provided.
        OSError:        If there is an issue with downloading the file or reading the CSV file.
        Exception:      If any other unexpected error occurs during the download or file reading process.
    '''
    # Declaring `url` and `output` for dataset
    match set:
        case 'business':
            url = 'https://drive.google.com/file/d/1t-_rOjZ8oMqPcMJunVaMgY3OEbhnuSCv/view?usp=sharing'
            output = 'Resources/business_dataset.csv'
        case 'checkin':
            url = 'https://drive.google.com/file/d/1_AVWp31ymfvf4QgTiMN_WLAeapfr0omf/view?usp=sharing'
            output = 'Resources/checkin_dataset.csv'
        case 'reviews':
            url = 'https://drive.google.com/file/d/1L8rFjhOQyU90Ycr9t_OLA70vCYM0e7ck/view?usp=sharing'
            output = 'Resources/reviews_dataset.csv'
        case 'tip':
            url = 'https://drive.google.com/file/d/1LMkCi5AFC_58_m7ELmn1hR8YDykuXwqq/view?usp=sharing'
            output = 'Resources/tip_dataset.csv'
        case 'user':
            url = 'https://drive.google.com/file/d/1kQ522qcod7AjD5DO9vj8qFcSKxwJCDrO/view?usp=sharing'
            output = 'Resources/user_dataset.csv'
        case _:
            # Raises
            raise ValueError('Invalid dataset selected, please try again')
    
    # Attempting to fetch dataset
    try:
        # Downloading dataset
        gdown.download(url, output, fuzzy=True, quiet=True)

        # Reading in the dataset
        df = pd.read_csv(output, low_memory=False)

    # Raises
    except ImportError as e:
        raise ImportError(f"Required module not found: {e}")
    except OSError as e:
        raise OSError(f"Error occurred during file operation: {e}")
    except Exception as e:
        raise Exception(f"An unexpected error occurred: {e}")
    
    # Returning the dataset
    return df

#### Fetching and Reading In

*Note: Once the* `fetch_data()` *function has been run for all five (5) datasets, you may comment out those lines of code (annotaed in cell, as well). For any additional executions, use the* `pd.read_csv()` *lines instead.*

In [None]:
# Fetching all datasets (uncomment for first run of code)
# business_df = fetch_data('business')
# checkin_df = fetch_data('checkin')
# reviews_df = fetch_data('reviews')
# tips_df = fetch_data('tip')
# user_df = fetch_data('user')

# Reading in all datasets (comment out if data is not already fetched)
business_df = pd.read_csv('./Resources/business_dataset.csv')
checkin_df = pd.read_csv('./Resources/checkin_dataset.csv')
reviews_df = pd.read_csv('./Resources/reviews_dataset.csv')
tips_df = pd.read_csv('./Resources/tip_dataset.csv')
user_df = pd.read_csv('./Resources/user_dataset.csv')

## EDA

Each dataset was explored individually before final feature selection and concatenation.

### Business dataset

Contains business data including location data, attributes, and categories.

#### Overview

In [None]:
# Previewing the data
business_df.head()

In [None]:
# Confirming additional data details
business_df.info()

### Checkin dataset

Contains checkins on a business.

#### Overview

In [None]:
# Previewing the data
checkin_df.head()

In [None]:
# Confirming additional data details
checkin_df.info()

*Note: We determined this dataset would not add any value to our training data.*

### Reviews dataset

Contains full review text data including the user_id that wrote the review and the business_id the review is written for.

#### Overview

In [None]:
# Previewing the data
reviews_df.head()

In [None]:
# Confirming additional data details
reviews_df.info()

#### Na count

In [None]:
# Verifying missing records
reviews_df.isna().sum()

#### Dropping features:
- **review_id**
- **useful**
- **funny**
- **cool**

In [None]:
# Dropping features
reviews_df.drop(columns = ['review_id','useful','funny','cool'],
                inplace = True)

#### Renaming features:
- **text** to **review**

*Note: This change was reverted before model training.*

In [None]:
# Renaming features
reviews_df.rename(columns = {'text':'review'},inplace = True)

#Confirming columns renamed
reviews_df.head()

*Dropped:*
- **review_id:** *Eliminated due to low informational value.*
- **useful:** *Eliminated due to low relevance.*
- **funny:** *Eliminated due to low relevance.*
- **cool:** *Eliminated due to low relevance.*

*Essential:*
- **business_id:** *Used as an identifier for concatenation.*
- **stars:** *Used as eventual model target.*
- **review:** *Used as feature for multiple models.*

*Retained:*
- **date:** *Retained for potential time series analysis.*

### Tips dataset

Contains tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.

#### Overview

In [None]:
# Previewing the data
tips_df.head()

In [None]:
# Confirming additional data details
tips_df.info()

#### Dropping features:
- **compliment_count**

In [None]:
# Dropping features
tips_df.drop(columns = ['compliment_count'],
             inplace =True)

#### Renaming features:
- **text** to **recommendations**

In [None]:
# Renaming columns
tips_df.rename(columns = {'text':'recommendations'},inplace = True)

#Confirming columns renamed
tips_df.head()

*Dropped:*
- **compliment_count:** *Eliminated due to low informational value.*

*Retained:*
- **recommendations:** *Retained for potential use as a target variable, since dataset has similar poitential for insights to improve the customer experience.*

### User dataset

Contains user data including the user's friend mapping and all the metadata associated with the user.

#### Overview

In [None]:
# Previewing the data
user_df.head()

#### Info

In [None]:
# Confirming additional data details
user_df.info()

*Note: We decided this dataset was not to be included in the training data to preserve user anonimity.*

### Concatenation

Merging the `reviews` and `business` data sets to create a single DataFrame to train our model.

In [None]:
# Declaring `data_df` as the merge of `reviews_df` and `business_df`
data_df = reviews_df.merge(business_df,how='left',on = 'business_id')

#### Overview

In [None]:
# Confirming additional data details
data_df.info()

In [None]:
# Previewing the data
data_df.head()

#### Na count

In [None]:
# Verifying missing records
data_df.isna().sum()

In [None]:
# Verifying missing records as percentages for specific features
# Calculating percentages
na_prcnt = data_df[['attributes','categories','hours']].isna().sum()/data_df.shape[0]*100

# Converting to a DataFrame
nas_df = pd.DataFrame(na_prcnt, columns=['percentage'])

# Transposing values
nas_df = nas_df.transpose()

# Rounding values
nas_df.round(4)

#### Na Count Visualization

In [None]:
# Generating a bar plot of NA records
sns.barplot(data = nas_df).set_title('NA percentage')

# Saving the figure (commented out after initial save)
# plt.savefig('Images/NA_Percentage_Plot.png')

# Displaying the figure
plt.show()

*Note: Ultimately, we decided to drop all three columns.*

#### Dropping features with NA records:
- **attributes**
- **categories**
- **hours**

In [None]:
# Dropping features
data_df.drop(columns = ['attributes','categories','hours'],inplace=True)

In [None]:
# Verifying missing records
data_df.isna().sum()

#### Comparing similar features

Exploring the features `stars_x` and `stars_y`.

In [None]:
# Previewing data with different values in both features
data_df.loc[data_df['stars_x'] != data_df['stars_y']][['stars_x','stars_y']].head()

In [None]:
# Previewing values for the same business
data_df.loc[data_df['business_id']=='XQfwVwDr-v0ZS3_CbbE5Xw'][['stars_x','stars_y']].head()

In [None]:
# Finding the average value for `star_x` for the same business
round(data_df.loc[data_df['business_id']=='XQfwVwDr-v0ZS3_CbbE5Xw']['stars_x'].mean(),2)

`star_y` seems to represent a business' average rating.

#### Renaming features
- **stars_y** to **stars_avg**
- **stars_x** to **stars**

In [None]:
# Renaming features
data_df.rename(columns={'stars_y':'stars_avg','stars_x':'stars'},inplace = True)

#### Visualization

In [None]:
# Declaring plots
fig,ax = plt.subplots()

# Plotting figure
sns.countplot(data_df,
             x='is_open',
             hue = 'is_open',
             ax = ax).set_title('`is_open` Feature Count')

# Saving the figure (commented out after initial save)
# plt.savefig('Images/is_open_Feature_Count.png')

# Displaying the figure
plt.show()

#### Dropping features:
- **is_open**

In [None]:
# Dropping features
data_df.drop(columns = ['is_open'],inplace = True)

*Note: We decided to drop this feature due low informational value and feature imbalance.*

### Merging with tips dataset

Exploring to potential gain from merging the tips dataset, since it, too, contains customer recommendations to improve experience.

#### Overview

In [None]:
# Previewing the data
tips_df.head()

In [None]:
# Confirming additional data details
tips_df.info()

In [None]:
# Finding the quantity of unique values in `business_id` in `tips_df`
display(tips_df['business_id'].unique().shape[0])

In [None]:
# Finding the quantity of unique values in `business_id` in `data_df`
data_df['business_id'].unique().shape[0]

Subset of **business_id** in `data_df` not found in `tips_df`

In [None]:
# Declaring `no_tips_df` as a subset of data not found in `tips_df`
no_tips_df = data_df[~data_df['business_id'].isin(tips_df['business_id'])]

# Previewing the data
no_tips_df.head()

Quantity of **business_id** in `data_df` not found in `tips_df`

In [None]:
# Counting the unique `business_id`s in `no_tips_df`
not_found = no_tips_df['business_id'].unique().shape[0]

# Printing the result
print(f'Number of business_ids in tips_df not found in data_df: {not_found}')

Evidence

In [None]:
# Locating a speciting `business_id` in `tips_df`
tips_df.loc[tips_df['business_id'] == no_tips_df['business_id'].iloc[33]]

### Merge

In [None]:
# Merging `tips_df` and `data_df` as `test_df`
test_df = pd.merge(tips_df,data_df,
                   on = ['business_id','user_id'],
                   how = 'inner')
                         

#### Overview

In [None]:
# Previewing the data
test_df.head()

In [None]:
# Confirming additional data details
test_df.info()

#### Comparison

Comparing **review** agains **recommendations**

In [None]:
# Previewing the data for the two selected features
test_df[['review','recommendations']].head()

*Note: The* `data_df` *DataFrame has approximately* ***7 million*** *entries, and the* `tips_df` *DataFrame has about* ***1 million***. *After merging them we end up the a little under* ***500 thousand***. *The comparison above shows little difference between a* ***review*** *from the reviews dataset and a* ***recommendation*** *from the tips dataset. As shown above, we stand to loose a significant amount of data if a merge is performed, we we decided against it.*

### Finalizing data

#### Dropping featires:
- **user_id** *(to preserve user anonymity)*

In [None]:
# Dropping features
data_df.drop(columns = ['user_id'],inplace = True)

#### Overview

In [None]:
# Previewing the data
data_df.head()

In [None]:
# Confirming additional data details
data_df.info()

#### Na count

In [None]:
# Verifying missing records
data_df.isna().sum()

#### Visualizations

Preparing some visualizations of the final dataset

In [None]:

# Ratings distribution

# Building a bar graph of the distribution of star ratings
plt.figure(figsize=(10, 6))
data_df['stars'].value_counts().sort_index().plot(kind='bar', color='brown')

# Setting plot features
plt.title('Distribution of Star Ratings')
plt.xlabel('Star Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)

# Formating the y-axis to show full numbers
plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, _: f'{int(x):,}'))

# Saving the figure (commented out after initial save)
# plt.savefig('Images/Disribution_of_Star_Ratings.png')

# Displaying the figure
plt.show()

In [None]:
# Review counts per business

# Grouping data by business name and counting the number of reviews per business
review_counts = data_df.groupby('name')['review'].count().reset_index()
review_counts.columns = ['business', 'review_count']

# Sorting the data by review count and selecting the top 10 businesses
top_10_review_counts = review_counts.sort_values(by='review_count', ascending=False).head(10)

# Building a bar graph of review counts per business
plt.figure(figsize=(14, 8))
sns.barplot(data=top_10_review_counts, x='business', y='review_count', palette='viridis')

plt.title('Top 10 Businesses by Review Count')
plt.xlabel('Business')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)

# Saving the figure (commented out after initial save)
# plt.savefig('Images/Disribution_of_Star_Ratings_Top_10_Business.png')

# Displaying the figure
plt.show()

In [None]:
# Business category distribution

# Extracting categories
all_categories = business_df['categories'].str.split(',').explode().str.strip()

# Counting the occurrences of each category
category_counts = all_categories.value_counts().head(20)

# Building the bar plot
plt.figure(figsize=(12, 8))
sns.barplot(x=category_counts.index, y=category_counts.values, palette='summer')

# Setting plot features
plt.title('Distribution of Top 20 Business Categories')
plt.xlabel('Category')
plt.ylabel('Number of Businesses')
plt.xticks(rotation=67)

# Saving the figure (commented out after initial save)
# plt.savefig('Images/Distribution_of_Top_20_Business_Categories.png')

# Displaying the figure
plt.show()

# Functions (Pt 1)

The following user-defined functions will be used throughout the sampling, preproccing, an modeling of our data, each developed with their annotated purposes in mind:

| **Function** | **Notes** |
| :--- | :--- |
| `sample_stars()` | Selects subsets of a DataFrame based on user rating value thresholds |
| `remove_accented_chars()` | Removes accented characters from text |
| `clean_text()` | Removes web formatting from text |
| `pre_process_reviews()` | Removes stop words from text |
| `tokenizer_function()` | Tokenizes text |
| `compute_metrics()` | Computes metrics to assist with evaluating model performance |

Functions outlined in more detail below require a DataFrame with the following features:

| **Feature** | **Notes** |
| :--- | :--- |
| `rating` | An int or float column with submitted user review scores |
| `rev_col` | A text column with available reviews |

In [None]:
# Function to select various subsets of data
def sample_stars(df, val):
    '''
    Samples a specific subset of a DataFrame based on the value of the 'stars' column.

    Args:
        df (DataFrame): Any DataFrame with sufficient data.
        val (int):      An integer representing the specific star rating to filter and sample from the DataFrame.

    Returns:
        df (DataFrame): A subset of the input DataFrame filtered by the specified star rating.

    Raises:
        KeyError:       If 'stars' is not a valid column name in the DataFrame.
        TypeError:      If `df` is not a DataFrame or if `val` is not an integer.
        ValueError:     If `val` is not within the acceptable range (1 to 5).
        ValueError:     If there are not enough records in the DataFrame to sample the requested amount.
    '''
    # Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    if 'stars' not in df.columns:
        raise KeyError("Column 'stars' not found in DataFrame.")
    if not isinstance(val, int):
        raise TypeError('The `val` parameter must be passed as an integer.')
    if val < 1 or val > 5:
        raise ValueError('The `val` parameter must be an integer between 1 and 5.')
    
    # Filtering and sampling the DataFrame
    df = df[df['stars'] == val].copy()

    # Sampling the DataFrame based on specified star rating to balance end dataset
    if val >= 4:
        df = df.sample(1000)
    elif val <= 2:
        df = df.sample(1000)
    else:
        df = df.sample(2000)
    df.reset_index(drop=True, inplace=True)
    
    # Retuning the DataFrame
    return df

In [None]:
# Function to remove accented characters
def remove_accented_chars(text):
    '''
    Removes accecnted characters.

    Args:
        text (str):     A corpus of text.

    Returns:
        text (str):     A processed corpus of text.
    '''
    # Removing accented characters
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

    # Returning `text`
    return text

In [None]:
# Funtion to remove web formatting
def clean_text(text):
    '''
    Removes Urls',mentions, hashtags, and multiple spaces from text.

    Args:
        text (str):     A corpus of text.
    
    Returns:
        text (str):     A processed corpus of text.
    '''
    # Removing URLs
    text = re.sub(r"http\S+", "", text)
    # Removing mentions
    text = re.sub(r"@\S+", "", text)
    # Removing hashtags
    text = re.sub(r"#\S+", "", text)
    # Removing multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()

    # Retuning `text`
    return text

In [None]:
# Funtion to remove stop words from a corups
def pre_process_reviews(reviews):
    '''
    Removes stop words from corpus.

    Args:
        reviews (str):      A corpus of text.
    
    Returns:
        reviews (str):     A processed corpus of text.
    '''
    # Setting stop words
    stop_words = set(stopwords.words('english'))

    # Declaring an empty list for processed reviews
    norm_reviews = []

    # Looping
    for review in tqdm(reviews):
        # Clean text
        review = clean_text(review)
        # remove extra newlines and convert them to spaces
        review = review.translate(review.maketrans("\n\t\r", "   "))
        # lower case
        review = review.lower()
        # remove accents
        review = remove_accented_chars(review)
        # remove special characters
        review = re.sub(r'[^a-zA-Z0-9\s]', '', review, flags=re.I|re.A)
        # remove extra whitespaces
        review = re.sub(' +', ' ', review)
        # remove leading and training whitespaces
        review = review.strip()

        review_tokens = word_tokenize(review)
        review = [w for w in review_tokens if not w in stop_words]
        review = ' '.join(review)

        # Appending to `norm_reviews`
        norm_reviews.append(review)
    
    # Returning `norm_reviews`
    return norm_reviews

In [None]:
# Function to tokenize a corpus
def tokenizer_function(review):
    '''
    Tokenizes corpus
    
    Args:
        reviews (str):              A corpus of text.
    
    Returns:
        return_tensors (pythorch):  A tokenized corpus of text.
    '''
    # Extracting text
    text = review['text']

    # Tokenize text with truncation and padding
    tokenized_inputs = tokenizer(
        text,
        # Truncate to max_length from the right by default
        truncation=True,
        # Pad to the maximum length
        padding="max_length",
        # Maximum sequence length for BERT models
        max_length=512,
        # Assuming you are using PyTorch; change to 'np' if necessary
        return_tensors='pt'
    )

    # Returning `tokenized_inputs`
    return tokenized_inputs

In [None]:
# Load multiple metrics
accuracy_metric = load("accuracy")
precision_metric = load("precision")
recall_metric = load("recall")
f1_metric = load("f1")

# Function to compute the metrics of our model
def compute_metrics(pred):
    '''
    Computes metrics to assist in evaluating model perfomance.

    Args:
        pred (tuple):   A series representing modeled predictions and a boolian indicator.
    
    Returns:
        return (dict):  A dictionary containing model performance metrics.
    
    Raises:
        TypeError:      If `pred` is not a tuple or if the elements of `pred` are not numpy arrays.
        ValueError:     If the shape of the predictions does not match the shape of the labels.
    '''
    # Raises
    if not isinstance(pred, tuple):
        raise TypeError("The `pred` argument must be a tuple.")
    if not isinstance(pred[0], np.ndarray) or not isinstance(pred[1], np.ndarray):
        raise TypeError("The elements of `pred` must be numpy arrays.")
    if pred[0].shape[0] != pred[1].shape[0]:
        raise ValueError("The number of predictions must match the number of labels.")
    
    # Unpacked tuple
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=1)

    # Metrics
    # Accuracy
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    # Precision
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')
    # Recall
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')
    # F1 score
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')

    # Returning a dictionary containing all metrics
    return {
        "accuracy": accuracy["accuracy"],
        "precision": precision["precision"],
        "recall": recall["recall"],
        "f1": f1["f1"],
    }

# **Modeling**

## Preparing a subset of data

Due to the imbalance of submitted user ratings, we decided to narrow our training data to reflect a balanced spread of "positive", "neutral", and "negative" reviews. To accomplish this, we selected an equal number of records with 5 or 4 star ratings, 3 star ratings, and 2 or 1 star ratings.

In [None]:
# Declaring sample subsets of data
sample_5 = sample_stars(data_df,5)
sample_4 = sample_stars(data_df,4)
sample_3 = sample_stars(data_df,3)
sample_2 = sample_stars(data_df,2)
sample_1 = sample_stars(data_df,1)

### Concatenating data samples

In [None]:
# Concatenating sample sets into a single DataFrame
sample_data_df = pd.concat(
    [
        sample_1,
        sample_2,
        sample_3,
        sample_4,
        sample_5
    ], axis=0, ignore_index=True
)

# Confirming total records in `sample_data_df`
sample_data_df.shape

In [None]:
# Confirming appropriate quantities of each star values were pulled
sample_data_df['stars'].value_counts()

### Manual encoding

Our goal was to break down reviews into "positive", "neutral", or "negative" sentiments. As such, we manually encoded the value of `stars` as follows;

| **Stars** | **Label** | **Sentiment Meaning** |
| :--- | :--- | :--- |
| **5** | **2** | "Positive" |
| **4** | **2** | "Positive" |
| **3** | **1** | "Neutral" |
| **2** | **0** | "Negative" |
| **1** | **0** | "Negative" |

In [None]:
# Manually encoding `stars` value to create sentiment labels
sample_data_df['stars']= sample_data_df['stars'].replace(to_replace=[1,2], value=0)
sample_data_df['stars']= sample_data_df['stars'].replace(to_replace=3, value=1)
sample_data_df['stars']= sample_data_df['stars'].replace(to_replace=[4,5], value=2)

In [None]:
# Confirming equal label class counts
sample_data_df['stars'].value_counts()

#### Na count

In [None]:
# Verifying missing records
sample_data_df.isna().sum()

*Note: Since this dataset is for training the sentiment analysis model, only, NA values in fields other than* `stars` *and* `review` *are ultimately irrelevent to the model training. As such, no records were in need of being dropped.*

#### Renaming featires:
- **review** to **text**
- **stars** to **label**

In [None]:
# Renaming features
sample_data_df.rename(columns={'review':'text','stars':'label'},inplace = True)

### Train Test Split

Our sample set was prepared for training with `train_test_split()` set to a `random_state=` of **42**, because life, the universe, and everything.

In [None]:
# Declaring `X` as features
X = sample_data_df['text']
# Declaring `y` as target
y = sample_data_df['label']

# Splitting the data into training and testing sets with a `test_size` of 0.3
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state=42)

## Building the pipeline

Our approach to sentiment analysis centered around utilizing a BERT-based Transormer. As such, steps were taken to establish, then tune the hyperparameters of, our pipeline.

### Preprocessing

Once split, each dataset was passed through our established preprocessing functions to prepare for model training.

In [None]:
# Preprocessing training reviews
norm_train_reviews = pre_process_reviews(X_train)

# Preprocessing testing reviews
norm_test_reviews = pre_process_reviews(X_test)

In [None]:
# Formatting training target data as a Dataset
train_dataset = Dataset.from_dict({'label':y_train.to_list(),'text':norm_train_reviews})
# Formatting testing target data as a Dataset
test_dataset = Dataset.from_dict({'label':y_test.to_list(),'text':norm_test_reviews})

### Generating the model

Each step below brought us one step closer to building our model, now lovingly referred to as `roberto`.

#### Building the model

Here we set `roberto`'s initial definition.

In [None]:
#Pretrained model
model_checkpoint = 'distilbert-base-uncased'

#Defining label classes
id_to_label = {0:'Negative', 1:'Neutral', 2:'Positive'}
label_to_id = {'Negative':0, 'Neutral': 1,'Positive':2}

#Model definiftion
model = DistilBertForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels = 3,
    id2label = id_to_label,
    label2id =label_to_id 
)

#### Tokenizing

And here we prepared the necessary steps for `roberto`'s tokenization.

In [None]:
#Model tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)
#Padding token
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

## Mapping Preprocessed data set to tokinezed data
tokenized_train_dataset = train_dataset.map(tokenizer_function, batched=True)
tokenized_test_dataset = train_dataset.map(tokenizer_function, batched=True)

#### Collator

`roberto` needed a collator, too.

In [None]:
#Dynamic padding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

#### Training arguments

While not argumentative by nature, `roberto` needed controllable hyperparameters to tune through the training process. Here we did that.

In [None]:
##Trainig arguments hypter-parameters,which will be used to train the model

#Output directory
output_dir = 'model_sentiment'
#Learning Rate
lr = 2e-5
#Batch size
batch_size = 32
#Epochs
EPOCHS = 3

#Training arguments
training_args = TrainingArguments(
    output_dir = output_dir,
    learning_rate = lr,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size*2,
    num_train_epochs = EPOCHS,
    weight_decay = 0.01,
    save_strategy = 'epoch',
    evaluation_strategy = 'epoch',
    logging_steps = 10,
    load_best_model_at_end = True,
    # Enable mixed precision
    fp16=True
)

#### Trainer

As we were trained to do, we etablished a proper trainer for `roberto` below.

In [None]:
#Trainer instatiation
trainer = Trainer(
                  model = model,
                  args = training_args,
                  train_dataset = tokenized_train_dataset,
                  eval_dataset = tokenized_test_dataset,
                  tokenizer = tokenizer,
                  compute_metrics = compute_metrics,
                  data_collator = data_collator
)

#### Training

It all came down to this. `roberto`'s moment t make us proud!

*Note: We commented this next cell out seeing as it took six (6) hours to train* `roebrto`.

In [None]:
# Fine-tuning the model with sample set of balanced data
# (commented out to prevent re-training)

# trained_model_results = trainer.train()

`roberto`... did not make us proud at first...

#### Training and Hypertuning;

Here are the initial and final accuracy values during `roberto`'s development;\
Initial fine-tuning of pretrained model yielded accuracy values of ~45%.\
Final training run yielded accruacy values are around ~82%.

Steps taken to improve accuracy (in something close to resembling order of application);
* Changed pre-trained model from `distilbert-base-uncased` to `MarieAngeA13/Sentiment-Analysis-BERT`
* Adjusted sample sizes of data <br> (from ~100 records total to a balanced sample set of 1000 with equal representation for all ratings) <br> (it would be another iteration before that sample set would be a balanced represntaion of the *labels*, though)
* Updates to text cleaning to include more web-present syntax <br> (eg; mentions, multiple spaces, hashtags, and web address elements) <br> (because reviews aren't literary works, typically)
* Adjustted syntax and arguments of tokenizer function and the application of it
* Adjusted training arguments to better align with our BERT-based model <br> (*spoiler: this gets undone pretty soon after*)
* Added additional metrics for better understanding of neccessary optimization
* Increased sample data size, again, and removed subset step entirely <br> (started at 10,000 only to then decrease that sample size to 600 because of time, but it was still a larger sample than where it started)
* Adjusted batch size and epochs <br> (twice)
* Moved back to `distilbert-base-uncased` and adjusted learning tokenizers, learning rate, logging steps, and such hyperparameters accordingly <br> (because sometimes less Bert is better Bert)
* Bargained with Eldritch beings in the hopes of a single soul buying even just a 10% boost to accuracy <br> (which is to say the sample size was changed to 3,000) <br> (also added and evaluation step to get a better idea of performance)
* Exchanged soul because the deal was pretty tempting <br> (3,000 records had an accuracy of ~78%, so set the model to train overnight with 6,000 records in the hopes of an above 80% result)

In the end, though? Yeah. `roberto` made us ***VERY*** proud!

Copied output from final evaluation;

> {'eval_loss': 0.4815390408039093, 'eval_accuracy': 0.8169047619047619, 'eval_precision': 0.8175331785953032, 'eval_recall': 0.8169047619047619, 'eval_f1': 0.8171466024398222, 'eval_runtime': 1743.5234, 'eval_samples_per_second': 2.409, 'eval_steps_per_second': 0.038, 'epoch': 3.0}


#### Evaluation

How we tested `roberto`'s performance. Don't worry, we saved the best results just above here.

*Note: We commented this out because there are no model metrics to review if the training cell is not run.*

In [None]:
# Evaluate the model (commented out due to trainer already being trained)
# evaluation_metrics = trainer.evaluate()

# Print the final score (commented out due to trainer already being trained)
# print(evaluation_metrics)

#### Saving Model & Tokenizer

Once we had an adequately trained `roberto`, we saved out model off for later fetching.

*Note: We commented this out because there is no model to save if the training cell is not run.*

In [None]:
# Saving the model and toekenizer
# (commented out to prevent overwriting, # fetching handled through `gdown` and `zipfile`)
# trainer.save_model(model_path)

# tokenizer.save_pretrained(tokenizer_path)

#### Fetching and unzipping model

So... Long story short, training a model on six-thousand (**6,000**) records makes for a sizable directory. Much as we circumnavicated Git's file size push restrictions with the datasets, we utlized `gdown` to fetch our model from being stored on Google Drive.

*Note: The cell below is only necessary to be run during this notebook's first execution. It may be commented out for any subsequent execution.*

In [None]:
# Fetching model through `gdown` (uncomment for first run of code)
# url = 'https://drive.google.com/file/d/1tzYRkjv3wWpfg21pJ02SEYNXEcj-TVH3/view?usp=sharing'
# output = 'Resources/Sentiment_Analysis.zip'
# Download model (uncomment for first run of code)
# gdown.download(url, output, fuzzy=True, quiet=False)

# Extracting model (uncomment for first run of code)
# with zipfile.ZipFile(output, 'r') as zip_ref:
#     zip_ref.extractall('./Sentiment_Analysis')

#### Paths

Once saved or fetched, pathing to `roberto` is necessary for later use. Here we set our paths.

*Note: If you unzip a file, it puts itself in a folder named for its zip, so... Yeah, the redundant filepath is a byproduct of default zip behavior.*

In [None]:
# Declaring the model path
model_path = 'Sentiment_Analysis/Sentiment_Analysis/model'

# Declaring the tokenizer path
tokenizer_path =  'Sentiment_Analysis/Sentiment_Analysis/tokenizer'

### The end result

Finally complete, `roberto` comes together with all the necessary pieces of its pipeline in place. **Behold!** `roberto`!

In [None]:
#Loading Model
model = DistilBertForSequenceClassification.from_pretrained(model_path)
#Loading Tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_path)

# Because of his dedication, and our exhaustion, which led us (Vanessa)
# to mispronunciate his namem in honor of our TA Alberto Aigner we have named
# our Pipeline 'roberto'
roberto = pipeline('sentiment-analysis',model=model,tokenizer=tokenizer)

# Functions (Pt 2)

While trained on Yelp! data, and developed for Google Reviews, the goal of the application is to be as univerally applicable to business reviews as possible - regardless of the source. The following functions were developed with their annotated purposes in mind:

| **Function** | **Notes** |
| :--- | :---|
| `apply_roberto()` | Generates sentiment analysis for reviews in a given dataset, and a confidence in that sentiment |
| `business_names_list()` | Generates a list of unique business names from a given dataset |
| `reviews_list()` | Generates a list of all reviews submitted to a business for all its locations |
| `general_sentiment()` | Classifies the general sentiment for a business' reviews and provides a mean confidence in that sentiment <br> *Note: To be run after a DataFrame has been passed through* `apply_roberto()` |
| `get_business_overview()` | Extracts business overview details from a webpage using BeaugtifulSoup and Selenium. |
| `read_csv_with_error_handling()` | Reads a CSV file into a pandas DataFrame with error handling for common file-related issues. |
| `get_review_summary()` | Gathers and summarizes review data from a set of review elements parsed from HTML. |
| `web_Scraper()` | Scrapes Google Maps via web driver and gather business information and reviews for the list of businesses in the imported file. |

Functions outlined in more detail below require a DataFrame with the following features:

| **Feature** | **Notes** |
| :--- | :--- |
| `bus_name_col` | A text column with the name of a business |
| `bus_add` | A text column with the street address of a business' location |
| `rev_col` | A text column with available reviews |
| `sent_lbl` | A text column with the generated sentiment classification <br> *Note: Generated through* `apply_roberto()` |
| `sent_scr` | A text column with the generated sentiment classification <br> *Note: Generated through* `apply_roberto()` |

In [None]:
# Function to apply `roberto` to any DF
def apply_roberto(df,rev_col):
    '''
    Applies the `roberto` model to generate sentiment analysis for the reviews in a DataFrame.

    Args:
        df (DataFrame):     Any DataFrame with sufficient data.
        rev_col (str):      A string with the feature name that contains the review text.
    
    Returns:
        df (DataFrame):     The same DataFrame with the appended sentiments and confidence scores.
    
    Raises:
        KeyError:  If `rev_col` is not a valid column name in the DataFrame.
        TypeError: If `df` is not a DataFrame or if `review_col` is not a string.
    '''
    #Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    if not isinstance(rev_col, str):
        raise TypeError('The `rev_col` parameter must be passed as a string.')
    if rev_col not in df.columns:
        raise KeyError(f"Column '{rev_col}' not found in DataFrame.")
    
    # Initializing features for results
    df['sent_label'] = ''
    df['sent_score'] = 0.0

    # Iterating through `df`
    for index,row in df.iterrows():
        # Setting review text as `text`
        text = row[rev_col]
        # Generating results for a given review
        result = roberto(text, truncation=True)[0]
        # Appending the sentiment label
        df.at[index, 'sent_label'] = result['label']
        #Appending the sentiment score
        df.at[index, 'sent_score'] = result['score']
    
    # Returning `df`
    return df

In [None]:
# Function to retrieve unique business names
def business_names_list(df, bus_name_col):
    '''
    Places unique names from a list of businesses into a list.

    Args:
        df (DataFrame):     Any DataFrame with sufficient data.
        bus_name_col (str): A string with the feature name that contains the business name.

    Returns:
        names (list):       A list of strings with only unique values.

    Raises:
        KeyError:           If `bus_name` is not a valid column name in the DataFrame.
        TypeError:          If `df` is not a DataFrame or if `bus_name` is not a string.
    '''
    # Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    if not isinstance(bus_name_col, str):
        raise TypeError('The `bus_name` parameter must be passed as a string.')
    if bus_name_col not in df.columns:
        raise KeyError(f"Column '{bus_name_col}' not found in DataFrame.")

    # Generating a list of business names
    names = df[bus_name_col].unique().tolist()
    
    # Returning the list
    return names

In [None]:
# Function to retrieve all reviews
def reviews_list(df, bus_name_col, bus_name, bus_add, rev_col):
    '''
    Places all reviews for a given business into a list, attributing each review to its specific location.

    Args:
        df (DataFrame):     Any DataFrame with sufficient data.
        bus_name_col (str): A string with the feature name that contains the business name.
        bus_name (str):     A string of a specific business' name for which to map the locations.
        bus_add (str):      A string with the feature name that contains the business street address.
        rev_col (str):      A string with the feature name that contains the review text.

    Returns:
        reviews (list):     A list of strings with all the reviews for a given business.

    Raises:
        KeyError:           If any passed str is not a valid column name in the DataFrame, or if `bus_name` is not a value in `bus_name_col`.
        TypeError:          If `df` is not a DataFrame  if and feature is not a string, or if `bus_name` is not a string.
    '''
    # Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    for param, name in zip(
        [bus_name_col, bus_add, rev_col],
        ['bus_name_col', 'bus_add', 'rev_col']
    ):
        if not isinstance(param, str):
            raise TypeError(f"The '{name}' parameter must be passed as a string.")
        if param not in df.columns:
            raise KeyError(f"Column '{param}' not found in DataFrame.")
    if not isinstance(bus_name, str):
        raise TypeError('The `bus_name` parameter must be passed as a string.')
    if bus_name not in df[bus_name_col].values:
        raise KeyError(f"Value '{bus_name}' not found in column '{bus_name_col}'.")
    
    # Filtering `df`
    filtered_df = df[[bus_add, rev_col]][df[bus_name_col] == bus_name].copy()

    # Handling missing or empty reviews
    filtered_df[rev_col] = filtered_df[rev_col].fillna('No review provided.')

    # Creating a list of all reviews
    reviews = filtered_df[bus_add] + ':\n' + filtered_df[rev_col] + '\n\n'

    # Converting to a list
    reviews = reviews.to_list()

    # Returning reviews
    return reviews

In [None]:
# Function to generalize the overall sentiment
def general_sentiment(df, bus_name_col, bus_name, sent_lbl, sent_scr):
    '''
    Compares the total positive, negative, and neutral reviews to classify an overall sentiment.

    Note:
        To be run after passing a DataFrame through `apply_roberto()`.

    Args:
        df (DataFrame):     Any DataFrame with sufficient data.
        bus_name_col (str): A string with the feature name that contains the business name.
        bus_name (str):     A string of a specific business' name for which to map the locations.
        sent_lbl (str):     A string with the feature name that contains the modeled sentiment label.
        sent_scr (str):     A string with the feature name that contains the modeled sentiment confidence.

    Returns:
        gen_sent (str):     A string with the overall sentiment, and the model's mean confidence in that classification.
    
    Raises:
        KeyError:           If any passed str is not a valid column name in the DataFrame, or if `bus_name` is not a value in `bus_name_col`.
        TypeError:          If `df` is not a DataFrame  if and feature is not a string, or if `bus_name` is not a string.
    '''
    # Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    for param, name in zip(
        [bus_name_col, sent_lbl, sent_scr],
        ['bus_name_col', 'sent_lbl', 'sent_scr']
    ):
        if not isinstance(param, str):
            raise TypeError(f"The '{name}' parameter must be passed as a string.")
        if param not in df.columns:
            raise KeyError(f"Column '{param}' not found in DataFrame.")
    if not isinstance(bus_name, str):
        raise TypeError('The `bus_name` parameter must be passed as a string.')
    if bus_name not in df[bus_name_col].values:
        raise KeyError(f"Value '{bus_name}' not found in column '{bus_name_col}'.")
    
    # Filtering `df`
    filtered_df = df[[sent_lbl, sent_scr]][df[bus_name_col] == bus_name].copy()

    # Converting `sentiment` to lower case
    filtered_df[sent_lbl] = filtered_df[sent_lbl].str.lower()

    # Calculating total `positive` sentiment
    pos = filtered_df.loc[filtered_df[sent_lbl] == 'positive'].shape[0]
    # Calculating total `neutral` sentiment
    ntrl = filtered_df.loc[filtered_df[sent_lbl] == 'neutral'].shape[0]
    # Calculating total `negative` sentiment
    neg = filtered_df.loc[filtered_df[sent_lbl] == 'negative'].shape[0]

    # Match case to generate general sentiment
    match (pos, ntrl, neg):
        case (p, n, ng) if p > n > ng:
            sent = 'highly positive'
        case (p, n, ng) if p > n + ng:
            sent = 'strongly positive'
        case (p, n, ng) if p + n > ng:
            sent = 'moderately positive'
        case (p, n, ng) if p < n > ng:
            sent = 'generally neutral'
        case (p, n, ng) if p < n < ng:
            sent = 'moderately negative'
        case (p, n, ng) if p + n < ng:
            sent = 'strongly negative'
        case (p, n, ng) if ng > n > p:
            sent = 'highly negative'
        case (p, n, ng) if p == n == ng:
            sent = 'perfectly neutral'
        case _:
            sent = 'undetermined'
    
    # Calculate the mean confidence
    conf = filtered_df[sent_scr].mean() * 100

    # Generating the final sentiment
    if pos + ntrl + neg != 0:
        # Concatenating sentiment and confidence
        gen_sent = f'The general sentiment is {sent}, with an average confidence of {conf:.1f}%.'
    else:
        # When no sentment available due to no reviews
        gen_sent = 'Cannot confirm sentiment due to a lack of reviews.'

    # Returning sentiment
    return gen_sent

In [None]:
# define function to scrape each business info and reviews
def get_business_overview(driver,lat,long,i):
    """
    Extracts business overview details from a webpage using BeaugtifulSoup and Selenium.

    Args:
        driver (selenium.webdriver): The Selenium WebDriver instance.
        lat (list): List of latitude values.
        long (list): List of longitude values.
        i (int): Index to access latitude and longitude.

    Returns:
        dataframe: A dataframe containing business overview details or an error message.
    """
    try:
        # Get page source and parse it with BeautifulSoup
        response = BeautifulSoup(driver.page_source, 'html.parser')
        
        # Extract business details
        business_name = response.find('h1', class_='DUwDvf lfPIob')
        avg_rating = response.find('div', class_='fontDisplayLarge')
        address = response.find('div', class_='rogA2c')
        

        # Check if elements are found and extract text
        business_name_text = business_name.text if business_name else "Not available"
        avg_rating_text = avg_rating.text if avg_rating else "Not available"
        address_text = address.text if address else "Not available"
        
        # Get latitude and longitude
        lat_value = lat[i] if i < len(lat) else "Not available"
        long_value = long[i] if i < len(long) else "Not available"
        
        # # parse out buss_add and bus_city from address_text
        # address_list = address_text.split(',')
        # bus_add = address_list[0]
        # bus_city = address_list[1]
        
        #create dictionary with business attributes
        business_dic = {
                'bus_id': business_name_text,
                'avg_rating': avg_rating_text,
                'bus_add': address_text,
                'lat': lat_value,
                'lon': long_value
                }
        
        # generate dataframe
        business_df = pd.DataFrame([business_dic])
        return business_df

    except IndexError as e:
            print(f"Error: Index out of range - {e}")
            return {'error': 'Index out of range'}
    except NoSuchElementException as e:
        print(f"Error: Element not found - {e}")
        return {'error': 'Element not found'}
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return {'error': 'Unexpected error'}

In [None]:
# error handling file_path function 
def read_csv_with_error_handling(file_path):
    """
    Reads a CSV file into a pandas DataFrame with error handling for common file-related issues.

    This function attempts to read a CSV file from the specified `file_path` into a pandas DataFrame.
    It includes error handling for common issues such as file not found, empty files, parsing errors,
    and permission errors. If an error occurs, it prints an appropriate error message.

    Args:
        file_path (str): The path to the CSV file to be read. It should be a valid file path string.

    Returns:
        pandas.DataFrame: The DataFrame created from the CSV file, or `None` if an error occurs.
            - If the file is successfully read, the DataFrame is returned.
            - If an error occurs, the function prints an error message and returns `None`.

    Raises:
        None: The function handles exceptions internally and does not raise them further.

    """
    try:
        # Attempt to read the CSV file
        url_df = pd.read_csv(file_path)
        return url_df
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
    except pd.errors.EmptyDataError:
        print("Error: The file is empty.")
    except pd.errors.ParserError:
        print("Error: The file could not be parsed.")
    except PermissionError:
        print(f"Error: Permission denied when trying to read '{file_path}'.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

In [None]:
# define function to gather relevant data from the reviews result set obtained by parsing through HTML
def get_review_summary(result_set):

    """
    Gathers and summarizes review data from a set of review elements parsed from HTML.

    This function extracts review text and ratings from a parsed HTML result set.
    It compiles this information into a pandas DataFrame. If any issues are encountered during
    extraction, such as missing elements, appropriate error handling is performed.

    Args:
        result_set (iterable): An iterable of BeautifulSoup elements containing review data.
            Each element should represent an individual review with a `span` for review text
            and an element with a class of 'kvMYJc' for the rating.

    Returns:
        pandas.DataFrame: A DataFrame with two columns:
            - 'review': A list of review texts.
            - 'rating': A list of review ratings (typically a single character representing the rating).
        
        dict: If an exception is caught during processing, a dictionary with an error message is returned.
            - For missing elements: {'error': 'Element not found'}
            - For unexpected errors: {'error': 'Unexpected error'}

    Raises:
        NoSuchElementException: Raised by BeautifulSoup if the expected HTML elements are not found.
        Exception: Catches any other unexpected errors that may occur.
    """
        
    # create empty review dictionary to add each review
    rev_dict = {
        'review' : [],
        'rating' : []
    }

    # scrape the necessary review elements such as review text and rating
    for result in result_set:
        try:
            review_text = result.find('span',class_='wiI7pd').text
            review_rating = result.find(class_='kvMYJc')['aria-label']
            review_rating = review_rating[0]
            rev_dict['review'].append(review_text)
            rev_dict['rating'].append(review_rating)
        
        # exception handling
        except NoSuchElementException as e:
            print(f"Error: Element not found - {e}")
            return {'error': 'Element not found'}
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return {'error': 'Unexpected error'}

    reviews_summary = pd.DataFrame(rev_dict)
    return reviews_summary

In [None]:
#define web scrapper function
def web_Scraper(file_path):

    '''
    Scrapes Google Maps via web driver and gather business information and reviews for the list of businesses in the imported file.

    Note:
        the following function must be defined before this function as they're nested:
        read_csv_with_error_handling
        get_business_overview 
        get_review_summary

    Args:

        filepath (str):     A string of the filepath that points the business url csv.

    Returns:
        df_list (list):     A list of dataframes each containing the business info and reviews data. Each dataframe represents a location.
    
    '''

    #run csv file with error handling function
    url_df = read_csv_with_error_handling(file_path)

    # keep only the first 11 businesses from the business_urls file for performance purposes
    url_df = url_df.head(11)

    # append the Google maps business urls and their corresponding lat and long coordinates to their individual lists
    url = url_df['url'].tolist()
    lat = url_df['lat'].astype(str).tolist()
    long = url_df['long'].astype(str).tolist()

    # initiate driver
    driver = webdriver.Chrome(service = ChromeService(ChromeDriverManager().install()))

    # create for loop to parse through the each business location in the url list above
    c = 0
    df_list = []

    # loop for the length of the url list
    for i in range(0,len(url)):
        c += 1
        driver.get(url[i])
        time.sleep(5)
    
        try:
            # execute get_business_overview function to get main business info
            business_df = get_business_overview(driver, lat, long, i)
            
            # try:
            # navigate to Reviews tab
            driver.find_element(By.CLASS_NAME, "RWPxGd").click()
            time.sleep(3)

            # Find the total number of reviews
            total_number_of_reviews = driver.find_element('xpath','//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[2]/div/div[2]/div[3]').text.split(" ")[0]
            total_number_of_reviews = int(total_number_of_reviews.replace(',','')) if ',' in total_number_of_reviews else int(total_number_of_reviews)

            # scrape the first 50 reviews for efficiency - this variable can be automatically using function above if you want extract ALL reviews
            total_number_of_reviews = 50

            #Find scroll layout
            scrollable_div = driver.find_element('xpath','//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]')

            #Scroll as many times as necessary to load all reviews - 10 reviews shown at a time
            for i in range(0,(round(total_number_of_reviews/10 - 1))):
                driver.execute_script('arguments[0].scrollTop = arguments[0].scrollHeight', scrollable_div)
                time.sleep(1)

            ## parse HTML and Data Extraction
            # loop over the number of reviews 
            next_item = driver.find_elements('xpath','//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[9]/div[1]/div/div')
            time.sleep(3)

            #expand review by click on 'more' button
            for i in next_item:
                button = i.find_elements(By.TAG_NAME,'button')
                for m in button:
                    if m.text == "More":
                        m.click()
                time.sleep(5)

            # parse through the HTML 
            response = BeautifulSoup(driver.page_source, 'html.parser')
            reviews = response.find_all('div',class_ = 'jftiEf')

            reviews_summary_df = get_review_summary(reviews)

            #repeat business_df rows to then be concatenated with reviews summary df
            bus_df_repeated = pd.concat([business_df] * len(reviews_summary_df), ignore_index=True)

            #combine business summary and reviews summary df
            business_summary_df = pd.concat([bus_df_repeated,reviews_summary_df],axis=1, ignore_index=True)
            
            # set column names
            business_summary_df.columns = ['bus_id','avg_rating','bus_add','lat','lon','review','rating']
            
            #append each business_summary_df to df_list
            df_list.append(business_summary_df)

            #concat list of dfs to create master df with all locations
            spooder_df = pd.concat(df_list, ignore_index=True)
        
        except NoSuchElementException as e:
            print(f"Error: Element not found - {e}")
            return {'error': 'Element not found'}
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return {'error': 'Unexpected error'}

        
    return spooder_df    


#### Sameple set trial

After `roberto` was developed, we used the same dataset that it trained on to test its application. Knowing that the initial distribution of `stars` (or `label` in the case of this set) was controlled other elements of the sample set were explored to see what `roberto` trained on.

In [None]:
# Reading in the previously saved sample set
# NOTE: This set was saved off following the finalization of `roberto`
sample_set_df = pd.read_csv('Resources/sample_set.csv')

In [None]:
# Sentiment Distribution

# Building the bar plot
plt.figure(figsize=(10, 6))
sns.countplot(data=sample_set_df, x='sent_label', palette='viridis')

# Setting plot features
plt.title('Distribution of Sentiment Labels')
plt.xlabel('Sentiment Label')
plt.ylabel('Count')

# Saving the figure (commented out after initial save)
# plt.savefig('Images/Distribution_of_Sentiment_Labels.png')

# Displaying the figure
plt.show()

*Note: this makes sense since an even distribution of ratings was selected to make for a balanced training dataset.*

In [None]:
# Review Count Distribution

# Building the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=sample_set_df, x='review_count', bins=20, kde=True, color='skyblue')

# Setting plot features
plt.title('Distribution of Review Counts')
plt.xlabel('Review Count')
plt.ylabel('Number of Businesses')

# Saving the figure (commented out after initial save)
# plt.savefig('Images/Distribution_of_Review_Counts.png')

# Displaying the figure
plt.show()

*Note: This illustrates that fewer businesses reveive the highest number of reviews, and emphasizes the importance of harnessing available reviews.*

In [None]:
# Sentiment Score Distribution

# Building the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=sample_set_df, x='sent_score', bins=20, kde=True, color='lightgreen')

# Setting plot features
plt.title('Distribution of Sentiment Scores\n(Model Confidence)')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')

# Saving the figure (commented out after initial save)
# plt.savefig('Images/Distribution_of_Sentiment_Scores.png')

# Displaying the figure
plt.show()

*Note: This was to be expected. We knew* `roberto` *felt less confident with "neutral" and "negative" reviews, which comprised two-thirds (2/3) of our training dataset.*

# Web Scraping

The web scraping process uses Selenium and BeautifulSoup to extract business and review data from Google Maps. Selenium navigates to business URLs from a CSV file, and BeautifulSoup parses details like names, ratings, and addresses after accessing the Reviews tab. The script also scrolls through and expands reviews to capture complete data, which is then compiled into Pandas DataFrames for each business and combined into a single structured dataset for frontend development.

*Note: A chrome window will open to proceed through the scraping process. Do* ***NOT*** *close or click on any links within that window, as it will interrupt the web scraper!*


In [None]:
# run the web_Scraper function with the business urls csv file as an input
web_Scraper('Resources/business_urls.csv')

In [None]:
# Exporting spooder_df for later use (commented out because `.csv` in `Resources/`)
# spooder_df.to_csv('./Resources/spooder.csv')

In [None]:
# Importing `spooder_df`
spooder_df = pd.read_csv('Resources/spooder.csv')

## Apply the Sentiment Analysis Model to the Web Scrapped Data

In [None]:
# apply Roberto to web scrapping dataframe with Google Reviews
# (commented out due to already applying to saved `.csv`)
# roberto_df = apply_roberto(spooder_df,'review')
# roberto_df

In [None]:
# Exporting roberto_df for later use (commented out because `.csv` in `Resources/`)
# roberto_df.to_csv('./Resources/roberto.csv')

In [None]:
# Importing `spooder_df`
roberto_df = pd.read_csv('Resources/roberto.csv')

In [None]:
# run function to get general sentiment per business to be used as input into chatgpt model 
general_sentiment_web_scrapping = general_sentiment(roberto_df, 'bus_id', 'Dulce de Leche Bakery', 'sent_label', 'sent_score')
general_sentiment_web_scrapping

In [None]:
# run function to add all reviews of a given business as a list
review_list = reviews_list(roberto_df, 'bus_id', 'Dulce de Leche Bakery', 'bus_add', 'review')
review_list[:10]

## Use Reviews from Selected Business to run ChatGPT Model

In [None]:
# Load environment variables.
load_dotenv()

# Set the model name for our LLMs.
OPENAI_MODEL = "gpt-3.5-turbo"
# Store the API key in a variable.
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [None]:
# Initialize the model.
llm=ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name=OPENAI_MODEL, temperature=0.3)

# Define the format for the template.
format = """

Provide a summary of the given reviews:{review_list} and three recommendations for ways in which to improve the business. The summary should capture the main points and key details of the text 
while conveying the author's intended meaning accurately. The recommendations should be actionable, clear and conscise, and there should always be three of them. Please ensure that the summary is well-organized and easy to read, 
with clear headings and subheadings to guide the reader through each section. The length of the summary should be appropriate to capture the main points and key details of the text, 
without including unnecessary information or becoming overly long. Please also ensure that there are always three recommendations for ways to improve the business.

reviews = {review_list}

"""

# Construct the prompt template.
prompt_template = PromptTemplate(
    input_variables=["review_list"],
    template=format
)

# Construct a chain using this template.
davidlingo = LLMChain(llm=llm, prompt=prompt_template)

In [None]:
# Define the input variable as a dictionary
review_list = {"review_list": review_list}

# Run the chain using the query as input and get the result.
result = davidlingo.invoke(review_list)
results = result["text"]

# split the results by new lines to extract review summary and business recommendations
results_list = results.split('\n')
reviews_summary = results_list[1]
recommendations = results_list[4]


print(result["text"])

# Functions (Pt 3)

While trained on Yelp! data, and developed for Google Reviews, the goal of the application is to be as univerally applicable to business reviews as possible - regardless of the source. The following functions were developed with their annotated purposes in mind:

| **Function** | **Notes** |
| :--- | :---|
| `unique_locs_df()` | Creates a DataFrame with all unique locations in a given dataset |
| `location_details()` | Generates a dictionary with geographic coordinates for all locations of a given business <br> *Note: To be run on the DataFrame generated by* `unique_locs_df()` |
| `build_map()` | Constructs a Scattermapbox based on the locations from `location_details()` |
| `apply_davidlingo()` | Generates the final summary of a business' reviews, or recommendations for improvement based off the reviews and overall sentiment <br> *Note: To be used with the ouputs of* `reviews_list()` *and* `general_sentiment()` |

Each function outlined in more detail below requires a DataFrame with the following features:

| **Feature** | **Notes** |
| :--- | :--- |
| `bus_name_col` | A text column with the name of a business |
| `bus_add` | A text column with the street address of a business' location |
| `bus_lat` | A float column with the latitude coordinate for a business' location |
| `bus_lon` | A float column with the longitude coordinate for a business' location |
| `rev_col` | A text column with available reviews |
| `sent_lbl` | A text column with the generated sentiment classification <br> *Note: Generated through* `apply_roberto()` |
| `sent_scr` | A text column with the generated sentiment classification <br> *Note: Generated through* `apply_roberto()` |

In [None]:
# Function to retrieve unique business locations
def unique_locs_df(df, bus_name_col, bus_add, bus_lat, bus_lon):
    '''
    Gathers unique adresses and coordinates for unique locations into a DataFrame.

    Args:
        df (DataFrame):     Any DataFrame with sufficient data.
        bus_name_col (str): A string with the feature name that contains the business name.
        bus_add (str):      A string with the feature name that contains the business street address.
        bus_lat (str):      A string with the feature name that contains the latitude of a location.
        bus_lon (str):      A string with the feature name that contains the longitude of a location.
    
    Returns:
        loc (DataFrame):    A DataFrame with only unique locations.

    Raises:
        KeyError:           If any passed str is not a valid column name in the DataFrame.
        TypeError:          If `df` is not a DataFrame or if and feature is not a string.
    '''
    # Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    for param, name in zip(
        [bus_name_col, bus_add, bus_lat, bus_lon],
        ['bus_name_col', 'bus_add', 'bus_lat', 'bus_lon']
    ):
        if not isinstance(param, str):
            raise TypeError(f"The '{name}' parameter must be passed as a string.")
        if param not in df.columns:
            raise KeyError(f"Column '{param}' not found in DataFrame.")
    
    # Generating a DataFrame of unique locations
    locs = df[[bus_name_col, bus_add, bus_lat, bus_lon]].drop_duplicates()
    
    # Returning the DataFrame
    return locs

In [None]:
# Function to generate a `locations` list of dictionaries
def location_details(df, bus_name_col, bus_name, bus_add, bus_lat, bus_lon):
    '''
    Transfers the latitude, longitude, and a concatenated identifier into a dictionary for later use in generating a Scattermapbox figure.

    Note:
        Advised to run `location_details()` on the DataFrame generated by `unique_locs_df()`

    Args:
        df (DataFrame):     Any DataFrame with sufficient data.
        bus_name_col (str): A string with the feature name that contains the business name.
        bus_name (str):     A string of a specific business' name for which to map the locations.
        bus_add (str):      A string with the feature name that contains the business street address.
        bus_lat (str):      A string with the feature name that contains the latitude of a location.
        bus_lon (str):      A string with the feature name that contains the longitude of a location.

    Returns:
        locs (dict):        A list of dictionaries with the necessary details for building a figure.
    
    Raises:
        KeyError:           If any passed str is not a valid column name in the DataFrame.
        TypeError:          If `df` is not a DataFrame or if and feature is not a string.
        ValueError:         If `bus_name` is not a value in `bus_col_name`.
    '''
    # Raises
    if not isinstance(df, pd.DataFrame):
        raise TypeError('The input `df` must be a pandas DataFrame.')
    for param, name in zip(
        [bus_name_col, bus_add, bus_lat, bus_lon],
        ['bus_name_col', 'bus_add', 'bus_lat', 'bus_lon']
    ):
        if not isinstance(param, str):
            raise TypeError(f"The '{name}' parameter must be passed as a string.")
        if param not in df.columns:
            raise KeyError(f"Column '{param}' not found in DataFrame.")
    if bus_name not in df[bus_name_col].values:
        raise ValueError(f"'{bus_name}' not found in column '{bus_name_col}'.")
    
    # Creating a list of features to retain
    retain = [bus_name_col, bus_add, bus_lat, bus_lon]

    # Filtering `df`
    filtered_df = df[retain][df[bus_name_col] == bus_name].copy()

    # Intializing a `name` feature

    # Creating a concatenated `name` feature with a business' name and location address
    filtered_df['loc_name'] = filtered_df[bus_name_col] + ' - ' + filtered_df[bus_add]

    # Renaming features
    filtered_df.rename(columns={bus_lat: 'lat', bus_lon: 'lon'}, inplace=True)

    # Dropping features
    filtered_df.drop([bus_name_col, bus_add], axis=1, inplace=True)

    # Converting `filtered_df` to a dictionary
    locs = filtered_df.to_dict('records')

    # Returning list of dictionaries
    return locs

In [None]:
# Function to build a map
def build_map(locs):
    '''
    Generates and updates a Scattermapbox figure based on the location details previously generated.

    Note:
        To be run on the dictionary returned by `location_details()`.

    Args:
        locs (dict):    A dictionary containing the latitude and longitude coordinates, as well as the business name and street address, of all given locations for that business.
    
    Returns:
        fig (fig):      A Scattermapbox formated to an appropriate zoom level and centered on all given locations for a business.
    
    Raises:
        TypeError:      If `locs` is not a list of dictionaries, or if the dictionaries do not contain the expected keys.
        KeyError:       If any of the expected keys are missing from the dictionaries.
        ValueError:     If `locs` is empty, or if latitude and longitude values are not valid numbers.
    '''
    # Raises
    if not isinstance(locs, list) or not all(isinstance(loc, dict) for loc in locs):
        raise TypeError("`locs` must be a list of dictionaries.")
    required_keys = {'lat', 'lon', 'loc_name'}
    for loc in locs:
        if not required_keys.issubset(loc):
            raise KeyError(f"Each dictionary in `locs` must contain the keys: {required_keys}.")
    if not locs:
        raise ValueError("`locs` cannot be an empty list.")

    # Generating location text
    hover_text = [loc['loc_name'] for loc in locs]

    # Generating location lat and lon
    lat_loc = [loc['lat'] for loc in locs]
    lon_loc = [loc['lon'] for loc in locs]

    # Calculating middle point for lat and lon
    lat_mean = sum(lat_loc)/len(lat_loc)
    lon_mean = sum(lon_loc)/len(lon_loc)

    # Calculating borders of locatoins
    lat_min, lat_max = min(lat_loc), max(lat_loc)
    lon_min, lon_max = min(lon_loc), max(lon_loc)

    # Calculating size of borders
    lat_diff = lat_max - lat_min
    lon_diff = lon_max - lon_min

    # Using `log()` to scale zoom based on distances at slower rates for larger geographic areas
    zoom = min(7 - math.log(lat_diff + 0.1), 7 - math.log(lon_diff + 0.1))

    # Creating the map figure
    fig = go.Figure(go.Scattermapbox(
        lat=lat_loc,
        lon=lon_loc,
        mode='markers',
        hovertext=hover_text,
        marker=dict(size=10)
    ))

    # Updating layout with map style and properties
    fig.update_layout(
        mapbox={
            'style': 'open-street-map',
            'center': {'lon': lon_mean, 'lat': lat_mean},
            'zoom': zoom
        },
        margin={"r":0,"t":0,"l":0,"b":0},
        height=500
    )

    # Returning figure
    return fig

In [None]:
# Function to generate OpenAI summary or recommendations
def apply_davidlingo(reviews):
    '''
    Assesses the reviews for a given business to then outputs a summary of those reviews and generates actionable feedback.

    Note:
        To be run after `reviews_list()`.
    
    Args:
        reviews (list):     A list of reviews for a given business, as generates by `reviews_list()`
    
    Returns:
        summary (str):      A [??]] with a summary of the provided reviews.
        feedback (list):    A [??]] with a the feedback based on the reviews.
    
    Raises:
        ValueError:         If `reviews` is not a list or is empty.
        RuntimeError:       If the `davidlingo.invoke()` call fails or returns unexpected results.
    '''
    # Raises
    if not isinstance(reviews, list) or len(reviews) == 0:
        raise ValueError("The `reviews` argument must be a non-empty list.")
    
    # Placing the list of reviews into a dictionary
    review_list = {'review_list': reviews}

    try:
        # Running the `davidlingo` chain
        result = davidlingo.invoke(review_list)
    except Exception as e:
        # Raises
        raise RuntimeError(f"Failed to invoke `davidlingo`: {str(e)}")

    # Check for expected result format
    if 'text' not in result:
        # Raises
        raise RuntimeError("The `davidlingo` response did not contain the expected `text` key.")
    
    # Storing the results
    results = result['text']

    # Initializing summary and feedback
    reviews_summary = ""
    feedback = []
    
    # Split the text into lines
    results_list = results.split('\n')

    # Identify the starting point of recommendations
    recommendations_title = "Recommendations for Improvement:"
    for idx, line in enumerate(results_list):
        line = line.strip()
        if line.startswith("1.") or line.startswith("2.") or line.startswith("3."):
            # All lines from here are feedback
            feedback = results_list[idx:]  
            break
        if recommendations_title in line:
            # Remove from summary line
            line = line.replace(recommendations_title, "").strip()  
        # Collect summary lines
        reviews_summary += line + " "  

    # Cleaning unwanted characters from summary and recommendations
    reviews_summary = re.sub(r'[##*]', '', reviews_summary).strip()
    feedback = [re.sub(r'[##*]', '', item).strip() for item in feedback]

    # Adding "Recommendations for Improvement:" with a line break
    feedback.insert(0, recommendations_title)

    # Returning summary and feedback
    return reviews_summary, feedback

# **SpooderApp™**

With `roberto` and `davidlingo` at the ready, it was time to build them a home. It was time... for `SpooderApp™`!

`SpooderApp™`, developed in __[Dash](https://dash.plotly.com/)__, is the interactive medium through which users will interact with our models directly. With data scraped from our curated list of businesses, we set out to build `SpooderApp™` as a demonstrative tool to prove our hypothesis correct. That yes, we *can* leverage reviews into accurate sentiment analysis and actionable feedback for businesses.

## Temporary Components;

To prepare `SpooderApp™`'s loading state, a few components needed to be etablished prior to initializing the interface.

#### Map

*Loading the map on the US map because, well... We're largely american students, so it made sense?*

In [None]:
# Create a default map centered on the US
fig_placeholder = go.Figure(go.Scattermapbox())
fig_placeholder.update_layout(
    mapbox={
        'style': "open-street-map",
        'center': {'lon': -98.583, 'lat': 39.833},
        'zoom': 2.5
    },
    margin={"r":0,"t":0,"l":0,"b":0},
    height=400
)

#### DataFrame

*Making the interchanging of datasets simple without needed to refactor the app, itself*

In [None]:
# Declare a DataFrame to be used for the app
app_df = roberto_df.copy()

*Note: During testing, it was discovered that a difference of 'de' and 'De' in 'Dulce De Leche Bakery' records was causing the business to populate twice in our dropdown menu. With* `davidlingo` *having the current limitation of processessing only so many reviews per business, we made the decition to drop the lesser populated instance of the business (lowercase 'de') for a cleaner proof of concept.*

In [None]:
# Drop the row where 'bus_id' is 'Dulce de Leche Bakery'
app_df = app_df[app_df['bus_id'] != 'Dulce de Leche Bakery']

# Verify if the row was dropped
app_df.loc[app_df['bus_id'] == 'Dulce de Leche Bakery']


#### List of business names

*For user input selection*

In [None]:
# Creating list of business names
drop_opts = business_names_list(app_df, 'bus_id')

#### Location details

*For making the rest of the code below easier*

In [None]:
# Creating a DataFrame of uniqur locations
uniq_locs = unique_locs_df(app_df, 'bus_id', 'bus_add', 'lat', 'lon')

#### Markdown guide

*For on-screen, in-app assistance*

In [None]:
# Loading markdown content for guide
with open('Resources/SpooderApp_Guide.md', 'r') as file:
    guide_content = file.read()

#### Logo

`SpooderApp™` *is too important to NOT have a logo*

In [None]:
# Declaring logo path
logo_path = 'Images/SpooderApp_Logo_Inverted_Color.png'

# Reading in logo using OpenCV (with unchanged flag to keep transparency)
logo_read = cv2.imread(logo_path, cv2.IMREAD_UNCHANGED)

# Converting to RGBA if not already
# Checking if already BGRA, since OpenCV uses for 4 channels
if logo_read.shape[2] == 4:  
    rgba_logo = cv2.cvtColor(logo_read, cv2.COLOR_BGRA2RGBA)
else:
    # Converting to RGBA if not already
    rgba_logo = cv2.cvtColor(logo_read, cv2.COLOR_BGR2RGBA) 

# Converting the image to a PIL Image to handle base64 conversion
rgba_pil = Image.fromarray(rgba_logo)

# Converting image to base64
buffered = BytesIO()
rgba_pil.save(buffered, format="PNG")  # Save the PIL image to the buffer
spooder_str = base64.b64encode(buffered.getvalue()).decode("utf-8")

# Preparing logo source for `html.Img()` component of app
spooder_logo = f"data:image/png;base64,{spooder_str}"

## App Development;

The stage was set, the players at the ready... It was time for `SpooderApp™` to truly shine.

*Note: While a monstrous cell, all components of* `SpooderApp™` *were developed in once cell for simplicity and consistency's sake.*

In [None]:
# Initialize app
app = Dash(external_stylesheets=[dbc.themes.QUARTZ])

# App layout
app.layout = html.Div([
    # Wrapping the whole GUI in a stack for uinform formatting
    dbc.Stack(
        [
            # Blank col for spacing (1/12 of parent container)
            dbc.Col('', width=1),
            # Col with all of GUI
            dbc.Col(
                [
                    # Row for header
                    dbc.Row(
                        dbc.Col(
                            # Header (as is evident by the `H1` method)
                            html.Img(
                                # Does whatever a SpooderApp™ can
                                src=spooder_logo, 
                                # Placing in the middle of the page
                                style={'height': '185px', 'width': 'auto'}
                            ),
                            # Get in the middle
                            className='text-center' 
                        ),
                        # Give us some room, please
                        style={'margin-top': '20px', 'margin-bottom': '20px'}
                    ),
                    # Row for subheader
                    dbc.Row(
                        # Subheader (as is less evident by the `H3` method)
                        html.H3(
                            # Taglines are important
                            'Leveraging business reviews to gain insights for potential improvements.',
                            # Placing in the middle of the page
                            style={'textAlign':'center'}
                        ),
                        # Buffer space
                        style={'margin-bottom': '20px'}
                    ),
                    # Row for business name and ratings
                    dbc.Row(
                        [
                            # Col for user input
                            dbc.Col(
                                # InputGroup for user input
                                dbc.InputGroup(
                                    [
                                        # Dropdown menu for user input
                                        dbc.DropdownMenu(
                                            # Instructions for user input
                                            label = 'Select a business',
                                            # To know it's user input
                                            id = 'business_dropdown',
                                            # Selections for user input
                                            children = [
                                                dbc.DropdownMenuItem(
                                                    name,
                                                    id=f"menu_item_{i}",
                                                    style={'color': 'grey'}
                                                ) for i, name in enumerate(drop_opts)
                                            ],
                                            # Making it pretty ([insert sparkles here])
                                            class_name='btn-info'
                                        ),
                                        # Not actually user input, but reflects it
                                        dbc.InputGroupText(
                                            # Blank until user input selected
                                            children='',
                                            # To know where to put user input
                                            id='chld_nm',
                                            # Making it pretty, but not AS pretty
                                            class_name='form-control'
                                        )
                                    ],
                                    # Be tall, but only so tall, please
                                    style={'width': '100%', 'height': '60px'}
                                ),
                                # 6/12 of parent container, because math
                                width=6
                            ),
                            # Col for average rating information
                            dbc.Col(
                                # Card display for averate rating information
                                dbc.Card(
                                    # Blank until user input selected
                                    children='',
                                    # To know where to put average rating information
                                    id='avg_rtng',
                                    # Making it pretty-ish
                                    body=True,
                                    # Be no taller than the column to your left
                                    style={
                                        'width': '100%',
                                        'height': '60px',
                                        'display': 'flex',
                                        'align-items': 'left',
                                        'justify-content': 'center'
                                    }
                                ),
                                # 3/12 of parent container, or 1/4 but HTML/CSS doesn't like quarters as much
                                width=3
                            ),
                            # Column for total reviews information
                            dbc.Col(
                                # Card display for total reviews information
                                dbc.Card(
                                    # Blank until user input selected
                                    children='',
                                    # To know where to put total reviews information
                                    id='tot_rvws',
                                    # Making it pretty-ish like its sibling to the left
                                    body=True,
                                    # You must be this short to display
                                    style={
                                        'width': '100%',
                                        'height': '60px',
                                        'display': 'flex',
                                        'align-items': 'left',
                                        'justify-content': 'center'
                                    }
                                ),
                                # 3/12 of parent container, beacuse 12 - 6 - 3 leaves 3
                                width=3
                            )
                        ],
                        # Usually a good place to begin - The beginning
                        justify='start',
                        # Matching buffer space for that glossy, uniform look ([more sparkles])
                        style={'margin-bottom': '20px'},
                    ),
                    # "Row" for map and accordion
                    dbc.Stack(
                        [
                            # Col for map
                            dbc.Col(
                                # Map
                                dcc.Graph(figure=fig_placeholder, id='bus_map'),
                                # 5/12 of parent container, because the map wanted to be special
                                width=5
                            ),
                            # Col for accordion
                            dbc.Col(
                                # Unfortunately, an accodion menu, not a Weird Al cameo
                                dbc.Accordion(
                                    [
                                        # Menu item for reviews
                                        dbc.AccordionItem(
                                            # Paragraph - in the loosest sense - for reviews
                                            html.P(
                                                # To know where to put the reviews
                                                id='reviews',
                                                # Blank until user input selected
                                                children='',
                                                # Only be so tall, and scroll if longer
                                                style={'max-height': '295px', 'overflow-y': 'auto'}
                                            ),
                                            # So you know it's got the reviews in it
                                            title='Reviews'
                                        ),
                                        # Menu item for sentiment analysis
                                        dbc.AccordionItem(
                                            html.P(
                                                # To know where to put the sentiment analysis
                                                id='sentiment',
                                                # Blank until user input selected
                                                children='',
                                                # Overkill, since this will only ever be a single line of text
                                                style={'max-height': '295px', 'overflow-y': 'auto'}
                                            ),
                                            # To identify it as the container for the sentiment analysis
                                            title='Sentiment Analysis'
                                        ),
                                        # Menu item for recommendations
                                        dbc.AccordionItem(
                                            html.P(
                                                # To know where to put the OpenAI feedback
                                                id='feedback',
                                                # Blank until user input selected
                                                children='',
                                                # Only be so tall, and scroll if longer
                                                style={'max-height': '295px', 'overflow-y': 'auto'}
                                            ),
                                            # For the purposes of labeling it as the recepticle for feedback
                                            title='Feedback'
                                        )
                                    ]
                                ),
                                # Again, a good palce to begin
                                align='start',
                                # 7/12 of parent container, because that's what was left and it looks good
                                width=7
                            )
                        ],
                        # That's what was meant by "row", earlier - go this way <-->
                        direction='horizontal',
                        # Little bit of breathing room in there, too, please
                        gap=1
                    ),
                    # Row for markdown guide
                    dbc.Row(
                        # Markdown guide
                        dcc.Markdown(
                            # Content for the markdown guide
                            guide_content,
                            # Making the markdown guide pretty
                            style={
                                'margin-top': '50px',
                                'padding': '20px',
                                'background-color': 'rgba(255, 255, 255, 0.35)', # This one is super important!
                                'border-radius': '10px'
                            }
                        )
                    )
                ],
                # 10/12 of parent container, because this really is the star of the show, right here
                width=10
            ),
             # Blank col for spacing (1/12 of parent container)
            dbc.Col('', width=1),
        ],
        # Another go this way <--> bit
        direction='horizontal',
        # We like negative space, let's have more of that between things
        gap=1
    )
])

# Callback to populate the `DropdownMenu`
@callback(
    Output('chld_nm', 'children'),
    Output('avg_rtng', 'children'),
    Output('tot_rvws', 'children'),
    Output('bus_map', 'figure'),
    Output('reviews', 'children'),
    Output('sentiment', 'children'),
    Output('feedback', 'children'),
    [Input(f"menu_item_{i}", "n_clicks") for i in range(len(drop_opts))],
    [State(f"menu_item_{i}", "children") for i in range(len(drop_opts))]
)
def update_content(*args):
    # Default states for elements
    load_input = 'Use the dropdown menu on the left'
    load_avg_rtng = 'Average Rating: '
    load_tot_rvws = 'Total Available Reviews: '
    load_fig = fig_placeholder
    load_revs = 'Select a business to see reviews.'
    load_sent = 'Select a business to generate sentiment analysis.'
    load_feedback = 'Select a business to generate dynamic feedback.'

    # Confirming a dropdown selection has been made
    ctx = callback_context
    if ctx.triggered:
        # Finding which business was clicked
        selected_item_id = ctx.triggered[0]['prop_id'].split('.')[0]
        # Finding the index of the clicked business
        selected_index = int(selected_item_id.split('_')[-1])
        # Getting the selected business name
        selected_business = args[len(drop_opts) + selected_index]

        # Getting the average rating for the selected business
        # NOTE: Since `avg_rating` is stores on a per-business basis,
        # not a per-record basis, the first record's value will suffice
        rvw_avg = app_df.loc[app_df['bus_id'] == selected_business, 'avg_rating'].iloc[0]
        # Returning the average rating value
        avg_rtng = f'Average Rating: {rvw_avg:.1f}'

        # Gathering the reviews for the selected business
        reviews = reviews_list(app_df, 'bus_id', selected_business, 'bus_add', 'review')
        # Calculating the total number of reviews
        if len(reviews) >= 1:
            # If 1 or more, returning a count of available reviews
            rev_tot = f'Total Available Reviews: {len(reviews)}'
            # Preparing an empty list
            rev_list = []
            # Appending each review into `rev_list` with HTML formatting
            for rev in reviews:
                rev_list.append(html.P([rev.split(':\n')[0], ':', html.Br(), rev.split(':\n')[1], html.Br()]))
        else:
            rev_tot = 'Total Available Reviews: 0'
            rev_list = 'Too few reviews available to display.'

        # Preparing location details for the selected business
        locations = location_details(uniq_locs,'bus_id', selected_business, 'bus_add','lat', 'lon')
        # Building map based on locations for the selected business
        fig = build_map(locations)

        # Gather the general sentiment for the selected business
        gen_sent = general_sentiment(app_df, 'bus_id', selected_business, 'sent_label', 'sent_score')

        # Generate OpenAI response
        reviews_summary, feedback = apply_davidlingo(reviews)
        
        # Creating an empy list to hold the formatted content
        formatted_content = []

        # Appending content
        # Appending summary of reviews
        formatted_content.append(html.P(reviews_summary))
        # Adding a line break for readability
        formatted_content.append(html.Br())
        # Adding each reommendation as a separate paragraph
        for recommendation in feedback:
            # Esnuring not an empty line
            if recommendation.strip():
                # Adding recommendation line
                formatted_content.append(html.P(recommendation))
                # Adding a line break for readability
                formatted_content.append(html.Br())
        # Combining list into a single Div
        feedback_component = html.Div(formatted_content)

        # Returning the label corresponding to the clicked item
        return selected_business, avg_rtng, rev_tot, fig, rev_list, gen_sent, feedback_component
    # Returning original placeholder text if none selected
    return load_input, load_avg_rtng, load_tot_rvws, load_fig, load_revs, load_sent, load_feedback

# Launch app (in browser tab) (comment out if running in notebook)
app.run(jupyter_mode='tab')
# Launch app (in notebook) (uncomment to run)
# app.run_server(debug=True)

# **Final Thoughts**

`SpooderApp™` was designed to provide businesses with a practical way to track and imporove their consumer experiences by focusing soley on customer feedback. With an effectively accurate sentiment analysis model in `roberto`, the versitilaty of our web scraping tools, and the power of our OpenAI LangChain in `davidlingo`, we feel that we exceeded - if not merely met - our goals. While there is still room for imporovements and further developments (such as handling multiple languages, allowing for ad hoc searches, and training `roberto` on a larger sample size), our app has more than laid the foundations for a marketable and applicable utility due to its simple and intuitive display, integration of multiple NLP technologies, and robust capabilities.

# **Citations**

#### Yelp Open Dataset

Yelp Inc. (2021). *Yelp Open Dataset*. Retrieved from __[https://www.yelp.com/dataset](https://www.yelp.com/dataset)__.

*Note: The findings and applications of this study are those of the authors and do not necessarily reflect the views or opinions of Yelp Inc.*