In [1]:
import pandas as pd
import numpy as np
import re
import jellyfish
from tabulate import tabulate

# Task 2: Consultancy for opening a restaurant in the city of Philadelphia

<!-- Task 2 is more open in nature as there is no specific target. We don’t expect you to analyse all aspects of the problem. You can decide yourself on which approaches, summary statistics or analysis procedures you focus on. Grading will be based on scientific correctness, originality and presentation. -->

## Introduction

The goal of this report is to analyse customer review data to better understand the regional market so that the business is more likely to achieve success for the opening of their restaurant in Philadelphia. In particular, we aim to answer the following questions:

1. An insight on what restaurant consumers generally seem to like (for example in terms of food, service, location, etc…).
2. An analysis of the evolution of food trends in the area over time, in terms of consumer preferences. Do the preferences evolve over time, or do they seem stable?
3. Imagine you have to present your findings to the business owner and his investors. What advice would you give to the new business, based on your findings?

Given the nature of the task, we'll focus on customer reviews which concern restaurants located in the city of Philadelphia. First, let's take a look at the data to see how we can approach this task.

## Extracting restaurant reviews in Philadelphia
Before starting the analysis, we'll have to obtain the relevant review data. Let's start by importing the data and then inspecting their structures. It is worth mentioning that the reviews in the test data (`ATML2024_reviews_test.csv`) are NOT used in this analysis, since they don't contain the ratings by customers for the businesses which will hinder our ability to know the preferences of restaurants in Philadelphia.

In [2]:
# Importing the dataset
reviews_df = pd.read_csv("datasets/ATML2024_reviews_train.csv")
users_df = pd.read_csv("datasets/ATML2024_users.csv")
business_df = pd.read_csv("datasets\ATML2024_businesses.csv")

  business_df = pd.read_csv("datasets\ATML2024_businesses.csv")


### Glimpsing at the data
Below cells show the data types of the columns as well as the first 5 rows of each dataset. Based on the output, in order to extract reviews about restaurants in Philadelphia, we can first filter out businesses who are based in Philadelphia under the `city` column, and then look at those whose categories include restaurants. One crucial thing to note that, however, is that due to the textual nature of the data, there's no guarantee that the `city` column is free of typos or has standardised how Philadelphia is referred to. For example, the city is sometimes referred to as Philly. We shall inspect this column more in detail to ensure that we include all the restaurant reviews in Philadelphia (or at least we don't miss out too much because of typos).

We can also notice that some columns aren't in the correct data types and will need to be changed if they're to be used in the following analysis. For instance, the date-related columns (`date` in `reviews_df` and `user_since` in `users_df`) are wrongly marked as `object`, and the `premium_account` column is just a string of years concatenated together which might pose some troubles if we'd like to look like the number of premium users by year. But for now let's focus on filtering restaurant reviews in Philadelphia.

In [3]:
print(reviews_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 9 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   id           1050000 non-null  int64 
 1   user_id      1050000 non-null  object
 2   business_id  1050000 non-null  object
 3   rating       1050000 non-null  int64 
 4   useful       1050000 non-null  int64 
 5   funny        1050000 non-null  int64 
 6   cool         1050000 non-null  int64 
 7   text         1050000 non-null  object
 8   date         1050000 non-null  object
dtypes: int64(5), object(4)
memory usage: 72.1+ MB
None


In [4]:
print(tabulate(reviews_df.head(), headers = "keys", tablefmt='orgtbl', showindex=False))

|   id | user_id                | business_id            |   rating |   useful |   funny |   cool | text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | date    

In [5]:
print(users_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 747468 entries, 0 to 747467
Data columns (total 19 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   user_id             747468 non-null  object 
 1   name                747457 non-null  object 
 2   user_since          747468 non-null  object 
 3   useful              747468 non-null  float64
 4   funny               747468 non-null  float64
 5   cool                747468 non-null  float64
 6   premium_account     57420 non-null   object 
 7   friends             747468 non-null  float64
 8   fans                747468 non-null  float64
 9   compliment_hot      747468 non-null  float64
 10  compliment_more     747468 non-null  float64
 11  compliment_profile  747468 non-null  float64
 12  compliment_cute     747468 non-null  float64
 13  compliment_list     747468 non-null  float64
 14  compliment_note     747468 non-null  float64
 15  compliment_plain    747468 non-nul

In [6]:
print(tabulate(users_df.head(), headers='keys', tablefmt='orgtbl', showindex=False))

| user_id                | name   | user_since          |   useful |   funny |   cool | premium_account                                                   |   friends |   fans |   compliment_hot |   compliment_more |   compliment_profile |   compliment_cute |   compliment_list |   compliment_note |   compliment_plain |   compliment_cool |   compliment_funny |   compliment_writer |
|------------------------+--------+---------------------+----------+---------+--------+-------------------------------------------------------------------+-----------+--------+------------------+-------------------+----------------------+-------------------+-------------------+-------------------+--------------------+-------------------+--------------------+---------------------|
| w7IdXgBVXKjZS5UYDO8cVq | Walker | 2007-01-25 16:47:26 |     7217 |    1259 |   5994 | 2007                                                              |     14995 |    267 |              250 |                65 |                   

In [7]:
print(business_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138210 entries, 0 to 138209
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   business_id  138210 non-null  object 
 1   name         138210 non-null  object 
 2   address      133772 non-null  object 
 3   city         138210 non-null  object 
 4   state        138210 non-null  object 
 5   postal_code  138145 non-null  object 
 6   latitude     138210 non-null  float64
 7   longitude    138210 non-null  float64
 8   attributes   126589 non-null  object 
 9   categories   138136 non-null  object 
 10  hours        117852 non-null  object 
dtypes: float64(2), object(9)
memory usage: 11.6+ MB
None


In [8]:
print(tabulate(business_df.head(), headers = "keys", tablefmt='orgtbl', showindex = False))

| business_id            | name                     | address                         | city         | state   |   postal_code |   latitude |   longitude | attributes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | categories                                                                         | hours                                                                                                                                     

### Is it Philadelphia, or ...?
We now take a closer look to how Philadelphia might be referred to in the dataset. From the regex search result below, even just by using the full name of the city, there already exist 6 ways to which the city is referred. Moreover, the below list doesn't include typos nor or nicknames of the city. Therefore, we'll need to normalise the different ways Philadelphia is called in the `business_df` dataset, and we need a way to deal with a non-exhaustive list of alias of Philadelphia. 

In [9]:
# Checking how Philadelphia might be referred to
business_pa = business_df.query("state == 'PA'")  # Since Philadelphia is located in the state of Pennsylvania
business_pa.loc[:, 'city'] = business_pa['city'].str.lower()
unique_cities = business_pa['city'].unique()  # Avoid case-sensitivity issues in string matching later
philly_matches = [re.search(r"philadelphia", city) is not None for city in unique_cities]
print(unique_cities[philly_matches])

['philadelphia' 'southwest philadelphia' 'philadelphia pa'
 'west philadelphia' 'philadelphia (northeast philly)' 'philadelphia ']


To compare similarities between some strings and "Philadelphia", we can use the Jaro similarity which ranges from 0 (totally dissimilar) to 1 (exact match) between two strings. Mathematically, Jaro similarity $sim_j$ between two strings $s_1 \; \text{and} \; s_2$ is defined as below (a more detailed discussion of the Jaro similairty and its variants can be found [here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)):

\begin{align}
sim_j = 
\begin{cases}
0 \quad &\text{if} \; m = 0 \\
\frac{1}{3}(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}) \; &\text{otherwise}
\end{cases},
\end{align}

where $|s_i|$ is the length of the string $s_i$, $m$ is the number of matching characters (characters in $s_1 \; \text{and} \; s_2$ are matching only if they're at most $[\frac{\text{max}(|s_1|, |s_2|)}{2}] - 1$ characters apart), and $t$ is the number of transpositions (i.e. swapping the positions of two characters) which is calculated as number of matching characters not being in the correct order divided by two.

Admittedly, there exist other string distance measures such as the Levenshtein distance as briefly mentioned in class. Nevertheless, we will use Jaro similarity because its output always ranges from 0 to 1 no matter the lengths of the two strings under comparison which makes thresholding easier. Moreover, Jaro similarity satisfies the mathematical definition of a distance metric, whereas its variant Jaro-Winkler similarity violates the triangle inequality.

We now move on to testing how many cities will be replaced by "philadelphia" depending on the threshold of Jaro similarity score. The python package `jellyfish` has already implemented a function for calculating Jaro similarity between two strings, and we shall define a function for replacing city names in the dataset based on this. Note that we've converted the unique city names in the dataset to be in lowercases since the implementation in `jellyfish` is case-sensitive, i.e. same character in different cases will not be counted as matching characters.

In [10]:
# Replacing the city name to Philadelphia
def replace_city_name(old_name, threshold, new_name = "philadelphia", return_score = False):
    
    jaro_score = jellyfish.jaro_similarity(old_name, new_name)
    name = old_name
    
    # Replace old name with new one if similarity score is above or equal to the threshold.
    if jaro_score >= threshold:
        name = new_name
    
    # Optional return of the Jaro score
    if return_score == True:
        return name, jaro_score
    return name

Starting with an arbitrary threshold of 0.6, we notice that matches with scores lower than 0.7 are wrongly replaced with 'philadelphia' while in reality they are genuinely referring to another locale in the state of Pennsylvania. We therefore should not include these matches in our final dataset. On the other hand, the algorithm manages to capture the 11 different alias of Philadelphia and replace them accordingly once the Jaro score is at least 0.7. Indeed, all the matching results seem to be about the city of Philadelphia, whether it be typos or nicknames of the city.

In [11]:
# Checking what city names get replaced with a given threshold for the Jaro similarity score
cities_replaced = []
city_scores = []
threshold = 0.6

for city in unique_cities:
    old_name, score = replace_city_name(city, threshold, return_score=True)
    cities_replaced.append(old_name)
    city_scores.append(score)

# Sorting the replacement from the highest Jaro score to the lowest
score_sort_idx = np.argsort(city_scores)[::-1]
cities_replaced = np.array(cities_replaced)
city_scores = np.array(city_scores)

# Printing the city names before and after replacement
for old_name, city, score in zip(unique_cities[score_sort_idx], cities_replaced[score_sort_idx], city_scores[score_sort_idx]):
    if score > threshold:
        print(f"{old_name} -> {city}, score: {score:.2f}")

philadelphia -> philadelphia, score: 1.00
philadelphia  -> philadelphia, score: 0.97
philiadelphia -> philadelphia, score: 0.97
philadephia -> philadelphia, score: 0.97
philadelphila  -> philadelphia, score: 0.95
philadelphia pa -> philadelphia, score: 0.93
philiidelphia -> philadelphia, score: 0.83
west philadelphia -> philadelphia, score: 0.82
phila -> philadelphia, score: 0.81
philadelphia (northeast philly) -> philadelphia, score: 0.80
philly -> philadelphia, score: 0.75
southwest philadelphia -> philadelphia, score: 0.71
phonixville -> philadelphia, score: 0.69
upland -> philadelphia, score: 0.67
pineville -> philadelphia, score: 0.67
landsdale -> philadelphia, score: 0.67
silverdale -> philadelphia, score: 0.64
holland -> philadelphia, score: 0.64
red hill -> philadelphia, score: 0.64
hatfield -> philadelphia, score: 0.64
ivyland -> philadelphia, score: 0.63
paoli -> philadelphia, score: 0.63
pipersville -> philadelphia, score: 0.63
springfield -> philadelphia, score: 0.63
lansda

Let's use 0.7 as a threshold for the Jaro score to replace different alias of Philadelphia for filtering the business dataset later. It seems that the algorithm worked correctly judging from the fact that the `assert` statement did not return any errors. 

In [12]:
# Replacing the alias of Philadelphia in the dataset
old_unique_cities_len = len(unique_cities)
business_pa.loc[:, "city"] = business_pa['city'].apply(replace_city_name, threshold = 0.7)
new_unique_cities_len = len(business_pa['city'].unique())

# Checking we've normalised Philadelphia correctly
assert old_unique_cities_len - new_unique_cities_len == 11  # There were 11 alias for Philadelphia in the original business dataset

### Pre-processing business categories 
We shall also pre-process the `categories` column to ensure that we do not miss out restaurant reviews due to typos.

In [13]:
categories_split = business_df["categories"].str.split(r",\s+", regex=True)

unique_categories = set()

for cat_list in categories_split:
    for cat in cat_list:
        unique_categories.add(cat)
        
print(unique_categories)
# unique_categories = np.unique(categories_split.values.flatten()).astype("U32")
# resto_mathces = [re.search(r"Rest.*", cat) is not None for cat in unique_categories]
# unique_categories[resto_mathces]

TypeError: 'float' object is not iterable

### Combining everything together