# An Evaluation of Strategies Using Natural Language Processing and Yelp for Detecting and Mitigating Crisis-Related Reputational Damage for Restaurants

## Authors:  
Roberto Cancel  
Nima Amin Taghavi  
Martin Zagari  

## Abstract  
  
Social media and review sites like Yelp have the
potential   to   quickly   disseminate   reputation-harming information, including, but not limited
to, involving foodborne illness and food safety
claims.   Chipotle   found   itself   at   the   center   of
multiple   foodborne   illness   outbreaks   in   2015-
2016 that sickened more than 1,100, tarnished its
reputation, and resulted in severe financial losses
including a 25 million federal fine. This study
aimed to leverage Yelp review data, which has
been widely deemed less useful for chains, to
detect   and   mitigate   crisis-related   reputational
damage. During exploratory data analysis (EDA),
foodborne   illness   outbreaks   were   successfully
detected by evaluating the distribution changes of
star   ratings   over   time.   After   hand   labeling   a
subset   of   reviews,   a   foodborne   illness-related
lexicon was developed and validated using Naïve
Bayes.   Topic   modeling   via   Latent   Dirichlet
Association   (LDA)   over   time   suggested   that
reputational   damage   can   be   measured   by   the
number of topics required to generate a separate
foodborne illness topic. Based on our findings,
we   believe   restaurant   chains   would   be   well-
served by developing a system to submit all Yelp
reviews to Natural Language Processing analysis
to   identify   topics   of   negative   reviews   and
determine   whether   critical   topics   such   as
foodborne illness are rising in importance. 

## 1. Introduction
  
Negative news is passed along at much higher rates than positive or accurate news items (Fang & Ben-Miled, 2017). With the emergence of real-time reviews and consumer feedback on social media platforms such as Yelp, crisis-related reputational damage can now be empirically evaluated.  

## 2. Background

Chipotle, an American fast-food chain, was at the center of several foodborne illness outbreaks, including an E. coli outbreak that sickened five people in Seattle, an E. coli outbreak in California with over 200 victims, a salmonella outbreak in Minnesota with over 100 victims, and a Norovirus outbreak in Boston.

### 2.1: Problem Identification and Motivation

Traditional approaches to food-borne illness involve medical attention, a local health department report, and a state health department report. These approaches are opaque and inefficient, and do not allow the restaurant to proactively mitigate and potentially avoid a larger crisis.

Third-party services such as IWasPoisoned.com have surfaced that allow individuals to crowdsource the reporting of food-poisoning incidents. However, the reach of these types of websites is minuscule compared with social media sites such as Yelp. Social media platforms such as Yelp are loosely managed and monitored sources of crowdsourced customer experiences at a restaurant, and not only for illness. These reviews potentially impact business reputation.

Some analysts have suggested that Yelp has limited impact on chain restaurants, presumably because of the design and emphasis of the platform. Estes (2011) discussed the impact of Yelp reviews on chain restaurants, and Schlosser (2012) concluded that national restaurant chains seek to create a common experience rather than unique, regional experiences.

Chain restaurants should better understand Yelp as a review engine because it may contain patterns that help them manage their locations and overall business, such as identifying system-wide issues, focusing quality improvement efforts, and perhaps mitigating the spreading impact of a stock-moving event.

A business-centric view of Yelp is necessary to monitor social media over time and to manage reputational damage from adverse company events. This involves channeling changing information to key executives in marketing, risk management, operations, finance, and so on.

### 2.2 Objectives

This study evaluated the use of Natural Language Processing (NLP) and machine learning to link Yelp reviews with food-borne illness and to potentially mitigate crisis-related reputational damage for fast-food chains.

A Yelp review data analysis was conducted to help pinpoint and manage crisis-level events. A sentiment/polarity analysis was implemented to evaluate sentiment distribution changes over time.

To replace subjective assessment of social media content with consistent and replicable science for better decision-making, we need to develop standardized, simplified and understandable tools.

## 3. Literature Review

Although there have been several studies using NLP with Yelp reviews, few have taken a time-centered NLP approach to examine the effects of social media sentiments on corporate reputation before, during, and after a crisis.

Liu (2020) performed a study comparing different machine learning techniques for analyzing Yelp reviews. They found that simpler models outperformed more complex models at predicting star ratings and being easier to interpret.

Li & Hecht (2020) focused on understanding social media dynamics at the chain vs non-chain level and understanding if platform effects might differ. It found that Google ratings were significantly higher than Yelp ratings for the same establishments, and that there was considerable heterogeneity in the top-rated restaurants.

Chung et al. (2019) used Twitter data to examine the evolution of corporate social media sentiment during an evolving controversy. They noted that legitimate news sources are deemed neutral or lacking in sentiment. The authors developed a sentiment analysis model that used a variety of inputs to calculate sentiment probability. They determined that negative sentiment tweets outnumbered positive sentiment tweets, and that corporate apologies had a short-term positive effect on sentiment.

Opuszko et al. (2019) conducted a sentiment analysis on YouTube to detect the influence of public scandals on reputation and the relative interest in joining the German Army. The overall sentiment has decreased based on the change in distributions of positive, negative, and neutral comments. Positive and neutral comments slightly outnumber negative comments for approximately 14 weeks (about 3 months) after the scandal, then a similar decrease exists but with a significant increase in positive comments.

Machine Learning was used to build a classifier capable of predicting sentiment using the Yelp Challenge Dataset available on Kaggle by Shama & Ramathmika (2019). The model was not capable of accurately classifying sarcasm.

## 4. Methodology

### 4.1 Data Ingestion
Data were downloaded, preprocessed, and merged using the Python programming language for exploratory data analysis (EDA) purposes. AWS Sagemaker was utilized for data wrangling and model runs when local model runs became impractical. The Yelp Academic Dataset (n.d.) was converted to CSV and Pickle to reduce storage requirements, and was used interchangeably at start-up. This study focused on the business directory and review files, and should link to the additional information for each user on the Yelp platform in the future (see Figure 1).

![Figure1.JPG](attachment:60a311be-d053-43f3-8e56-d48902b79357.JPG)

Descriptive analyses were performed on Yelp reviews for national fast-chain restaurants based on the inclusion criteria (atleast 30 locations, atleast 2500 reviews available) shown in Figure 2.

![Figure 2.JPG](attachment:ca554b9b-987c-42a7-a3f5-793cf6c3dbdb.JPG)

### 4.2 Data Cleaning and Sufficiency

During the initial preprocessing steps, missing values for address, postal code, attributes, categories,  and  hours  were  identified. Because latitude,  longitude,  and  state  served  as  adequate location data, the remaining geographical features were  dropped.  Attributes,  categories,  and  hours were  of  no  interest  for  this  study  and  were also dropped.

157 Chipotle locations were identified with 9,771 reviews given by 8,572 unique reviewers with a mean star rating of 2.48 from July 4, 2007 to January 19, 2022.

#### 4.2.2 Geospatial Analysis

The Yelp Academic dataset did not contain Chipotle reviews for every impacted state, but reviews from adjacent states existed except WA and MN (see Figure 3).

![Figure3.JPG](attachment:1f84ccd8-4ca7-4e8e-9d89-3bfc9306d8e6.JPG)

### 4.3 Star Rating and Sentiment Analysis

To focus on the 2015 - 2016 foodborne outbreak and subsequent reputational damage, Yelp data was used for 2,797 Chipotle reviews from 124 of the 157 restaurants open during this time, written by 2,571 unique reviewers with a mean star rating of 2.72.

The valence aware dictionary for sentiment reasoning (VADER) model was developed by Hutto and Gilbert (2014) and calculates the compound score of a corpus with respect to intensity and basic context within a text.

### 4.4 Text-based Exploratory Data Analysis

#### 4.4.1 Manual Human Labeling of Reviews

Manual human labeling of 9,771 Chipotle reviews by Yelp users was performed on 2,000 reviews in 7-8 hours. It was not feasible to perform a dual review by independent reviewers followed by "tiebreaking" for conflicting labeling between the two reviewers. The review step was clearly defined as two binary classifications: 1) Foodborne illness mentioned in the review, 2) Foodborne illness noted without the reviewer linking the concern to direct illness, and 3) Food directly linked to illness.

#### 4.4.2 Exploration  of  keywords  for  food-borne  illness

Once the 2,000 reviews were labeled, a second review was performed for label types 2 and 3. This review flagged the 1-3 keywords that most influenced the label. During the manual labeling process, the reviewer observed that many Chipotle reviews compared store locations. This was quantitatively confirmed by looking at 3,270 reviews that used the word "location". Based on the reviewer's observations, many restaurants have an open-viewing food prep area, such as Panda Express and Golden Corral. Also, several full-serve casual-dining chains were included to compare with "fast-food" chains.

#### 4.4.3 Visual examination of time series and descriptive  analysis

A visual examination of time series and descriptive analysis was conducted on reviews for chains that started in approximately 2014 and reported in the results for these chains (see Figure 4).


![Figure4.JPG](attachment:60c8c72b-4c77-4a34-932d-99b911e2ea0e.JPG)

### 4.5 NLP and Topic Modeling

The manual labeling of the reviews suggested that there are numerous distinct areas of customer feedback in each review, and that topic modeling could be useful to chain restaurant businesses to flag topics based on their frequency or geography of reporting.

#### 4.5.1 Data Preparation

Data preparation included removing emoji characters, editing punctuation, filter stop word lists, and stemming each word to its root. Custom lists of foodborne illness terms were also used independently (see Figure 5).


![Figure5.JPG](attachment:34ba39d4-79a4-493b-b4e6-0ec6f935c141.JPG)

Analyses were developed to explore relationships between NLP topics and the geographies having reported outbreaks of foodborne illnesses for named restaurant chains. For example, the words "chicken" and "cheese" appear far more often than other food types (see Figure 6).

![Figure6.JPG](attachment:544acf6b-fdd5-4816-a790-e70dcc15d845.JPG)

## 5. Results and Findings

### 5.1 Star Ratings by Affected States Over Time and Relation to Sentiment Analysis.  

#### 5.1.1 Star-rating analysis by affected states

Figure 7 shows that star ratings were significantly lower in all directly impacted states compared to adjacent and non-impacted states, except Illinois, where reviews were from a single location.

![Figure7.JPG](attachment:cc4b0775-b5cb-4daa-8fed-586c9d810144.JPG)

Interestingly,   all Delaware,   New   Jersey,   and Pennsylvania locations in the dataset lie within the greater Philadelphia metropolitan area (see Figure 8) – yet  stark differences  in  mean  ratings  exist. This   finding   helped   emphasize   the   location-specific nature of Yelp reviews, which was further evaluated during the text-based EDA. 

![Figure8.JPG](attachment:e1dad445-c217-4a20-a968-a119e73158f5.JPG)

#### 5.1.2 Star-rating Analysis over time.

The frequency distribution of star ratings was difficult to interpret due to day-time component, but a spike of 1-star ratings was observed in May 2015, followed by a period of high one-star ratings, and a subsequent spike in August and September 2015 (Outbreak), followed by elevated 1-star ratings (see Figure 9).

![Figure9.JPG](attachment:1f759777-4cb6-45f6-af5b-b8c42eb1c15d.JPG)

#### 5.1.3 Star-based Sentiment Analysis

Chipotle's July Outbreak was not widely publicized as only five were sickened, and the local health department did not become involved. The study noted increasing negative sentiment frequency over time, which might indicate either reputational (consequential) damage or ongoing issues (see Figure 10).

![Figure10.JPG](attachment:f6019503-2124-46a7-8496-b59906ed71a9.JPG)

#### 5.1.4 Text-based Sentiment  Analysis

While the crude (star-based) sentiment analysis during EDA assisted in reducing granularity and indentifying patterns over time, a text-based (VADER) sentiment analysis highlighted the difficulties of working with social media and review data (sarcasm). Figure 11 shows the VADER-derived review sentiment with respect to star rating of the Top 25 chain restaurants.

![Figure11.JPG](attachment:bf777b3a-8885-4a57-a14a-347693998368.JPG)

When the corpus was limited to Chipotle reviews, VADER sentiment results more closely resembled the star rating distributions, but conflicts were identified, suggesting that text-based sentiment analysis is increasingly difficult and likely less reflective of the expected sentiment compared to the star-based approach. Figure 12   shows two examples of five-star Chipotle reviews with negative VADER-derived sentiment and two examples of one-star Chipotle reviews with positive VADER-derived sentiment. From these examples, it is evident that the nuances and complexity of review text, such as sarcasm, make text-based sentiment analysis increasingly difficult and likely less reflective of the expected sentiment compared to the star-based approach. 

![Figure12.JPG](attachment:e01f9dac-4f97-42c0-830a-c761905d07c4.JPG)

### 5.2 Text-based Analysis

#### 5.2.1 Dictionary-Labeled vs. Human-Labeled Review Texts

The proportion of reviews mentioning food-borne illness anywhere using a custom dictionary was 3.27% for all Chipotle reviews, and 1.44% for all human-labeled review texts. Human-labeled Chipotle reviews demonstrate a stronger relationship between star ratings and food-borne illness than dictionary-labeled reviews, although dictionary-labeled reviews are less precise to both sentiment and star rating than human review and labeling.

![Figure13.JPG](attachment:4be253b1-832e-480f-81a3-2dcae1213d70.JPG)

#### 5.2.3 Initial Dictionary Results

The heat map presented in Figure 14 shows that words related to food sickness were used significantly more often in negative reviews from mid-2015 to mid-2016.

![Figure14.JPG](attachment:72dfc918-dc33-4e35-a565-46e66a0125aa.JPG)

#### 5.2.4 Naive Bayes Modeling for Text of Foodborne Illness Reviews

To determine the strength of association for hand-labeled Chipotle reviews containing references to foodborne illness with those reviews not mentioning illness, a Nave Bayes Model was developed and the most prominent sickness-related words were listed in descending order of relative frequency.

Figure 15 shows the list of most informative features (words) for illness-related reviews. The highest spike in illness related review proportion for Chipotle occurred during the 2015 outbreak, with a small peak occurring at the time of the smaller 2017 outbreak (see Figure 16).

![Figure15.JPG](attachment:3aa520a6-07ef-4a2a-9270-1273cd656fef.JPG)

![Figure16.JPG](attachment:0d402c73-4f31-4731-aae5-e496e5e69f3b.JPG)

### 5.3 Topic-Based Analysis

#### 5.3.1 Low Topic Number Modeling

We undertook a variety of topic modeling analyses for both Chipotle reviews and the broad group of chain restaurants, but found no distinct topics containing predominantly food-borne illness terminology. This was particularly true when the number of topics was restricted to smaller numbers (<10). This finding was also consistent with the low rate of reviews that mentioned these topics, about 3.27% in the case of Chipotle reviews.

#### 5.3.2 Topic Number Hyperparameter Threshold Analysis

To increase the sensitivity of topic analysis, we performed a time-based topic analysis on Chipotle reviews. We found that 10% of reviews mentioned foodborne illness during the outbreak period compared to 2-3% during the non-outbreak period.

The results of the analysis show that a topic number threshold of 31 was required before the LDA model produced a separate topic containing the keywords in the pre-outbreak period, and that this threshold increased to 17 required topics in the post-outbreak period (see Figure 17).

![Figure17.JPG](attachment:d085c935-9325-4242-954d-85b41fdb1cfe.JPG)

This analysis suggests that a more systematic way to capture "reviewer interest" in each topic may be to use topic number thresholds rather than simply counting proportions of reviews containing certain words.

## 6. Discussion

### 6.1 Explanation of Results and Implications

This study used online reviews to better understand adverse user experiences relating to food safety and foodborne illness. It may help companies manage health and safety policies and prevent further illness.

One of the more interesting findings was that food-borne illness is not commonly mentioned in Yelp reviews, and that employee service level, restaurant cleanliness, wait times, or even parking are more common themes in negative reviews.

With a low proportion of reviews mentioning illness, methods using only a star-based approach across all reviews were less sensitive than more specific approaches using NLP. As a result, managers should not expect to see obvious evidence of an emerging negative event in star ratings alone.

The tradeoff of complexity for higher fidelity in pinpointing the most relevant reviews is worth it in this situation. Managers should consider incorporating NLP analysis for all social media data available to them, and not only for health and safety issues.

Topic analysis appeared to be of even greater utility from a time-based perspective, suggesting a more systematic and quantitative method to "follow" the relative importance of target issues over time, including long term effects of issues and time to recovery.

We believe NLP could be used to monitor corporate performance in different areas of consumer sentiment, such as food-borne illness. We have tested a prototype that would live in the cloud and would identify reviews that are flagged as being important for certain topics.

### 6.2 Findings Compared to Previous Studies

#### 6.2.1 Literature Review Comparison

VADER sentiment analysis results confirmed previous studies' findings that suggested star ratings and review text are not perfectly correlated and that sentiment analysis of review text is complex. Liu (2020) suggested text normalization did not improve classification performance. We, similarly, found that text normalization was not necessary for our limited classification modeling purposes; however, normalization (such as lemmatization) improved topic modeling performance given the complexity and lexical diversity of the Yelp text review corpus. Chung et al. (2019) and Opuszko et al. (2019) found though the overall sentiment during the crisis was neutral, changes in sentiment distribution over time were very informative. Our crude sentiment (star-based) and VADER analysis confirmed these findings and helped identify topic modeling as an appropriate approach for reputational damage evaluation. 

#### 6.2.2 Topic Modeling for Reputational Damage Comparison

Topic modeling can be used to evaluate reputational damage, but the use of reverse-engineered LDA topic modeling for measuring sustained crisis-related reputational damage appears to be novel.

### 6.3 Limitations and Challenges

The Yelp Academic Dataset contains nearly seven million reviews for ten distinct metro areas in the United States. It would be useful to replicate the current study using all the data available for a given chain restaurant.

The limited number of metro areas in the data prevented the incorporation of demographic data into the models. It would have been useful to develop features for each restaurant location, something that chain restaurants could easily do.

The study was limited by the small number of labeled samples and the high class imbalance, which made it difficult to perform detailed analyses of the "positive" (allegedly ill) class of reviews by length, Yelp user characteristics, and so on.

The limited time and budget for this project prevented us from ingesting any results from state and county-based health inspection records for each restaurant location. However, we feel that standardization of state and county reporting tools and metrics would be an invaluable tool.

### 6.4 Conclusion

Companies spend billions of dollars annually on traditional consumer-based market research and analysis. By contrast, comparatively few studies have used a time-based NLP approach to examine and potentially harness social media sentiments for corporate crisis management, mitigation, and monitoring. This is perhaps the result of an existing bias that social media has less of an impact on and is less relevant for chain businesses. Using Chipotle as a documented crisis management subject area, this study examined the results of various common strategies (rating analysis, sentiment analysis, text-based classification, lexicon/dictionary extraction, and topic modeling) to extract potential managerial insights from store reviews. The results using a limited, but publicly available sample of Yelp data suggests that a review-centered, NLP-based analysis of thousands of written reviews has the potential to allow for better detection, monitoring, management, and mitigation of events such as food-borne illness and could be extended to other types of crises as well.

### 6.5 Recommended Next Steps and Future Studies

Future research should use the relevant subset(s) from the entire, unabridged Yelp dataset for the chain(s) being analyzed and compared, including an enriched sample from the positive class. Next, we would expand the research to include many more restaurant chains and compare results amongst the different businesses. We would also explore potential issues beyond food-borne illness. We have previously mentioned several ways to improve future studies, such as standardizing statistical metrics, using demographics, and linking to state and local health department records. We suggest that NLP is well suited for real time review monitoring from a variety of social media channels.

## Acknowledgements:

We would like to acknowledge our families for their patience as we complete this program. We would also like to acknowledge the professors throughout the program including Erin Cooke, M.S. and Anna Marbut, PhD(c) for their guidance throughout the capstone project. Additionally, we would like to acknowledge the SOLES Writing Center at the University of San Diego for their thorough review. 

## References:

Chung, S., Chong, M., Chua, J. S., & Na, J. C. (2019). Evolution of corporate reputation during an evolving controversy. Journal of Communication Management, 23(1), 52–71. https://doi.org/10.1108/jcom-08-2018-0072

Estes, A. C. (2011, October 3). How Yelp helps steer people away from fast food chains. The Atlantic. https://www.theatlantic.com/business/archive/2011/10/how-yelp-helps-steer-people-away-fast-food-chains/337181/

Fang, A., & Ben-Miled, Z. (2017). Does bad news spread faster? International Conference on Computing, Networking and Communications (ICNC),793–797. https://doi.org/10.1109/ICCNC.2017.7876232

Hutto, C., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216–225. https://doi.org/10.1609/icwsm.v8i1.14550

Li, H., & Hecht, B. (2020). 3 stars on Yelp, 4 stars on Google maps: A cross-platform examination of restaurant ratings. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW#), Article 254. https://doi.org/10.1145/3432953

Liu, S. (2020). Sentiment analysis of Yelp reviews: A comparison of techniques and models. ArXiv, abs/2004.13851. https://doi.org/10.48550/arXiv.2004.13851

Luca, M. (2016). Reviews, reputation, and revenue: The case of Yelp.com. [NOM unit working paper 12-016]. Harvard Business School. https://dx.doi.org/10.2139/ssrn.1928601

Opuszko, M., Berger, G., & Ruhland, J. (2019). The impact of public scandals on social media: A sentiment analysis on YouTube to detect the influence on reputation. In W. Popma, & S. Francis (Eds.), 6th European Conference on Social Media (ECSM 2019; pp.36–43). Curran Associates.

Schlosser, E. (2012). Fast food nation: The dark side of the all-American meal. Mariner Books/Houghton Mifflin Harcourt.

Shama, H., & Ramathmika, R. (2019). Sentiment Analysis of Yelp reviews by machine learning. 2019 International Conference on Intelligent Computing and Control Systems (ICCS), 700–704. https://doi.org/10.1109/iccs45141.2019.9065812 

Yelp. (2022, February). Investor presentation. https://s24.q4cdn.com/521204325/files/doc_presentations/2022/Yelp-Investor-Presentation-Feb-2022.pdf

Yelp Academic Dataset. (n.d.). https://www.yelp.com/dataset
