# Introduction to NLP with Python spaCy: analyzing restaurant reviews


_This work is licensed under a [Creative Commons BY-SA 4.0 License](http://creativecommons.org/licenses/by-sa/4.0/)_

<br><br><br><br>
__Daniel Kapitan__<br>
`e. d.kapitan@jads.nl`<br>
`w. https://kapitan.net`<br>



<img style="float: left" src="https://github.com/jads-nl/public-lectures/blob/main/logos/jads-gold-250x60.png?raw=true">

## The challenge: predict the next Michelin star

Thanks to [the people at analyticslab.nl](https://www.theanalyticslab.nl/about-us/) we will use a restaurant review dataset with nearly 370.000 reviews collected over an eight-year period. Using the dataset which they have scraped, we will follow along [their blogpost series](https://www.theanalyticslab.nl/nlpblogs_0_preparing_restaurant_review_data_for_nlp_and_predictive_modeling/), but replacing their R code with a workflow in Python spaCy.

In this notebook we compare different NLP techniques to show you how we get valuable information from unstructured text. Given the restaurant reviews, the challenge is whether 'the wisdom of the croud' - reviews from restaurant visitors - could be used to predict which restaurants are most likely to receive a new Michelin-star. We will try to see how that worked out. To following tools and techniques will be demonstrated:

- How to setup a reproducible text pipeline in Python spaCy for text analysis;
- How to apply [topic modeling](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) as the primary tool to extract information from the review texts, to be combined and used in predictive modeling techniques to end up with our predictions.
- How two more novel NLP techniques cabn 
 To answer these questions, we will explain our approach in more detail in the coming articles. But we didn't stop exploring NLP techniques after our publication, and we also like to share insights from adding more novel NLP techniques. More specifically we will use two types of word embeddings - a classic [Word2Vec model](https://arxiv.org/abs/1301.3781) and a GLoVe embedding model - we'll use transfer learning with pretrained word embeddings and we use BERT. We compare the added value of these advanced NLP techniques to our baseline topic model on the same dataset. By showing what we did and how we did it, we hope to guide others that are keen to use textual data for their own data science endeavours.


## Data preparation
Before we delve into the analytical side of things, we need some prepared textual data. As all true data scientists know, proper data preparation takes most of your time and is most decisive for the quality of the analysis results you end up with. Since preparing textual data is another cup of tea compared to preparing structured numeric or categorical data, and our goal is to show you how to do text analytics, we also want to show you how we cleaned and prepared the data we gathered. Therefore, in this notebook we start with the data dump with all reviews and explore and prepare this data in a number of steps:

![](https://bhciaaablob.blob.core.windows.net/thefork/Text%20preprocessing%20pipeline_noheader.png)

As a result of these steps, we end up with - aside from building insights in our data and some cleaning - a number of flat files we can use as source files throughout the rest of the articles:

- __reviews.csv__: a csv file with review texts (original and cleaned) - the fuel for our NLP analyses. (included key: restoreviewid, hence the unique identifier for a review)
- __labels.csv__: a csv file with 1 / 0 values, indicating whether the review is a review for a Michelin restaurant or not (included key: restoreviewid)
- __restoid.csv__: a csv file with restaurant id's, to be able to determine which reviews belong to which restaurant (included key: restoreviewid)
- __trainids.csv__: a csv file with 1 / 0 values, indicating whether the review should be used for training or testing - we already split the reviews in train/test to enable reuse of the same samples for fair comparisons between techniques (included key: restoreviewid)
- __features.csv__: a csv file with other features regarding the reviews (included key: restoreviewid)

These files with the cleaned and relevant data for NLP techniques are made available to you via public blob storage so that you can run all code we present yourself and see how things work in more detail.

In [1]:
import re

import pandas as pd
import pendulum


# # not needed for this notebook, required for uploading data to GitHub in smaller files < 25 MB
# REVIEWS = (
#     "https://bhciaaablob.blob.core.windows.net/cmotionsnlpblogs/RestoReviewRawdata.csv"
# )
# resto = pd.read_csv(REVIEWS, decimal=",")
# resto['reviewYear'] = resto.reviewDate.str[-4:].astype('float').astype('Int64')
# resto.to_parquet('data/restaurant-reviews', partition_cols=['reviewYear'])

raw_reviews = pd.read_parquet("data/restaurant-reviews")
raw_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 368529 entries, 0 to 368528
Data columns (total 25 columns):
 #   Column               Non-Null Count   Dtype   
---  ------               --------------   -----   
 0   restoId              368529 non-null  int64   
 1   restoName            368529 non-null  object  
 2   tags                 356517 non-null  object  
 3   address              368529 non-null  object  
 4   scoreTotal           348081 non-null  float64 
 5   avgPrice             336224 non-null  object  
 6   numReviews           368529 non-null  int64   
 7   scoreFood            347984 non-null  float64 
 8   scoreService         347984 non-null  float64 
 9   scoreDecor           348037 non-null  float64 
 10  review_id            368529 non-null  float64 
 11  numreviews2          368529 non-null  float64 
 12  valueForPriceScore   296774 non-null  object  
 13  noiseLevelScore      296902 non-null  object  
 14  waitingTimeScore     296902 non-null  object  
 15  

Let's look at some `reviewText`.

In [2]:
raw_reviews.reviewText.head()

0    b'We komen al meer dan 8 jaar in dit restauran...
1    b'Een werkelijk prachtige ijssalon,blinkende u...
2    b'Naast dat men hier heerlijk grieks eten heef...
3    b'Via de Sweetdeal genoten van het 3 gangenkeu...
4    b'Vakantieveiling is een leuk ding om restaura...
Name: reviewText, dtype: object

Ok, there's clearly some cleaning to be done here.

First of all, the available texts are all encapsulated in "b'...''", indicating the texts are byte literals. Also you might spot some strange sequences of tokens like in 'ingredi\xc3\xabnten', indicating that our texts include UTF-8 encoded tokens (here, the character ë that has the code \xc3\xab in UTF-8). This combination of byte literal encapsulation with the UTF-8 codes shows that in the creation of the source data we have available, the encoding got messed up a bit, making it difficult to to obtain the review texts out of the encoded data. We won't go in to too much detail here (if you want, read [this](https://diveintopython3.net/strings.html)) but you might run into similar stuff when you start working with textual data. In short, there are different encoding types and you need to know what you are working with. We need to make sure we use the right encoding and we should get rid of the "b'...''" in the strings.

We could spend some time on figuring out how to correct this messing-up due to coding as good as possible. However, in order not to lose too much time and effort on undoing this (and we don't) we can take a short cut with minimal loss of data by cleaning the texts with some regular expressions. Depending on your goal, you might want to go the extra mile and try to restore the texts in their original UTF-8 encoding though! As so often in data science projects, we're struggling with available time and resources: You need to pick you battles - and pick them wisely!

Do we have other things to cover? To get a better understanding of our data, let's check the most frequent, identical review texts:

In [3]:
raw_reviews.reviewText.value_counts(normalize=True).head()

b'- Recensie is momenteel in behandeling -'    0.003937
b'Heerlijk gegeten!'                           0.001037
b'Heerlijk gegeten'                            0.000795
b'Heerlijk gegeten.'                           0.000448
b'Top'                                         0.000293
Name: reviewText, dtype: float64

Ok, several things to solve here:

- About 3% of all reviews have no review text so they are not useful and we can delete those.
- Another 0,4% has the value "b'- Recensie is momenteel in behandeling -'" (In English: The review is currently being processed) and therefore the actual review text is not published yet. Similar to empty reviews, we can delete these reviews.
- Several reviews seem very short and are not that helpful in trying to learn from the review text. Although this is very context dependent (when performing sentiment analysis, short reviews like 'Top!' (English: Top!), 'Prima' (Engish: Fine/OK) and 'Heerlijk gegeten' (En: Had a nice meal) might still have much value!) we will set a minimum length to reviews.

We will deal with punctuation later in spaCy.

### Pattern matching with regex

In [4]:
import re


def fix_bytestring(string):
    """Decode wonky byte string into proper string"""

    pattern = re.compile(r"^b'(.*)'")
    match = re.search(pattern, string)
    if match:
        return match[1].encode("utf-8").decode("utf-8")
    else:
        return ""

In [5]:
reviews = raw_reviews.loc[:, ['restoId', 'reviewerId', 'review_id', 'reviewerFame', 'reviewerNumReviews']].copy()
reviews['reviewText'] = raw_reviews.reviewText.apply(fix_bytestring)
reviews.reviewText.head()

0    We komen al meer dan 8 jaar in dit restaurant ...
1    Een werkelijk prachtige ijssalon,blinkende uit...
2    Naast dat men hier heerlijk grieks eten heeft,...
3    Via de Sweetdeal genoten van het 3 gangenkeuze...
4    Vakantieveiling is een leuk ding om restaurant...
Name: reviewText, dtype: object

In [6]:
def validate_review(review):
    if review == '- Recensie is momenteel in behandeling -' or len(review) < 4:
        return 0
    else:
        return 1
    
reviews['is_valid'] = reviews.reviewText.apply(validate_review)
reviews[reviews.is_valid==0]['reviewText'].value_counts(normalize=True).head(10)

                                            0.956331
- Recensie is momenteel in behandeling -    0.035183
Top                                         0.002619
.                                           0.000970
Nvt                                         0.000679
-                                           0.000485
..                                          0.000291
Kip                                         0.000218
Ok                                          0.000218
nvt                                         0.000218
Name: reviewText, dtype: float64

So that looks OK, we can safely delete `is_valid == 0` reviews later. Let's do some more data prep.

### Parse localized datestrings with `pendulum`

In [7]:
import pendulum


pendulum.set_locale('nl')
pendulum.date(1973, 9, 9).format('D MMM YYYY')  # example

'9 sep. 1973'

In [8]:
def parse_date(date):
    return pendulum.from_format(date, fmt='D MMM YYYY', locale='nl')

reviews['reviewDate'] = raw_reviews.reviewDate.apply(parse_date).dt.date

In [9]:
reviews.reviewDate.head()

0    2012-09-19
1    2012-07-12
2    2012-11-29
3    2012-12-13
4    2012-10-19
Name: reviewDate, dtype: object

### Format numerical columns

In [10]:
# avgPrice has whitespace and euro character
def clean_price(string):
    if string:
        return string.split(" ")[-1]
    else:
        return None


reviews["avgPrice"] = raw_reviews["avgPrice"].apply(clean_price)

In [11]:
# turn categorical columns into ordinal values, lower is better
# note to Dutch audience: do you think the ordinal order is sensible and correct?
map_scores = {
    "waitingTimeScore": {
        "Hoog tempo": 1,
        "Kort": 2,
        "Redelijk": 3,
        "Kan beter": 4,
        "Lang": 5,
    },
    "valueForPriceScore": {
        "Erg gunstig": 1,
        "Gunstig": 2,
        "Redelijk": 3,
        "Precies goed": 4,
        "Kan beter": 5,
    },
    "noiseLevelScore": {
        "Erg rustig": 1,
        "Rustig": 2,
        "Precies goed": 3,
        "Rumoerig": 4,
    },
}

for col in map_scores.keys():
    reviews[col] = (
        raw_reviews[col].apply(lambda x: map_scores[col].get(x, None)).astype("Int64")
    )

In [12]:
# numerical columns have comma as decimal seperator --> cast to floats
numerical_cols = [
    "scoreFood",
    "scoreService",
    "scoreDecor",
    "reviewScoreOverall",
    "scoreTotal",
]
for col in numerical_cols:
    reviews[col] = pd.to_numeric(raw_reviews[col])

In [16]:
reviews.head()

Unnamed: 0,restoId,reviewerId,review_id,reviewerFame,reviewerNumReviews,reviewText,is_valid,reviewDate,avgPrice,waitingTimeScore,valueForPriceScore,noiseLevelScore,scoreFood,scoreService,scoreDecor,reviewScoreOverall,scoreTotal
0,236127,111373143.0,20.0,Fijnproever,4.0,We komen al meer dan 8 jaar in dit restaurant ...,1,2012-09-19,35.0,,,,8.6,8.4,7.2,8.5,8.4
1,246631,111355027.0,11.0,Meesterproever,21.0,"Een werkelijk prachtige ijssalon,blinkende uit...",1,2012-07-12,,,,,8.2,7.6,8.0,10.0,8.0
2,243427,112961389.0,3.0,Expertproever,9.0,"Naast dat men hier heerlijk grieks eten heeft,...",1,2012-11-29,,,,,,,,8.0,
3,234077,111347867.0,107.0,Meesterproever,97.0,Via de Sweetdeal genoten van het 3 gangenkeuze...,1,2012-12-13,45.0,,,,8.0,8.0,7.6,7.0,7.9
4,240845,112167929.0,14.0,Meesterproever,40.0,Vakantieveiling is een leuk ding om restaurant...,1,2012-10-19,43.0,,,,7.3,7.6,7.4,8.5,7.4


### Exercise: perform exploratory data analysis

Prior to diving into NLP with spaCy, perform a EDA to explore possible correlations:
- reviewer type vs. given scores
- length of reviews vs. scores
- value-for-money vs

Learning objective:
- Lest you forget to always do a short EDA, before getting lost in details ...

## Tokenizing in spaCy

To develop reproducible pipelines, we will follow the recommended workflow from spaCy.

![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the __processing pipeline__. The pipeline used by the [trained pipelines](https://spacy.io/models) typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed `Doc`, which is then passed on to the next component.

The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in `nlp.pipe_names`. The reason is that there can only really be one tokenizer, and while all other pipeline components take a `Doc` and return it, the tokenizer takes a __string of text__ and turns it into a `Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is writable, so you can either create your own [`Tokenizer` class from scratch](https://spacy.io/usage/linguistic-features#native-tokenizers), or even replace it with an [entirely custom function](https://spacy.io/usage/linguistic-features#custom-tokenizer).

We will use the large Dutch model which is 546 MB in size. The download command needs to be run once on your system. You may want to restart your Jupyter Notebook kernel to ensure spaCy is loaded properly with the newly downloaded model.

In [14]:
!python -m spacy download nl_core_news_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('nl_core_news_lg')


In [15]:
import spacy


nlp = spacy.load("nl_core_news_lg")