Title: Text Processing Techniques\
Name: Kuan-Hung Liu\
ASU ID: 1230540209\
File creation date: Jan/19/2024

# Text Processing Pipeline

## 1. Library and Data Import

In [None]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import packages
import spacy
import nltk
import spacy.cli
import pandas as pd
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

In [None]:
# Download packages
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

In [None]:
path = '/content/drive/MyDrive/CIS_509/restaurant_reviews_az.csv'

In [None]:
df = pd.read_csv(path) # import csv file

In [None]:
df.info() # summary of input data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48147 entries, 0 to 48146
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    48147 non-null  object
 1   user_id      48147 non-null  object
 2   business_id  48147 non-null  object
 3   stars        48147 non-null  int64 
 4   useful       48147 non-null  int64 
 5   funny        48147 non-null  int64 
 6   cool         48147 non-null  int64 
 7   text         48147 non-null  object
 8   date         48147 non-null  object
 9   Sentiment    48147 non-null  int64 
dtypes: int64(5), object(5)
memory usage: 3.7+ MB


## 2. Select Dataset
Select the 1 star reviews and 5 star reviews from the dataset

In [None]:
# Select the 1 star reviews and 5 star reviews from the dataset
reviews_1_star_df = df[df['stars'] == 1]
reviews_5_star_df = df[df['stars'] == 5]

In [None]:
reviews_1_star_df
reviews_5_star_df

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,4/19/2020 5:33,1
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2/29/2020 19:43,1
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,3/14/2020 21:47,1
6,DblKoOM1O6Bug_0b6YcpIQ,8o2iLbpduMiPefS2Gy_28g,wJmyu7W1K9A_gE8Ed4Bc9w,5,0,0,0,In town after a long weekend of hiking and cam...,1/20/2020 4:55,1
7,vW2w4F27XNIkD2toYu0PKg,t9LqNtCGuNUqBeFKWoFOPg,u4P6hqDz6-QG9PR2Pj5KIw,5,0,0,0,This is the definition of a great family-run b...,1/16/2020 4:58,1
...,...,...,...,...,...,...,...,...,...,...
48138,XtRsx73YCHC3WiDh_oZrDA,2n1VIiuJueBRHDytgTq9Zg,wIXYreqGaO5AEVjNQulbKQ,5,2,0,2,We had heard of this place from my daughter wh...,2/21/2021 22:03,1
48141,VkXr54yJMN4Qu4dRznAN8Q,4jEdEPDNAAa3aS7rYhQ60w,MK0OMY_u9unl8xSqjPLtMw,5,3,0,0,"ALWAYS love this place! And it's always busy, ...",1/17/2020 20:58,1
48143,Zu4ng3tjf_2oa9LlnEvUmQ,ItQzeC91hJF6qvvE7-OZmQ,Yzh7Xo1_JBDWUl2BzRiYaQ,5,0,0,0,"Fresh and delicious food, served fast. What mo...",10/30/2021 20:17,1
48145,dkGbETTcSQZTwHSnAMnLUw,_RmG_5kxRPgTWP7RptaFgQ,Bq0CQcwk5R8yhm-MGfHxCA,5,6,2,4,Bosnian food?? \n\n--- location. This is a HID...,1/5/2020 4:20,1


## 3. Find Nouns
Text Processing + find the top 20 frequently used nouns in 1 star reviews and 5 star **reviews**, respectively

In [None]:
# Function to get nouns from a list of reviews
def get_nouns(reviews):
    nouns = []
    stop_words = set(stopwords.words("english"))

    for review in reviews:
        words = word_tokenize(review)
        words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
        tagged_words = pos_tag(words)
        nouns.extend([word for word, pos in tagged_words if pos.startswith('NN')])

    return nouns

# Get nouns for 1-star and 5-star reviews
reviews_1_star = reviews_1_star_df['text'].tolist()
reviews_5_star = reviews_5_star_df['text'].tolist()

nouns_1_star = get_nouns(reviews_1_star)
nouns_5_star = get_nouns(reviews_5_star)

# Get top 20 nouns for each category
top_nouns_1_star = Counter(nouns_1_star).most_common(20)
top_nouns_5_star = Counter(nouns_5_star).most_common(20)

print("Top 20 Nouns in 1-star reviews:", top_nouns_1_star)
print("Top 20 Nouns in 5-star reviews:", top_nouns_5_star)

Top 20 Nouns in 1-star reviews: [('food', 6575), ('order', 4744), ('time', 3407), ('service', 3043), ('place', 2975), ('minutes', 2388), ('manager', 1591), ('restaurant', 1571), ('people', 1440), ('customer', 1341), ('way', 1109), ('location', 1097), ('staff', 1044), ('experience', 1023), ('times', 925), ('customers', 871), ('money', 857), ('hour', 838), ('chicken', 822), ('pizza', 816)]
Top 20 Nouns in 5-star reviews: [('food', 15846), ('place', 9499), ('service', 7435), ('time', 5579), ('staff', 4177), ('tucson', 3812), ('order', 3607), ('restaurant', 3233), ('everything', 2944), ('menu', 2277), ('pizza', 2127), ('experience', 2101), ('chicken', 2077), ('try', 1965), ('family', 1700), ('spot', 1693), ('breakfast', 1655), ('town', 1653), ('love', 1606), ('day', 1520)]


## 4. Find Adjectives
Text Processing + find the top 20 frequently used adjectives in 1 star reviews and 5 star, respectively

In [None]:
# Function to get adjectives from a list of reviews
def get_adjectives(reviews):
    adjectives = []
    stop_words = set(stopwords.words("english"))

    for review in reviews:
        words = word_tokenize(review)
        words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
        tagged_words = pos_tag(words)
        adjectives.extend([word for word, pos in tagged_words if pos.startswith('JJ')])

    return adjectives

# Get nouns for 1-star and 5-star reviews
reviews_1_star = reviews_1_star_df['text'].tolist()
reviews_5_star = reviews_5_star_df['text'].tolist()

adjectives_1_star = get_adjectives(reviews_1_star)
adjectives_5_star = get_adjectives(reviews_5_star)

# Get top 20 adjectives for each category
top_adjectives_1_star = Counter(adjectives_1_star).most_common(20)
top_adjectives_5_star = Counter(adjectives_5_star).most_common(20)

print("Top 20 Adjectives in 1-star reviews:", top_adjectives_1_star)
print("Top 20 Adjectives in 5-star reviews:", top_adjectives_5_star)

Top 20 Adjectives in 1-star reviews: [('good', 1814), ('bad', 1234), ('last', 944), ('table', 806), ('terrible', 774), ('horrible', 765), ('worst', 760), ('first', 740), ('great', 727), ('wrong', 687), ('new', 632), ('sure', 594), ('many', 556), ('cheese', 554), ('cold', 525), ('much', 523), ('small', 511), ('disappointed', 504), ('old', 498), ('little', 481)]
Top 20 Adjectives in 5-star reviews: [('great', 12434), ('good', 9454), ('delicious', 7360), ('best', 4728), ('fresh', 3246), ('nice', 2892), ('friendly', 2605), ('favorite', 2224), ('excellent', 2102), ('little', 1897), ('first', 1880), ('new', 1859), ('hot', 1784), ('happy', 1576), ('wonderful', 1485), ('many', 1426), ('perfect', 1412), ('amazing', 1408), ('sure', 1407), ('cheese', 1382)]


## 5. Find Verbs
Text Processing + find the top 20 frequently used verbs in 1 star reviews and 5 star reviews, respectively.

In [None]:
# Function to get verbs from a list of reviews
def get_verbs(reviews):
    verbs = []
    stop_words = set(stopwords.words("english"))

    for review in reviews:
        words = word_tokenize(review)
        words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
        tagged_words = pos_tag(words)
        verbs.extend([word for word, pos in tagged_words if pos.startswith('VB')])

    return verbs

# Get nouns for 1-star and 5-star reviews
reviews_1_star = reviews_1_star_df['text'].tolist()
reviews_5_star = reviews_5_star_df['text'].tolist()

verbs_1_star = get_verbs(reviews_1_star)
verbs_5_star = get_verbs(reviews_5_star)

# Get top 20 verbs for each category
top_verbs_1_star = Counter(verbs_1_star).most_common(20)
top_verbs_5_star = Counter(verbs_5_star).most_common(20)

print("Top 20 Verbs in 1-star reviews:", top_verbs_1_star)
print("Top 20 Verbs in 5-star reviews:", top_verbs_5_star)

Top 20 Verbs in 1-star reviews: [('ordered', 2529), ('go', 2483), ('get', 2425), ('said', 2391), ('got', 2359), ('asked', 1938), ('went', 1700), ('came', 1670), ('told', 1647), ('going', 1320), ('took', 1117), ('know', 1012), ('called', 976), ('come', 955), ('waiting', 945), ('made', 935), ('take', 921), ('waited', 905), ('make', 891), ('say', 837)]
Top 20 Verbs in 5-star reviews: [('go', 4400), ('amazing', 4040), ('got', 3648), ('ordered', 3485), ('get', 3065), ('made', 2622), ('love', 2597), ('recommend', 2360), ('came', 2202), ('come', 1986), ('went', 1805), ('loved', 1625), ('make', 1602), ('try', 1592), ('take', 1516), ('going', 1444), ('tried', 1434), ('coming', 1295), ('say', 1202), ('wait', 1136)]


## 6. Find Named Entity
Text Processing + find the top 20 frequently used named entities from the selected reviews.

In [None]:
# Function to get named entities from a list of reviews
def get_named_entities(reviews):
    named_entities = []

    for review in reviews:
        doc = nlp(review)
        named_entities.extend([ent.text for ent in doc.ents])

    return named_entities

reviews_1_star = reviews_1_star_df['text'].tolist()
reviews_5_star = reviews_5_star_df['text'].tolist()

# Get named entities from selected reviews
named_entities_list_1_star = get_named_entities(reviews_1_star)
named_entities_list_5_star = get_named_entities(reviews_5_star)

# Get the top 20 named entities
top_named_entities_1_star = Counter(named_entities_list_1_star).most_common(20)
top_named_entities_5_star = Counter(named_entities_list_5_star).most_common(20)

print("Top 20 Named Entities in 1-star reviews:", top_named_entities_1_star)
print("Top 20 Named Entities in 5-star reviews:", top_named_entities_5_star)

Top 20 Named Entities in 1-star reviews: [('one', 1084), ('first', 935), ('two', 857), ('2', 765), ('3', 568), ('Tucson', 562), ('today', 465), ('1', 366), ('4', 360), ('second', 287), ('three', 284), ('half', 281), ('First', 273), ('5', 268), ('Mexican', 260), ('tonight', 258), ('zero', 211), ('an hour', 211), ('One', 205), ('10', 184)]
Top 20 Named Entities in 5-star reviews: [('Tucson', 4843), ('first', 1932), ('one', 1284), ('two', 1243), ('Mexican', 1169), ('2', 742), ('5', 736), ('today', 677), ('One', 644), ('First', 571), ('3', 515), ('Italian', 514), ('Chinese', 432), ('Thai', 402), ('Indian', 376), ('4', 356), ('three', 340), ('half', 326), ('French', 321), ('second', 317)]


## 7. Observation
What is the key to a good restaurant experience?


1. In 5 star reviews, there are several common words:
* Nouns: service, staff, experience, love, family
* Adjectives: delicious, fresh, friendly
* Verbs: love, recommend

\
2. In 1 star reviews, there are several common words:
* Nouns: food, order, time, service, minute, manager, staff, experience, hour
* Adjectives: disappointed, old, little,
* Verbs: wait

\
3. Conclusion:
As a result, I found that users usually care about:
* The quality of the restaurant's food.
* The level of service provided by staff and management.
* The duration of waiting times.

\
These aspects significantly impact customers' experiences and shape their overall perceptions of the restaurants. Consequently, restaurants should prioritize these key elements to enhance customer satisfaction.




## 8. Acknowlege of usage of GenAI tools

GenAI helped me to generate the most of code. The GenAI usually use function to process the data and extract different types of words.