<i>## Comments will be provided using this format. Key takeaway: groups are encouraged to change the formatting, but not the structure. Groups are also allowed to create additional notebooks - for instance, create one notebook for data exploration, and one notebook for each preprocessing-modelling-evaluation pipeline -, but must strive to keep an unified style across notebooks.</i>

#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Notebook Title`

#### Group:
- `Group member #1`
- `(...)`
- `Group member #5`

#### <font color='#BFD72'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [1. Data Understanding](#P1)
- [2. General Data Preparation](#P2) 
- [3. Multilabel Classification (Information Requirement 3311)](#P3)
    - [3.1 Specific Data Preparation](#P31)
    - [3.2 Model Implementation](#P32)
    - [3.3 Model Evaluation](#P3n)
- [4. Sentiment Analysis (Information Requirement 3312)](#P4)
    - [4.1 Specific Data Preparation](#P41)
    - [4.2 Model Implementation](#P42)
    - [4.3 Model Evaluation](#P43)
- [...]
- [N. Additional Tasks (Information Requirements 332n)](#Pn)
    - [N.1 Specific Data Preparation](#Pn1)
    - [N.2 Model Implementation](#Pn2)
    - [N.3 Model Evaluation](#Pn3)

<i>## Note that the notebook structure differs from the report: instead of following the CRISP-DM phases and then specifying the different problems inside the phases, the notebook is structured by problem, with the CRISP-DM phases being defined for each specific problem.

In [55]:
## All imports must be concentrated on a cell that immediately follow the table of contents
import time
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import emoji

In [69]:
restaurants_raw = pd.read_csv(r"data_hyderabad/105_restaurants.csv")
reviews_raw = pd.read_csv(r"data_hyderabad/10k_reviews.csv")

restaurants_raw.head(5)

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM
4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no..."


In [70]:
restaurants_raw.describe()

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
count,105,105,105,51,105,104
unique,105,105,29,42,92,77
top,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,500,Food Hygiene Rated Restaurants in Hyderabad,"North Indian, Chinese",11 AM to 11 PM
freq,1,1,13,4,4,6


In [71]:
restaurants_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         105 non-null    object
 1   Links        105 non-null    object
 2   Cost         105 non-null    object
 3   Collections  51 non-null     object
 4   Cuisines     105 non-null    object
 5   Timings      104 non-null    object
dtypes: object(6)
memory usage: 5.0+ KB


In [72]:
reviews_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Restaurant  10000 non-null  object
 1   Reviewer    9962 non-null   object
 2   Review      9955 non-null   object
 3   Rating      9962 non-null   object
 4   Metadata    9962 non-null   object
 5   Time        9962 non-null   object
 6   Pictures    10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB


In [73]:
reviews_raw.sample(10)

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
8081,Tandoori Food Works,Sai Charan,never eat,1,"7 Reviews , 1 Follower",8/1/2018 11:52,0
1545,KFC,Richa,burger quality has decreased,2,1 Review,9/18/2018 19:43,0
9307,Zing's Northeast Kitchen,Soumith,This place has been a bookmark for me since ve...,4,"25 Reviews , 19 Followers",5/5/2019 12:17,1
3936,Deli 9 Bistro,Nishant Srivastava,Good ambience and service. If you are looking ...,5,"3 Reviews , 1 Follower",1/25/2019 0:32,0
6642,Aromas@11SIX,Harshitha.guguloth9,I ordered the aroma’s special chicken biryani ...,1,2 Reviews,10/10/2018 13:27,0
6429,Hyderabad Chefs,Hemanth Sai,packing was not at all good,3,"1 Review , 4 Followers",8/6/2018 17:10,0
765,Shah Ghouse Spl Shawarma,Ashok Prabhakaran,A very popular shop nearby the hotel I was sta...,4,"119 Reviews , 469 Followers",12/4/2018 10:52,0
4065,Frio Bistro,Nom.Nom.Foodie (Ruthwikkumar Durgam),I was here with my pals for a lunch. We have t...,4,"109 Reviews , 2768 Followers",11/30/2018 23:47,0
8530,Momos Delight,Satyajit Mahapatra,5 Tiny Veg Momos is what you get for Rs.75. Hy...,1,"1 Review , 1 Follower",11/21/2018 22:46,1
7900,Olive Garden,Smrati Saxena,I ordered noodles and manchurian combo from th...,3,"18 Reviews , 41 Followers",5/23/2019 21:02,0


In [74]:
reviews_raw.dtypes

Restaurant    object
Reviewer      object
Review        object
Rating        object
Metadata      object
Time          object
Pictures       int64
dtype: object

In [75]:
reviews_raw.isna().sum()

Restaurant     0
Reviewer      38
Review        45
Rating        38
Metadata      38
Time          38
Pictures       0
dtype: int64

In [76]:
# Split the metadata column into two columns
reviews_raw[["Reviews", "Followers"]] = reviews_raw["Metadata"].str.split(",", expand=True)

reviews_raw["Reviews"] = reviews_raw["Reviews"].str.extract('(\d+)').fillna(0)
reviews_raw["Followers"] = reviews_raw["Followers"].str.extract('(\d+)').fillna(0)
reviews_data = reviews_raw.drop("Metadata", axis=1)

In [77]:
# Delete rows with missing values or missing ratings because there are only 45 od them in the dataset

reviews_data = reviews_data[reviews_data["Rating"].notna() & reviews_data["Review"].notna()]
reviews_data.isna().sum()

Restaurant    0
Reviewer      0
Review        0
Rating        0
Time          0
Pictures      0
Reviews       0
Followers     0
dtype: int64

In [78]:
# Encode emojis
def replace_emojis_with_text(text):
    # Check if the value is NaN or float
    if isinstance(text, float):
        return ""
    return emoji.demojize(text)

reviews_data['Review_With_Emoji_Text'] = reviews_data['Review'].apply(replace_emojis_with_text)

MODIFY so the emojis are encoded in the text like this: <"emoji> and not like this :emoji:

In [79]:
# split UPPERCASE WORDS - have some doubts if it works well
def splitting_words_process(word):
    # only upper case letters 
    if word.isupper(): 
        return word
    
    # more than one upper case letter inside
    elif re.search(r'[A-Z][a-z]*[A-Z]', word):
        split_word = re.findall(r'[A-Z][a-z]*', word)
        return ' '.join(split_word)
    
    # <2 upper case letters
    else:
        return word
    
# e.g.
words = ["EATCHICKEN", "KFC", "Food", "CoolMan"]
processed_words = [splitting_words_process(word) for word in words]

print(processed_words)

['EATCHICKEN', 'KFC', 'Food', 'Cool Man']


In [80]:
# Function to detect emojis
def contains_emoji(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "]+",
        flags=re.UNICODE)
    return bool(emoji_pattern.search(text))

# Function to filter rows based on review length, excluding valid cases
def filter_reviews(row):
    if isinstance(row, str):
        # Check if it's "bad" or "ok"
        if re.match(r'^(bad|ok)$', row, re.IGNORECASE):
            return True
        # Check if it contains an emoji
        if contains_emoji(row):
            return True
        # Otherwise, drop if it's shorter than 3 characters
        if len(row) < 3:
            return False
    return True

# Apply the function to filter the DataFrame
reviews_data = reviews_data[reviews_data['Review'].apply(filter_reviews)]

#TODO emotikons to emojis

There are many words that are misspelled 'good' and words 'bad' we will map them by hand and remove the ones that are not in the list

In [81]:
# Function to replace 'gud', 'goo', 'gd' with the appropriate 'good'
def replace_gud_with_good(text):
    if isinstance(text, str):
        # Define the regex pattern to match 'gud', 'goo', 'gd' in various capitalizations
        pattern = re.compile(r'\b([Gg][Uu][Dd]|[Gg][Oo][Oo]|[Gg][Dd])\b')

        # Replacement function to check the case of the first letter
        def replacement(match):
            word = match.group()
            # Check if the first letter is uppercase, then return 'Good', else 'good'
            if word[0].isupper():
                return 'Good'
            else:
                return 'good'
        
        # Use re.sub to apply the replacement function
        return pattern.sub(replacement, text)
    
    return text

# Apply the function to the 'Review' column to replace the variants of 'good'
reviews_data['Review'] = reviews_data['Review'].apply(replace_gud_with_good)


In [82]:
# Function to replace 'kk', 'Oke', 'k', 'Ok' with 'ok'
def replace_to_ok(text):
    if isinstance(text, str):
        # Define the regex pattern to match the variants of 'ok'
        pattern = re.compile(r'\b(k|kk|Ok|Oke)\b', re.IGNORECASE)

        # Replacement function to return 'ok' for all matched words
        def replacement(match):
            return 'ok'
        
        # Use re.sub to apply the replacement function
        return pattern.sub(replacement, text)
    
    return text

# Apply the function to the 'Review' column to replace the variants of 'ok'
reviews_data['Review'] = reviews_data['Review'].apply(replace_to_ok)

In [83]:
# add space after ! | " | # | $ | % | & | ( | ) | * | + | , | . | : | ; followed immediately by a word
def add_space_after_punctuation(df):

    df['Reviews'] = df['Reviews'].apply(lambda text: re.sub(r'([\u0021-\u0026\u0028-\u002C\u002E\u003A-\u003F]+(?=\w))', r'\1 ', text) if isinstance(text, str) else text)
    return df

# Example usage:
reviews_data = add_space_after_punctuation(reviews_data)
reviews_data.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Time,Pictures,Reviews,Followers,Review_With_Emoji_Text
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,5/25/2019 15:54,0,1,2,"The ambience was good, food was quite good . h..."
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,5/25/2019 14:20,0,3,2,Ambience is too good for a pleasant evening. S...
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,5/24/2019 22:54,0,2,3,A must try.. great food great ambience. Thnx f...
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,5/24/2019 22:11,0,1,1,Soumen das and Arun was a great guy. Only beca...
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,5/24/2019 21:37,0,3,2,Food is good.we ordered Kodi drumsticks and ba...


In [84]:
# Function to check for non-Unicode characters
def find_non_unicode_reviews(row):
    try:
        # If the row contains non-unicode characters, we flag it
        return not bool(re.match(r'^[\u0000-\uFFFF]*$', row))
    except TypeError:
        return False  # In case the row is not a string (e.g., NaN)

non_unicode_reviews = reviews_raw[reviews_raw['Review'].apply(find_non_unicode_reviews)]
non_unicode_reviews

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures,Reviews,Followers
21,Beyond Flavours,Sneha Munigela,please was good but it was quite expensive and...,4,"6 Reviews , 1 Follower",5/21/2019 20:17,1,6,1
25,Beyond Flavours,Imteja7,The place is very good.. 5* to the live music....,5,3 Reviews,5/20/2019 14:17,0,3,0
26,Beyond Flavours,Nisha Gahlawat,Sonalin has a great voice.. 😍 must visit the p...,5,"2 Reviews , 1 Follower",5/20/2019 12:45,0,2,1
27,Beyond Flavours,Dharini Hombal,I heard her voice..she is too beautiful with a...,5,"1 Review , 26 Followers",5/20/2019 12:30,0,1,26
28,Beyond Flavours,Ankita Sinha,Sonalin is a very good singer in the city.. be...,5,"9 Reviews , 5 Followers",5/20/2019 12:25,0,9,5
...,...,...,...,...,...,...,...,...,...
9875,Triptify,Pyla Deepti,paratha's where too tasty. ordered chicken par...,5,"7 Reviews , 33 Followers",7/27/2018 13:46,0,7,33
9935,Chinese Pavilion,Ch.ramya Krishna,Been here for four times.. still didn't get bo...,4,"14 Reviews , 33 Followers",5/11/2018 17:16,1,14,33
9965,Chinese Pavilion,Kabir Kashyap,I was told by one of my colleague about this p...,4.5,"48 Reviews , 111 Followers",3/12/2017 19:34,2,48,111
9966,Chinese Pavilion,Shilpa,One of the finest in chinese restaurants. This...,4.5,"1 Review , 56 Followers",2/26/2017 0:57,0,1,56


In [85]:
reviews_data.shape

(9931, 9)