<i>## Comments will be provided using this format. Key takeaway: groups are encouraged to change the formatting, but not the structure. Groups are also allowed to create additional notebooks - for instance, create one notebook for data exploration, and one notebook for each preprocessing-modelling-evaluation pipeline -, but must strive to keep an unified style across notebooks.</i>

#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Notebook Title`

#### Group:
- `Group member #1`
- `(...)`
- `Group member #5`

#### <font color='#BFD72'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [1. Data Understanding](#P1)
- [2. General Data Preparation](#P2) 
- [3. Multilabel Classification (Information Requirement 3311)](#P3)
    - [3.1 Specific Data Preparation](#P31)
    - [3.2 Model Implementation](#P32)
    - [3.3 Model Evaluation](#P3n)
- [4. Sentiment Analysis (Information Requirement 3312)](#P4)
    - [4.1 Specific Data Preparation](#P41)
    - [4.2 Model Implementation](#P42)
    - [4.3 Model Evaluation](#P43)
- [...]
- [N. Additional Tasks (Information Requirements 332n)](#Pn)
    - [N.1 Specific Data Preparation](#Pn1)
    - [N.2 Model Implementation](#Pn2)
    - [N.3 Model Evaluation](#Pn3)

<i>## Note that the notebook structure differs from the report: instead of following the CRISP-DM phases and then specifying the different problems inside the phases, the notebook is structured by problem, with the CRISP-DM phases being defined for each specific problem.

In [50]:
## All imports must be concentrated on a cell that immediately follow the table of contents
import time
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt

In [65]:
restaurants_raw = pd.read_csv(r"data_hyderabad/105_restaurants.csv")
reviews_raw = pd.read_csv(r"data_hyderabad/10k_reviews.csv")

restaurants_raw.head(5)

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM
4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no..."


In [52]:
restaurants_raw.describe()

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
count,105,105,105,51,105,104
unique,105,105,29,42,92,77
top,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,500,Food Hygiene Rated Restaurants in Hyderabad,"North Indian, Chinese",11 AM to 11 PM
freq,1,1,13,4,4,6


In [53]:
restaurants_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         105 non-null    object
 1   Links        105 non-null    object
 2   Cost         105 non-null    object
 3   Collections  51 non-null     object
 4   Cuisines     105 non-null    object
 5   Timings      104 non-null    object
dtypes: object(6)
memory usage: 5.0+ KB


In [54]:
reviews_raw.head(5)

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0


In [55]:
reviews_raw.describe()

Unnamed: 0,Pictures
count,10000.0
mean,0.7486
std,2.570381
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,64.0


In [56]:
reviews_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Restaurant  10000 non-null  object
 1   Reviewer    9962 non-null   object
 2   Review      9955 non-null   object
 3   Rating      9962 non-null   object
 4   Metadata    9962 non-null   object
 5   Time        9962 non-null   object
 6   Pictures    10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB


In [57]:
reviews_raw['Review'].sample(10)

6019    Had good food and drinks. Good ambience. Good ...
468     Restaurant Ambience is Nice.\nFood is TastY an...
5053    As they promise best butter chicken its not tr...
616     Great food. No oil, soft rotis, tasty gravy an...
1184    It’s a hell of party when you land with your c...
215     I have given my b'day treat to my friends. Guy...
9439    You should close the store. I had worst experi...
8827                                            very nice
3408    It's a romantic place. Pool side wine on 10th ...
1993    Been here only for tea and snacks, the guy ove...
Name: Review, dtype: object

In [58]:
reviews_raw.dtypes

Restaurant    object
Reviewer      object
Review        object
Rating        object
Metadata      object
Time          object
Pictures       int64
dtype: object

In [66]:
reviews_raw.isna().sum()

Restaurant     0
Reviewer      38
Review        45
Rating        38
Metadata      38
Time          38
Pictures       0
dtype: int64

In [67]:
import re
import pandas as pd

# Ensure non-string and NaN values are handled
def filter_short_reviews(row):

    empty_regex = r'^\s*$'
    up_to_three_chars_pattern = r'^\w{1,3}$'
    if pd.isna(row):
        return True  # Keep NaN values in the final DataFrame
    elif isinstance(row, str):
        # Check if the row matches the pattern for empty or up to 3 characters long
        return bool(re.match(empty_regex, row)) or bool(re.match(up_to_three_chars_pattern, row))
    return False

filtered_reviews_raw = reviews_raw[reviews_raw['Review'].apply(filter_short_reviews )]

filtered_reviews_raw


Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
1568,KFC,Ganessh,gud,4,1 Review,8/1/2018 13:39,0
1577,KFC,Kalyanachakravarthy Chiruvella,ok,4,0 Reviews,7/29/2018 10:45,0
1584,KFC,Deepali Goel,nyc,5,"1 Review , 2 Followers",7/26/2018 21:12,0
1776,Hotel Zara Hi-Fi,Abhijit Mukherjee,yup,5,"1 Review , 1 Follower",9/22/2018 22:17,0
2090,13 Dhaba,Medhavi,D,5,1 Review,8/21/2018 15:30,0
...,...,...,...,...,...,...,...
9096,Arena Eleven,,,,,,0
9097,Arena Eleven,,,,,,0
9098,Arena Eleven,,,,,,0
9099,Arena Eleven,,,,,,0


In [71]:
# Function to check for non-Unicode characters
def find_non_unicode_reviews(row):
    try:
        # If the row contains non-unicode characters, we flag it
        return not bool(re.match(r'^[\u0000-\uFFFF]*$', row))
    except TypeError:
        return False  # In case the row is not a string (e.g., NaN)

non_unicode_reviews = reviews_raw[reviews_raw['Review'].apply(find_non_unicode_reviews)]
non_unicode_reviews
non_unicode_reviews.to_csv('non_unicode_reviews.csv', encoding='utf-8')

In [63]:
reviews_raw.to_csv('reviews_change_encoding.csv', index=False)

In [72]:
test = pd.read_csv(r'non_unicode_reviews.csv', encoding='utf-8')