<i>## Comments will be provided using this format. Key takeaway: groups are encouraged to change the formatting, but not the structure. Groups are also allowed to create additional notebooks - for instance, create one notebook for data exploration, and one notebook for each preprocessing-modelling-evaluation pipeline -, but must strive to keep an unified style across notebooks.</i>

#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Notebook Title`

#### Group:
- `Group member #1`
- `(...)`
- `Group member #5`

#### <font color='#BFD72'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [1. Data Understanding](#P1)
- [2. General Data Preparation](#P2) 
- [3. Multilabel Classification (Information Requirement 3311)](#P3)
    - [3.1 Specific Data Preparation](#P31)
    - [3.2 Model Implementation](#P32)
    - [3.3 Model Evaluation](#P3n)
- [4. Sentiment Analysis (Information Requirement 3312)](#P4)
    - [4.1 Specific Data Preparation](#P41)
    - [4.2 Model Implementation](#P42)
    - [4.3 Model Evaluation](#P43)
- [...]
- [N. Additional Tasks (Information Requirements 332n)](#Pn)
    - [N.1 Specific Data Preparation](#Pn1)
    - [N.2 Model Implementation](#Pn2)
    - [N.3 Model Evaluation](#Pn3)

<i>## Note that the notebook structure differs from the report: instead of following the CRISP-DM phases and then specifying the different problems inside the phases, the notebook is structured by problem, with the CRISP-DM phases being defined for each specific problem.

In [23]:
## All imports must be concentrated on a cell that immediately follow the table of contents
import math
import time
import re
import pandas as pd
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize
from nltk.tokenize import PunktSentenceTokenizer
sent_tokenizer = PunktSentenceTokenizer()
from  Text_preprocessing_functions import *

#without truncation
pd.set_option('display.max_colwidth', None)

<font color='#BFD72F' size=5>1. Data Understanding</font> <a class="anchor" id="P1"></a>
  
[Back to TOC](#toc)

<i>## Imports.</i>

In [2]:
reviews=pd.read_csv('data/reviews.csv')
restaurants=pd.read_csv('data/restaurants.csv') 
restaurants = restaurants.drop(columns=['Links'])

## Restaurant data initial preprocessing


### Turning cost into a int collumn

In [3]:
#Turning collunm cost to int
restaurants['Cost'] = restaurants['Cost'].str.replace(',', '').astype(int)

### solving null value in  timming


In [4]:
restaurants[restaurants['Timings'].isnull()] #there is a missing value in timings
#After visiting zomato website we took the timetable of this restaurant
restaurants.loc[restaurants['Timings'].isnull(), 'Timings'] = '12AM to 3:30pm, 7pm to 11pm (Mon-Sun)'

### Collections and Cuisines regex alteration


In [5]:
# Putting this 2 collumns in a list
restaurants['Collections'] = restaurants['Collections'].str.replace(r',\s+', ',', regex=True).str.split(',')
restaurants['Cuisines']=restaurants['Cuisines'].str.replace(r',\s+', ',', regex=True).str.split(',')

In [6]:
restaurants['N_collections'] = restaurants['Collections'].apply(lambda x: len(x) if type(x)==list else 0)

### Creating a Open Close collumn

In [7]:
def capture_open_close_times(numbers_list):
    result = []
    for string in numbers_list:
        # Find all numbers in the string
        numbers = re.findall(r'\d+', string)
        if numbers:   
            # Capture the first (opening time) and determine closing time
            opening_time = numbers[0]
            if numbers[-1] in ('15', '30', '40'):
                closing_time = numbers[-2] if len(numbers) > 1 else None  # Use second-to-last if available
            else:
                closing_time = numbers[-1]  # Use last if not 15, 30, or 40

            # Transform closing time if it's 10, 11, or 12
            if closing_time == '10':
                closing_time = '22'
            elif closing_time == '11':
                closing_time = '23'
            elif closing_time == '12':
                closing_time = '24'

            # Transform opening time if it's 12, 1, 4, or 5
            elif opening_time == '1':
                opening_time = '13'
            elif opening_time == '4':
                opening_time = '16'
            elif opening_time == '5':
                opening_time = '17'

            # Append the transformed opening and closing times to the result
            result.append((opening_time, closing_time))

    return result

# Get a list of tuples (opening time, closing time)
open_close_times = capture_open_close_times(restaurants['Timings'])

# Unpack opening and closing times into separate lists
opening_times = [time[0] for time in open_close_times]
closing_times = [time[1] for time in open_close_times]

# Assign the lists to new columns in the DataFrame
restaurants['open time'] = opening_times
restaurants['closing time'] = closing_times
restaurants.drop('Timings', axis=1, inplace=True)

## Reviews data initial preprocessing

### Removing rows that contains no information

In [8]:
reviews.drop_duplicates(inplace=True)

In [9]:
reviews[reviews.isna().any(axis=1)]

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
2360,Amul,Lakshmi Narayana,,5.0,0 Reviews,7/29/2018 18:00,0
5799,Being Hungry,Surya,,5.0,"4 Reviews , 4 Followers",7/19/2018 23:55,0
6449,Hyderabad Chefs,Madhurimanne97,,5.0,1 Review,7/23/2018 16:29,0
6489,Hyderabad Chefs,Harsha,,5.0,1 Review,7/8/2018 21:19,0
7954,Olive Garden,ARUGULLA PRAVEEN KUMAR,,3.0,"1 Review , 1 Follower",8/9/2018 23:25,0
8228,Al Saba Restaurant,Suresh,,5.0,1 Review,7/20/2018 22:42,0
8777,American Wild Wings,,,,,,0
8844,Domino's Pizza,Sayan Gupta,,5.0,"2 Reviews , 2 Followers",8/9/2018 21:41,0
9085,Arena Eleven,,,,,,0


In [10]:
reviews.drop([8777, 9085], axis=0, inplace=True)

### Rating fix LIKE

In [11]:
reviews.Rating.value_counts()

5       3832
4       2373
1       1735
3       1193
2        684
4.5       69
3.5       47
2.5       19
1.5        9
Like       1
Name: Rating, dtype: int64

In [12]:
reviews[reviews['Rating']=='Like']

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
7601,The Old Madras Baking Company,Dhanasekar Kannan,One of the best pizzas to try. It served with the fresh crust and the topping of veggies are fresh and the taste of the ingredients was awesome and it is fully overloaded with Cheese. I would like to recommend to try every Time I wager for pizza,Like,"12 Reviews , 21 Followers",5/18/2019 12:31,1


In [13]:
reviews.at[7601, 'Rating'] = 5

In [14]:
reviews['Rating'] = reviews['Rating'].astype(float)

### Extracting number of reviews and followers

In [16]:
# Extract the number of reviews, followers
reviews['N_reviews'] = reviews['Metadata'].str.extract(r'(\d+)\s+Review')
reviews['Followers'] = reviews['Metadata'].str.extract(r'(\d+)\s+Follower')

reviews['N_reviews'] = reviews['N_reviews'].astype('Int64')
reviews['Followers'] = reviews['Followers'].astype('Int64')
reviews=reviews.drop('Metadata', axis=1)

KeyError: 'Metadata'

### Extracting Date information

In [17]:
reviews['Time'] = pd.to_datetime(reviews['Time'])

reviews['Month'] = reviews['Time'].dt.month.astype(int)
reviews['Year'] = reviews['Time'].dt.year.astype(int)

reviews['Weekend'] = reviews['Time'].dt.weekday.apply(lambda x: 1 if x >= 5 else 0)

### Creating post meal collumn

In [19]:
reviews['Hour'] = reviews['Time'].dt.hour

# Create the 'Post_Meal' column based on the hour ranges for lunch and dinner
# Lunch: 13-15, Dinner: 20-23
reviews['Post_Meal'] = reviews['Hour'].apply(lambda x: 1 if (13 <= x <= 15) or (20 <= x <= 23) else 0)
reviews.drop('Hour', axis=1, inplace=True)

### Creating exploration collumns

In [20]:
reviews["msg_len"] = reviews["Review"].map(lambda content : len(str(content)))
reviews["sents"] = reviews["Review"].map(lambda content :sent_tokenizer.tokenize(str(content)))
reviews["nr_sents"] = reviews["sents"].map(lambda content : len(content))
reviews=reviews.drop('sents',axis=1)

### fill the missing values of reviews with ' '

In [21]:
reviews['Review'].fillna('', inplace=True)

### Remove Giberish

In [71]:
reviews["giberish_text"] =\
      reviews["Review"].map(lambda content : Text_preprocessing_functions.regex_cleaner(content, 
            no_emojis = True, 
            no_hashtags = True,
            hashtag_retain_words = True,
            no_newlines = True,
            no_urls = True,
            no_punctuation = False))

In [72]:
reviews['Gibberish Score'] = reviews['giberish_text'].apply(lambda x: classify(str(x)))

In [73]:
reviews['Gibberish Score'].describe()

count    9962.000000
mean       15.857004
std        22.474989
min         0.000000
25%         1.000000
50%         1.000000
75%        25.175048
max       100.000000
Name: Gibberish Score, dtype: float64

In [78]:
print(reviews[reviews['Gibberish Score']>96]['Review'])

245     Good Service...Hygenic Food...JAYANTA....KUSHAL...HASEBUL GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGYGGGYGGYYYYGGGGGGGGGGHH
1583                                                                                                                                                😄
1584                                                                                                                                              nyc
2090                                                                                                                                                D
3576                                                                                                                                               gd
3583                                                                                                                                                5
3736                                                                                                

In [None]:
reviews[reviews['Review'].loc[9597,8830,8379,8281,8166,7552,6281,6274,6260,5280,4281 ]]

Restaurant                         KFC
Reviewer                  Deepali Goel
Review                             nyc
Rating                             5.0
Time               2018-07-26 21:12:00
Pictures                             0
N_reviews                            1
Followers                            2
Month                                7
Year                              2018
Weekend                              0
Post_Meal                            1
msg_len                              3
nr_sents                             1
Gibberish Score              96.194492
Name: 1584, dtype: object

In [None]:
#restaurants.to_csv('data/restaurants_initial_preproc.csv', index=False)
#reviews.to_csv('data/reviews_initial_preproc.csv', index=False)