# Trustworthy Location Reviews

- Team Name: Code Fellas
- Team Members: Rayaan Nabi Ahmed Quraishi, Kunal Soni

## Part 1: Data Labelling

### Imports

In [59]:
import torch
import os
import re
import json
import pandas as pd
import numpy as np
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
from sklearn.model_selection import train_test_split
from collections import Counter
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import precision_recall_fscore_support

### Data Loading and Eye-Balling

In [60]:
#Set Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [61]:
#Load Data
data = pd.read_csv('reviews.csv',encoding='ISO-8859-1')
data.head()

Unnamed: 0,business_name,author_name,text,photo,rating,rating_category
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,We went to Marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,During my holiday in Marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,Prices are very affordable. The menu in the ph...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,Turkey's cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,I don't know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu


The problem statement demands that: 
1. we determine the review quality with following characteristics: not a spam, not an advertisement, relevant to location, and without user rants.
2. we enforce following policies: "no advertisement or promotional content","no irrelevant content", and "no rants or user complaints"
If we look at the data above, we do not have any of the characteristic above labelled for us. Nor we have the location tagged. Therefore, we take an approach in which we first tag the reviews with the following labels.
    1. location (categorical): restaurant, home, shop, office etc. (depending upon the text in rating category)
    2. spam (categorical): 1 if review is spam, 0 otherwise.
    3. advertisement (categorical): 1 if review is advertisement, 0 otherwise.
    4. irrelevant content (categorical): 1 if review is irrelevant to the location, 0 otherwise
    5. rants (categorical): 1 if review is rant, 0 othewise
    6. review quality score: a metric that fuses the information in 4 tags: spam, advertisement, irrelevancy, and rants. 
We use the generative AI tools for tagging all the labels which is roughly an unsupervised learning. Once the data is tagged, we split the data into training and testing sets and perform supervised learning using deep learning. 

In [62]:
#Data Column Labels and Number of Rows
columns = data.columns.tolist()
print("Column Names:",columns)
print("Number of Rows:",data.shape[0])

Column Names: ['business_name', 'author_name', 'text', 'photo', 'rating', 'rating_category']
Number of Rows: 1100


In [63]:
#Drop Rows with NAs
data = data.dropna()

In [64]:
print("Number of Rows:",data.shape[0])

Number of Rows: 1100


In [65]:
#Text Preprocessing
def preprocess_text(text):
  text = str(text)#Convert everything to string
  text = re.sub(r"http\S+", "", text)   #Remove URLs
  text = re.sub(r"[^a-zA-Z\s]", " ", text)  # Remove special characters and numbers
  text = re.sub(r"\s+", " ", text)  # Remove extra whitespaces
  text = text.lower()  # Convert to lowercase
  text = text.strip() # Remove leading/trailing spaces
  return text

### Location Tagging (Prompt Engineering)

1. A quick glance at data suggests that business name can be in language such as Turkish. Therefore, it might be hard to extract location type from business name.
2. However, location type can be extracted from text in rating_category. This could be highly contextual, so we prefer prompting to give context to LLM. 

In [66]:
#Location Tagging

def tag_location(features):
    """
    Classifies a location based on a list of features, ensuring a specific output
    for a known input by using a highly-structured prompt with a Flan-T5 model.

    Args:
        features (list): A list of strings describing a place.

    Returns:
        str: The identified location category, or 'unidentified' if no match is found.
    """
    # Initialize the text-to-text generation pipeline with the Flan-T5 model.
    try:
        generator = pipeline('text2text-generation', model='google/flan-t5-small')
    except Exception as e:
        return f"Error loading model: {e}. Please ensure 'transformers' and 'torch' are installed."

    # Define the specific categories the model should choose from.
    categories = ['restaurant', 'office', 'home', 'park', 'hotel', 'shop']
    
    # Construct the highly-structured prompt as a variable.
    prompt_template = """
You are a highly specialized AI that classifies locations.
Your task is to analyze a list of features and provide the single most appropriate location category from the given list.
The location categories are: {categories}.

Example 1:
Features: ['desk', 'computer', 'meeting']
Output: office

Example 2:
Features: ['taste', 'menu', 'atmosphere', 'outdoor', 'indoor']
Output: restaurant

Instructions:
Analyze the semantic meaning of the provided features and select the best matching category from the list. Your response must only contain the single word that is the category.

Features: {features}
Output:
"""

    # Format the prompt with the actual input features and categories.
    features_str = ', '.join(f"'{word}'" for word in features)
    categories_str = ', '.join(f"'{cat}'" for cat in categories)
    prompt = prompt_template.format(features=f"[{features_str}]", categories=f"[{categories_str}]")

    # Generate the text with specific parameters for concise output.
    response = generator(
        prompt, 
        max_length=256, 
        num_return_sequences=1, 
        do_sample=False,
    )

    # Extract and clean the generated word.
    # T5 models return only the generated text, so no need to remove the prompt.
    generated_text = response[0]['generated_text']
    summary_word = generated_text.strip().lower()

    # Ensure the output is one of the valid categories.
    if summary_word in categories:
        return summary_word
    else:
        return "unidentified"

In [67]:
df = data.groupby("business_name")["rating_category"].agg(" ".join).reset_index()

In [68]:
df["rating_category"] = df["rating_category"].apply(lambda x: list(set(preprocess_text(x).split(" "))))


In [69]:
df["location"] = df["rating_category"].apply(lambda x: tag_location(x))

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more infor

In [70]:
df = df[["business_name","location"]]

In [71]:
data = data.merge(df, on="business_name", how="left")

In [72]:
data.head()

Unnamed: 0,business_name,author_name,text,photo,rating,rating_category,location
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,We went to Marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste,restaurant
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,During my holiday in Marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu,restaurant
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,Prices are very affordable. The menu in the ph...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere,restaurant
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,Turkey's cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere,restaurant
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,I don't know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu,restaurant


In [77]:
data.to_csv('data.csv', index=False) 

### Tagging Spam, Advertisment, Irrelevancy, and Rant

1. Irrelevancy: embedding of text are compared against embedding of location to find similarity score with 0.3 cut-off
2. Spam: detection using spam keywords search
3. Advertisement: detection using advertisement keywords search
4. Rant: detection using rant keywords search
   
Review uality Score is information fusion of 4 tags above (1-(spam+advertisement+rant+irrelevancy)/4).

In [78]:
#Load Data
data = pd.read_csv('data.csv',encoding='ISO-8859-1')
data["text"] = data["text"].apply(lambda x: preprocess_text(x))

In [81]:
model = SentenceTransformer('all-MiniLM-L6-v2')

def tag_spam_adv_irrel_rant(review_text, location, similarity_threshold=0.3):
    review_text = review_text.strip().lower()
    location = location.strip().lower()

    # 1. Relevancy using semantic similarity
    review_emb = model.encode(review_text, convert_to_tensor=True)
    location_emb = model.encode(location, convert_to_tensor=True)
    similarity = util.pytorch_cos_sim(review_emb, location_emb).item()
    irrelevant = 0 if similarity >= similarity_threshold else 1

    # 2. Spam detection (doubled keywords)
    spam_keywords = [
        "buy now", "free", "check out", "subscribe",
        "click here", "limited time", "order now", "winner"
    ]
    spam = 1 if any(word in review_text for word in spam_keywords) else 0

    # 3. Advertisement detection (doubled keywords)
    ad_keywords = [
        "promo", "deal", "offer", "visit my page",
        "sale", "discount", "special offer", "advertisement"
    ]
    advertisement = 1 if any(word in review_text for word in ad_keywords) else 0

    # 4. Rant detection (doubled keywords)
    rant_keywords = [
        "worst", "never coming back", "awful", "terrible", "horrible",
        "disgusting", "poor service", "not recommended", "hate"
    ]
    rant = 1 if any(word in review_text for word in rant_keywords) else 0

    # 5. Fuse labels to compute quality score
    labels = [irrelevant, spam, advertisement, rant]
    quality_score = 1 - (sum(labels) / len(labels))  # 0 = worst, 1 = best

    return {
        "irrelevant": irrelevant,
        "spam": spam,
        "advertisement": advertisement,
        "rant": rant,
        "quality_score": quality_score
    }

In [82]:

# Apply to DataFrame
data[["irrelevant", "spam", "advertisement", "rant", "quality_score"]] = data.apply(
    lambda row: pd.Series(tag_spam_adv_irrel_rant(row["text"], row["location"])),
    axis=1
)

In [83]:
data.head()

Unnamed: 0,business_name,author_name,text,photo,rating,rating_category,location,irrelevant,spam,advertisement,rant,quality_score
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,we went to marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste,restaurant,0.0,0.0,0.0,0.0,1.0
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,during my holiday in marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu,restaurant,0.0,0.0,0.0,0.0,1.0
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,prices are very affordable the menu in the pho...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere,restaurant,0.0,0.0,0.0,0.0,1.0
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,turkey s cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere,restaurant,0.0,0.0,0.0,0.0,1.0
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,i don t know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu,restaurant,0.0,0.0,0.0,0.0,1.0


In [84]:
data.to_csv('data.csv', index=False) 

## Part 2: Policy Enforcement

A function to join the tagged(labelled) data with policy tags - joining two dataframes

In [85]:
#Function to enforce user policy on reviews data

def enforce_policies(data_file: str, policy_file: str = "policies.json", output_file: str = "data_w_policy.csv"):
    """
    Enforces policies on review data by merging review labels with policy definitions.

    Args:
        data_file (str): Path to input review data CSV file.
        policy_file (str): Path to policies JSON file (default: 'policies.json').
        output_file (str): Path to save the merged CSV file (default: 'data_w_policy.csv').

    Returns:
        pd.DataFrame: The merged DataFrame with policies applied.
    """
    # Load review data
    data = pd.read_csv(data_file, encoding="ISO-8859-1")

    # Load policies
    with open(policy_file, "r") as f:
        policies_dict = json.load(f)

    # Convert list of dicts to DataFrame
    policy_df = pd.DataFrame(policies_dict)

    # Merge on label columns
    data_w_policy = data.merge(
        policy_df,
        how="left",
        on=["advertisement", "irrelevant", "rant"]
    )

    # Replace NaNs with None for JSON compatibility
    data_w_policy = data_w_policy.replace({np.nan: None})

    # Save to CSV
    data_w_policy.to_csv(output_file, index=False, encoding="utf-8")

    return data_w_policy


In [86]:
enforce_policies("data.csv").head()

Unnamed: 0,business_name,author_name,text,photo,rating,rating_category,location,irrelevant,spam,advertisement,rant,quality_score,policy_type,policy_description
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,we went to marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste,restaurant,0.0,0.0,0.0,0.0,1.0,,
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,during my holiday in marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu,restaurant,0.0,0.0,0.0,0.0,1.0,,
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,prices are very affordable the menu in the pho...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere,restaurant,0.0,0.0,0.0,0.0,1.0,,
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,turkey s cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere,restaurant,0.0,0.0,0.0,0.0,1.0,,
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,i don t know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu,restaurant,0.0,0.0,0.0,0.0,1.0,,


## Part 3: Text Classification

In [87]:
#Read Data From File
data = pd.read_csv('data.csv',encoding='ISO-8859-1')
cols = list(data.columns)
print(cols)

['business_name', 'author_name', 'text', 'photo', 'rating', 'rating_category', 'location', 'irrelevant', 'spam', 'advertisement', 'rant', 'quality_score']


In [88]:
#Check The Data Balance
input_cols = ['text']
output_cols = ['spam', 'advertisement', 'rant', 'irrelevant','quality_score']
categorical_cols = ['spam','advertisement','rant','irrelevant']
numerical_cols = ['quality_score']

In [89]:
#Check the class balance
df = data[categorical_cols]
counts = pd.DataFrame({col: df[col].value_counts() for col in df.columns}).fillna(0).astype(int)
print(counts)

     spam  advertisement  rant  irrelevant
0.0  1099           1088  1086         486
1.0     1             12    14         614


In [90]:
import pandas as pd
from skmultilearn.model_selection import iterative_train_test_split
import numpy as np

def balanced_split_dataframe(df, test_size=0.2, random_state=42):
    """
    Splits a pandas DataFrame into balanced train and test sets 
    for multi-label classification.
    
    Args:
        df (pd.DataFrame): Input dataframe with 'text' and label columns.
        test_size (float): Fraction of data to use for testing.
        random_state (int): Random seed for reproducibility.
    
    Returns:
        X_train, X_test, y_train, y_test
    """
    np.random.seed(random_state)
    
    # Define input and output columns
    input_col = "text"
    output_cols = ["spam", "advertisement", "rant", "irrelevant", "quality_score"]
    
    # Extract features (X) and labels (y)
    X = df[[input_col]].values
    y = df[output_cols].values
    
    # Iterative stratification split
    X_train, y_train, X_test, y_test = iterative_train_test_split(
        X, y, test_size=test_size
    )
    
    # Convert back to DataFrame
    X_train = pd.DataFrame(X_train, columns=[input_col])
    X_test = pd.DataFrame(X_test, columns=[input_col])
    y_train = pd.DataFrame(y_train, columns=output_cols)
    y_test = pd.DataFrame(y_test, columns=output_cols)
    
    return X_train, X_test, y_train, y_test


In [91]:
X_train, X_test, y_train, y_test = balanced_split_dataframe(data)
df = y_train[categorical_cols]
counts = pd.DataFrame({col: df[col].value_counts() for col in df.columns}).fillna(0).astype(int)
print("Train Data:")
print(counts)
df = y_test[categorical_cols]
counts = pd.DataFrame({col: df[col].value_counts() for col in df.columns}).fillna(0).astype(int)
print("Test Data:")
print(counts)

Train Data:
     spam  advertisement  rant  irrelevant
0.0   879            872   869         389
1.0     1              8    11         491
Test Data:
     spam  advertisement  rant  irrelevant
0.0   220            216   217          97
1.0     0              4     3         123


In [92]:
#Function to train with ensemble method and test the labelled data.

def train_and_test_ensemble(X_train, X_test, y_train, y_test):
    """
    Train a multi-label text classifier with an ensemble model 
    and evaluate precision, recall, f1 per class.
    
    Args:
        X_train (pd.DataFrame): Training input with 'text' column
        X_test (pd.DataFrame): Test input with 'text' column
        y_train (pd.DataFrame): Training labels (spam, advertisement, rant, irrelevant)
        y_test (pd.DataFrame): Test labels
    
    Returns:
        metrics_df (pd.DataFrame): Table of precision, recall, f1 for each class
    """
    
    label_cols = ["spam", "advertisement", "rant", "irrelevant"]
    
    # Vectorize text
    vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
    X_train_tfidf = vectorizer.fit_transform(X_train["text"])
    X_test_tfidf = vectorizer.transform(X_test["text"])
    
    # Ensemble classifier (Random Forest)
    base_clf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
    clf = MultiOutputClassifier(base_clf)
    clf.fit(X_train_tfidf, y_train[label_cols])
    
    # Predictions
    y_pred = clf.predict(X_test_tfidf)
    
    # Collect metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test[label_cols], y_pred, average=None, labels=range(len(label_cols))
    )

    #Predicted Quality Score
    quality_score_pred = list(y_pred.mean(axis=1))

    
    # Extract true quality_score from y_test
    quality_score_true = y_test['quality_score'].tolist()
    
    correct_predictions = sum(1 for true, pred in zip(quality_score_true, quality_score_pred) if true == pred)
    accuracy = correct_predictions / len(quality_score_true)
    
    print("Quality Score Prediction Accuracy:", accuracy)
    print("\nPer-Class Metrics:")
    
    # Convert to DataFrame (tabular format)
    metrics_df = pd.DataFrame({
        "precision": precision,
        "recall": recall,
        "f1": f1
    }, index=label_cols)
    
    return metrics_df, accuracy


### Evaluation
1. Accuracy of Quality Score predicted
2. Precision, Recall, F1 scores for the different text classification viz., spam, advertisement, rant, and irrelevant.

In [93]:
metrics, quality_acc = train_and_test_ensemble(X_train, X_test, y_train, y_test)
print(metrics)

Quality Score Prediction Accuracy: 0.0

Per-Class Metrics:
               precision    recall        f1
spam            0.000000  0.000000  0.000000
advertisement   0.000000  0.000000  0.000000
rant            0.000000  0.000000  0.000000
irrelevant      0.710145  0.796748  0.750958


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Conclusion:
Given the time constraints, we could only show one of the ways to solve the given problem. 
The test results are far from reasonable because of two reasons:
1. Poor Data Tagging: we admit the shortcomings in the data tagging. Finding a refined method to tag the data would have required more time, something we could not afford to provide.
2. Imbalanced Data: the results shown are tested on limited data. Finding more data points through web scraping could have helped, something we did not try given limited available time. 