[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lindsayalexandra14/ds_portfolio/blob/main/1_projects/restaurant_prediction_nlp/restaurant_prediction_nlp.ipynb)


In [None]:
# @title
# !pip install pipreqs

In [None]:
try:
    import gensim
except ImportError:
    !pip install gensim
    import gensim

📦 This notebook requires `gensim`.  
Run the install cell above first. If Colab asks you to restart the runtime, do that, then re-run the notebook from the top.


![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Restaurant%20Type%20Prediction%20NLP.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Introduction.png)



*  Author: Lindsay McFarlane
*  Date: May 2025
*  Objective: Predict restaurant type from restaurant data using any type of NLP model
* Goal: Accuracy above ~76%, outperforming given baseline model
* Result: 82% accuracy (vs. 76% baseline), an improvement of ~8 %+ over baseline
* Data: 3rd party data (from UCSD NLP Class) on restaurant reviews & other restaurant features
* Models:

    1.   Avg. word embeddings of restaurant reviews (baseline)
    2.   Avg. Word Embeddings (review + features)
    3.   BERT Transformer (on reviews)
    4.   BERT Transformer on reviews + features
    5.   RoBERTA Transformer on reviews & features



>>[Import Data](#scrollTo=Rhq9-i6WaWOO)

>>[View Data](#scrollTo=yOWoCaNkYtS-)

>[Data Preparation](#scrollTo=cK4PnORqYp5Y)

>[Initial Dimension Reduction](#scrollTo=WAD9194iRmst)

>[EDA](#scrollTo=K6D3CQoVZGPb)

>[Dimension Reduction](#scrollTo=sNI3285xZLw3)

>[Average Embedding Models](#scrollTo=h99-Cxve7Bv3)

>>[Data Pre-Processing](#scrollTo=BQrAAU1a1J3b)

>>[Model 1](#scrollTo=ir7d2aDE6Nk0)

>>[Model 2](#scrollTo=sfRa7_EW6Bw6)

>[BERT Transformer Models](#scrollTo=dmyCqwti0l9y)

>>[Model 3](#scrollTo=LRmbbe9-NRXH)

>>[Model 4](#scrollTo=JL5WNzgt2sP8)

>[RoBERTa Transformer Model](#scrollTo=_VUOlEPd0uiM)

>>[Model 5](#scrollTo=qy-UWADF18FP)

>[Model Comparison](#scrollTo=NSljNK9KRc7i)

>[Misc](#scrollTo=oFuYjg7IPH64)



![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/tl;dr.png)

This project predicts restaurant type from restaurant reviews OR restaurant reviews + other restaurant features using the following models. This was the final performance:

INPUT CODE FOR CHARTS IMAGE

##Import Data

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Import%20Data.png)

In [None]:
!wget "https://raw.githubusercontent.com/lindsayalexandra14/ds_portfolio/refs/heads/main/1_projects/restaurant_prediction_nlp/train.csv"
!wget "https://raw.githubusercontent.com/lindsayalexandra14/ds_portfolio/refs/heads/main/1_projects/restaurant_prediction_nlp/test.csv"


In [None]:
import pandas as pd
import numpy as np

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

##View Data

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/View%20Data.png)

View shape of data:

*   13k rows and 62 features for training data
*   10k rows and 61 features for test data (same features but does not have 'label' > one fewer feature)



In [None]:
print(df_train.shape)
print(df_test.shape)

*Note: The given "test set" is unlabeled so won't be used here and was only used for Kaggle scoring in the platform*

View names of all features/columns:

In [None]:
print(df_train. columns)

#Data Preparation

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Data%20Preparation.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Missing%20Values.png)

Create dataframe of features with their count and % of missing values (if above 0 -> they had missing values):

In [None]:
missing_df = pd.DataFrame({
    'Missing Count': df_train.isnull().sum(),
    'Missing %': (df_train.isnull().sum() / len(df_train)) * 100
})
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing %', ascending=False)

 Drop columns with more than 70% missing:

In [None]:
# Threshold to keep columns with at least 30% non-null values
threshold = 0.30

original_cols = df_train.columns

valid_cols = df_train.columns[df_train.isnull().mean() < (1 - threshold)]

# Except keep the bitcoin feature despite its missingness %
if "attributes.BusinessAcceptsBitcoin" in df_train.columns and \
   "attributes.BusinessAcceptsBitcoin" not in valid_cols:
    valid_cols = valid_cols.append(pd.Index(["attributes.BusinessAcceptsBitcoin"]))

# Drop columns in original_cols but NOT in valid_cols
dropped_cols = original_cols.difference(valid_cols)

print(f"Columns dropped ({len(dropped_cols)}):")
print(dropped_cols.tolist())

# Set dataframes to only include valid columns for training and test
df_train = df_train[valid_cols]
df_test = df_test[valid_cols.intersection(df_test.columns)]


View updated data shape and features:

In [None]:
print(df_train.shape)
print(df_test.shape)
print(df_train.columns)
print(df_test.columns)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Removing%20Characters.png)

View state of formatting:

In [None]:
print(df_train.head())

Fix string formatting, first combining the training and test to apply formatting to both datasets at once. Then remove byte-string (b'), unicode-string (u'), and quote characters from the data:

In [None]:
df_train["source"] = "train"
df_test["source"] = "test"
combined = pd.concat([df_train, df_test], axis=0)

combined = combined.apply(lambda col: col.astype(str).str.replace(r"^b'|^b\"|\"|'|u'", '', regex=True))

Split data back into training and test sets:

In [None]:
df_train = combined[combined["source"] == "train"].drop("source", axis=1)
df_test = combined[combined["source"] == "test"].drop("source", axis=1)

View data to check that characters were removed and formatting looks usable:

In [None]:
print(df_train.head())

Note, the combining and separating the train and test sets above gave a 'label' column to the test set, just with na's. Won't be pulled in later, but good to know:

In [None]:
print(df_test['label'])

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Handle%20Null%20Values.png)

Make 'None's in a consistent format:
Replace actual NaNs with the string 'None'.
Normalize anything that *means* 'None' (case-insensitive):

In [None]:
df_train = df_train.fillna('None')
df_train = df_train.applymap(lambda x: 'None' if str(x).strip().lower() in ['none', 'nan', 'b\'none\'', 'b"none"'] else x)


Apply same formatting to the test data:

In [None]:
df_test = df_test.fillna('None')

df_test = df_test.applymap(lambda x: 'None' if str(x).strip().lower() in ['none', 'nan', 'b\'none\'', 'b"none"'] else x)

#Initial Dimension Reduction

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Initial%20Dimension%20Reduction.png)

Drop features:

*   id, business_id - not informative
*   latitude, longitude, address, postal code, city, hours - too granular to use
*   is_open, review count - not indicative of type
*  parking (bike, business) - not indicative of type, one field is too complex for low value, mostly dependent on the city/location
*   attributes is a field of all the attributes combined - don't want all


Drop columns that will definitely not be used:

In [None]:
columns_to_drop = ['id','business_id','latitude','longitude','hours','address', 'postal_code','city',
                   'hours.Monday','hours.Tuesday','hours.Wednesday','hours.Thursday',
                   'hours.Friday','hours.Saturday', 'hours.Sunday',
                   'attributes.BusinessParking','attributes.BikeParking',
                   'is_open','review_count','attributes']
df_train = df_train.drop(columns_to_drop, axis=1)
df_test= df_test.drop(columns_to_drop, axis=1)

In [None]:
print(df_train.columns)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Feature%20Engineering.png)

The feature engineering that I performed is largely to provide as much concise, text-rich info as possible phrased in a Natural Language way that BERT would interpret better than for example: "AZ" (for a state) or "PriceRange: 2" which were their original form.

Note, first relabeling the review field to "text" as it's more intuitive and standard as an input feature name for the NLP tokenizing functions. Then the "review" field is included as a field to drop:

In [None]:
df_train["text"] = df_train["review"]
df_test["text"] = df_test["review"]

I viewed the unique states in the 'state' column in order to build a state map to assign their exact state names to the values so that it has more specific/useful text information to feed the natural language processing model.

This is important, for example, to identify the Candadian states (e.g., for 'Quebec Canada' the word 'Canada' can be picked up in the model and help predict the 'Canadian restaurant' type).

There are other ways to group the data and assign regions to it but there were not that many states so I did not group them. Adding the full state names in the model though should provide clearer info for the word embeddings, in addition to adding the Canadian tag:

In [None]:
print(df_train["state"].unique())

In [None]:
state_map = {
    "QC": "Quebec Canada",
    "ON": "Ontario Canada",
    "BC": "British Columbia Canada",
    "AZ": "Arizona",
    "OH": "Ohio",
    "NV": "Nevada",
    "SC": "South Carolina",
    "WI": "Wisconsin",
    "PA": "Philadelphia",
    "NC": "North Carolina",
    "IL": "Illinois",
    "VA": "Virginia",
    "AB": "Alberta Canada"
}

df_train["state"] = df_train["state"].map(state_map).fillna(df_train["state"])
df_test["state"] = df_test["state"].map(state_map).fillna(df_test["state"])


Confirm mapping and show new 'state' values:

In [None]:
print(df_train["state"].unique())

Inspect 'attribute' columns and their unique values because they have the same complex structure. Some were True/False/None but some were dictionaries. They will need to be reformatted:

In [None]:
for col in df_train.columns:
    if col.startswith("attributes."):
        print(f"\nColumn: {col}")
        print(df_train[col].dropna().astype(str).unique().tolist())

In [None]:
print(df_train.head())

I viewed the 'Attire' values and wanted to maintain the word "attire" for the model to read and only keep the values that have valuable text info so I reformatted and removed the 'None' values:

In [None]:
print(df_train["attributes.RestaurantsAttire"].unique())

In [None]:
def map_attire(value):
    if value == "casual":
        return "Casual attire"
    elif value == "dressy":
        return "Dressy attire"
    elif value == "formal":
        return "Formal attire"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsAttire"] = df_train["attributes.RestaurantsAttire"].apply(map_attire)
df_test["attributes.RestaurantsAttire"] = df_test["attributes.RestaurantsAttire"].apply(map_attire)


This "no false" version (not also adding "Not Kid-Friendly") performed better for these types of features where there was a True False, implying the negative versions are likely adding noise so my strategy was to only keep the positive values and make them as concise and word embedding friendly as possible. This way it was more likely to produce a cleaner signal for the words

Similarly I adjusted the "Good for Kids" field to "Kid-friendly so that the text information is again in the values instead of just saying True False and also so it is concise enough to be well-interpreted by the word embeddings in the model:

In [None]:
def map_wifi(value):
    if value == "free":
        return "Free WiFi"
        #['no', 'None', 'free', 'paid']
    elif value == "paid":
        return "Paid WiFi"
    else:
        return ""  # For None or anything else

df_train["attributes.WiFi"] = df_train["attributes.WiFi"].apply(map_wifi)
df_test["attributes.WiFi"] = df_test["attributes.WiFi"].apply(map_wifi)


In [None]:
def map_noise(value):
    if value == "loud":
        return "Noisy"
    elif value == "quiet":
        return "Quiet"
    elif value == "average":
        return "Average noise"
    elif value == "very_loud":
        return "Very noisy"
    else:
        return ""  # For None or anything else
        #['loud', 'quiet', 'None', 'average', 'very_loud']

df_train["attributes.NoiseLevel"] = df_train["attributes.NoiseLevel"].apply(map_noise)
df_test["attributes.NoiseLevel"] = df_test["attributes.NoiseLevel"].apply(map_noise)


In [None]:
def map_good_for_kids(value):
    if value is True or value == "True":
        return "Kid-friendly"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.GoodForKids"] = df_train["attributes.GoodForKids"].apply(map_good_for_kids)
df_test["attributes.GoodForKids"] = df_test["attributes.GoodForKids"].apply(map_good_for_kids)


In [None]:
print(df_train["attributes.GoodForKids"].head())

I performed the same formatting adjustment on the fields below so that they also include as much text information as possible to feed the natural language model.

In [None]:
def map_delivery(value):
    if value is True or value == "True":
        return "Delivery"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsDelivery"] = df_train["attributes.RestaurantsDelivery"].apply(map_delivery)
df_test["attributes.RestaurantsDelivery"] = df_test["attributes.RestaurantsDelivery"].apply(map_delivery)


 For 'Price' I gave them low, medium, high, very high price for the same reason:

In [None]:
print(df_train["attributes.RestaurantsPriceRange2"].unique())

In [None]:
def map_pricerange(value):
    if value =='1':
        return "Low Price"
    elif value =='2':
        return "Moderate Price"
    elif value =='3':
        return "High Price"
    elif value =='4':
        return "Very High Price"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsPriceRange2"] = df_train["attributes.RestaurantsPriceRange2"].apply(map_pricerange)
df_test["attributes.RestaurantsPriceRange2"] = df_test["attributes.RestaurantsPriceRange2"].apply(map_pricerange)


In [None]:
print(df_train['attributes.RestaurantsPriceRange2'].head())

In [None]:
def map_cc(value):
    if value is True or value == "True":
        return "Accepts Credit Cards"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.BusinessAcceptsCreditCards"] = df_train["attributes.BusinessAcceptsCreditCards"].apply(map_cc)
df_test["attributes.BusinessAcceptsCreditCards"] = df_test["attributes.BusinessAcceptsCreditCards"].apply(map_cc)


In [None]:
def map_bitcoin(value):
    if value is True or value == "True":
        return "Accepts Bitcoin"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.BusinessAcceptsBitcoin"] = df_train["attributes.BusinessAcceptsBitcoin"].apply(map_bitcoin)
df_test["attributes.BusinessAcceptsBitcoin"] = df_test["attributes.BusinessAcceptsBitcoin"].apply(map_bitcoin)


In [None]:
def map_outdoor_seating(value):
    if value is True or value == "True":
        return "Outdoor Seating"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.OutdoorSeating"] = df_train["attributes.OutdoorSeating"].apply(map_outdoor_seating)
df_test["attributes.OutdoorSeating"] = df_test["attributes.OutdoorSeating"].apply(map_outdoor_seating)


In [None]:
def map_tv(value):
    if value is True or value == "True":
        return "Has TV"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.HasTV"] = df_train["attributes.HasTV"].apply(map_tv)
df_test["attributes.HasTV"] = df_test["attributes.HasTV"].apply(map_tv)


In [None]:
def map_groups(value):
    if value is True or value == "True":
        return "Good for Groups"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsGoodForGroups"] = df_train["attributes.RestaurantsGoodForGroups"].apply(map_groups)
df_test["attributes.RestaurantsGoodForGroups"] = df_test["attributes.RestaurantsGoodForGroups"].apply(map_groups)


In [None]:
def map_tableservice(value):
    if value is True or value == "True":
        return "Table Service"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsTableService"] = df_train["attributes.RestaurantsTableService"].apply(map_tableservice)
df_test["attributes.RestaurantsTableService"] = df_test["attributes.RestaurantsTableService"].apply(map_tableservice)


In [None]:
def map_reservations(value):
    if value is True or value == "True":
        return "Takes reservations"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsReservations"] = df_train["attributes.RestaurantsReservations"].apply(map_reservations)
df_test["attributes.RestaurantsReservations"] = df_test["attributes.RestaurantsReservations"].apply(map_reservations)


In [None]:
def map_takeout(value):
    if value is True or value == "True":
        return "Takeout"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.RestaurantsTakeOut"] = df_train["attributes.RestaurantsTakeOut"].apply(map_takeout)
df_test["attributes.RestaurantsTakeOut"] = df_test["attributes.RestaurantsTakeOut"].apply(map_takeout)


In [None]:
def map_caters(value):
    if value is True or value == "True":
        return "Caters"
    elif value is False or value == "False":
        return "False"
    else:
        return ""  # For None or anything else

df_train["attributes.Caters"] = df_train["attributes.Caters"].apply(map_caters)
df_test["attributes.Caters"] = df_test["attributes.Caters"].apply(map_caters)


In [None]:
def map_alcohol(value):
    if value == "beer_and_wine":
        return "Beer and Wine"
    elif value == "full_bar":
        return "Full bar"
    else:
        return ""  # For None or anything else

df_train["attributes.Alcohol"] = df_train["attributes.Alcohol"].apply(map_alcohol)
df_test["attributes.Alcohol"] = df_test["attributes.Alcohol"].apply(map_alcohol)


In [None]:
type(df_train["attributes.GoodForMeal"].iloc[0])


'Good for Meal' and 'Ambience' had the string dictionary-type structure that needed major reformatting to parse out Ambience: casual: True, romantic: False for example to just 'casual'. Or 'Good for Dinner vs. 'Breakfast: False, Dinner: True, etc. This removes unnecessary words where things are False since the model can only take in so many words.


In [None]:
import re
import json

# Step 1: Convert the string into a usable dictionary
def parse_good_for_meal(row):
    if not isinstance(row, str) or row.strip() == "":
        return {}
    try:
        # Add quotes around keys and convert True/False to JSON compatible format
        fixed = re.sub(r'(\w+):', r'"\1":', row)  # keys
        fixed = fixed.replace("True", "true").replace("False", "false")  # booleans
        return json.loads(fixed)
    except Exception as e:
        # print(f"Failed to parse: {row}")
        return {}

# Step 2: Create readable label from True values
def label_good_meals(meal_dict):
    return ', '.join([f"Good for {k}" for k, v in meal_dict.items() if v is True])

# Step 3: Apply both steps
df_train["GoodForMeal_dict"] = df_train["attributes.GoodForMeal"].apply(parse_good_for_meal)
df_train["attributes.GoodForMeal"] = df_train["GoodForMeal_dict"].apply(label_good_meals)

df_test["GoodForMeal_dict"] = df_test["attributes.GoodForMeal"].apply(parse_good_for_meal)
df_test["attributes.GoodForMeal"] = df_test["GoodForMeal_dict"].apply(label_good_meals)



In [None]:
import re
import json

# Step 1: Convert the string into a usable dictionary
def parse_ambience(row):
    if not isinstance(row, str) or row.strip() == "":
        return {}
    try:
        # Add quotes around keys and convert True/False to JSON compatible format
        fixed = re.sub(r'(\w+):', r'"\1":', row)  # keys
        fixed = fixed.replace("True", "true").replace("False", "false")  # booleans
        return json.loads(fixed)
    except Exception as e:
        # print(f"Failed to parse: {row}")
        return {}

# Step 2: Create readable label from True values
def label_ambience(ambience_dict):
    return ', '.join([f"{k}" for k, v in ambience_dict.items() if v is True])

# Step 3: Apply both steps
df_train["Ambience_dict"] = df_train["attributes.Ambience"].apply(parse_ambience)
df_train["attributes.Ambience"] = df_train["Ambience_dict"].apply(label_ambience)

df_test["Ambience_dict"] = df_test["attributes.Ambience"].apply(parse_ambience)
df_test["attributes.Ambience"] = df_test["Ambience_dict"].apply(label_ambience)


In [None]:
df_train.drop(["GoodForMeal_dict", "Ambience_dict"], axis=1, inplace=True)
df_test.drop(["GoodForMeal_dict", "Ambience_dict"], axis=1, inplace=True)

Check examples for Good For Meal and Ambience. They have many unique values, so checking them this way vs. the ones with few unique values below it:

In [None]:
print(df_train["attributes.GoodForMeal"].head(1))

In [None]:
print(df_train["attributes.Ambience"].head(1))

Check new unique values for attribute features:

In [None]:
for col in df_train.columns:
    if col.startswith("attributes.") and col not in ["attributes.GoodForMeal", "attributes.Ambience"]:
        print(f"\nColumn: {col}")
        print(df_train[col].dropna().astype(str).unique().tolist())


Although some restaurant-specific features were too specific or not useful to input into the model, the 'name' I hypothesized would be very important. The words in a restaurant name whether they are French (to indicate Canadian) or include something related to the restaurant type ('Tacos') really help add information to the model.

In [None]:
print(df_train['name'].head(10))

#EDA

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/EDA.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Bar%20Plots.png)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a custom color palette
custom_palette = [
    '#8dd3c7',
    '#bebada',
    '#ab8072',
    '#80b1d3',
    '#fdb462',
    '#b3de69',
    '#fccde5',
    '#d9d9d9',
    '#bc80bd',
    '#17becf',
    '#aec7e8',
    '#bbb005']

sns.palplot(custom_palette)
plt.show()

Function for creating bar plots of Features (Restaurant Features) vs. Labels (Restaurant Type):

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

label_col = 'label'
feature_cols = [
    'attributes.OutdoorSeating',
    'attributes.BusinessAcceptsCreditCards',
    'attributes.RestaurantsTableService',
    'attributes.RestaurantsReservations',
    'attributes.RestaurantsPriceRange2',
    'attributes.WiFi',
    'attributes.NoiseLevel',
    'state',
    'attributes.Alcohol',
    'attributes.HasTV',
    'attributes.RestaurantsGoodForGroups',
    'attributes.Caters',
    'attributes.RestaurantsTakeOut',
    'attributes.RestaurantsAttire',
    'attributes.RestaurantsDelivery',
    'attributes.GoodForKids',
    'attributes.BusinessAcceptsBitcoin'
]

# In case want to view unknowns
df_plot = df_train.copy()
df_plot[feature_cols] = df_plot[feature_cols].replace('', 'Unknown').fillna("Unknown")
df_plot[label_col] = df_plot[label_col].replace('', 'Unknown').fillna("Unknown")

# Plotting function
def plot_categorical_barplots(
    data,
    features,
    x_axis='feature',
    hue_axis='label',
    percentage=False,
    drop_unknown=False,
    title_suffix=""
):
    n_cols = 3
    n_rows = -(-len(features) // n_cols)  # ceiling division
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 5 * n_rows))
    axes = axes.flatten()
    custom_palette = [
    '#8dd3c7',
    '#bebada',
    '#ab8072',
    '#80b1d3',
    '#fdb462',
    '#b3de69',
    '#fccde5',
    '#d9d9d9',
    '#bc80bd',
    '#17becf',
    '#aec7e8',
    '#bbb005']

# sns.palplot(custom_palette)
# plt.show()

    for i, col in enumerate(features):
        ax = axes[i]

        x_col = col if x_axis == 'feature' else label_col
        hue_col = label_col if x_axis == 'feature' else col

        plot_data = data.copy()
        if drop_unknown:
            plot_data = plot_data[
                (plot_data[x_col] != 'Unknown') & (plot_data[hue_col] != 'Unknown')
            ]

        # Build custom palette
        unique_values = plot_data[hue_col].unique()
        has_false = 'False' in unique_values

        if has_false:
            palette = {'False': 'gray'} #gray for Falses
            other_values = sorted([val for val in unique_values if val != 'False'])
            other_colors = sns.color_palette('Set2', len(other_values))
            palette.update({val: color for val, color in zip(other_values, other_colors)})

        else: #custom palette
            other_values = sorted(unique_values)
            if len(other_values) > len(custom_palette):
                raise ValueError("Not enough colors in custom_palette for the number of unique values.")
            palette = {val: custom_palette[i] for i, val in enumerate(other_values)}


        if percentage: #for % of total (vs. count)
            count_df = plot_data.groupby([x_col, hue_col]).size().reset_index(name='count')
            total_per_x = count_df.groupby(x_col)['count'].transform('sum')
            count_df['percent'] = (count_df['count'] / total_per_x) * 100

            sns.barplot(
                data=count_df,
                x=x_col,
                y='percent',
                hue=hue_col,
                palette=palette,
                ax=ax
            )
            ax.set_ylabel('Percent (%)')
        else: # by raw counts
            sns.countplot(
                data=plot_data,
                x=x_col,
                hue=hue_col,
                palette=palette,
                ax=ax
            )
            ax.set_ylabel('Count')

        ax.set_title(f"{col.replace('attributes.', '').replace('_', ' ')} {title_suffix}", fontsize=11)
        ax.set_xlabel('')
        ax.tick_params(axis='x', rotation=30)
        ax.legend(title=hue_col, fontsize=8)

    # Remove unused subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()


Bar Plots by Feature color-coded by Label in term of raw counts to see volume:

In [None]:
#Count
plot_categorical_barplots(
    df_plot,
    features=feature_cols,
    x_axis='label',
    hue_axis='feature',
    percentage=False,
    # palette="Set2",
    drop_unknown=True,
    title_suffix='(Feature)'
)

Bar Plots by Feature and Label in terms of % of Total:

In [None]:

# Percent of Total (excluding unknowns)
plot_categorical_barplots(
    df_plot,
    features=feature_cols,
    x_axis='label',
    hue_axis='feature',
    percentage=True,
    # palette="Set2",
    drop_unknown=True,
    title_suffix='(Feature % of Total)'
)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Chi-Squared%20Tests.png)

Perform chi-squared to isolate most relvant features in predicting Restaurant Type. This isn't an exact science because NLP uses word embeddings and the words of the feature are important in relation to other words, not in isolation, so this just to get some good guesses at what might help to include:

Make a copy of the dataframe for the chi-squared to make sure to keep df_train intact:

In [None]:
df_train_chi=df_train.copy()
df_train_chi.columns

Ambience, Business Parking, Name, and Good For Meal would be too lengthy to convert to dummies, so making a new df for the chi-squared that excludes those and converting to dummies on the other features:

In [None]:
df_train_chi = df_train.drop(['attributes.Ambience','name', 'review', 'text', 'attributes.GoodForMeal'], axis=1)

Make dummy variables for chi squared:

In [None]:
# Select only object (string-like) columns
space_cols = [
    col for col in df_train_chi.columns
    if (col.startswith("attributes.") or col.startswith("state")) and col != 'label']

# Make dummies for chi squared
df_train_chi = pd.get_dummies(df_train_chi, columns=space_cols, dummy_na=False)

# Clean up naming
df_train_chi.columns = df_train_chi.columns.str.strip().str.replace("attributes.", "", regex=False)
df_train_chi.columns = df_train_chi.columns.str.replace('2_', '', regex=False)

print(df_train_chi.columns)


Run chi-squared test of features vs. label as a whole:

For each feature, it tests whether the distribution of that feature differs across all classes (i.e., is dependent on the label).

High chi-squared -> feature distribution is very different across labels → good at distinguishing between multiple classes.

In [None]:
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

# Ensure X and y are set correctly
X_chi = df_train_chi.drop('label', axis=1).values  # Feature matrix
y_chi = df_train_chi['label'].values               # Target labels

# If y is not already numeric, encode it
le = LabelEncoder()
y_chi_encoded = le.fit_transform(y_chi)

# Run Chi-squared test
chi2_stats, p_values = chi2(X_chi, y_chi_encoded)

# Create a DataFrame of results
feature_names = df_train_chi.drop('label', axis=1).columns
chi2_df = pd.DataFrame({
    'feature': feature_names,
    'chi2_stat': chi2_stats,
    'p_value': p_values
})

# Sort by chi2_stat to get top features
chi2_df_sorted = chi2_df.sort_values(by='chi2_stat', ascending=False)
print(chi2_df_sorted.head(20))  # Top 20 features overall


For each class (Restaurant type) individually, tests whether the feature distinguishes that class vs all others (for optimization of individual class prediction if one or more in particular aren't performing well):

In [None]:
from sklearn.feature_selection import chi2

results_per_class = {}

for class_label in np.unique(y_chi_encoded):
    # Create binary target: 1 if current class, 0 otherwise
    y_binary = (y_chi_encoded == class_label).astype(int)

    chi2_stats, p_values = chi2(X_chi, y_binary)

    results_df = pd.DataFrame({
        'feature': feature_names,
        'chi2_stat': chi2_stats,
        'p_value': p_values
    }).sort_values(by='chi2_stat', ascending=False)

    results_per_class[le.inverse_transform([class_label])[0]] = results_df

# Example: top 5 features for a specific class
print("Top features for each class:")
for label in le.classes_:
    print(f"\nTop features for class '{label}':")
    print(results_per_class[label].head(5))


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Build DataFrame (top N features per class)
combined_rows = []
top_n = 5

for class_label in le.classes_:
    top_features = results_per_class[class_label].head(top_n).copy()
    top_features['class_label'] = class_label

    # Add a unique ID to feature names to avoid duplicates
    top_features['feature_display'] = top_features['feature'] + f" ({class_label})"

    combined_rows.append(top_features)

combined_df = pd.concat(combined_rows, ignore_index=True)

# Plot using catplot
g = sns.catplot(
    data=combined_df,
    x='chi2_stat',
    y='feature',
    col='class_label',
    kind='bar',
    col_wrap=2,
    height=2,
    aspect=3.5,
    sharey=False,
    color='#C8A2C8'
)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
palette=sns.diverging_palette(280, 40, s=70, l=50, n=20, as_cmap=False)
sns.barplot(data=chi2_df_sorted.head(20), x='chi2_stat', y='feature',hue='feature', palette=palette, legend=False)


plt.title('Top Features Associated with Label (Chi-Squared Test)')
plt.tight_layout()
plt.show()

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Finalize%20Data.png)

Final Edit: Filter out "False" for use in models:

In [None]:
for col in df_train.columns:
    if col.startswith("attributes.") and col not in ["attributes.GoodForMeal", "attributes.Ambience"]:
        print(f"\nColumn: {col}")

        # Replace string "False" with NaN
        df_train[col] = df_train[col].replace("False", pd.NA)

        unique_vals = df_train[col].dropna().astype(str)
        print(unique_vals.unique().tolist())


#Dimension Reduction

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Dimension%20Reduction.png)

Drop additional features:

*   stars - not in top features
*   review was renamed to 'text' since the field is labeled 'text' in the tokenizing function later
*   table service - not in top features, likely correlated with price range
*   wifi, accepts reservations - Included originally but it did better without
*  accepts credit cards - having this as null was popping to the top so I think it was adding noise

I tested a handful of combinations (did not have the compute resources to do this super thoroughly).

In [None]:
columns_to_drop = [
                  # 'name',
                  # 'state',
                  'stars',
                  'review', #bc renamed to 'text'
                  'attributes.RestaurantsTableService',
                  # 'attributes.RestaurantsAttire',
                  # 'attributes.RestaurantsDelivery',
                  # 'attributes.GoodForKids',
                  # 'attributes.GoodForMeal'
                  # 'attributes.RestaurantsGoodForGroups',
                  # 'attributes.HasTV',
                  # 'attributes.BusinessAcceptsBitcoin',
                  # 'attributes.RestaurantsTakeOut',
                  # 'attributes.Ambience',
                  # 'attributes.OutdoorSeating',
                  'attributes.BusinessAcceptsCreditCards',
                  'attributes.RestaurantsReservations',
                  # 'attributes.RestaurantsPriceRange2',
                  'attributes.WiFi',
                  # 'attributes.NoiseLevel',
                  # 'attributes.Alcohol',
                  # 'attributes.Caters'
                  ]

df_train = df_train.drop(columns_to_drop, axis=1)
df_test = df_test.drop(columns_to_drop, axis=1)

In [None]:
df_train.columns

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Concatenating%20Features.png)

This is the finalization of the data that will go into the model. The restaurant review (called 'text' in the data) starts as the primary source of what would be used in the model to predict restaurant type. All of these other features I have formatted and included are additional to that data and were added to supplement it, so I wanted them to be as succinct and meaningful as possible. The model is essentially predicting from the 'review + other features' to predict the type. The method here to get the single field of data to input into the natural language model is to concatenate the words of the additional features to the review to create one long piece of text. Note using a separator between the features (a period here) is helpful for the model to compartmentalize unrelated text.

View restaurant review field separately first before concatenating with other features:

One example of "text" value (one restaurant review) before adding features to it:

In [None]:
print(df_train['text'].iloc[0])

Concatenate 'text' (review) field with other features:

In [None]:
# Get all column names except 'label'
fields_to_concat = [col for col in df_train.columns if col != 'label']

# Concatenate with periods between values
df_train["concatenated"] = df_train[fields_to_concat]\
    .fillna("").astype(str)\
    .apply(lambda row: ".".join(row.values), axis=1)


In [None]:
# Ensure all fields are strings and fill NaNs with empty string
df_train[fields_to_concat] = df_train[fields_to_concat].fillna('').astype(str)

# Concatenate only non-empty fields, separated by '. '
df_train["text_review_and_features"] = df_train[fields_to_concat].apply(
    lambda row: '. '.join([val for val in row if val.strip() != '']),
    axis=1
)


In [None]:
# Ensure all fields are strings and fill NaNs with empty string
df_test[fields_to_concat] = df_test[fields_to_concat].fillna('').astype(str)

# Concatenate only non-empty fields, separated by '. '
df_test["text_review_and_features"] = df_test[fields_to_concat].apply(
    lambda row: '. '.join([val for val in row if val.strip() != '']),
    axis=1
)


This is the final text field that will be fed into the natural language model. It is a concatenation of features (as text) joined to the restaurant review text, which will be used as one big text field to predict the restaurant type:

One example of new "text" value (restaurant review + additional features):

In [None]:
print(df_train['text_review_and_features'].iloc[0])

In [None]:
print(df_test['text_review_and_features'].iloc[0])

#Average Embedding Models

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Average%20Embedding%20Models.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Load%20Libraries.png)

Load libraries for Average Embedding Models:

In [None]:
import pickle
import pandas as pd
import itertools
from collections import Counter
import numpy as np
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from gensim.models import word2vec
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

import os
import string


# NLTK library downloads
nltk.download('stopwords')
nltk.download('punkt_tab')

##Data Pre-Processing

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Data%20Pre-Processing.png)

Data Pre-Processing for Average Word Embeddings:


1.   TEXT PRE-PROCESSING | pre-process_df: Clean the data of stopwords, punctuation and single character words
2.   TOKENIZATION | tokenize_text_column: Tokenize the Training and Test Data sentences (with nltk tokenizer)
3.   BUILD VOCABULARY | build_vocab: Build vocabulary from Tokenized Data
4.   inp_data: Create Input Data (indices only) to build sentences of indices of words using Vocabulary and Tokenized Data
5.   WORD EMBEDDINGS | get_embeddings: Train Word2Vec Skigram model to get word embedding weights matrix (W) using Vocabulary and Input Data
6.   AVERAGE EMBEDDING VECTORS | average_embedding_vectors: Build Avg. Word Embedding Vectors for training and test sets (1 vector for each review)



![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Functions.png)

Functions for Steps 1-2 and 4-6:

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Text%20Pre-Processing.png)

Function to pre-process raw (English) text:

In [None]:
def preprocess_df(df):
    # get English stopwords
    stop_words = set(stopwords.words('english'))
    stop_words.add('would')
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    preprocessed_sentences = []
    for i, row in df.iterrows():
        sent = row["text"]
        sent_nopuncts = sent.translate(translator)
        words_list = sent_nopuncts.strip().split()
        filtered_words = [word for word in words_list if word not in stop_words and len(word) != 1]
        preprocessed_sentences.append(" ".join(filtered_words))
    df["text"] = preprocessed_sentences
    return df

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Build%20Vocabulary.png)

Function to build a vocabulary based on descending word frequencies:

In [None]:
def build_vocab(sentences):
    word_counts = Counter(itertools.chain(*sentences))
    vocabulary_inv = [x[0] for x in word_counts.most_common()]
    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
    return word_counts, vocabulary, vocabulary_inv

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Word%20Embeddings.png)

Function to learn word embeddings through Word2vec module and skigram (builds matrix W):

In [None]:

def get_embeddings(inp_data, vocabulary_inv, size_features=100, #100 features originally
                   mode='skipgram',
                   min_word_count=2, #2 originally
                   context=5): #5 originally
    model_name = "embedding"
    model_name = os.path.join(model_name)
    num_workers = 15  #15 originally
    downsampling = 1e-3  # Downsample setting for frequent words
    print('Training Word2Vec model...')
    sentences = [[vocabulary_inv[w] for w in s] for s in inp_data]
    if mode == 'skipgram':
        sg = 1
        print('Model: skip-gram')
    elif mode == 'cbow':
        sg = 0
        print('Model: CBOW')
    embedding_model = word2vec.Word2Vec(sentences, workers=num_workers,
                                        sg=sg,
                                        vector_size=size_features,
                                        min_count=min_word_count,
                                        window=context,
                                        sample=downsampling,
                                        seed=42)
    print("Saving Word2Vec model {}".format(model_name))
    embedding_weights = np.zeros((len(vocabulary_inv), size_features)) #W
    for i in range(len(vocabulary_inv)):
        word = vocabulary_inv[i]
        if word in embedding_model.wv:
            embedding_weights[i] = embedding_model.wv[word]
        else:
            embedding_weights[i] = np.random.uniform(-0.25, 0.25,
                                                     embedding_model.vector_size)
    return embedding_weights

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Tokenization.png)

In [None]:
from nltk.tokenize import word_tokenize

def tokenize_text_column(df, column="text"):
    """
    Tokenizes each row in the specified text column using NLTK's word_tokenize.
    """
    return [word_tokenize(str(text)) for text in df[column]]

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Average%20Embedding%20Vectors.png)

In [None]:
def average_embedding_vectors(tokenized_docs, embedding_weights, vocabulary):
    vectors = []
    embedding_dim = embedding_weights.shape[1]

    for doc in tokenized_docs:
        vec = np.zeros(embedding_dim)
        count = 0
        for word in doc:
            if word in vocabulary:
                vec += embedding_weights[vocabulary[word]]
                count += 1
        if count > 0:
            vec /= count
        vectors.append(vec)

    return vectors

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%201.png)

##Model 1

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Run%20Pre-Processing.png)

In [None]:
import random
import numpy as np

random.seed(42)
np.random.seed(42)

In [None]:
# Step 1 Pre-process/tokenize "text"
df_train_1 = preprocess_df(df_train)
df_test_1 = preprocess_df(df_test)

In [None]:
# Step 2 Tokenize sentences
train_1_tokenized = tokenize_text_column(df_train_1, column="text")
test_1_tokenized  = tokenize_text_column(df_test_1, column="text")

In [None]:
# Step 3 Build Vocabulary of "text"
word_counts, vocabulary, vocabulary_inv = build_vocab(train_1_tokenized)

In [None]:
# use the above mapping to create input data (indices only) to build sentences of indices
from datetime import datetime

model_1_start = datetime.now()

# Step 4 Create Input Data
inp_data = [[vocabulary[word] for word in text] for text in train_1_tokenized] #tagged data is in sentences
# get embedding vector

# Step 5 Word Embeddings
embedding_weights = get_embeddings(inp_data, vocabulary_inv)

model_1_end = datetime.now()

model_1_time = model_1_end - model_1_start
print(f"Model 1 training time: {model_1_time:.2f} seconds")

#Trains Word2Vec model -> not using pre-trained embeddings, training them here

In [None]:
print(embedding_weights[0])

In [None]:
#Step 6 Create Average Word Embedding Vectors
train_vec_1 = average_embedding_vectors(train_1_tokenized, embedding_weights, vocabulary)
test_vec_1  = average_embedding_vectors(test_1_tokenized, embedding_weights, vocabulary)


In [None]:
print(train_1_tokenized[0]) #tagged data is in sentences
# print(vocabulary)

In [None]:
print(inp_data[0])

In [None]:
print(f"{'word_counts':<15} {'vocabulary':<15} {'vocabulary_inv':<15}")
print("-" * 45)

for i, word in enumerate(vocabulary_inv[:10]):  # First 20 entries
    count = word_counts[word]
    index = vocabulary[word]
    print(f"{str(count):<15} {str({word: index}):<15} {word:<15}")

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Split%20Data.png)

Split data into 80/10/10 Train, Validation, Test split:

In [None]:
X_1 = train_vec_1
y_1 = df_train_1['label']

# Train (80%), Temp (20%) split
X_train_1, X_temp_1, y_train_1, y_temp_1 = train_test_split(
    X_1,y_1,
    test_size=0.2,
    random_state=42,
    stratify=y_1
)

# Temp → Validation (10%), Test (10%) split
X_val_1, X_test_1, y_val_1, y_test_1 = train_test_split(
    X_temp_1,
    y_temp_1,
    test_size=0.5,
    random_state=42,
    stratify=y_temp_1
)

# Delete temp vars and print shapes with % of total
del X_temp_1, y_temp_1
total_samples = len(df_train_1)

X_train_1_for_shape = np.array(X_train_1)
y_train_1_for_shape = np.array(y_train_1)
X_val_1_for_shape = np.array(X_val_1)
y_val_1_for_shape = np.array(y_val_1)
X_test_1_for_shape = np.array(X_test_1)
y_test_1_for_shape = np.array(y_test_1)


print(f"the shape of the training set (input) is: {X_train_1_for_shape.shape} ({len(X_train_1_for_shape)/total_samples:.1%} of total)")
print(f"the shape of the training set (target) is: {y_train_1_for_shape.shape} ({len(y_train_1_for_shape)/total_samples:.1%} of total)\n")

print(f"the shape of the cross validation set (input) is: {X_val_1_for_shape.shape} ({len(X_val_1_for_shape)/total_samples:.1%} of total)")
print(f"the shape of the cross validation set (target) is: {y_val_1_for_shape.shape} ({len(y_val_1_for_shape)/total_samples:.1%} of total)\n")

print(f"the shape of the test set (input) is: {X_test_1_for_shape.shape} ({len(X_test_1_for_shape)/total_samples:.1%} of total)")
print(f"the shape of the test set (target) is: {y_test_1_for_shape.shape} ({len(y_test_1_for_shape)/total_samples:.1%} of total)")


![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Train%20Model.png)

Train average word embedding feature against labels using Logistic Regression:

In [None]:
clf_val_1 = LogisticRegression(max_iter=1000).fit(X_train_1, y_train_1)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%20Accuracy_%20Validation%20Set.png)

Evaluate Accuracy of model's Restaurant Predictions:

In [None]:
preds_val_1 = clf_val_1.predict(X_val_1)

val_accuracy_1 = accuracy_score(y_val_1, preds_val_1)
print(f"Validation Accuracy: {val_accuracy_1:.4f}")

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Classification%20Report_%20Validation%20Set.png)

Evaluate accuracy on individual Restaurant Types:

*Improvements were optimized to the F1 score during modeling*

In [None]:
classification_report_val_1 = classification_report(y_val_1, preds_val_1)
print(classification_report_val_1)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Confusion%20Matrix_%20Validation%20Set.png)

Run confusion matrix to see where exactly predictions are going wrong:

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np
import matplotlib.pyplot as plt

cm = confusion_matrix(y_val_1, preds_val_1, labels=clf_val_1.classes_)
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # row-wise %

disp = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=clf_val_1.classes_)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(cmap=plt.cm.Pastel2_r, values_format=".2f", ax=ax)
# Improve readability
for text in ax.texts:
    text.set_color("black")

plt.setp(ax.get_xticklabels(), rotation=45, ha='right')

plt.title("Confusion Matrix (Percentages)")
plt.tight_layout()
plt.show()

##Model 2

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%202.png)

Run pre-processing steps on Restaurant Review + Other Features field this time (instead of just the Review only like in Model 1):

In [None]:
import random
import numpy as np

random.seed(42)
np.random.seed(42)

In [None]:
model_2_start = datetime.now()

df_train_2=df_train.copy()
df_train_2["text"]=df_train_2["text_review_and_features"]

df_test_2=df_test.copy()
df_test_2["text"]=df_test_2["text_review_and_features"]

# Step 1 Pre-process/tokenize "text"
df_train_2 = preprocess_df(df_train_2)
df_test_2 = preprocess_df(df_test_2)

# Step 2 Tokenize sentences
train_2_tokenized = tokenize_text_column(df_train_2, column="text")
test_2_tokenized  = tokenize_text_column(df_test_2, column="text")

# Step 3 Build Vocabulary of "text"
word_counts, vocabulary, vocabulary_inv = build_vocab(train_2_tokenized)

# Step 4 Create Input Data
inp_data = [[vocabulary[word] for word in text] for text in train_2_tokenized] #tagged data is in sentences

# Step 5 Word Embeddings
embedding_weights = get_embeddings(inp_data, vocabulary_inv)

#Step 6 Create Average Word Embedding Vectors
train_vec_2 = average_embedding_vectors(train_2_tokenized, embedding_weights, vocabulary)
test_vec_2  = average_embedding_vectors(test_2_tokenized, embedding_weights, vocabulary)

model_2_end = datetime.now()

model_2_time = model_2_end - model_2_start
print(f"Model 1 training time: {model_2_time:.2f} seconds")

In [None]:
print(train_2_tokenized[0]) #tagged data is in sentences

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Split%20Data.png)

Split data into 80/10/10 Train, Validation, Test split:

In [None]:
X_2=train_vec_2
y_2=df_train_2['label']

# Train (80%), Temp (20%) split
X_train_2, X_temp_2, y_train_2, y_temp_2 = train_test_split(
    X_2,y_2,
    test_size=0.2,
    random_state=42,
    stratify=y_2
)

# Temp → Validation (10%), Test (10%) split
X_val_2, X_test_2, y_val_2, y_test_2 = train_test_split(
    X_temp_2,
    y_temp_2,
    test_size=0.5,
    random_state=42,
    stratify=y_temp_2
)

# Delete temp vars and print shapes with % of total
del X_temp_2, y_temp_2
total_samples_2 = len(df_train_2)

X_train_2_for_shape = np.array(X_train_2)
y_train_2_for_shape = np.array(y_train_2)
X_val_2_for_shape = np.array(X_val_2)
y_val_2_for_shape = np.array(y_val_2)
X_test_2_for_shape = np.array(X_test_2)
y_test_2_for_shape = np.array(y_test_2)


print(f"the shape of the training set (input) is: {X_train_2_for_shape.shape} ({len(X_train_2_for_shape)/total_samples_2:.1%} of total)")
print(f"the shape of the training set (target) is: {y_train_2_for_shape.shape} ({len(y_train_2_for_shape)/total_samples_2:.1%} of total)\n")

print(f"the shape of the cross validation set (input) is: {X_val_2_for_shape.shape} ({len(X_val_2_for_shape)/total_samples_2:.1%} of total)")
print(f"the shape of the cross validation set (target) is: {y_val_2_for_shape.shape} ({len(y_val_2_for_shape)/total_samples_2:.1%} of total)\n")

print(f"the shape of the test set (input) is: {X_test_2_for_shape.shape} ({len(X_test_2_for_shape)/total_samples_2:.1%} of total)")
print(f"the shape of the test set (target) is: {y_test_2_for_shape.shape} ({len(y_test_2_for_shape)/total_samples_2:.1%} of total)")


![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Train%20Model.png)

Train model on Logistic Regression:

In [None]:
clf_val_2 = LogisticRegression(max_iter=1000).fit(X_train_2, y_train_2)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%20Accuracy_%20Validation%20Set.png)

Evaluate Accuracy:

In [None]:
preds_val_2 = clf_val_2.predict(X_val_2)

val_accuracy_2 = accuracy_score(y_val_2, preds_val_2)
print(f"Validation Accuracy: {val_accuracy_2:.4f}")

Evaluate Accuracy of individual Restaurant Type predictions:

In [None]:
classification_report_val_2 = classification_report(y_val_2, preds_val_2)
print(classification_report_val_2)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Confusion%20Matrix_%20Test%20Set.png)

Run confusion matrix to see where predictions are going wrong:


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np
import matplotlib.pyplot as plt

cm = confusion_matrix(y_val_2, preds_val_2, labels=clf_val_2.classes_)
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # row-wise %

disp = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=clf_val_2.classes_)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(cmap=plt.cm.Pastel2_r, values_format=".2f", ax=ax)

# Improve readability
for text in ax.texts:
    text.set_color("black")

plt.setp(ax.get_xticklabels(), rotation=45, ha='right')

plt.title("Confusion Matrix (Percentages)")
plt.tight_layout()
plt.show()

#BERT Transformer Models

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/BERT%20Transformer%20Models.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Load%20Libraries.png)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import Dataset
import torch
import random
import os

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Set%20Seed.png)

Set seed for reproducibility, but NOTE IT IS NOT 100% reproducible, as there is still some level of randomness in Hugging Face Transformer models:

In [None]:
def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # when using GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(42)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Encode%20Labels.png)

Encode labels as numbers (e.g., "Italian" -> class "1"):

In [None]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_train["label"].values)
num_classes = len(label_encoder.classes_)

In [None]:
print(df_train["label"].values)
print(y)

##Model 3

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%203.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Split%20Data.png)

Split labeled training data into training (80%), validation (10%) and test (10%) sets. This is a common split for NLP and a dataset of this size (13k rows). It also performed better than 70%/15%/15%

In [None]:
X_3=df_train["text"].values
y_3=y

# Train (80%), Temp (20%) split
X_train_3, X_temp_3, y_train_3, y_temp_3 = train_test_split(
    X_3,
    y_3,
    test_size=0.2,
    random_state=42,
    stratify=y_3
)

# Temp → Validation (10%), Test (10%) split
X_val_3, X_test_3, y_val_3, y_test_3 = train_test_split(
    X_temp_3,
    y_temp_3,
    test_size=0.5,
    random_state=42,
    stratify=y_temp_3
)

# Delete temp vars and print shapes with % of total
del X_temp_3, y_temp_3
total_samples_3 = len(df_train)

print(f"the shape of the training set (input) is: {X_train_3.shape} ({len(X_train_3)/total_samples_3:.1%} of total)")
print(f"the shape of the training set (target) is: {y_train_3.shape} ({len(y_train_3)/total_samples_3:.1%} of total)\n")

print(f"the shape of the cross validation set (input) is: {X_val_3.shape} ({len(X_val_3)/total_samples_3:.1%} of total)")
print(f"the shape of the cross validation set (target) is: {y_val_3.shape} ({len(y_val_3)/total_samples_3:.1%} of total)\n")

print(f"the shape of the test set (input) is: {X_test_3.shape} ({len(X_test_3)/total_samples_3:.1%} of total)")
print(f"the shape of the test set (target) is: {y_test_3.shape} ({len(y_test_3)/total_samples_3:.1%} of total)")


![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Tokenize%20Data.png)

Tokenize the text using Bert Tokenizer. Caps each observation at 512 tokens:

*(I tested out different levels but this is the max the model can use and performed the best when it captured as much of the review as possible)*

In [None]:
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def tokenize_texts_with_bert(texts, max_length=512):
  #truncates at 512 tokens per row
    return bert_tokenizer(
        list(texts),
        max_length=max_length,
        padding='max_length',
        truncation=True
    )

train_encodings_3 = tokenize_texts_with_bert(X_train_3)
val_encodings_3 = tokenize_texts_with_bert(X_val_3)
test_encodings_3 = tokenize_texts_with_bert(X_test_3)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Create%20Dataset%20Objects.png)

Creates Hugging Face dataset objects from a Python dictionary of an id, tokenized texts and their labels, to use in training.

In [None]:
train_dataset_3 = Dataset.from_dict({
    'input_ids': train_encodings_3['input_ids'],
    'attention_mask': train_encodings_3['attention_mask'],
    'labels': list(y_train_3)
})

val_dataset_3 = Dataset.from_dict({
    'input_ids': val_encodings_3['input_ids'],
    'attention_mask': val_encodings_3['attention_mask'],
    'labels': list(y_val_3)
})

test_dataset_3 = Dataset.from_dict({
    'input_ids': test_encodings_3['input_ids'],
    'attention_mask': test_encodings_3['attention_mask'],
    'labels': list(y_test_3)
})


![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Train%20Model.png)

Call BERT model for classification:

In [None]:
model_3 = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=num_classes,
)
#default dropout is 0.1

Train the BERT model with the following arguments:

In [None]:
training_args_bert = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4, #3-5 recommended, lower prevents overfitting
    learning_rate=2e-5, #want lower if data is noisy/smaller
    per_device_train_batch_size=16, #lower (8,16) uses less memory
    per_device_eval_batch_size=16,
    warmup_steps=100,
    #gradually increases learning rate at start of training to prevent destabilizing of gradient descent
    weight_decay=0.009, #regularizes to prevent overfitting
    logging_dir='./logs',
    eval_strategy='epoch', #epochs vs. early stopping
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to="none",  # disables W&B logging
    run_name='bert-review-classifier',
    seed=42
)

Build function to calculate accuracy:

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy_score(labels, predictions)
    }

Train the model:

In [None]:
model_3_start = datetime.now()

bert_trainer_3 = Trainer(
    model=model_3,
    args=training_args_bert,
    train_dataset=train_dataset_3,
    eval_dataset=val_dataset_3,  # val set used during training
    compute_metrics=compute_metrics
)

bert_trainer_3.train()

model_3_end = datetime.now()

model_3_time = model_3_end - model_3_start
print(f"Model 1 training time: {model_3_time:.2f} seconds")

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%20Accuracy_%20Validation%20Set.png)

Evaluate the accuracy of the predictions:

In [None]:
val_results_3 = bert_trainer_3.evaluate(val_dataset_3)
print(f"\n Validation Accuracy: {val_results_3['eval_accuracy']:.4f}")

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Classification%20Report_%20Validation%20Set.png)

View accuracy by class (restaurant type) and accuracy type (precision vs. recall vs. F1):

Improvements were optimized to the F1 score during modeling:

In [None]:
val_predictions_3 = bert_trainer_3.predict(val_dataset_3)
y_val_pred_3 = np.argmax(val_predictions_3.predictions, axis=1)
y_val_true_3 = val_predictions_3.label_ids

val_report_3 = classification_report(y_val_true_3, y_val_pred_3, target_names=label_encoder.classes_)
print("\n Validation Classification Report:\n", val_report_3)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Confusion%20Matrix_%20Validation%20Set.png)

Run confusion matrix to see where exactly predictions are going wrong:

In [None]:
# This gives the original labels in the correct encoded order
label_names = label_encoder.classes_

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

# Compute confusion matrix and percentages
cm = confusion_matrix(y_val_true_3, y_val_pred_3, labels=range(len(label_names)))
cm_percent = cm.astype(float) / cm.sum(axis=1, keepdims=True)

# Plot
disp = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=label_names)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(cmap=plt.cm.Pastel2_r, values_format=".2f", ax=ax)

# Improve readability
for text in ax.texts:
    text.set_color("black")

plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
plt.title("Confusion Matrix (Percentages)")
plt.tight_layout()
plt.show()

##Model 4

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%204.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Split%20Data.png)

Perform same steps as Model 3:

Split data into 80/10/10 Train, Validation, Test split:

In [None]:
X_4=df_train["text_review_and_features"].values
y_4=y

# Train (80%), Temp (20%) split
X_train_4, X_temp_4, y_train_4, y_temp_4 = train_test_split(
    X_4,
    y_4,
    test_size=0.2,
    random_state=42,
    stratify=y_4
)

# Temp → Validation (10%), Test (10%) split
X_val_4, X_test_4, y_val_4, y_test_4 = train_test_split(
    X_temp_4,
    y_temp_4,
    test_size=0.5,
    random_state=42,
    stratify=y_temp_4
)

# Delete temp vars and print shapes with % of total
del X_temp_4, y_temp_4
total_samples_4 = len(df_train)

print(f"the shape of the training set (input) is: {X_train_4.shape} ({len(X_train_4)/total_samples_4:.1%} of total)")
print(f"the shape of the training set (target) is: {y_train_4.shape} ({len(y_train_4)/total_samples_4:.1%} of total)\n")

print(f"the shape of the cross validation set (input) is: {X_val_4.shape} ({len(X_val_4)/total_samples_4:.1%} of total)")
print(f"the shape of the cross validation set (target) is: {y_val_4.shape} ({len(y_val_4)/total_samples_4:.1%} of total)\n")

print(f"the shape of the test set (input) is: {X_test_4.shape} ({len(X_test_4)/total_samples_4:.1%} of total)")
print(f"the shape of the test set (target) is: {y_test_4.shape} ({len(y_test_4)/total_samples_4:.1%} of total)")


In [None]:
train_encodings_4 = tokenize_texts_with_bert(X_train_4)
val_encodings_4 = tokenize_texts_with_bert(X_val_4)
test_encodings_4 = tokenize_texts_with_bert(X_test_4)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Train%20Model.png)

Create Hugging Face Datasets for Train, Val, Test:

In [None]:
train_dataset_4 = Dataset.from_dict({
    'input_ids': train_encodings_4['input_ids'],
    'attention_mask': train_encodings_4['attention_mask'],
    'labels': list(y_train_4)
})

val_dataset_4 = Dataset.from_dict({
    'input_ids': val_encodings_4['input_ids'],
    'attention_mask': val_encodings_4['attention_mask'],
    'labels': list(y_val_4)
})

test_dataset_4 = Dataset.from_dict({
    'input_ids': test_encodings_4['input_ids'],
    'attention_mask': test_encodings_4['attention_mask'],
    'labels': list(y_test_4)
})

Call pre-trained BERT model:

In [None]:
model_4 = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=num_classes,
)
#default dropout is 0.1

Train model:

In [None]:
model_4_start = datetime.now()

bert_trainer_4 = Trainer(
    model=model_4,
    args=training_args_bert, #used same training args as model 3
    train_dataset=train_dataset_4,
    eval_dataset=val_dataset_4,  # val set used during training
    compute_metrics=compute_metrics
)

bert_trainer_4.train()

model_4_end = datetime.now()

model_4_time = model_4_end - model_4_start
print(f"Model 1 training time: {model_4_time:.2f} seconds")

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%20Accuracy_%20Validation%20Set.png)

Evaluate accuracy:

In [None]:
val_results_4 = bert_trainer_4.evaluate(val_dataset_4)
print(f"\n Validation Accuracy: {val_results_4['eval_accuracy']:.4f}")

Run classification report to view accuracy by Restaurant Type (class):

In [None]:
val_predictions_4 = bert_trainer_4.predict(val_dataset_4)
y_val_pred_4 = np.argmax(val_predictions_4.predictions, axis=1)
y_val_true_4 = val_predictions_4.label_ids

val_report_4 = classification_report(y_val_true_4, y_val_pred_4, target_names=label_encoder.classes_)
print("\n Validation Classification Report:\n", val_report_4)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Confusion%20Matrix_%20Validation%20Set.png)

Run Confusion Matrix to see where predictions are going wrong:

In [None]:
# This gives the original labels in the correct encoded order
label_names = label_encoder.classes_

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

# Compute confusion matrix and percentages
cm = confusion_matrix(y_val_true_4, y_val_pred_4, labels=range(len(label_names)))
cm_percent = cm.astype(float) / cm.sum(axis=1, keepdims=True)

# Plot
disp = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=label_names)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(cmap=plt.cm.Pastel2_r, values_format=".2f", ax=ax)

# Improve readability
for text in ax.texts:
    text.set_color("black")

plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
plt.title("Confusion Matrix (Percentages)")
plt.tight_layout()
plt.show()


#RoBERTa Transformer Model

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/RoBERTa%20Transformer%20Model.png)

##Model 5

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%205.png)

Use a different transformer model (Roberta) and run it on the Review + Other Features data as Model 5:

Differences between RoBERTa and BERT:


*   RoBERTa is trained on more data with longer training and bigger batches
*   It removes Next Sentence Prediction, focusing only on masked language modeling
*  Uses dynamic masking (changing masked tokens during training) instead of static masking
*  Has improved hyperparameters and training setup for better performance
* Overall, RoBERTa achieves higher accuracy and is more robust than BERT

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Load%20Libraries.png)

Load libraries for RoBERTa model:

In [None]:
from transformers import (
    RobertaTokenizerFast,
    RobertaForSequenceClassification,
    Trainer,
    TrainingArguments
)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Split%20Data.png)

Split data into 80/10/10 Train, Validation, Test:

In [None]:
X_5=df_train["text_review_and_features"].values
y_5=y

# Train (80%), Temp (20%) split
X_train_5, X_temp_5, y_train_5, y_temp_5 = train_test_split(
    X_5,
    y_5,
    test_size=0.2,
    random_state=42,
    stratify=y_5
)

# Temp → Validation (10%), Test (10%) split
X_val_5, X_test_5, y_val_5, y_test_5 = train_test_split(
    X_temp_5,
    y_temp_5,
    test_size=0.5,
    random_state=42,
    stratify=y_temp_5
)

# Delete temp vars and print shapes with % of total
del X_temp_5, y_temp_5
total_samples_5 = len(df_train)

print(f"the shape of the training set (input) is: {X_train_5.shape} ({len(X_train_5)/total_samples_5:.1%} of total)")
print(f"the shape of the training set (target) is: {y_train_5.shape} ({len(y_train_5)/total_samples_5:.1%} of total)\n")

print(f"the shape of the cross validation set (input) is: {X_val_5.shape} ({len(X_val_5)/total_samples_5:.1%} of total)")
print(f"the shape of the cross validation set (target) is: {y_val_5.shape} ({len(y_val_5)/total_samples_5:.1%} of total)\n")

print(f"the shape of the test set (input) is: {X_test_5.shape} ({len(X_test_5)/total_samples_5:.1%} of total)")
print(f"the shape of the test set (target) is: {y_test_5.shape} ({len(y_test_5)/total_samples_5:.1%} of total)")


Tokenize data with pre-trained Roberta model:

In [None]:
roberta_tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

def tokenize_texts_with_roberta(texts, max_length=512):
  #truncates at 512 tokens per row
    return roberta_tokenizer(
        list(texts),
        max_length=max_length,
        padding='max_length',
        truncation=True
    )

In [None]:
train_encodings_5 = tokenize_texts_with_roberta(X_train_5)
val_encodings_5 = tokenize_texts_with_roberta(X_val_5)
test_encodings_5 = tokenize_texts_with_roberta(X_test_5)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Train%20Model.png)

Create Hugging Face datasets for Train, Val, Test:

In [None]:
train_dataset_5 = Dataset.from_dict({
    'input_ids': train_encodings_5['input_ids'],
    'attention_mask': train_encodings_5['attention_mask'],
    'labels': list(y_train_5)
})

val_dataset_5 = Dataset.from_dict({
    'input_ids': val_encodings_5['input_ids'],
    'attention_mask': val_encodings_5['attention_mask'],
    'labels': list(y_val_5)
})

test_dataset_5 = Dataset.from_dict({
    'input_ids': test_encodings_5['input_ids'],
    'attention_mask': test_encodings_5['attention_mask'],
    'labels': list(y_test_5)
})

Call RoBERTA model, this one includes dropout regularization of 0.1
Dropout randomly sets a percentage of the layer’s outputs to 0 in training
to force the model not to rely too much on any one neuron

In [None]:
# Load model with dropout regularization (default dropout for RoBERTa is 0.1)
model_5 = RobertaForSequenceClassification.from_pretrained(
    'roberta-base',
    num_labels=num_classes,
)

Set training arguments:

In [None]:
#Training arguments
training_args_roberta = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4, #lower prevents overfitting
    learning_rate=2e-5, #want lower if data is noisy/smaller
    per_device_train_batch_size=16, #lower (8,16) uses less memory
    per_device_eval_batch_size=16,
    warmup_steps=100,
    #gradually increases learning rate at start of training to prevent destabilizing of gradient descent
    weight_decay=0.009, #regularizes to prevent overfitting
    logging_dir='./logs',
    eval_strategy='epoch', #epochs vs. early stopping
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to="none",  # disables W&B logging
    run_name='roberta-review-classifier'
)

Train the model:

In [None]:
model_5_start = datetime.now()

roberta_trainer_5 = Trainer(
    model=model_5,
    args=training_args_roberta,
    train_dataset=train_dataset_5,
    eval_dataset=val_dataset_5,  # val set used during training
    compute_metrics=compute_metrics
)

roberta_trainer_5.train()

model_5_end = datetime.now()

model_5_time = model_5_end - model_5_start
print(f"Model 1 training time: {model_5_time:.2f} seconds")

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%20Accuracy_%20Validation%20Set.png)

Evaluate accuracy of predictions:

In [None]:
val_results_5 = roberta_trainer_5.evaluate(val_dataset_5)
print(f"\n Validation Accuracy: {val_results_5['eval_accuracy']:.4f}")

Run classification report to show accuracy on Restaurant Type (class) level:

In [None]:
val_predictions_5 = roberta_trainer_5.predict(val_dataset_5)
y_val_pred_5 = np.argmax(val_predictions_5.predictions, axis=1)
y_val_true_5 = val_predictions_5.label_ids

val_report_5 = classification_report(y_val_true_5, y_val_pred_5, target_names=label_encoder.classes_)
print("\n Validation Classification Report:\n", val_report_5)

Run confusion matrix to see where predictions are going wrong:

In [None]:
# # This gives the original labels in the correct encoded order
# label_names = label_encoder.classes_

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

# Compute confusion matrix and percentages
cm = confusion_matrix(y_val_true_5, y_val_pred_5, labels=range(len(label_names)))
cm_percent = cm.astype(float) / cm.sum(axis=1, keepdims=True)

# Plot
disp = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=label_names)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(cmap=plt.cm.Pastel2_r, values_format=".2f", ax=ax)

# Improve readability
for text in ax.texts:
    text.set_color("black")

plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
plt.title("Confusion Matrix (Percentages)")
plt.tight_layout()
plt.show()

#Model Comparison

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Model%20Comparison.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Test%20Set%20Accuracies%20All%20Models.png)

Model 1 Test Accuracy:

In [None]:
clf_test_1 = LogisticRegression(max_iter=1000).fit(train_vec_1, df_train_1["label"])

preds_test_1 = clf_test_1.predict(X_test_1)

test_accuracy_1 = accuracy_score(y_test_1, preds_test_1)
print(f"Model 1 Test Accuracy: {test_accuracy_1:.4f}")

In [None]:
classification_report_test_1=classification_report(y_test_1, preds_test_1, zero_division=0)
print(classification_report_test_1)

Model 2 Test Accuracy:

In [None]:
clf_test_2 = LogisticRegression(max_iter=1000).fit(train_vec_2, df_train_2["label"])

preds_test_2 = clf_test_2.predict(X_test_2)

test_accuracy_2 = accuracy_score(y_test_2, preds_test_2)
print(f"Model 2 Test Accuracy: {test_accuracy_2:.4f}")

In [None]:
classification_report_test_2 = classification_report(y_test_2, preds_test_2)
print(classification_report_test_2)

Model 3 Test Accuracy:

In [None]:
test_predictions_3 = bert_trainer_3.predict(test_dataset_3)
y_test_pred_3 = np.argmax(test_predictions_3.predictions, axis=1)
y_test_true_3 = test_predictions_3.label_ids

test_accuracy_3 = bert_trainer_3.evaluate(test_dataset_3)
print(f"\n Model 3 Test Accuracy: {test_accuracy_3['eval_accuracy']:.4f}")
test_accuracy_3 = test_accuracy_3['eval_accuracy']

In [None]:
classification_report_test_3 = classification_report(y_test_true_3, y_test_pred_3, target_names=label_encoder.classes_)
print("\n Model 3 Test Classification Report:\n", classification_report_test_3)

Model 4 Test Accuracy:

In [None]:
test_predictions_4 = bert_trainer_4.predict(test_dataset_4)
y_test_pred_4 = np.argmax(test_predictions_4.predictions, axis=1)
y_test_true_4 = test_predictions_4.label_ids

test_accuracy_4 = bert_trainer_4.evaluate(test_dataset_4)
print(f"\n Model 4 Test Accuracy: {test_accuracy_4['eval_accuracy']:.4f}")
test_accuracy_4 = test_accuracy_4['eval_accuracy']

In [None]:
classification_report_test_4 = classification_report(y_test_true_4, y_test_pred_4, target_names=label_encoder.classes_)
print("\n Model 4 Test Classification Report:\n", classification_report_test_4)

Model 5 Test Accuracy:

In [None]:
test_predictions_5 = roberta_trainer_5.predict(test_dataset_5)
y_test_pred_5 = np.argmax(test_predictions_5.predictions, axis=1)
y_test_true_5 = test_predictions_5.label_ids

test_accuracy_5 = roberta_trainer_5.evaluate(test_dataset_5)
print(f"\n Model 5 Test Accuracy: {test_accuracy_5['eval_accuracy']:.4f}")
test_accuracy_5 = test_accuracy_5['eval_accuracy']

In [None]:
classification_report_test_5 = classification_report(y_test_true_5, y_test_pred_5, target_names=label_encoder.classes_)
print("\n Model 5 Test Classification Report:\n", classification_report_test_5)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Final%20Comparison%20Charts.png)

Comparison Chart for Accuracy:

In [None]:
import pandas as pd

# Round and convert to percent strings
acc_1 = f"{round(test_accuracy_1 * 100, 1)}%"
acc_2 = f"{round(test_accuracy_2 * 100, 1)}%"
acc_3 = f"{round(test_accuracy_3 * 100, 1)}%"
acc_4 = f"{round(test_accuracy_4 * 100, 1)}%"
acc_5 = f"{round(test_accuracy_5 * 100, 1)}%"


# Create the DataFrame
accuracy_df = pd.DataFrame({
    'Model #': ['1', '2','3','4','5'],
    'Model': ['Logistic Regression', 'Logistic Regression','BERT','BERT','RoBERTa'],
    'Model Description': ['Avg. Word2Vec Embeddings', 'Avg. Word2Vec Embeddings','Transformer','Transformer','Transformer'],
    'Accuracy': [acc_1, acc_2, acc_3, acc_4, acc_5],
    'Type': ['Baseline', 'Candidate', 'Candidate','Final','Candidate',]
})

accuracy_df


Comparison chart for classification report:

In [None]:
from sklearn.metrics import classification_report
import pandas as pd

# Generate reports
report_1 = classification_report(y_test_1, preds_test_1, zero_division=0, output_dict=True)
report_2 = classification_report(y_test_2, preds_test_2, zero_division=0, output_dict=True)
report_3 = classification_report(y_test_true_3, y_test_pred_3, zero_division=0, output_dict=True, target_names=label_encoder.classes_)
report_4 = classification_report(y_test_true_4, y_test_pred_4, zero_division=0, output_dict=True, target_names=label_encoder.classes_)
report_5 = classification_report(y_test_true_5, y_test_pred_5, zero_division=0, output_dict=True, target_names=label_encoder.classes_)

# Convert to DataFrames
df1 = pd.DataFrame(report_1).T[['f1-score']].rename(columns={'f1-score': 'Model 1: Avg. Embeddings Review Only'})
df2 = pd.DataFrame(report_2).T[['f1-score']].rename(columns={'f1-score': 'Model 2: Avg. Embeddings Review+Features'})
df3 = pd.DataFrame(report_3).T[['f1-score']].rename(columns={'f1-score': 'Model 3: BERT Review Only'})
df4 = pd.DataFrame(report_4).T[['f1-score']].rename(columns={'f1-score': 'Model 4: BERT Review + Features'})
df5 = pd.DataFrame(report_5).T[['f1-score']].rename(columns={'f1-score': 'Model 5: RoBERTa Review + Features'})

# Combine
comparison_df = pd.concat([df1, df2, df3, df4, df5], axis=1)

# Separate summary rows and class rows
summary_rows = ['accuracy', 'macro avg', 'weighted avg']
summary_df = comparison_df.loc[summary_rows]
class_df = comparison_df.drop(index=summary_rows)

# Combine summary first, then classes
final_df = pd.concat([summary_df, class_df])


Formatted/Final Comparison Chart:

In [None]:
import plotly.graph_objects as go

# Format as percentage where applicable
def format_percent(df):
    return df.copy().apply(lambda col: col.map(lambda x: f"{x*100:.1f}%" if isinstance(x, float) else x))

# Split summary and class rows
summary_rows = ['accuracy', 'macro avg', 'weighted avg']
summary_df = format_percent(comparison_df.loc[summary_rows].reset_index())
summary_df.columns = ['Metric'] + list(summary_df.columns[1:])

class_df = format_percent(comparison_df.drop(index=summary_rows, errors='ignore').reset_index())
class_df.columns = ['Class'] + list(class_df.columns[1:])

# Colors
body = '#fdfdfe'
header = '#d3b4e5'

def make_table(data, height):
    fig = go.Figure(data=[go.Table(
        header=dict(
            values=list(data.columns),
            fill_color=header,
            font=dict(color='black', size=12),
            align='center'
        ),
        cells=dict(
            values=[data[col] for col in data.columns],
            fill_color=body,
            font=dict(color='gray', size=12),
            align='center'
        )
    )])
    fig.update_layout(width=950, height=height, margin=dict(l=10, r=10, t=20, b=10))
    return fig

# Show both tables
make_table(summary_df, 300).show()
make_table(class_df, 300).show()


#Misc

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/Freeze%20Library%20Requirements.png)

In [None]:
# !pip freeze > full_requirements.txt

In [None]:
# !pipreqs /content/ --force --savepath used_requirements.txt


In [None]:
!wget https://raw.githubusercontent.com/lindsayalexandra14/ds_portfolio/main/1_projects/restaurant_prediction_nlp/restaurant_prediction_nlp.ipynb


In [None]:
from nbformat import read, write, NO_CONVERT

# Load this notebook file
notebook_path = '/content/restaurant_prediction_nlp.ipynb'  # use correct name if different

with open(notebook_path, 'r', encoding='utf-8') as f:
    nb = read(f, as_version=NO_CONVERT)

# Remove any broken widget metadata
for cell in nb.get('cells', []):
    if 'metadata' in cell and 'widget' in cell['metadata']:
        del cell['metadata']['widget']
    if 'metadata' in cell and 'widget_state' in cell['metadata']:
        del cell['metadata']['widget_state']

# Save cleaned notebook to a new file
clean_path = 'restaurant_prediction_nlp_final.ipynb'

with open(clean_path, 'w', encoding='utf-8') as f:
    write(nb, f)

print("Cleaned notebook saved as:", clean_path)


In [None]:
!jupyter nbconvert --to python restaurant_prediction_nlp_final.ipynb

In [None]:
# from google.colab import files
# files.download("restaurant_prediction_nlp_final.py")

In [None]:
!apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic -y


In [None]:
!apt-get install -y pandoc

In [None]:
!jupyter nbconvert --to pdf restaurant_prediction_nlp_final.ipynb

In [None]:
from google.colab import files
files.download("restaurant_prediction_nlp_final.pdf")

In [None]:
!jupyter nbconvert --to html restaurant_prediction_nlp_final.ipynb

In [None]:
from google.colab import files
files.download("restaurant_prediction_nlp_final.html")