# Clean and Explore

**Setup & Import**

In [1]:
import sys
current_dir = "/Users/josephtolsma/Documents/dev/yelp_rag"
sys.path.insert(0,current_dir)

In [2]:
from src.config import DATA_DIR_SAMP,COL_TEXT,MIN_LENGTH
import pandas as pd
import re
import os
import unicodedata

In [10]:
pd.set_option("display.max_columns",None)
pd.set_option("display.max_colwidth",None)

In [3]:
reviews_df = pd.read_csv(os.path.join(DATA_DIR_SAMP,"reviews_df.csv"))

**Cleaning Functions**

In [4]:
def clean_review_text(df):
    "for dataframe df, remove nonstandard characters and drop unusable (too short) values"

    before = len(df)
    # basic cleaning    
    df[COL_TEXT] = df[COL_TEXT].astype("string").str.strip()
    df = df.dropna(subset = [COL_TEXT])

    # remove ultra-short reviews
    df = df[df[COL_TEXT].str.len() >= MIN_LENGTH]

    # remove nonstandard characters
    invalid_chars = {
        "\u00a0":" ",
        "\u002b":"",
        "\xa0":" ",
        "\x0b":" ",
        "“":'"',
        "’": "'",
    }
    
    for char,rep_str in invalid_chars.items():
        df[COL_TEXT] = df[COL_TEXT].str.replace(char,rep_str,regex = False)

    df[COL_TEXT] = df[COL_TEXT].str.replace(r"\s+", " ", regex=True).str.strip()
    
    print(f"{before - len(df)} reviews dropped in cleaning step.")
    return df

In [5]:
def normalize_unicode(text):
    "convert all characters to standard unicode"
    return unicodedata.normalize("NFKC",text)

In [6]:
def deduplicate_reviews(df):
    "remove duplicated review texts from the dataset"
    before = len(df)
    df = df.drop_duplicates(subset = [COL_TEXT])
    print(f"{before - len(df)} reviews dropped in deduplicating step.")
    return df

In [7]:
reviews_df = clean_review_text(reviews_df)
reviews_df[COL_TEXT] = reviews_df[COL_TEXT].apply(normalize_unicode)
reviews_df = deduplicate_reviews(reviews_df)

0 reviews dropped in cleaning step.
3 reviews dropped in deduplicating step.


**Chunking Functions**

In [None]:
# reviews_df has restaurant_id, restaurant_name, text

# Step 1: ensure review_id exists
# Step 2: set chunk params
# Step 3: define chunk_text(text) -> list[str]
# Step 4: iterate rows and expand
# Step 5: add chunk_index + chunk_id
# Step 6: sanity checks
# Step 7: save chunks_df


Unnamed: 0,name,text
1538,Backspace Bar & Kitchen,"This place is at the bottom of the list for me, out of all the places we visited during our NOLA weekend. The offering of cuban sandwiches is what made me want to try this place, and I think the interior is really interesting, with all the literary references and homages to the talented writers of the past. I found the bartender, however, to be extremely rude, refusing to leave his post behind the bar to interact with us, instead opting to shout at us over the unnecessarily loud nineties music. This would have been okay if he wasn't the only person on staff; but he was and the restaurant was void of customers when we arrived. As a result of his piss poor attempt at serving us, he mistook my ordering of a cuban as gumbo, which he promptly spilled on my friend during his maiden voyage to our table. Nice going, dipshit. Maybe this place is better during the day, but I won't be back to find out."
1583,Cheddar's Scratch Kitchen,"First time dining at this location and the service at Cheddar's was absolutely terrible. We saw our waitress a total of three times during our dining experience, once to take our order, once to deliver the order (almost an hour after placing the order), and once to hand us the check - with not so much as a ""thank you for coming in,"" or ""sorry about your wait"" with the check delivery. She didn't say a word to us, which we found extremely unprofessional and rude. Another waitress and a manager visited our table between when our order was taken and when the food was delivered, which was a bad sign, given that the manager came to the table to apologize before we had even received our dinner. She gave us two coupons for free chips and queso at our next visit, which we will not be using, because we do not plan to return to Cheddar's anytime soon. The food arrived late and cold after what they claimed was a ""mix up"" in the kitchen. We boxed up our sides and steaks to take home because they were not even appetizing at the lukewarm temperature that they had been served to us. Getting the box was another fiasco, however, which involved walking out into the main dining room and peeking into the kitchen to be able to locate someone who could help us locate our check and get a to-go box. The chicken tenders and the baked potato were served at an appropriate warmness, and we were able to enjoy those while at the restaurant. However, those items alone were not convincing enough to persuade us to choose Cheddar's over one of the MANY other area restaurants after our next night at the mall. The table behind us also had issues with the service, and ended up getting part of their meal comped due to the lack of delivery by the staff. Disappointed in the staff and management for the experience we had last night in the bar room at Cheddar's."
770,Aldertos Fresh Mexican Food,The food was ok. Not worth the $19.00 I spent. They messed up my order so I ended up with a bean burrito I ordered beef. I won't return.
2409,Backspace Bar & Kitchen,"Nice space, quiet music, outdoor sitting. The daily special Paper Plane for 5 bucks is refreshing, not too sweet, slightly sour. Would come again."
814,Backspace Bar & Kitchen,"I'm feeling this place. It's divey (ish) with good bar food. Given all of the options nearby, I would definitely choose this over any of them. I also got a $4 glass of decent wine. Oh, and you can smoke inside. I'll be back."
