# Building a Crowdsourced Recommender System

### Group Members: Jose Currea, Jenna Ferguson, Evan Hadd, Ramzi Kattan, Hadley Krummel, Jennifer Gonzales, Ibrahim Muhammad
### Class Section: Afternoon 1 - 3pm

It should accept user inputs in the form of desired attributes of a product and come up with 3 recommendations. 

**Your Python Notebook should include the following:**
- All scripts 
- The sentiment and similarity scores for the three products you recommended in task E.
- Your analyses for and answer to task F. Make sure you show the ratings, similarity scores and sentiments for the products you recommend in tasks E and F. Use tables whenever possible.  
- Show the logic you are using in addition to finding the most similar product. 

## Imports

In [4]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent wrapping to multiple lines

## Task A

Extract about 5-6k reviews. However, many reviews may not have any text and will therefore be discarded. Finally you may end up with 1700-2000 reviews with text.  

In [5]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

In [None]:
# Set up ChromeDriver path
driver_path = "/Users/ramzikattan/Downloads/chromedriver-mac-arm64/chromedriver"
chrome_path = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"

# Configure Chrome options
chrome_options = Options()
chrome_options.binary_location = chrome_path

# Set up the Chrome WebDriver service
service = Service(driver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)

In [None]:
# Open the webpage
url = "https://www.ratebeer.com/top-beers?time=all"
driver.get(url)
time.sleep(5)  # Wait for the page to load

# Parse the page with BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Close the browser after fetching the page
driver.quit()

# Find all <a> tags with the class containing the beer name and link
beer_links = soup.find_all('a', class_="MuiTypography-root")

# Extract the name and URL for each beer
beers = []
for beer in beer_links:
    name = beer.get_text(strip=True)  # Get the text (beer name)
    link = beer['href']  # Get the URL
    full_link = "https://www.ratebeer.com" + link  # Construct full URL
    beers.append({'name': name, 'link': full_link})

# Print the extracted beer names and their URLs
#for beer in beers:
 #   print(f"Beer: {beer['name']} - URL: {beer['link']}")

In [None]:
# Clean the matrix for only the beer links
cleaned_beers = [beer for beer in beers[28:128] if beer['name']]  # Only keep non-empty names

In [None]:
# Rescrape Using Nico's Links
cleaned_beers = pd.read_csv('beer_url.csv')

# Drop rows with duplicate beer names, keeping the first occurrence
cleaned_beers = cleaned_beers.drop_duplicates(subset=['Beer Name'], keep='first')

# Display the first few rows of the updated DataFrame
cleaned_beers.head()

Unnamed: 0,Beer Name,Beer Rating,Beer URL
0,Aecht Schlenkerla Rauchbier Urbock,4.0,https://www.ratebeer.com/beer/aecht-schlenkerl...
1,Duckpond Duckpond Darkwing De Luxe,3.8,https://www.ratebeer.com/beer/duckpond-duckpon...
2,Aecht Schlenkerla Weichsel Rotbier,3.8,https://www.ratebeer.com/beer/aecht-schlenkerl...
3,Ayinger Altbairisch Dunkel,3.6,https://www.ratebeer.com/beer/ayinger-altbairi...
4,Russian River Beatification (Batch 002+),4.2,https://www.ratebeer.com/beer/russian-river-be...


In [None]:
# Configure Chrome options for headless mode
chrome_options = Options()
chrome_options.binary_location = chrome_path  # Path to your Chrome browser
chrome_options.add_argument('--headless')  # Enable headless mode to run faster
chrome_options.add_argument('--disable-gpu')  # Disable GPU acceleration (for better performance in headless mode)
chrome_options.add_argument('--no-sandbox')  # Added for safe execution in certain environments
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid issues with shared memory

# Set up the Chrome WebDriver service
service = Service(driver_path)  # Path to your ChromeDriver

# Create a new instance of the Chrome driver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Function to click "Show More" buttons on the current page
def click_show_more_buttons():
    # Locate all "Show More" buttons on the page
    show_more_xpath = '//span[contains(@class, "MuiButton-label") and text()="Show more"]'
    show_more_buttons = driver.find_elements(By.XPATH, show_more_xpath)

    # Click each "Show More" button
    for button in show_more_buttons:
        try:
            if button.is_displayed():
                driver.execute_script("arguments[0].click();", button)
                time.sleep(1)  # Wait for the content to expand
        except:
            pass  # Skip if there's an issue clicking the button

# Function to scrape reviews on the current page
def scrape_reviews():
    reviews = []
    ratings_xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "bRPQdN", " " )) and contains(concat( " ", @class, " " ), concat( " ", "MuiTypography-subtitle1", " " ))]'  # Adjust this based on your page's structure for ratings
    messages_xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "pre-wrap", " " )) and contains(concat( " ", @class, " " ), concat( " ", "MuiTypography-body1", " " ))]'  # Adjust this based on your page's structure for review messages

    # Find all the review ratings and messages on the current page
    ratings = driver.find_elements(By.XPATH, ratings_xpath)
    messages = driver.find_elements(By.XPATH, messages_xpath)

    for i, message in enumerate(messages):
        reviews.append({
            'rating': ratings[i].text if i < len(ratings) else None,  # Handle index if ratings and messages mismatch
            'message': message.text
        })

    return reviews

# Function to scrape reviews for a single beer URL with a limit of 250 reviews
def scrape_beer_reviews(beer_name, url, review_limit):
    all_reviews = []
    
    driver.get(url)

    # Handle the cookies banner by accepting it
    try:
        accept_cookies_id = 'onetrust-accept-btn-handler'
        accept_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, accept_cookies_id)))
        accept_button.click()
    except:
        print("No cookies banner found or failed to dismiss.")
        pass

    # Loop through pages and scrape reviews until the limit is reached
    while len(all_reviews) < review_limit:
        # Click all "Show More" buttons to expand reviews
        click_show_more_buttons()

        # Scrape reviews on the current page
        page_reviews = scrape_reviews()
        all_reviews.extend(page_reviews)

        # Check if we've hit the review limit
        if len(all_reviews) >= review_limit:
            all_reviews = all_reviews[:review_limit]  # Truncate to exactly the review limit
            break

        # Find the "Next" button and move to the next page
        try:
            next_button_xpath = '//button[@aria-label="Next page" and contains(@class, "MuiIconButton-root")]'
            next_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, next_button_xpath)))
            next_button.click()

            # Wait for the next page to load
            time.sleep(2)
        except:
            # If there's no "Next" button, break the loop
            print(f"Finished scraping {beer_name}.")
            break

    return all_reviews

# Initialize an empty DataFrame to store all the reviews
df = pd.DataFrame(columns=['Beer Name', 'URL', 'Rating', 'Review'])

# Iterate through each beer in the cleaned_beers list
for beer in cleaned_beers:
    beer_name = beer['name']
    beer_url = beer['link']

    print(f"Scraping reviews for {beer_name} at {beer_url}...")

    # Scrape reviews for the current beer with a limit of 250 reviews
    reviews = scrape_beer_reviews(beer_name, beer_url, review_limit=100)

    # Create a DataFrame for the reviews of the current beer
    beer_df = pd.DataFrame(reviews)
    beer_df['Beer Name'] = beer_name
    beer_df['URL'] = beer_url

    # Append the DataFrame for this beer to the overall DataFrame
    df = pd.concat([df, beer_df], ignore_index=True)

# Save the DataFrame to a CSV file
df.to_csv('beer_reviews.csv', index=False, encoding='utf-8')

# Close the browser after scraping all URLs
driver.quit()

print("Scraping completed. Data saved to 'beer_reviews.csv'")

## Task B

Assume that a customer, who will be using this recommender system, has specified 3 attributes in a product. E.g., one website describes multiple attributes of beer (but you should choose attributes from the actual data like you did for the first assignment)

https://www.dummies.com/food-drink/drinks/beer/beer-for-dummies-cheat-sheet/
- Aggressive (Boldly assertive aroma and/or taste) 
- Balanced: Malt and hops in similar proportions; equal representation of malt sweetness and hop bitterness in the flavor — especially at the finish
- Complex: Multidimensional; many flavors and sensations on the palate
- Crisp: Highly carbonated; effervescent
- Fruity: Flavors reminiscent of various fruits or Hoppy: Herbal, earthy, spicy, or citric aromas and flavors of hops or Malty: Grainy, caramel-like; can be sweet or dry
- Robust: Rich and full-bodied


In [None]:
import pandas as pd
from collections import Counter
import re
import nltk 
from nltk.corpus import stopwords
from predictionguard import PredictionGuard
import os
import itertools

In [None]:
# Input File Name
inputfile = 'translated_messages.csv'

# Translated Column Messages
messagecolumn = 'translated_messages'

# Import Data
data = pd.read_csv(inputfile)
data = data.drop(['Rating', 'Review'], axis = 1)

In [None]:
# Join All Messages
all_messages = ' '.join(data[messagecolumn].tolist())

# Find All Words
words = re.findall(r'\b\w+\b', all_messages.lower())

# Remove Stop Words
stop_words = set(stopwords.words('english'))
filtered = [word for word in words if word not in stop_words]

# Count
counts = Counter(filtered)
countsdf = pd.DataFrame(counts.items(), columns = ['words', 'frequency'])


In [None]:
# Set API key
api_key = os.getenv("PREDICTIONGUARD_API_KEY", "Oq62vYfSJRwjnFQcUnJy5PM3SRVejYtJCXWSxnfv")

# Check if the API key is being set correctly
print(f"API Key: {api_key}")

# Initialize the PredictionGuard client
client = PredictionGuard(api_key=api_key)

# System Behavior
system_message = {
    "role": "system",
    "content": (
        "You are a beer enthusiast. Your task is to look through words and pick which of them you'd consider to be an attribute of beer. \n"
        "Do not provide any explanations or contextual information. Only return the word if it's an attribute of beer."
    )
}

# Only consider words that are present more than 100 times
countsdf_100 = countsdf[countsdf['frequency'] > 100]

# Define the list to store words that are considered attributes
attributes = []

def process_message(row):
    try:
        user_message = row['words']

        # Prepare the messages list for the chatbot
        messages = [
            system_message,
            {
                "role": "user",
                "content": f"{user_message}"
            }
        ]

        # Send the message to the PredictionGuard API
        result = client.chat.completions.create(
            model="Hermes-3-Llama-3.1-8B",
            messages=messages
        )

        # Extract the chatbot's response
        response = result['choices'][0]['message']['content'].strip().lower()

        # Check if the response is 'yes' and add to attributes list
        if response == 'yes':
            attributes.append(user_message)
        
        return response
    except Exception as e:
        print(f"Error processing word {user_message}: {str(e)}")
        return f"Error: {str(e)}"
    
# Apply the process_message function to each row
countsdf_100['response'] = countsdf_100.apply(lambda row: process_message(row), axis=1)

# Debug print the list of attributes
print("Attributes considered as beer-related:", attributes)

In [None]:
# Top 20 Attributes
attributes_20 = attributes[:20]

In [None]:
# Lift Analysis of Attributes

# Initializations
attribute_counter = Counter()
co_occurrence_counter = Counter()
lift_results = []
total_messages = len(data)

# Function to find attributes in a message
def find_attributes(message, attribute_set):
    words = re.findall(r'\w+', message.lower())

    # Filter the replaced words for the list
    filtered_attributes = set([word for word in words if word not in stop_words and word in attribute_set])

    return filtered_attributes

# Functino to find co-occurences 
def find_co_occurrences(message, attributes_20, distance):
    words = message.split()
    found_attributes = []
    
    for i, word in enumerate(words):
        if word in attributes_20:
            found_attributes.append((word, i)) 
    
    co_occurrences = set()
    for (attribute1, idx1), (attribute2, idx2) in itertools.combinations(found_attributes, 2):
        if abs(idx1 - idx2) <= distance: 
            co_occurrences.add(tuple(sorted((attribute1, attribute2)))) 
    return co_occurrences

def calculate_lift(attribute1, attribute2, attribute_counter, co_occurrence_counter, total_messages):
    P_A = attribute_counter[attribute1] / total_messages 
    P_B = attribute_counter[attribute2] / total_messages  
    
    # Combine counts for both (brand1, brand2) and (brand2, brand1)
    P_AB = (co_occurrence_counter[(attribute1, attribute2)] + co_occurrence_counter[(attribute2, attribute1)]) / total_messages #if (brand1, brand2) in co_occurrence_counter or (brand2, brand1) in co_occurrence_counter else 0
    
    if P_A * P_B == 0: 
        return 0
    return P_AB / (P_A * P_B)

# Loop through all messages to update counters
for message in data[messagecolumn]:
    filtered_attributes = find_attributes(message, attributes_20)
    
    # Update brand counter with the filtered brands
    attribute_counter.update(filtered_attributes)
    
    # Now find co-occurrences using the replaced message
    co_occurrences = find_co_occurrences(message, attributes_20, distance=10e6)
    co_occurrence_counter.update(co_occurrences)

# Calculate lifts
for (attribute1, attribute2) in itertools.combinations(attributes_20, 2):
    lift = calculate_lift(attribute1, attribute2, attribute_counter, co_occurrence_counter, total_messages)
    lift_results.append((attribute1, attribute2, lift))

# Create Lift Dataframe
lift_df = pd.DataFrame(lift_results, columns=['Attribute1', 'Attribute2', 'Lift'])

# Create Lift Matrix
lift_matrix = lift_df.pivot(index='Attribute1', columns='Attribute2', values='Lift')
lift_matrix = lift_matrix.combine_first(lift_matrix.T)
lift_matrix.fillna(0, inplace=True)

# Print Lift Matrix
print(lift_matrix)

              aroma   balance  balanced    barrel      beer    bodied  \
aroma      0.000000  0.354258  0.236642  0.290704  0.264807  0.151313   
balance    0.354258  0.000000  0.554760  0.719619  0.573867  0.367685   
balanced   0.236642  0.554760  0.000000  0.523259  0.497568  0.296343   
barrel     0.290704  0.719619  0.523259  0.000000  0.609843  0.243495   
beer       0.264807  0.573867  0.497568  0.609843  0.000000  0.208151   
bodied     0.151313  0.367685  0.296343  0.243495  0.208151  0.000000   
bottle     0.121661  0.255030  0.215697  0.303513  0.238001  0.123220   
brown      0.317886  0.586688  0.475355  0.710690  0.483532  0.341422   
caramel    0.140644  0.336322  0.309968  0.398645  0.295979  0.152764   
colour     0.234351  0.271367  0.307819  0.237529  0.265351  0.280255   
dark       0.331042  0.587629  0.505983  0.762764  0.493112  0.346348   
flavours   0.189706  0.507986  0.460978  0.666964  0.357642  0.116583   
malt       0.206353  0.544063  0.395360  0.511566  

## Task C

Perform a similarity analysis using cosine similarity (without word embeddings – i.e., using the bag-of-words model) with the 3 attributes specified by the customer and the reviews. 
The similarity script should accept as input a file with the product attributes, and calculate similarity scores (between 0 and 1) between these attributes and each review. That is, the output file should have 3 columns – product_name (for each product, the product_name will repeat as many times as there are reviews of the product), product_review and similarity_score. 


In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [7]:
# Input File Name
inputfile = 'translated_messages.csv'

# Translated Column Messages
messagecolumn = 'translated_messages'

# Import Data
data = pd.read_csv(inputfile)
data = data.drop(['Rating', 'Review'], axis = 1)

# Output File 
outputfile = 'CDE.bagofwords.csv'

In [8]:
# User Attributes
userattributes = ['thick', 'rich', 'bodied']

In [12]:
# Initializing Dataframe
bagofwords = pd.DataFrame()
bagofwords['Product Name'] = data['Beer Name']
bagofwords['Product Review'] = data['translated_messages']


# Joining the Attributes
text1 = ' '.join(userattributes)

# Initializing the Lists
similarity_scores = []
similarity_scorestfidf = []

# Calculating Similarity
for text2 in data[messagecolumn]:
    documents =[text1, text2]
    
    # Non-Normalized
    count_vectorizer = CountVectorizer(stop_words='english')
    sparse_matrix = count_vectorizer.fit_transform(documents)
    doc_term_matrix = sparse_matrix.todense()
    df = pd.DataFrame(doc_term_matrix,
        columns=count_vectorizer.get_feature_names_out(),
        index=['x', 'y'])
    similarity_scores.append(cosine_similarity(df, df)[0,1])

    # Normalized
    tfidf_vectorizer = TfidfVectorizer(stop_words = 'english')
    sparse_matrixtfidf = tfidf_vectorizer.fit_transform(documents)
    doc_term_matrixtfidf = sparse_matrixtfidf.todense()
    dftfidf = pd.DataFrame(doc_term_matrixtfidf, 
        columns = tfidf_vectorizer.get_feature_names_out(),
        index = ['x', 'y'])
    similarity_scorestfidf.append(cosine_similarity(dftfidf, dftfidf)[0,1])

# Saving to Dataframe
bagofwords['Cosine Similarity'] = similarity_scores
bagofwords['Cosine Similarity TFIDF'] = similarity_scorestfidf
bagofwords

Unnamed: 0,Product Name,Product Review,Cosine Similarity,Cosine Similarity TFIDF
0,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"You need personal informations from companies,...",0.000000,0.000000
1,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"Bottle after MBCC 2024. Black colour, malty ar...",0.000000,0.000000
2,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Thank you for sharing this Chris - Black with ...,0.000000,0.000000
3,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"Boxed beer at home, proper glassware. Pitch bl...",0.187317,0.134447
4,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"From backlog. (As 2018 Vintage) 0,3 litre Bott...",0.000000,0.000000
...,...,...,...,...
8226,Superstition Blue Berry White🇺🇸Mead - Melomel ...,"Bottle 82/100. Pours a deep, inky purple. More...",0.102062,0.059846
8227,Superstition Blue Berry White🇺🇸Mead - Melomel ...,Pours a deep berry hue. Aroma of rich blueberr...,0.154303,0.091090
8228,Superstition Blue Berry White🇺🇸Mead - Melomel ...,"If you look at my reviews, you will see how hi...",0.000000,0.000000
8229,Superstition Blue Berry White🇺🇸Mead - Melomel ...,Huge thank you to Dakine for sharing this with...,0.121268,0.071261


In [13]:
# Outputting CSV
bagofwords.to_csv(outputfile, index = False)

## Task D

For every review, perform a sentiment analysis (using VADER or any LLM). In case you have to change the default values of words in the VADER lexicon, use this article: https://medium.com/swlh/adding-context-to-unsupervised-sentiment-analysis-7b6693d2c9f8 

In [None]:
from predictionguard import PredictionGuard
import os
import pandas as pd
import concurrent.futures

In [None]:
# Input File Name
inputfile = 'CDE.bagofwords.csv'

# Output File Name
outputfile = 'CDE.bagofwords.csv'

# Translated Column Messages
messagecolumn = 'Product Review'

# Import Data
data = pd.read_csv(inputfile)

In [None]:
# Set API key
api_key = os.getenv("PREDICTIONGUARD_API_KEY", "Oq62vYfSJRwjnFQcUnJy5PM3SRVejYtJCXWSxnfv")

# Initialize the PredictionGuard client
client = PredictionGuard(api_key=api_key)

# System Behavior
system_message = {
    "role": "system",
    "content": (
        "You are our personal sentiment score evaluator for beers. Your task is to assess the product reviews and assign a sentiment score value between -1 and 1. \n"
        "Please account for potential sarcasm and slang terms. \n"
        "The context is beer, so please keep in mind how word meanings and their sentiments may change in this context. \n"
        "Do not provide any explanations or contextual information. Only return the sentiment score of the message."
    )
}

def process_message(row):
    try:
        user_message = row
        # Prepare the messages list for the chatbot
        messages = [
            system_message,
            {
                "role": "user",
                "content": f"{user_message}"
            }
        ]
        
        # Send the message to the PredictionGuard API
        result = client.chat.completions.create(
            model="Hermes-3-Llama-3.1-8B",
            messages=messages
        )
        
        # Extract the chatbot's response
        response = result['choices'][0]['message']['content']
        return response
    except Exception as e:
        return f"Error: {str(e)}"
    
# Process in Parallel 
def process_in_parallel(data_column, max_workers=5):
    # Counter for processed messages
    count = 0
    
    # Function to update the counter and print progress
    def process_message_with_checker(row):
        nonlocal count
        count += 1
        if count % 100 == 0:
            print(f"Processed {count} messages")
        return process_message(row)

    # Use ThreadPoolExecutor for parallel processing
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        responses = list(executor.map(process_message_with_checker, data_column))
    return responses
    

In [None]:
data_subset = data[messagecolumn]

# Run sentiment analysis in parallel with 5 threads
response = process_in_parallel(data_subset, max_workers=5)

# The result is a list of responses
print(response)

Processed 100 messages
Processed 200 messages
Processed 300 messages
Processed 400 messages
Processed 500 messages
Processed 600 messages
Processed 700 messages
Processed 800 messages
Processed 900 messages
Processed 1000 messages
Processed 1100 messages
Processed 1200 messages
Processed 1300 messages
Processed 1400 messages
Processed 1500 messages
Processed 1600 messages
Processed 1700 messages
Processed 1800 messages
Processed 1900 messages
Processed 2000 messages
Processed 2100 messages
Processed 2200 messages
Processed 2300 messages
Processed 2400 messages
Processed 2500 messages
Processed 2600 messages
Processed 2700 messages
Processed 2800 messages
Processed 2900 messages
Processed 3000 messages
Processed 3100 messages
Processed 3200 messages
Processed 3300 messages
Processed 3400 messages
Processed 3500 messages
Processed 3600 messages
Processed 3700 messages
Processed 3800 messages
Processed 3900 messages
Processed 4000 messages
Processed 4100 messages
Processed 4200 messages
P

In [23]:
data['Sentiment Scores'] = response

Unnamed: 0,Product Name,Product Review,Cosine Similarity,Cosine Similarity TFIDF,Sentiment Scores
0,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"You need personal informations from companies,...",0.000000,0.000000,-0.7
1,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"Bottle after MBCC 2024. Black colour, malty ar...",0.000000,0.000000,0.8
2,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Thank you for sharing this Chris - Black with ...,0.000000,0.000000,0.7
3,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"Boxed beer at home, proper glassware. Pitch bl...",0.187317,0.134447,0.9
4,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,"From backlog. (As 2018 Vintage) 0,3 litre Bott...",0.000000,0.000000,0.8
...,...,...,...,...,...
8226,Superstition Blue Berry White🇺🇸Mead - Melomel ...,"Bottle 82/100. Pours a deep, inky purple. More...",0.102062,0.059846,0.8
8227,Superstition Blue Berry White🇺🇸Mead - Melomel ...,Pours a deep berry hue. Aroma of rich blueberr...,0.154303,0.091090,0.8
8228,Superstition Blue Berry White🇺🇸Mead - Melomel ...,"If you look at my reviews, you will see how hi...",0.000000,0.000000,0.8
8229,Superstition Blue Berry White🇺🇸Mead - Melomel ...,Huge thank you to Dakine for sharing this with...,0.121268,0.071261,0.8


In [24]:
# Output to csv
data.to_csv(outputfile, index = False)

## Task E

Create an evaluation score for each beer that uses both similarity and sentiment scores. 
Now recommend 3 products to the customer. 


In [25]:
import pandas as pd

In [26]:
# Input File
inputfile = 'CDE.bagofwords.csv'

# Output File
outputfile = 'CDE.bagofwords.csv'

In [27]:
# Calculate Evaluation Score Function
def evaluation(cosine_sim, sentiment):
    try:
        norm_sentiment = (float(sentiment) + 1) / 2
        return ((cosine_sim * 0.8) + (norm_sentiment * 0.2))
    except Exception as e:
        return(0)

In [28]:
# Import Data
data = pd.read_csv(inputfile)
cosine_sim = data['Cosine Similarity TFIDF']
sentiment = data['Sentiment Scores']

In [29]:
# Calculation of Evaluation Score
evaluation_list = []
for i in range(len(data)):
    evaluation_list.append(evaluation(cosine_sim[i], sentiment[i]))

In [30]:
# Save Evaluation Score 
data['Evaluation Score'] = evaluation_list

In [36]:
# Turning Sentiment into Floats
sentiment = []
for i in data['Sentiment Scores']:
    try:
            sentiment.append(float(i))
    except Exception as e:
            sentiment.append(0)
data['Sentiment Scores'] = sentiment

In [50]:
# Aggregate Evaluation Scores per Beer
beers = data['Product Name'].drop_duplicates()
beer_score = data[['Product Name', 'Evaluation Score', 'Cosine Similarity TFIDF', 'Sentiment Scores']].groupby('Product Name').mean('Evaluation Score')
beer_score = beer_score.sort_values('Evaluation Score', ascending = False)

In [51]:
# Output to CSV
data.to_csv(outputfile)

beer_score.to_csv('E.beer_recommendations.csv')

In [52]:
# Recommendation
beer_score[0:3]

Unnamed: 0_level_0,Evaluation Score,Cosine Similarity TFIDF,Sentiment Scores
Product Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Superstition Grand Cru Berry - F.O. Barrel Aged🇺🇸Mead - Melomel / Fruited,0.202101,0.034167,0.747674
B. Nektar Ken Schramm Signature Series - The Heart of Darkness🇺🇸Mead - Melomel / Fruited,0.200601,0.024627,0.809
Toppling Goliath SR-71 Blackbird (2015 Bottling / Draft)🇺🇸Stout - Imperial,0.198969,0.0377,0.68809


### Results 
$$
\text{Evaluation Score} = 0.8 * \text{Similarity} + 0.2 * \text{Sentiment}
$$

# EVAd PUT THING HERE

## Task F 

How would your recommendations change if you use word vectors (e.g., the spaCy package with medium sized pretrained word vectors) instead of the plain vanilla bag-of-words cosine similarity? One way to analyze the difference would be to consider the % of reviews that mention a preferred attribute. E.g., if you recommend a product, what % of its reviews mention an attribute specified by the customer? Do you see any difference across bag-of-words and word vector approaches? Explain. This article may be useful: https://medium.com/swlh/word-embeddings-versus-bag-of-words-the-curious-case-of-recommender-systems-6ac1604d4424?source=friends_link&sk=d746da9f094d1222a35519387afc6338


Note that the article doesn’t claim that bag-of-words will always be better than word embeddings for recommender systems. It lays out conditions under which it is likely to be the case. That is, depending on the attributes you use, you may or may not see the same effect. 


In [42]:
import pandas as pd 
import spacy
nlp = spacy.load('en_core_web_md')

In [43]:
# Input File Name
inputfile = 'CDE.bagofwords.csv'

# Output File Name
outputfile = 'CDE.bagofwords.csv'

# Translated Column Messages
messagecolumn = 'Product Review'

# Import Data
data = pd.read_csv(inputfile)

# User Attributes
userattributes = ['thick', 'rich', 'bodied']

In [44]:
# Joining the Attributes
text1 = ' '.join(userattributes)
doc1 = nlp(text1)

# Initializing the Lists
spacy_scores = []

# Calculating Similarity
for text2 in data[messagecolumn]:
    doc2 = nlp(text2)
    spacy_scores.append(doc1.similarity(doc2))

# Saving Results to DataFrame
data['Spacy Similarity'] = spacy_scores

# Outputting CSV
data.to_csv(outputfile, index = False)

# Inputting Recommendations
recommendations = pd.read_csv("E.beer_recommendations.csv")
recommendation = recommendations['Product Name'][0:3].values.tolist()

In [45]:
# Percent of Each Beer That Contains the Attributes
percent_TFIDF = []
percent_spacy = []

for name in recommendation:
    total_count = int(data[data['Product Name'] == name]['Product Name'].count())
    non_zero_TFIDF = int((data[data['Product Name'] == name]['Cosine Similarity TFIDF'] != 0).sum())
    non_zero_spacy = int((data[data['Product Name'] == name]['Spacy Similarity'] != 0).sum())
    percent_TFIDF.append((non_zero_TFIDF / total_count) * 100)
    percent_spacy.append((non_zero_spacy / total_count) * 100)

In [46]:
# Comparing Results 
comparison = pd.DataFrame()
comparison['Product Name'] = recommendation
comparison['Bag of Words Similarity'] = percent_TFIDF
comparison['Vector Similarity'] = percent_spacy
comparison

Unnamed: 0,Product Name,Bag of Words Similarity,Vector Similarity
0,Superstition Grand Cru Berry - F.O. Barrel Age...,27.906977,100.0
1,B. Nektar Ken Schramm Signature Series - The H...,28.0,100.0
2,Toppling Goliath SR-71 Blackbird (2015 Bottlin...,42.696629,100.0


In [47]:
# Finding New Evaluation Scores Using Spacy
# Calculate Evaluation Score Function
def evaluation(cosine_sim, sentiment):
    try:
        norm_sentiment = (float(sentiment) + 1) / 2
        return ((cosine_sim * 0.8) + (norm_sentiment * 0.2))
    except Exception as e:
        return(0)
# Import Data
cosine_sim = data['Spacy Similarity']
sentiment = data['Sentiment Scores']

# Calculation Evaluation Score
evaluation_list = []
for i in range(len(data)):
    evaluation_list.append(evaluation(cosine_sim[i], sentiment[i]))

# Save Evaluation Score 
data['Spacy Evaluation Score'] = evaluation_list

# Aggregate Evaluation Scores per Beer
beers = data['Product Name'].drop_duplicates()
beer_score = data[['Product Name', 'Spacy Evaluation Score', 'Spacy Similarity', 'Sentiment Scores']].groupby('Product Name').mean('Spacy Evaluation Score')
beer_score = beer_score.sort_values('Spacy Evaluation Score', ascending = False)

# Output to CSV
data.to_csv(outputfile, index = False)

beer_score.to_csv('F.new_recommendations.csv')

In [49]:
# New Recommendations
beer_score[0:3]

Unnamed: 0_level_0,Spacy Evaluation Score,Spacy Similarity,Sentiment Scores
Product Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Anchorage A Deal With The Devil - Double Oaked 2017🇺🇸Barley Wine / Wheat Wine / Rye Wine,0.634633,0.581589,0.693617
Sahtipaja MeadMe Batch #2 - Bourbon Vanilla🇸🇪Mead - Melomel / Fruited,0.625957,0.564998,0.739583
Superstition Grand Cru Berry - F.O. Barrel Aged🇺🇸Mead - Melomel / Fruited,0.61253,0.547203,0.747674


### Results
There is a clear difference between bag of words and spacy. It seems that spacy inflates the cosine similarity higher than the bag of words model. That's likely because spacy allows similar words to count towards the similarity evaluation whereas cosine similarity requires a one to one match between words to count a similarity. However, in a niche context such as attributes of beer, the safer route is to use bag of words because the words in this context have every specific meaning, and accepting other meanings of the word is not beneficial. We can see that the beers recommended completely changed from the bag of words model. We also see that spacy has found some sort of similarity between every single review of every beer to the attributes described by the user. 

In this scenario, word embeddings is not helpful. Looking through the data, there are some messages that are spam messages, having no similarity with beers or the attributes described by the user, yet spacy assigned a significant score to the spam message. Therefore, in this context, similarity is inflating similarity scores. 

In [35]:
import textwrap
data = pd.read_csv('CDE.bagofwords.csv')

data[['Product Review', 'Spacy Similarity']].head(1)

print(textwrap.fill(data['Product Review'][0], width = 140))
result = data['Spacy Similarity'][0]
print(f'\nThe calculated spacy similarity was: {result:.2f}')

You need personal informations from companies,family and friends that will better your life and you need easy access without them noticing
just contact nick or you’re financially unstable or you have a bad credit score, he will solve that without stress,he and his team can clear
criminal records without leaving a trace and can also anonymously credit your empty credit cards with funds you need,all these are not done
free obviously but I like working with nick and his team cause they keep you updated on every step taken in order to achieve the goal and
they also deliver on time,I tested and confirmed this I’m still happy on how my life is improving after my encounter with him ,you can send
a mail to Premiumhackservices AT gmail DOT com, Whatsapp: +14106350697.

The calculated spacy similarity was: 0.18


In [4]:
import pandas as pd

## Task G

How would your recommendations differ if you ignored the similarity and feature sentiment scores and simply chose the 3 highest rated products from your entire dataset? Would these products meet the requirements of the user looking for recommendations? Why or why not? Justify your answer with analysis. Use the similarity and sentiment scores as well as overall ratings to answer this question. 

In [None]:
import pandas as pd

In [None]:
# Input File Names
inputfile = 'CDE.bagofwords.csv'
inputfile2 = 'translated_messages.csv'

# Translated Column Messages
messagecolumn = 'Product Review'
ratingcolumn = 'rating'

# Import Data
data = pd.read_csv(inputfile)
data2 = pd.read_csv(inputfile2)

# Turning Sentiment into Floats
sentiment = []
for i in data['Sentiment Scores']:
    try:
            sentiment.append(float(i))
    except Exception as e:
            sentiment.append(0)
data['Sentiment Scores'] = sentiment

# Find Top 3 Highest Rated Beers
data['Ratings'] = data2[ratingcolumn]
highest_rated = data[['Product Name', 'Evaluation Score', 'Cosine Similarity TFIDF', 'Sentiment Scores', 'Ratings']].groupby('Product Name').mean(['Evaluation Score','Ratings']).sort_values('Ratings', ascending=False)[0:3]
recommendations = data[['Product Name', 'Evaluation Score', 'Cosine Similarity TFIDF', 'Sentiment Scores', 'Ratings']].groupby('Product Name').mean(['Evaluation Score','Ratings']).sort_values('Evaluation Score', ascending=False)[0:3]

In [None]:
# Compare 
pd.set_option('display.expand_frame_repr', False)  # Do not wrap the output
print("Based on ratings alone:")
print(highest_rated)
print("\n\nBased on bag of words similarity and sentiment:")
print(recommendations)

Based on ratings alone:
                                                    Evaluation Score  Cosine Similarity TFIDF  Sentiment Scores   Ratings
Product Name                                                                                                             
Toppling Goliath Kentucky Brunch🇺🇸Stout - Imper...          0.192930                 0.023483          0.741436  4.563536
Cigar City Pilot Series Miami Madness🇺🇸Berliner...          0.189680                 0.006743          0.842857  4.561905
Side Project Beer : Barrel : Time - 2018🇺🇸Stout...          0.186155                 0.015031          0.741304  4.560870


Based on bag of words similarity and sentiment:
                                                    Evaluation Score  Cosine Similarity TFIDF  Sentiment Scores   Ratings
Product Name                                                                                                             
Superstition Grand Cru Berry - F.O. Barrel Aged...          0.202101    

### Results
The results would not satisfy the user because these beers do not possess the attributes the consumer is looking for. This is reflected in the higher rated beers having lower evaluation scores which combines sentiment and cosine similarity. In addition, the recommended beers still have high ratings compared to the highest rated, and so the quality of the beer is surely still acceptable and more fitting to the user's taste. 

## Task H

Choose any 10 beers in your data. Now choose any one of them, and find the most similar beer (among the remaining 9). Explain your method and logic. https://medium.datadriveninvestor.com/who-is-your-competitor-in-the-era-of-the-long-tail-d0ac24fedde8

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Input File Names
inputfile = 'CDE.bagofwords.csv'
inputfile2 = 'translated_messages.csv'

# Import Data
data = pd.read_csv(inputfile)
data2 = pd.read_csv(inputfile2)

In [None]:
# Find Top 10 Beers
data['Ratings'] = data2['rating']
#highest_rated = 
highest_rated = data[['Product Name', 'Ratings']].groupby('Product Name').mean('Ratings').sort_values('Ratings', ascending=False).index[0:10].tolist()

In [None]:
# Represent Each Beer As a Single Document
beer_messages = pd.DataFrame()
beer_messages['Beer'] = highest_rated

fullmessages = []
for beer in highest_rated:
    fullmessages.append(data[data['Product Name'] == beer]['Product Review'].str.cat(sep = ' '))

beer_messages['Combined Reviews'] = fullmessages

In [None]:
# Find Cosine Similarity Between Highest Rated Beer and Rest of Top 10 Using Bag of Words Model

# Setting Target Beer
text1 = beer_messages['Combined Reviews'][0]

# Initializing the Lists
similarity_scores = []
similarity_scorestfidf = []

# Calculating Similarity
for text2 in beer_messages['Combined Reviews'][1:10]:
    documents =[text1, text2]
    
    # Non-Normalized
    count_vectorizer = CountVectorizer(stop_words='english')
    sparse_matrix = count_vectorizer.fit_transform(documents)
    doc_term_matrix = sparse_matrix.todense()
    df = pd.DataFrame(doc_term_matrix,
        columns=count_vectorizer.get_feature_names_out(),
        index=['x', 'y'])
    similarity_scores.append(cosine_similarity(df, df)[0,1])

    # Normalized
    tfidf_vectorizer = TfidfVectorizer(stop_words = 'english')
    sparse_matrixtfidf = tfidf_vectorizer.fit_transform(documents)
    doc_term_matrixtfidf = sparse_matrixtfidf.todense()
    dftfidf = pd.DataFrame(doc_term_matrixtfidf, 
        columns = tfidf_vectorizer.get_feature_names_out(),
        index = ['x', 'y'])
    similarity_scorestfidf.append(cosine_similarity(dftfidf, dftfidf)[0,1])


In [None]:
# Saving to Dataframe
beersimilarities = pd.DataFrame()
beersimilarities['Beer 1'] = highest_rated[0:1] * (len(highest_rated) - 1)
beersimilarities['Beer 2'] = highest_rated[1:10]
beersimilarities['Cosine Similarity'] = similarity_scores
beersimilarities['Normalized Similarity'] = similarity_scorestfidf

beersimilarities.sort_values('Cosine Similarity', ascending=False)

Unnamed: 0,Beer 1,Beer 2,Cosine Similarity,Normalized Similarity
7,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Goose Island Bourbon County Stout - Rare 2010🇺...,0.842742,0.835625
1,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Side Project Beer : Barrel : Time - 2018🇺🇸Stou...,0.774314,0.715586
5,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Cycle / 3 Sons Rare Scooop🇺🇸Stout - Imperial F...,0.652199,0.567901
8,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Superstition Grand Cru Berry - F.O. Barrel Age...,0.485763,0.440068
4,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Sahtipaja MeadMe Batch #2 - Bourbon Vanilla🇸🇪M...,0.47852,0.408192
3,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,B. Nektar Ken Schramm Signature Series - The H...,0.474992,0.380317
6,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Schramm's The Heart of Darkness🇺🇸Mead - Melome...,0.471103,0.375422
2,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Cigar City Pilot Series Dragonfruit Passion Fr...,0.318535,0.231672
0,Toppling Goliath Kentucky Brunch🇺🇸Stout - Impe...,Cigar City Pilot Series Miami Madness🇺🇸Berline...,0.276189,0.187509


# Method & Logic
I chose the top 10 rated beers for this analysis. I made the target beer the number one highest rated beer by rating and found the cosine similarity with the rest of the top 10. 

For the similarity calculations, I took every review of each beer and joined it into 1 long review. This made it easy to compare reviews using bag of words cosine similarity. 

I chose bag of words cosine similarity because as we saw before, the spacy similarity was not appropritate for this scenario, inflating similarities greatly. 

I used both count vectorizer and tfidf vectorizer to find cosine similarities to account for frequency of word usage among reviews, this was done because I did not account for stop words in my analysis. 

You'll find that if you use the count or tfidf vectorizer, you end up with similar results. 