# Building a Crowdsourced Recommender System (WORKING FILE)

### Group Members: Jose Currea, Jenna Ferguson, Evan Hadd, Ramzi Kattan, Hadley Krummel, Jennifer Gonzales, Ibrahim Muhammad
### Class Section: Afternoon 1 - 3pm

It should accept user inputs in the form of desired attributes of a product and come up with 3 recommendations. 

**Your Python Notebook should include the following:**
- All scripts 
- The sentiment and similarity scores for the three products you recommended in task E.
- Your analyses for and answer to task F. Make sure you show the ratings, similarity scores and sentiments for the products you recommend in tasks E and F. Use tables whenever possible.  
- Show the logic you are using in addition to finding the most similar product. 

## Imports

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np 
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from collections import Counter
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
import itertools
from sklearn.manifold import MDS
import statsmodels.api as sm  # For the OLS regression
import numpy as np            # For numerical operations like log transformations
import matplotlib.pyplot as plt  # For plotting
from collections import Counter  # For counting word occurrences
from scipy import stats        # For t-statistic and p-value calculations
from sklearn import manifold

pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent wrapping to multiple lines

## Task A

Extract about 5-6k reviews. However, many reviews may not have any text and will therefore be discarded. Finally you may end up with 1700-2000 reviews with text.  

In [None]:
def scrape_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")

        messages = soup.find_all("div", class_ = "Message userContent")

        dates = soup.find_all("time")

        data = []

        for message, date in zip(messages, dates):
            message_text = message.get_text(strip = True)
            date_text = date.get("title")
            data.append({"Date": date_text, "Message": message_text})

        return data


def scrape_forum(base_url, total_pages):
    all_data = []

    for page_num in range(1, total_pages + 1):
        page_url = f"{base_url}/p{page_num}"
        print(f"Scraping page {page_num}: {page_url}")
        page_data = scrape_page(page_url)
        all_data.extend(page_data)
    return all_data

In [None]:
base_url = "https://www.beeradvocate.com/beer/top-rated/"
total_pages = 300
forum_data = scrape_forum(base_url, total_pages)
messagedata = pd.DataFrame(forum_data)
messagedata.to_csv("messagedata.csv", index = False)
len(messagedata)

## Task B

Assume that a customer, who will be using this recommender system, has specified 3 attributes in a product. E.g., one website describes multiple attributes of beer (but you should choose attributes from the actual data like you did for the first assignment)

https://www.dummies.com/food-drink/drinks/beer/beer-for-dummies-cheat-sheet/
- Aggressive (Boldly assertive aroma and/or taste) 
- Balanced: Malt and hops in similar proportions; equal representation of malt sweetness and hop bitterness in the flavor — especially at the finish
- Complex: Multidimensional; many flavors and sensations on the palate
- Crisp: Highly carbonated; effervescent
- Fruity: Flavors reminiscent of various fruits or Hoppy: Herbal, earthy, spicy, or citric aromas and flavors of hops or Malty: Grainy, caramel-like; can be sweet or dry
- Robust: Rich and full-bodied


## Task C

Perform a similarity analysis using cosine similarity (without word embeddings – i.e., using the bag-of-words model) with the 3 attributes specified by the customer and the reviews. 
The similarity script should accept as input a file with the product attributes, and calculate similarity scores (between 0 and 1) between these attributes and each review. That is, the output file should have 3 columns – product_name (for each product, the product_name will repeat as many times as there are reviews of the product), product_review and similarity_score. 


## Task D

For every review, perform a sentiment analysis (using VADER or any LLM). In case you have to change the default values of words in the VADER lexicon, use this article: https://medium.com/swlh/adding-context-to-unsupervised-sentiment-analysis-7b6693d2c9f8 

## Task E

Create an evaluation score for each beer that uses both similarity and sentiment scores. 
Now recommend 3 products to the customer. 


## Task F 

How would your recommendations change if you use word vectors (e.g., the spaCy package with medium sized pretrained word vectors) instead of the plain vanilla bag-of-words cosine similarity? One way to analyze the difference would be to consider the % of reviews that mention a preferred attribute. E.g., if you recommend a product, what % of its reviews mention an attribute specified by the customer? Do you see any difference across bag-of-words and word vector approaches? Explain. This article may be useful: https://medium.com/swlh/word-embeddings-versus-bag-of-words-the-curious-case-of-recommender-systems-6ac1604d4424?source=friends_link&sk=d746da9f094d1222a35519387afc6338


Note that the article doesn’t claim that bag-of-words will always be better than word embeddings for recommender systems. It lays out conditions under which it is likely to be the case. That is, depending on the attributes you use, you may or may not see the same effect. 


## Task G

How would your recommendations differ if you ignored the similarity and feature sentiment scores and simply chose the 3 highest rated products from your entire dataset? Would these products meet the requirements of the user looking for recommendations? Why or why not? Justify your answer with analysis. Use the similarity and sentiment scores as well as overall ratings to answer this question. 

## Task H

Choose any 10 beers in your data. Now choose any one of them, and find the most similar beer (among the remaining 9). Explain your method and logic. https://medium.datadriveninvestor.com/who-is-your-competitor-in-the-era-of-the-long-tail-d0ac24fedde8