<a href="https://colab.research.google.com/github/kevinawongw/CSCI_4802_Wine/blob/main/CSCI_4802_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSCI 4022 - Final Project 
<b>Kevina Wong</b> - kevina.wong@colorado.edu

<b>Vienna Wong</b> vienna.wong@colraodo.edu

<b>Yubin (Sally) Go </b>

## Question of Interest:

Can we use algorithms to classify wine type based on flavor profile and descriptions?


## Data

The link to the dataset can be found [here](https://docs.google.com/spreadsheets/d/1DuHnmSNEWDIcu_QoBHBpybO-qS1FfiDxftMsY33kfUw/edit?usp=sharing
).


This dataset consists of roughly 151,000 entries of wines. The columns included are: country, description, designation, points, price, province	region_1, region_2, variety, winery.


## Organization of Data  

<ol>
<li> <b>Data Wrangline and Organization</b> </li>

We only intend on using the columns 'description' and 'variety'.

Descriptions include a description of the flavor profile for each wine. Viariety tells us what type of wine it is (Merlot, Pinot Noir, etc.). These two will be our only relevant columns since we are trying to classify wine variety using its written description.

All cells are converted to lowercase without punctuation. Lastly, we also created subsets of the dataframe grouped by wine variety. We have also eliminated common articles, prepositions, and conjunctions so our frequent items aren't skewed.

<li><b>Creating Item Baskets </b>

We created dictionaries to hold our items acting as an item basket.
The structure of the dictionary will look like like:

variety -> {[array of words in description]}

</ol>


In [154]:
# Imports & Packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import itertools
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import string

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [173]:
# Read DataFrame

df = pd.read_csv('wine.csv', on_bad_lines='skip')
df = df[["description", "variety"]]
df.tail()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,description,variety
286048,Many people feel Fiano represents southern Ita...,White Blend
286049,"Offers an intriguing nose with ginger, lime an...",Champagne Blend
286050,This classic example comes from a cru vineyard...,White Blend
286051,"A perfect salmon shade, with scents of peaches...",Champagne Blend
286052,More Pinot Grigios should taste like this. A r...,Pinot Grigio


In [None]:
# Clean Dataframe
df['variety'] = df['variety'].astype('str') 
df['description'] = df['description'].astype('str')
clean_df = df.dropna()
clean_df = clean_df.drop_duplicates().reset_index(drop=True)

clean_df.tail()

Unnamed: 0,description,variety
97867,"Smooth in the mouth, this Chard starts off wit...",Chardonnay
97868,"Seems mature and ready to drink now, with eart...",Chardonnay
97869,While we'd suggest waiting for the 2007 to hit...,Pinot Noir
97870,"Fresh and fragrant, with a jumble of cherry, p...",Nero d'Avola
97871,Signorello's proprietary Bordeaux blend is lar...,Cabernet Blend


In [None]:
unique_varieties = np.unique(df['variety'])

print("+=== Wine Varieties ===+")
print(clean_df['variety'].value_counts()[:15])

+=== Wine Varieties ===+
Pinot Noir                  9287
Chardonnay                  9164
Cabernet Sauvignon          8277
Red Blend                   6486
Bordeaux-style Red Blend    5173
Sauvignon Blanc             4035
Syrah                       3664
Riesling                    3582
Merlot                      3179
Zinfandel                   2408
Sangiovese                  2154
Malbec                      1968
Rosé                        1910
White Blend                 1859
Tempranillo                 1622
Name: variety, dtype: int64


In [None]:
# Remove Unwanted / Weird Genres
keep = ['Pinot Noir', 'Chardonnay', 'Cabernet Sauvignon', 'Red Blend', 'Bordeaux-style Red Blend', 'Sauvignon Blanc', 'Syrah', 'Riesling', 'Merlot', 'Zinfandelranillo', 'Sangiovese', 'Malbec', 'Rosé', 'White Blend', 'Temp'] 
clean_df = clean_df.loc[clean_df['variety'].isin(keep)].copy()

clean_df.tail()


Unnamed: 0,description,variety
97865,"Outside of the vineyard, wines like this are w...",Merlot
97866,"Heavy and basic, with melon and pineapple arom...",Sauvignon Blanc
97867,"Smooth in the mouth, this Chard starts off wit...",Chardonnay
97868,"Seems mature and ready to drink now, with eart...",Chardonnay
97869,While we'd suggest waiting for the 2007 to hit...,Pinot Noir


In [127]:
# Put wine varieties into their own dataframe

pinotNoir = clean_df.loc[(clean_df['variety'] == 'Pinot Noir')].copy().drop_duplicates().reset_index(drop=True)
chardonnay = clean_df.loc[(clean_df['variety'] == 'Chardonnay')].copy().drop_duplicates().reset_index(drop=True)
cabernetSauvignon = clean_df.loc[(clean_df['variety'] == 'Cabernet Sauvignon')].copy().drop_duplicates().reset_index(drop=True)
redBlend = clean_df.loc[(clean_df['variety'] == 'Red Blend')].copy().drop_duplicates().reset_index(drop=True)
bordeauxStyleRedBlend = clean_df.loc[(clean_df['variety'] == 'Bordeaux-style Red Blend')].copy().drop_duplicates().reset_index(drop=True)
sauvignonBlanc= clean_df.loc[(clean_df['variety'] == 'Sauvignon Blanc')].copy().drop_duplicates().reset_index(drop=True)
syrah= clean_df.loc[(clean_df['variety'] == 'Syrah')].copy().drop_duplicates().reset_index(drop=True)
riesling= clean_df.loc[(clean_df['variety'] == 'Riesling')].copy().drop_duplicates().reset_index(drop=True)
merlot= clean_df.loc[(clean_df['variety'] == 'Merlot')].copy().drop_duplicates().reset_index(drop=True)
zinfandel= clean_df.loc[(clean_df['variety'] == 'Zinfandel')].copy().drop_duplicates().reset_index(drop=True)
sangiovese = clean_df.loc[(clean_df['variety'] == "Sangiovese")].copy().drop_duplicates().reset_index(drop=True)
malbec= clean_df.loc[(clean_df['variety'] == 'Malbec')].copy().drop_duplicates().reset_index(drop=True)
rose = clean_df.loc[(clean_df['variety'] == 'Rosé')].copy().drop_duplicates().reset_index(drop=True)
whiteBlend= clean_df.loc[(clean_df['variety'] == 'White Blend')].copy().drop_duplicates().reset_index(drop=True)
tempranillo = clean_df.loc[(clean_df['variety'] == 'Tempranillo')].copy().drop_duplicates().reset_index(drop=True)

myDfs = [pinotNoir,chardonnay,cabernetSauvignon,redBlend,bordeauxStyleRedBlend,sauvignonBlanc,syrah,riesling,merlot,zinfandel,sangiovese,malbec,rose,whiteBlend,tempranillo]

myNames = ["Pinot Noir" , "Chardonnay","Cabernet Sauvignon" , "Red Blend", "Bordeaux-style Red Blend", "Sauvignon Blanc" , "Syrah" ,"Riesling", "Merlot" , "Zinfandel" , "Sangiovese" ,"Malbec","Rosé","White Blend","Tempranillo"]

In [151]:
removedWords = ["include", "remain", "remains", "scented", "scent", "depth", "powerful", "mix", "through", "fresh", "fully", "developing", "age", "nose",
                "both", "finishes", "drink", "ready", "acidity", "glass", "style", "suggestion", "suggestions", "stays", "offer", "layered", "layer",
                "together", "combination", "bite", "start", "give", "topped", "delivers", "brings", "structure","100%", "new", "accented", "light", "offers",
                "blend", "blended", "palate", "body", "paired", "carry", "carries", "taste", "tastes", "young", "old", "slightly", "good", "wines",
                "along", "note", "notes", "develop","wine", "hint", "flavors", "flavor", "accent", "accents", "but", "vineyard", "local", "texture", "textures",
                "create", "a", "an", "the", "my", "your", "his", "her", "its", "our", "their", "whose", "that", "this", "these", "those", "one", "first",
                "many", "few", "any", "is", "to", "by", "of", "on", "are", "in","for", "and", "as", "be","he","she","from","him","with","after","up","over",
                "now","when","at","who","if","they","them","but","yet","so","only","out", "have", "it", "not", "theyre", "into", "which", "of", "will","where", "best", 'better', 'because', 'than', 'most']

In [166]:
myPunc = []
for punc in string.punctuation:
  myPunc.append(punc)

In [175]:
# Helper to Clean Description

def cleanDesc(desc, removedWords, punc):
    desc.replace("'", "")
    desc = desc.lower()
    keepWords = []
    token = word_tokenize(desc)
    for t in token:
      # remove ' and numvers
        if (t not in removedWords) and (t not in myPunc):
          keepWords.append(t)

    return keepWords      

## Part 1: Item Baskets

In [176]:
# Initialize Basket Dictionaries 

pinotNoirBasket = {}
chardonnayBasket = {}
cabernetSauvignonBasket = {}
redBlendBasket = {}
bordeauxStyleRedBlendBasket = {}
sauvignonBlancBasket = {}
syrahBasket = {}
rieslingBasket = {}
merlotBasket= {}
zinfandelBasket = {}
sangioveseBasket = {}
malbecBasket = {}
roseBasket = {}
whiteBlendBasket= {}
tempranilloBasket= {}
 
myBaskets = [pinotNoirBasket,chardonnayBasket,cabernetSauvignonBasket,redBlendBasket,bordeauxStyleRedBlendBasket,sauvignonBlancBasket,syrahBasket,rieslingBasket,merlotBasket,zinfandelBasket,sangioveseBasket,malbecBasket,roseBasket,whiteBlendBasket,tempranilloBasket]


In [177]:
# Clean descriptions in the dataframe
for df, basket in zip(myDfs,myBaskets):
    for index,row in df.iterrows():
        temp = cleanDesc(row['description'], removedWords, myPunc)
        basket[row['variety']] = temp


In [178]:
for i in myBaskets:
  print(i)

{'Pinot Noir': ['while', 'we', "'d", 'suggest', 'waiting', '2007', 'hit', 'market', 'softer-styled', 'showing', 'mature', 'melon', 'tropical', 'fruit', 'bee', "'s", 'wax', 'aromas', 'front', 'mild', 'citrus', 'lemon', 'pineapple', 'has', 'medium-weight', 'almoance']}
{'Chardonnay': ['seems', 'mature', 'earth', 'mushroom', 'black', 'cherries', 'finishing', 'touch', 'anise', 'easy-drinking', 'pinot', "'s", 'heavy', 'or', 'extracted', 'displays', 'complexity', 'balear', 'followed', 'creamy', 'peach', 'apricot', 'yellow', 'rose', 'tonic', 'density', 'long', 'persistency', 'close']}
{'Cabernet Sauvignon': ['big', 'dark', 'smoked', 'meat', 'bacon', 'aromas', 'sitting', 'side', 'side', 'damp', 'soupy', 'has', 'strong', 'points', 'such', 'dark', 'plum', 'bacon', 'wideness', 'finish', 'strikes', 'against', 'extra', 'hard', 'tannins', 'result', 'scouring', 'lasting', 'feel', 'imported', 'ecosur', 'group', 'llc']}
{'Red Blend': ['half', 'merlot', 'rest', 'cabernet', 'sauvignon', 'cabernet', 'fran