# Backup for project 

reference 1: 
According to wine writer and competition organizer Dan Berger, one of the most important criteria in evaluating wine competitions is the quality of the judges. Before agreeing to enter, he advises, be sure to check the competition's website for a list of participating judges. "They should be skilled tasters--professionals, not wine collectors," he says.

"Both the high-priced panel and the low-priced panel are prejudiced against their own wines. If you know you're judging the low-priced wines, you say to yourself, 'Well, there can't be anything worth a gold in here.' If you're judging the \\$30-and-above wines, that panel's going say, 'I wouldn't pay \\$30 for any of this crap.' So the end result is that you get a smaller percentage of gold medals for both high- and low-priced wines."(ref:https://winesvinesanalytics.com/columns/section/23/article/50637/Wine-Competitions-That-Help-You-Sell, access: May 27 2020)

reference 2: 
The first experiment took place in 2005. The last was in Sacramento earlier this month. Hodgson's findings have stunned the wine industry. Over the years he has shown again and again that even trained, professional palates are terrible at judging wine.

"Only about 10% of judges are consistent and those judges who were consistent one year were ordinary the next year. says Hodgson "Chance has a great deal to do with the awards that wines win."

These judges are not amateurs either. They read like a who's who of the American wine industry from winemakers, sommeliers, critics and buyers to wine consultants and academics. In Hodgson's tests, judges rated wines on a scale running from 50 to 100. In practice, most wines scored in the 70s, 80s and low 90s.
(ref:https://www.theguardian.com/lifeandstyle/2013/jun/23/wine-tasting-junk-science-analysis, access: May 27 2020)

reference 3: 
how a wine competition works (ref: https://www.newyorkwines.org/awards-how-a-competition-works, access: May 27 2020)

reference 4: 
top 9 wine management software. only one references increasing sales. so potentially untapped market and no guarentee that they are even using statistics on known reviewers. (ref: https://www.predictiveanalyticstoday.com/top-winery-management-software/)

ref 5: 
scales of wine reviews, 80 barely acceptable wine, 90 becomes middle ground and 99 becomed best mark, 100 theoretical 
https://www.delongwine.com/blogs/de-long-wine-moment/14610147-how-we-rate-wines-and-other-things
https://cdn.shopify.com/s/files/1/0527/6177/files/how_we_rate_wines.pdf?2489


# Quick Look at the wine_reviews_150k.csv 

importing, and graphing some basic aspects of the wine reviews to see if: 
- there is enough data per wine taster (at least 50 reviews of a variety of wine) 
- graph a few metrics vs. wine taster 
    - number of white/reds reviewed  
    - average score w/ highest and lowest marked
    - tags for the taste of the wine (clusters/counting?)

In [1]:
# graph imformation
%matplotlib notebook
import matplotlib.pyplot as plt #matlab plots
import seaborn as sns 
sns.set_style('whitegrid') # style preference on graphs

#useful packages for math, statistics and dictionaries 
from scipy import stats 
import numpy as np  
import collections #ordered dictionary

#importing,cleaning and managing datasets 
import pandas as pd 
from pandas import Series,DataFrame

#machine learning packages
from sklearn import ensemble, tree, model_selection 
# ensemble = random forest
# tree = decision tree 

# for generating random seeds in the game 
from random import seed, random
# seed random number generator
seed(1)

from ansimarkup import parse, ansiprint # colour print statements
import time # adds delays for gamer to read script

In [2]:
# load data set 
#wine_150 = pd.read_csv("./data/wine_reviews_150k.csv")
wine_130 = pd.read_csv("../data/wine_reviews_130k.csv")

In [3]:
#wine_150 = wine_150.loc[:, ~wine_150.columns.str.contains('^Unnamed')]
wine_130 = wine_130.loc[:, ~wine_130.columns.str.contains('^Unnamed')]
#print(wine_150.head(),wine_150.info(), sep='\n') 
print(wine_130.head(), wine_130.info(), sep='\n')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 12.9+ MB
    country                   

In [4]:
wine_130.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [76]:
taster_dic = {}

for x in range(len(wine_130)): 
    name = str(wine_130.loc[x, 'taster_name']).capitalize().strip()
   # print(gender, type(gender))
    
    if name not in taster_dic: 
        taster_dic[name] = 0
    
    taster_dic[name] += 1

#print(taster_dic)

x = np.array(range(len(taster_dic)))
print(x)
y = list(taster_dic.values())
y_name = list(taster_dic.keys())
print(y)
width = 0.1
fig = plt.figure(figsize=(10,10))
ax= fig.add_subplot(211)
plt.bar(x, y) #taster_dic.values()
plt.yscale('log')
plt.title('Number of Reviews per Reviewer')
plt.xlabel('Wine Tasters')
plt.ylabel('Number of Reviews')
plt.xticks(x-0.5,y_name, rotation=60)

#ax= fig.add_subplot(212)
#sns.distplot(y, bins=18)

plt.show()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[10776, 25514, 9532, 415, 15134, 4415, 9537, 6332, 26244, 4966, 4177, 5147, 3685, 1835, 514, 491, 1085, 139, 27, 6]


<IPython.core.display.Javascript object>

In [8]:

print(wine_130.iloc[0:6] )
#wine_130['taster_name'] = wine_130['taster_name'].astype(str)
#print(wine_130[wine_130["taster_name"].str.match('Anna lee')])
#print(wine_130[wine_130["taster_name"].str.match('Anne krebiehl')])

    country                                        description  \
0     Italy  Aromas include tropical fruit, broom, brimston...   
1  Portugal  This is ripe and fruity, a wine that is smooth...   
2        US  Tart and snappy, the flavors of lime flesh and...   
3        US  Pineapple rind, lemon pith and orange blossom ...   
4        US  Much like the regular bottling from 2012, this...   
5     Spain  Blackberry and raspberry aromas show a typical...   

                          designation  points  price           province  \
0                        Vulkà Bianco      87    NaN  Sicily & Sardinia   
1                            Avidagos      87   15.0              Douro   
2                                 NaN      87   14.0             Oregon   
3                Reserve Late Harvest      87   13.0           Michigan   
4  Vintner's Reserve Wild Child Block      87   65.0             Oregon   
5                        Ars In Vitro      87   15.0     Northern Spain   

           

In [44]:
cols = [0,1,3,8]
#df = df[df.columns[cols]]
print(wine_130[wine_130.columns[cols]])

         country                                        description  points  \
0          Italy  Aromas include tropical fruit, broom, brimston...      87   
1       Portugal  This is ripe and fruity, a wine that is smooth...      87   
2             US  Tart and snappy, the flavors of lime flesh and...      87   
3             US  Pineapple rind, lemon pith and orange blossom ...      87   
4             US  Much like the regular bottling from 2012, this...      87   
...          ...                                                ...     ...   
129966   Germany  Notes of honeysuckle and cantaloupe sweeten th...      90   
129967        US  Citation is given as much as a decade of bottl...      90   
129968    France  Well-drained gravel soil gives this wine its c...      90   
129969    France  A dry style of Pinot Gris, this is crisp with ...      90   
129970    France  Big, rich and off-dry, this is powered by inte...      90   

               taster_name  
0            Kerin O’K

In [10]:
print(wine_130.iloc[1120,1])


Chardonay and Pinot Noir are both present in this apple and pear flavored wine. A lively mousse brings out the fruit and acidity, giving a tangy, bright wine that's ready to drink now.


In [11]:
print(min(wine_130['points']))
print(max(wine_130['points']))
counts_89 = wine_130["points"][wine_130["points"] == 89].value_counts()
print(counts_89)
fig = plt.figure(figsize=(8,8))
ax= fig.add_subplot(111)

plt.hist(wine_130['points'], bins=20)
plt.title('Number of Occurances of Each Score')
plt.xlabel('Score')
plt.ylabel('Number of Occurances')

plt.show()

80
100
89    12226
Name: points, dtype: int64


<IPython.core.display.Javascript object>

In [4]:
# first run at NLP with NLTK 
import os 
import nltk, re, pprint 
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize import RegexpTokenizer
import nltk.corpus
#nltk.download('punkt')
#nltk.download('stopwords')

first_10 = wine_130[wine_130['taster_name'] == 'Kerin O’Keefe']
#print(first_10)

usr_defined_stop = ['.', ',',"'s", 'is', "n't", '%', 'aromas', 'include', 'wine', 'opens',\
                    'carry', 'note','offers','alongside', 'drink', 'hint', 'dried','delivers','finish','lead', \
                    'firm', 'nose','palate', 'made', 'glass', 'along']
stop = nltk.corpus.stopwords.words('english')
#i = nltk.corpus.stopwords.words('english')
stopwords= set(stop).union(usr_defined_stop)
#print(stopwords)

tokens = first_10['description'].str.lower().apply(word_tokenize) #sentence tokenization, 

def stopword_clean(token_pd): 
    for index,row in token_pd.iteritems(): 
        for y in row: 
            if y in stopwords: 
                row.remove(y)
    return token_pd
tokens = stopword_clean(tokens)
            
print('done')

done


In [5]:
tokens = stopword_clean(tokens)
tokens = stopword_clean(tokens)
tokens = stopword_clean(tokens)
tokens = stopword_clean(tokens)
#print(tokens)
print('the' in stopwords)
print(tokens[0:50])

True
0      [tropical, fruit, broom, brimstone, herb, over...
6      [bright, informal, red, candied, berry, white,...
13     [dominated, oak, oak-driven, roasted, coffee, ...
22     [delicate, recall, white, flower, citrus, pass...
24     [prune, blackcurrant, toast, oak, extracted, f...
26     [pretty, yellow, flower, stone, fruit, bright,...
27     [recall, ripe, dark, berry, toast, whiff, cake...
28     [suggest, mature, berry, scorched, earth, anim...
61     [densely, hued, black, plum, vanilla, simple, ...
72     [black-skinned, fruit, leather, underbrush, ga...
88     [subdued, french, oak, toast, acacia, waft, ra...
89     [primarily, sangiovese, malvasia, white, grape...
98     [forest, floor, menthol, espresso, cranberry, ...
104    [65, sangiovese, 20, merlot, 15, cabernet, sau...
105    [predominantly, trebbiano, malvasia, pinot, bi...
106    [blend, cabernet, sauvignon, merlot, cabernet,...
107    [yellow, stone, fruit, white, spring, flower, ...
109    [easy-drinking, ble

In [12]:
#nltk.download('brown')
brown = nltk.corpus.brown.words()
print(brown)
'note' in brown

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


True

In [20]:
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /Users/risa/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [6]:
full_text = []
test = []
token_list = tokens.tolist()

for row in token_list: 
    full_text = full_text + row
   # for word in row:
#        full_text.append(word)

        
from nltk.stem import WordNetLemmatizer
lemmzr = WordNetLemmatizer() 

processed_words = []
for word in full_text:
    processed_words.append(lemmzr.lemmatize(word))
#print(processed_words)

freq = nltk.FreqDist(processed_words)
print(len(freq))
print(len(test))
print(freq.most_common(40))

3860
0
[('tannin', 6413), ('cherry', 5908), ('black', 5055), ('acidity', 3913), ('fruit', 3465), ('berry', 3390), ('white', 3283), ('spice', 3150), ('red', 3014), ('ripe', 2856), ('pepper', 2493), ('herb', 2450), ('flower', 2285), ('whiff', 2008), ('juicy', 1982), ('licorice', 1956), ('plum', 1854), ('apple', 1813), ('flavor', 1757), ('bright', 1744), ('raspberry', 1729), ('oak', 1702), ('clove', 1664), ('fresh', 1656), ('leather', 1563), ('note', 1416), ('espresso', 1329), ('wild', 1322), ('mature', 1313), ('vanilla', 1307), ('crushed', 1303), ('citrus', 1281), ('mineral', 1275), ('peach', 1268), ('yellow', 1230), ('tobacco', 1208), ('underbrush', 1194), ('blackberry', 1166), ('dark', 1107), ('anise', 1065)]


In [37]:
#test_freq = nltk.FreqDist(test)
#print(len(test_freq))

fig = plt.figure(figsize=(8,8))
ax= fig.add_subplot(111)
freq.plot(40,cumulative=False)

#ax=fig.add_subplot(122)
#test_freq.plot(40,cumulative=False)

#print(test_freq==freq)
#print(full_text==test)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a8eb750>

In [7]:
first_10 = wine_130[wine_130['taster_name'] == 'Kerin O’Keefe']

test_tag_words = freq.most_common(30)

#test_tag_words = ['fruit', 'oak', 'dry', 'sweet', 'light', 'full bodied', 'toasted', 'red', 'rose', 'white',\
 #                 'sparkling', 'herb', 'tannin', 'berry', 'acidity', 'citrus', 'pepper' ]
first_10 = first_10.drop(['taster_twitter_handle', 'title', 'region_2'], axis=1)

for x in test_tag_words: 
    first_10[x[0]] = 0 
    
print(first_10.iloc[0:6, 10:26])

    tannin  cherry  black  acidity  fruit  berry  white  spice  red  ripe  \
0        0       0      0        0      0      0      0      0    0     0   
6        0       0      0        0      0      0      0      0    0     0   
13       0       0      0        0      0      0      0      0    0     0   
22       0       0      0        0      0      0      0      0    0     0   
24       0       0      0        0      0      0      0      0    0     0   
26       0       0      0        0      0      0      0      0    0     0   

    pepper  herb  flower  whiff  juicy  licorice  
0        0     0       0      0      0         0  
6        0     0       0      0      0         0  
13       0     0       0      0      0         0  
22       0     0       0      0      0         0  
24       0     0       0      0      0         0  
26       0     0       0      0      0         0  


In [8]:
first_10.iloc[0:6, 10:26]

Unnamed: 0,tannin,cherry,black,acidity,fruit,berry,white,spice,red,ripe,pepper,herb,flower,whiff,juicy,licorice
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
#print(type(tokens))
#print(len(tokens[0]))
#print(tokens[0],tokens[0][5])
#print(test_tag_words[0:5])
clean_numbers = []
for y in test_tag_words:
    clean_numbers.append(y[0])
test_tag_words = clean_numbers
'tannin' in test_tag_words

True

In [14]:
print(test_tag_words)
new_tag_words = []
for x in test_tag_words: 
    new_tag_words.append(x[0])
print(new_tag_words)    

for index, row in tokens.iteritems(): 
   # print(index,row)
    for y in row: 
        #print(y)
        if y in new_tag_words: 
            print('in:', y, index)
            first_10.at[index, y] = 1
            
print(first_10.head())

[('tannin', 6413), ('cherry', 5908), ('black', 5055), ('acidity', 3913), ('fruit', 3465), ('berry', 3390), ('white', 3283), ('spice', 3150), ('red', 3014), ('ripe', 2856), ('pepper', 2493), ('herb', 2450), ('flower', 2285), ('whiff', 2008), ('juicy', 1982), ('licorice', 1956), ('plum', 1854), ('apple', 1813), ('flavor', 1757), ('bright', 1744), ('raspberry', 1729), ('oak', 1702), ('clove', 1664), ('fresh', 1656), ('leather', 1563), ('note', 1416), ('espresso', 1329), ('wild', 1322), ('mature', 1313), ('vanilla', 1307)]


NameError: name 'new_tag_word' is not defined

In [10]:
first_10.iloc[0:6, 10:26]

Unnamed: 0,tannin,cherry,black,acidity,fruit,berry,white,spice,red,ripe,pepper,herb,flower,whiff,juicy,licorice
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
print(first_10.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10776 entries, 0 to 129962
Data columns (total 40 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   country      10776 non-null  object 
 1   description  10776 non-null  object 
 2   designation  7549 non-null   object 
 3   points       10776 non-null  int64  
 4   price        9874 non-null   float64
 5   province     10776 non-null  object 
 6   region_1     10749 non-null  object 
 7   taster_name  10776 non-null  object 
 8   variety      10776 non-null  object 
 9   winery       10776 non-null  object 
 10  tannin       10776 non-null  int64  
 11  cherry       10776 non-null  int64  
 12  black        10776 non-null  int64  
 13  acidity      10776 non-null  int64  
 14  fruit        10776 non-null  int64  
 15  berry        10776 non-null  int64  
 16  white        10776 non-null  int64  
 17  spice        10776 non-null  int64  
 18  red          10776 non-null  int64  
 19  rip

In [172]:
def fill_dic(dic, pd_data, column): 
    for index,row in pd_data.iterrows(): 
        cell = str(pd_data.at[index,'%s' %column]) #.capatalize().strip()
    if cell not in dic: 
        dic[cell] = 0 
    dic[cell] += 1
    
    return dic

region_dic2 = {}

region_dic2 = fill_dic(region_dic2, first_10, 'region_1')

print(region_dic)

{'Etna': 200, 'Vittoria': 19, 'Sicilia': 281, 'Terre siciliane': 213, 'Cerasuolo di vittoria': 9, 'Romagna': 63, 'Aglianico del vulture': 54, 'Vernaccia di san gimignano': 104, 'Toscana': 544, 'Morellino di scansano': 140, 'Chianti classico': 768, 'Brunello di montalcino': 783, 'Alto adige': 408, 'Sagrantino di montefalco': 21, 'Barolo': 1127, 'Gavi': 28, 'Franciacorta': 174, 'Vino nobile di montepulciano': 212, 'Amarone della valpolicella': 114, 'Alto adige valle isarco': 40, "Dolcetto d'alba": 33, 'Collio': 282, 'Dogliani superiore': 6, "Barbera d'alba": 80, 'Cortona': 32, 'Piemonte': 7, 'Colli della toscana centrale': 10, 'Barbaresco': 559, 'Verdicchio dei castelli di jesi classico': 48, 'Rosso di montalcino': 233, "Nebbiolo d'alba": 33, 'Sannio': 8, 'Coste della sesia': 13, 'Greco di tufo': 71, 'Langhe': 61, 'Rosso di montepulciano': 105, 'Basilicata': 7, 'Campania': 33, 'Nan': 27, 'Trentino': 53, "Valle d'aosta": 5, 'Alto adige terlano': 10, 'Amarone della valpolicella classico': 

In [175]:
region_dic = {}

for index,row in first_10.iterrows():
    #print(index, row)
    region = str(first_10.at[index, 'region_1']).capitalize().strip()
    
    if region not in region_dic: 
        region_dic[region] = 0
    
    region_dic[region] += 1
    
    
if region_dic == region_dic2: 
    print(yes)