## Data Sampling and Initial Impressions
This notebook is going to merge and sample our .csv files. They are pretty large and we don't want to be perpetually messing around with large files in memory. So I'm going to start with making a couple sample files that will be used when we are testing our code. Once our model is ready to go, we'll run it on the big guys.

I decide to profile the data at load time, to get a sense of what I'm dealing with.

In [36]:
import os
import random
import pandas as pd
import pandas_profiling

In [37]:
os.chdir('/Users/patrick/Documents/portfolio/Wine Classification/data')
os.listdir() #let's see what we're working with

['.DS_Store',
 'wine_sample_7k.csv',
 'wine_sample.csv',
 'winemag-data-130k-v2.csv',
 'winemag-data-130k-v2.json',
 'winemag-data_first150k.csv']

In [38]:
#Sample 5% of our dataset, or a little under 7,000 rows in this case
#sample function easily adjustable for a direct sample size instead of percentage.
file_nm = 'winemag-data-130k-v2.csv'
num_lines = sum(1 for l in open(file_nm))
sample_frac = .05
sample_size = int(num_lines * .05) #truncate down, to be safe
n = int(num_lines / sample_size) #how many nth rows to sample
n #sanity check this number

20

In [39]:
#I don't know if data is ordered, so we need to randomly sample rows
skip_index = random.sample(range(1, num_lines), num_lines - sample_size)
data = pd.read_csv(file_nm, skiprows=skip_index)

In [40]:
data.to_csv('wine_sample.csv')

In [41]:
#Just had one column name to fix.
data = data.rename(columns = {'Unnamed: 0': "original_index"
                       })

In [42]:
#let's take a peak
data.head(5)

Unnamed: 0,original_index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,25,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Virginie Boone,@vboone,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa
1,26,Italy,Pretty aromas of yellow flower and stone fruit...,Dalila,87,13.0,Sicily & Sardinia,Terre Siciliane,,Kerin O’Keefe,@kerinokeefe,Stemmari 2013 Dalila White (Terre Siciliane),White Blend,Stemmari
2,37,Italy,This concentrated Cabernet offers aromas of cu...,Missoni,86,21.0,Sicily & Sardinia,Sicilia,,,,Feudi del Pisciotto 2010 Missoni Cabernet Sauv...,Cabernet Sauvignon,Feudi del Pisciotto
3,39,Italy,"Part of the natural wine movement, this wine i...",Purato Made With Organic Grapes,86,12.0,Sicily & Sardinia,Sicilia,,,,Feudo di Santa Tresa 2011 Purato Made With Org...,Nero d'Avola,Feudo di Santa Tresa
4,47,US,This is a sweet wine with flavors of white sug...,,86,13.0,California,Lake County,,,,The White Knight 2011 Riesling (Lake County),Riesling,The White Knight


In [43]:
pandas_profiling.ProfileReport(data)

0,1
Number of variables,14
Number of observations,6497
Total Missing (%),11.3%
Total size in memory,710.7 KiB
Average record size in memory,112.0 B

0,1
Numeric,3
Categorical,11
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,37
Unique (%),0.6%
Missing (%),0.0%
Missing (n),1

0,1
US,2733
France,1092
Italy,1010
Other values (33),1661

Value,Count,Frequency (%),Unnamed: 3
US,2733,42.1%,
France,1092,16.8%,
Italy,1010,15.5%,
Spain,312,4.8%,
Portugal,285,4.4%,
Chile,235,3.6%,
Argentina,181,2.8%,
Austria,158,2.4%,
Australia,109,1.7%,
Germany,101,1.6%,

0,1
Distinct count,6477
Unique (%),99.7%
Missing (%),0.0%
Missing (n),0

0,1
"A very good value-priced Rioja with rubber and hickory on the bouquet along with light touches of cassis and red plum. Feels smart, fresh and lively, with a good blend of tomato, herb and red berry flavors. Finishes spicy and crisp, with an herbal, peppery accent.",2
"This is a rich, spicy and lush-fruited rosé. Its palate-pleasing tangle of citrus, peach and light tropical fruit is fresh, forward and inviting. It spent five months in neutral oak and is ready for immediate enjoyment.",2
"The fantastic nose pitches French toast together with Meyer lemon pith, crushed rock, struck match and a slight sour-cream element. This wine is immediately lively on the tongue, with lots of but not overdone acidity. Yeasty toast flavors are cut by lemon and poached pear on a racy and focused palate.",2
Other values (6474),6491

Value,Count,Frequency (%),Unnamed: 3
"A very good value-priced Rioja with rubber and hickory on the bouquet along with light touches of cassis and red plum. Feels smart, fresh and lively, with a good blend of tomato, herb and red berry flavors. Finishes spicy and crisp, with an herbal, peppery accent.",2,0.0%,
"This is a rich, spicy and lush-fruited rosé. Its palate-pleasing tangle of citrus, peach and light tropical fruit is fresh, forward and inviting. It spent five months in neutral oak and is ready for immediate enjoyment.",2,0.0%,
"The fantastic nose pitches French toast together with Meyer lemon pith, crushed rock, struck match and a slight sour-cream element. This wine is immediately lively on the tongue, with lots of but not overdone acidity. Yeasty toast flavors are cut by lemon and poached pear on a racy and focused palate.",2,0.0%,
"Merlot (59%) takes the lead on this blend, which is balanced out by Cabernet Franc. Intriguing aromas of herbs, flowers, spice raspberries, café au lait and barrel spices lead to silky, polished fruit and barrel flavors that linger.",2,0.0%,
"A rather ordinary Merlot, dry and rugged, with jammy-fruity flavors of cherries, herbs and oak. Blended with Beckstoffer Cabernet Sauvignon, which gives it needed body and power.",2,0.0%,
"Gritty aromas of asphalt, green peppercorn, saucy berry and hard spices precede a tight palate with grating tannins. This is oak-heavy, with sweet-and-saucy red-berry flavors. An oaky, tomatoey finish is rubbery in feel.",2,0.0%,
"This unoaked Chardonnay is so ripe and rich, you won't miss the buttered-toast flavor of oak. It's incredibly potent with limes, mangoes, oranges and pineapples, uplifted with brisk acidity. A weekend brunch of scrambled eggs and smoked salmon will be the perfect occasion.",2,0.0%,
"Rich in black cherry, licorice, mocha and oak flavors, this Pinot is showing its best now. Give it a few hours of decanting and pair with steak, lamb and salmon.",2,0.0%,
"Complexity and varietal character come through in this concentrated, well-balanced and smoothly polished wine. It is full bodied and lightly tannic, but so oozing with flavors that it's still easy to sip. The array of black plum, light sage, and black olive flavors is accented with very light oak.",2,0.0%,
"Despite it's modest reported alcohol level, this is a plush, rounded example of Chardonnay, loaded with fruit. A whiff of woodsmoke frames pineapple and mango, adding complexity while still allowing the tropical notes to shine through. Drink now–2017.",2,0.0%,

0,1
Distinct count,3685
Unique (%),56.7%
Missing (%),28.1%
Missing (n),1824

0,1
Reserve,107
Estate,70
Reserva,62
Other values (3681),4434
(Missing),1824

Value,Count,Frequency (%),Unnamed: 3
Reserve,107,1.6%,
Estate,70,1.1%,
Reserva,62,1.0%,
Riserva,40,0.6%,
Estate Grown,31,0.5%,
Dry,26,0.4%,
Estate Bottled,23,0.4%,
Brut,23,0.4%,
Barrel sample,22,0.3%,
Extra Dry,16,0.2%,

0,1
Distinct count,6497
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,64545
Minimum,25
Maximum,129959
Zeros (%),0.0%

0,1
Minimum,25
5-th percentile,6548
Q1,32060
Median,63905
Q3,96929
95-th percentile,123380
Maximum,129959
Range,129934
Interquartile range,64869

0,1
Standard deviation,37457
Coef of variation,0.58033
Kurtosis,-1.198
Mean,64545
MAD,32431
Skewness,0.020281
Sum,419347672
Variance,1403000000
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
27662,1,0.0%,
118139,1,0.0%,
83306,1,0.0%,
52587,1,0.0%,
126319,1,0.0%,
21872,1,0.0%,
73073,1,0.0%,
49960,1,0.0%,
13684,1,0.0%,
91510,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
25,1,0.0%,
26,1,0.0%,
37,1,0.0%,
39,1,0.0%,
47,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
129723,1,0.0%,
129807,1,0.0%,
129808,1,0.0%,
129843,1,0.0%,
129959,1,0.0%,

0,1
Distinct count,21
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,88.428
Minimum,80
Maximum,100
Zeros (%),0.0%

0,1
Minimum,80
5-th percentile,84
Q1,86
Median,88
Q3,91
95-th percentile,93
Maximum,100
Range,20
Interquartile range,5

0,1
Standard deviation,3.0273
Coef of variation,0.034235
Kurtosis,-0.26714
Mean,88.428
MAD,2.47
Skewness,0.11437
Sum,574517
Variance,9.1645
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
88,902,13.9%,
87,842,13.0%,
90,751,11.6%,
86,673,10.4%,
89,599,9.2%,
91,526,8.1%,
85,492,7.6%,
92,478,7.4%,
84,325,5.0%,
93,325,5.0%,

Value,Count,Frequency (%),Unnamed: 3
80,17,0.3%,
81,27,0.4%,
82,94,1.4%,
83,134,2.1%,
84,325,5.0%,

Value,Count,Frequency (%),Unnamed: 3
96,26,0.4%,
97,12,0.2%,
98,6,0.1%,
99,2,0.0%,
100,1,0.0%,

0,1
Distinct count,162
Unique (%),2.5%
Missing (%),7.3%
Missing (n),473
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,34.97
Minimum,5
Maximum,660
Zeros (%),0.0%

0,1
Minimum,5
5-th percentile,11
Q1,17
Median,25
Q3,42
95-th percentile,85
Maximum,660
Range,655
Interquartile range,25

0,1
Standard deviation,33.584
Coef of variation,0.96037
Kurtosis,66.091
Mean,34.97
MAD,19.506
Skewness,6.011
Sum,210660
Variance,1127.9
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
20.0,342,5.3%,
15.0,307,4.7%,
18.0,270,4.2%,
30.0,268,4.1%,
25.0,259,4.0%,
40.0,221,3.4%,
12.0,217,3.3%,
16.0,190,2.9%,
35.0,186,2.9%,
13.0,175,2.7%,

Value,Count,Frequency (%),Unnamed: 3
5.0,1,0.0%,
6.0,6,0.1%,
7.0,15,0.2%,
8.0,53,0.8%,
9.0,54,0.8%,

Value,Count,Frequency (%),Unnamed: 3
440.0,2,0.0%,
495.0,1,0.0%,
525.0,1,0.0%,
550.0,1,0.0%,
660.0,1,0.0%,

0,1
Distinct count,223
Unique (%),3.4%
Missing (%),0.0%
Missing (n),1

0,1
California,1840
Washington,416
Tuscany,306
Other values (219),3934

Value,Count,Frequency (%),Unnamed: 3
California,1840,28.3%,
Washington,416,6.4%,
Tuscany,306,4.7%,
Bordeaux,282,4.3%,
Oregon,257,4.0%,
Northern Spain,205,3.2%,
Piedmont,185,2.8%,
Burgundy,179,2.8%,
Mendoza Province,157,2.4%,
Veneto,150,2.3%,

0,1
Distinct count,639
Unique (%),9.8%
Missing (%),16.5%
Missing (n),1070

0,1
Napa Valley,232
Columbia Valley (WA),197
Russian River Valley,163
Other values (635),4835
(Missing),1070

Value,Count,Frequency (%),Unnamed: 3
Napa Valley,232,3.6%,
Columbia Valley (WA),197,3.0%,
Russian River Valley,163,2.5%,
California,137,2.1%,
Willamette Valley,118,1.8%,
Mendoza,117,1.8%,
Paso Robles,116,1.8%,
Alsace,113,1.7%,
Champagne,97,1.5%,
Rioja,91,1.4%,

0,1
Distinct count,18
Unique (%),0.3%
Missing (%),61.2%
Missing (n),3976

0,1
Central Coast,573
Sonoma,451
Columbia Valley,390
Other values (14),1107
(Missing),3976

Value,Count,Frequency (%),Unnamed: 3
Central Coast,573,8.8%,
Sonoma,451,6.9%,
Columbia Valley,390,6.0%,
Napa,342,5.3%,
Willamette Valley,171,2.6%,
California Other,137,2.1%,
Finger Lakes,87,1.3%,
Central Valley,58,0.9%,
Sierra Foothills,55,0.8%,
Napa-Sonoma,53,0.8%,

0,1
Distinct count,18
Unique (%),0.3%
Missing (%),20.5%
Missing (n),1333

0,1
Roger Voss,1266
Michael Schachner,736
Kerin O’Keefe,552
Other values (14),2610
(Missing),1333

Value,Count,Frequency (%),Unnamed: 3
Roger Voss,1266,19.5%,
Michael Schachner,736,11.3%,
Kerin O’Keefe,552,8.5%,
Virginie Boone,475,7.3%,
Paul Gregutt,461,7.1%,
Matt Kettmann,339,5.2%,
Sean P. Sullivan,240,3.7%,
Joe Czerwinski,237,3.6%,
Jim Gordon,215,3.3%,
Anna Lee C. Iijima,214,3.3%,

0,1
Distinct count,14
Unique (%),0.2%
Missing (%),24.3%
Missing (n),1577

0,1
@vossroger,1266
@wineschach,736
@kerinokeefe,552
Other values (10),2366
(Missing),1577

Value,Count,Frequency (%),Unnamed: 3
@vossroger,1266,19.5%,
@wineschach,736,11.3%,
@kerinokeefe,552,8.5%,
@vboone,475,7.3%,
@paulgwine,461,7.1%,
@mattkettmann,339,5.2%,
@wawinereport,240,3.7%,
@JoeCz,237,3.6%,
@gordone_cellars,215,3.3%,
@AnneInVino,186,2.9%,

0,1
Distinct count,6472
Unique (%),99.6%
Missing (%),0.0%
Missing (n),0

0,1
Covila 2008 II Gran Reserva (Rioja),2
Testarossa 2013 La Rinconada Vineyard Chardonnay (Sta. Rita Hills),2
Mazzei 2013 Belguardo Red (Toscana),2
Other values (6469),6491

Value,Count,Frequency (%),Unnamed: 3
Covila 2008 II Gran Reserva (Rioja),2,0.0%,
Testarossa 2013 La Rinconada Vineyard Chardonnay (Sta. Rita Hills),2,0.0%,
Mazzei 2013 Belguardo Red (Toscana),2,0.0%,
Le Manzane NV Extra Dry (Conegliano Valdobbiadene Prosecco Superiore),2,0.0%,
Marilyn 2008 Merlot (Napa Valley),2,0.0%,
Giant Steps 2011 Tarraford Vineyard Chardonnay (Yarra Valley),2,0.0%,
Pfendler 2012 Pinot Noir (Sonoma Coast),2,0.0%,
Vignerons de Bel Air 2014 Eté Fleuri (Chiroubles),2,0.0%,
Easton 2013 Monarch Mine Vineyard Sauvignon Blanc (Sierra Foothills),2,0.0%,
Viña Tarapacá 2011 Gran Reserva Cabernet Sauvignon (Maipo Valley),2,0.0%,

0,1
Distinct count,306
Unique (%),4.7%
Missing (%),0.0%
Missing (n),0

0,1
Pinot Noir,666
Chardonnay,603
Cabernet Sauvignon,500
Other values (303),4728

Value,Count,Frequency (%),Unnamed: 3
Pinot Noir,666,10.3%,
Chardonnay,603,9.3%,
Cabernet Sauvignon,500,7.7%,
Red Blend,431,6.6%,
Bordeaux-style Red Blend,334,5.1%,
Riesling,249,3.8%,
Sauvignon Blanc,229,3.5%,
Syrah,191,2.9%,
Rosé,176,2.7%,
Nebbiolo,150,2.3%,

0,1
Distinct count,4214
Unique (%),64.9%
Missing (%),0.0%
Missing (n),0

0,1
Testarossa,20
Williams Selyem,13
Wines & Winemakers,13
Other values (4211),6451

Value,Count,Frequency (%),Unnamed: 3
Testarossa,20,0.3%,
Williams Selyem,13,0.2%,
Wines & Winemakers,13,0.2%,
DFJ Vinhos,12,0.2%,
Concha y Toro,12,0.2%,
Gary Farrell,10,0.2%,
Henri de Villamont,10,0.2%,
Chehalem,10,0.2%,
Kenwood,9,0.1%,
Martin Ray,9,0.1%,

Unnamed: 0,original_index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,25,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Virginie Boone,@vboone,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa
1,26,Italy,Pretty aromas of yellow flower and stone fruit...,Dalila,87,13.0,Sicily & Sardinia,Terre Siciliane,,Kerin O’Keefe,@kerinokeefe,Stemmari 2013 Dalila White (Terre Siciliane),White Blend,Stemmari
2,37,Italy,This concentrated Cabernet offers aromas of cu...,Missoni,86,21.0,Sicily & Sardinia,Sicilia,,,,Feudi del Pisciotto 2010 Missoni Cabernet Sauv...,Cabernet Sauvignon,Feudi del Pisciotto
3,39,Italy,"Part of the natural wine movement, this wine i...",Purato Made With Organic Grapes,86,12.0,Sicily & Sardinia,Sicilia,,,,Feudo di Santa Tresa 2011 Purato Made With Org...,Nero d'Avola,Feudo di Santa Tresa
4,47,US,This is a sweet wine with flavors of white sug...,,86,13.0,California,Lake County,,,,The White Knight 2011 Riesling (Lake County),Riesling,The White Knight


## Discussion

Let's talk briefly about what we're seeing here, a priori to a deeper dive of the data. My initial impression is of the data quality, which is quite good. There are a few apparent duplicates, but they should be quite easy to clean up. I'm not so concerned with most of the missing values. I wasn't expecting the designation (e.g. "reserva") to be especially meaningful.

On a personal note, I find the summary statistics and histogram of the points *very* interesting, and a little surprising. I did not read beforehand any discussion on the wine magazine point system. But it's very apparent that it ranges from 80 to 100. I'm glad I now know that 93 points represents that 95% percentile of wine points. What's maybe more surprising is that 88 points is the 50th percentile, and the 0th percentile is 80. 