In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("/Users/oktavianu/Documents/my-pandas/winemag-data-130k-v2.csv", index_col=0)

In [3]:
# Summary functions
# Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, 
# consider the describe() method:

data.points.describe()

count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

In [5]:
# This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on
# the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

data.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

In [6]:
# If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas 
# function that makes it happen.

# For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:
data.points.mean()

88.44713820775404

In [7]:
# To see a list of unique values we can use the unique() function:
data.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

In [8]:
# To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:
data.taster_name.value_counts()

Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
Matt Kettmann          6332
Joe Czerwinski         5147
Sean P. Sullivan       4966
Anna Lee C. Iijima     4415
Jim Gordon             4177
Anne Krebiehl MW       3685
Lauren Buzzeo          1835
Susan Kostrzewa        1085
Mike DeSimone           514
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: taster_name, dtype: int64

In [10]:
# `map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can 
# do this as follows:
data_poins_mean = data.points.mean()
data.points.map(lambda p: p - data_poins_mean)

0        -1.447138
1        -1.447138
2        -1.447138
3        -1.447138
4        -1.447138
            ...   
129966    1.552862
129967    1.552862
129968    1.552862
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In [None]:
# The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed 
# version of that value. map() returns a new Series where all the values have been transformed by your function.

# apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.