<a href="https://colab.research.google.com/github/nachoacev/practice-data-science/blob/main/PandasTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Tutorial

In this notebook I will put all what it is needed from `pandas`, the most popular Pythonic library for data analysis, to make a correct manipulation of data.

In [None]:
import pandas as pd

## Creating data

There are two core objects in pandas: the **DataFrame** and the **Series**.

**DataFrame**
A DataFrame is a table. It contains an array of individual *entries*, each of which has a certain *value*. Each entry corresponds to a row (or record) and a column.

In [None]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


In [None]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

Unnamed: 0,Bob,Sue
0,I liked it.,Pretty good.
1,It was awful.,Bland.


We are using the `pd.DataFrame()` constructor to generate these DataFrame objects. The syntax for declaring a new one is a **dictionary** whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, _but oftentimes we will want to assign these labels ourselves_.

The list of row labels used in a DataFrame is known as an __Index__. We can assign values to it by using an index parameter in our constructor:

In [None]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


**Series**

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list

In [None]:
pd.Series([1, 2, 3, 4, 5])

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall `name`:

In [None]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Unnamed: 0,Product A
2015 Sales,30
2016 Sales,35
2017 Sales,40


## Reading data files

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble `CSV file`. When you open a CSV file you get something that looks like this:
```
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
```
So a CSV file is a table of values separated by commas. Hence the name: **"Comma-Separated Values"**, or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the `pd.read_csv()` function to read the data into a DataFrame.

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("zynicide/wine-reviews") + "/winemag-data_first150k.csv"

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/zynicide/wine-reviews?dataset_version_number=4...


100%|██████████| 50.9M/50.9M [00:03<00:00, 16.6MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/zynicide/wine-reviews/versions/4/winemag-data_first150k.csv


In [None]:
wine_reviews = pd.read_csv(path)

wine_reviews.shape

(150930, 11)

In [None]:
wine_reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


The `pd.read_csv()` function is well-endowed, with over 30 optional parameters you can specify. For example, we can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an `index_col`.

In [None]:
wine_reviews = pd.read_csv(path, index_col=0)

wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [None]:
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals

Unnamed: 0,Cows,Goats
Year 1,12,22
Year 2,20,19


In [None]:
animals.to_csv('animals.csv')

# Indexing, Selecting & Assigning

In `Python`, we can access the property of an object by accessing it as an `attribute`. A book object, for example, might have a title property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way.

In [None]:
# Select column called country as an attribute
wine_reviews.country

Unnamed: 0,country
0,US
1,Spain
2,US
3,US
4,France
...,...
150925,Italy
150926,France
150927,Italy
150928,France


If we have a Python dictionary, we can access its values using the indexing `([])` operator. We can do the same with columns in a DataFrame:

In [None]:
# Select column called description as index
desc = wine_reviews['description']
desc

Unnamed: 0,description
0,This tremendous 100% varietal wine hails from ...
1,"Ripe aromas of fig, blackberry and cassis are ..."
2,Mac Watson honors the memory of a wine once ma...
3,"This spent 20 months in 30% new French oak, an..."
4,"This is the top wine from La Bégude, named aft..."
...,...
150925,Many people feel Fiano represents southern Ita...
150926,"Offers an intriguing nose with ginger, lime an..."
150927,This classic example comes from a cru vineyard...
150928,"A perfect salmon shade, with scents of peaches..."


In [None]:
type(desc)

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator `[]` does have the advantage that *it can handle column names with reserved characters in them* (e.g. if we had a country providence column, reviews.country providence wouldn't work).

A pandas Series is like a vector, so it isn't surprise that to drill down to a single specific value, we need only use the indexing operator `[]` once more:

In [None]:
wine_reviews['country'][0]

'US'

## Indexing with `loc` and `iloc`

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using.

- `iloc`: It treats the DataFrame **as a matrix**.
- `loc`: It treats it __as a table__ with its indices.

### Index-based selection

Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

To select **the first row** of data in a DataFrame, we may use the following:

In [None]:
# Select first row
wine_reviews.iloc[0]

Unnamed: 0,0
country,US
description,This tremendous 100% varietal wine hails from ...
designation,Martha's Vineyard
points,96
price,235.0
province,California
region_1,Napa Valley
region_2,Napa
variety,Cabernet Sauvignon
winery,Heitz


Both `loc` and `iloc` are **row-first, column-second**. *This is the opposite of what we do in native Python*, which is column-first, row-second.

The big advantage of these two ways is that **they work as matrix notation.**

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [None]:
wine_reviews.iloc[:, 0]

Unnamed: 0,country
0,US
1,Spain
2,US
3,US
4,France
...,...
150925,Italy
150926,France
150927,Italy
150928,France


In [None]:
wine_reviews.iloc[[0, 1, 2], 0]

Unnamed: 0,country
0,US
1,Spain
2,US


In [None]:
wine_reviews.iloc[:3, 0]

Unnamed: 0,country
0,US
1,Spain
2,US


In [None]:
wine_reviews.iloc[1:3, 0]

Unnamed: 0,country
1,Spain
2,US


Finally, it's worth knowing that **negative numbers** can be used in selection. This will start *counting forwards from the end of the values*. So for example here are the last five elements of the dataset.

In [None]:
wine_reviews.iloc[-5:]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


### Label-based selection

The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, **it's the data index value**, not its position, which matters.

For example, to get the first entry in reviews, we would now do the following:

In [None]:
wine_reviews.loc[0, 'country']

'US'

In [None]:
wine_reviews.loc[:, ['province', 'variety', 'points']]

Unnamed: 0,province,variety,points
0,California,Cabernet Sauvignon,96
1,Northern Spain,Tinta de Toro,96
2,California,Sauvignon Blanc,96
3,Oregon,Pinot Noir,96
4,Provence,Provence red blend,95
...,...,...,...
150925,Southern Italy,White Blend,91
150926,Champagne,Champagne Blend,91
150927,Southern Italy,White Blend,91
150928,Champagne,Champagne Blend,90


### Choosing between loc and iloc

When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that **the two methods use slightly different indexing schemes**.

`iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.

Why the change? Remember that `loc` can index any stdlib type: strings, for example. If we have a DataFrame with index values `Apples, ..., Potatoes, ...`, and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index `df.loc['Apples':'Potatoes']` than it is to index something like `df.loc['Apples', 'Potatoet']` (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! To get 1000 elements using `loc`, you will need to go one lower and ask for `df.loc[0:999]`.

Otherwise, the semantics of using loc are the same as those for iloc.

In [None]:
wine_reviews.loc[0:99, ['country', 'province', 'region_1', 'region_2']]

Unnamed: 0,country,province,region_1,region_2
0,US,California,Napa Valley,Napa
1,Spain,Northern Spain,Toro,
2,US,California,Knights Valley,Sonoma
3,US,Oregon,Willamette Valley,Willamette Valley
4,France,Provence,Bandol,
...,...,...,...,...
95,France,Southwest France,Cahors,
96,US,California,California,California Other
97,US,California,Sonoma Valley,Sonoma
98,France,Southwest France,Buzet,


## Conditional selection

- and: `&`.
- or : `|`

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do *interesting* things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

We can start by checking if each wine is Italian or not:

In [None]:
wine_reviews.country == 'Italy'

Unnamed: 0,country
0,False
1,False
2,False
3,False
4,False
...,...
150925,True
150926,False
150927,True
150928,False


This operation produced a Series of `True/False` booleans based on the country of each record. This result can then be used inside of `loc` to select the relevant data (since this select the rows by index):

In [None]:
wine_reviews.loc[wine_reviews.country == 'Chile']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
155,Chile,"Lightly herbal, horsey aromas of blueberry, bl...",Ecos de Rulo Single Vineyard El Chequén Estate,89,20.0,Marchigue,,,Carmenère,Viña Bisquertt
159,Chile,"Staunch berry, cassis and spice aromas are fri...",Fina Reserva Ensamblaje Malbec-Cabernet Sauvig...,89,19.0,Colchagua Valley,,,Red Blend,Estampa
171,Chile,"Dark, charred, lemony aromas of graphite, lico...",Family Collection,89,30.0,Curicó Valley,,,Cabernet Sauvignon-Cabernet Franc,Santa Alba
179,Chile,"Aromas of latex, tire rubber, spice, cured mea...",Perla Negra,89,61.0,Maule Valley,,,Red Blend,Casa Donoso
502,Chile,"Blackberry, cassis, herb and spice aromas are ...",Fina Reserva Ensamblaje Cabernet Sauvignon-Mal...,89,19.0,Colchagua Valley,,,Red Blend,Estampa
...,...,...,...,...,...,...,...,...,...,...
150901,Chile,"Lavishly oaked, the fruit here struggles to ma...",Reserva,81,12.0,Maipo Valley,,,Merlot,Undurraga
150902,Chile,This medium weight Chardonnay offered aromas o...,Estate Bottled,81,10.0,Maipo Valley,,,Chardonnay,De Martino
150903,Chile,Very light berry and mint aromas open this aus...,120,81,7.0,Rapel Valley,,,Cabernet Sauvignon,Santa Rita
150904,Chile,A lot of Chilean Cabernets seem to have a dist...,,81,10.0,Maipo Valley,,,Cabernet Sauvignon,De Martino


We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.

We can use the ampersand (`&`) to bring the two questions together:

In [None]:
wine_reviews.loc[(wine_reviews.country == 'Chile') & (wine_reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
537,Chile,"Deep, pure cassis and red berry aromas are int...",Edición Limitada B,92,30.0,Colchagua Valley,,,Bordeaux-style Red Blend,Caliterra
833,Chile,"Spicy, herbal plum and cassis aromas mix in no...",Amplus One,90,24.0,Peumo,,,Red Blend,Santa Ema
873,Chile,"Deep, pure cassis and red berry aromas are int...",Edición Limitada B,92,30.0,Colchagua Valley,,,Bordeaux-style Red Blend,Caliterra
892,Chile,"Reedy, berry aromas come with a shot of eucaly...",Tralca,92,100.0,Colchagua Valley,,,Red Blend,Viña Bisquertt
1037,Chile,"Dark berry, shoe polish and lemony oak are at ...",Auma Los Lingues,91,100.0,Colchagua Valley,,,Red Blend,Koyle
...,...,...,...,...,...,...,...,...,...,...
150775,Chile,"A blend of 50% Cabernet Sauvignon, 20% Grande ...",Winemaker's Reserve,91,40.0,Maipo Valley,,,Red Blend,Carmen
150781,Chile,"A blend of old-vine Cariñena (60%), Syrah (25%...",Cordillera,91,26.0,Curicó Valley,,,Red Blend,Miguel Torres
150783,Chile,Makes a strong statement for the potential of ...,Maquehua,90,19.0,Curicó Valley,,,Chardonnay,Miguel Torres
150785,Chile,"Rich and complex from the start, the nose and ...",Reserva de la Familia,90,15.0,Maipo Valley,,,Chardonnay,Santa Carolina


Suppose we'll buy any wine that's made in Chile or which is rated above average. For this we use a pipe (`|`):

In [None]:
wine_reviews.loc[(wine_reviews.country == 'Chile') | (wine_reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


Pandas comes with a few **built-in conditional selector methods**, two of which we will highlight here.

The first is `isin`. `isin` lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:

In [None]:
wine_reviews.loc[wine_reviews.country.isin(['France', 'Italy'])]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
10,Italy,"Elegance, complexity and structure come togeth...",Ronco della Chiesa,95,80.0,Northeastern Italy,Collio,,Friulano,Borgo del Tiglio
13,France,This wine is in peak condition. The tannins an...,Château Montus Prestige,95,90.0,Southwest France,Madiran,,Tannat,Vignobles Brumont
18,France,Coming from a seven-acre vineyard named after ...,Le Pigeonnier,95,290.0,Southwest France,Cahors,,Malbec,Château Lagrézette
32,Italy,"Underbrush, scorched earth, menthol and plum s...",Vigna Piaggia,90,,Tuscany,Brunello di Montalcino,,Sangiovese,Abbadia Ardenga
...,...,...,...,...,...,...,...,...,...,...
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


The second is `isnull` (and its companion `notnull`). These methods let you highlight values which are (or are not) empty (`NaN`). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:

In [None]:
wine_reviews.loc[wine_reviews.price.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


In [None]:
top_oceania_wines = wine_reviews.loc[wine_reviews.country.isin(['Australia', 'New Zealand']) & (wine_reviews.points >= 95)]

top_oceania_wines

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
2148,Australia,Full-bodied and plush yet vibrant and imbued w...,The Factor,98,125.0,South Australia,Barossa Valley,,Shiraz,Torbreck
2458,Australia,This is a top example of the classic Australia...,The Peake,96,150.0,South Australia,McLaren Vale,,Cabernet-Shiraz,Hickinbotham
3033,Australia,This Cabernet equivalent to Grange has explode...,Bin 707,95,500.0,South Australia,South Australia,,Cabernet Sauvignon,Penfolds
3044,Australia,"From vines planted in 1912, this has been an i...",Mount Edelstone Vineyard,95,200.0,South Australia,Eden Valley,,Shiraz,Henschke
3047,Australia,"This is a throwback to those brash, flavor-exu...",One,95,95.0,South Australia,Langhorne Creek,,Red Blend,Heartland
...,...,...,...,...,...,...,...,...,...,...
122779,Australia,If Standish's Relic is the feminine side of Sh...,The Standish Single Vineyard,96,135.0,South Australia,Barossa Valley,,Shiraz,Standish
127614,Australia,This stellar wine takes a little time in the g...,Hill of Grace,95,625.0,South Australia,Eden Valley,,Shiraz,Henschke
137383,Australia,The 2007 Astralis impresses for its combinatio...,Astralis,95,225.0,South Australia,Clarendon,,Syrah,Clarendon Hills
150562,Australia,"As unevolved as they are, the dense and multil...",Grange,96,185.0,South Australia,South Australia,,Shiraz,Penfolds


## Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign

- A constant value.
- An iterable of values.

In [None]:
wine_reviews['critic'] = 'everyone'
wine_reviews['index_backwards'] = range(len(wine_reviews), 0, -1)

In [None]:
wine_reviews.loc[:, ['critic', 'index_backwards']]

Unnamed: 0,critic,index_backwards
0,everyone,150930
1,everyone,150929
2,everyone,150928
3,everyone,150927
4,everyone,150926
...,...,...
150925,everyone,5
150926,everyone,4
150927,everyone,3
150928,everyone,2


# Summary, Functions and Maps

In this part we will extract useful insights from the data.

## Summary functions

Pandas provides many simple "**summary functions**" (not an official name) which restructure the data in some useful way.

- `describe()` method: generates *a high-level summary* of the attributes of the given column.  **It is type-aware**, meaning that its output changes based on the data type of the input.

In [64]:
wine_reviews.points.describe()

Unnamed: 0,points
count,150930.0
mean,87.888418
std,3.222392
min,80.0
25%,86.0
50%,88.0
75%,90.0
max,100.0


In [65]:
wine_reviews.country.describe()

Unnamed: 0,country
count,150925
unique,48
top,US
freq,62397


If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, **there is usually a helpful pandas function that makes it happen**.

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [66]:
wine_reviews.points.mean()

87.8884184721394

To see a list of unique values we can use the `unique()` function:

In [69]:
wine_reviews.country.unique()

array(['US', 'Spain', 'France', 'Italy', 'New Zealand', 'Bulgaria',
       'Argentina', 'Australia', 'Portugal', 'Israel', 'South Africa',
       'Greece', 'Chile', 'Morocco', 'Romania', 'Germany', 'Canada',
       'Moldova', 'Hungary', 'Austria', 'Croatia', 'Slovenia', nan,
       'India', 'Turkey', 'Macedonia', 'Lebanon', 'Serbia', 'Uruguay',
       'Switzerland', 'Albania', 'Bosnia and Herzegovina', 'Brazil',
       'Cyprus', 'Lithuania', 'Japan', 'China', 'South Korea', 'Ukraine',
       'England', 'Mexico', 'Georgia', 'Montenegro', 'Luxembourg',
       'Slovakia', 'Czech Republic', 'Egypt', 'Tunisia', 'US-France'],
      dtype=object)

To see a list of unique values and how often they occur in the dataset, we can use the `value_counts()` method:

In [73]:
wine_reviews.country.value_counts().head()

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
US,62397
Italy,23478
France,21098
Spain,8268
Chile,5816


## Maps

A `map` is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for **creating new representations from existing data**, or for **transforming data from the format it is in now to the format that we want it to be in later**. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

- `map()` is the first, and slightly simpler one. The function you pass to `map()` should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by your function.

For example, suppose that we wanted to remean the scores the wines received to 0 (centering the average to 0). We can do this as follows:

In [77]:
review_points_mean = wine_reviews.points.mean()

wine_reviews.points.map(lambda p: p - review_points_mean)


Unnamed: 0,points
0,8.111582
1,8.111582
2,8.111582
3,8.111582
4,7.111582
...,...
150925,3.111582
150926,3.111582
150927,3.111582
150928,2.111582


- `apply()` is the equivalent method if we want to transform a whole DataFrame by calling a custom method **on each row** or **on each column**.

In [79]:
def remean_points(row):
  row.points = row.points - review_points_mean
  return row

wine_reviews.iloc[:1000].apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,8.111582,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,8.111582,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,8.111582,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,everyone,150928
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,8.111582,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,everyone,150927
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,7.111582,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,everyone,150926
...,...,...,...,...,...,...,...,...,...,...,...,...
995,Portugal,"This is a dry sparkling wine with tannins, spi...",Plexus Tinto,-2.888418,7.0,Tejo,,,Alicante Bouschet,Adega Cooperativa do Cartaxo,everyone,149935
996,Portugal,"This is a soft, creamy wine with fruitiness an...",Blanc de Blancs Bruto,-2.888418,12.0,Beira Atlantico,,,Portuguese Sparkling,Adega de Cantanhede,everyone,149934
997,France,This is a soft wine whose herbal character is ...,La Galope,-2.888418,15.0,Southwest France,Côtes de Gascogne,,Sauvignon Blanc,Domaine de l'Herré,everyone,149933
998,France,This white wine is fruity and crisp with a del...,,-2.888418,9.0,Southwest France,Côtes de Gascogne,,White Blend,Domaine du Touja,everyone,149932


Note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of `wine_reviews`, we can see that it still has its original points value.

In [89]:
wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,everyone,150928
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,everyone,150927
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,everyone,150926


Now, the above isn't the best method to remean the variable points. **Pandas provides many common mapping operations as built-ins**. For example, here's a faster way of remeaning our points column:

In [90]:
review_points_mean = wine_reviews.points.mean()
wine_reviews.points - review_points_mean

Unnamed: 0,points
0,8.111582
1,8.111582
2,8.111582
3,8.111582
4,7.111582
...,...
150925,3.111582
150926,3.111582
150927,3.111582
150928,2.111582


In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

In [92]:
wine_reviews.country + ' - ' + wine_reviews.region_1

Unnamed: 0,0
0,US - Napa Valley
1,Spain - Toro
2,US - Knights Valley
3,US - Willamette Valley
4,France - Bandol
...,...
150925,Italy - Fiano di Avellino
150926,France - Champagne
150927,Italy - Fiano di Avellino
150928,France - Champagne


**These operators are faster than** `map()` or `apply()` because **they use speed ups built into pandas**. All of the standard Python operators (`>`, `<`, `==`, and so on) work in this manner.

However, they are not as flexible as `map()` or `apply()`, which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

## Exercises

## 1.
Select the best bargain

In [106]:
# We create a Series of the points-to-price ratio
reviews = wine_reviews.loc[wine_reviews.price.notnull()]
ratio_points_price = reviews.points / reviews.price

In [108]:
# We select the best bargain using the 'idxmax()' method
bargain_wine = reviews.loc[ratio_points_price.idxmax(), 'winery']

bargain_wine

'Bandit'

## 2.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"?

Create a Series `descriptor_counts` counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

In [115]:
n_trop = wine_reviews.description.map(lambda p: 'tropical' in p).sum()
n_fruity = wine_reviews.description.map(lambda p: 'fruity' in p).sum()

descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
descriptor_counts

Unnamed: 0,0
tropical,4135
fruity,8669


## 3.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. **A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.**

Also, the Canadian Vintners Association bought a lot of ads on the site, so any **wines from Canada should automatically get 3 stars, regardless of points**.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [129]:
def get_stars(row):
  if row.country == 'Canada' or row.points >= 95:
    row.star_ratings = 3
    return row
  elif 85 <= row.points < 95:
    row.star_ratings = 2
    return row
  else:
    row.star_ratings = 1
    return row


In [132]:
wine_reviews['star_ratings'] = 0

star_ratings = wine_reviews.apply(get_stars, axis='columns').star_ratings

star_ratings

Unnamed: 0,star_ratings
0,3
1,3
2,3
3,3
4,3
...,...
150925,2
150926,2
150927,2
150928,2


In [128]:
star_ratings

Unnamed: 0,stars
0,3
1,3
2,3
3,3
4,3
...,...
150925,2
150926,2
150927,2
150928,2
