# INTRODUCTION TO PANDAS
Pandas is a popular python library for data analysis.



In [1]:
# import pandas 
import pandas as pd

### Dataframe
There are two core objects in pandas:  **Dataframe** and **Series**

**Dataframe**: it is a table containing a row and column entries;

- It consists of a column name/label
- the row label are by default sequence of natural numbers (0, 1, 2 ...)
- the columns must have the same data length

In [31]:
# create a Dataframe

pd.DataFrame({'A': [1, 2, 3], 'B': [1, 4, 9], 'C': [1, 8, 27]})

Unnamed: 0,A,B,C
0,1,1,1
1,2,4,8
2,3,9,27


In [32]:
# the row labels can be modified by using the index key in the Dataframe constructor

pd.DataFrame({'A': [1, 2, 3], 'B': [1, 4, 9], 'C': [1, 8, 27]}, index=['First', 'Second', 'Third'])

Unnamed: 0,A,B,C
First,1,1,1
Second,2,4,8
Third,3,9,27


### Series
The Series object is used to  create a single *unnamed* column

- row labels can be assigned to Series objects using the index parameter
- the Series can be given an overall name
- a series does not have a column name

In [33]:
# creating a Series object using its constructor

pd.Series([1, 2, 3, 4, 5, 6], index=['A', 'B', 'C', 'D', 'E', 'F'])

A    1
B    2
    ..
E    5
F    6
Length: 6, dtype: int64

In [53]:
# adding an overall name to a series object by passing the name argument

pd.Series([1, 2, 3, 4, 5, 6], index=['A', 'B', 'C', 'D', 'E', 'F'], name='Number Series')


A    1
B    2
C    3
D    4
E    5
F    6
Name: Number Series, dtype: int64

### Reading Data Files

It is convinient to read data in file into a table form. This is achieved using the read_(file_type) method

- **Using the WINE REVIEW DATASETS as an example**

In [16]:
# reading a csv formatted file and examining its content using the head method
wine_review = pd.read_csv("winemag-data_first150k.csv")

wine_review.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [9]:
# reading a json formatted data and examining its content using the head method

wine_review_json = pd.read_json("winemag-data-130k-v2.json")

wine_review_json.head()

Unnamed: 0,points,title,description,taster_name,taster_twitter_handle,price,designation,variety,region_1,region_2,province,country,winery
0,87,Nicosia 2013 Vulkà Bianco (Etna),"Aromas include tropical fruit, broom, brimston...",Kerin O’Keefe,@kerinokeefe,,Vulkà Bianco,White Blend,Etna,,Sicily & Sardinia,Italy,Nicosia
1,87,Quinta dos Avidagos 2011 Avidagos Red (Douro),"This is ripe and fruity, a wine that is smooth...",Roger Voss,@vossroger,15.0,Avidagos,Portuguese Red,,,Douro,Portugal,Quinta dos Avidagos
2,87,Rainstorm 2013 Pinot Gris (Willamette Valley),"Tart and snappy, the flavors of lime flesh and...",Paul Gregutt,@paulgwine,14.0,,Pinot Gris,Willamette Valley,Willamette Valley,Oregon,US,Rainstorm
3,87,St. Julian 2013 Reserve Late Harvest Riesling ...,"Pineapple rind, lemon pith and orange blossom ...",Alexander Peartree,,13.0,Reserve Late Harvest,Riesling,Lake Michigan Shore,,Michigan,US,St. Julian
4,87,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,"Much like the regular bottling from 2012, this...",Paul Gregutt,@paulgwine,65.0,Vintner's Reserve Wild Child Block,Pinot Noir,Willamette Valley,Willamette Valley,Oregon,US,Sweet Cheeks


### How large is a Dataframe object?

The shape attribute can be use on a Dataframe to check its size - row * column


In [5]:
# check the size of the csv file

wine_review.shape

(150930, 11)

In [6]:
# check the size of the json formatted file

wine_review_json.shape

(129971, 13)

- The csv file contains 150,930 rows and 11 columns while
- the json formatted file contains 129,971 rows and 13 colums

#### A csv formatted file can be indexed by a specific column

This means that the column used as index will be the first column in the display


In [28]:
# index by the pprovince column which corresponds to the sixth column/index

wine_review_indexed_by_province = pd.read_csv("winemag-data_first150k.csv", index_col=[0])

wine_review_indexed_by_province.head()


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


### REPEATING SCALAR VALUES USING THE INDEX

The number of rows can be constructed based on the the length of the index key.
If a scalar value is passed to the column, the the values will be repeated by the length of the index

In [34]:
simple_df = pd.DataFrame({'A': 1, 'B': 4}, index=[0,1,2])
simple_df

Unnamed: 0,A,B
0,1,4
1,1,4
2,1,4


### Saving a DataFrame Object 
A DataFrame object can be saved in a specific file format by calling its to_method on the DataFrame Object

The file path must be passed as a parameter

- **NOTE**: The extension of the file name must be given along with the parameter

In [None]:
# to save the simple_df in csv and json format

# csv
df.to_csv('simple_df.csv')

# json

df.to_json('simple_df.json')

## SELECTIONS, INDEXING AND ASSIGNING


- Selecting specific values of a DataFrame or Series 
	- Property of objects are accessed using its attribute
	- Another method is to access it using the column name as index
	- Doing this returns a Series object

> NOTE: using this accessors method, the order of access is COLUMN-FIRST and ROW-SECOND

In [55]:
wine_review_indexed_by_province

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


In [42]:
# access the country column using attributes

wine_review_indexed_by_province.country

0             US
1          Spain
           ...  
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

In [56]:
# access the country column using index

wine_review_indexed_by_province['country']

0             US
1          Spain
           ...  
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

The advantage of index over attribute access lies in the name of the attribute.

IF the attribute is a string with space, then it is impossible to use attribute access.

- **Index access is more general then Attribute access**

## Pandas Indexing
For more advanced operations, instead of using the attribute/index access, there are two *accessor operator* for indexing in pandas:

-  *loc*
- *iloc*

It is important to note that while native index/attribute accessors is based on retrieving entire column of data, then rows, it is the reverse with pandas accessors - loc and iloc

#### ILOC
It is used to select data based on its numerical position.

Instead of returning the column Series object like the native accessors, it will instead return the rows Series object

- iloc is ROW-FIRST and COLUMN-SECOND accessors

In [60]:
# remove the unamed column by indexing the Dataframe with it
wine_review = pd.read_csv('winemag-data_first150k.csv', index_col=[0])

In [62]:
wine_review

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


In [79]:
# return the first row-based Series

wine_review.iloc[0]

country                                                       US
description    This tremendous 100% varietal wine hails from ...
                                     ...                        
variety                                       Cabernet Sauvignon
winery                                                     Heitz
Name: 0, Length: 10, dtype: object

#### Creating a new DataFrame from another DataFrame using ILOC

A new DataFrame can be created from the original DataFrame by using the range indexing

- the range indexing will return a row-based DataFrame object with entries depending on the passed range

In [70]:
# a new Dataframe object containing the first three rows

wine_review.iloc[0:3]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley


#### Creating a column-based Series using ILOC

A column based Series can be returned by passing another key to the iloc representing the integer index of the column

- ROW-FIRST then COLUMN-SECOND

In [84]:
# access the entire column of the province column

# return a dataframe containing the entire dataset, 
# then access the province columnn which corresponds to the integer index - 5


wine_review.iloc[:, 5]

0                 California
1             Northern Spain
                 ...        
150928             Champagne
150929    Northeastern Italy
Name: province, Length: 150930, dtype: object

In [87]:
# return a Series object containing the first five columns of the description attributes

# the description attribute has an integer index - 1
wine_review.iloc[0:5, 1]

0    This tremendous 100% varietal wine hails from ...
1    Ripe aromas of fig, blackberry and cassis are ...
2    Mac Watson honors the memory of a wine once ma...
3    This spent 20 months in 30% new French oak, an...
4    This is the top wine from La Bégude, named aft...
Name: description, dtype: object

#### Using a list-based instead of a range-based to create a DataFrame

Instead of passing a range of values as the first element to the iloc accessor, a list of the needed rows can be used

- **This is more flexible than range-based, since one can pass in the specific row numbers of interest**

In [97]:
# return a DataFrame containing the first, sixth, second and 10th row - in that order

wine_review.iloc[[0, 6, 2, 10]]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
6,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
10,Italy,"Elegance, complexity and structure come togeth...",Ronco della Chiesa,95,80.0,Northeastern Italy,Collio,,Friulano,Borgo del Tiglio


In [98]:
# get the description column of the DataFrame above

wine_review.iloc[[0, 6, 2, 10], 2]

0                 Martha's Vineyard
6                         San Román
2     Special Selected Late Harvest
10               Ronco della Chiesa
Name: designation, dtype: object

#### Returning the DataFrame in a reversed order using negative values

By passing a negative-first range-based to the iloc, the DataFrame returned will be in the reverse order

In [110]:
# get the last 10 elements in the reverse order

wine_review.iloc[-10:]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
150920,Italy,"Rich and mature aromas of smoke, earth and her...",Brut Riserva,91,19.0,Northeastern Italy,Trento,,Champagne Blend,Letrari
150921,France,Shows some older notes: a bouquet of toasted w...,Blanc de Blancs Brut Mosaïque,91,38.0,Champagne,Champagne,,Champagne Blend,Jacquart
150922,Italy,Made by 30-ish Roberta Borghese high above Man...,Superiore,91,,Northeastern Italy,Colli Orientali del Friuli,,Tocai,Ronchi di Manzano
150923,France,"Rich and toasty, with tiny bubbles. The bouque...",Demi-Sec,91,30.0,Champagne,Champagne,,Champagne Blend,Jacquart
150924,France,"Really fine for a low-acid vintage, there's an...",Diamant Bleu,91,70.0,Champagne,Champagne,,Champagne Blend,Heidsieck & Co Monopole
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


### SUMMARY of ILOC ATTRIBUTE SELECTION

- It generally allows indexing into  a DataFrame 
- It takes two positional elements - 
	- the first determines the rows of interest, and
	- the second determines the columns of the indexed rows

> **The first positional elements will return a DataFrame containing the rows of interest**, and **the second will return a Series object containing the columns of interest** in the DataFrame returned by the first positional element

the first positional argument can either be a **ranged-based** or a **list of the rows (in integers)** of interest

	

-- -

### **LABEL-BASED SELECTION - LOC**

It also takes two positional arguments - *the second is optional*

- scalar integer index - this will return a Series representing the rows of the integer index


- range based index- this will return a DataFrame object containing the rows of the range of values


- list index- this does the same thing as range-based, although it has the added advantage of being flexible
	- it will return a DataFrame object containing only the rows of the integer-index in the list

> **NOTE: the first element of the *loc* accessors perform the same operation as the first element of the *iloc* accessor**

#### SCALAR-BASED INDEXING

In [125]:
# return a Series object containing the third row of the DataFrame

wine_review.loc[2]

country                                                       US
description    Mac Watson honors the memory of a wine once ma...
designation                        Special Selected Late Harvest
points                                                        96
price                                                       90.0
province                                              California
region_1                                          Knights Valley
region_2                                                  Sonoma
variety                                          Sauvignon Blanc
winery                                                  Macauley
Name: 2, dtype: object

#### RANGE-BASED INDEXING

In [165]:
# get the fifth to the tenth rows of tbe DataFrame

wine_review.loc[4:9]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
5,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia
6,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
7,Spain,Lush cedary black-fruit aromas are luxe and of...,Carodorum Único Crianza,95,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
8,US,This re-named vineyard was formerly bottled as...,Silice,95,65.0,Oregon,Chehalem Mountains,Willamette Valley,Pinot Noir,Bergström
9,US,The producer sources from two blocks of the vi...,Gap's Crown Vineyard,95,60.0,California,Sonoma Coast,Sonoma,Pinot Noir,Blue Farm


#### LIST-BASED INDEXING

In [137]:
# return the 5th, 10th, 200th and 1001th rows of the DataFrame

wine_review.loc[[4, 9, 199, 1000]]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
9,US,The producer sources from two blocks of the vi...,Gap's Crown Vineyard,95,60.0,California,Sonoma Coast,Sonoma,Pinot Noir,Blue Farm
199,US,Smoke and violet perfume are seductive on this...,Semi-Dry,87,14.0,New York,Finger Lakes,Finger Lakes,Riesling,Heron Hill
1000,Portugal,"This yeasty wine is bone-dry, mature and ready...",Brut Nature Reserva Arinto,85,28.0,Tejo,,,Portuguese Sparkling,Quinta da Lapa


In [134]:
wine_review.loc[0, 'country'] == wine_review.iloc[0, 0]

True

### LOC SECOND POSITIONAL ELEMENT

While the first position return a row-based Series/DataFrame object, the second elements will return a column-based value/Series/DataFrame object depending on how the second positional element is passed

> The first positional element can be a boolean
---

Unlike iloc where the second position is an integer, the second position of loc is the data index value (or attribute) which can come in two form

- a scalar data index value: this will return 
	- a value of the data index corresponding to the row, if the first positional element returns a Series object


	- a Series object of the data index corresponding to the elements of the same column returned by the first positional element

- a list of data index: this will return

	- a Series object that corresponds to element of the row having the data index (attribute) in the list, IF the first element returns a Series object

	- a DataFrame object containing the rows of elements indexed by the list, IF the first element returns a DataFrame object


---


> A value/Series/DataFrame object is returned depending on the result of the first positional element

#### VALUE, IF the first element returns a Series Object

In [153]:
# get a value of the description attribute (data index) of the 150th row

wine_review.loc[147, 'description']

'Aromas of white flowers, stone fruits and lees are typical of well-made Albariño. A juicy, elegant palate with a mix of citrus and stone-fruit flavors is lasting, with a hint of pithy character coming up on the finish.'

#### SERIES if the first element returned a DataFrame

In [154]:
# get the series containing the description data index of the first five rows

wine_review.loc[0:4, 'description']

0    This tremendous 100% varietal wine hails from ...
1    Ripe aromas of fig, blackberry and cassis are ...
2    Mac Watson honors the memory of a wine once ma...
3    This spent 20 months in 30% new French oak, an...
4    This is the top wine from La Bégude, named aft...
Name: description, dtype: object

#### SERIES if the first element returns a DataFrame

In [159]:
# get the series containing the description and province data index of the first row

wine_review.loc[0, ['description', 'province']]

description    This tremendous 100% varietal wine hails from ...
province                                              California
Name: 0, dtype: object

#### DataFrame if the first element returned is a DataFrame

In [None]:
# get the Dataframe containing the description,province and price data index of the 12th, 40th, 144th and 1201th rows

wine_review.loc[[11, 39, 143, 1200], ['description', 'province', 'price']]

Unnamed: 0,description,province,price
11,"From 18-year-old vines, this supple well-balan...",Oregon,48.0
39,"This bright, savory wine delivers aromas and f...",Tuscany,29.0
143,"Estate grown and aged in Stainless steel, this...",Oregon,26.0
1200,"Firm and focused, this lightly spiced effort s...",Oregon,50.0


#### Using the ILOC experessions below, we obtain the same this as above

We note how flexible the LOC expression is with respect to the second positional argument. 

Instead of using integer index of the attributes as in iloc, the attribute name is used in loc

In [158]:
wine_review.iloc[[11, 39, 143, 1200], [1, 5, 4]]

Unnamed: 0,description,province,price
11,"From 18-year-old vines, this supple well-balan...",Oregon,48.0
39,"This bright, savory wine delivers aromas and f...",Tuscany,29.0
143,"Estate grown and aged in Stainless steel, this...",Oregon,26.0
1200,"Firm and focused, this lightly spiced effort s...",Oregon,50.0


In [179]:
wine_review[1:5] == wine_review.iloc[1:5]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
1,True,True,True,True,True,True,True,False,True,True
2,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,False,True,True


#### Range-Based Indexing using the second positional element of LOC

 It is possbile to perform a range based index using the second positional argument. 

> the order of the range is preserved and include the endpoints

In [190]:
wine_review.loc[10, 'description':'province']

description    Elegance, complexity and structure come togeth...
designation                                   Ronco della Chiesa
points                                                        95
price                                                       80.0
province                                      Northeastern Italy
Name: 10, dtype: object

### Difference Between ILOC and LOC

The indexing scheme used by ILOC is the same as native Python. LOC on the other hand will include the last index of the range

USING ILOC

In [191]:
# this will return a DataFrame starting from the 1st integer index to the 4th - it excludes the 5th

wine_review.iloc[0:5]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


USING LOC

In [None]:
# this will return a DataFrame starting from the 1st integer index to the 5th - it does NOT excludes the 5th

wine_review.loc[0:5]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
5,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia


### INDEX MANIPULATION
The index column can be mutated by using any of the columns in the DataFrame

By using the set_index(|index attribut|) method on the DataFrame. This is equivalent to using the |index_col| argument in the read_|file format| of the pandas method.

> the read_|file format| only works with csv formatted files, while the set_index(|index attribute|) works on any file format

In [193]:
wine_review

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


In [195]:
# index the DataFrame using the region_1 attribute

wine_review.set_index('region_1')

Unnamed: 0_level_0,country,description,designation,points,price,province,region_2,variety,winery
region_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Napa Valley,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa,Cabernet Sauvignon,Heitz
Toro,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,,Tinta de Toro,Bodega Carmen Rodríguez
Knights Valley,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Sonoma,Sauvignon Blanc,Macauley
Willamette Valley,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Pinot Noir,Ponzi
Bandol,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...
Fiano di Avellino,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,,White Blend,Feudi di San Gregorio
Champagne,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,,Champagne Blend,H.Germain
Fiano di Avellino,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,,White Blend,Terredora
Champagne,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,,Champagne Blend,Gosset


## CONDITIONAL SELECTION
While all the previous expressions indexed the DataFrame using the **structural properties** of the DataFame itself, it is possible to index the DataFrame based on conditional properties satisfied the DataFrame

In [206]:
wine_review.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [255]:
# return the DataFrame if the country is Spain

wine_review.loc[wine_review.country == 'Spain']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
5,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia
6,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
7,Spain,Lush cedary black-fruit aromas are luxe and of...,Carodorum Único Crianza,95,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
17,Spain,"Nicely oaked blackberry, licorice, vanilla and...",6 Años Reserva Premium,95,80.0,Northern Spain,Ribera del Duero,,Tempranillo,Valduero
...,...,...,...,...,...,...,...,...,...,...
149601,Spain,Toasty oak and tobacco shadings are wrapped ar...,Rioja Crianza,88,15.0,Northern Spain,Rioja,,Tempranillo Blend,Bodegas Faustino
149819,Spain,"This pleasant wine shows clean apple, hay and ...",,86,7.0,Central Spain,Tierra Manchuela,,Viura,Santana
149858,Spain,Dried cherry and spice aromas and flavors acce...,Faustino VII,86,12.0,Northern Spain,Rioja,,Tempranillo,Bodegas Faustino
149936,Spain,"A straightforward basic red, with dried cherry...",,84,7.0,Northern Spain,Rioja,,Tempranillo,Santana


In [258]:
# return  the Series containing the provinces of all the rows if the country is Spain

wine_review.loc[wine_review.country == 'Spain', 'province']

1         Northern Spain
5         Northern Spain
6         Northern Spain
7         Northern Spain
17        Northern Spain
               ...      
149601    Northern Spain
149819     Central Spain
149858    Northern Spain
149936    Northern Spain
149993    Northern Spain
Name: province, Length: 8268, dtype: object

In [256]:
# return the DataFrame containing the province and descripton of all the rows IF the country is Spain

wine_review.loc[wine_review.country == 'Spain', ['province', 'description']]

Unnamed: 0,province,description
1,Northern Spain,"Ripe aromas of fig, blackberry and cassis are ..."
5,Northern Spain,"Deep, dense and pure from the opening bell, th..."
6,Northern Spain,Slightly gritty black-fruit aromas include a s...
7,Northern Spain,Lush cedary black-fruit aromas are luxe and of...
17,Northern Spain,"Nicely oaked blackberry, licorice, vanilla and..."
...,...,...
149601,Northern Spain,Toasty oak and tobacco shadings are wrapped ar...
149819,Central Spain,"This pleasant wine shows clean apple, hay and ..."
149858,Northern Spain,Dried cherry and spice aromas and flavors acce...
149936,Northern Spain,"A straightforward basic red, with dried cherry..."


In [268]:
# return  the DataFrame containing province and region_1 of all the rows IF country is Spain, AND the region_1 is NOT Toro AND the price is less than `0` dollars

wine_review.loc[(wine_review.country == 'Spain') & (wine_review.region_1 != 'Toro') & (wine_review.price < 10), ['province', 'region_1']]

Unnamed: 0,province,region_1
1400,Northern Spain,Campo de Borja
1546,Northern Spain,Campo de Borja
2130,Northern Spain,Campo de Borja
3371,Catalonia,Catalunya
3601,Central Spain,Valdepeñas
...,...,...
147983,Northern Spain,Campo de Borja
148039,Levante,Utiel-Requena
149059,Catalonia,Penedès
149819,Central Spain,Tierra Manchuela


In [276]:
# return the 1st, 10th and 20th element of the row of the result above

(wine_review.loc[(wine_review.country == 'Spain') & (wine_review.region_1 != 'Toro') & (wine_review.price < 10), ['province', 'region_1']]).iloc[[1, 10, 20]]

Unnamed: 0,province,region_1
1546,Northern Spain,Campo de Borja
4575,Central Spain,Valdepeñas
8330,Northern Spain,Campo de Borja


In [292]:
# return the DataFrame containing country,  description, and province IF it is made in Spain, AND its region_ is TORO OR rated (points) is above average ( >=90)

wine_review.loc[
  (wine_review.country == 'Spain') & 
  ((wine_review.region_1 == 'Toro') |
  (wine_review.points >= 90)), 
  ['country', 'description', 'province', 'points', 'region_1']
  ]

Unnamed: 0,country,description,province,points,region_1
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Northern Spain,96,Toro
5,Spain,"Deep, dense and pure from the opening bell, th...",Northern Spain,95,Toro
6,Spain,Slightly gritty black-fruit aromas include a s...,Northern Spain,95,Toro
7,Spain,Lush cedary black-fruit aromas are luxe and of...,Northern Spain,95,Toro
17,Spain,"Nicely oaked blackberry, licorice, vanilla and...",Northern Spain,95,Ribera del Duero
...,...,...,...,...,...
149240,Spain,With its dried-cherry and rose-petal aromas an...,Northern Spain,94,Rioja
149242,Spain,"The bouquet of dried cherries, plums, prunes a...",Northern Spain,92,Rioja
149245,Spain,"Deep, sweet dark-cherry fruit forms a solid fo...",Northern Spain,92,Rioja
149247,Spain,"Beautifully balanced, this shows all the posit...",Northern Spain,92,Rioja


### ISIN CONDITIONAL SELECTOR METHOD

Selects data whose values **is in** a list of of values

In [298]:
# select wines from Spain and US, and return their country, province, price and rating (points)

wine_review.loc[
  wine_review.country.isin(['Spain', 'US']), 
  ['country', 'province', 'price', 'points']]

Unnamed: 0,country,province,price,points
0,US,California,235.0,96
1,Spain,Northern Spain,110.0,96
2,US,California,90.0,96
3,US,Oregon,65.0,96
5,Spain,Northern Spain,73.0,95
...,...,...,...,...
150892,US,California,10.0,82
150896,US,California,10.0,82
150914,US,California,25.0,94
150915,US,California,30.0,93


In [308]:
# select wines from Spain and US, and return the DataFrame containing country, province, price, points, IF its price is less than 10 and greater than 4 dollars, and it is rated above average (>= 90)

wine_review.loc[
  (wine_review.country.isin(['Spain', 'US'])) & 
  (wine_review.price.isin([5, 6, 7, 8, 9])) &
  (wine_review.points >= 90),
  
  ['country', 'province', 'price', 'points']
  ]

Unnamed: 0,country,province,price,points
12347,Spain,Central Spain,9.0,90
15337,US,Washington,9.0,90
15428,US,California,6.0,90
15429,US,California,5.0,90
15438,US,California,8.0,90
...,...,...,...,...
144727,US,Washington,9.0,90
145388,US,California,6.0,90
145389,US,California,5.0,90
145398,US,California,8.0,90


In [303]:
# we can do  the same operation without using the ISIN operator method

wine_review.loc[
  ((wine_review.country == 'Spain') |
  (wine_review.country == 'US') )&
  (wine_review.price > 4) &
  (wine_review.price < 10) &
  (wine_review.points >= 90),
  
  ['country', 'province', 'price', 'points']
]

Unnamed: 0,country,province,price,points
12347,Spain,Central Spain,9.0,90
15337,US,Washington,9.0,90
15428,US,California,6.0,90
15429,US,California,5.0,90
15438,US,California,8.0,90
...,...,...,...,...
144727,US,Washington,9.0,90
145388,US,California,6.0,90
145389,US,California,5.0,90
145398,US,California,8.0,90


Using the **isin** operator method is less written code as shown above

---

#### ISNULL or NOTNULL CONDITIONAL SELECTOR METHOD

This can be used to highlight values in a list of values if they are/ are not empty (NaN)

In [306]:
# selects price in the DataFrame that is null

wine_review.loc[wine_review.price.isnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
32,Italy,"Underbrush, scorched earth, menthol and plum s...",Vigna Piaggia,90,,Tuscany,Brunello di Montalcino,,Sangiovese,Abbadia Ardenga
56,France,"Delicious while also young and textured, this ...",Le Pavé,90,,Loire Valley,Sancerre,,Sauvignon Blanc,Domaine Vacheron
72,Italy,"This offers aromas of red rose, wild berry, da...",Bussia Riserva,91,,Piedmont,Barolo,,Nebbiolo,Silvano Bolmida
82,Italy,"Berry, baking spice, dried iris, mint and a hi...",Palliano Riserva,91,,Piedmont,Roero,,Nebbiolo,Ceste
116,Spain,Aromas of brandied cherry and crème de cassis ...,Dulce Tinto,86,,Levante,Jumilla,,Monastrell,Casa de la Ermita
...,...,...,...,...,...,...,...,...,...,...
150377,New Zealand,"Light and a bit herbal, like a pleasant St.-Jo...",Matheson,84,,Hawke's Bay,,,Syrah,Matua Valley
150378,New Zealand,"Impressive purple color, but less intense on t...",,84,,Martinborough,,,Syrah,Kusuda
150587,Canada,"Shows pronounced oily, earthy, almost tobacco-...",Icewine,90,,Ontario,Lake Erie North Shore,,Riesling,Colio
150673,US,"Cherry-scented, clean and fruity. Good concent...",,87,,California,Dry Creek Valley,Sonoma,Zinfandel,Taft Street


In [309]:
# selects province in the DataFrame whose values is Null
wine_review.loc[wine_review['province'].isnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
1133,,Delicate white flowers and a spin of lemon pee...,Askitikos,90,17.0,,,,Assyrtiko,Tsililis
1440,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Red Blend,Büyülübağ
68226,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas
113016,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas
135696,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas


In [310]:
# selects province whose region_2 is not null

wine_review.loc[wine_review.region_2.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
8,US,This re-named vineyard was formerly bottled as...,Silice,95,65.0,Oregon,Chehalem Mountains,Willamette Valley,Pinot Noir,Bergström
9,US,The producer sources from two blocks of the vi...,Gap's Crown Vineyard,95,60.0,California,Sonoma Coast,Sonoma,Pinot Noir,Blue Farm
...,...,...,...,...,...,...,...,...,...,...
150892,US,"A light, earthy wine, with violet, berry and t...",Coastal,82,10.0,California,California,California Other,Merlot,Callaway
150896,US,"Some raspberry fruit in the aroma, but things ...",,82,10.0,California,California,California Other,Pinot Noir,Camelot
150914,US,"Old-gold in color, and thick and syrupy. The a...",Late Harvest Cluster Select,94,25.0,California,Anderson Valley,Mendocino/Lake Counties,White Riesling,Navarro
150915,US,"Decades ago, Beringer’s then-winemaker Myron N...",Nightingale,93,30.0,California,North Coast,North Coast,White Blend,Beringer


### ASSIGNING DATA

Data can be assigned to a DataFrame by using the dictionary key value variable assignment method.
- the key will serve as the data index or attribute of the data to be assigned

- the value will represent the value of the element in each row
---
A value comes in two forms:
- a constant values assigned to all the rows of that column (key)

- an iterable value [that must match the number of rows of the DataFrame]

In [334]:
# create a new DataFrame from the wine_review, 
# containing the first 10 rows and 
# the following columns: country, province, price

wr = wine_review.loc[0:9, ['country', 'province', 'price']]
wr

Unnamed: 0,country,province,price
0,US,California,235.0
1,Spain,Northern Spain,110.0
2,US,California,90.0
3,US,Oregon,65.0
4,France,Provence,66.0
5,Spain,Northern Spain,73.0
6,Spain,Northern Spain,65.0
7,Spain,Northern Spain,110.0
8,US,Oregon,65.0
9,US,California,60.0


In [339]:
# assign a new column named continent 
# the values of the new column should be
# North America if country is US, or Europe if in Europe

wr['continent'] = 'West'

for i, ctry in enumerate(wr.country):
  if ctry == 'US':
    
    # this method sees the chained index as two different operations
    # wr['continent'][i] = 'North America'
    
    # this sees them as the same operation, hence better
    wr.loc[i, 'continent'] = 'North America'
    
    continue
  # wr['continent'][i] = 'Europe'
  wr.loc[i, 'continent'] = 'Europe'

# wr['continent'][1] = 'Europe'
wr

Unnamed: 0,country,province,price,continent
0,US,California,235.0,North America
1,Spain,Northern Spain,110.0,Europe
2,US,California,90.0,North America
3,US,Oregon,65.0,North America
4,France,Provence,66.0,Europe
5,Spain,Northern Spain,73.0,Europe
6,Spain,Northern Spain,65.0,Europe
7,Spain,Northern Spain,110.0,Europe
8,US,Oregon,65.0,North America
9,US,California,60.0,North America
