# Pandas
- Solve short hands-on challenges to perfect your data manipulation skills.
- https://www.kaggle.com/learn/pandas

## Indexing, Selecting & Assigning

In [1]:
import pandas as pd
reviews = pd.read_csv('Red.csv')
print('Default display.max_rows:', pd.get_option('display.max_rows'))
# Default display.max_rows: 60


Default display.max_rows: 60


In [2]:
pd.set_option('display.max_rows', 7)
print('Actual display.max_rows:', pd.get_option('display.max_rows'))

Actual display.max_rows: 7


## Native accessors
Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

In [3]:
reviews

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
...,...,...,...,...,...,...,...,...
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010
8664,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019
8665,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016


In [4]:
# In Python, we can access the property of an object by accessing it as an attribute
reviews.Country

0          France
1          France
2           Italy
          ...    
8663       France
8664    Australia
8665    Argentina
Name: Country, Length: 8666, dtype: object

In [5]:
# If we have a Python dictionary, we can access its values using the indexing ([])
reviews['Country']
# reviews['Country Province'] is the only valid for reserved chars.

0          France
1          France
2           Italy
          ...    
8663       France
8664    Australia
8665    Argentina
Name: Country, Length: 8666, dtype: object

In [6]:
# Pandas Series kind of dic{kye: [list]}, then:
reviews['Country'][0]

'France'

## Intrinsic Indexing in pandas
Pandas has its own accessor operators, iloc and loc. For more advanced operations, these are the ones you're supposed to be using.

In [7]:
# index-base selection, to select the first row
#type(reviews.iloc[0])   # pandas.core.series.Series
reviews.iloc[0]


Name               Pomerol 2011
Country                  France
Region                  Pomerol
                       ...     
NumberOfRatings             100
Price                      95.0
Year                       2011
Name: 0, Length: 8, dtype: object

In [8]:
reviews.head(1)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011


In [9]:
# Native Python
from pprint import pprint
lst_3x2 = [[1, 2], [4, 5], [7, 8]]
print(lst_3x2)
pprint(lst_3x2)
for row in lst_3x2:
    print(row)
print('last_1stRow:', lst_3x2[0][1])
print('first_3erRow:', lst_3x2[2][0])

[[1, 2], [4, 5], [7, 8]]
[[1, 2], [4, 5], [7, 8]]
[1, 2]
[4, 5]
[7, 8]
last_1stRow: 2
first_3erRow: 7


In [10]:
# Is marginally easier to retrieve rows, and marginally harder to get retrieve
# columns. To get 2nd column ['Country'] with iloc, we can do the following:
reviews.iloc[:, 1]

0          France
1          France
2           Italy
          ...    
8663       France
8664    Australia
8665    Argentina
Name: Country, Length: 8666, dtype: object

In [11]:
# to select the Country column from just the first, second, and third row:
reviews.iloc[:3, 1]

0    France
1    France
2     Italy
Name: Country, dtype: object

In [12]:
# or to select 2, 3 and 4 rows of Country col:
reviews.iloc[2:5, 1]

2      Italy
3      Italy
4    Austria
Name: Country, dtype: object

In [13]:
# It's also possible to pass a list:
display(reviews.iloc[[2, 3, 4], 1])
# list is useful whe select alternate values (2, 5, 12)
reviews.iloc[[2, 5, 12], 1]


2      Italy
3      Italy
4    Austria
Name: Country, dtype: object

2      Italy
5     France
12    France
Name: Country, dtype: object

In [14]:
# Finally, it's worth knowing that negative numbers can be used in selection.
# This will start counting forwards from the end of the values.
# The last five rows of the dataset
reviews.iloc[-5:]

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
8661,6th Sense Syrah 2016,United States,Lodi,Michael David Winery,3.8,994,16.47,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010
8664,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019
8665,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016


In [15]:
# last 5 Countrys of the Country columns
reviews.iloc[-5:, 1]

8661    United States
8662            Italy
8663           France
8664        Australia
8665        Argentina
Name: Country, dtype: object

# Label-based selection
The loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [16]:
# first entry en Country col
reviews.loc[0, 'Country']

'France'


iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead

In [17]:
# All elements of 3 columns
reviews.loc[:, ['Name', 'Rating', 'Price']]

Unnamed: 0,Name,Rating,Price
0,Pomerol 2011,4.2,95.00
1,Lirac 2017,4.3,15.50
2,Erta e China Rosso di Toscana 2015,3.9,7.45
...,...,...,...
8663,Haut-Médoc 2010,3.7,23.95
8664,Shiraz 2019,3.5,6.21
8665,Portillo Cabernet Sauvignon 2016,3.4,7.88


In [18]:
reviews[['Name', 'Rating', 'Price']]

Unnamed: 0,Name,Rating,Price
0,Pomerol 2011,4.2,95.00
1,Lirac 2017,4.3,15.50
2,Erta e China Rosso di Toscana 2015,3.9,7.45
...,...,...,...
8663,Haut-Médoc 2010,3.7,23.95
8664,Shiraz 2019,3.5,6.21
8665,Portillo Cabernet Sauvignon 2016,3.4,7.88



#### Choosing between loc and iloc
When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.
iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.
Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).
This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].
Otherwise, the semantics of using loc are the same as those for iloc.


In [21]:
### Manipulating the index
display(reviews)
reviews.set_index('Name')
# df is IMMUTABLE, unless use 'inplace=True' parameter

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
...,...,...,...,...,...,...,...,...
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010
8664,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019
8665,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016


Unnamed: 0_level_0,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011
Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017
Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
...,...,...,...,...,...,...,...
Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010
Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019
Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016


### Conditional selection
- To do interesting things with the data, however, we often need to ask questions based on conditions.    
suppose that we're interested specifically in better-than-average
 wines produced in Italy.

In [23]:
# We can start by checking if each wine is Italian or not:
reviews.Country == 'Italy'

0       False
1       False
2        True
        ...  
8663    False
8664    False
8665    False
Name: Country, Length: 8666, dtype: bool

In [26]:
# This result can be ise inside loc to select relevant data
reviews.loc[reviews.Country == 'Italy']
# select all the rows that have Italy in Country col


Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
3,Bardolino 2019,Italy,Bardolino,Cavalchina,3.5,100,8.72,2019
8,Chianti 2015,Italy,Chianti,Castello Montaùto,3.6,100,10.75,2015
...,...,...,...,...,...,...,...,...
8654,Scipio 2013,Italy,Toscana,Tenuta dei Sette Cieli,4.3,99,63.39,2013
8658,Bricco dell'Uccellone Barbera d'Asti 2016,Italy,Barbera d'Asti,Braida,4.3,990,52.00,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016


In [30]:
reviews.Rating.describe()
print(reviews['Rating'].min(), reviews.Rating.mean(), reviews.Rating.max())

2.5 3.8903415647357487 4.8


In [31]:
# We also wanted to know which italian wines are better than average
reviews.loc[(reviews.Country == 'Italy') &
 (reviews['Rating']>= reviews.Rating.mean())]

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
10,Chianti Riserva 2013,Italy,Chianti,Poggiotondo,3.9,100,20.95,2013
19,Fulgeo Negroamaro Salento 2016,Italy,Salento,San Donaci,4.0,100,12.90,2016
...,...,...,...,...,...,...,...,...
8654,Scipio 2013,Italy,Toscana,Tenuta dei Sette Cieli,4.3,99,63.39,2013
8658,Bricco dell'Uccellone Barbera d'Asti 2016,Italy,Barbera d'Asti,Braida,4.3,990,52.00,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016


In [32]:
# lists wines from italy OR rating better than average (more rows than previous)
reviews.loc[(reviews['Country'] == 'Italy') |
(reviews['Rating'] >= reviews.Rating.mean())]

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
...,...,...,...,...,...,...,...,...
8658,Bricco dell'Uccellone Barbera d'Asti 2016,Italy,Barbera d'Asti,Braida,4.3,990,52.00,2016
8660,Bishop Shiraz 2016,Australia,Barossa Valley,Glaetzer,4.1,994,30.39,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016


Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:

In [33]:
reviews.loc[reviews.Country.isin(['Italy', 'France'])]

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
...,...,...,...,...,...,...,...,...
8659,Bordeaux Rouge 2016,France,Bordeaux,Mouton Cadet,3.5,9926,9.25,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010


The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:

In [36]:
reviews.loc[reviews.Price.isnull()]

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year


### Assigning data
- Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [38]:
reviews.NumberOfRatings = -99
reviews['NumberOfRatings']

0      -99
1      -99
2      -99
        ..
8663   -99
8664   -99
8665   -99
Name: NumberOfRatings, Length: 8666, dtype: int64

In [39]:
reviews

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,-99,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,-99,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,-99,7.45,2015
...,...,...,...,...,...,...,...,...
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,-99,23.95,2010
8664,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,-99,6.21,2019
8665,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,-99,7.88,2016


In [42]:
# Also with iterable values:
print(len(reviews), reviews.shape[0])
reviews['NumberOfRatings'] = range(len(reviews), 0, -1)
reviews.NumberOfRatings

8666 8666


0       8666
1       8665
2       8664
        ... 
8663       3
8664       2
8665       1
Name: NumberOfRatings, Length: 8666, dtype: int64