---
title: Indexing Selecting and Assigning Values
tags: [jupyter]
keywords: pandas
summary: "Indexing, selecting and assigning in pandas."
mlType: dataFrame
infoType: pandas
sidebar: pandas_sidebar
permalink: __AutoGenThis__
notebookfilename:  __AutoGenThis__
---

In [2]:
import sys

sys.path.append("../")

In [3]:
import pandas as pd
from pprint import pprint

# Padas Options

In [15]:
pd.set_option('max_rows', 5)

# I/O

In [14]:
reviews = pd.read_csv("../data/winemag-data-130k-v2.csv", index_col=0)

In [16]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Naive Accessors

## Accessing Coloumns

You can use either the **dot** operation or the **dictionary type** accessing

In [17]:
reviews.country

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

In [18]:
reviews["country"]

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

You can treat pandas like a **fancy dictionary** where you can use many of the dictionary operations, like accessing a particular element in a coloumn.

In [19]:
reviews["country"][1]

'Portugal'

# Indexing in Pandas

**loc** and **iloc** are the ways you are suppose to be accessing data in pandas

## Index-based selection

Pandas indexing works in aone of teo paradigms.  The first is **index-based selection** where you select data based on its numerical position in the data.  The is **iloc**.

For instance to select the first row of data in a DF we do the follow:

In [20]:
row = 0
reviews.iloc[row]

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

Both **loc** and **iloc** are (row,column) which is opposite of what we usually do in python where we take (column,row).  Which means it is easy to retrieve the row but harder to do so with the coloumns.  We have to use something similar to matlab to access entire coloumns

In [22]:
# note that 0 is the countries coloumn
col=0
reviews.iloc[:,col]

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

And similar to MATLAB we can take a subset of columns using this notation

In [23]:
reviews.iloc[3:6,col]

3       US
4       US
5    Spain
Name: country, dtype: object

We can also select a **list of rows** like this

In [24]:
rows=[2,4,6]
reviews.iloc[rows,col]

2       US
4       US
6    Italy
Name: country, dtype: object

## Label-based selection

The second paradigm for attribute selection is the one followed by the **loc** operator: **label-based selection**.  This is data indexing value that will be used and not its position.

In [25]:
col = 'country'
reviews.loc[0,col]

'Italy'

Notice that we did not get a row but rather just a single value.  **iloc** is conceptually simpler the **loc** because it ignores the dataset indices.  When we use **iloc** we treat the dataset like a **big matrix**, one that we have to index into by obtaining the position.  **loc** by contrast uses the information in the indices to do its work.  Since your dataset usually has meaningful indices, it's usually easier to do things using **loc**.

For instance we do not know the column location of 

- taster_name
- taster_twitter_handle
- points

Instead of looping through the columns identifying the indices and using **iloc** you can just use loc operation.

In [29]:
colList = ['taster_name', 'taster_twitter_handle', 'points']
reviews.loc[1:3, colList]

Unnamed: 0,taster_name,taster_twitter_handle,points
1,Roger Voss,@vossroger,87
2,Paul Gregutt,@paulgwine,87
3,Alexander Peartree,,87


## WARNING

There is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

- **indexing**
    - **iloc** uses the python stdlib indexing where the first element of the range is included and the last one exlcuded. So **0:10** will select **0,1,...,9** while
    - **loc** for **0:10** will selection **0,1,....10**

# Manipulating Indecies

We can set the indecies to whatever we want, even to another column values.

In [31]:
reviews.set_index("title")

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,variety,winery
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Nicosia 2013 Vulkà Bianco (Etna),Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,White Blend,Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro),Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace),France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Pinot Gris,Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace),France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Gewürztraminer,Domaine Schoffit


# Conditional Selection

We often need to ask questions based on conditions, for example suppose that we're interested specifically in better-than-avg wines produced in Italy.  We can need to identify the wines made in italy.

In [32]:
reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

This will produce a boolean list array which we can pass to loc to locate all the rows.

In [33]:
reviews.loc[reviews.country=='Italy']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS
129962,Italy,"Blackberry, cassis, grilled herb and toasted a...",Sàgana Tenuta San Giacomo,90,40.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...,Nero d'Avola,Cusumano


Since we want to know wines that are greater than avg and since the reviewes are 80-100 points we want to obtain the rows which have 90 or greater.  We can combine conditional statements in our loc search

In [35]:
reviews.loc[(reviews.country=='Italy') & (reviews.points >=90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
120,Italy,"Slightly backward, particularly given the vint...",Bricco Rocche Prapó,92,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Prapó (Barolo),Nebbiolo,Ceretto
130,Italy,"At the first it was quite muted and subdued, b...",Bricco Rocche Brunate,91,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Brunate (Barolo),Nebbiolo,Ceretto
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS
129962,Italy,"Blackberry, cassis, grilled herb and toasted a...",Sàgana Tenuta San Giacomo,90,40.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...,Nero d'Avola,Cusumano


## Built-in conditional selectors (isin, isnull)

### isin

This function lets you select data whose value **is in** a list of values.  For example, here's how we can use it to select wines only for Italy or France

In [37]:
listOfCountries = ['Italy','France']
reviews.loc[reviews.country.isin(listOfCountries)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


### isnull

The second is **isnull** (and its companion **notnull**). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do

In [38]:
reviews.loc[reviews.price.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


# Multi-Layer Indexing

I followed this [this](https://www.youtube.com/watch?v=tcRGa2soc-c) youtuber for examples of multilayer indexing.

In [4]:
stocks = pd.read_csv('http://bit.ly/smallstocks')

In [5]:
stocks.head()

Unnamed: 0,Date,Close,Volume,Symbol
0,2016-10-03,31.5,14070500,CSCO
1,2016-10-03,112.52,21701800,AAPL
2,2016-10-03,57.42,19189500,MSFT
3,2016-10-04,113.0,29736800,AAPL
4,2016-10-04,57.24,20085900,MSFT


## Groupby to create multi-Layer indexing

In [11]:
stocks.pivot_table(values=['Close','Volume'],index=['Symbol','Date'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Close,Volume
Symbol,Date,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2016-10-03,112.52,21701800
AAPL,2016-10-04,113.0,29736800
AAPL,2016-10-05,113.05,21453100
CSCO,2016-10-03,31.5,14070500
CSCO,2016-10-04,31.35,18460400
CSCO,2016-10-05,31.59,11808600
MSFT,2016-10-03,57.42,19189500
MSFT,2016-10-04,57.24,20085900
MSFT,2016-10-05,57.64,16726400


We need to use ```loc``` to access both the index and column.  Notice that it will be in this form

.loc[**(Tuple outer then inner of index)**,**(Tuple outer then inner of columns)**]

OR

.loc[**(List outer, list inner)**,**column list**]

In [12]:
stocksNew = stocks.pivot_table(values=['Close','Volume'],index=['Symbol','Date'])
stocksNew.loc[(['AAPL','MSFT'],'2016-10-03'),'Close']

Symbol  Date      
AAPL    2016-10-03    112.52
MSFT    2016-10-03     57.42
Name: Close, dtype: float64

In [13]:
stocksNew.loc[(['AAPL','MSFT'],['2016-10-03','2016-10-04']),'Close']

Symbol  Date      
AAPL    2016-10-03    112.52
        2016-10-04    113.00
MSFT    2016-10-03     57.42
        2016-10-04     57.24
Name: Close, dtype: float64

What happens when we want to look at all columns or all of the same index.

**WE CAN'T USE ```:```**

In [14]:
stocksNew.loc[(:,['2016-10-03','2016-10-04']),:]

SyntaxError: invalid syntax (<ipython-input-14-9f95907fc13f>, line 1)

Instead you have to use ```slice(None)```