# **Indexing, Selecting & Assigning**

### Introduction
Selecting specific values of a pandas DataFrame or Series work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

In [1]:
import pandas as pd
reviews = pd.read_csv("../data/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('display.max_rows', 5)

### Native Accessors
Native Python objects provide good ways of indexing data.

In [2]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way. <br>

Hence to access the `country` property of `reviews` we can use:

In [3]:
reviews.country

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

if we have a Python dictionary we can access its values using the indexing ( [ ] ) operator. We can do the same with the columns in a DataFrame.

In [4]:
reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

These are two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntaactivally valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them ( if we had a `country providence` column `review.country` would not work).

Doesn't a pandas Sereies look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only to use the exisiting operator []  once more:

In [5]:
reviews['country'][0]

'Italy'

### Indexing in Pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using. 

#### Index-based selection

Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm. 

In [15]:
reviews.iloc[1] # Retrieving the second [1] column

country                                                 Portugal
description    This is ripe and fruity, a wine that is smooth...
                                     ...                        
variety                                           Portuguese Red
winery                                       Quinta dos Avidagos
Name: 1, Length: 13, dtype: object

**Both `loc` and `iloc` are row-first, column-second**. This is the opposite of what we do in native Python, which is column-first, row-second. 
This means that it's marginally easier to retrieve rows, and mariginally harder to get retrieve columns. To get a column with `iloc`, we can do the following:

In [16]:
reviews.iloc[:,0] # retrieveing all rows of the first [0] column

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

On its own, the : operator, which also comes from the native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the `country` column from just the first, second, and third row, we would do:

In [17]:
reviews.iloc[:3,0] # retreiving first 3 rows of the first [0] column. 

0       Italy
1    Portugal
2          US
Name: country, dtype: object

In [18]:
reviews.iloc[1:3, 0] # retrieving 2nd and 3rd [1, 2] rows of the first column.

1    Portugal
2          US
Name: country, dtype: object

In [19]:
reviews.iloc[[0,1,2],0] # another way to retrieve 1,2,3rd rows of the first [0] column. 

0       Italy
1    Portugal
2          US
Name: country, dtype: object

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values.
So for example here are the last five elements of the dataset.

In [26]:
reviews.iloc[-5:,0] # retrieving last 5 elements from the first [0] column

129966    Germany
129967         US
129968     France
129969     France
129970     France
Name: country, dtype: object

#### Label based selection
The second paradigm for attribute selection is the one followed by `loc` operator: **label-based selection**.
In this paradigm, its the data index value, not its position, which matters.

In [27]:
reviews.loc[0, 'country']

'Italy'

`iloc` is conceptually simpler than `loc` because it ignores the dataset's indices. When we use `iloc` we treat the dataset like a big matric (a list of lists), one that we have to index into by position, `loc`, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, its usually easier to do things using `loc` instead. 

In [28]:
reviews.loc[:, ['taster_name','taster_twitter_handle', 'points']]

Unnamed: 0,taster_name,taster_twitter_handle,points
0,Kerin O’Keefe,@kerinokeefe,87
1,Roger Voss,@vossroger,87
...,...,...,...
129969,Roger Voss,@vossroger,90
129970,Roger Voss,@vossroger,90


#### Choosing between `loc` and `iloc`
When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes. <br>
`iloc` uses the Python `stdlib` indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...10`. 

Why the change? Remember that `loc` can index any `stdlib` type: strings, for example, if we have a DataFrame with index values `Apples,...,Potatoes,...`, and we select "all the alphabetical fruit choices between Apples and Potatoes", then its a lot more convinient to index : <br>
```python
df.loc['Apples':'Potatoes']
```
that it is to index something like 
```python
df.loc['Apples':'Potatoet'] # t comes after s in the alphabet
```