# Selecting Data In A DataFrame
## Notebook Outline:

* <a href='#IntroToIndexing'>Introduction To Indexing</a>
* <a href='#IntroducingILoc'>Introduction To iLoc</a>
* <a href='#IntroducingLoc'>Introduction To Loc</a>
* <a href='#UsingLocWithCondition'>Using Loc With A Condition</a>
* <a href='#UsingLocWithMultipleConditions'>Using Loc With Multiple Conditions</a>

# How to use this Notebook

The best way to use this notebook is to follow along with the lecture and then to apply what you learn to your own data files, or (if you do not have any of your own data) to practice using this functions and methods on the provided data. A little practice goes a long way towards understand and retaining! It would be easy to just skim this notebook, but you will learn more by doing!

<a name="IntroToIndexing"></a>
#  Introduction to Indexing
Indexing and slicing just refers to methods to grab specific rows and columns from a dataset.  Maybe you want the value at the 100th row and 11th column, or maybe you want all the rows of data for month of January in 2017. Maybe you want all the rows where a certain value is greater than 0.  These are all examples where you will want to use indexing.

We are going to cover the two main methods of indexing, .iloc and loc, and then start using them in example. First, we need to get a handle on the basics of these methods!

<a name='IntroducingILoc'></a>
# Introducing the .iloc[] method
The .iloc method will allow us to select rows and columns based on the _number_ of the row and column. For example, we can select the 10th row and 3rd column, or we can select all values on the 17th row, etc...  Let's learn about .iloc[] via the examples below.

We need a dataset to practice on, so let's load the Illinois Boy Baby names dataset that we saw in a previous lecture.

In [1]:
# In this cell we import pandas and load the datafile.
import pandas as pd
import os

filepath = os.path.join(os.getcwd(), 'data', 'Most_Popular_Baby_Boy_Names__1980-2013.csv')
nameData = pd.read_csv(filepath)

#### Let's use the .head() method to get a quick look at the data

In [2]:
nameData.head(3)

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273


#### now let's use the .iloc[] method
The .iloc method allows us to index a dataframe in a very similar way to how we would index a list. It is important to remember that .iloc does not use any of the row or column labels (names), but it uses the numerical position of each row and column.

While it is important to know about.iloc, I actually don't use it very often and I use a similar method, .loc, that we will learn about next!

To use .iloc, simple write the method after the dataframe variable name and then use _square_ brackets to grab the row and column you want. The first value is the row number, and the second is the column number.  For example, nameData.iloc[0, 3] will get the value form the first row and the fourth column.

In the next cell we grab the first row and first column of the dataframe (remember that python is zero indexed)

The value that is returedn is 1, note this value does correspond to the value in the first row and first column in the output above.

In [3]:
# We grab the value from the first row and the first column.
nameData.iloc[0, 0]

1

In [4]:
nameData.iloc[0,2]

'Michael'

#### Now, let's use .iloc[]  to get the 2nd row.
Remember, python is zero-indexed so the 2nd row is at index 1.

Note how we use ':' to get all the columns.

In [None]:
nameData.iloc[1, :]

#### You don't actually need to use the ':' to gt all the columns. But you do need to use ':' to get all the rows (see a few cells below). So, as a practice, it's easier to remember to just use ':'.

In [None]:
nameData.iloc[1, ]

#### Let's use .iloc[] to get all the values in the last 3rd column
Note how we use ':' to to get all the rows

In [None]:
nameData.iloc[:, 2]

#### Now let's use .iloc[] to get the first 10 rows.
When getting a range of rows, we can type the range as < first row number >: < last row number + 1>. For example nameData.iloc[0:10, :] will get all the rows from row 0 through row 9.

In [None]:
nameData.iloc[0:10, :]

#### Let's use iloc[] to get the first 10 rows and the first 2 columns

In [None]:
nameData.iloc[0:10, 0:2]

#### If you are starting your selection at 0, you don't actually need to type the 0. For example:

In [None]:
nameData.iloc[:10, :2]

#### Now lets get ever other row. The notation is dataframe.iloc[< first row index > : < last row index + 1> : < step size >, :]

## This is called a stride

In [5]:
nameData.iloc[0:10:2, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
2,3,1980,Christopher,2273
4,5,1980,David,2088
6,7,1980,Robert,1763
8,9,1980,John,1722


#### You can also select specific rows and columns:

In [None]:
nameData.iloc[[1, 4, 7], :]

In [None]:
nameData.iloc[[1, 4, 7], [1, 3]]

## In Class Exercise
Please create a cell below and use the .iloc[] method to explore the dataset.

<a name='IntroducingLoc'></a>
# Introducing the .loc method
.loc[] lets us select rows and columns by their labels or by boolean value (true/false tests). I use .loc _much_ more often than I use .iloc.

For example, we use the .loc method below to select the 'Name' column.

In [None]:
nameData.loc[:, 'Name']

#### Let's use .loc to grab just the first row of the name column.
Notice that our row labels also happen to be the number of the row. This is not always the case but it is here.

In [None]:
nameData.loc[0 , 'Name']

#### Let's now get the first two rows and the columns 'rank' and 'name'.
Notice that when you want multiple rows and/or columns you need to list the labels of rows and/or columns you want. (That is, the labels are in square brackets..they are in a list.

In [6]:
nameData.loc[:,['Name', 'Rank']]

Unnamed: 0,Name,Rank
0,Michael,1
1,Jason,2
2,Christopher,3
3,Matthew,4
4,David,5
5,James,6
6,Robert,7
7,Daniel,8
8,John,9
9,Joseph,10


## In Class Exercise
Please create a cell below and use the .loc[] method to explore the dataset.

In [12]:
nameData.loc[1:11:2,['Rank','Name']].sort_values('Rank')

Unnamed: 0,Rank,Name
1,2,Jason
3,4,Matthew
5,6,James
7,8,Daniel
9,10,Joseph
11,12,Joshua


<a name='UsingLocWithCondition'></a>
# Using .loc to get rows based on a _condition_
In this section, we are going to look out how we get rows where a certain condition is True. This is a very common thing to do!  Often examples are show on random data, but let's use it on real data - starting with the name data!

#### Reviewing Booleans
We first need to do a quick Boolean review. A 'boolean' is a variable type that can have value of either True or False. The are usually created by performing some kind of simple test. For example, 2 > 5, this statement is _false_ because it is _not true_ that 2 > 5. You will want to briefly review what each symbol below means:
* a == b, tests if a is the same value as b.
* a != b, tests if a is not the same value as b.
* a > b, tests if a is greater than b.
* a >= b, tests if a is greater than or equal to b.
* a < b, test if a is less than b.
* a <= b, tests if a is less than or equal to b.

##### NOTE: '==' is not the same as '='. '=' is used to assign values to variable names. '==' is used to test for equivalence.

Let's try some other tests in the cell below.

In [13]:
print(2 > 1)
print(1 == 1)
print(1 == 3)
print(5 <= 6)
print(5 <= 5)
print(100 >= 101)
print(2 != 4)
print('apple' != 'banana')

True
True
False
True
True
False
True
True


#### Creating a columns of true/false values based on values in a column of a dataframe.
Let's say we want all the rows of the baby name data where the value in the 'Rank' column is 1. We just need to use the '==' operator we review above.

First get the column from the dataframe using the .loc method and then use the '==' to test for equivalence to 1. Notice that this prints a Series (which is a like a pandas DataFrame but just 1-dimensional instead of having multiple columns.

In [14]:
nameData.loc[:, 'Rank'] == 1

0       True
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25      True
26     False
27     False
28     False
29     False
       ...  
820    False
821    False
822    False
823    False
824    False
825     True
826    False
827    False
828    False
829    False
830    False
831    False
832    False
833    False
834    False
835    False
836    False
837    False
838    False
839    False
840    False
841    False
842    False
843    False
844    False
845    False
846    False
847    False
848    False
849    False
Name: Rank, Length: 850, dtype: bool

#### Assign the True/False values to a variable and use it with the .loc method to index the dataframe
This time, we will assign the output True/False values to the variable name _ranked1st_.  Now we can use this variable to index our dataframe.

In [15]:
ranked1st = nameData.loc[:, 'Rank'] == 1

#### Using a boolean series with .loc
You can use a boolean series with .loc to select the rows (or columns) where the series has a value of True. You can _not_ do this with .iloc.

Note how the below only gets the rows where _ranked1st_ has the value of True.

In [17]:
nameData.loc[ranked1st, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
25,1,1981,Michael,3632
50,1,1982,Michael,3664
75,1,1983,Michael,3681
100,1,1984,Michael,3669
125,1,1985,Michael,3480
150,1,1986,Michael,3337
175,1,1987,Michael,3467
200,1,1988,Michael,3540
225,1,1989,Michael,3624


#### Note that you can use the True/False test directly in the .loc method, this is usually what you will see.

In [18]:
nameData.loc[nameData['Rank'] == 1, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
25,1,1981,Michael,3632
50,1,1982,Michael,3664
75,1,1983,Michael,3681
100,1,1984,Michael,3669
125,1,1985,Michael,3480
150,1,1986,Michael,3337
175,1,1987,Michael,3467
200,1,1988,Michael,3540
225,1,1989,Michael,3624


#### Let's look at some more examples: Use booleans and .loc to get all the rows for the name 'William'.

In [19]:
nameData.loc[nameData['Name'] == 'William', :]

Unnamed: 0,Rank,Year,Name,Frequency
17,18,1980,William,1192
40,16,1981,William,1176
69,20,1982,William,1124
95,21,1983,William,1033
120,21,1984,William,1075
149,25,1985,William,919
173,24,1986,William,949
196,22,1987,William,983
223,24,1988,William,916
244,20,1989,William,983


#### Get all rows where the rank is 3 or higher (3rd place, 2nd place, or 1st place)

In [20]:
topNames = nameData.loc[nameData['Rank'] <= 3, :]
topNames.head()

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273
25,1,1981,Michael,3632
26,2,1981,Matthew,2329


## In Class Exercise
Please create a cell below and use the .loc[] method to explore the dataset. Use the .loc method and an '==' test to get all the rows where the year is equal to 2000

In [22]:
nameData.loc[nameData['Year']==2000,:]

Unnamed: 0,Rank,Year,Name,Frequency
500,1,2000,Jacob,1640
501,2,2000,Michael,1553
502,3,2000,Matthew,1420
503,4,2000,Daniel,1269
504,5,2000,Nicholas,1211
505,6,2000,Joseph,1149
506,7,2000,Joshua,1123
507,8,2000,Anthony,1017
508,9,2000,Andrew,991
509,10,2000,Ryan,974


<a name=UsingLocWithMultipleConditions></a>
# Using .loc to get rows based on multiple _conditions_
In this section, we are going to look out how we get rows where multiple conditions are True. This is also a very common thing to do!

First we need a quick review of the symbol we use for _and_ and _or_ when using arrays (or series) of True/False values:

* & - means 'and'
* | - means 'or'

#### How to get the row where the rank equals 1 and the year equals 2000.
Use the same equivalence tests we used above, but combine them with the & operator. This produces a series with True/False values, where each value will only be True if both test are True. We only expect one row to be test as True, that is there should only be one row where the year equals 2000 and the rank equals 1.

##### Note you must now use parentheses to group each test.

In [23]:
nameData.loc[(nameData['Rank'] == 1) & (nameData['Year'] == 2000), :]

Unnamed: 0,Rank,Year,Name,Frequency
500,1,2000,Jacob,1640


### Now let's try some examples on our auto data. First we will load the data.

In [24]:
filepath = os.path.join(os.getcwd(), 'data', 'auto-mpg-tabs.csv')

autoMPGData = pd.read_csv(filepath, sep='\t', index_col=0)
autoMPGData.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino


#### Use the or operator, '|', to get the rows for where the name is 'ford gran torino' or 'ford pinto'.

In [25]:
autoMPGData.loc[(autoMPGData['carname'] == 'ford gran torino') |
                (autoMPGData['carname'] == 'ford pinto'), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
88,14.0,8,302.0,137.0,4042.0,14.5,73,ford gran torino
112,19.0,4,122.0,85.00,2310.0,18.5,73,ford pinto
130,26.0,4,122.0,80.00,2451.0,16.5,74,ford pinto
136,16.0,8,302.0,140.0,4141.0,14.0,74,ford gran torino
168,23.0,4,140.0,83.00,2639.0,17.0,75,ford pinto
174,18.0,6,171.0,97.00,2984.0,14.5,75,ford pinto
190,14.5,8,351.0,152.0,4215.0,12.8,76,ford gran torino
206,26.5,4,140.0,72.00,2565.0,13.6,76,ford pinto


#### Get all rows for cars built after 1980. This time we assign the output to a variable named _carModelsAbove80_ and use. head() to print the first few rows.
We also print the type of _carModelsAbove80_ so you can see that it is a dataframe also.

In [None]:
carModelsAbove80 = autoMPGData.loc[(autoMPGData['model year'] > 80), :]
print(type(carModelsAbove80))
carModelsAbove80.head()

#### Introducing the .isin() method for testing if a value in a column is in a list of possible values.
Let's say we wanted all rows where th car name is 'ford pinto', 'ford gran torino', or 'ford maverick'. We could use three equivalence test and two or operators to string the tests together. But, another way is to use the .isin() method.  See the example below:

In [None]:
# we can just use the .isin() method on any column and then pass the list of
# the valued we want checked to the .isin() method
autoMPGData['carname'].isin(['ford pinto', 'ford gran torino','ford maverick'])

In [None]:
# Now, let's use it in the .loc[] method to get those rows from the dataframe
autoMPGData.loc[autoMPGData['carname'].isin(['ford pinto', 'ford gran torino','ford maverick']), :]

#### Selecting rows by wildcard to select all the models with the word ford

We can use the `contains()` method on any column of strings to find all the rows with a string that matches a substring.  We can also set the `case` argument to False to ignore case. Note that we have to use the `str` attribute to access string methods for the column.

In [None]:
# Now, let's use the match method to find all the rows with ford in the carname
autoMPGData.loc[autoMPGData['carname'].str.contains('ford', case=False), :].head()

#### We can use the '|' as an 'or' in our matches.

In [26]:
# Now, let's use the match method to find all the rows with ford or chevrolete in the carname
autoMPGData.loc[autoMPGData['carname'].str.contains('ford|chevrolet', case=False), :].head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,chevrolet impala
12,15.0,8,400.0,150.0,3761.0,9.5,70,chevrolet monte carlo


###  Now, let's use the an equivalence test to get all the rows where the horsepower column has a value of '?'

In [32]:
autoMPGData.loc[autoMPGData['horsepower'] == '?', :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,amc concord dl


#### Now, use the not equivalent test, !=, to get all the rows that do not have a missing horsepower value.

In [30]:
autoMPGDataCleaned = autoMPGData.loc[autoMPGData['horsepower'] != '?', :].copy()
autoMPGDataCleaned.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
393,27.0,4,140.0,86.0,2790.0,15.6,82,ford mustang gl
394,44.0,4,97.0,52.0,2130.0,24.6,82,vw pickup
395,32.0,4,135.0,84.0,2295.0,11.6,82,dodge rampage
396,28.0,4,120.0,79.0,2625.0,18.6,82,ford ranger
397,31.0,4,119.0,82.0,2720.0,19.4,82,chevy s-10


#### Now, use the .astype method to convert the column to a float type.

In [31]:
autoMPGDataCleaned.loc[:, 'horsepower'] = autoMPGDataCleaned.loc[:, 'horsepower'].astype(float)

# Now we can look for cars with a horsepower over 190, for example
~(autoMPGDataCleaned['horsepower'] > 190)

0       True
1       True
2       True
3       True
4       True
5      False
6      False
7      False
8      False
9       True
10      True
11      True
12      True
13     False
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21      True
22      True
23      True
24      True
25     False
26     False
27     False
28     False
29      True
       ...  
367     True
368     True
369     True
370     True
371     True
372     True
373     True
375     True
376     True
377     True
378     True
379     True
380     True
381     True
382     True
383     True
384     True
385     True
386     True
387     True
388     True
389     True
390     True
391     True
392     True
393     True
394     True
395     True
396     True
397     True
Name: horsepower, Length: 392, dtype: bool

## In Class Exercise
Please create a cell below and use the .loc[] method to explore the dataset. Use multiple tests, and also use the .isin() method.

# Lesson Summary:
In this lesson you learned:
* How to use the .iloc[] method to select rows and columns from a dataframe by their number.
* How to use the .loc[] method to select rows and columns from a dataframe by their label.
* How to use the .loc[] method to select data based on boolean (True/False) arrays.

## Question or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>