## Selecting Subsets of Data from DataFrames with `loc`
## 使用 `loc` 從 DataFrames 中選擇數據子集

>In this chapter, we use the `loc` indexer to select subsets of data from DataFrames. The `loc` indexer selects data in a different manner than *just the brackets*. It has its own separate set of rules that we must learn. 

在本章中，我們使用 `loc` 索引器從 DataFrame 中選擇數據子集。 

In [1]:
import pandas as pd
ps = pd.Series(['Ganga', 'Yamuna', 'Gomti', 'Koshi','Godavari','Kaveri'], index = ['a','b','c','d','e','f'])
ps

a       Ganga
b      Yamuna
c       Gomti
d       Koshi
e    Godavari
f      Kaveri
dtype: object

In [2]:
ps.loc['d']
ps.loc['c':'f']
ps.loc['a':'f':2]

a       Ganga
c       Gomti
e    Godavari
dtype: object

In [3]:
# indexing series using iloc[]
ps.iloc[5]
ps.iloc[2:]
ps.iloc[::2]

a       Ganga
c       Gomti
e    Godavari
dtype: object

In [4]:
ps.describe()

count         6
unique        6
top       Ganga
freq          1
dtype: object

### iterating over all the elements

In [5]:
for i in ps:
    print(i)

Ganga
Yamuna
Gomti
Koshi
Godavari
Kaveri


In [6]:
for i in ps.iteritems():
    print(i)

('a', 'Ganga')
('b', 'Yamuna')
('c', 'Gomti')
('d', 'Koshi')
('e', 'Godavari')
('f', 'Kaveri')


>The `loc` indexer can select rows and columns simultaneously.  This is done by separating the row and column selections with a **comma**. The selection will look something like this:

`loc` 索引器可以同時選擇行和列。 這是通過用**逗號**分隔行和列選擇來完成的。 選擇將如：df.loc[rows, cols]


In [2]:
import pandas as pd
df = pd.read_csv('input/sample_data.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Select two rows and three columns with `loc`

>Let's make our first selection with `loc` by simultaneously selecting some rows and some columns. Let's select the rows `Dean` and `Cornelia` along with the columns `age`, `state`, and `score`. A list is used to contain both the row and column selections before being placed within the brackets following `loc`. Row and column selection must be separated by a comma.

我們選擇行`Dean`和`Cornelia`以及列`age`、`state`和`score`。 列表用於包含行和列選擇，放在 `loc` 後面的括號中。 行和列選擇必須用逗號分隔。

In [3]:
rows = ['Dean', 'Cornelia']       #抓 Rows 值為 'Dean', 'Cornelia'
cols = ['age', 'state', 'score']  #抓 Col  值為 'age', 'state', 'score'
df.loc[rows, cols]

Unnamed: 0_level_0,age,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dean,32,AK,1.8
Cornelia,69,TX,2.2


### The possible types of row and column selections

In the above example, we used a list of labels for both the row and column selection. You are not limited to just lists. All of the following are valid objects available for both row and column selections with `loc`.

* A single label
* A list of labels
* A slice with labels
* A boolean Series (covered in a later chapter)

### Select two rows and a single column

Let's select the rows `Aaron` and `Dean` along with the `food` column. We can use a list for the row selection and a single string for the column selection.

In [3]:
rows = ['Dean', 'Aaron']
cols = 'food'
df.loc[rows, cols]

name
Dean     Cheese
Aaron     Mango
Name: food, dtype: object

### Series returned

In the above example, a Series and not a DataFrame was returned. Whenever you select a single row or a single column using a string label, pandas returns a Series.

## `loc` with slice notation

Let's take a moment to review Python's slice notation, which is used to select subsets from some core Python objects such as lists, tuples, and strings. Slice notation always has three components - the **start**, **stop**, and **step**. Syntactically, each component is separated by a colon like this - `start:stop:step`. All components of slice notation are optional and not necessary to include. Each has a default value if not included in the notation. The start component defaults to the beginning, the stop defaults to the end, and the step size to 1.

### Example slices

Let's take a look at several slice notations and the value of each component of the slice.

* `'Niko':'Christina':2` - start is 'Niko', stop is 'Christina', step is 2
* `'Niko':'Christina'` - start is 'Niko', stop is 'Christina', step is 1
* `'Niko'::2` - start is 'Niko', stop is the end', step is 2
* `'Niko':` - start is 'Niko', stop is the end, step is 1
* `:'Christina':2` - start is the beginning, stop is 'Christina', step is 2
* `:` - start is the beginning, stop is the end, step is 1. All components take their default value.

This same slice notation is allowed within the `loc` indexer. Let's select all of the rows from `Jane` to `Penelope` with slice notation along with the columns `state` and `color`.

In [4]:
cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]

Unnamed: 0_level_0,state,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


### Slice notation is inclusive of the stop label

Slice notation with the `loc` indexer includes the stop label. This behaves differently than slicing done on Python lists, which is exclusive of the stop integer.

### Slice notation only works within the brackets attached to the object

Python only allows us to use slice notation within the brackets that are attached to an object. If we try and assign slice notation outside of this, we will get a syntax error like we do below.

### Slice both the rows and columns

Both row and column selections support slice notation. In the following example, we slice all the rows from the beginning up to and including label `Dean` along with columns from `height` until the end.

In [5]:
df.loc[:'Dean', 'height':]

Unnamed: 0_level_0,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


### Selecting all of the rows and some of the columns

It is possible to use slice notation to select all of rows or columns. We do so with a single colon, which is sometimes referred to as the **empty slice**. In this example, we select all of the rows and two of the columns.

In [6]:
cols = ['food', 'color']
df.loc[:, cols]

Unnamed: 0_level_0,food,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### Could have used *just the brackets*

It isn't necessary to use `loc` for this selection as we are only selecting two distinct columns. This could have been accomplished with *just the brackets*.

In [7]:
cols = ['food', 'color']
df[cols]

Unnamed: 0_level_0,food,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### A single colon is slice notation to select all values

That single colon might be intimidating, but it is technically slice notation that selects all items. In the following example, all of the elements of a Python list are selected using a single colon.

In [8]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:]

[1, 2, 3, 4, 5, 6]

### Use a single colon to select all the columns

It is possible to use a single colon to represent a slice of all of the rows or all of the columns. Below, a colon is used as slice notation for all of the columns.

In [9]:
rows = ['Penelope','Cornelia']
df.loc[rows, :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### The above can be shortened

By default, pandas selects all of the columns if you only provide a row selection. Providing the colon is not necessary so the following syntax makes the exact same selection.

In [10]:
rows = ['Penelope', 'Cornelia']
df.loc[rows]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


Though it is not syntactically necessary, one reason to use the colon is to reinforce the idea that `loc` may be used for simultaneous column selection. The first object passed to `loc` always selects rows and the second always selects columns.

### Use slice notation to select a range of rows with all of the columns

Similarly, we can use slice notation to select several rows at a time. Below, the slice begins at the row labeled by `Niko` and goes all he way through `Dean`. We do not provide a specific column selection to return all of the columns.

In [11]:
df.loc['Niko':'Dean']

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


You could have written the above as `df.loc['Niko':'Dean', :]` to reinforce the fact that `loc` first selects rows and then columns.

### Changing the step size

The step size must be an integer when using slice notation with `loc`. In this example, we select every other row beginning at `Niko` and ending at `Christina`.

In [12]:
df.loc['Niko':'Christina':2, :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


### Select a single row and a single column

If the row and column selections are both a single label, then a scalar value and NOT a DataFrame or Series is returned.

In [13]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]

'NY'

### Select a single row as a Series with `loc`

The `loc` indexer returns a single row as a Series when given a single row label. Let's select the row for `Niko`. Notice that the column names have now become index labels.

In [14]:
df.loc['Niko']

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

Again, the column selection isn't necessary, but does provide clarity.

In [15]:
df.loc['Niko', :]

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

### Confusing output

This output is potentially confusing. The original row that was labeled by `Niko` had horizontal data. Selecting a single row returns a Series that displays the row data vertically.

### Selecting a single row as a DataFrame

It is possible to select a single row as a DataFrame instead of a Series. Create the row selection as a one-item list instead of just a string label. The returned result is a DataFrame and maintains the same horizontal position for the row.

In [16]:
rows = ['Niko']
df.loc[rows, :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Niko,TX,green,Lamb,2,70,8.3


## Summary of the `loc` indexer

* Primarily uses labels
* Selects rows and columns simultaneously with `df.loc[rows, cols]`
* Both row and column selections can be a:
    * single label
    * list of labels
    * slice of labels
    * boolean Series
* A comma separates row and column selections

## Exercises

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [17]:
pd.set_option('display.max_columns', 50)
movie = pd.read_csv('input/movie.csv', index_col='title')
movie

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
McHale's Navy,,Black and White,TV-G,30.0,,,Tim Conway,870,Gavin MacLeod,284.0,Bob Hastings,253.0,,Comedy|War,4.0,1558,american soldier|naval uniform|navy|patrol boa...,English,USA,,7.5
Micmacs,2009.0,Color,R,105.0,Jean-Pierre Jeunet,0.0,Omar Sy,1000,Dany Boon,172.0,André Dussollier,52.0,1260917.0,Action|Comedy|Crime,213.0,24657,bullet|contortionist|gag humor|human cannonbal...,French,France,27000000.0,7.2
8 Mile,2002.0,Color,R,110.0,Curtis Hanson,161.0,Mekhi Phifer,1000,Omar Benson Miller,418.0,Evan Jones,196.0,116724075.0,Drama|Music,119.0,187181,competition|contest|friend|self expression|whi...,English,USA,41000000.0,7.0
Animal Kingdom: Let's go Ape,2015.0,Color,,101.0,Jamel Debbouze,326.0,Jamel Debbouze,326,Youssef Hajdi,152.0,Mélissa Theuriau,6.0,,Adventure|Animation|Comedy|Family,9.0,590,ape|computer animation|evolution|first person ...,French,France,,4.9


# The DataFrame and Series

The DataFrame and Series are the two primary objects when using pandas to analyze data. In this chapter, we will learn how to read in data into a DataFrame and understand its components. We will also learn how to select a single column of data as a Series and examine its components.

## Reading external data with pandas

The one thing you need for data analysis is **data**. If you do not have any data, then you won't be able to use pandas to analyze it. This book contains many data sets stored externally in the `data` directory one level above where this notebook resides. Most of these datasets are stored as comma-separated value (**CSV**) files. These CSVs are human-readable and separate each individual piece of data with a comma. The comma is referred to as the **delimiter**. Despite its name, CSVs can use other one-character delimiters besides commas such as tabs, semi-colons, or others. 


### City of Chicago bike rides

We begin our data analysis adventure with a dataset on public bike rides from the city of Chicago. The data is contained in the `bikes.csv` file. There are about 50,000 recorded rides from 2013 through 2017. Each row of the dataset represents a single ride from a single person using the city's public bike stations. There are 19 columns of data containing information on gender, start time, trip duration, bike station name, temperature, wind speed, and more. Let's print out the first three lines of the `bikes.csv` file using Python's built-in capabilities for reading files. This does not use pandas. Take note of the commas separating each value on each line.

In [18]:
with open('input/bikes.csv') as f:
    for i in range(3):
        print(f.readline())     

gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events

Male,6/28/2013 19:01,6/28/2013 19:17,993,Lake Shore Dr & Monroe St,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy

Male,6/28/2013 22:53,6/28/2013 23:03,623,Clinton St & Washington Blvd,31,Wells St & Walton St,19,69.1,6.9,partlycloudy



### Understanding the file location

Above, the string `../data/bikes.csv` was used to represent the file location of the data. This location is relative to the directory where this notebook resides on your machine. Let's cover every part of this string to ensure we understand what it means.

The file location string begins with two dots, `..`. This translates as "move one level above the current working directory" to the **Jupyter Notebooks** directory. Appearing next in the string is `/data`, which translates as 'move down into the `data` directory. 

Note that the forward slash was written to separate the directories. Both macOS and Linux operating systems use this forward slash to separate directories and files from one another. On the other hand, the Windows operating system uses the backslash. Fortunately, we can always use a forward slash regardless of our operating system, as Python will automatically handle the file location string for us.

The string ends with `/bikes.csv` which translates as 'reference the filename `bikes.csv`. In summary, the file location `../data/bikes.csv` represents a relative location to where the dataset resides.

### Import pandas

To use the pandas library, we need to import it into our namespace. By convention, pandas is imported and aliased to the name `pd`. After running the import statement below, we will have access to all pandas objects with variable name `pd`. It is possible to use any other valid variable name as an alias, but it's best to use `pd` as the official documentation uses it along with most everyone else.

In [19]:
import pandas as pd
bikes = pd.read_csv('input/bikes.csv')

### Display DataFrame in Jupyter Notebook

We assigned the output from the `read_csv` function to the `bikes` variable name which now refers to a DataFrame object. To visually display the DataFrame, place the variable name as the last line in a code cell. By default, pandas outputs the first and last 5 rows and first and last 10 columns. If there are less than 60 total rows, it displays all rows. We cover how to change these display options in an upcoming chapter.

### `head` and `tail` methods

A very useful and simple method is `head`, which returns the first 5 rows of the DataFrame by default. This avoids long default output and is something I highly recommend when doing data analysis within a notebook. The `tail` method returns the last 5 rows by default. There will only be a few instances in the book where the `head` method is not used, as displaying up to 60 rows is far too many and will take up a lot of space on a screen or page.

In [20]:
bikes.head()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,6/28/2013 19:01,6/28/2013 19:17,993,Lake Shore Dr & Monroe St,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy
1,Male,6/28/2013 22:53,6/28/2013 23:03,623,Clinton St & Washington Blvd,31,Wells St & Walton St,19,69.1,6.9,partlycloudy
2,Male,6/30/2013 14:43,6/30/2013 15:01,1040,Sheffield Ave & Kingsbury St,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy
3,Male,7/1/2013 10:05,7/1/2013 10:16,667,Carpenter St & Huron St,19,Clark St & Randolph St,31,72.0,16.1,mostlycloudy
4,Male,7/1/2013 11:16,7/1/2013 11:18,130,Damen Ave & Pierce Ave,19,Damen Ave & Pierce Ave,19,73.0,17.3,partlycloudy


The last five rows of the DataFrame may be displayed with the `tail` method.

In [21]:
bikes.tail()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
4054,Male,6/8/2014 21:09,6/8/2014 21:13,257,Pine Grove Ave & Irving Park Rd,15,Halsted St & Waveland Ave,15,62.1,8.1,mostlycloudy
4055,Male,6/8/2014 23:15,6/8/2014 23:20,304,Southport Ave & Irving Park Rd,15,Broadway & Sheridan Rd,15,59.0,6.9,mostlycloudy
4056,Male,6/9/2014 5:16,6/9/2014 5:25,530,Morgan Ave & 14th Pl,15,Wood St & Taylor St,15,55.0,9.2,partlycloudy
4057,Male,6/9/2014 7:31,6/9/2014 7:39,496,Clinton St & Washington Blvd,31,Stetson Ave & South Water St,19,60.1,8.1,partlycloudy
4058,Female,6/9/2014 7:41,6/9/2014 8:01,1169,Theater on the Lake,15,Lake Shore Dr & Monroe St,39,60.1,8.1,partlycloudy


### First and last `n` rows
Both the `head` and `tail` methods accept a single integer parameter `n` controlling the number of rows returned. Here, we output the first three rows.

In [22]:
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,6/28/2013 19:01,6/28/2013 19:17,993,Lake Shore Dr & Monroe St,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy
1,Male,6/28/2013 22:53,6/28/2013 23:03,623,Clinton St & Washington Blvd,31,Wells St & Walton St,19,69.1,6.9,partlycloudy
2,Male,6/30/2013 14:43,6/30/2013 15:01,1040,Sheffield Ave & Kingsbury St,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy


## Components of a DataFrame

The DataFrame is composed of three separate components - the **columns**, the **index**, and the **data**. These terms will be used throughout the book and understanding them is vital to your ability to use pandas. Take a look at the following graphic of our `bikes` DataFrame stylized to put emphasis on each component.

### The columns

The columns provide a **label** for each column and are always displayed in **bold** font above the data. A column is a single vertical sequence of data. In the above DataFrame, the column name `tripduration` references all the values in that column (993, 623, 1040, etc...).

The columns are also referred to as the **column names** or the **column labels** with individual values referred to as a **column name** or **column label**.

Most DataFrames, like the one above, use strings for column names, but it is possible that they can be other types such as integers. The column names are not required to be unique, though having duplicate columns would be bad practice, as it's vital to be able to uniquely identify each column.

### The index

The index provides a **label** for each row and is always displayed to the left of the data. A row is a single horizontal sequence of data. For instance, the index label **3** references all the values in its row (12907, Subscriber, Male, etc...)

The index is also referred to as the **index names/labels** or the **row names/labels** with the individual values referred to as a(n) **index name/label** or **row name/label**.

In the above DataFrame, the index is simply a sequence of integers beginning at 0. The values in the index are not limited to integers. Strings are a common type that are used in the index and make for more descriptive labels.

Surprisingly, values in the index are not required to be unique. In fact, all of the index values can be the same. A row label does not guarantee a one-to-one mapping to one specific row.

### The data

The actual data is to the right of the index and below the columns and is displayed with normal font. The data is also referred to as the **values**. The data represents all the values for all the columns. It is important to note that the index and the columns are NOT part of the data. They are separate objects that act as **labels** for either rows or columns.

### The Axes

The index and columns are known collectively as the **axes**, each representing a single **axis** of the two-dimensional DataFrame. pandas uses the integer **0** to reference the index and **1** for the columns.

[1]: images/df_components.png

## What type of object is `bikes`?

Let's verify that `bikes` is indeed a DataFrame with the `type` function.

In [23]:
type(bikes)

pandas.core.frame.DataFrame

### Fully-qualified name

The above output is something called the **fully-qualified name**. Only the word after the last dot is the name of the type. We have now verified that the `bikes` variable has type `DataFrame`. 

The fully-qualified name always returns the package and module name of where the type was defined. The package name is the first part of the fully-qualified name and, in this case, is `pandas`. The module name is the word immediately preceding the name of the type. Here, it is `frame`.

### Package vs Module

A Python **package** is a directory containing other directories or modules that contain Python code. A Python **module** is a file (typically a text file ending in .py) that contains Python code. 

### Sub-packages

Any directory containing other directories or modules within a Python package is considered a **sub-package**. In this case, `core` is the sub-package.

### Where are the packages located on my machine?

Third-party packages are installed in the `site-packages` directory which itself is set up during Python installation. We can get the actual location with help from the standard library's `site` module's `getsitepackages` function.

In [24]:
import site
site.getsitepackages()

['/Users/jerrychien/opt/anaconda3/lib/python3.9/site-packages']

If you navigate to this directory in your file system, you'll find the 'pandas' directory. Within it will be a 'core' directory which will contain the 'frame.py' file. It is this file which contains Python code where the DataFrame class is defined.

## Select a single column from a DataFrame - a Series

To select a single column from a DataFrame, append a set of square brackets, `[]`, to the end of the DataFrame variable name. Place the column name as a string within those brackets to select it. This returns a single column of data as a pandas **Series**. This is a separate (but similar) type of object than a DataFrame.

Let's select the column name `tripduration`, assign it to a variable name, and output the first few values to the screen. The `head` and `tail` methods work the same as they do with DataFrames.

In [25]:
td = bikes['tripduration']
td.head()

0     993
1     623
2    1040
3     667
4     130
Name: tripduration, dtype: int64

Select the last three values in the Series by passing the `tail` method the integer 3.

In [26]:
td.tail(3)

4056     530
4057     496
4058    1169
Name: tripduration, dtype: int64

Let's verify that `td` has the type Series.

In [27]:
type(td)

pandas.core.series.Series

## Components of a Series

A Series is a similar type of object as a DataFrame but only contains a single dimension of data. It has two components - the **index** and the **data**. Let's take a look at a stylized Series graphic.

It's important to note that a Series has no rows and no columns. In appearance, it resembles a one-column DataFrame, but it technically has no columns. It just has a sequence of values that are labeled by an index.

### The index

A Series index serves as labels for the values. A single **label** or **name** always references a single value. In the above image, the index label **3** corresponds to the value 667. The Series index is virtually identical to the DataFrame index, so the same rules apply to it. Index values can be duplicated and can be types other than integers, such as strings. 

### Output of Series vs DataFrame

Notice that there is no nice HTML styling for the Series. It's just plain text. Below the Series display, you will see a few other items printed to the screen - the **name**, **length**, and **dtype**. These other items are NOT part of the Series itself and are just extra pieces of information to help you understand the Series.

* The **name** is not important right now. If the Series was formed from a column of a DataFrame, it will be set to that column name.
* The **length** is the number of values in the Series
* The **dtype** is the data type of the Series, which will be discussed in an upcoming chapter.

[0]: images/series_components.png

## Changing display options

pandas gives you the ability to change how the output on your screen is displayed. For instance, the default number of columns displayed for a DataFrame is 20, meaning that if your DataFrame has more than 20 columns, then only the first and last 10 columns will be shown on the screen. All the other columns will be hidden and unable to be displayed. This is problematic as many DataFrames have more than 20 columns.

### Get current option value with `get_option`

There are a few dozen display options you can control to change the visual representation of your DataFrame. It is not necessary to remember the option names as the official documentation provides descriptions for all [available options][1]. 

Let's first learn how to retrieve each option value with the `get_option` function. This is not a DataFrame method, but instead, a function that is accessed directly from `pd`.  Below are three of the most common options to change.

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

In [28]:
pd.get_option('display.max_columns')

50

In [29]:
pd.get_option('display.max_rows')

60

In [30]:
pd.get_option('display.max_colwidth')

50

### Use the `set_option` function to change an option value

To change an option's value, use the `set_option` function. You can set as many options as you would like at one time. It's usage is a bit strange. Pass it the option name as a string and follow it immediately with the value you want to set it to. Continue this pattern of option name followed by new value to set as many options as you desire. Below, we set the maximum number of columns to 100 and the maximum number of rows to 4.

In [31]:
pd.set_option('display.max_columns', 100, 'display.max_rows', 4)

We now read in the housing dataset which contains 81 columns, all of which will be visible. Uncomment the lines to run them in your notebook.

In [32]:
housing = pd.read_csv('input/housing.csv')
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129.0,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,358500,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,342200,NEAR BAY


# Selecting Subsets of Data from DataFrames with `iloc`

The `iloc` indexer is very similar to the `loc` indexer but only uses **integer location** to make its subset selections. The word `iloc` itself stands for integer location and can help remind you what it does.

## Simultaneous row and column subset selection

The `iloc` indexer is capable of making simultaneous row and column selections just like `loc`. Selection with `iloc` takes the following form, with a comma separating the row and column selections.

```python
df.iloc[rows, cols]
```

Let's read in some sample data and then begin making selections with integer location using `iloc`.

In [33]:
import pandas as pd
df = pd.read_csv('input/sample_data.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
...,...,...,...,...,...,...
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### What is integer location?

Integer location is the term used to reference a row or column. The first row/column is referenced by the integer 0. Each subsequent row is referenced by the next integer. The last row/column is referenced by `n - 1` where `n` is the number of row/columns.

### Select using a list for both rows and columns

Let's select rows with integer location 2 and 4 along with the first and last columns. It is possible to use negative integers in the same manner as Python lists. The integer location -1 refers to the last column below.

In [34]:
rows = [2, 4]
cols = [0, -1]
df.iloc[rows, cols]

Unnamed: 0_level_0,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,FL,9.0
Dean,AK,1.8


### The possible types of selections for `iloc`

In the above example, we used a list of integers for both the row and column selection. You are not limited to just lists. All of the following are valid objects available for both row and column selections with `iloc`. The `iloc` indexer, unlike `loc`, is unable to do boolean selection. 

* A single integer
* A list of integers
* A slice with integers

### Slice the rows and use a list for the columns

Let's use slice notation to select rows with integer location 2 and 3 and a list to select columns with integer location 4 and 2. Notice that the stop integer location is **excluded** with `iloc`, which is exactly how slicing works with Python lists, tuples, and strings. Slicing with `loc` is **inclusive** of the stop label.

In [35]:
cols = [4, 2]
df.iloc[2:4, cols]

Unnamed: 0_level_0,height,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,120,Mango
Penelope,80,Apple


### Use a list for the rows and a slice for the columns

In this example, we use a list for the row selection and slice notation for the columns.

In [36]:
rows = [5, 2, 4]
df.iloc[rows, 3:]

Unnamed: 0_level_0,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Christina,33,172,9.5
Aaron,12,120,9.0
Dean,32,180,1.8


### Select all of the rows and some of the columns

You can use an empty slice (just the colon) to select all of the rows or columns. In the example below, we select all of the rows and some of the columns with a list.

In [37]:
cols = [2, 4]
df.iloc[:, cols]

Unnamed: 0_level_0,food,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,Steak,165
Niko,Lamb,70
...,...,...
Christina,Melon,172
Cornelia,Beans,150


### Select some of the rows and all of the columns

We can again use an empty slice, but do so to select all of the columns. We use a list to select some of the rows.

In [38]:
rows = [-3, -1, -2]
df.iloc[rows, :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dean,AK,gray,Cheese,32,180,1.8
Cornelia,TX,red,Beans,69,150,2.2
Christina,TX,black,Melon,33,172,9.5


It is possible to rewrite the above without the column selection. pandas defaults to selecting all of the columns if a selection for them is not explicitly present.

In [39]:
df.iloc[rows]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dean,AK,gray,Cheese,32,180,1.8
Cornelia,TX,red,Beans,69,150,2.2
Christina,TX,black,Melon,33,172,9.5


### Select a single row and a single column

We can select a single value in our DataFrame using `iloc` by providing a single integer for both the row and column selection. This returns the actual value by itself completely outside of a DataFrame or Series.

In [40]:
df.iloc[3, 2]

'Apple'

### Select a single row and a single column as a DataFrame

It is possible to select the above value as a DataFrame by using one-item lists for the row and column selections. The output looks a little bizarre, but it's just a DataFrame with one row and one column.

In [41]:
rows = [3]
cols = [2]
df.iloc[rows, cols]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Penelope,Apple


### Select some rows and a single column

In this example, a list of integers is used for the rows and a single integer for the columns. pandas returns a Series when a single integer is used to select either a row or column.

In [42]:
rows = [2, 3, 5]
cols = 4
df.iloc[rows, cols]

name
Aaron        120
Penelope      80
Christina    172
Name: height, dtype: int64

### Select a single row or column as a DataFrame and NOT a Series

You can select a single row (or column) and return a DataFrame and not a Series if you use a list to make the selection. Let's replicate the selection from the previous example, but use a one-item list for the column selection.

In [43]:
rows = [2, 3, 5]
cols = [4]
df.iloc[rows, cols]

Unnamed: 0_level_0,height
name,Unnamed: 1_level_1
Aaron,120
Penelope,80
Christina,172


### Select a single row as a Series

We can select a single row by providing a single integer as the row selection for `iloc`. We use an empty slice to select all of the columns. Because we are selecting a single row, a Series is returned. Just as with `loc`, the returned output can be confusing as the original horizontal row is now displayed vertically.

In [44]:
df.iloc[2, :]

state      FL
color     red
         ... 
height    120
score     9.0
Name: Aaron, Length: 6, dtype: object

To maintain the original orientation, we can select the row using a one-item list which returns a DataFrame.

In [45]:
df.iloc[[2], :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aaron,FL,red,Mango,12,120,9.0


## Summary of `iloc`

The `iloc` indexer is analogous to `loc` but only uses **integer location** for selection. The official pandas documentation refers to it as selection by **position**.

* Uses only integer location
* Selects rows and columns simultaneously with `df.iloc[rows, cols]`
* Selection can be a 
    * single integer
    * a list of integers
    * a slice of integers
* A comma separates row and column selections

## Exercises

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [46]:
pd.set_option('display.max_columns', 50)
movie = pd.read_csv('input/movie.csv', index_col='title')
movie.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8


## Example

In [47]:
# Create dictionary of test scores
import pandas as pd
test_dict = {'Corey':[63,75,88], 'Kevin':[48,98,92], 'Akshay': [87, 86, 85]}
df = pd.DataFrame(test_dict)
df

Unnamed: 0,Corey,Kevin,Akshay
0,63,48,87
1,75,98,86
2,88,92,85


You can inspect the DataFrame. First, each dictionary key is listed as a column. Second, the rows are labeled with indices starting with 0 by default. Third, the visual layout is clear and legible.
Each column and row of DataFrame is officially represented as a Series. A series is a one-dimensional  array. Note that an array can be represented both by Series and numpy array, however they are two distinct data types and are interchangeable.

In [48]:
# Transpose DataFrame
df = df.T
df

Unnamed: 0,0,1,2
Corey,63,75,88
Kevin,48,98,92
Akshay,87,86,85


In [49]:
# Rename Columns
df.columns = ['Quiz_1', 'Quiz_2', 'Quiz_3']
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3
Corey,63,75,88
Kevin,48,98,92
Akshay,87,86,85


Selecting a range of rows:

In [50]:
# Access first row by index number
df.iloc[0]    #  .iloc generally takes two parameters.

Quiz_1    63
Quiz_2    75
Quiz_3    88
Name: Corey, dtype: int64

In [51]:
# Access first row by index number
df.iloc[0,:]

Quiz_1    63
Quiz_2    75
Quiz_3    88
Name: Corey, dtype: int64

In [52]:
# Access first column by name
df['Quiz_1']

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

In [53]:
# Access first column using dot notation
df.Quiz_1

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

In [54]:
# Access first column by its index
df.iloc[:, 0]

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

## Computing DataFrames within DataFrames

In [55]:
# Defining a new DataFrame from first 2 rows and last 2 columns 
rows = ['Corey', 'Kevin']
cols = ['Quiz_2', 'Quiz_3']
df_spring = df.loc[rows, cols]
df_spring

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [56]:
# Select first 2 rows and last 2 columns using index numbers
df.iloc[[0,1], [1,2]]

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [57]:
# Select first 2 rows and last 2 columns using index numbers 
df.iloc[0:2, 1:3]

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [58]:
# Define new column as mean of other columns
df['Quiz_Avg'] = df.mean(axis=1)
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_Avg
Corey,63,75,88,75.333333
Kevin,48,98,92,79.333333
Akshay,87,86,85,86.0


In [59]:
##  Create a new column as a list, as shown in the following code 
df['Quiz_4'] = [92, 95, 88]
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_Avg,Quiz_4
Corey,63,75,88,75.333333,92
Kevin,48,98,92,79.333333,95
Akshay,87,86,85,86.0,88


In [60]:
##  delete the Quiz_Avg column as it is not needed anymore:
del df['Quiz_Avg']
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4
Corey,63,75,88,92
Kevin,48,98,92,95
Akshay,87,86,85,88


Concatenating and Finding the Mean with Null Values for Our testscore Data

In [61]:
import numpy as np
# Create new DataFrame of one row
df_new = pd.DataFrame({'Quiz_1':[np.NaN], 
                       'Quiz_2':[np.NaN], 
                       'Quiz_3': [np.NaN],
                       'Quiz_4':[71]}, index=['Adrian'])
df_new

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4
Adrian,,,,71


In [68]:
# Let Now, concatenate Dataframe with the added new row, Adrian, and display the new Dataframe value using df:
df = pd.concat([df, df_new])
df

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary,Quiz_1,Quiz_2,Quiz_3,Quiz_4
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.,,,,
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5998,2006-05-15 23:00:00.000 +0200,Clear,rain,12.927778,12.927778,0.90,2.7531,191.0,14.9569,0.0,1021.64,Partly cloudy until night.,,,,
Adrian,,,,,,,,,,,,,,,,71.0


In [64]:
# Creating a new columns but igonraning the NaN Value
df['Quiz_Avg'] = df.mean(axis=1, skipna=True)
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4,Quiz_Avg
Corey,63.0,75.0,88.0,92,79.5
Kevin,48.0,98.0,92.0,95,83.25
Akshay,87.0,86.0,85.0,88,86.5
Adrian,,,,71,71.0


In [65]:
# The data type of Quiz_4 columns is int and other is float we can convert this into float by using this function
df.Quiz_4.astype(float)

Corey     92.0
Kevin     95.0
Akshay    88.0
Adrian    71.0
Name: Quiz_4, dtype: float64