# Miscellaneous Subset Selection

In this chapter, a few more methods for subset selection are described. The methods used in this chapter do not add any additional functionality to pandas, but are covered for completeness.

I personally do not use the methods described in this chapter and suggest that you also avoid them. They are all valid syntax and some pandas users do actually use them, so you may find them valuable.

## Selecting a column with dot notation

In my opinion, the best way to select a single column from a DataFrame as a Series is by placing the name of the column within *just the brackets*. There's actually an alternative way to select a single column of data and that is with dot notation. Let's read in the the `sample_data2.csv` dataset.

In [1]:
import pandas as pd
df = pd.read_csv('../data/sample_data2.csv')
df

Unnamed: 0,name,average score,max
0,Niko,99,100
1,Penelope,100,102
2,Aria,88,93


Place the name of the column directly after the dot as if it were an attribute.

In [2]:
df.name

0        Niko
1    Penelope
2        Aria
Name: name, dtype: object

This produces an identical result as using *just the brackets*.

In [3]:
df['name']

0        Niko
1    Penelope
2        Aria
Name: name, dtype: object

### Avoid dot notation - use just the brackets

Although this method for column selection requires less syntax and is used by many pandas users, it has many downsides. The following is a partial list of the functionality that is impossible using dot notation.

* Select column names with spaces
* Select column names that have the same name as methods
* Select columns with variables
* Select columns that begin with numbers
* Select columns that are non-strings

Examples of some of the above scenarios will now be covered. Using dot notation does not allow you to select columns with spaces. Selecting the column `average score` raises a syntax error.

In [4]:
df.average score

SyntaxError: invalid syntax (190530440.py, line 1)

The only way to select this column is with *just the brackets*.

In [5]:
df['average score']

0     99
1    100
2     88
Name: average score, dtype: int64

Dot notation is unable to select columns that are the same name as methods. For instance, `max` is a method that all DataFrames have. In this particular DataFrame, it also the name of the column. Attempting to select it via dot notation will access the method.

In [6]:
df.max

<bound method DataFrame.max of        name  average score  max
0      Niko             99  100
1  Penelope            100  102
2      Aria             88   93>

Again, the only way to select this column is with *just the brackets*.

In [8]:
df['max']

0    100
1    102
2     93
Name: max, dtype: int64

Dot notation is unable to select a column using a variable name. Let's say we assign the variable `col` to the string 'name' which is the name of the first column. Attempting to select it via dot notation raises an error.

In [9]:
col = 'name'
df.col

AttributeError: 'DataFrame' object has no attribute 'col'

Once again, use *just the brackets*.

In [10]:
df[col]

0        Niko
1    Penelope
2        Aria
Name: name, dtype: object

### Video with 10 reasons why using the brackets are superior

There are actually many more reasons to use the brackets over dot notation. If you are interested in hearing all of my reasons, [watch this video][1].

[1]: https://www.youtube.com/watch?v=LxZvl9Mc1cY

## Selecting rows with just the brackets using slice notation

So far, we have covered three ways to select subsets of data with *just the brackets*.  You can use a single string, a list of strings, or a boolean Series. Let's quickly review those ways right now using the bikes dataset.

In [9]:
bikes = pd.read_csv('../data/bikes.csv')

### A single string

In [10]:
bikes['tripduration'].head(3)

0     993
1     623
2    1040
Name: tripduration, dtype: int64

### A list of strings

In [11]:
cols = ['gender', 'tripduration']
bikes[cols].head(3)

Unnamed: 0,gender,tripduration
0,Male,993
1,Male,623
2,Male,1040


### A boolean Series

The previous two examples selected columns. Boolean Series select rows.

In [12]:
filt = bikes['tripduration'] > 5000
bikes[filt].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
18,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,35.0,Millennium Park,35.0,79.0,13.8,cloudy
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19.0,Lake Shore Dr & Monroe St,11.0,87.1,8.1,partlycloudy
77,Female,2013-07-21 11:35:00,2013-07-21 13:54:00,8299,State St & 19th St,15.0,Sheffield Ave & Kingsbury St,15.0,82.9,5.8,mostlycloudy


### Using a slice

It is possible to use slice notation within just the brackets. For example, the following selects the rows beginning at location 2 up to location 10 with a step size of 3. You can even use slice notation when the index is strings.

In [13]:
bikes[2:10:3]

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
5,Male,2013-07-01 12:37:00,2013-07-01 12:48:00,660,California Ave & 21st St,15.0,Clark St & Wrightwood Ave,15.0,73.0,17.3,mostlycloudy
8,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,31.0,Wood St & Division St,15.0,71.1,0.0,cloudy


### I do not recommend using slicing with *just the brackets*

Although slicing with *just the brackets* seems simple, I do not recommend using it. This is because it is ambiguous and can make selections either by integer location or by label. I always prefer explicit, unambiguous methods. Both `loc` and `iloc` are unambiguous and explicit. Meaning that even without knowing anything about the DataFrame, you would be able to explain exactly how the selection will take place. If you do want to slice the rows, then use `loc` if you are using labels or `iloc` if you are using integer location, but do not use *just the brackets*.

## Selecting a single cell with `at` and `iat`

pandas provides two more rarely seen indexers, `at`, and `iat`. These indexers are analogous to `loc` and `iloc` respectively, but only select a single cell of a DataFrame. Since they only select a single cell, you must pass both a row and column selection as either a label (`loc`) or an integer location (`iloc`). Let's see an example of each.

In [17]:
bikes

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19.0,Clark St & Randolph St,31.0,72.0,16.1,mostlycloudy
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19.0,Damen Ave & Pierce Ave,19.0,73.0,17.3,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...
50084,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,27.0,Clark St & Elm St,27.0,5.0,16.1,partlycloudy
50085,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),16.0,Union Ave & Root St,11.0,5.0,16.1,partlycloudy
50086,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,31.0,Halsted St & Blackhawk St (*),20.0,5.0,16.1,partlycloudy
50087,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,23.0,Kingsbury St & Kinzie St,31.0,7.0,11.5,partlycloudy


In [16]:
bikes.at[40, 'temperature']

np.float64(87.1)

In [21]:
bikes.iat[-30, 5]

np.float64(23.0)

The current index labels for `bikes` is integers which is why the number 40 was used above. It is the label for a row, but also happens to be an integer.

### What's the purpose of these indexers?

All usages of `at` and `iat` may be replaced with `loc` and `iloc` and would produce the exact same results. The `at` and `iat` indexers are optimized to select a single cell of data and therefore provide slightly better performance than `loc` or `iloc`. Let's verify this below.

In [19]:
bikes.loc[40, 'temperature']

np.float64(87.1)

In [20]:
bikes.iloc[-30, 5]

np.float64(23.0)

### I never use these indexers

Personally, I never use these specialty indexers as the performance advantage for a single selection is minor. It would require a case where single element selections were happening in great numbers to see any significant improvement and doing so is rare in data analysis.

### Much bigger performance improvement using numpy directly

If you truly wanted a large performance improvement for single-cell selection, you would select directly from numpy arrays and not a pandas DataFrame. Below, the data is extracted into the underlying numpy array with the `values` attribute. We then time the performance of selecting with numpy and also with `iat` and `iloc` on a DataFrame. 

The timing is done using the magic command `%time`. This is a special command only available in a Jupyter Notebook (or IPython shell). The **Wall time** provides the total time it took to complete the operation. On my machine, `iat` shows a negligible improvement over `iloc`, but selecting with numpy is about 15x as fast. There is no comparison here, if you care about performance for selecting a single cell of data, use numpy.

In [22]:
values = bikes.values

In [23]:
%time values[-30, 5]

CPU times: total: 0 ns
Wall time: 0 ns


23.0

In [27]:
%time bikes.iat[-30, 5]

CPU times: total: 0 ns
Wall time: 0 ns


np.float64(23.0)

In [26]:
%time bikes.iloc[-30, 5]

CPU times: total: 0 ns
Wall time: 0 ns


np.float64(23.0)

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Provide several example column names that are not possible to select using dot notation.</span>

### Exercise 2

<span style="color:green; font-size:16px">Use the `%time` magic function to compare the performance difference between `loc` and `at` and between `iloc` and `iat`.</span>