
# pandas DataFrames
---

## Questions:
- How do I read in data from a file?
- How can I work with data in tabular format (tables)?
- How can I do basic descriptive statistics on tabular data?

## Learning Objectives:
- Select individual values from a Pandas dataframe
- Select entire rows or entire columns from a dataframe
- Select a subset of both rows and columns from a dataframe in a single operation
- Select a subset of a dataframe by a single Boolean criterion
- Obtain descriptive statistics for subsets of data within a table
- Use the split-apply-combine paradigm to work with data
---

## What is pandas?

Bad news first: there are no cute, black-and-white bears here. [pandas](https://pandas.pydata.org/docs/) (whose official name starts with a lower-case "p") is a Python *library* for working with data in a tabular format, such as is found in file formats like CSV, Microsoft Excel, and Google Sheets. Unlike Excel or Sheets, pandas is not a point-and click graphical interface for working with these files — everything is done through Python code. But compared to other formats for working with data, such as lists and dictionaries, pandas may seem more familiar, and it definitely lends itself more naturally to large data sets. Indeed, pandas' mission statement is, "...to be the fundamental high-level building block for doing practical, real world data analysis in Python". 

The primary units of pandas data storage you will work with are [DataFrames](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented) (essentially, tables of data organized as rows and columns). DataFrames are actually collections of pandas [Series](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) objects, which can be thought of as individual rows or columns (or vectors, or 1D arrays). 

Among the things that make pandas so attractive are the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames. As well, pandas functions and methods are written to work intuitively and efficiently with data organized in tables. Most operations are *vectorized*, which means that they will automatically apply to all values in a DataFrame or Series without the need to write `for` loops to execute the same operation on a set of cells.

pandas is built on top of the [NumPy](https://numpy.org) library. It's worth noting for your future reference that most of the methods defined for NumPy Arrays also apply to Pandas Series/DataFrames.



~~~python
import pandas
~~~

Once a library is imported, we can use functions and methods from it. But, for functions we have to tell Python that the function can be found in a particular library we imported. For example, pandas has a function to import data from CSV (comma-separated value) files, called `read_csv`. To run this command, we would need to type:

~~~python
pandas.read_csv()
~~~

Since some package names are long, and adding the name to every function can result in a lot of typing, Python also allows us to assign an *alias* — a shorter name — to a library when we import it. For example, the convention for pandas is to give it the alias `pd` like this:

~~~python
import pandas as pd
~~~

Then to read a CSV file we could use:

~~~python
pd.read_csv()
~~~

In the cell below, import pandas with the alias pd:

## Dataframes and Series

The main type of data structure in `pandas` is a `DataFrame`, which
organizes data into a 2D table, like a spreadsheet. Unlike a `numpy`
array, however, each column in a `DataFrame` can have different data
types - for example, you can have a string column, an integer column,
and a float column all in the same `DataFrame`.

(The other major type of data in `pandas` is a `Series`, which is like a
1D array- any individual row or column from a `DataFrame` will be a
`Series`.)

You *can* create a `DataFrame` or a `Series` “by hand” - for example,
try

``` python
pd.Series([1,2,3,99])
```

or

``` python
pd.DataFrame({'fruit': ['apple', 'banana', 'kiwi'], 'cost': [0.55, 0.99, 1.24]})
```

## Importing data with pandas

As noted, we can read a CSV file and use it to create a pandas DataFrame, with the funciton `pd.read_csv()`. [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) is a text format used for storing tabular data, in which each line of the file corresponds to a row in the table, and columns are separated with commas ("CSV" stands for "comma-separated values"). Often the first row of a CSV file will be the *header*, containing labels for each column. 

The ADNI data subset is in CSV format, so let's load in the data with the command below. This data was collected as part of the Alzheimer’s Disease Neuroimaging Initiative (ADNI). ADNI researchers collected a wide variety of measurements including MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers as predictors of the disease (See more information here: https://adni.loni.usc.edu/)

Note that when we read in a DataFrame, we need to assign it to a variable name so that we can reference it later. A convention when working with pandas is to call the DataFrame `df`. This works fine if you only have one DataFrame to work with, although if you are working with multiple DataFrames it is a good idea to give them more meaningful names.

The data are stored in a subfolder called `data`, so as the argument to `pd.read_csv()` below we give the folder name folled by a slash, then the file name:

~~~python
df = pd.read_csv('data/TADPOLE_select.csv')
~~~

We can view the contents of the DataFrame `df` by simply typing its name and running the cell. Note that, unlike most of the examples we've used in previous lessons, we *don't* use the `print()` function. Although it works, the result is not nicely formatted the way the output is if we just use the name of the data frame.

That is, run this command: `df` — not `print(df)` — in the cell below.

You'll see that the rows are numbered in boldface, starting with 0 as is the norm in Python. This boldfaced, leftmost column is called the **index** of the DataFrame, and provides one way of accessing data by rows. Across the top, you'll see that the column labels are also in boldface. pandas is pretty smart about automatically detecting when the first row of a CSV file contains header information (column names).

You can access the column names in the data using the `columns` attribute of the dataframe

Dataframe have lots of other useful attributes as well like `index`, `dtypes`, `size`, `shape`, `ndim`

The `index` index attribute is used to display the row labels of a data frame object. The row labels can be of 0,1,2,3,… form and can be of names. 

We can also get a quick summary with info()

`read_csv` is for “flat” text files, where each data point is on another
row, and the fields in a row are separated by some delimiter
(e.g. comma). Other pandas functions exist for loading other kinds of
data (read from database, Excel file, etc.)

## Heads or Tails?

We might want to "peek" at the DataFrame without printing out the entire thing, especially if it's big. We can see the first 5 rows of a DataFrame with the `.head()` method:

~~~python
df.head()
~~~

...or the last 5 rows with `.tail()`:

We can also see a random sample of rows from the DataFrame with `.sample()`, giving it a numerical argument to indicate the number of rows we want to see:

~~~python
df.sample(10)
~~~

Note that the `.head()` and `.tail()` methods also optionally take a numerical argument, if you want to view a different number of rows from the default of 5.

Looking at some rows can help us spot obvious problems with data loading. For example, suppose we had tried to read in the data using a tab delimiter to separate fields on the same row, instead of a comma.

## Accessing values in a DataFrame

One thing we often want to do is access a single cell in a DataFrame, or a range of cells. Each cell is uniquely defined by a combination of its row and column locations. 

### Select a column using `[]`

If we want to select an entire column of a pandas DataFrame, we just give the name of the DataFrame followed by the column name in square brackets. Below we are selecting the 'DX' (diagnosis) column

~~~python
df['DX']
~~~

Note that if we ask for a single column the result is a pandas Series, but if we ask for two or more columns, the result is a DataFrame. Pay close attention to the syntax below — if we're asking for more than one column, we need to provide a *list* of columns inside the square brackets (so there are *two* sets of nested square brackets in the code below):

~~~python
df[['DX', 'AGE']]
~~~

### Numerical indexing using `.iloc[]`

Often we don't want to access an entire column, however, but just specific rows within a column (or range of columns). pandas provides two ways of accessing cell locations. One is using the numerical positions in the DataFrame, using the convention of [row, column] — with [0, 0] being the top left cell in the DataFrame. So for a pandas DataFrame with 3 rows and 3 columns, the indices of each cell are as shown:

|   | col 0  | col 1  | col 2  | col 3  |
|---|--------|--------|--------|--------|
| 0 | [0, 0] | [0, 1] | [0, 2] | [0, 3] |
| 1 | [1, 0] | [1, 1] | [1, 2] | [1, 3] |
| 2 | [2, 0] | [2, 1] | [2, 2] | [2, 3] |
| 3 | [3, 0] | [3, 1] | [3, 2] | [3, 3] |


Numerical indexing of DataFrames is done with the `.iloc[]` method. For example, to access the DX value for patient # 4  — which is located in the fifth row, third column of our current DataFrame, we would use:

~~~python
df.iloc[4, 2]
~~~

## Label-based indexing using `.loc[]`

The other way to access a location in a DataFrame is by its index and column *labels*, using the `.loc[]` method. As noted earlier, in the DataFrame we imported, the indexes are currently numbers, which were created automatically when we imported the data. The `.loc[]` method doesn't work with numerical indexes (that's what `iloc` is for — and you can't mix, say, a numerical row index with a column label), but in the data set we imported, the first column of this CSV file is actually meant to be its index: while all other columns are data values (DX, AGE, APOE4 etc.), the first column identifies the patient with which each row of data is associated. 

pandas has a method for setting an index column, `.set_index()`, where the argument (in the parentheses) would be the name of the column to use as the index. So here we want to run:

~~~python
df = df.set_index('PTID_Key')
~~~

**Note** that we need to assign the result of this operation back to `df` (using `df = `), otherwise the change will not actually modify `df`.

In the cell below, use the `.set_index()` method to set the index of `df` to `PTID_Key`, and then view the DataFrame again to see how it has changed.

Alternatively, if we knew which column we wanted to use as the index before loading in the data file, we could have included the argument `index_col=` in the `pd.read_csv()` command:

~~~python
df = pd.read_csv('data/TADPOLE_select.csv', index_col='PTID_Key')
~~~



Now that we have defined the index, we can access the cognitive score (say MMSE) value for patient with ID # 347 by its index and column names:

~~~python
df.loc[347, 'MMSE']
~~~

## Use `:` on its own to mean all columns or all rows.

Using Python's familiar slicing notation (which we've previously used for strings and lists), we can use `:` with `.iloc[]` or `.loc[]`, to specify a range in a DataFrame.

For example, to see the data of patient with ID# 400 for every variable (column) in the DataFrame, we would use:

~~~python
df.loc[400, :]
~~~

Likewise, we could see the MMSE scores for every patient with:

~~~python
df.loc[:, 'MMSE']
~~~

You can also just specify the row index; if you don't specify anything for the columns, pandas assumes you want all columns:

~~~python
df.loc[400]
~~~

However, since the syntax for `.iloc[]` and `.loc[]` is [rows, columns], you cannot omit a row index; you need to use `:` if you want all rows.

## Slicing works on DataFrames

Slicing using numerical indices works similarly for DataFrames as we previously saw for strings and lists, for example, the following code will print the third through fifth rows of the DataFrame, and the fifth through eighth columns (remember, Python indexing starts at 0, and slicing does not include the "end" index): 

~~~python
df.iloc[2:5, 4:8]
~~~

The code below will print from the sixth to second-last row of the DataFrame, and from the ninth to the last column:

~~~python
df.iloc[5:-1, 8:]
~~~

**Note** however, that when using label-based indexing with `.loc[]`, pandas' slicing behaviour is a bit different. Specifically, the output *includes* the last item in the range, whereas numerical indexing with `.iloc[]` does not. 

So, considering that the first three rows of the DataFrame correspond to the patient IDs 400, 564 and 1354 and that columns 8 through 11 are the volumes for hippocampus, wholebrain, entorhinal cortex and fusiform regions, compare the output of:

~~~python
df.iloc[0:2, 8:11]
~~~

with:

~~~python
df.loc[400:1354, 'Hippocampus': 'Fusiform']
~~~

The "inclusive" label-based indexing with `.loc[]` is fairly intuitive, but it's important to remember that it works differently from numerical indexing.

## Use lists to select non-contiguous sections of a DataFrame

While slicing can be very useful, sometimes we might want to extract values that aren't next to each other in a DataFrame. For example, what if we only want values for the MMSE scores and diagnosis (DX), for patients with IDs (398, 613, 522)? Neither these patient IDs nor variables are in adjacent columns/rows in the DataFrame. With `.loc[]`, we can use lists, rather than ranges separated by `:`, as selectors:

~~~python
df.loc[[398, 613, 522], ['DX', 'MMSE']]
~~~

We can equivalently write the command over several lines to make it a bit easier to read:

~~~python
df.loc[[398, 613, 522], 
        ['DX', 'MMSE']]
~~~

We could also define those lists as variables, and pass the variables to `.loc[]`. This might be useful if you were going to use the lists more than once, as well as for clarity:

~~~python
patient_IDs = [398, 613, 522]
variables = ['DX', 'MMSE']
df.loc[patient_IDs, variables]
~~~

We can take this a step further, and assign the output of a `.loc[]` selection like this to a new variable name. This makes a copy of the selected data, stored in a new DataFrame (or Series, if we only select one row or column) with its own name. This allows us to later reference and use that selection. 

~~~python
subset_data = df.loc[patient_IDs, variables]
~~~

Selecting non-contiguous sections with numerical indexing
Try the following
```python 
df.iloc[[0,2], [1,-2]]
```

## It's easy to do simple math and statistics in DataFrames

We prevoiusly learned about methods to get simple statistical values out of a Python list, like `.max()`, and `.min()`. pandas includes these and many more methods as well. For example, we can view the mean volume of the Hippocampus across all patients (rows) with:

~~~python
df.loc[:, 'Hippocampus'].mean()
~~~

Or max along a row, say for patient ID 400

~~~python
df.loc[400].max()
~~~

The above doesn't work since some columns are not of numeric type. We can first extract columns with data type numeric (say float) and then compute the max in this case
```python
df.loc[400, df.columns[df.dtypes == 'float64']].max()
```

Another useful method is `.describe()`, which prints out a range of descriptive statistics for the range of data you specify. Without any slicing it provides information for each column:

~~~python
df.describe()
~~~

In [242]:
df.describe()

Unnamed: 0,AGE,PTEDUCAT,APOE4,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,ADAS13,MMSE,Ventricles
count,1669.0,1669.0,1667.0,1143.0,1332.0,1091.0,1091.0,1091.0,1360.0,1434.0,1437.0,1324.0
mean,73.791791,15.937088,0.561488,6817.128609,1025091.0,3529.007333,17729.987168,19695.106324,1523213.0,17.575767,26.899095,40817.602719
std,7.196902,2.831223,0.660335,1215.890371,111486.6,786.346886,2781.797029,3041.776697,166830.3,10.524535,3.14753,22918.334187
min,55.0,4.0,0.0,2991.0,669364.0,1143.0,10012.0,9375.0,708913.0,0.0,7.0,5650.0
25%,69.2,14.0,0.0,5963.0,947794.0,3025.0,15958.5,17736.0,1404960.0,9.33,25.0,23611.5
50%,73.9,16.0,0.0,6929.0,1021660.0,3563.0,17633.0,19675.0,1513275.0,15.165,28.0,35650.5
75%,79.0,18.0,1.0,7664.5,1098638.0,4066.0,19552.5,21813.0,1624448.0,24.0,29.0,52629.75
max,91.4,20.0,2.0,10699.0,1486040.0,5896.0,29950.0,32189.0,2110290.0,71.0,30.0,156066.0


For categorical variables, you can compute the mode or check the unique values using the .mode() and .unique() methods
e.g. 
```python
df.loc[:,'PTETHCAT'].unique()
```

### Mini-Exercise
In the cell below, try to view descriptive statistics for columns [MMSE, Ventricles, ADAS13]

## Evaluate cells based on conditions

pandas allows an easy way to identify values in a DataFrame that meet a certain condition, using operators like `<`, `>`, and `==`. For example, let's see which patients in a list had a diagnosis of AD. The result is reported in Booleans (True/False) for each cell.

~~~python
patients = [1511, 583, 284, 1007, 316, 1124, 1558, 1367, 823, 235]
df.loc[patients, 'DX']  == 'Dementia'
~~~

Evaluate all rows using the `:`

compare with 
```python 
    df['DX'] == 'Dementia'
```

Try the same for numeric variables: 
```python 
 df.loc[patients, 'AGE'] < 80           
```

## Select values or NaN using a Boolean mask.

A DataFrame full of Booleans is sometimes called a *mask* because of how it can be used. A mask removes values that are not True, and replaces them with `NaN` — a special Python value representing "not a number". This can be useful because pandas ignores NaN values when doing computations. 

We create a mask by assigning the output of a conditional statement to a variable name:


~~~python
mask = df.loc[:, 'DX'] == 'Dementia'
~~~

Then we can apply the mask to the DataFrame to get only the values that meet the criterion:

~~~python
df[mask]
~~~

You can also combine different conditions 
```python
mask = (df.loc[:, 'AGE'] < 80) & (df.loc[:, 'PTGENDER'] == 'Male')
```

As an example of how this might be used, the steps above would now allow us to find the mean age of patients with Dementia:

~~~python
mask = df.loc[:, 'DX'] == 'Dementia'
df_dementia = df[mask]
df_dementia.loc[:,'AGE'].mean()
~~~

### Mini-Exercise
Research has shown that the APOE4 gene can increase the risk of developing Alzheimer's disease. Compute the most frequent value of the APOE4 gene
factor among patients who are cognitively normal (DX: 'NL') and patients who have dementia

Hint: You can use the .mode() method and boolean masking

Now, compute and compare the mean MMSE scores for patients with dementia and cognitively normal patients

What about the hippocampal volumes? 

## Split-Apply-Combine

A common task in data science is to split data into meaningful subgroups, apply an operation to each subgroup (e.g., compute the mean), and then combine the results into a single output, such as a table or a new DataFrame. This paradigm was famously [described by Hadley Wickham in a 2011 paper](http://dx.doi.org/10.18637/jss.v040.i01).

pandas provides methods and grouping operations that are very efficient (*vectorized*) for split-apply-combine operations. 

As an example, let's say that we wanted to compare the average hippocampal volumes and MMSE scores for different groups of patients. To do this, we first have to create lists defining the patients belonging to each of these groups. In the following, we just create these groups randomly. 

~~~python
import random
group1 = random.sample(list(df.index.values), 10)
group2 = random.sample(list(df.index.values), 10)
group3 = random.sample(list(df.index.values), 10)
group4 = random.sample(list(df.index.values), 10)
~~~

Next we can make a new column simply by using `.loc[]` with the rows specified by one of the lists we just defined, a column name that doesn't already exist (in this case, we'll call it "group_id"), then assigning a group label to that combination of rows and column. We need to do this separately for each group. Note that when we first create the new column ("group"), pandas fills it with NaN values in any rows that were not defined by the assignment. For example, in the code below, the first line will create the column "group", and fill it with "1" for any row in the `group1` list, and `NaN` to every other row. 

~~~python
df.loc[group1, 'group_id'] = '1'
df.loc[group2, 'group_id'] = '2'
df.loc[group3, 'group_id'] = '3'
df.loc[group4, 'group_id'] = '4'
~~~

### Split

Now we can use this "region" column to split the data into groups, using a pandas method called `.groupby()`

~~~python
grouped_patients = df.groupby('group_id')
~~~

Note that this step doesn't create a new DataFrame, it creates a special kind of pandas object that points to a grouping in the original DataFrame:

~~~python
type(grouped_patients)
~~~

### Apply

Now that we have split the data, we can apply a function separately to each group. Here we'll compute the mean hippocampal volumes and MMSE scores for each group:

~~~python
means_by_group = grouped_patients[['Hippocampus', 'MMSE']].mean()
~~~

### Combine

The combine step actually occurred with the *apply* step above — the result is automatically combined into a table of mean values organized by region. But since our *apply* step (`.mean()`) saved the result to a variable, we can view the resulting table as the output of the *combine* step:

~~~python
means_by_group
~~~

### Chaining

In Python, **chaining** refers to combining a number of operations in one command, using a sequence of methods. We can perform the above split-apply-combine procedure in a single step as follows. Note that because we don't assign the output to a variable name, it is displayed as output but not saved.

~~~python
df.groupby('group_id')[['Hippocampus', 'MMSE']].mean()
~~~

You can group data based on existing column names too <br>
```python
df.groupby('DX')['Hippocampus'].mean()
```

compare with 
```python
df.loc[:,['DX', 'Hippocampus']].groupby('DX').mean()
```

---
# Exercises

## Selecting Individual Values

Write an expression to find the MMSE score of patient ID # 635

#### Extent of Slicing

1.  Do the two statements below produce the same output? (Hint: you might want to use the `.head()` method to remind yourself of the structure of the DataFrame)
2.  Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

~~~python
print(df.iloc[0:2, 0:2])
print(df.loc[400:1354, 'EXAMDATE':'AGE'])
~~~

In [331]:
df

Unnamed: 0_level_0,EXAMDATE,DX,AGE,PTGENDER,PTEDUCAT,PTETHCAT,PTMARRY,APOE4,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,ADAS13,MMSE,Ventricles,group_id
PTID_Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
400,9/8/05,NL,74.3,Male,16,Not Hisp/Latino,Married,0.0,8336.0,1229740.0,4177.0,16559.0,27936.0,1984660.0,18.67,28.0,118233.0,n
564,9/12/05,Dementia,81.3,Male,18,Not Hisp/Latino,Married,1.0,5319.0,1129830.0,1791.0,15506.0,18422.0,1920690.0,31.00,20.0,84599.0,n
1354,4/19/07,MCI,70.6,Male,18,Not Hisp/Latino,Married,1.0,7420.0,1060910.0,4169.0,19522.0,22864.0,1627060.0,16.67,30.0,54752.0,n
221,3/26/07,MCI,69.7,Female,12,Not Hisp/Latino,Widowed,1.0,6688.0,949914.0,3817.0,13838.0,17121.0,1396120.0,27.33,24.0,16683.0,1
518,4/16/07,MCI,73.6,Male,4,Hisp/Latino,Married,0.0,5920.0,856238.0,3443.0,12897.0,12787.0,1398620.0,32.00,26.0,27088.0,n
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
882,12/18/13,NL,69.3,Male,14,Not Hisp/Latino,Married,0.0,10602.0,1486040.0,4701.0,24783.0,32189.0,1998250.0,14.00,29.0,18633.0,n
250,8/15/07,MCI,74.4,Male,14,Not Hisp/Latino,Married,1.0,,,,,,,,,,n
1507,3/5/09,MCI,82.8,Female,20,Not Hisp/Latino,Married,0.0,,,,,,,31.67,26.0,,n
347,10/25/11,,74.1,Female,18,Not Hisp/Latino,Divorced,0.0,6254.0,954172.0,2190.0,13415.0,18947.0,1415770.0,6.00,29.0,50196.0,n


## Reconstructing Data

Explain what each line in the following short program does:
what is in `df1`, `df2`, etc.?

~~~python
df1 = pd.read_csv('data/TADPOLE_select.csv', index_col='PTID_Key')
df2 = df1[df1['PTGENDER'] == 'Female']
df3 = df2.drop(1507)
df4 = df3.drop('EXAMDATE', axis = 1)
#df4.to_csv('result.csv')
~~~

## Selecting Indices

Explain in simple terms what `idxmin` and `idxmax` do. When would you use these methods?

~~~python
data = pd.read_csv('data/TADPOLE_select.csv', index_col='PTID_Key')
numeric_columns = data.columns[data.dtypes == 'float64']
data = data.loc[:, numeric_columns]
data.idxmin()
~~~

~~~python
data.idxmax()
~~~

## Practice with Selection

Load the GDP data for Europe as `data` 
```python
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
```
Using this DataFrame, write an expression to select each of the following:

- GDP per capita for all countries in 1982

- GDP per capita for Denmark for all years

- GDP per capita for all countries for years *after* 1985

Note that pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named `gdpPercap_1985` actually exists. This is useful if new columns are added to the CSV file later.

- GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952

---
# Summary of Key Points:
- pandas DataFrames are a powerful way of storing and working with tabular (row/column) data
- pandas columns and rows can have names
- pandas row names are called *indexes* which are numeric by default, but can be given other labels
- Use the `.iloc[]` method with a DataFrame to select values by integer location, using [row, column] format
- Use the `.loc[]` method with a DataFrame to select rows and/or columns, using named slices
- Use `:` on its own to mean all columns or all rows
- Result of slicing can be used in further operations
- Use comparisons to select data based on value
- Select values or `NaN` using a Boolean mask
- use split-apply-combine to derive analytics from groupings within a DataFrame