# Subsetting 

Review core methods to select data from a `pandas.DataFrame`

## Read in CSV

In [3]:
import pandas as pd

# Read in file, argument is in the file path 
df = pd.read_csv('data/wetlands_seasonal_bird_diversity.csv')

# Print data frame's first five rows 
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/wetlands_seasonal_bird_diversity.csv'

In [None]:
df.tails() # this is a method, a function 

df.columns # this is an attribute 

df.dtypes # this is an attribute

type(df.dtypes)

df.shape

## Selecting a single column 

Simples case: selecting single column by column name 

General syntax: 
```
df['column_name']
```

this is an example of **label-based subsetting**. Which means we select data from our data frame using the *names* of the columns, *not their positions*

### Example

In [None]:
# select number of bird species obserfved at Mugu lagoon in Spring
mul_spring = df['MUL_spring']

mul_spring

In [None]:
type(mul_spring)

We can do label-based subsetting of a single col using attribute syntax

`df.column_name`

### Example

In [None]:
df.MUL_spring # atrtibute 

Favor `df[`column_name`]` instead of `df.column_name`

## Selecting multiple columns ...

... using a list of column names. 

Syntax:

```
df[[`column_1`, `column_10`,`column_245`]]
```

Notice: there are double square brackets. This is because we are passing a list of names to the selection brackets 

### Example

In [None]:
# Select species abundance in Tijuana Estuary during winter and fall 
df[['TJE_winder','TJE_fall']]

# check the data type of tje_wf
print(type(tje_wf))

# check the shape of the selection 
print(tje_wf.shape)

## ... using a slice 
To slect a slice of the columns we will use a special case of `loc` selection

Syntax:

```
df.loc[ : , `column_start`:`column_end`]
```

`column_start` and `column_end` = starting and ending point of column slice, slice includes both endpoints

- the first value passed to`loc` is used to select rows, using a ':' as the row-selection parameter means "select all the rows"


### Example

In [None]:
# select columns between CSM_winter and MUL_fall
csm_mul = df.loc[ : , 'CSM_winter':'MUL_fall']
csm_mul.head()

## Selecting rows ... 

### ... using a condition

Syntax: 
```
df[condition_on_rows]
```

that `condition_on_rows` can be many things 

### Example
We are interested in all data after 2020

In [None]:
# select all rows with year>2020 

post_2020 = df[df['year']>2020]

post_2020

condition for our rows =  `df[df['year']>2020]` this is a pandas.series with **boolean values** (True of False) indicates which rows satisfy the condition year > 2020

In [None]:
# check the type of df['year']>2020
print(type(df['year']>2020))

# print the boolean series 
print()

### Check-in
Get the subset of the data frame on which the San Dieguito Wetland has at least 75 species recorded during spring.

In [None]:
df[df['SDW_spring'] >=75]

In [None]:
subset = df[df['year'].between(2012,2015)]
subset

- `df['year']' = column with year values, this is a `panda.Series`
- `df['year']`.between()` we have that `between()` is a method for the panda.Series. Calling it using "."