# Subsetting

- Review Core methods to select data from a 'pandas.DataFrame'

### Read in a CSV

To read in a CSV file into our Python workspace as a `pandas.DataFrame` we use the `pandas.read_csv` function:


In [1]:
import pandas as pd

#Read in file argument as the filepath as a string

df= pd.read_csv('data/wetlands_seasonal_bird_diversity.csv')

#Print the first 5 rows
df.head()

Unnamed: 0,year,CSM_winter,CSM_spring,CSM_fall,MUL_winter,MUL_spring,MUL_fall,SDW_winter,SDW_spring,SDW_fall,TJE_winter,TJE_spring,TJE_fall
0,2010,39.0,40.0,50.0,45.0,,61.0,,75.0,85.0,,,81.0
1,2011,48.0,44.0,,58.0,52.0,,78.0,74.0,,67.0,70.0,
2,2012,51.0,43.0,49.0,57.0,58.0,53.0,71.0,72.0,73.0,70.0,63.0,69.0
3,2013,42.0,46.0,38.0,60.0,58.0,62.0,69.0,70.0,70.0,69.0,74.0,64.0
4,2014,38.0,43.0,45.0,49.0,52.0,57.0,61.0,78.0,71.0,60.0,81.0,62.0


Birds were surveyed in the four wetlands:

- Carpinteria Salt Marsh (CSM)
- Mugu Lagoon (MUL)
- San Dieguitio Wetland (SDW)
- Tijuana Estuary (TJE)

Values from second column to the last column = number of different bird species recorded across the survey sites in each wetland during spring, winter, and fall for a given set of data

In [3]:
# Print the df's shape:
df.shape

(14, 13)

In [4]:
#List the data types of the df:
df.dtypes

year            int64
CSM_winter    float64
CSM_spring    float64
CSM_fall      float64
MUL_winter    float64
MUL_spring    float64
MUL_fall      float64
SDW_winter    float64
SDW_spring    float64
SDW_fall      float64
TJE_winter    float64
TJE_spring    float64
TJE_fall      float64
dtype: object

## Selecting a single column

Simples case: select a single column by column name

General syntax:
```python
df['column_name']
```

This is an example of **label-based subsetting**, which means we want to select data from our df using the *names* of the columns, not their position

### Example

Select the number of bird species in Mugu Lagoon in the spring

In [6]:
#Select a single column by using square brackets []
mul_spring = df['MUL_spring']

mul_spring

0      NaN
1     52.0
2     58.0
3     58.0
4     52.0
5     50.0
6     48.0
7     54.0
8     54.0
9     52.0
10     NaN
11    55.0
12    55.0
13    59.0
Name: MUL_spring, dtype: float64

### Note:

- 'df[`column_name`] avoids conflicts with the `pd.DataFrames` methods and attributes

## Selecting Multiple columns

### ...using a list of column names

```python
df[['col1', 'col2', 'col100']]
```

## Check-in

Is this label-based on location-based

### Example


In [8]:
#Select columns with names 'TJE_winter' and 'TJE_fall`

tje_wf = df[['TJE_winter', 'TJE_fall']]

tje_wf

Unnamed: 0,TJE_winter,TJE_fall
0,,81.0
1,67.0,
2,70.0,69.0
3,69.0,64.0
4,60.0,62.0
5,73.0,64.0
6,76.0,58.0
7,72.0,57.0
8,66.0,55.0
9,63.0,50.0


In [9]:
print(type(tje_wf))

<class 'pandas.core.frame.DataFrame'>


### ... using a slice

To select a slice of the columns we will use a special case of **`loc` selection**. General syntax:

```python
df.loc[: , 'column_start', 'column_end']
```

Notice:

- the first value passed to `loc` is used for rows, using a colon `:` means to select all the rows.
- the resulting slice will include both endpoints

In [10]:
# Check the type of df['year']>1996
print(type(df['year']>2020))

# Print the boolean series
df['year']>2020


<class 'pandas.core.series.Series'>


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
12     True
13     True
Name: year, dtype: bool

In [14]:
SDW_spring =df[df['SDW_spring'] > 75]

SDW_spring

Unnamed: 0,year,CSM_winter,CSM_spring,CSM_fall,MUL_winter,MUL_spring,MUL_fall,SDW_winter,SDW_spring,SDW_fall,TJE_winter,TJE_spring,TJE_fall
4,2014,38.0,43.0,45.0,49.0,52.0,57.0,61.0,78.0,71.0,60.0,81.0,62.0
