# Discussion Section 3: Snowshoe Hares

In [31]:
import pandas as pd
import numpy as np

## Archive Exploration
1. What is this data about?
This data is about snowshoe hares. They conducted capture-recapture studies of snowshoe hares at 5 locales in the Tanana valley, from Tok in the east to Clear in the west
2. During what time frame were the observations in the dataset collected?
1999 - 2012
3. Does the dataset contain sensitive data?
No
4. Is there a publication associated with this dataset?
yes?


## Data Citation
citation: 
**Kielland, K., F.S. Chapin, R.W. Ruess, and Bonanza Creek LTER. 2017. Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present ver 22. Environmental Data Initiative. https://doi.org/10.6073/pasta/03dce4856d79b91557d8e6ce2cbcdc14 (Accessed 2024-10-17).**
date of access: 
October 17, 2024
link to the archive: https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701
![ALAN SCHMIERER, Set 72157600401137773, ID 18988734889, Original title SNOWSHOE HARE (Lepus americanus) (5-28-2015) quoddy head, washington co, maine -01
](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1452px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)

## Data loading and preliminary exploration

In [32]:
hares = pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

What are the dimensions of the dataframe and what are the data types of the columns? Do the data types match what you would expect from each column?
Are there any columns that have a significant number of NA values?
What are the minimum and maximum values for the weight and hind feet measurements?
What are the unique values for some of the categorical columns?
An explroatory question about the data frame you come up with!

In [33]:
hares.shape

(3380, 14)

In [34]:
hares.dtypes

date           object
time           object
grid           object
trap           object
l_ear          object
r_ear          object
sex            object
age            object
weight        float64
hindft        float64
notes          object
b_key         float64
session_id      int64
study          object
dtype: object

In [16]:
hares.isnull().sum()

date             0
time          3116
grid             0
trap            12
l_ear           48
r_ear          169
sex            352
age           2111
weight         535
hindft        1747
notes         3137
b_key           47
session_id       0
study          163
dtype: int64

In [18]:
hares['weight'].max()

2365.0

In [19]:
hares['weight'].min()

0.0

In [21]:
hares['hindft'].max()

160.0

In [22]:
hares['hindft'].min()

60.0

In [23]:
hares['age'].unique()

array([nan, 'J', 'A', 'a 1 yr.', 'a 3/4 yr.', 'a 1 yr', '1 yr', '1 yr.',
       '2 yrs.', '2 yrs', 'a 2 yrs.', '2.25 yrs', '3.5 yrs.', '3 yrs.',
       '2.5 yrs', '3.25 yrs.', 'A 1.5', '?', 'U', 'j', 'a', 'u', 'J 3/4',
       'A 3/4', 'A 1/2', '3/4/2013', '1/4/2013', '1/2/2013', '1', '1.25',
       '1.5'], dtype=object)

In [24]:
hares.info

<bound method DataFrame.info of             date      time    grid trap       l_ear r_ear  sex  age  weight  \
0     11/26/1998       NaN  bonrip   1A  414D096A08   NaN  NaN  NaN  1370.0   
1     11/26/1998       NaN  bonrip   2C  414D320671   NaN    M  NaN  1430.0   
2     11/26/1998       NaN  bonrip   2D  414D103E3A   NaN    M  NaN  1430.0   
3     11/26/1998       NaN  bonrip   2E  414D262D43   NaN  NaN  NaN  1490.0   
4     11/26/1998       NaN  bonrip   3B  414D2B4B58   NaN  NaN  NaN  1710.0   
...          ...       ...     ...  ...         ...   ...  ...  ...     ...   
3375    8/8/2002  18:00:00  bonrip  1b         1201  1202  NaN  NaN  1400.0   
3376    8/8/2002   6:00:00  bonrip  4b         1201  1202  NaN  NaN     NaN   
3377    8/7/2002       NaN  bonrip   4b        1217  1218  NaN  NaN  1000.0   
3378    8/8/2002       NaN  bonrip   6d        1217  1218  NaN  NaN   990.0   
3379    8/6/2002       NaN  bonrip   4b        1058  1060    M  NaN  1460.0   

      hindft notes 

## Detecting Messy Values

1. In the metadata section of the EDI repository, find which are the allowed values for the hares’ sex. Create a small table in a markdown cell showing the values and their definitions.

| Value | Definition |
| --- | --- |
| m | male |
| f | female |
| m? | male not confirmed |

2. Get the number of times each unique sex non-NA value appears.

In [51]:
hares['sex'].value_counts(dropna=False)

F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: sex, dtype: int64

In [36]:
hares['sex'].value_counts(dropna=False)

F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: sex, dtype: int64

## Clean Values

In [38]:
# Create a list with the conditions
conditions = [(hares.sex == 'F') | (hares.sex == 'f') | (hares.sex =='f_'), 
              (hares.sex == 'M') | (hares.sex == 'm') | (hares.sex =='m_')]

# Create a list with the choices
choices = ["female",
           "male"]

# Add the selections using np.select
hares['sex_simple'] = np.select(conditions, 
                             choices, 
                             default=np.nan) # Value for anything outside conditions

# Display the updated data frame to confirm the new column
hares.head()

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study,sex_simple
0,11/26/1998,,bonrip,1A,414D096A08,,,,1370.0,160.0,,917.0,51,Population,
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population,male
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population,male
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population,
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population,


In [41]:
hares['sex_simple'].unique()

array(['nan', 'male', 'female'], dtype=object)

## Calculate mean weight

In [52]:
hares.groupby("sex_simple").weight.mean()

sex_simple
female    1366.920372
male      1352.145553
nan       1176.511111
Name: weight, dtype: float64

Thee weights are in grams, the females were the heaviest, then the males, then the nan.