In [32]:
import pandas as pd
import numpy as np

## Archive Exploration

This data is about the physical attributes of snowshoe hares in Bonanza Creek Experimental Forest. The data was taken from 1999-06-01 to 2012-09-14. Unsure if it contains sensitive data. 
![image description](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1452px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)

## Data loading and preliminary exploration

In [4]:
hares = pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

In [5]:
hares.shape

(3380, 14)

In [6]:
hares.columns

Index(['date', 'time', 'grid', 'trap', 'l_ear', 'r_ear', 'sex', 'age',
       'weight', 'hindft', 'notes', 'b_key', 'session_id', 'study'],
      dtype='object')

In [7]:
hares.dtypes

date           object
time           object
grid           object
trap           object
l_ear          object
r_ear          object
sex            object
age            object
weight        float64
hindft        float64
notes          object
b_key         float64
session_id      int64
study          object
dtype: object

In [8]:
hares['sex'].unique()

array([nan, 'M', 'F', '?', 'F?', 'M?', 'pf', 'm', 'f', 'f?', 'm?', 'f ',
       'm '], dtype=object)

## 4. Detecting messy values

### a. 
| Sex Code     | Description |
| ----------- | ----------- |
| m?   | Male - not confirmed        |
| m      | Male       |
| f   | Female        |

### b. 

In [17]:
hares['sex'].value_counts(dropna = True)

F     1161
M      730
f      556
m      515
?       40
F?      10
f        4
m        4
f?       3
M?       2
m?       2
pf       1
Name: sex, dtype: int64

In [29]:
# help(pd.Series.value_counts)
# The default is dropna = True, and it doesnt include Nan values

In [30]:
hares['sex'].value_counts(dropna = False)

F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: sex, dtype: int64

Discuss with your team the output of the unique value counts. In particular:

Do the values in the sex column correspond to the values declared in the metadata?
No, there were only three values declared in the metadata.

What could have been potential causes for multiple codes?
If there were multiple observers who used different syntax for the gender value, then there would be more codes.

Are there seemingly repated values? If so, what could be the cause?
Yes, there are seemingly repeated values. The cause could be different observers who didnt communicate a common code.

## 6. Data Wrangle

In [33]:
# Create a list with the conditions
conditions = [
    hares['sex'].isin(['F', 'f', 'f?', 'F?']),
    hares['sex'].isin(['M', 'm', 'm?', 'M?'])
]

# Create a list with the choices
choices = ["female",
           "male"]

# Add the selections using np.select
hares['simple_sex'] = np.select(conditions, 
                             choices, 
                             default=np.nan) # Value for anything outside conditions

# Display the updated data frame to confirm the new column
hares.head()

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study,simple_sex
0,11/26/1998,,bonrip,1A,414D096A08,,,,1370.0,160.0,,917.0,51,Population,
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population,male
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population,male
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population,
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population,
