In [32]:
import pandas as pd
import numpy as np

## Archive Exploration

This data is about the physical attributes of snowshoe hares in Bonanza Creek Experimental Forest. The data was taken from 1999-06-01 to 2012-09-14. Unsure if it contains sensitive data. 
![image description](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1452px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)

## Data loading and preliminary exploration

In [4]:
hares = pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

In [53]:
hares.head(30)

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study,sex_simple
3350,8/9/2002,6:00:00,bonmat,2a,1236,1238,f,,800.0,92.0,,78.0,50,Population,female
3351,8/9/2002,6:00:00,bonmat,5f,1221,1222,,,,,,72.0,50,Population,
3352,3/31/2003,11:30:00,bonmat,2b,1209,1210,m,,1450.0,135.0,,66.0,43,Population,male
3353,6/4/2003,10:30:00,bonmat,4d,639,640,m,,1350.0,134.0,,623.0,44,Population,male
3354,8/20/2003,10:40:00,bonmat,6b,1521,1522,f,j,750.0,110.0,,94.0,45,Population,female
3355,8/20/2003,10:30:00,bonmat,3b,1512,1516,f,j,650.0,104.0,,93.0,45,Population,female
3356,8/21/2003,18:30:00,bonmat,1c,1209,1210,m,,1400.0,134.0,,66.0,45,Population,male
3357,8/21/2003,9:00:00,bonmat,2d,1628,1629,m,,1150.0,132.0,,103.0,45,Population,male
3358,8/21/2003,8:45:00,bonmat,4f,1626,1627,f,,1425.0,142.0,,102.0,45,Population,female
3359,8/21/2003,8:35:00,bonmat,3a,1521,1522,f,j,875.0,110.0,,94.0,45,Population,female


In [6]:
hares.columns

Index(['date', 'time', 'grid', 'trap', 'l_ear', 'r_ear', 'sex', 'age',
       'weight', 'hindft', 'notes', 'b_key', 'session_id', 'study'],
      dtype='object')

In [39]:
hares['weight'].unique()

array([1370., 1430., 1490., 1710., 1890., 2170., 1510., 1590., 1830.,
       1780., 1730., 1670., 1990., 1870., 1650.,   nan, 1770., 1690.,
       1640., 1750., 1720., 1620., 1790., 1470., 1480., 1940., 1400.,
       1450., 1860., 1580., 1500., 1760., 1560., 1600., 1420., 1660.,
       1820., 1700., 1300., 1200., 1350., 1380., 1320., 1800., 1375.,
        395.,  420.,  445.,  410.,  430.,  520.,  435.,  545.,  425.,
        480.,  530.,  355.,  590.,  400.,  660., 1170., 1180., 1310.,
       1090.,  930., 1000., 1040.,  910., 1060., 1190., 1010.,  990.,
       1020., 1140., 1290., 1810., 1050., 1900., 1160., 1390., 1840.,
       1080., 1130.,  960., 1230., 1950., 1150., 1110., 1330., 1210.,
       1100., 1410., 1220., 1360.,  980., 1460.,  900., 1120., 1280.,
       2090., 1530., 2060., 1930.,  710., 1070.,  940., 2200., 1680.,
       1030., 1340., 1920., 1540., 1570.,  890., 1250., 1270., 2260.,
        700.,  750.,  760., 1240.,  595., 1550., 1520., 2000., 2070.,
       1485., 1740.,

In [7]:
hares.dtypes

date           object
time           object
grid           object
trap           object
l_ear          object
r_ear          object
sex            object
age            object
weight        float64
hindft        float64
notes          object
b_key         float64
session_id      int64
study          object
dtype: object

In [8]:
hares['sex'].unique()

array([nan, 'M', 'F', '?', 'F?', 'M?', 'pf', 'm', 'f', 'f?', 'm?', 'f ',
       'm '], dtype=object)

## 4. Detecting messy values

### a. 
| Sex Code     | Description |
| ----------- | ----------- |
| m?   | Male - not confirmed        |
| m      | Male       |
| f   | Female        |

### b. 

In [45]:
# Dropna = true is the default
hares['sex'].value_counts(dropna = True)

F     1161
M      730
f      556
m      515
?       40
F?      10
f        4
m        4
f?       3
M?       2
m?       2
pf       1
Name: sex, dtype: int64

## c.

In [29]:
# help(pd.Series.value_counts)
# The default is dropna = True, and it doesnt include Nan values

In [52]:
hares['sex'].value_counts(dropna = False)

F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: sex, dtype: int64

Discuss with your team the output of the unique value counts. In particular:

Do the values in the sex column correspond to the values declared in the metadata?
No, there were only three values declared in the metadata.

What could have been potential causes for multiple codes?
If there were multiple observers who used different syntax for the gender value, then there would be more codes.

Are there seemingly repated values? If so, what could be the cause?
Yes, there are seemingly repeated values. The cause could be different observers who didnt communicate a common code.

## 6. Data Wrangle

In [50]:
# Create a list with the conditions
conditions = [
    hares['sex'].isin(['F', 'f', 'f ']),
    hares['sex'].isin(['M', 'm', 'm '])
]

# Create a list with the choices
gender = ["female",
           "male"]

# Add the selections using np.select
hares['sex_simple'] = np.select(conditions, 
                             gender, 
                             default=np.nan) # Value for anything outside conditions

# Display the updated data frame to confirm the new column
hares.head(20)

hares['sex_simple'].value_counts()

female    1721
male      1249
nan        410
Name: sex_simple, dtype: int64

## 7. Use groupby() to calculate the mean weight by sex using the new column.
Write a full sentence explaining the results you obtained. Donâ€™t forget to include units.

In [51]:
hares.groupby('sex_simple').mean('weight')

Unnamed: 0_level_0,weight,hindft,b_key,session_id
sex_simple,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1365.164792,130.878889,481.653263,57.746078
male,1349.935542,133.194074,489.514812,49.230584
,1193.364055,103.758621,626.942935,46.47561


The mean weight of female hares is 1365.16 grams, the mean weight of male hares is 1349.93 grams, and the mean weight of unconfirmed gendered hares is 1193.36. 

In [44]:
# Read in the data
hares = pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

# Makes a new column called 'sex_simple' that condenses the 'sex' column into only male and female based on the code
hares['sex_simple'] = np.select(
    [hares['sex'].isin(['F', 'f', 'f?', 'F?']),
    hares['sex'].isin(['M', 'm', 'm?', 'M?'])], 
    ["female",
    "male"], 
    default = np.nan)

# Group by new sex column to find the mean weights by gender
hares.groupby('sex_simple').mean('weight')

Unnamed: 0_level_0,weight,hindft,b_key,session_id
sex_simple,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1364.986148,130.877348,482.393852,57.831214
male,1352.399642,133.388148,491.098479,49.226581
,1178.42723,98.754717,621.130556,45.86783
