# Week 3 - Discussion section
## Snowshoe hares at Bonanza Creek Experimental Forest

1. What is this data about?
2. During what time frame were the observations in the dataset collected?
3. Does the dataset contain sensitive data?
4. Is there a publication associated with this dataset?

#brief description of the dataset, including a citation, date of access, and a link to the archive.

Citation: Kielland, K., F.S. Chapin, R.W. Ruess, and Bonanza Creek LTER. 2017. Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present ver 22. Environmental Data Initiative. https://doi.org/10.6073/pasta/03dce4856d79b91557d8e6ce2cbcdc14 (Accessed 2024-10-16).

Link: https://portal.edirepository.org/nis/advancedSearch.jsp

Date of Access: Oct. 17, 2024

![Snowshoe Hare](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1452px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)
Source: Schmierer, Alan. “Snowshoe Hare (Lepus americanus).” Wikimedia Commons, 28 May 2015. Last edited 30 Aug. 2024, https://commons.wikimedia.org/wiki/File:SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg

## 3. Data loading and preliminary exploration

In [1]:
# Loading libraries
import pandas as pd
import numpy as np

In [2]:
# Load data
hares=pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

In [3]:
# Set pandas to display all columns in the data frame.
pd.set_option("display.max_columns",None)
hares

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study
0,11/26/1998,,bonrip,1A,414D096A08,,,,1370.0,160.0,,917.0,51,Population
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3375,8/8/2002,18:00:00,bonrip,1b,1201,1202,,,1400.0,,,63.0,64,Population
3376,8/8/2002,6:00:00,bonrip,4b,1201,1202,,,,,,63.0,64,Population
3377,8/7/2002,,bonrip,4b,1217,1218,,,1000.0,134.0,,69.0,64,Population
3378,8/8/2002,,bonrip,6d,1217,1218,,,990.0,,,69.0,64,Population


In [4]:
# What are the dimensions of the dataframe and what are the data types of the columns? 
# Use pandas methods to obtain preliminary info

hares.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3380 entries, 0 to 3379
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        3380 non-null   object 
 1   time        264 non-null    object 
 2   grid        3380 non-null   object 
 3   trap        3368 non-null   object 
 4   l_ear       3332 non-null   object 
 5   r_ear       3211 non-null   object 
 6   sex         3028 non-null   object 
 7   age         1269 non-null   object 
 8   weight      2845 non-null   float64
 9   hindft      1633 non-null   float64
 10  notes       243 non-null    object 
 11  b_key       3333 non-null   float64
 12  session_id  3380 non-null   int64  
 13  study       3217 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 369.8+ KB


In [5]:
hares.shape

(3380, 14)

# Do the data types match what you would expect from each column?

-Date and time should be Datetime and Timedelta Types


In [6]:
# Are there any columns that have a significant number of NA values?
nan_counts = hares.isna().sum()
nan_counts

date             0
time          3116
grid             0
trap            12
l_ear           48
r_ear          169
sex            352
age           2111
weight         535
hindft        1747
notes         3137
b_key           47
session_id       0
study          163
dtype: int64

In [7]:
# What are the minimum and maximum values for the weight and hind feet measurements?
print('the min weight: ', min(hares['weight']))
print('the max weight: ',max(hares['weight']))
print('the min hind feet: ', min(hares['hindft']))
print('the max hind feet: ',max(hares['hindft']))

the min weight:  0.0
the max weight:  2365.0
the min hind feet:  60.0
the max hind feet:  160.0


In [8]:
# What are the unique values for some of the categorical columns?
hares.nunique()

date           256
time           165
grid             5
trap           121
l_ear         1020
r_ear          963
sex             12
age             30
weight         266
hindft          90
notes          118
b_key         1026
session_id     113
study            4
dtype: int64

In [9]:
# An explroatory question about the data frame you come up with!

## 4. Detecting messy values

In the metadata section of the EDI repository, find which are the allowed values for the hares’ sex. Create a small table in a markdown cell showing the values and their definitions.

| Code | Definition                |
|------|---------------------------|
| m    | male                      |
| f    | female                    |
| m?   | male not confirmed        |



In [36]:
hares['sex'].unique()

array([nan, 'M', 'F', '?', 'F?', 'M?', 'pf', 'm', 'f', 'f?', 'm?', 'f ',
       'm '], dtype=object)

Get the number of times each unique sex non-NA value appears.

In [10]:
print(len(hares) - (hares['sex'].isna().sum()))

3028


Check the documentation of value_counts(). What is the purpose of the dropna parameter and what is its default value? 

The value_counts() function in pandas is used to count the unique values in a Series. The dropna parameter determines whether to exclude NaN (missing) values from the count.

Repeat step (a), this time adding the dropna=False parameter to value_counts().

In [31]:
sum(hares['sex'].value_counts(dropna=False))

3380

## 5. Brainstorm

Individually, write step-by-step instructions on how you would wrangle the hares data frame to clean the values in the sex column to have only two classes female and male. Which codes would you assign to each new class? Remember: It’s ok if you don’t know how to code each step - it’s more important to have an idea of what you would like to do.

## 6. Clean values

a) Create a new column called sex_simple using the numpy.select() function so that
‘F’, ‘f’, and ‘f_’ in the sex column get assigned to ‘female’,
‘M’, ‘m’, and ‘m_’ get assigned to ‘male’, and
anything else gets assigned np.nan

In [30]:
# Define conditions for the 'sex' column based on its values
conditions = [
    hares['sex'].isin(['F', 'f', 'f_']),
    hares['sex'].isin(['M', 'm', 'm_'])
]

# Define the corresponding gender labels for the conditions
gender = ['female', 'male']

# Create a new column 'sex_simple' based on the defined conditions
# Use np.select to assign 'female' or 'male', and np.nan for unmatched cases
hares['sex_simple'] = np.select(conditions, gender, default=np.nan)

print(hares['sex_simple'])

0        nan
1       male
2       male
3        nan
4        nan
        ... 
3375     nan
3376     nan
3377     nan
3378     nan
3379    male
Name: sex_simple, Length: 3380, dtype: object


b)Check the counts of unique values (including NAs) in the new sex_simple column.

In [26]:
hares['sex_simple'].isna().count()

3380

## 7. Calculate mean weight

a) Use groupby() to calculate the mean weight by sex using the new column.

In [29]:
mean_weight_bysex=hares.groupby('sex_simple')['weight'].mean()
mean_weight_bysex

sex_simple
female    1366.920372
male      1352.145553
nan       1176.511111
Name: weight, dtype: float64

b)Write a full sentence explaining the results you obtained. Don’t forget to include units.

The results indicate that the average weight of female hares is approximately 1366.92 grams, which is slightly higher than the average weight of male hares at around 1352.15 grams, while the average weight of the samples categorized as NaN (not classified by sex) is lower, at approximately 1176.51 grams.

## 8. Collect your code and explain your results

In a new code cell, collect all the relevant code to obtain to create a streamlined workflow to obtain the final result from exercise 7 starting from importing the data. Your code cell should:

In [None]:

1. Import data and libraries
2. Data Exploration:
    - .info , .shape, .isna().sum(), .nunique()
4. group by sex: groupby('sex_simple')
5. calculate the mean of each sex: ['weight'].mean()


In [44]:
# Load data
hares=pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

# Checking if the 'sex' column contains more than two categories
if len(hares['sex'].unique())>2:
    print("there's more than 2 sex categories")
    # Define conditions for the 'sex' column based on its values
    conditions = [
        hares['sex'].isin(['F', 'f', 'f_']),
        hares['sex'].isin(['M', 'm', 'm_'])
    ]

    # Define the corresponding gender labels for the conditions
    gender = ['female', 'male']

    # Create a new column 'sex_simple' based on the defined conditions
    # Use np.select to assign 'female' or 'male', and np.nan for unmatched cases
    hares['sex_simple'] = np.select(conditions, gender, default=np.nan)
    
    # Calculating the mean of weight for each sex group
    mean_weight_bysex=hares.groupby('sex_simple')['weight'].mean()
    
    # Display output of mean_weight_bysex
    print(mean_weight_bysex)

else:
    print("there's only 2 categories")

there's more than 2 categories
sex_simple
female    1366.920372
male      1352.145553
nan       1176.511111
Name: weight, dtype: float64
