# Gator hunt

The Florida Fish and Wildlife Conservation Commission keeps track of [gators killed by hunters](http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/). A cut of this data lives in `../data/gators.csv`.

Let's take a look.

In [21]:
# import pandas
import pandas as pd

In [22]:
# read in the CSV
df = pd.read_csv('../data/gators.csv')

In [23]:
# check the output with `head()`
df.head()

Unnamed: 0,Year,Area Number,Area Name,Carcass Size,Harvest Date,Location
0,2000,101,LAKE PIERCE,11 ft. 5 in.,09-22-2000,
1,2000,101,LAKE PIERCE,9 ft. 0 in.,10-02-2000,
2,2000,101,LAKE PIERCE,8 ft. 10 in.,10-06-2000,
3,2000,101,LAKE PIERCE,8 ft. 0 in.,09-25-2000,
4,2000,101,LAKE PIERCE,8 ft. 0 in.,10-07-2000,


### Check it out

First, let's take a look at our data and examine some of the column values that we might be interested in analyzing. We're already starting to think about the questions we want this data to help us answer.

In [24]:
# get the info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94337 entries, 0 to 94336
Data columns (total 6 columns):
Year            94337 non-null int64
Area Number     94337 non-null int64
Area Name       94337 non-null object
Carcass Size    94337 non-null object
Harvest Date    94337 non-null object
Location        94327 non-null object
dtypes: int64(2), object(4)
memory usage: 4.3+ MB


In [25]:
# what's the year range, with counts?
df['Year'].value_counts()

2011    8011
2013    7979
2009    7744
2010    7654
2014    7374
2016    7155
2015    6726
2012    6657
2006    6420
2008    6201
2007    5952
2005    3453
2004    3227
2003    2820
2000    2547
2001    2262
2002    2155
Name: Year, dtype: int64

In [26]:
# let's also peep the carcass size values to get the pattern
df['Carcass Size'].unique()

array(['11 ft. 5 in.', '9 ft. 0 in.', '8 ft. 10 in.', '8 ft. 0 in.',
       '7 ft. 2 in.', '7 ft. 1 in.', '6 ft. 11 in.', '6 ft. 7 in.',
       '6 ft. 6 in.', '6 ft. 3 in.', '12 ft. 7 in.', '12 ft. 3 in.',
       '12 ft. 2 in.', '12 ft. 0 in.', '11 ft. 10 in.', '11 ft. 7 in.',
       '11 ft. 1 in.', '10 ft. 9 in.', '10 ft. 4 in.', '9 ft. 9 in.',
       '9 ft. 8 in.', '9 ft. 6 in.', '9 ft. 3 in.', '9 ft. 2 in.',
       '8 ft. 3 in.', '8 ft. 2 in.', '7 ft. 9 in.', '7 ft. 8 in.',
       '7 ft. 5 in.', '7 ft. 3 in.', '11 ft. 2 in.', '11 ft. 0 in.',
       '10 ft. 8 in.', '10 ft. 3 in.', '10 ft. 2 in.', '9 ft. 10 in.',
       '8 ft. 11 in.', '8 ft. 9 in.', '8 ft. 5 in.', '7 ft. 11 in.',
       '7 ft. 10 in.', '7 ft. 6 in.', '7 ft. 0 in.', '6 ft. 4 in.',
       '6 ft. 0 in.', '11 ft. 6 in.', '10 ft. 5 in.', '9 ft. 7 in.',
       '6 ft. 8 in.', '6 ft. 5 in.', '5 ft. 8 in.', '5 ft. 6 in.',
       '5 ft. 0 in.', '11 ft. 9 in.', '11 ft. 4 in.', '10 ft. 1 in.',
       '10 ft. 0 in.', '7 ft. 4 in.

### Come up with a list of questions

- What's the longest gator in our data?
- Average length by year?
- How many gators are killed by month?

### Write a function to calculate gator length in inches

Right now, the value for the gator's length is a string following this pattern: `{} ft. {} in.`.

Let's create a new column to get the gator's length in a constant, numeric value: inches.

We're going to write a function to do these steps:
- Given a row of data, capture the feet and inch values in the carcass size column -- we can split the string on 'ft.' and clean up each piece from there
- Multiply feet by 12
- Add that number to the inch value
- `return` the result

We shall call this function on the data frame using the [`.apply()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) method.

👉 Learn more about functions in [this notebook](../reference/Functions.ipynb).

👉 Learn more about using the `apply()` function in [this notebook](../reference/Using%20the%20apply%20method%20in%20pandas.ipynb).

👉 Learn more about string methods like `split()` in [this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#String-methods).

In [27]:
def get_inches(row):
    '''Given a row of gator data, parse out gator length in inches'''

    # get the carcass size string
    carcass_size = row['Carcass Size']
    
    # split on 'ft.'
    size_split = carcass_size.split('ft.')
    
    # grab the first item in the resulting list [0] - the number of feet
    # strip whitespace
    # coerce to integer
    feet = int(size_split[0].strip())
    
    # get the second item [1] in that list - the number of inches
    # replace 'in.' with nothing
    # strip whitespace
    # coerce to integer
    inches = int(size_split[1].replace('in.', '').strip())
    
    # return inches plus ft*12
    return inches + (feet * 12)

In [28]:
# apply our new formula, specifying axis=1
# for row-wise application
df['length_in'] = df.apply(get_inches, axis=1)

In [29]:
# check the output with head()
df.head()

Unnamed: 0,Year,Area Number,Area Name,Carcass Size,Harvest Date,Location,length_in
0,2000,101,LAKE PIERCE,11 ft. 5 in.,09-22-2000,,137
1,2000,101,LAKE PIERCE,9 ft. 0 in.,10-02-2000,,108
2,2000,101,LAKE PIERCE,8 ft. 10 in.,10-06-2000,,106
3,2000,101,LAKE PIERCE,8 ft. 0 in.,09-25-2000,,96
4,2000,101,LAKE PIERCE,8 ft. 0 in.,10-07-2000,,96


In [30]:
# sort by length descending, check it out with head()
df.sort_values('length_in', ascending=False).head()

Unnamed: 0,Year,Area Number,Area Name,Carcass Size,Harvest Date,Location,length_in
44996,2010,502,ST. JOHNS RIVER (LAKE POINSETT),14 ft. 3 in.,10-31-2010,,171
78315,2014,828,HIGHLANDS COUNTY,14 ft. 3 in.,10-28-2014,LITTLE RED WATER LAKE,171
31961,2008,510,LAKE JESUP,14 ft. 1 in.,08-26-2008,,169
70005,2013,733,LAKE TALQUIN,14 ft. 1 in.,09-02-2013,,169
63077,2012,828,HIGHLANDS COUNTY,14 ft. 0 in.,10-31-2012,boat ramp north of boat ramp road,168


### Count by year

Our friend `value_counts()` is _on it_.

In [31]:
df['Year'].value_counts()

2011    8011
2013    7979
2009    7744
2010    7654
2014    7374
2016    7155
2015    6726
2012    6657
2006    6420
2008    6201
2007    5952
2005    3453
2004    3227
2003    2820
2000    2547
2001    2262
2002    2155
Name: Year, dtype: int64

### Average length by year

To get the average length of gators by year, we'll run a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html).

👉 For more details on creating pivot tables, [see this notebook](../reference/Grouping%20data%20in%20pandas.ipynb#pivot_table()).

In [32]:
# get average length harvested by year
avg_length_by_year = pd.pivot_table(df,
                                   values='length_in',
                                   index='Year',
                                   aggfunc='mean')

In [33]:
avg_length_by_year

Unnamed: 0_level_0,length_in
Year,Unnamed: 1_level_1
2000,104.10051
2001,104.16313
2002,99.721578
2003,100.596454
2004,101.734738
2005,100.789169
2006,101.802181
2007,102.617776
2008,101.234478
2009,95.944215


### Treating dates as dates

This data include the date on which the gator was killed, but the date values are being stored as strings. If we want to do some time-based analysis -- comparing the gator hunt by month, or whatever -- we'd want to deal directly with native dates.

Noting the format (month-day-year), let's use the [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) method to convert the dates into native date objects. We'll tell pandas to use the [correct date specification](http://strftime.org/) and to coerce errors to null values rather than throw a giant exception.

👉 For more information on handling dates in pandas, [see this notebook](../reference/Date%20and%20time%20data%20types.ipynb#Working-with-dates-in-pandas).

In [34]:
df['harvest_date_clean'] = pd.to_datetime(df['Harvest Date'],
                                         format='%m-%d-%Y',
                                         errors='coerce')

In [35]:
df.head()

Unnamed: 0,Year,Area Number,Area Name,Carcass Size,Harvest Date,Location,length_in,harvest_date_clean
0,2000,101,LAKE PIERCE,11 ft. 5 in.,09-22-2000,,137,2000-09-22
1,2000,101,LAKE PIERCE,9 ft. 0 in.,10-02-2000,,108,2000-10-02
2,2000,101,LAKE PIERCE,8 ft. 10 in.,10-06-2000,,106,2000-10-06
3,2000,101,LAKE PIERCE,8 ft. 0 in.,09-25-2000,,96,2000-09-25
4,2000,101,LAKE PIERCE,8 ft. 0 in.,10-07-2000,,96,2000-10-07


If you want to doublecheck that the data type is correct, you can access the `dtypes` attribute.

In [36]:
df.dtypes

Year                           int64
Area Number                    int64
Area Name                     object
Carcass Size                  object
Harvest Date                  object
Location                      object
length_in                      int64
harvest_date_clean    datetime64[ns]
dtype: object

### Gator hunt by month

[According to](http://myfwc.com/media/310257/Alligator-processors.pdf) the Florida Fish and Wildlife Conservation Commission, the gator hunt season is in the fall:

![gatorhunt](../img/gatorhunt.png "gatorhunt")

Let's look at the totals by month:
- Create a new column for the month using `apply()` with a [lambda expression](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) -- we'll access the `month` attribute of the date
- Do value counts by month

👉 For more information on using lambda expressions, [see this notebook](../reference/Functions.ipynb#Lambda-expressions).

In [37]:
df['month'] = df['harvest_date_clean'].apply(lambda x: x.month)

In [38]:
df['month'].unique()

array([ 9., 10., nan, 11.,  8.,  5.,  7.,  1.,  4.,  6.])

In [39]:
df['month'].value_counts().sort_index()

1.0         2
4.0         1
5.0         1
6.0         2
7.0         5
8.0     22912
9.0     32978
10.0    37470
11.0      702
Name: month, dtype: int64

What if we wanted to get a count by month _by year_? Pivot tables to the rescue, again.

We'll provide the `pivot_table` method with five things:
- `df` specifies what data frame we're pivoting
- `index='month'` specifies the column we're grouping on
- `columns='Year'` specifies the columns value
- `aggfunc='count'` tells pandas how to aggregate the data -- we want to count the values
- `values='length_in'` specifies the column of data to apply the aggregation to -- we're going to count up every record of a carcass that has a length

In [40]:
by_month_by_year = pd.pivot_table(df,
                                  index='month',
                                  values='length_in',
                                  columns='Year',
                                  aggfunc='count')

In [41]:
by_month_by_year

Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1.0,,,,,,,,,,,,,,,,2.0,
4.0,,,,,,,,,,,,,,,,1.0,
5.0,,,,,,,,,,,,,,,1.0,,
6.0,,,,,,,,,,,,,,,,,2.0
7.0,,,,,,,,,,,,,,,1.0,2.0,2.0
8.0,,,,,,,2129.0,2093.0,1854.0,2279.0,2081.0,2603.0,1848.0,2145.0,2096.0,1844.0,1940.0
9.0,1929.0,1562.0,1470.0,1870.0,1542.0,2296.0,2202.0,1795.0,1910.0,2225.0,2164.0,2290.0,2064.0,2174.0,1846.0,1719.0,1920.0
10.0,601.0,694.0,670.0,934.0,1548.0,1143.0,2073.0,2016.0,2250.0,3141.0,3342.0,3075.0,2709.0,3609.0,3371.0,3052.0,3242.0
11.0,,,,,117.0,1.0,16.0,16.0,133.0,40.0,37.0,43.0,36.0,51.0,57.0,106.0,49.0


All those `NaN`s mixed in with our numbers gives me the fantods. Let's use the `.fillna()` method to replace those with `0`.

In [42]:
by_month_by_year.fillna(0)

Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,2.0
8.0,0.0,0.0,0.0,0.0,0.0,0.0,2129.0,2093.0,1854.0,2279.0,2081.0,2603.0,1848.0,2145.0,2096.0,1844.0,1940.0
9.0,1929.0,1562.0,1470.0,1870.0,1542.0,2296.0,2202.0,1795.0,1910.0,2225.0,2164.0,2290.0,2064.0,2174.0,1846.0,1719.0,1920.0
10.0,601.0,694.0,670.0,934.0,1548.0,1143.0,2073.0,2016.0,2250.0,3141.0,3342.0,3075.0,2709.0,3609.0,3371.0,3052.0,3242.0
11.0,0.0,0.0,0.0,0.0,117.0,1.0,16.0,16.0,133.0,40.0,37.0,43.0,36.0,51.0,57.0,106.0,49.0


In [43]:
# what else?