# Code to functions

One of the nice things about Python (and Jupyter notebooks) is that we can write functions 'on the fly' and apply them to our data.  This can simply the process of repetitive coding, give us clearer and more readable code, improve reprodcibility, and allow us to make up for the shortcomings of built-in functions. 

This is an example from [pyOpenSci](https://github.com/pyOpenSci/) that shows the potential benefits of writing your own functions.  We're going to import using Pandas a simple CSV file that has a list of locations, columns with each location's average snowfall and temperature, and the country or geography region where the location can be found.  We'll then do some simple operations to group locations by their snowfall and look at their average temperature:

In [1]:
# import Pandas
import pandas as pd

# Load the data 
data_path = "snow_data.csv"
data = pd.read_csv(data_path)
print(data.head())


             site  average_snowfall  average_temperature       country
0      Mount Etna                13                   22         Italy
1  Table Mountain                 0                   20  South Africa
2   Mount Massive                20                   28           USA
3   Mount Harvard                 3                   67           USA
4   La Plata Peak                 4                   68           USA


In [2]:
# Filter rows where the value in 'column1', the average snowfall, is greater than 10, and assign to a new DataFrame
sites_more_snow = data[data["average_snowfall"] > 10]

# we can print this result now
print(f"Sites with more than 15 inches of snow: \n{sites_more_snow}")


Sites with more than 15 inches of snow: 
                       site  average_snowfall  average_temperature  \
0                Mount Etna                13                   22   
2             Mount Massive                20                   28   
5                Pikes Peak                30                   25   
10               Mount Fuji                20                   10   
13            Mount Whitney                20                    5   
14            Mount Everest                50                  -15   
15                   Denali                45                  -10   
17               Mount Cook                25                    5   
19               Mont Blanc                35                    0   
20                   Elbrus                30                   -8   
21             Mount Vinson                20                  -30   
22              Cerro Torre                15                    5   
23             Mount Elbert                15    

In [3]:
# We can then use the .mean method to get the average temperature for those sites 
avg_temp_more_snow = sites_more_snow["average_temperature"].mean()
print(f"The mean temp for sites with more snow is {avg_temp_more_snow}")


The mean temp for sites with more snow is 7.4375


In [4]:
# Now let's find sites with a small accumulation of snow on average and once again calculate the average temperature for those sites
sites_less_snow = data[data["average_snowfall"] < 5]
avg_temp_less_snow = sites_less_snow["average_temperature"].mean()

print(f"The mean temp for sites with less snow is {avg_temp_less_snow}")


The mean temp for sites with less snow is 50.57142857142857


In [5]:
# Finally, let's filter rows where the average snowfall is between 5 and 15 inches
sites_medium_snow = data[(data["average_snowfall"] >= 5) & (data["average_snowfall"] <= 15)]

print(f"Sites with snowfall between 5 and 15 inches:\n{sites_medium_snow}")


Sites with snowfall between 5 and 15 inches:
                 site  average_snowfall  average_temperature    country
0          Mount Etna                13                   22      Italy
6    Mount Kosciuszko                 5                   15  Australia
7           Ben Nevis                 8                   10   Scotland
9       Mount Snowdon                 5                    8      Wales
11        Mount Teide                10                   18      Spain
16  Mount Kilimanjaro                 5                   20   Tanzania
18          Aconcagua                10                   -5  Argentina
22        Cerro Torre                15                    5  Argentina
23       Mount Elbert                15                   32        USA


In [6]:
# And calculate the mean temperature for sites with medium snow
avg_temp_medium_snow = sites_medium_snow["average_temperature"].mean()
print(
    f"The mean temperature for sites with medium snow is {avg_temp_medium_snow}"
)

The mean temperature for sites with medium snow is 13.88888888888889


## Is there a 'cleaned' way?

That worked, but it was repetitive and a little mind numbing.  We did a few different things over and over again - we filtered the list by average snowfall, saved those rows as a new DataFrame, and then calculated the average temperature for those sites.  

Here's an alternative approach using a set of simple functions to do the repetitive work for us.  Once again, this is from [pyOpenSci](https://github.com/pyOpenSci/):

In [7]:
# create a function that reads in the data and replaces any missing data
def load_data(filepath):

    # Load the data and replace the missing value flag of -999 with NaN
    return pd.read_csv(filepath, na_values=-999) # this is a Pandas function to read CSV data and deal with missing value flags


# create a function that catagorizes a site by the amount of snow
def categorize_snow(x):
    if x > 15:
        return "High"
    elif x < 5:
        return "Low"
    else:
        return "Medium"

# create a function that adds a column to our DataFrame with the snow catagory
def add_snowfall_category(data):
    data["snowfall_category"] = data["average_snowfall"].apply(categorize_snow) # note that this applies the function above this one to the data! 
    return data

# Create a summary DataFrame that shows the characteristics of each group of locations in each snow category
def summarize_data(data):
    summary = data.groupby("snowfall_category").agg(
        avg_snowfall=("average_snowfall", "mean"),
        avg_temperature=("average_temperature", "mean"),
        site_count=("site", "count"),
    )
    return summary


When we run the code cell above, those functions are now available to us in this session! In the code block below, we'll call those functions to do the work for us:

In [8]:
# Load the data and deal with any missing data using the first function we wrote above
data = load_data(data_path)
data


Unnamed: 0,site,average_snowfall,average_temperature,country
0,Mount Etna,13,22,Italy
1,Table Mountain,0,20,South Africa
2,Mount Massive,20,28,USA
3,Mount Harvard,3,67,USA
4,La Plata Peak,4,68,USA
5,Pikes Peak,30,25,USA
6,Mount Kosciuszko,5,15,Australia
7,Ben Nevis,8,10,Scotland
8,Mount Vesuvius,0,25,Italy
9,Mount Snowdon,5,8,Wales


In [9]:
# Add snowfall category to our DataFrame using the third function we wrote above, which calls the second function we wrote to do the categorization
data = add_snowfall_category(data)
data


Unnamed: 0,site,average_snowfall,average_temperature,country,snowfall_category
0,Mount Etna,13,22,Italy,Medium
1,Table Mountain,0,20,South Africa,Low
2,Mount Massive,20,28,USA,High
3,Mount Harvard,3,67,USA,Low
4,La Plata Peak,4,68,USA,Low
5,Pikes Peak,30,25,USA,High
6,Mount Kosciuszko,5,15,Australia,Medium
7,Ben Nevis,8,10,Scotland,Medium
8,Mount Vesuvius,0,25,Italy,Low
9,Mount Snowdon,5,8,Wales,Medium


In [10]:
# Summarize the data using the fourth function we wrote
summary = summarize_data(data)
summary


Unnamed: 0_level_0,avg_snowfall,avg_temperature,site_count
snowfall_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
High,29.230769,4.615385,13
Low,1.285714,50.571429,7
Medium,9.555556,13.888889,9


## Creating and using your own module

Finally, if you find yourself using functions you wrote on a regular basis, you might want to put them in a `.py` file you can import just like we import `Numpy` or `Pandas`.  Once again based on the example provided by example from [pyOpenSci](https://github.com/pyOpenSci/), the functions we wrote above are all stored in a single Python file called `my_module.py`

In [11]:
from my_module import load_data, summarize_data, categorize_snowfall_amount

data_path = "snow_data.csv"

# Load and clean the data
data = load_data(data_path)

# Add snowfall category
data = categorize_snowfall_amount(data)

# Summarize the data
summary = summarize_data(data)
summary

snow_data.csv


Unnamed: 0_level_0,avg_snowfall,avg_temperature,site_count
snowfall_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
High,29.230769,4.615385,13
Low,1.285714,50.571429,7
Medium,9.555556,13.888889,9
