# User-defined functions in Python


## Functions review

**User-Defined functions** are a powerful tool in python that allow us to create a section of code to complete a specific task which we can call & run whenever we choose in our program. Python functions are considered **objects**, which in this case simply means it is a specific method that can be called by its name. Functions take inputs called **arguments** and can **return** data output as well. 

One of the perks of functions is that they allow us to reduce the bulk in our code. We can repeat larger sections of code throughout our program whenever that task needs to be completed. This makes code more readable for humans & makes tasks more easily repeatable. 

## Input arguments 

Lets create a simple function that capitalizes all letters in a given string & prints the output. The syntax to create your own function is as follows:

In [None]:
def make_caps(input_string):
    new_word = input_string.upper()
    print(new_word)

To use our function we have to **call** it & provide it with the appropriate input arguments. 

Let's do that now by setting our `input_string` to a string we want to campitalize, then do so using our function. The beauty of a function is that you can reuse it an infinite amount of times to repeat a certain task on various input arguments. 

In [None]:
# run function with an input argument
make_caps(input_string = 'hockey')

# run function with a different input argument 
make_caps(input_string = 'let\'s go rangers')

<hr style="border:2px solid gray"> </hr>

### Now you try! 

Make a function called `make_lowercase` that will print an input string in all *lowercase* letters. 

Use this new function to print out `Get THE pUcK in THE NET` in all lowercase letters. 

In [None]:
### BEGIN SOLUTION 

def make_lowercase(input_string):
    print(input_string.lower()) 

make_lowercase(input_string = 'Get THE pUcK in THE NET')

### END SOLUTION 

<hr style="border:2px solid gray"> </hr>

## Bringing in some data 

Now that we remember the basic mechanics of user-defined functions in Python, let's bring in some data we can work with to explore further. 

We will read in the following csv files: 
        `artemi-panarin.csv`
        `alex-ovetchkin.csv`
        `pk-subban.csv`
        
Which contain career summaries by season for these three National Hockey League (NHL) players, obtained [here](https://moneypuck.com/data.htm). The format for all three files is the same. There is a huge number of columns, but we are only interested in these:

- `season` - season year
- `name` - player name
- `team` - NHL team they played on for that season 
- `position` - player position on the ice
- `situation`	- in-game situation (e.g. 5 on 5 means 5 players from each team are facing off) 
- `games_played` - total number of games played that season 
- `icetime` - total amount of ice time (minutes) that season 
- `shifts` - number of 'shifts' the player took during the season - in other words the number of times the player got on the ice. Hockey players play in "shifts" that are only about a minute long, so they are taking on/off the ice very frequently in a single game. 
- `gameScore` - a statistic used to calculate how effective/productive a player was during a particular game 
    

In [None]:
# import pandas to our workspace 
import pandas as pd 

# use read_csv to read in the files 
panarin = pd.read_csv('artemi-panarin.csv')
subban = pd.read_csv('pk-subban.csv')
ovechkin = pd.read_csv('alex-ovechkin.csv')

In [None]:
# print a preview of one dataframe 
panarin.head()

## Creating a function to clean up our dataframes 

You can see that there are a whopping 154 columns! But we only want to keep the few we listed above. Since we know the dataframes are all formatted the same way, we can create **one common function** that could be applied to any of the three dataframes (and by extension the data from any player obtained from MonkeyPuck!) 

Let's create a function that will take the input dataframe `input_df`, select only the desired columns which we will save in a list called `cols_to_keep`, and **return** a cleaned up version of the dataframe. We will name this function `clean_player_data`. 

Pandas syntax is such that a list of column names in square brackets will return *only* the columns with names in our list. 

In [None]:
# define our function 
def clean_player_data(input_df, cols_to_keep):
    output_df = input_df[cols_to_keep]
    return output_df 

In [None]:
# create our list of columns to keep 
cols_to_keep = ['season', 'name', 'team', 'position', 
                  'situation', 'games_played', 'icetime', 'shifts', 'gameScore']

# run the funciton on a player's dataframe & save to a new dataframe 
panarin_clean = clean_player_data(input_df = panarin, cols_to_keep = cols_to_keep)
panarin_clean.head()

In [None]:
# now the apply the function to the other two players 
ovechkin_clean = clean_player_data(input_df = ovechkin, cols_to_keep = cols_to_keep)
ovechkin_clean.head()

In [None]:
subban_clean = clean_player_data(input_df = subban, cols_to_keep = cols_to_keep)
subban_clean.head()

Great! We see this function works when we input any of our three dataframes. It takes our two input arguemnts and returns a new dataframe we save as a new variable. 

## Functions to help us filter our data

We've used a function to clean up our dataframes, so let's now create some functions to do a bit of analysis! 

Let's create a function we can use to isolate data for different values in the `situation` column. The values in this column describe different situations in gameplay. Let's look at the unique values in the column first. 

In [None]:
panarin.situation.unique()

For some context, a value of `5on5` is the typical gameplay situation. If it is `4on5` or `5on4`, it indicates a "power play" where players are removed due to a penalty, and as aresult one team has less players on the ice for a certain period of time. `all` is any situation, and `other` is any situation that doesn't fall into any of the categories. 

We will design a function `get_situation` where it isolates the `target_sitution` (any unique value from the `situation` column) from our `input_dataframe`. 

In [None]:
def get_situation(input_dataframe, target_situation):
    output_dataframe = input_dataframe[input_dataframe['situation'] == target_situation]
    return output_dataframe

In [None]:
# test function 
get_situation(panarin_clean, '5on5')

We can do something similar where we create another function that isolates a particular team from that player's history. Let's print out the possible options. 

In [None]:
print('Andrei Panarin has played for these teams: ' + str(panarin_clean.team.unique()))
print('Alex Ovechkin  has played for these teams: ' + str(ovechkin_clean.team.unique()))
print('P.K. Subban has played for these teams: ' + str(subban_clean.team.unique()))

And now let's design our function & test it. 

In [None]:
def get_team(input_dataframe, target_team):
    output_dataframe = input_dataframe[input_dataframe['team'] == target_team]
    return output_dataframe

In [None]:
# test it 
get_team(subban_clean, 'NJD')

Hm we've found something odd! P.K. Subban was traded to the NJD (Jersey Devils) in 2019 but the function is only returning values from the 2021 season. What happened to 2019 and 2020's data? Is it possible our function doesn't work? Or is there an underlying problem in our dataframe? Let's check these years in our dataframe. 

We will do this by isolating when the season was 2019 OR (`|`) 2020. 

In [None]:
subban_clean[(subban_clean['season'] == 2019) | (subban_clean['season'] == 2020)]

Interesting! P.K. Subban is shown to have played for 'N.J', but that's not a proper team abbreviation. The value *should be* 'NJD' - we just found a typo in our dataset! Let's replace this using pandas `.replace()` method. 

This method takes the value to be replaced, followed by the value to replace it with. We add `inplace = True` to indicate we want this change to be saved to our dataframe. 

In [None]:
subban_clean.replace('N.J', 'NJD', inplace = True)

get_team(subban_clean, 'NJD')

Great! Now we fixed it & can use our function properly. This is an instance of **troubleshooting issues** during data analysis. Sometimes the issue is with the design/syntax within the function itself, and other times it is your function not being able to catch an issue in the underlying data as in this case. 

Oftentimes, we can get to the bottom of most problems by querying our dataframes and printing out the contents.

<hr style="border:2px solid gray"> </hr>

### Now you try! 

Use `get_team` to isolate the data for years that Andrei Panarin played on the New York Rangers (NYR) and save this to a dataframe called `panarin_nyr`. Print out the unique years that Panarin played on the NYR. 

In [None]:
### BEGIN SOLUTION 

panarin_nyr = get_team(panarin_clean, 'NYR')
panarin_nyr.season.unique()

### GET SOLUTION 

<hr style="border:2px solid gray"> </hr>

## Nesting functions 

Each of these functions we created are helpful for querying our dataframes & returning filtered versions. This is a really useful step in analysis! But you will usually want to **actually do something** with the filtered data, not just return it. 

We can use the functions we've created **nested within** another function to complete some larger analysis task! 

Let's create a function called `plot_icetime`. In this function we will do a few things: 
- first use `get_situation` to isolate a particular situation 
- then use `get_team` to isolate when a player was on a particular team - this **must** be done using the output of `get_situation` so **both** filters are applied on the resulting dataframe!  
- add a new column called `icetime_min` that contains the icetime in minutes (`icetime` is given in seconds, so divide this by 60 to get icetime in minutes). 
- create a variable named `plot_title` where we build a string based on the unique inputs for our plot title 
- create a bar plot with `season` on the x axis, and `icetime` on the y axis, where the height of each bar is the number of icetime in minutes 

In [None]:
def plot_icetime(input_dataframe, target_situation, target_team):
    # apply first filter - situation 
    filter_df_first = get_situation(input_dataframe, target_situation) 
    # apply second filter - team (input is output of first filter)
    filter_df_second = get_team(filter_df_first, target_team)
    # get icetime in minutes 
    filter_df_second['icetime_min'] = filter_df_second['icetime']/60
    # create a title for the plot using the unique inputs (player name, situation, team)
    plot_title = str(filter_df_second.name.unique()[0]) + ' ' + target_situation + ' icetime on the ' + target_team
    # create a bar plot 
    filter_df_second.plot.bar(x='season', y='icetime_min', title = plot_title, ylabel = 'icetime [minutes]')
    # return the filtered df 
    return filter_df_second

In [None]:
plot_icetime(panarin_clean, '5on5', 'NYR')

We see that our function works! It returns a new dataframe that contains Panarin's stats in 5-on-5 play situations while he played on the Rangers. We also have a bar plot showing icetime in minutes. 

We can use this function to analyze & plot any combination of players we have stats for, particular play situations, and when they were on a certain team. From this we learned Panarin had comprable amount of ice time in 2019 and 2021, but his icetime was nearly halved in 2020. 

<hr style="border:2px solid gray"> </hr>

### Now you try! 

We assume this function should work with any player, situation, and team combination. Use `plot_icetime` to create a bar plot that shows Alex Ovechkin's play time in `all` situations while on the Ottowa Senators (team abbreviation `OTT`). 

Does the function work? Why or why not?

In [None]:
### BEGIN SOLUTION 

plot_icetime(ovechkin_clean, 'all', 'OTT')

# this function does not work because Ovechkin never played on OTT! 
# The function fails to build a plot title & plot because the filtered dataframe is empty. 

### END SOLUTION 

<hr style="border:2px solid gray"> </hr>

## Debugging functions 

Sometimes a function doesn't work and we have to figure out why. Previously we showed an example of a problem in the underlying dataframe, but most often it will be an issue with the function itself. Sometimes a function will not work alltgether or sometimes it will only work in particular situations. 

In cases such as this, we have to go in and figure out why our function is failing. Let's debug `plot_icetime`. The simplest way to do this is by **adding print statements** at different stages in the function to see what step in the function fails!

We need to embed these print/return statements within the function as local function variables (such as `filter_df_first` do not exist in our global variable space. We can use `print` or return `return` to print our local function variables. 

Let's add `return filter_df_first` right after we define it in our function to check the output. 

In [None]:
def plot_icetime(input_dataframe, target_situation, target_team):
    # apply first filter - situation 
    filter_df_first = get_situation(input_dataframe, target_situation) 
    return filter_df_first
    # apply second filter - team (input is output of first filter)
    filter_df_second = get_team(filter_df_first, target_team)
    # get icetime in minutes 
    filter_df_second['icetime_min'] = filter_df_second['icetime']/60
    # create a title for the plot using the unique inputs (player name, situation, team)
    plot_title = str(filter_df_second.name.unique()[0]) + ' ' + target_situation + ' icetime on the ' + target_team
    # create a bar plot 
    filter_df_second.plot.bar(x='season', y='icetime_min', title = plot_title, ylabel = 'icetime [minutes]')
    # return the filtered df 
    return filter_df_second
    
plot_icetime(ovechkin_clean, 'all', 'OTT')

Okay, so far so good... the function is returning all values where the `situation` column equals `all`. The issue is not in this step. 

Let's remove that return statement, and add another to print `filter_df_second`. 

In [None]:
def plot_icetime(input_dataframe, target_situation, target_team):
    # apply first filter - situation 
    filter_df_first = get_situation(input_dataframe, target_situation) 
    # apply second filter - team (input is output of first filter)
    filter_df_second = get_team(filter_df_first, target_team)
    return filter_df_second
    # get icetime in minutes 
    filter_df_second['icetime_min'] = filter_df_second['icetime']/60
    # create a title for the plot using the unique inputs (player name, situation, team)
    plot_title = str(filter_df_second.name.unique()[0]) + ' ' + target_situation + ' icetime on the ' + target_team
    # create a bar plot 
    filter_df_second.plot.bar(x='season', y='icetime_min', title = plot_title, ylabel = 'icetime [minutes]')
    # return the filtered df 
    return filter_df_second
    
plot_icetime(ovechkin_clean, 'all', 'OTT')

Uh oh! Here's the problem! `get_team` for the input target team `OTT` returns and **empty dataframe**! 

This is because Ovechkin never played for Ottowa! The nested `get_team` filter works properly, but its result causes our `plot_icetime` function to break at the step where we get the player's full name out of the dataframe, `filter_df_second.name.unique()[0]`.  The list is empty! 

## Building more resilient functions 

The function in it's current state fails given certain inputs. But we can make some simple changes to ensure that it doesn't fail, and if a problem arises we can be warned so we don't have to go hunting for the issue. 

We can use a combination of **conditionals** and **print statements** to build a more resillient function that won't break even if we feed it faulty inputs (within reason). 

We will add the following to `plot_icetime`: 

- a conditional to check if the output of `get_situation` is empty - if it is empty, use `return` to break the function and print a statement that lets us know what happened
- a conditional to check if the output of `get_team` is empty - if it is empty, use `return` to break the function and print a statement that lets us know what happened

In [None]:
def plot_icetime(input_dataframe, target_situation, target_team):
    # apply first filter - situation 
    filter_df_first = get_situation(input_dataframe, target_situation) 
    if len(filter_df_first) == 0:
        print('The situation "' + target_situation + '" does not exist in the input dataframe!')
        return 
    # apply second filter - team (input is output of first filter)
    filter_df_second = get_team(filter_df_first, target_team)
    if len(filter_df_second) == 0:
        print('The team "' + target_team + '" does not exist in the input dataframe!')
        return
    # get icetime in minutes 
    filter_df_second['icetime_min'] = filter_df_second['icetime']/60
    # create a title for the plot using the unique inputs (player name, situation, team)
    plot_title = str(filter_df_second.name.unique()[0]) + ' ' + target_situation + ' icetime on the ' + target_team
    # create a bar plot 
    filter_df_second.plot.bar(x='season', y='icetime_min', title = plot_title, ylabel = 'icetime [minutes]')
    # return the filtered df 
    return filter_df_second
    
plot_icetime(ovechkin_clean, 'all', 'OTT')

Now we have a resilient function that doesn't fail even if the user tries to query for something that doesn't exist, or makes a typo in the input arguments! 

In [None]:
plot_icetime(ovechkin_clean, 'one man on the ice', 'WSH')

In [None]:
plot_icetime(subban_clean, '4on5', 'Pittsburg Penguins')

<hr style="border:2px solid gray"> </hr>

# Practice on your own 

In this practice section you will work with data in the file `goalies.csv`, which contains stats on NHL goalies in the 2020 season. A file like this is produced at the end of every season, so we want to create a set of functions to be able to repeat this analysis every year. 

#### Exercise 1. Read in `goalies.csv` and take a look at the contents

In [None]:
goalies = pd.read_csv('goalies.csv')
goalies.head()

#### Exercise 2. In general, it is pretty rare for a goalie to take a penalty. Create a function called `find_penalties` that isolates the goalies with non-zero values in the `penalties` column & returns their names in a variable called `goalies_with_penalties`. Then, print the names of the goalies with penalties, and print a statement that tells us the percentage of total goalies that took penalities. 

    percentage = (number of goalies with penalities / total number of goalies) * 100. 

In [None]:
### BEGIN SOLUTION 

# create function to find goalies with penalties
def find_penalites(input_dataframe):
    penalties_exist = input_dataframe[input_dataframe['penalties'] > 0]
    goalie_names = penalties_exist.name.unique()
    return goalie_names

# run function 
goalies_with_penalties = find_penalites(goalies)

# print goalie names 
print(goalies_with_penalties)

# print 
print(str(round((len(goalies_with_penalties)/len(goalies)*100), 2)) + '% of goalies took penalties in 2020')

### END SOLUTION 

#### Exercise 3. Create a function called `get_most_icetime` that finds and returns the 10 goalies with the most ice time when the situation is `all`. Feel free to re-use the `get_situation` function we used in the guided section above as it will also work on this dataframe. You can use[ `.sort_values` ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to sort the goalies by the values in column `icetime` (remember the default is to sort in ascending order). 

In [None]:
def get_most_icetime(input_dataframe):
    goalies_situation_all = get_situation(goalies, 'all')
    goalies_sorted = goalies_situation_all.sort_values('icetime', ascending = False)
    goalies_most_icetime = goalies_sorted.iloc[0:10].reset_index(drop = True)
    return goalies_most_icetime

get_most_icetime(goalies)

#### Exercise 4. Generalize the function you made in exercise 3 so it can find the goalies with the most ice time in any given `target_situation` (all, other, 5on5, ...). Make sure the `get_situation` function is properly nested in `get_most_icetime`. Use this function to find the 10 goalies with most ice time in `4on5` situations. 

In [None]:
### BEGIN SOLUTION 

def get_most_icetime(input_dataframe, target_situation):
    goalies_situation = get_situation(goalies, target_situation)
    goalies_sorted = goalies_situation.sort_values('icetime', ascending = False)
    goalies_most_icetime = goalies_sorted.iloc[0:10].reset_index(drop = True)
    return goalies_most_icetime

get_most_icetime(goalies, '4on5')

### END SOLUTION 

#### Exercise 5. You are given the following function & pickle file, but something is wrong. It is supposed to calculate the sum of each goal type (`lowDangerGoals`  + `mediumDangerGoals` + `highDangerGoals` ) and test if there is difference between this calculated sum and the given `goals` column. It should plot the deviations. If there are none (if our sum is a perfect match to the given goals column) then it will be a straight line at the zero value. 

#### Debug the function - identify why it is failing, fix it, and then run it with the correct result. 

In [None]:
# import the new goalies file 
goalies_pkl = pd.read_pickle('goalies.pkl')

# define function 
def plot_total_shots(input_dataframe): 
    goalies_situation = get_situation(input_dataframe, 'all')
    # resume function 
    goalies_situation['sum_of_goals'] = goalies_situation.lowDangerGoals + \
        goalies_situation.mediumDangerGoals + goalies_situation.highDangerGoals
    goalies_situation['deviation_in_goal_count'] = goalies_situation['sum_of_goals'] - goalies_situation['goals']
    goalies_situation.plot(x = 'goals', y = 'deviation_in_goal_count')
    return goalies_situation 

# run the function 
plot_total_shots(goalies_pkl)


In [None]:
### BEGIN SOLUTION 

# define function 
def plot_total_shots(input_dataframe): 
    goalies_situation = get_situation(input_dataframe, 'all')
    # print out the outcome of the first step & check the data types 
    return goalies_situation, goalies_situation.dtypes
    # see that the values of the dataframe are all objects, not float/int/or any number type 
    # ADD THIS - change type of these dataframes to floats so pandas can treat them as numbers, not strings 
    goalies_situation = goalies_situation.astype({'lowDangerGoals': 'float',
                                                  'mediumDangerGoals': 'float', 
                                                  'highDangerGoals': 'float', 
                                                  'goals': 'float', })
    # resume 
    goalies_situation['sum_of_goals'] = goalies_situation.lowDangerGoals + \
        goalies_situation.mediumDangerGoals + goalies_situation.highDangerGoals
    goalies_situation['deviation_in_goal_count'] = goalies_situation['sum_of_goals'] - goalies_situation['goals']
    goalies_situation.plot(x = 'goals', y = 'deviation_in_goal_count')
    return goalies_situation 

# run the function 
plot_total_shots(goalies_pkl)

### END SOLUTION 