## Hypothetical

I want to take the mean squared error of a series of 5 numbers: 5, 10, 15, 20, and 25. 

1. Take the mean of the numbers
2. Compute (x - mean)<sup>2</sup>
3. Find the sum of these numbers.
4. Divide by the number of numbers.

Let's do this in Python:

In [None]:
import numpy as np
mean_ex = np.mean([5,10,15,20,25])
se1 = (5 - mean_ex)**2
se2 = (10 - mean_ex)**2
se3 = (15 - mean_ex)**2
se4 = (20 - mean_ex)**2
se5 = (25 - mean_ex)**2

MSE = (se1 + se2 + se3 + se4 + se5) / 5
MSE

There has to be a better way to do this! What if we had 60 numbers? One way would be be to use array syntax, which you can see below. 

But what if we wanted to also use different "means" based on different groupings of our samples? How can we tell Python how to process the numbers?

In [None]:
samples = [5,10,15,20,25]
mean_ex = np.mean([5,10,15,20,25])
(samples - mean_ex)**2

## Functions in Python

* Meant to help keep us sane
    * it's difficult to process a long sequence of tasks or a repetitive task
* Systematic application of a set of rules to some input 
* _Encapsulation_ means that once we write a function, we can treat it just as one line/one unit

We use **def** to signal to `Python` that we are starting a function.

In [9]:
def writegreeting():
    print("Hello")

***Three Ingredients***: `def`, colon, parentheses (and a `Python`ic indent). You need to run a Python function before it is accessible.

In [10]:
writegreeting()

Hello


_Arguments_ are used as inputs to a function, and are specified in parentheses.

In [11]:
def meanofnumbers(number_vector):
    sumofnumbers = 0
    for number in number_vector:
        sumofnumbers += number
        
    print(sumofnumbers / len(number_vector))
    
meanofnumbers([1,2,3,4])

2.5


The **return** value of a function specifies what the function should give back when you run it. You can assign the return value of a function to the variable. 

In [13]:
def meanofnumbers(number_vector):
    sumofnumbers = 0
    for number in number_vector:
        sumofnumbers += number
        
    return (sumofnumbers / len(number_vector))

print(meanofnumbers([1,2,3,4]))
mean_of_numbers = meanofnumbers([1,2,3,4])
print(mean_of_numbers)

2.5
2.5


Functions can have multiple return values.

In [15]:
def meanofnumbers(number_vector):
    sumofnumbers = 0
    for number in number_vector:
        sumofnumbers += number
        
    return (sumofnumbers / len(number_vector)), len(number_vector)

mean_of_numbers, len_of_numbers = meanofnumbers([1,2,3,4])
print(len_of_numbers)

4


**Encapsulate an If-Statement/Print Block**

Sometimes, we use **conditionals** to test what category our data falls into. For example, I might want to look at people's ages and classify them based on their generation, loosely:

- 25 and under: Generation Z
- 25 to 40: Millennials (Generation Y)
- 41 to 56: Generation X
- 56 to 75: Baby Boomer
- 75 to 92: Silent Generation
- 92+: Greatest Generation

How can we check this systematically? We have a sample of 20 people and we want to associate each person's name with their generation.

### Zoom Poll: What is the shortest way to do this?

* Use a `for` loop and check what category each falls into - **Check Mark** ✅
* Write an `if` statement and just copy it however many times you need - **Red X** ❌
* Use an array of the people's ages and check the generation using a function - **Go Slower** ⏪
* Manually assign the labels to people - **Go Faster** ⏩

### Zoom Poll: What is the fastest way to do this?

* Use a `for` loop and check what category each falls into - **Check Mark** ✅
* Write an `if` statement and just copy it however many times you need - **Red X** ❌
* Use an array of the people's ages and check the generation using a function - **Go Slower** ⏪
* Manually assign the labels to people - **Go Faster** ⏩

In [None]:
import timeit

## Option 1: Using a `for` loop
def solution1():
    ## this is where our code starts
    x = list(range(0,81,20)) ## here is where we would put our list of ages - this is just a sample
    names = ["Your Name Here"] * len(x) ## same for the names!
    
    generations = []
    for curr_age in x:
        if curr_age < 25:
            generations.append("Generation Z")
        elif curr_age < 40:
            generations.append("Generation Y")
        elif curr_age < 56:
            generations.append("Generation X")
        elif curr_age < 75:
            generations.append("Baby Boomer")
        elif curr_age < 92:
            generations.append("Silent Generation")
        else:
            generations.append("Greatest Generation")
            
    return(generations)
            
print(solution1())
timeit.timeit(solution1,number=100)

In [None]:
## Option 2: Using a repeated "if" statement
def solution2():
    ## this is where our code starts
    x = list(range(0,81,10)) ## here is where we would put our list of ages - this is just a sample
    names = ["Your Name Here"] * len(x) ## same for the names!
    
    generations = []
    
    curr_age = x[0]
    if curr_age < 25:
        generations.append("Generation Z")
    elif curr_age < 40:
        generations.append("Generation Y")
    elif curr_age < 56:
        generations.append("Generation X")
    elif curr_age < 75:
        generations.append("Baby Boomer")
    elif curr_age < 92:
        generations.append("Silent Generation")
    else:
        generations.append("Greatest Generation")
    
    curr_age = x[1]
    if curr_age < 25:
        generations.append("Generation Z")
    elif curr_age < 40:
        generations.append("Generation Y")
    elif curr_age < 56:
        generations.append("Generation X")
    elif curr_age < 75:
        generations.append("Baby Boomer")
    elif curr_age < 92:
        generations.append("Silent Generation")
    else:
        generations.append("Greatest Generation")
        
    curr_age = x[2]
    if curr_age < 25:
        generations.append("Generation Z")
    elif curr_age < 40:
        generations.append("Generation Y")
    elif curr_age < 56:
        generations.append("Generation X")
    elif curr_age < 75:
        generations.append("Baby Boomer")
    elif curr_age < 92:
        generations.append("Silent Generation")
    else:
        generations.append("Greatest Generation")
        
        
    curr_age = x[3]
    if curr_age < 25:
        generations.append("Generation Z")
    elif curr_age < 40:
        generations.append("Generation Y")
    elif curr_age < 56:
        generations.append("Generation X")
    elif curr_age < 75:
        generations.append("Baby Boomer")
    elif curr_age < 92:
        generations.append("Silent Generation")
    else:
        generations.append("Greatest Generation")
      
    curr_age = x[4]
    if curr_age < 25:
        generations.append("Generation Z")
    elif curr_age < 40:
        generations.append("Generation Y")
    elif curr_age < 56:
        generations.append("Generation X")
    elif curr_age < 75:
        generations.append("Baby Boomer")
    elif curr_age < 92:
        generations.append("Silent Generation")
    else:
        generations.append("Greatest Generation")
            
    return(generations)
            
print(solution2())
timeit.timeit(solution2,number=100)

In [None]:
def generation_checker(curr_age):    
    if curr_age < 25:
        return("Generation Z")
    elif curr_age < 40:
        return("Generation Y")
    elif curr_age < 56:
        return("Generation X")
    elif curr_age < 75:
        return("Baby Boomer")
    elif curr_age < 92:
        return("Silent Generation")
    else:
        return("Greatest Generation")

## Option 3: Using a separated, encapsulated function
def solution3():
    ## this is where our code starts
    x = list(range(0,90,1)) ## here is where we would put our list of ages - this is just a sample
    names = ["Your Name Here"] * len(x) ## same for the names!
    
    generations = []
    for curr_age in x:
        generations.append(generation_checker(curr_age))
    
    return(generations)

## Option 4: Using a separated, encapsulated function and list comprehension
def solution4():
    ## this is where our code starts
    x = list(range(0,90,1)) ## here is where we would put our list of ages - this is just a sample
    names = ["Your Name Here"] * len(x) ## same for the names!
    
    generations = [generation_checker(curr_age) for curr_age in x]
    return(generations)
            
print(solution3() == solution4())
print("The timing of the 3rd solution is:", timeit.timeit(solution3,number=100))
print("The timing of the 4th solution is:", timeit.timeit(solution4,number=100))

Why did we order the `if` statement that way?

What would happen if we checked if the age was less than 40 first?

**Your Turn**: "Encapsulating Data Analysis" activity, in breakout rooms.

In [17]:
import pandas as pd

df = pd.read_csv("../python-novice-gapminder/data/gapminder_gdp_asia.csv",index_col=0)
japan = df.loc["Japan"]

In [23]:
year = 1983
gdp_decade = 'gdpPercap_' + str(year // 10)
print(gdp_decade)
avg = (japan.loc[gdp_decade + "2"] + japan.loc[gdp_decade + "7"]) / 2

gdpPercap_198


In [20]:
#egg sizing machinery prints a label
def checkeggsize(mass):
    if(mass>=90):
        return "warning: egg might be dirty"
    elif(mass>=85):
        return "jumbo"
    elif(mass>=70):
        return "large"
    elif(mass<70 and mass>=55):
        return "medium"
    elif(mass<50):
        return "too light, probably spoiled"
    else:
        return "small"

In [22]:
checkeggsize(90)

