# MSDS 631 - Supplemental Notes
## String Formatting and Functions

### String formatting

#### Quick tutorial on string formatting
In case you haven't understood how to do string formatting, I've put a very short tutorial in the cells below. The use of curly braces is what allows Python to understand that you want to replace that place in the string with some sort of input value.

In [None]:
# Wherever there are curly braces, Python will expect you to pass it something in the arguments of `format` 
# The simplest example are empty curly braces and one variable passed
base_string = 'Hello, my name is {}.'
filled_in_string = base_string.format('Jason')
print(filled_in_string)

In [None]:
# You can also assign a string to a variable and fill it in that way
name = 'Hilary'
base_string = '{} has curly hair.'
filled_in_string = base_string.format(name)
print(filled_in_string)

In [None]:
# Now I can just loop through an iterator and do the same thing over and over again
names = ['Jason', 'Hilary', 'Jon', 'Danny']
for name in names:
    print('Hello {}'.format(name))

In [None]:
# You can also ask Python to fill in multiple blanks by adding more curly braces.
mom = 'Margaret'
dad = 'Vincent'
me = 'Jason'
base_string = "{}'s parent's names are {} and {}"
filled_in_string = base_string.format(me, mom, dad)
print(filled_in_string)

In [None]:
# Sometimes you don't want to have to pass the same value several times. We call this "aliasing".
# The alias does NOT need to be the same name as whatever variable you pass to it 
little_sister = 'Jan'
big_sister = 'Marsha'
base_string = "\"Everything is about {big_sis}, {big_sis}, {big_sis}!!!\", exclaimed {lil_sis}."
filled_in_string = base_string.format(lil_sis=little_sister, big_sis=big_sister)
print(filled_in_string)

In [None]:
# Let's combine the previous two to make our code a little more flexible
families = [{'mom': 'Lana', 'dad': 'Sterling', 'child': 'Abbiejean'}, {'mom': 'Marge', 'dad': 'Homer', 'child': 'Maggie'}]
for family in families:
    base_string = "\"{dad}, can you please make {child} a bottle? {child} hasn't eaten yet!\", exclaimed {mom}."
    filled_in_string = base_string.format(child=family['child'], mom=family['mom'], dad=family['dad'])
    print(filled_in_string)

In [None]:
# You can also fill in strings with numbers.
age = 40
name = 'Jason'
base_string = "{} is turning {} next month!"
filled_in_string = base_string.format(name, age)
print(filled_in_string)

In [None]:
# One last example of formatting with some conditionals
my_name = {'first': 'Jason', 'last': 'Shu'}
base_string = "My reservation is under the name {first} {last}"
filled_in_string = base_string.format(first=my_name['first'], last=my_name['last'])
print(filled_in_string)

In [None]:
# But what happens if someone has no last name?
# Let's use some control flow to help out
names = [{'first': 'Jason', 'last': 'Shu'}, {'first':'Madonna', 'last': None}]
base_string = "My reservation is under the name {first}{space}{last}."
for name in names:
    first = name['first']
    last = name['last']
    if last == None:
        space = ''
        last = ''
    else:
        space = ' '
    filled_in_string = base_string.format(first=first, space=space, last=last)
    print(filled_in_string)

Let's format some strings with numbers now!

In [None]:
pi = 3.14159265358973
base_string = 'Everybody knows that pi is {:.2f}'.format(pi)
filled_in_string = base_string.format(number=pi)
print(filled_in_string)

In the string formatting above, the colon inside of the braces indicated to Python that you want to apply special formatting to the number you are going to pass it. By putting `.2f` after the colon, you are saying that you want to convert the float number you are passing it into a float rounded to two decimal places. You can provide whatever level of precision you want by changing the number (e.g. `.6f` will give you six decimal places).

In [None]:
# Note that you are able to do this **without** affecting the actual value of pi.
print(pi)
print('{:.2f}'.format(pi))
print(pi)

In [None]:
# You can still use aliases by putting the name BEFORE the colon
base_string = '{number} has a lot of decimal places, but you could round it to just {number:.2f}'
filled_in_string = base_string.format(number=pi)
print(filled_in_string)

In [None]:
# Let's use a different number for the same phrase.
e = 2.71828182845904523 # Another famous math constant
filled_in_string = base_string.format(number=e)
print(filled_in_string)

In [None]:
# Now let's combine a few of the concepts.
constants = {'e': 2.71828182845904523, 'pi': 3.14159265358973}
math_saying = 'My favorite math constant is "{name}". It has a value of {value}, but I can round it to {value:.2f}.'
names = constants.keys()
for name in names:
    formatted_saying = math_saying.format(name=name, value=constants[name])
    print(formatted_saying)

Now let's look at another type of numeric string formatting.

In [None]:
# Let's turn our score into a percentage without changing the value of our variable
points_earned = 27
points_possible = 33
score = points_earned / points_possible
base_string = 'I scored a {:.1%} on the quiz last week. More precisely, I scored a {:.3%}'
filled_in_string = base_string.format(score, score)
print(filled_in_string)

In the above example, the colon is once again indicated to Python that you want to apply formatting to the number you are going to pass it. By putting .1% after the colon (while still inside the braces), you are asking for Python to convert your number into a percentage with a single decimal place. In doing so, Python will automatically multiply your number by 100 and print the rounded result with one decimal place. By putting .3% instead, you are asking Python to do the same thing as before, but with three decimals. Like the decimal formatting, you can do this with whatever level of precision that you would like.

This is important because now you do not have to multiply your values by 100 to get a more "printable" number. Now I get to keep the actual mathematical representation of my score (ratio of points earned and points_possible).

In [None]:
print(score)
print('{:.3%}'.format(score))
print(score)

There are **many** more ways to format strings. This is not critical to understand. Just memorize how to use the examples above and you will be good for the rest of this course!

### Functions
One of the challenges with the way we've been coding is that our solutions have been very specific to a single instance and require executing many lines of code to perform the same analysis. Our goal with functions is to "generalize" solutions so that we can "call" a single line of code to yield the same result.

Here's a simple example of a function:

In [None]:
def say_hello_to_me(my_name):
    phrase = 'Hello {}'.format(my_name)
    print(phrase)

Note that it doesn't do anything when I execute it. I have to "call" it in order for it to work. In calling it, I have to "pass" it an argument. The argument is what is contained inside the parentheses of the function.

In [None]:
say_hello_to_me('Jason')

In [None]:
name = 'Hilary'
say_hello_to_me(name)

Let's try another function

In [None]:
def how_much_change(total_bill, tip_pct, cash_given):
    total_with_tip = total_bill * (1 + tip_pct)
    change = cash_given - total_with_tip
    return change

In [None]:
total_bill = 34.85
tip_pct = .2
cash_given = 50
change = how_much_change(total_bill, tip_pct, cash_given)
print(change)

Let's take a look at a few features of the function above.

First, the function is initiated by the keyword **`def`** followed by the name of the function. After this are the arguments, which are enclosed in parentheses. In this case, there are three different arguments. This means that we will pass the function three values that it will use to perform whatever computations it needs to make (NOTE: you do note have to pass variables with the same name as the arguments - you only need to make sure that the value represents what the argument variable is intended to represent and is of the same data type). Inside of the function is the logic that will get run every time the function is called. Finally, a value is "returned" which can be assigned to a variable (or used immediately). You don't have to return a value (as the "Hello" example above shows).

#### So why bother with functions?

You can absolutely write code in Python without ever having to write functions. In fact, most data scientists don't even bother writing functions early in their careers. There are two primary reasons why you really want to write functions:
1. It will often help prevent you from re-writing a bunch of code where you have to create many unnecessary intermediate variables
2. It will help you organize your thoughts and your code so that you can break down problems into manageable chunks that are more solvable
3. It will help you test and debug your code much much more easily

As your problems get more difficult, you will start having complex logic using nested for-loops within multiple if-else statements. If you have ALL of your code in a giant script, the likelihood of making a mistake will increase exponentially. If you are lucky, your mistake will result in a syntax error and it won't run; if you are unlucky, your code will still run but it will return the wrong answer to you. The moment you start realizing that you are trying to "eat the elephant", you should take a step back and ask yourself if you can break apart your problem into smaller sub-problems (that are assembled later to answer your original problem).

Think about Susie's grocery bill from Homework 1. If you tried to do everything in one big blob, it is easy to make mistakes and you can't tell where things went wrong. By breaking down your code into chunks where you are "telling a story" by calling your functions, you will be on a much better path to success.

The previous examples were pretty simple. Let's add a little more logic now.

Given a list of itemized bill amounts, the quality of service that was provided, and the amount tendered, I want to compute the amount of change that will be required. We've obviously not accounted for the situation where the amount of cash given does not cover the total bill, but let's ignore that for now.

In [None]:
def how_much_change_new(itemized_bill_amounts, how_was_service, cash_given):
    total_bill = sum(itemized_bill_amounts)
    if how_was_service == 'amazing':
        tip_amount = .25 * total_bill
    elif how_was_service == 'good':
        tip_amount = .2 * total_bill
    elif how_was_service == 'okay':
        tip_amount = .15 * total_bill
    else:
        tip_amount = .1 * total_bill
    total_with_tip = total_bill + tip_amount
    change = cash_given - total_with_tip
    return change

In [None]:
line_items = [1.23, 2.34, 6.78, 9.87, 6.54, 10.98]
service_level = 'amazing'
cash_given = 100
change = how_much_change_new(line_items, service_level, cash_given)
print(change)

Ok. So now that we understand the bare bones basics, let's try to put together a series of functions that we can call to perform a prime factorization of a number. If you recall what a prime factorization is, it is all of the prime numbers that go into a number. For example, the prime factorization of 8 is [2,2,2]. The prime factorization of 42 is [2,3,7].

This is a fairly complex task if you try to "eat the whole elephant" it may be very intimidating and you'll not know where to start. This is how you wind up leaving whole problems blank. By starting with a process where you think about how one actually can do a prime factorization, we can write the steps down in a human-readable manner that makes sense to us (while still somewhat representing what we will do with our code). This is called pseudocode.

**Prime Factorization Pseudocode**
- Start by generating a list of prime number, at least as big as the number you are evaluating
- Check to see if the number is a prime number itself
 - If it is a prime, there is no need to factorize (since we know it will not have any)
- If it is not a prime, we start by checking whether the number is divisible by the smallest prime number
- If it is, add the prime to the factors and compute the quotient
- Continue dividing by the prime number (and adding it to our factors each time) until the quotient is no longer divisible by it; keep track of all of the times that the prime goes into the divisor
- Check to see if the remaining quotient is either 1 or a prime number itself
- If the remaining quotient is not 1 or a prime, then continue the process with the next prime until the condition in the previous step is satisfied. 
 - If the number is a prime (not 1), then add it to the factors list

After thinking about the pseudocode, I've decided I am going to build a few helper functions that do the following:
- Check to see if a number is divisible by another number
- Check to see how many times a number is divisible by another number and compute the remaining quotient
- Check to see if a number is equal to 1 or is a prime number

To complete this exercise I will build an algorithm to find all of the prime numbers. I don't want anyone getting bogged down in the details of how I generated it, so just ignore its contents for now. We will review the algorithm another day.

In [None]:
# YOU CAN IGNORE ME UNLESS YOU ARE REALLY CURIOUS
def generate_all_primes(max_value):
    primes = [2,3]
    next_value = max(primes) + 2
    while next_value <= max_value:
        valid = True
        for prime in primes:
            if prime ** 2 > next_value:
                valid = True
                break            
            elif next_value % prime == 0:
                valid = False
                break
        if valid:
            primes.append(next_value)
        next_value += 2
    return primes

In [None]:
generate_all_primes(20)

It works!

Now let's start the REAL work!

In [None]:
# First function we need for our factorization algorithm
def check_if_divisible_by(number, divisor):
    if number % divisor == 0:
        return True
    else:
        return False

In [None]:
# Let's see if it works
print('4/2 is', check_if_divisible_by(4,2))
print('5/2 is', check_if_divisible_by(5,2))

Perfect!!

Now let's move on to the next function.

In [None]:
# Next function we need for our algorithm
def compute_how_many_times_divisible(number, divisor):
    times = 0
    keep_checking = True
    quotient = number
    while keep_checking:
        if quotient % divisor == 0:
            quotient /= divisor # Divide the variable by the divisor and reassign the quotient to the variable
            times += 1 # Add 1 to the number of times the divisor goes into the original number
        else:
            keep_checking = False # Stop trying to compute the divisor
    factors = [divisor] * times # Create a list of all of the times that a divisor goes into a number
    return factors, int(quotient) # It is less common, but you are allowed to return more than one value

In [None]:
# Let's see if it works
factors, quotient = compute_how_many_times_divisible(8, 2)
print('How does 2 go into 8? ', factors, ' with {} remaining'.format(quotient))
factors, quotient = compute_how_many_times_divisible(10, 2)
print('How does 2 go into 10? ', factors, ' with {} remaining'.format(quotient))
factors, quotient = compute_how_many_times_divisible(18, 3)
print('How does 3 go into 18? ', factors, ' with {} remaining'.format(quotient))

Looks good to me!

Let's write our last function now.

In [None]:
# Last piece we need to finish our algorithm
def check_if_should_continue_searching(quotient, primes):
    if (quotient == 1) or (quotient in primes):
        continue_searching = False
    else:
        continue_searching = True
    return continue_searching

In [None]:
primes = generate_all_primes(100) # This will be made available in our function - this is just for testing purposes
quotient = 5.0
continue_searching = check_if_should_continue_searching(quotient, primes)
print('If the quotient is {} then continue_searching should be {}'.format(quotient, continue_searching))

Alright! I think we have all of the pieces to try to do a prime factorization! In order to simplify things so that we only need a limited number of prime numbers, we'll make sure that any number we want to factorize is under 1000.

In [None]:
def compute_prime_factorization(number):
    primes = generate_all_primes(number) #Generate primes up to our number of interest
    if number in primes: # Check to see if there is even anything to factorize (i.e. is it already prime)
        return [number]
    else:
        all_factors = [] # Initialize a list to hold all of our prime factors
        quotient = number # Create modifiable number, starting with the original number
        for prime in primes: # Iterate through all of the primes
            valid = check_if_divisible_by(quotient, prime)
            if valid:
                temp_factors, quotient = compute_how_many_times_divisible(quotient, prime) # Get all instances of factor
                all_factors += temp_factors # Add factors to our list
                keep_looking = check_if_should_continue_searching(quotient, primes) # Check to see if we're done
                if not keep_looking: # I can negate a boolean with by putting a not in front. Thus, not True == False
                    if quotient != 1:
                        all_factors.append(quotient) # Add final prime factor, if needed
                    break # exit out of the for loop because we are done!
        return all_factors

In [None]:
# Computing primes can take a little while, so let's try to keep it under 1,000,000 :)
number = 355510
compute_prime_factorization(number)

The function above may seem complicated, but it would be WAY harder to understand if we didn't have functions to help us make the code more readable.

I'll conclude this review by applying functions to two of the problems we've looked at before in your homework and quiz. From the quiz, we were asked to compute the number of cows we would need given the high estimate and low possibility. The code looked something like this:

If you read closely, you'll see that a lot of that code is repeated. Computer programming is useful for its ability perform the same computations over and over again very easily and quickly. In the solution above, the only real difference (besides the variable names) was the estimate error buffer being added in one case and subtracted in the next. Functional programming allows us to avoid rewriting a lot of code and makes our code a lot more readable.

Consider the following alternative:

In [None]:
def compute_cows_needed(estimated_dishes_ordered, pounds_beef_per_dish, pounds_beef_per_cow, error_buffer):
    num_dishes_total = estimated_dishes_ordered * (1 + error_buffer)
    pounds_beef_needed = pounds_beef_per_dish * num_dishes_total
    whole_cows_needed = pounds_beef_needed // pounds_beef_per_cow
    additional_pounds_needed = pounds_beef_needed % pounds_beef_per_cow
    if additional_pounds_needed > 0:
        whole_cows_needed += 1
    return whole_cows_needed

With this function-based approach, we can quickly compute the high estimate and low possibilities by simply converting the error_buffer into a negative. Now my code is much easier to read and I don't have to keep writing the same basic code over and over again. I can also quickly change the input variable values and only re-run a single line of code. As your code gets more and more complex, this will save you a considerable amount of time and reduce the likelihood of error.

Now that you've seen a use case you are familiar with, let's take a few steps back and look at some simpler examples in action.

Let's modify the bus example from Homework 1 a little to see what the cost per student would be. If we assumed that a 72 person capacity was \$5,000 and a 40 person capacty bus was \$3,000, then which would be a more spend-efficient option for transporting 960 students?

In [212]:
# The inefficient way of computing our desired result
number_of_students = 960

# 72 person capacity cost
cost_per_big_bus = 5000
capacity_per_bus_big = 72
number_big_buses_full = number_of_students // capacity_per_bus_big
remaining_students_big_buses = number_of_students % capacity_per_bus_big
if remaining_students_big_buses > 0:
    number_big_buses = number_big_buses_full + 1
else:
    number_big_buses = number_big_buses_full
total_cost_big = number_big_buses * cost_per_big_bus
cost_per_student_big = total_cost_big / number_of_students

# 40 person capacity cost
cost_per_small_bus = 3000
capacity_per_bus_small = 40
number_small_buses_full = number_of_students // capacity_per_bus_small
remaining_students_small_buses = number_of_students % capacity_per_bus_small
if remaining_students_small_buses > 0:
    number_small_buses = number_small_buses_full + 1
else:
    number_small_buses = number_small_buses_full
total_cost_small = number_small_buses * cost_per_small_bus
cost_per_student_small = total_cost_small / number_of_students

per_student_diff = cost_per_student_big - cost_per_student_small
print('Big bus cost per student: ', cost_per_student_big)
print('Small bus cost per student: ', cost_per_student_small)
print('Difference: ', per_student_diff)

Big bus cost per student:  72.91666666666667
Small bus cost per student:  75.0
Difference:  -2.0833333333333286


We found that the cost-per-student using large buses is $2.08 less per student than the small buses. However, it took a lot of code with a lot of variables (more opportunity for making mistakes). Functions to the rescue!

In [213]:
def compute_cost_per_student(number_of_students, capacity, cost_per_bus):
    full_buses = number_of_students // capacity
    extra_students = number_of_students % capacity
    if extra_students > 0:
        total_buses = full_buses + 1
    else:
        total_buses = full_buses
    total_cost = total_buses * cost_per_bus
    cost_per_student = total_cost / number_of_students
    return cost_per_student

In [214]:
# Now that we've got our handy function, I can compute the cost for our two options with two lines of code
cost_per_student_big = compute_cost_per_student(number_of_students, capacity_per_bus_big, cost_per_big_bus)
cost_per_student_small = compute_cost_per_student(number_of_students, capacity_per_bus_small, cost_per_small_bus)
per_student_diff = cost_per_student_big - cost_per_student_small
print('Big bus cost per student: ', cost_per_student_big)
print('Small bus cost per student: ', cost_per_student_small)
print('Difference: ', per_student_diff)

Big bus cost per student:  72.91666666666667
Small bus cost per student:  75.0
Difference:  -2.0833333333333286


Same answer!

Now if we wanted to evaluate a third option, we barely have to write any new code at all!

In [215]:
cost_per_bus_medium = 3700
capacity_per_bus_medium = 50
cost_per_student_medium = compute_cost_per_student(number_of_students, capacity_per_bus_medium, cost_per_bus_medium)
print('Medium bus cost per student: ', cost_per_student_medium)

Medium bus cost per student:  77.08333333333333


This is the most expensive option... and it only took one line of code to find that out!