# MSDS 631 - Lecture 5 (February 20, 2019)

## Refresher about functions, variables, data structures
There are several key concepts that we need to fully internalize before moving forward.

##### Fundamentals of functions

There is a lot of confusion regarding how to use functions, including the calling of functions from other functions, assuming the availability of global variables, and knowing how/what to return.

First, let's talk about the calling of functions from other functions.

In [38]:
#Let's start by loading the students data.
import json
students = json.load(open('students.json', 'r'))

In [39]:
students[0]

{'class': 'Junior',
 'first': 'Janis',
 'gender': 'Female',
 'gpa': 3.12,
 'last': 'Brown',
 'major': 'Economics',
 'student_id': '5a397209-3782-4764-a285-10fae807ee71'}

Let's say we want to compute the average GPA given a specific value for a specific attribute. For example, what is the average GPA for "Females" within the attribute of "gender".

In [43]:
#Average GPA for females
gpas = []
for student in students:
    if student['gender'] == 'Female':
        gpa = student['gpa']
        gpas.append(gpa)
avg_gpa = sum(gpas) / len(gpas)
avg_gpa

3.430955090428767

Let's generalize this now and build a series of functions to find the highest average GPA within a specfic attribute (e.g. Finding the gender that has the highest GPA or finding the major that has the highest GPA).

In [44]:
def get_avg_gpa(attribute, value):
    gpas = []
    for student in students:
        if student[attribute] == value:
            gpa = student['gpa']
            gpas.append(gpa)
    avg_gpa = sum(gpas) / len(gpas)
    rounded_gpa = round(avg_gpa, 4)
    return rounded_gpa

In [45]:
#Let's see if we get the same answer for females
get_avg_gpa('gender', 'Female')

3.431

Now let's find the highest average GPA amongst all of the possible values of an attribute.

In [50]:
#Write function to find all of the possible values of an attribute
def get_all_possible_values(attribute):
    values = []
    for student in students:
        value = student[attribute]
        values.append(value)
    possible_values = set(values)
    return possible_values

In [51]:
#Let's see if we get the same answer for majors
get_all_possible_values('major') #Keys are case sensitive, so be careful!!!

{'Chemistry', 'Economics', 'Engineering', 'Finance', 'Math', 'Physics'}

In [52]:
#Write function to compute all of the average GPAs for each value of an attribute
def compute_avg_gpa_given_value(attribute):
    all_avg_gpas = {}
    possible_values = get_all_possible_values(attribute)
    for value in possible_values:
        avg_gpa = get_avg_gpa(attribute, value)
        all_avg_gpas[value] = avg_gpa
    return all_avg_gpas

In [55]:
#Let's see what we get for genders
all_avg_gpas = compute_avg_gpa_given_value('gender')

In [56]:
#Write a function to find the key with the largest value within a dictionary
def find_key_with_largest_value(dictionary_of_gpas):
    highest_avg_gpa = -10000
    value_w_highest_avg_gpa = None
    possible_values = dictionary_of_gpas.keys()

    for value in possible_values:
        gpa = dictionary_of_gpas[value]
        if gpa > highest_avg_gpa:
            highest_avg_gpa = gpa
            value_w_highest_avg_gpa = value
    return value_w_highest_avg_gpa, highest_avg_gpa


In [58]:
#Let's see what we get for the previously generated dictionary of gender GPAs
find_key_with_largest_value(all_avg_gpas)

('Female', 3.431)

In [59]:
#Put it all together to find the value with the highest GPA within an attribute
def find_group_with_highest_avg_gpa(attribute):
    dictionary_of_gpas = compute_avg_gpa_given_value(attribute)
    value_w_highest_avg_gpa, highest_avg_gpa = find_key_with_largest_value(dictionary_of_gpas)
    return value_w_highest_avg_gpa, highest_avg_gpa

In [60]:
find_group_with_highest_avg_gpa('major')

('Finance', 3.615)

In [61]:
find_group_with_highest_avg_gpa('gender')

('Female', 3.431)

In [62]:
find_group_with_highest_avg_gpa('class')

('Freshman', 3.3809)

The problem with the above code is that we've assumed the existance of **`students`** as a global variable. If we had created **`students`** within another function and tried calling all of the previous functions, nothing would work because the students data would be local to this newly created function.

In [None]:
#Let's put it all together but not assume that we have the global variable 'students'
import json
def find_group_with_highest_avg_gpa
    students = json.load(open('students.json', 'r'))
    dictionary_of_gpas = compute_avg_gpa_given_value(students, attribute) #Need to pass function `students` as an argument
    value_w_highest_avg_gpa, highest_avg_gpa = find_key_with_largest_value(dictionary_of_gpas)
    return value_w_highest_avg_gpa, highest_avg_gpa

When building functions, you should ALWAYS pass the arguments that the function needs. You should never assume that a global variable will exist. This could causes several issues regarding possible naming conflicts, debugging, and may prevent the possibility of the code running at all.

##### Fundamentals of using variables vs. other data structures
It's important to remember that data structures can serve the exact same purpose as variables to store data. The difference is that they *can* give you incredible flexibility to do things.

In the problem where we wanted to see whether numbers were divisible by a certain divisor, we **could** have done this:

In [None]:
nums_divisible_by_2 = []
nums_divisible_by_3 = []
nums_divisible_by_4 = []
nums_divisible_by_5 = []
nums_divisible_by_6 = []
nums_divisible_by_7 = []
nums_divisible_by_8 = []
nums_divisible_by_9 = []
for i in range(1,51):
    if i % 2 == 0:
        nums_divisible_by_2.append(i)
    if i % 3 == 0:
        nums_divisible_by_3.append(i)
    if i % 4 == 0:
        nums_divisible_by_4.append(i)
    if i % 5 == 0:
        nums_divisible_by_5.append(i)
    if i % 6 == 0:
        nums_divisible_by_6.append(i)
    if i % 7 == 0:
        nums_divisible_by_7.append(i)
    if i % 8 == 0:
        nums_divisible_by_8.append(i)
    if i % 9 == 0:
        nums_divisible_by_9.append(i)
print(nums_divisible_by_2)
print(nums_divisible_by_3)
print(nums_divisible_by_4)
print(nums_divisible_by_5)
print(nums_divisible_by_6)
print(nums_divisible_by_7)
print(nums_divisible_by_8)
print(nums_divisible_by_9)


Any time you find yourself re-writing very similar looking code, think about whether you can use loops and data structures to achieve your desired result

In [64]:
#Let's generalize this so we don't have to repeat so much code
nums_divisible_by = {}
for divisor in range(2,10):
    nums_divisible_by[divisor] = []
    for dividend in range(1,51):
        if dividend % divisor == 0:
            nums_divisible_by[divisor].append(dividend)
nums_divisible_by

{2: [2,
  4,
  6,
  8,
  10,
  12,
  14,
  16,
  18,
  20,
  22,
  24,
  26,
  28,
  30,
  32,
  34,
  36,
  38,
  40,
  42,
  44,
  46,
  48,
  50],
 3: [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48],
 4: [4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48],
 5: [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
 6: [6, 12, 18, 24, 30, 36, 42, 48],
 7: [7, 14, 21, 28, 35, 42, 49],
 8: [8, 16, 24, 32, 40, 48],
 9: [9, 18, 27, 36, 45]}

I want to close by providing a few tips about choosing variable and function names.
- Be descriptive in your names
- If you have multiple variables holding similar data, consider using a more versatile data structure
- Use lower-case for all letters separated by underscores