# MSDS 631 - Supplemental Notes
## Data Structure Manipulation and Usage

### Accessing and Storing Data
Let's start by getting some data.

In [None]:
import json
students = json.load(open('data/students.json', 'r'))

In [None]:
#Let's take a look at the first record of students
students[0]

So what we're seeing in the `students` object is a list of 10,000 dictionaries, each containing a student's information. Included in the record is their:
- Student ID
- First name
- Last name
- Gender
- Major
- Class
- GPA

Note that the keys of the dictionary are in no particular order.

If I wanted to get the first student's info and store it in variables, I *could* do this:

In [None]:
student_1_id = students[0]['student_id']
student_1_first = students[0]['first']
student_1_last = students[0]['last']
student_1_gender = students[0]['gender']
student_1_major = students[0]['major']
student_1_class = students[0]['class']
student_1_gpa = students[0]['gpa']

In [None]:
print(student_1_id)
print(student_1_first)
print(student_1_last)
print(student_1_gender)
print(student_1_major)
print(student_1_class)
print(student_1_gpa)

Unfortunately, this is completely infeasible to create all of these variables. Also, using generic names like `student_1_first` makes it difficult to access **specific** students' info. If I wanted a better structure, I could create a directory based on each student's ID. I can use a dictionary for this because I know that students' IDs are going to be unique.

Let's go ahead and do this, one step at a time.

## Reshaping Data
### Converting a list of unique objects into a keyed dictionary

In [None]:
#First we need to create a dictionary that we will store data into.
students_directory = {}

In [None]:
#Now let's create a key for each student and create a value that is another dict
for student in students: #This will loop through each of the 10,000 students and assign it to a variable 'student'
    specific_student_id = student['student_id'] #'student' contains a value within the 'students' list (which is a dictionary)
    students_directory[specific_student_id] = {} #Assign the new dictionary

In [None]:
#Let's verify that students_directory is in fact a dictionary
type(students_directory)

In [None]:
#Let's store the keys of the directory in a new variable
all_students_ids = list(students_directory.keys())

In [None]:
#Let's look at the first 10 IDs
all_students_ids[:10]

In [None]:
#So now let's look at the placeholder for the information for the first student ID
first_id = all_students_ids[0]
students_directory[first_id]

Now, rather than creating an empty dictionary while creating this student ID record, let's fill it in with the necessary data.

In [None]:
#First we need to get the keys. Let's automate this rather than hard-coding them.
first_student_dict = students[0]
student_keys = []
for key in first_student_dict.keys():
    student_keys.append(key)
print(student_keys)

In [None]:
#Now let's run the previous code but fill in the data we need
for student in students: #This will loop through each of the 10,000 students and assign it to a variable 'student'
    specific_student_id = student['student_id'] #'student' contains a value within the 'students' list (which is a dictionary)
    students_directory[specific_student_id] = {} #We need to initiate a new dictionary where we can store things
    # Now we need to go through and add keys to each student dictionary so that we have a dictionary of dictionaries
    for key in student_keys: #We found the keys in the previous cell
        if key != 'student_id': #We can ignore student ID since we already have it as the key
            students_directory[specific_student_id][key] = student[key] #Extracting data from the original data structure

So what does the record look like for our first student?

In [None]:
students_directory[first_id]

Perfect!! Now the data is much more accessible whenever we want to look up a specific student! Dictionaries are great for keeping track of specific things like this.

## Analyzing Data
### Using "counters"
Sometimes we want to keep track of how often we see things. In this case, we want to use a counter. Let's go through a list and see how many times we're seeing a certain major. For this example, let's use "Economics" as our major to look for.

In [None]:
male_counter = 0 #To start, we know we've seen the word "Male" zero times
female_counter = 0 #To start, we know we've seen the word "Female" zero times
master_counter = 0 #To start, we know we've counted zero students

In [None]:
for student in students:
    student_gender = student['gender']
    if student_gender == 'Male':
        male_counter += 1 #Only increment the male_counter if the condition is satisfied
    elif student_gender == 'Female':
        female_counter += 1 #Only increment the female_counter if the condition is satisfied
    master_counter += 1 #Increment the master_counter every time
pct_male = male_counter / master_counter
print('There were {} male students. This is {:.1%} of students'.format(male_counter, pct_male))

### Storing data in lists

Another commonly used method is to store information in dictionaries as we see them. In the previous example, we only cared how many students were a specific gender. In this case, we want to know **which** students are a specific gender.

First, we need to find out the names of the genders. Never assume you know what the labels should be because many things can happen that result in unexpected values that will break your code. For instance, if the data had Males, Females, males, and females, then Python would think you have four different values. If you only created keys for Males and Femals, your code would break whenever you tried accessing the key for male or female (lowercase).

In [None]:
all_students_genders = []
for student in students:
    all_students_genders.append(student['gender'])
all_genders = set(all_students_genders)
all_genders

In [None]:
gender_lists = {}
for gender in all_genders:
    gender_lists[gender] = [] #Create an empty list in each key of the dictionary so we can store IDs

In [None]:
#Let's go through each studetn record and add the student ID of the student to the appropriate list
master_counter = 0
for i in students: #You can assign any name as your counter variable. I discourage the use of "i" as I'm using here
    student_id = i['student_id']
    gender = i['gender']
    gender_lists[gender].append(student_id) #You don't need an if-statement because your gender is already in the dictionary
    master_counter += 1
pct_male = len(gender_lists['Male']) / master_counter

In [None]:
print('There were {} male students. This is {:.1%} of students'.format(male_counter, pct_male))

Different approach, same answer!!

### Counting unique instances amongst many choices
Let's now think about how we might see what the most popular name amongst the students is.

We're going to keep track of each time we see the name by using a dictionary and leveraging the idea of a "counter" from earlier in these notes. We could use the same approach used for finding the different genders in order to find all of the names, but I'll show you an alternative way here.

In [None]:
#First we need to create a dictionary that we will store data into.
names_dict = {}

If we go through each record of the original `students` list and pick out each name, we can see if the name is already in the dictionary. If it is not, then we can add to the counter. If not, we can add the key and simultaneously set the counter to a value of 1.

In [None]:
for student in students:
    name = student['first']
    if name in names_dict.keys():
        names_dict[name] += 1
    else:
        names_dict[name] = 1

In [None]:
print('There were {} names seen in total.'.format(len(names_dict)))

Now we'll do the same thing but find the names by gender.

In [None]:
#We need to create the keys and empty dictionaries for each gender
gender_names_dict = {}
for gender in all_genders:
    gender_names_dict[gender] = {}

In [None]:
#Just to be sure, let's take a look at the data structure we created
gender_names_dict['Male']

In [None]:
gender_names_dict['Female']

In [None]:
#Now let's use similar logic to what we did in the previous example
for student in students:
    name = student['first']
    gender = student['gender']
    if name in gender_names_dict[gender]: #You dont actually have to tell Python to look in the keys. It's assumed.
        gender_names_dict[gender][name] += 1
    else:
        gender_names_dict[gender][name] = 1

In [None]:
#Let's look at the first few names in the female names dictionary
list(gender_names_dict['Female'].keys())[:10]

Now that we have the counts of each name, let's find out which ones occur the most. We'll start by only looking at females.

In [None]:
#Let's find the most common female name now!
current_most_common_name = None #Initialize this value. Initialization is only necessary because we want to modify it later
current_most_common_name_counter = 0 #We've seen no names yet, so we've seen no names more than zero times
for key in gender_names_dict['Female'].keys():
    name = key #This line is not necessary, but I want you to see that the key is the student's name
    counter = gender_names_dict['Female'][name] #See how many times this name has appeared
    if counter > current_most_common_name_counter: #Check to see if the current name has more instances than previous most
        current_most_common_name_counter = counter
        current_most_common_name = name
print("The most common female name was {} and it appeared {} times".format(current_most_common_name, current_most_common_name_counter))

Let's do this for all genders now

In [None]:
#Let's do this for all of the genders now
most_common = {} #Instead of hardcoding variable names as we did above, let's create a data structure to store our results
for gender in all_genders:
    most_common[gender] = {'most_common_name': None, 'most_common_name_counter': 0} #Initialize values for each gender

In [None]:
#Now let's count similarly to the example above

#In order to keep track of whats happening, I created a phrase to let me know what's going on
update_phrase = 'Updated previous leader ({}) which had {} occurrences and replaced it with {} which has {}'

for gender in gender_names_dict:
    print(gender)
    for key in gender_names_dict[gender].keys(): #It's actually implied you are looking in keys for a for loop, but I was explicit here
        name = key #This line is not necessary, but I want you to see that the key is the student's name
        num_occurrences = gender_names_dict[gender][name] #See how many times this name has appeared
        if num_occurrences > most_common[gender]['most_common_name_counter']: #Check to see if the current name has more instances than previous most
            old_name = most_common[gender]['most_common_name']
            old_value = most_common[gender]['most_common_name_counter']
            most_common[gender]['most_common_name_counter'] = num_occurrences
            most_common[gender]['most_common_name'] = name
            print(update_phrase.format(old_name, old_value, name, num_occurrences))

In [None]:
print("The most common female name was {} and it appeared {} times".format(most_common['Female']['most_common_name'], most_common['Female']['most_common_name_counter']))
print("The most common female name was {} and it appeared {} times".format(most_common['Male']['most_common_name'], most_common['Male']['most_common_name_counter']))

The example above had a lot of loops and logic, and even I had problems getting the code to work. How about we think about using functions to simplify this and make our code more testable.

Let's start by assuming a single gender for our analysis.

In [None]:
def find_most_common_name(gender, gender_names_dict):
    specific_gender_names_dict = gender_names_dict[gender] #We're only going to need one gender's data for this function
    all_names = specific_gender_names_dict.keys() #These are all of the names for one gender

    current_most_common_name = None #Initialize this value. Initialization is only necessary because we want to modify it later
    current_most_common_name_counter = 0 #We've seen no names yet, so we've seen no names more than zero times
    
    for name in all_names:
        num_occurrences = specific_gender_names_dict[name] #See how many times this name has appeared
        if num_occurrences > current_most_common_name_counter: #Check to see if the current name has more instances than previous most
            old_name = current_most_common_name
            old_value = current_most_common_name_counter
            current_most_common_name_counter = num_occurrences
            current_most_common_name = name
            print(update_phrase.format(old_name, old_value, name, num_occurrences))

In [None]:
find_most_common_name('Male', gender_names_dict)

Perfect! Now we can run a for loop on genders and only call this function. We also were able to use local variables that were easier to understand and avoid hurting our heads trying to keep track of levels within nested data structures. Data structures provide flexibility, but they aren't the most easily understandable.

So while we're at a simpler state now, it's still a bit complicated. Let's try to simplify things even more. Rather than keeping all of the update-logic inside of the loop, let's move that into yet another function and run some tests.

In [None]:
def update_values_if_necessary(leader, leader_count, challenger_name, challenger_count):
    update_phrase = 'Updated previous leader ({}) which had {} occurrences and replaced it with {} which has {}'
    
    if leader_count > challenger_count:
        return leader, leader_count
    else:
        print(update_phrase.format(leader, leader_count, challenger_name, challenger_count))
        return challenger_name, challenger_count

In [None]:
#Let's test our function using dummy data
print(update_values_if_necessary('Old', 10, 'New', 8)) #Should return Old as being higher

In [None]:
#Let's test our function using dummy data
print(update_values_if_necessary('Old', 10, 'New', 12)) #Should return New as being higher

In [None]:
#Now let's build a function to call this over and over again using a loop:
def find_largest_count(specific_gender_names_dict):
    all_names = specific_gender_names_dict.keys()
    leader_name = None #Initialize this value. Initialization is only necessary because we want to modify it later
    leader_count = 0 #We've seen no names yet, so we've seen no names more than zero times

    for name in all_names:
        challenger_count = specific_gender_names_dict[name]
        leader_name, leader_count = update_values_if_necessary(leader_name, leader_count, name, challenger_count)
    return leader_name, leader_count

In [None]:
find_largest_count(gender_names_dict['Male'])

In [None]:
#Putting it all together, the new code looks like this:
most_common = {}
    
for gender in all_genders:
    print(gender)
    name, count = find_largest_count(gender_names_dict[gender])
    
    #Now that we know the final answer, we don't have to keep updating the values of our dictionary. This is simplier.
    most_common[gender] = {'most_common_name': name, 'most_common_name_count': count}

In [None]:
#Let's put a bow on this and print it nicely
for gender in most_common:
    name = most_common[gender]['most_common_name']
    count = most_common[gender]['most_common_name_count']
    print("The most common {} name was {} and it appeared {} times".format(gender, name, count))

Breaking apart code into smaller snippets, testing it with values you know should work, and eliminating some of the complicated data structures that result in errors is a good practice when coding. There are infinite ways to solve the same problem, and breaking things into more manageable pieces is a recipe for fewer errors and improved ability to take bites of the elephant rather than trying to eat the whole elephant every time.