# Activity 4: Data Analysis!

## Overview

For this activity, we are going to be using data on the WIC (Women, Infants, and Children) supplemental nutrition program. Our data contains the state and/or tribe and the number of people enrolled in the program in a given year.

You don't need to know what next few lines of code are doing, but in essence, they're reading in a table from a file! We'll talk more about how data is stored soon, but we often transport data from one place to another in CSV (comma-separated value) format. Since we're looking at real data here, we're reading in from a CSV file!

We got this data from https://catalog.data.gov/dataset/women-infants-and-children-wic-participating-and-cost-data


In [None]:
from otd_util import *

In [None]:
WIC_States_and_Tribes_Data = read_table('WIC_Participation_edited.csv', columns=[0, 5], column_names={'FY 2018':'Total Enrollment', "State / Indian Tribe         ":'State/Indian Tribe'})
WIC_States_and_Tribes = {'State/Indian Tribe': list(WIC_States_and_Tribes_Data['State/Indian Tribe']), 'Total Enrollment': list(WIC_States_and_Tribes_Data['Total Enrollment'])}

## Problem 1: Loops

Loops can help us perform the same set of actions on multiple pieces of data! We will use loops to filter data and remove pieces we don't want or modify data to make is easier for us to use. 

Let's take a look at the data we have first. We have two columns: `State/Indian Tribe` and `Total Enrollment`.

In [None]:
visualize_table(WIC_States_and_Tribes)

In the cell above we have downloaded a table that shows us each state and Native American tribe in the country and their respective enrollment in the WIC program. 

As you may recall from Module 2, a table is made up of columns. The table we are using has two columns named 'Total Enrollment' and 'State / Indian Tribe'. Each column is a list with every piece of data about a certain topic. We can use square brackets like these [] to view the columns in a table in list form. 

In [None]:
WIC_States_and_Tribes["Total Enrollment"]

Remember when we learned about data types? The four data types we talked about are: integers, strings, floats, and booleans. Look back in the textbook if you do not remember what these are. 

It is good practice to check what kind of data you are working with so you know what kind of tools you can use on it. A good rule of thumb is for data in 'words' (like the names of states and tribes) to be in strings while data in 'numbers' (like the total enrollment) would be in integers. You can use the type() function to check the type of a single piece of data. 

In [None]:
type('this is a string')

Let's check what type of data the 'State / Indian Tribe' column has by using the type() function on its first element!

In [None]:
type(WIC_States_and_Tribes["State/Indian Tribe"][0])

Now, let's look at the type of data in the 'Total Enrollment' column:

In [None]:
type(WIC_States_and_Tribes["Total Enrollment"][0])

Oh no! The data in this column is in strings instead of integers. Don't worry! We can use a for loop to change the type of every element in the column into an integer by applying the int() function to each value in the column.

In [None]:
int('10')

However, this might have issues when there are commas in the data. What happens when the following cell is run?

In [None]:
int('10,000')

So let's try removing the commas from our number string and turning that into an integer:

In [None]:
no_commas = '10,000'.replace(',', '')
int(no_commas)

Now, let's try doing this in a loop to get a new list where all of the values are numbers!

In [None]:
integer_enrollment = []

for num in WIC_States_and_Tribes["Total Enrollment"]:
    no_commas = ... # Fill in your code here! Hint: Think about what we did in the above cell 
    integer_enrollment.append(...) # Fill in your code here!

print(integer_enrollment)

Our table currently shows us the number of people enrolled in WIC by state and by Native American Tribe. Since the state counts include tribe counts, let's try removing all the Tribe counts in the table. 

How do we tell Python which rows are tribe counts and which are state counts? Notice that the tribe counts have a space at the beginning of the names. For example, the one of the first Tribe entries is `' Navajo Nation'`. We need to constuct a filter that goes through every entry in the data table and removes ones that start with a space. We can do this using a loop. 

In [None]:
WIC_States_and_Tribes["State/Indian Tribe"][5]

In [None]:
state_names = []
corresponding_numbers = []

index_value = 0

for state in ...: # Fill in your code here! What do you want to be indexing through?
    if state[0] != ' ': # This indexes a string to get the first charecter
        ... # Fill in your code here!
        corresponding_numbers.append(...) # Fill in your code here!
        
    index_value += 1

Now, let's replace our old `Total Enrollment` and `State/Indian Tribe` columns with our new lists!

In [None]:
WIC_States_and_Tribes["Total Enrollment"] = corresponding_numbers
WIC_States_and_Tribes["State/Indian Tribe"] = state_names

## Problem 2: Analysis

### Analysis Part 1:

In this next section, we will combine all the tools you have learned throughout this module to learn more about the data as a whole. You will be finding the top 10 states with largest enrollment in WIC programs and make a bar chart of the values.

We want to find the top 10 largest values within corresponding numbers. We can start by sorting the enrollment counts per state in descending order (largest to smallest). We can use the `sorted` function to sort a list; setting the `reverse` argument to `True` tells it to sort from largest to smallest.

In [None]:
sorted_enrollment = sorted(WIC_States_and_Tribes["Total Enrollment"], reverse = True) ## TEACH SORT IN LIST SECTION

Now that the list is sorted from largest to smallest, assign the ten largest counts to the name `top_10_amounts` as a list.

In [None]:
top_10_amounts = ... # Fill in your code here! How might you get the top 10 amounts using sorted_enrollment?
top_10_amounts

With the top ten values, we can find out where they were in the original list to find the correct corresponding states. To do this, we can use a for loop that goes through each "row" of our table and adding the state to our top ten list if its enrollment value corresponds to our state.

In [None]:
top_state_enrollments = {}
num_states = len(WIC_States_and_Tribes["State/Indian Tribe"]) # number of states

for i in range(num_states):
    state = WIC_States_and_Tribes["State/Indian Tribe"][i] # get state at row
    enrollment = ... # Fill in your code here to get enrollment at row
    
    if ...: # When do you want to add to top_state_enrollments?
        top_state_enrollments[state] = enrollment
        
top_state_enrollments

Now that we have the indices of the top ten enrollment counts, we can index into the state_names list to get the corresponding state names to these counts. Use a forloop to obtain the values.

Hooray! At this point, we have both the top ten enrollment counts and the state names that correspond to those counts. Let's make a new table using the `visualize_table()` function to see values look like side by side. Start this by making a dictionary of the values using the variable we just found in this section.

In [None]:
top_states = list(top_state_enrollments.keys())
top_enrollments = list(top_state_enrollments.values())
top_state_and_value = ... # Fill in your code here!

In [None]:
visualize_table(top_state_and_value)

We can see the table we have made up to this point, and we are able to use this information to plot the different states against each other. This is helpful, because we can understand how frequently people need help in some states versus others.

In [None]:
# Fill in the column names you used when creating top_state_and_value above!
bar(top_state_and_value, ..., ...) 

In the bar chart, we can see that California has the largest enrollment in WIC as 1,009,492 people use these services. The next state is Texas and has around 250,000 fewer people enrolled. How is the difference between the enrollment of each state so far apart. Something to consider is the size of the population. California has over 39 million people within the state at this time, while Texas has around 27 million people. Based on the different population sizes we would expect California to have more people enrolled.
