Looking for a dataset for this activity!

Potential Ideas:
- WICS data
- SNAP data
- transit data

# Activity 4: Sampling!

Topics covered:
* Logical operators 
* Conditional statements (if / else)
* Loops (for loops)
* Analysis 

## Overview

We can load in the data we decide on here!

In [79]:
from otd_util import *

import pandas as pd

## Problem 1: Logical Operators

Common logical operators we will use in the course are `==`, `!=`, `>`, `>=`, `<`, `<=`. We can use these to compare individual numbers or strings! This activity will deal with WIC (Women, Infants, and Children) data. Let's consider the data on a smaller scale. Say we are looking at women in the table, who make less than $20,000 a year. We can analyze how many dependents these women have. The first woman we find in the dataset has 4 dependents. 

In [2]:
first_woman_dependents = 3
first_woman_dependents

3

For women with less than $21,330 annual income, having 3 dependents means they are below the Federal poverty guidelines in California. If we wanted to check the number of `first_woman_dependents` against the guideline we could write an inequality.

In [3]:
first_woman_dependents >= 3

True

Make another variable named `second_woman_dependents`, where the second woman has 1 dependents and makes the same amount of money as the first woman. Write an inequality to check whether she would meet the Federal poverty guidelines.

In [4]:
... # PROMPT
second_woman_dependents = 1 # SOLUTION
second_woman_dependents

1

In [5]:
... # PROMPT
second_woman_dependents >= 3 # SOLUTION

False

What if we had a list of how many family members 10 different people had? Could we use the same operators? Try it out below using the list called `dependents`.

In [6]:
dependents = [1, 3, 1, 5, 6, 2, 1, 2, 3, 4] 
dependents

[1, 3, 1, 5, 6, 2, 1, 2, 3, 4]

In [7]:
dependents >= 3 #SOLUTION

## THIS WILL ERROR!!! 

TypeError: '>=' not supported between instances of 'list' and 'int'

We cannot check whether the elements inside the list meet the condition of being greater than or equal to 3. What happens if we wanted to check which of the women in the list had 3 children? What if we tried the statement below? What does the result mean?

In [8]:
dependents == 3

False

**What happened above was that rather than check the elements in the list, Python checked if a list is the same as a number, which is False. This is something to keep in mind for future comparisons!**

Another place you may use logical operators is on strings. Consider the case where we have a dictionary of dependent count paired with the names of women who have that number of dependents. 

In [9]:
dependents_to_women = {'1':['Jessica', 'Natasha A.'], '2': ['Whitney', 'Natasha B.'], '3': ['Danielle', 'Valerie'], '4':['Chelsea', 'Karen'], '5+':['Robin', 'Marissa']}
dependents_to_women

{'1': ['Jessica', 'Natasha A.'],
 '2': ['Whitney', 'Natasha B.'],
 '3': ['Danielle', 'Valerie'],
 '4': ['Chelsea', 'Karen'],
 '5+': ['Robin', 'Marissa']}

By looking at the dictionary we see that there are two women with the same first name. Let's compare their names with the operators. First, make a variable called `one_dep` which contains all the values from the dictionary that match the key '1'. And make another variable called `two_dep` which contains all the values for the key '2'.

In [10]:
... #PROMPT
one_dep = dependents_to_women['1'] #SOLUTION
one_dep

['Jessica', 'Natasha A.']

In [11]:
... #PROMPT
two_dep = dependents_to_women['2'] #SOLUTION
two_dep

['Whitney', 'Natasha B.']

Using these variables compare the lists below. What is the result we expect to get? Double check that this matches your logic.

In [12]:
one_dep == two_dep

False

The result is false, because it is comparing whether the two lists contain the same values. However, the first woman in both of the lists is different, so the two lists are not equal. In the next cell, make two comparisons between the second element in each of the lists. You can use any operator you want, but get one to say `True` and another to say `False`. **Note: you must use the variables you created above.**

In [13]:
... #PROMPT
# POSSIBLE SOLUTION
one_dep[1] == two_dep[1]

False

In [14]:
... # PROMPT
# POSSIBLE SOLUTION
one_dep[1] <= two_dep[1]

True

## Problem 2: Conditional Statements

We can use conditionals to further impose restrictions on what parts of our code are executed or only allow some actions to be taken when certain conditions are met! Let's review the syntax we learnt earlier. Recall that we use the **if** statement to specify what our condition should be. When this condition evaluates to true, the program will execute the block of code under that if statement. You can also choose to specify an **else** statement and provide alternate instructions for what you want the program to do if the condition evaluates to false. 

In [15]:
#if (condition):
#    code to execute
# else:
#    alternate code to execute

Now, suppose we want to compare how many participants there were in the WIC program in Alabama in December 2018 vs. November 2018, and check whether this number increased or decreased.

In [16]:
alabama_december_2018 = 115036
alabama_november_2018 = 117518

#SOLUTION
if (alabama_december_2018 < alabama_november_2018):
    print("Decrease in program participants!")
else:
    print("Increase in program participants!")

Decrease in program participants!


We can also have multiple conditions that we can check for using **if** and **elif** statements! If the condition specified in the if(condition1...) statement is false, the program moves down to the elif(condition2...) statement and checks it. We can choose to have as many elif statements as needed. 

In [17]:
#if (condition 1):
#    code to execute for condition 1
#elif (condiiton 2):
#    code to execute for condition 2
#elif (condition 3):
#    ....
# elif:
#    alternate code to execute if none of the conditions are met

Now suppose we want to check which state among California, Georgia, and Illinois had the largest **magnitude** of change from December 2017 to December 2018. Below, we've entered the numbers for you. Try calculating the change in number of program participants and check which state has the largest absolute change! Note that you can combine conditions using **and** if you want two or more conditions to **all** be satisfied, or **or** if **at least one** should be satisfied.

In [18]:
california_dec_2017 = 993706
california_dec_2018 = 926600

georgia_dec_2017 = 219881
georgia_dec_2018 = 199837

illinois_dec_2017 = 199365
illinois_dec_2018 = 182419

### SOLUTION ###
california_change = abs(california_dec_2018 - california_dec_2017)
georgia_change = abs(georgia_dec_2018 - georgia_dec_2017)
illinois_change = abs(illinois_dec_2018 - illinois_dec_2017)

if (california_change > georgia_change and california_change > illinois_change):
    print("California had the most change.")
elif (georgia_change > california_change and georgia_change > illinois_change):
    print("Georgia had the most change.")
else:
    # Note that this automatically means if (illinois_change > california_change and illinois_change > georgia_change)
    # Can you reason why?
    print("Illinois had the most change.")

California had the most change.


You can also choose to **nest** if-else statements. You can use this in a situation when you want different parts of the code under a particular if statement to be executed under different conditions. Can we rewrite our program from the previous part using nested if-else statements?

In [19]:
### SOLUTION ###
if (california_change > georgia_change):
    if (california_change > illinois_change):
        print("California had the most change.")
    else:
        print("Illinois had the most change.")
else:
    if (georgia_change > illinois_change):
        print("Georgia had the most change.")
    else:
        print("Illinois had the most change.")

California had the most change.


## Problem 3: Loops

Loops can help us perform the same set of actions on multiple pieces of data! We will use loops to filter data and remove pieces we don't want or modify data to make is easier for us to use. 

In [20]:
#downloaded from https://catalog.data.gov/dataset/women-infants-and-children-wic-participating-and-cost-data
WIC_States_and_Tribes = pd.read_csv('WIC_Participation_edited.csv', usecols = [0,5])
#Index into table to get a list
WIC_States_and_Tribes.head()

Unnamed: 0,State / Indian Tribe,FY 2018
0,Alabama,120605
1,Alaska,17092
2,American Samoa,5235
3,Arizona,149513
4,Dept. of Health,133547


In [21]:
# EXPLAIN NEXT CELL BREAK DOWN FROM TABLE TO COLUMN TO ITEM AND WHAT TYPE FUNCTION DOES
# EXPLAIN THAT IT IS GOOD DATA PRACTICE TO CHECK DATA TYPE BEFORE STARTING ANALYSIS

In [22]:
type(WIC_States_and_Tribes["FY 2018"][0])

str

In [23]:
# WE NOTICED THEY ARE NOT NUMBERS AND WILL CHANGE IT BELOW

In [36]:
strings_wic = WIC_States_and_Tribes["FY 2018"]
str_to_num = []

for i in strings_wic:
    without_comma = i.replace(',', "")
    str_to_num.append(int(without_comma))
    

This table shows us the number of people enrolled in WIC by state and by Native American Tribe. Since the state counts include the Tribe counts, lets perform a function to remove all the Tribe counts in the table. 

How do we tell Python which rows are Tribe counts and which are State counts? Notice that the tribe counts have a space at the beginning of the names. For example, the one of the first Tribe entries is ' Navajo Nation'. We need to constuct a filter that goes through every entry in the data table and removes ones that start with a space. We can do this using a loop. 

In [56]:
state_names = []
corresponding_numbers = []

index_value = 0

for i in WIC_States_and_Tribes["State / Indian Tribe         "]: 
    
    
    if i[0] != ' ': #this indexes a string to get the first charecter
        state_names.append(i)
        corresponding_numbers.append(str_to_num[index_value])
        
    index_value += 1

## Problem 4: Analysis

In this next section, we will combine all the tools you have learned throughout the notebook to learn more about the data as a whole. There are two goals with this section:
1. Find the top 10 states with largest enrollment in WIC programs and make a bar chart of the values.
2. Find the relative enrollment in each state. (Optional: find the top 10 states using relative enrollment)

Let's start approaching our first goal! We want to find the top 10 largest values within corresponding numbers. We can start by sorting the enrollment counts per state in descending order (largest to smallest).

In [81]:
original_unsorted = corresponding_numbers.copy() 
corresponding_numbers.sort(reverse = True) ## TEACH SORT IN LIST SECTION

Now that the list is sorted from largest to smallest, assign the ten largest counts to the name `top_10_amounts` as a list.

In [82]:
top_10_amounts = corresponding_numbers[0:10] #SOLUTION
top_10_amounts

[1009492,
 746246,
 450624,
 416173,
 221719,
 218188,
 217695,
 213964,
 208955,
 199360]

With the top ten values, we can find out where they were in the original list to find the correct corresponding states. To do this we need to create a for loop that gets the index from the `original_unsorted` list of each count in the top 10 amounts.

In [59]:
indices = []
for i in top_10_amounts:
     indices.append(original_unsorted.index(i))
        
indices

[4, 43, 9, 32, 33, 38, 10, 22, 35, 13]

Now that we have the indices of the top ten enrollment counts, we can index into the state_names list to get the corresponding state names to these counts. Use a forloop to obtain the values.

In [83]:
## SOLUTION
top_states = []
for i in indices:
    top_states.append(state_names[i])
top_states

['California',
 'Texas',
 'Florida',
 'New York',
 'North Carolina',
 'Pennsylvania',
 'Georgia',
 'Michigan',
 'Ohio',
 'Illinois']

Hooray! At this point, we have both the top ten enrollment counts and the state names that correspond to those counts. Let's make a new table using the `visualize_table()` function to see values look like side by side. Start this by making a dictionary of the values using the variable we just found in this section.

In [61]:
# PROMPT
# top_state_and_value = {'State': ..., "Enrollment": ...}

top_state_and_value = {'State': top_states, "Enrollment": top_10_amounts}
top_state_and_value

{'Enrollment': [1009492,
  746246,
  450624,
  416173,
  221719,
  218188,
  217695,
  213964,
  208955,
  199360],
 'State': ['California',
  'Texas',
  'Florida',
  'New York',
  'North Carolina',
  'Pennsylvania',
  'Georgia',
  'Michigan',
  'Ohio',
  'Illinois']}

In [None]:
visualize_table(top_state_and_value)

We can see the table we have made up to this point, and we are able to use this information to plot the different states against each other. This is helpful, because we can understand how frequently people need help in some states versus others.

In [None]:
bar(top_state_and_value, 'State', "Enrollment in WIC")

In the bar chart, we can see that California has the largest enrollment in WIC as 1,009,492 people use these services. The next state is Texas and has around 250,000 fewer people enrolled. How is the difference between the enrollment of each state so far apart. Something to consider is the size of the population. California has over 39 million people within the state at this time, while Texas has around 27 million people. Based on the different population sizes we would expect California to have more people enrolled.



## Vivi's part starts below

The second goal is to find the relative enrollment by finding the proportion of people receiving aid in a state divided by the state's population.


Now we have all the states with their counts of people enrolled. Lets find the states with more than X number of people in WIC.

In [26]:
WIC_States_and_Tribes #ADD THE STUFF  

Unnamed: 0,State / Indian Tribe,FY 2018
0,Alabama,120605
1,Alaska,17092
2,American Samoa,5235
3,Arizona,149513
4,Dept. of Health,133547
5,Navajo Nation,7561
6,Inter-Tribal Council,8406
7,Arkansas,73607
8,California,1009492
9,Colorado,85258


Now lets find the states with the highest *relative* number of people recieving assistance. To do this we have to divide the number of people enrolled in WIC for every state by the state's population. In the cell below we will add the total population counts for every state to the table.

In [27]:
#https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=PEP_2018_PEPANNRES&prodType=table

In [28]:

# make percent column
# find state with greatest percent enrolled