## Visualizing Food Insecurity with Pandas and Pixie Dust

This notebook has been adapted from the Food Insecurity code pattern https://developer.ibm.com/patterns/create-visualizations-to-understand-food-insecurity/.

For this particular journey, food insecurity throughout the US is focused on. Low access, diet-related diseases, race, poverty, geography and other factors are considered by using open government data. For some context, this problem is a more and more relevant problem for the United States as obesity and diabetes rise and two out of three adult Americans are considered obese, one third of American minors are considered obsese, nearly ten percent of Americans have diabetes and nearly fifty percent of the African American population have heart disease. Even more, cardiovascular disease is the leading global cause of death, accounting for 17.3 million deaths per year, and rising. Native American populations more often than not do not have grocery stores on their reservation... and all of these trends are on the rise. The problem lies not only in low access to fresh produce, but food culture, low education on healthy eating as well as racial and income inequality.

The government data used in this journey is aggregated from the original, government data from the US Bureau of Labor Statistics https://www.bls.gov/cex/ and The United States Department of Agriculture https://www.ers.usda.gov/data-products/food-environment-atlas/data-access-and-documentation-downloads/.


The aggregated data is hosted here - https://ibm.box.com/s/058spwk7hvo8z2xguzr5jxsdjssbbvdl

## Import Dependencies

In [None]:
from io import StringIO
import requests
import json
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Load Dataset

To get the data and load it into a pandas DataFrame:

1. Download the dataset as a CSV from this link [https://ibm.box.com/s/058spwk7hvo8z2xguzr5jxsdjssbbvdl] to your computer. 
1. From this notebook, click the Find and Add Data icon on the top right (the icon looks like a 0100) and then uploading the file you downloaded in step 1 in this panel.
1. Click in the next cell and then choose Insert to code > Insert pandas DataFrame from below the file name and then run the cell. 
1. The generated code will read the csv file into a data frame variable named df_data_#. Where the # may vary. Be Sure to change the inserted variable name to df_data_1 if it is not named that automatically.
1. Run the cell.


In [None]:
# With your cursor in this cell, insert the code to read the dataset into a DataFrame as instructed in step 3)



In [None]:
try:
    disease_df = df_data_1
except NameError as e:
    print('Error: Setup is incorrect or incomplete.\n')
    print('Follow the instructions to insert the pandas DataFrame above, and edit to')
    print('make the generated df_data_# variable match the variable used here.')
    raise

## Explore and Clean Dataset

In [None]:
# First lets see what columns we have in our data set. A mapping of the column codes to description 
# is provided in the Dietrelateddisease_VariableMap.xlsx file 
disease_df.columns

In [None]:
# We can use pandas to look at the statistics of our dataset.
disease_df.describe()

In [None]:
# To see general information, we can get some metrics for the entire dataset as follows:
disease_df.max()
disease_df.min()
disease_df.std()

In [None]:
# Or we can get information on a specific column in the dataset.
disease_df['PCT_DIABETES_ADULTS10'].unique()

In [None]:
disease_df['FOODINSEC_10_12'].unique()

In [None]:
# Pandas has the ability to show correlation between values in the dataset.
disease_df.corr()

In [None]:
# With over 1200 columns, reading the correlation values in table format can be hard. 
# Lets use matplotlib to visualize this matrix. 
plt.figure(figsize=(10,10))
plt.matshow(disease_df.corr(), fignum=1)


In [None]:
# Plot counts of a specified column using Pandas
disease_df.FOODINSEC_10_12.value_counts().plot(kind='barh')

In [None]:
# Bar plot example
sns.factorplot("PCT_SNAP09", "PCT_OBESE_ADULTS10", data=disease_df,size=3,aspect=2)

## Clean Data

In [None]:
#create a dataframe of values that are most interesting to food insecurity
df_focusedvalues = disease_df[["State", "County","PCT_REDUCED_LUNCH10", "PCT_DIABETES_ADULTS10", "PCT_OBESE_ADULTS10", "FOODINSEC_10_12", "PCT_OBESE_CHILD11", "PCT_LACCESS_POP10", "PCT_LACCESS_CHILD10", "PCT_LACCESS_SENIORS10", "SNAP_PART_RATE10", "PCT_LOCLFARM07", "FMRKT13", "PCT_FMRKT_SNAP13", "PCT_FMRKT_WIC13", "FMRKT_FRVEG13", "PCT_FRMKT_FRVEG13", "PCT_FRMKT_ANMLPROD13", "FOODHUB12", "FARM_TO_SCHOOL", "SODATAX_STORES11", "State_y", "GROC12", "SNAPS12", "WICS12", "PCT_NHWHITE10", "PCT_NHBLACK10", "PCT_HISP10", "PCT_NHASIAN10", "PCT_65OLDER10", "PCT_18YOUNGER10", "POVRATE10", "CHILDPOVRATE10"]]

In [None]:
#remove NaNs and 0s
df_focusedvalues = df_focusedvalues[(df_focusedvalues != 0).all(1)]
df_focusedvalues = df_focusedvalues.dropna(how='any')

In [None]:
#look at heatmap of correlations with the dataframe to see what we should visualize
corr = df_focusedvalues.corr()
fig, ax = plt.subplots(figsize=(10,15))         

sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            linewidths=.5, ax=ax)

#fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
#sns.heatmap(df1.iloc[:, 1:6:], annot=True, linewidths=.5, ax=ax)

We can immediately see that a fair amount of strong correlations and relationships exist. Some of these include 18 and younger and Hispanic, an inverse relationship between Asian and obese, a correlation between sodatax and Hispanic, African American and obesity as well as food insecurity, sodatax and obese minors, farmers markets and aid such as WIC and SNAP, obese minors and reduced lunches and a few more.

Let's try and plot some of these relationships with seaborn.

In [None]:
#Percent of the population that is white vs SNAP aid participation (positive relationship)
sns.regplot("PCT_NHWHITE10", "SNAP_PART_RATE10", data=df_focusedvalues, robust=True, ci=95, color="seagreen")
sns.despine();

In [None]:
#Percent of the population that is Hispanic vs SNAP aid participation (negative relationship)
sns.regplot("SNAP_PART_RATE10", "PCT_HISP10", data=df_focusedvalues, robust=True, ci=95, color="seagreen")
sns.despine();

In [None]:
#Eligibility and use of reduced lunches in schools vs percent of the population that is Hispanic (positive relationship)
sns.regplot("PCT_REDUCED_LUNCH10", "PCT_HISP10", data=df_focusedvalues, robust=True, ci=95, color="seagreen")
sns.despine();

In [None]:
#Percent of the population that is black vs percent of the population with diabetes (positive relationship)
sns.regplot("PCT_NHBLACK10", "PCT_DIABETES_ADULTS10", data=df_focusedvalues, robust=True, ci=95, color="seagreen")
sns.despine();

In [None]:
#Percent of population with diabetes vs percent of population with obesity (positive relationship)
sns.regplot("PCT_DIABETES_ADULTS10", "PCT_OBESE_ADULTS10", data=df_focusedvalues, robust=True, ci=95, color="seagreen")
sns.despine();

With these simple regression plots we were able to glean from our data information such as in 2010, non-hispanic whites were highly correlated with the use of the SNAP program, or food stamps. We see that the hispanic population is not highly correlated in this time frame. This could be for a variety of reasons including eligibility, reporting, varying policies and use of the program. In our next graphs we see that in 2010, the percentage of the population who were black were highly correlated with diabetes. Next, we see that diabetes and obesity are highly correlated. These graphs do not represent any statistical significance, but they can help us understand and familiarize ourselves with the data.

### Now, let's visualize with Pixie Dust.

Now that we've gained some initial insights, let's try out a different tool: Pixie Dust!

As you can see in the notebook below, to activate Pixie Dust, we just import it and then write:

 ```display(your_dataframe_name)```
 
After doing this your dataframe will show up in a column-row table format. To visualize your data, you can click the chart icon at the top left (looks like an arrow going up). From there you can choose from a variety of visuals. Once you select the type of chart you want, you can then select the variables you want to showcase. It's worth playing around with this to see how you can create the most effective visualizations for your audience. The notebook below showcases a couple options such as scatterplots, bar charts, line charts, and histograms.

In [None]:
import pixiedust

In [None]:
!pip install --user --upgrade pixiedust

In [None]:
#looking at the dataframe table. Pixie Dust does this automatically, but to find it again you can click the table icon.
#Just to give some examples of what you can do with the data, I've created a pie chart of percent of food hubs in the country by state.
display(df_focusedvalues)

In [None]:
#using seaborn in Pixie Dust to look at Food Insecurity and the Percent of the population that is black in a scatter plot
display(df_focusedvalues)

In [None]:
#using matplotlib in Pixie Dust to view Food Insecurity by state in a bar chart
display(df_focusedvalues)

In [None]:
#using bokeh in Pixie Dust to view the percent of the population that is black vs the percent of the population that is obese in a line chart
display(df_focusedvalues)

In [None]:
#using seaborn in Pixie Dust to view obesity vs diabetes in a scatterplot
display(df_focusedvalues)

In [None]:
#using matplotlib in Pixie Dust to view childhood obesity vs reduced school lunches in a scatterplot
display(df_focusedvalues)