
# Project Title: Chronic Disease in the U.S.

Group Member 1: Kevin Wood,
kevin.wood@utah.edu,
u0658811

Group Member 2: Jassy Ju,
jassy.ju@utah.edu
u0718404

## Background and Motivation

Health data is widely available information that contains information regarding different health conditions from various populations. As biomedical informatics students, we were both interested in measuring, comparing and depicting health datasets.
For this project, we will focus on exploring the U.S. Chronic Disease Indicators dataset.

## Project Objectives

The primary objective of this project will focus on geospatial analysis of the U.S. Chronic Disease Indicators dataset by diabetes, alcohol, tobacco, nutrition, physical activity and weight status. Aside from comparing the topics by each state, we plan to see if there are connections between the health conditions and information found from scraping online or from another public dataset.


## Data
Dataset we will be using is called U.S. Chronic Disease Indicators (CDI), a CSV file from http://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi

## Ethical Consideration

The information we extract from the publicly available datasets will be valuable for anyone wanting to know more about chronic disease indicators across the states. Depending on the persona, the information we display could be informative or it may be used for other research purposes.

## Data Processing

We plan to use Python and it's supporting libraries and packages to clean and further process the data we import from the CDI dataset, while also scraping other supporting data sources when necessary. Below, we read in the dataset to a pandas dataframe to give better context to our proposal.

In [1]:
import pandas as pd
data = pd.read_csv('US_CDI.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
data.dtypes

YearStart                      int64
YearEnd                        int64
LocationAbbr                  object
LocationDesc                  object
DataSource                    object
Topic                         object
Question                      object
Response                      object
DataValueUnit                 object
DataValueType                 object
DataValue                     object
DataValueAlt                 float64
DataValueFootnoteSymbol       object
DatavalueFootnote             object
LowConfidenceLimit           float64
HighConfidenceLimit          float64
StratificationCategory1       object
Stratification1               object
StratificationCategory2       object
Stratification2               object
StratificationCategory3       object
Stratification3               object
GeoLocation                   object
ResponseID                    object
LocationID                     int64
TopicID                       object
QuestionID                    object
D

In [7]:
data.describe()

Unnamed: 0,YearStart,YearEnd,DataValueAlt,LowConfidenceLimit,HighConfidenceLimit,LocationID
count,519718.0,519718.0,350335.0,311062.0,311062.0,519718.0
mean,2013.141885,2013.162754,891.7742,46.759401,58.991287,30.993144
std,1.777622,1.762672,18330.73,77.492628,88.668862,17.723341
min,2001.0,2001.0,0.0,0.2,0.42,1.0
25%,2012.0,2012.0,18.455,12.7,18.9,17.0
50%,2013.0,2013.0,41.0,30.2,43.8,30.0
75%,2015.0,2015.0,70.3,55.4,70.4,45.0
max,2016.0,2016.0,2600878.0,1330.66,2088.0,78.0


In [15]:
data.DataValue.isnull().groupby(data['Topic']).sum().astype(int).reset_index(name='NullCNT')

Unnamed: 0,Topic,NullCNT
0,Alcohol,8968
1,Arthritis,12844
2,Asthma,14325
3,Cancer,6033
4,Cardiovascular Disease,14604
5,Chronic Kidney Disease,1997
6,Chronic Obstructive Pulmonary Disease,25062
7,Diabetes,30127
8,Disability,600
9,Immunization,251


In [16]:
data.groupby('Topic').mean()

Unnamed: 0_level_0,YearStart,YearEnd,DataValueAlt,LowConfidenceLimit,HighConfidenceLimit,LocationID
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alcohol,2013.20157,2013.20157,43.326631,8.492928,12.442843,31.055604
Arthritis,2013.374716,2013.374716,36.630016,30.157477,43.597251,31.545816
Asthma,2013.174142,2013.174142,368.774039,31.374737,43.624873,31.11477
Cancer,2013.353663,2014.061698,881.422496,67.616579,77.122929,31.126941
Cardiovascular Disease,2012.658477,2012.658477,1327.05714,66.981051,81.234672,30.31541
Chronic Kidney Disease,2012.631706,2012.631706,720.707002,33.174245,41.138492,30.132876
Chronic Obstructive Pulmonary Disease,2012.985101,2012.985101,3074.420181,63.857687,78.264549,30.717639
Diabetes,2013.191446,2013.192827,838.749862,46.305193,60.812661,31.140234
Disability,2013.0,2013.0,37.01723,34.008404,40.024029,30.339623
Immunization,2013.5,2013.5,37.324592,32.42004,42.846143,31.542529


As you can see, the dataset isn't perfect, with some data types not being entirely standardized. In light of this, we plan on extracting subsets of the data by topic to ensure the data values are correspond appropriately with the analysis we do on the dataset as a whole. We also will need to account for missingness within the dataset, but we've also chosen to focus on topics with higher fill rates (i.e. alcohol, nutrition). We'll have to try out different methods, but we've initially discussed using the mean to impute null values, or similar values for a given state if pertinent.

We also plan on scraping other data on the web, for example, this website containing the fast food restaurants per capita: https://www.thrillist.com/news/nation/states-with-most-fast-food-restaurants-datafiniti. We found other data on the web regarding alcohol prevalence that we would process using Beautiful Soup.

The values we see being the most valuable for our purposes will be topic, year end and start, question, state, and the data value itself. These values from the CDI data, along with those mentioned above will be the focus of our analysis.

## Exploratory Analysis

Our project will definitely need to utilize map visualizations, and we plan on trying to implement a geospatial plot as well as other standard plots and charts to better describe the dataset. We'd like to do some simple linear regressions on the data to start, and show those regressions with the ggplot library. It might be useful for us to try implementing some boxplots and to explore visualization capabilities using maps, as evidenced here: https://python-graph-gallery.com/.

## Analysis Methodology


Analysis of the CDI data will start with data profiling using basic pandas methods, and then expand into more and more complex analysis. We hope to tell the story of our data as it unfolded to us throughout our analysis using bar charts by topic and state, then expanding out to other datasets to explore other predictive values that are outside the scope of the CDI data. We hope to do this by starting with different variations of variables for linear regressions, then trying a more complex method that we might not have covered in class. Ultimately, we'd like to keep the number of variables small, but expand the data frame with new variables for exploratory purposes.

## Project Schedule


Our tentative project plan is shown below. We also plan to create a Trello board for task assignment and tracking while communicating via the class Slack and text.

###### Week of 2/25 - 3/3:
    -Procure initial dataset(s)
    -Setup project github repository
    -Submit Project Proposal and Google Form

###### Week of 3/4-3/10:
    -Clean CDI dataset
    -Look for other supplementary data for cross-examination and add to project repo
    -Assess missingness
    -Experiment with and decide on way(s) to account for missingness/bad data
    -Decide on 8-10 hypotheses


###### Week of 3/11-3/17 (Spring Break):
    -Explore dataset
    -Determine best linear regression variables
    -Test 4 hypotheses

###### Week of 3/18-3/24:
    -Test 4-5 more hypotheses
    -Attempt 2-3 classification methods
    -Create successful map visualization
    -Staff Review 1

###### Week of 3/25-3/31:
    -Add map visualization variations
    -Add supporting descriptive statistics as needed
    -Staff Review 2
    -Submit Milestone 1

###### Week of 4/1-4/7:
    -Test 3 newly learned analysis methods
    -Test 3 new variables or hypotheses

###### Week of 4/8-4/14:
    -Refine aesthetics, especially visualizations
    -Edit and expound on explanations
    -Final staff review

###### Week of 4/15-4/21:
    -Final edits and implement feedback from staff review
    -Submit Final Project


Daniel Wells and Amy Record 

Diabetes
type 2 or 1?


