# Project Design Writeup - CHSI Cancer Research

Project Problem and Hypothesis

What's the project about? What problem are you solving?

This project is using CHSI (CDC, Community Health Status Indicators) county data with women from various ethnic backgrounds between the age of 25-44 with cancer. The dataset could help solve, or find correlations between these groups of women and their lifestyl habits in addtion to their access to healthcare (private phyisican and community health centers).




Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?

The binary value of yes and no included in this data set is access to a community health center 1 = No and 2 = Yes. 



What kind of impact do you think it could have?
What do you think will have the most impact in predicting the value you are interested in solving for?

I have recently been transferred to a new client, MD Anderson Research Center based in Houston and believe understanding the relationship between lifestyle and access to centers can help with the user experience design of the website. With this community health data released by the CDC, I hope to predict or solve the prevalence of access to community centers and women who have cancer between 25-44.

Dataset

In [10]:
import pandas as pd
import numpy as np 
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression # this is not applicable for a classification problem
import scipy, scipy.stats

def read_csv_file(path):
   df = pd.read_csv(path) 
   return(df) 

path = "./mdanderson.csv"

In [11]:
table = read_csv_file(path)

In [12]:
table.head()

Unnamed: 0,Strata_ID_Number,D_Wh_Cancer,D_Bl_Cancer,D_Hi_Cancer,No_Exercise,Few_Fruit_Veg,Obesity,Smoker,Uninsured,Prim_Care_Phys_Rate,Community_Health_Center_Ind
0,29,-1111,18,-1111,27.8,78.6,24.5,26.6,5690,45.3,1
1,16,11,-1111,-1111,27.2,76.2,23.6,24.6,19798,67.0,1
2,51,21,-1111,-1111,-1111.1,-1111.1,25.6,17.7,5126,45.8,1
3,42,16,21,-1111,-1111.1,86.6,-1111.1,-1111.1,3315,41.8,1
4,28,16,-1111,-1111,33.5,74.6,24.2,23.6,8131,16.2,2


In [13]:
table.describe()

Unnamed: 0,Strata_ID_Number,D_Wh_Cancer,D_Bl_Cancer,D_Hi_Cancer,No_Exercise,Few_Fruit_Veg,Obesity,Smoker,Uninsured,Prim_Care_Phys_Rate,Community_Health_Center_Ind
count,3141.0,3141.0,3141.0,3141.0,3141.0,3141.0,3141.0,3141.0,3141.0,3141.0,3141.0
mean,44.696275,-277.122891,-719.173193,-1083.198981,-312.130213,-389.736071,-307.284241,-292.487902,12644.807386,57.562464,1.55078
std,25.118434,514.396503,554.457213,224.538711,520.268834,581.547655,516.246132,508.390227,54604.778511,44.79377,0.497494
min,1.0,-2222.0,-2222.0,-2222.0,-1111.1,-1111.1,-1111.1,-1111.1,-2222.0,0.0,1.0
25%,23.0,-1111.0,-1111.0,-1111.0,-1111.1,-1111.1,-1111.1,-1111.1,1549.0,30.5,1.0
50%,44.0,13.0,-1111.0,-1111.0,22.6,74.0,21.7,20.3,3426.0,50.6,2.0
75%,66.0,17.0,12.0,-1111.0,28.6,80.2,25.8,25.1,8116.0,74.7,2.0
max,88.0,30.0,26.0,23.0,52.4,96.4,42.6,46.2,2167891.0,581.2,2.0


In [14]:
table.corr()

Unnamed: 0,Strata_ID_Number,D_Wh_Cancer,D_Bl_Cancer,D_Hi_Cancer,No_Exercise,Few_Fruit_Veg,Obesity,Smoker,Uninsured,Prim_Care_Phys_Rate,Community_Health_Center_Ind
Strata_ID_Number,1.0,-0.196836,-0.259463,-0.233445,-0.437122,-0.438766,-0.418139,-0.404309,-0.291441,-0.367165,0.248034
D_Wh_Cancer,-0.196836,1.0,0.213077,0.242768,0.099152,0.099045,0.085683,0.082421,0.077463,0.064177,-0.057586
D_Bl_Cancer,-0.259463,0.213077,1.0,0.204589,0.11639,0.147047,0.111868,0.128172,0.191398,0.098563,-0.269681
D_Hi_Cancer,-0.233445,0.242768,0.204589,1.0,0.08651,0.121141,0.085313,0.086363,0.438443,0.164232,-0.161612
No_Exercise,-0.437122,0.099152,0.11639,0.08651,1.0,0.716274,0.830915,0.801036,0.124494,0.249496,-0.092165
Few_Fruit_Veg,-0.438766,0.099045,0.147047,0.121141,0.716274,1.0,0.699262,0.67653,0.147863,0.240161,-0.123075
Obesity,-0.418139,0.085683,0.111868,0.085313,0.830915,0.699262,1.0,0.78726,0.12194,0.238474,-0.080363
Smoker,-0.404309,0.082421,0.128172,0.086363,0.801036,0.67653,0.78726,1.0,0.118488,0.236696,-0.093855
Uninsured,-0.291441,0.077463,0.191398,0.438443,0.124494,0.147863,0.12194,0.118488,1.0,0.170448,-0.169715
Prim_Care_Phys_Rate,-0.367165,0.064177,0.098563,0.164232,0.249496,0.240161,0.238474,0.236696,0.170448,1.0,-0.111153


Domain knowledge

What experience do you already have around this area?

This will be the first project (client) I work on related to Cancer research and community health. However, the goal is to understand how the results from the CDC community indicators can impact how to iterate and explore content and site conversion strategy 


Project Concerns

What questions do you have about your project? What are you not sure you quite yet understand? (The more honest you are about this, the easier your instructors can help).

The values are a bit confusing, even with the data descriptions. Wondering if I should transform the integers to actual percentages so the plots are easier to understand. 


What are the assumptions and caveats to the problem?

My assumption is that the data was gathered by the CDC during the same time frame for all patients. 



Outcomes


How complicated does your model have to be?

Simplicity is key, this model would help guide strategy and design most. The goal is to distill the model down to actionable recommendations for landing page test ideas. 

How successful does your project have to be in order to be considered a "success"?

Success means being able to predict varying behaviors for different women who have cancer and their access to healthcare because it will help idenitfy opportunties to drive the most effective awareness campaigns. 

What will you do if the project is a bust (this happens! but it shouldn't here)?

There are some other features I can try to use if I find that there are no relationships between cancer patients and their lifestyle habits and access to healthcare. 