---
title: Final Project Results, Model Creation Document 
author: Sophie Seiple, Julia Joy, Lindsey Schweitzer
date: '2024-01-01'
description: "Process of model creation."
bibliography: refs.bib
format: html
---


In [64]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
from itertools import combinations
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

### Reading in our Data

In [65]:
conditions_diabetes = pd.read_csv('conditions_diabetes.csv')
conditions_pregnancy = pd.read_csv('conditions_pregnancy.csv')
conditions_cancer = pd.read_csv('conditions_cancer.csv')
conditions_heart = pd.read_csv('conditions_heart.csv')
conditions_lungs = pd.read_csv('conditions_lungs.csv')

observations = pd.read_csv('observations_pivot.csv')
patients = pd.read_csv('patient_clean.csv')

 Note: All of our datasets are grouped by related diseases (for example diabetes and comorbitidies such as diabetic retinopathy), for the rest of the post, when we say "diabetes" or "pregnancy complications," we are talking about diabetes and all present comorbidites, or a grouping of pregnancy complications such as pre/ante eclampsia and misscarriage.

# Diabetes Modeling & Analysis

### 1. Prepping Data

In order to prep our data for modelling we label encoded each of the qualitative variables (keeping track so we could decode them again later). We created a function in order to do this easily multiple times.

In [66]:
le = LabelEncoder()

# our data-prepping function for modeling
def prep_data(df):
    
    # label encode all quantitative vars
    df["race"] = le.fit_transform(df["race"]) 
    race_code = {code: race for code, race in enumerate(le.classes_)}

    df["ethnicity"] = le.fit_transform(df["ethnicity"])
    eth_code = {code: ethnicity for code, ethnicity in enumerate(le.classes_)}

    df["gender"] = le.fit_transform(df["gender"])
    gen_code = {code: gender for code, gender in enumerate(le.classes_)}

    df["birthplace"] = le.fit_transform(df["birthplace"])
    bp_code = {code: bp for code, bp in enumerate(le.classes_)}

    df["curr_town"] = le.fit_transform(df["curr_town"]) 
    curr_code = {code: bp for code, bp in enumerate(le.classes_)}
    
    # split data into test and train
    train, test = train_test_split(df, test_size=0.2, random_state=42)
    
    X_train = train.drop(columns=['y'])
    y_train = train['y']
    
    X_test = test.drop(columns=['y'])
    y_test = test['y']
    
    # return split x, y, and all of the code tracking dicts
    return X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code

In [67]:
np.random.seed(300)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(conditions_diabetes)

### 2. Finding optimal model

Next, we created a function we could reuse that identifies the best performing model on our data from the options random forest, SVC, logistic regression, and decision trees. The best model is what we use to predict the probability that each person has a certain disease (for our purposes, their risk score).

In [68]:
# our model-finding function
def train_model(X_train, y_train):
    
    #LogisticRegression
    LR = LogisticRegression(max_iter=10000000000000000000)
    LRScore = cross_val_score(LR, X_train, y_train, cv=5).mean()

    #DecisionTreeClassifier
    param_grid = { 'max_depth': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None ]}

    tree = DecisionTreeClassifier()
    grid_search = GridSearchCV(tree, param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    DTCScore  = grid_search.best_score_
    bestDTCDepth = grid_search.best_params_


    # Random Forrest Classifier    
    forrest = RandomForestClassifier(random_state=0)
    grid_search = GridSearchCV(forrest, param_grid, cv=5)
    grid_search.fit(X_train, y_train)

    RFCScore  = grid_search.best_score_
    bestRFCDepth = grid_search.best_params_

    #SVC
    SVM = SVC()

    # use grid search to find best gamma for SVM
    g = {'gamma': 10.0 ** np.arange(-5, 5) }
    grid_search = GridSearchCV(SVM, g, cv=5)
    grid_search.fit(X_train, y_train)

    SVMScore  = grid_search.best_score_   


    print("best LR :", LRScore)
    print("best DTC:", DTCScore)
    print("best max depth: ", bestDTCDepth)
    print("best RFC: ", RFCScore)
    print("best max depth: ", bestRFCDepth)
    print("best SVM: ", SVMScore)

    # store the scores of each model
    max_score = 0
    max_model = ""
    if LRScore > max_score:
        max_score = LRScore
        max_model = "LR"
    if DTCScore > max_score:
        max_score = DTCScore
        max_model = "DTC"
    if RFCScore > max_score:
        max_score = RFCScore
        max_model = "RFC"
    if SVMScore > max_score:
        max_score = SVMScore
        max_model = "SVM"

    print("best score overall is: ", max_score, " with model: ", max_model)

In [69]:
    
# run model finding function on our diabetes data
np.random.seed(500)
train_model(X_train, y_train)

best LR : 0.9050401672719269
best DTC: 0.9178790213124979
best max depth:  {'max_depth': 3}
best RFC:  0.9153112505043837
best max depth:  {'max_depth': 5}
best SVM:  0.9016066908770772
best score overall is:  0.9178790213124979  with model:  DTC


The results of our function should that the decision tree classifier is the best model possible, with an accuracy of 91.78%. Our accuracies tend generally lower considering the limited information we allowed the model to have, as we really wanted to see what the model would do when it predicted on identity factors such as race, ethnicity, and birthplace, and not how it would predict given information on the specific procedures and allergies a patient had.

### 3. Create Risk Scores

Predict probabilities for all our entries using the best model we found.

In [70]:
dtc = DecisionTreeClassifier(max_depth=3)
dtc.fit(X_train, y_train)
pred_prob = dtc.predict_proba(X_test)

For ease we created a risk finding function that can be used across factors and disease probabilities.

In [71]:
def find_risk(code, col, probs):
    # finds the corresponding subset of our probability data
    indices = (X_test[col] == code)
    prob_subset = probs[indices]
    # finds the average of this subset
    av_prob = np.mean(prob_subset[:, 1]) 
    return av_prob   

### 4. Compare Across Race, Gender, Ethnicity
Next, we find the average risk score for different demographic characteristics: Race, Gender, and Ethnicity.

#### Race

In [72]:
diabetesRaceRisk = []

# find risk for each race (after finding on their code from the label encoder)
for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    diabetesRaceRisk.append(newRow)

# print summary table
diabetesRaceRisk = pd.DataFrame(diabetesRaceRisk)
diabetesRaceRisk = diabetesRaceRisk.sort_values(by='risk', ascending=False)
diabetesRaceRisk

Unnamed: 0,race,risk
0,asian,0.479592
2,hispanic,0.340659
3,white,0.312536
1,black,0.256158


Our model tells us that the most susceptible group to diabetes is Asian, then Hispanic and White, with Black being the least susceptible. These results were interesting in that they do indeed indicate that there may be a difference according to race, and made us think of how we could explore demographic information about Massachussetts (where our data is "from"), to understand whether these trends are reflective of larger trends.

#### Gender

In [73]:
diabetesGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    diabetesGenderRisk.append(newRow)

diabetesGenderRisk = pd.DataFrame(diabetesGenderRisk)
diabetesGenderRisk = diabetesGenderRisk.sort_values(by='risk', ascending=False)
diabetesGenderRisk

Unnamed: 0,gender,risk
0,F,0.375356
1,M,0.263908


Our model tells us that women are slightly more likely to experience diabetes (or comorbidities) than men, which is in line with medical research we've seen.

#### Ethnicity

In [74]:
av_risk_eth = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    av_risk_eth.append(new_row)

av_risk_eth_df = pd.DataFrame(av_risk_eth)
av_risk_eth_df = av_risk_eth_df.sort_values(by='risk', ascending=False)
av_risk_eth_df


Unnamed: 0,eth,risk
2,asian_indian,0.714286
13,polish,0.558405
9,german,0.492674
12,mexican,0.428571
1,american,0.428571
14,portuguese,0.397959
6,english,0.369491
17,scottish,0.333333
15,puerto_rican,0.32967
5,dominican,0.328571


This table gives us lots of information about risk by ethnicity, most interestingly perhaps, it agrees with our race finding that Asian people are more likely to experience diabetes, in that our most at risk ethnicity was Asian Indian. However, Chinese and West Indian, the two other Asian ethnicities in the datasest are at the bottom of the risk hierarchy, which made us consider that the risk of Asian Indian people specifically, and alone, was what was driving our race findings.

### 5. Compare Across Wealthier & Poorer Towns of Residence/Birthplace

In order to compare outcomes across towns of varying socioeconomic status, we compiled a list of the richest and poorest towns present in our dataset (using Census data).

In [75]:
# richest towns in Mass
richTowns = ["Dover", "Weston", "Wellesley", "Lexington", "Sherborn", "Cohasset", "Lincoln", "Carlisle", "Hingham", "Winchester", 
                "Medfield", "Concord", "Needham", "Sudbury", "Hopkinton", "Boxford", "Brookline", "Andover",  
                  "Southborough", "Belmont", "Acton", "Marblehead", "Newton", "Nantucket", "Duxbury", "Boxborough", "Westwood","Natick", 
                  "Longmeadow", "Marion", "Groton", "Newbury", "North Andover", "Sharon", "Arlington", "Norwell", "Reading", 
                  "Lynnfield", "Marshfield", "Holliston", "Medway", "Canton", "Milton", "Ipswich", "Littleton", "Westford", "North Reading", "Chelmsford", "Dedham",
                  "Walpole", "Mansfield", "Shrewsbury", "Norwood", "Hanover", "Stow", "Newburyport", "Chatham", "Orleans", "Harwich",
                  "Swampscott","Fairhaven", "Salem"]

# poorest towns in Mass
poorTowns = ["Springfield", "Lawrence", "Holyoke", "Amherst", "New Bedford", "Chelsea", "Fall River", "Athol", "Orange", "Lynn", "Fitchburg", "Gardner", "Brockton", "Malden", "Worcester", "Chicopee", "North Adams", "Everett",
    "Ware", "Dudley", "Greenfield Town", "Weymouth Town", "Montague", "Revere", "Taunton", "Adams", "Huntington", "Charlemont", "Leominster", "Florida", "Colrain", "Hardwick",
    "Palmer Town", "Peabody", "Somerville", "Lowell", "Westfield", "Billerica"]

Create a df with all the information for the rich and poor towns

In [76]:
def find_town_info_row(town, bp_code_swapped, townCounts_df, code_name):
    code = bp_code_swapped[town]
    
    if not townCounts_df[townCounts_df[code_name] == code].empty:
        count = townCounts_df[townCounts_df[code_name] == code]['count'].values[0]
    else:
        count = 0
    
    new_row = {code_name: town, 'code': code, 'count': count}
    
    new_row_df = pd.DataFrame([new_row])
    
    return new_row_df

In [77]:
def find_town_info_all(counts, code_name):
    
    townCounts_df = pd.merge(X_test, counts, on=code_name)
    town_info_rich = pd.DataFrame(columns=[code_name, 'code', 'count'])
    town_info_poor = pd.DataFrame(columns=[code_name, 'code', 'count'])

    bp_code_swapped = {value: key for key, value in bp_code.items()}

    for town in richTowns:
        
        new_row_df = find_town_info_row(town, bp_code_swapped, townCounts_df, code_name)
        town_info_rich = pd.concat([town_info_rich, new_row_df], ignore_index=True)

    for town in poorTowns:
        
        new_row_df = find_town_info_row(town, bp_code_swapped, townCounts_df, code_name)
        town_info_poor= pd.concat([town_info_poor, new_row_df], ignore_index=True)
        
    return town_info_rich, town_info_poor

birthplace_counts = X_test.groupby('birthplace').size().reset_index(name='count')

town_info_rich, town_info_poor = find_town_info_all(birthplace_counts, 'birthplace')

We proceed with the following code to get the list of towns that sum up to 65 people from the richest towns, and 65 people from the poorest towns. 

In [78]:
def get_towns_by_sum_pop(town_info, code_name):
    
    townsUsed = set()
    peopleCount = 0

    for index, row in town_info.iterrows():
        
        if peopleCount > 65:
            break
        
        name = row[code_name]
        count = row['count']
        townsUsed.add(name)
        peopleCount += count
    
    return townsUsed, peopleCount

richTownsUsed, richPeopleCount = get_towns_by_sum_pop(town_info_rich, 'birthplace')
poorTownsUsed, poorPeopleCount = get_towns_by_sum_pop(town_info_poor, 'birthplace')

### Birthplace

In [79]:
def get_av_prob_bp(townsUsed, code_name, bp_code):
    
    town_codes = []
    bp_code_swapped = {value: key for key, value in bp_code.items()}


    for town_full in townsUsed:
        town_codes.append(bp_code_swapped[town_full])
        
    indices = X_test[code_name].isin(town_codes)
    prob_subset = pred_prob[indices]
    av_prob = np.mean(prob_subset[:, 1]) 

    return av_prob

In [80]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.33305156382079454 av_poor_prob:  0.31947027331642713


We find that there is not much difference in the average risk of diabetes when comparing poor and rich birthplace towns. 

### Current Town of Residence

Create a dataframe with the information for rich and poor towns. Then get the list of towns that sum up to 65 people from the richest towns, and 65 people from the poorest towns. 

In [81]:
curr_counts = X_test.groupby('curr_town').size().reset_index(name='count')
town_info_rich, town_info_poor = find_town_info_all(curr_counts, 'curr_town')

richTownsUsed, richPeopleCount = get_towns_by_sum_pop(town_info_rich, 'curr_town')
poorTownsUsed, poorPeopleCount = get_towns_by_sum_pop(town_info_poor, 'curr_town')

In [82]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)
#HERE

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.25274725274725274 av_poor_prob:  0.2827087442472057


In this comparison, we find that people currently residing in rich towns have slightly lower rates of diabetes than those residing in poorer towns. 

# Pregnancy Analysis

We repeated the same exact process as above for each of our condition subsets.

Finding the best model:

In [83]:
np.random.seed(567)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(conditions_pregnancy)

In [84]:
np.random.seed(567)
train_model(X_train, y_train) 

best LR : 0.9538094714060378
best DTC: 0.9632185172957705
best max depth:  {'max_depth': 1}
best RFC:  0.9632185172957705
best max depth:  {'max_depth': 1}
best SVM:  0.9632185172957705
best score overall is:  0.9632185172957705  with model:  DTC


### Compute Average Risk scores

Predict probabilities for all our entries using the best model we found

In [85]:
DTC = DecisionTreeClassifier(max_depth=1)
DTC.fit(X_train, y_train)
pred_prob = DTC.predict_proba(X_test)

### Race

In [86]:
pregRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    pregRaceRisk.append(newRow)

pregRaceRisk = pd.DataFrame(pregRaceRisk)
pregRaceRisk = pregRaceRisk.sort_values(by='risk', ascending=False)
pregRaceRisk

Unnamed: 0,race,risk
1,black,0.051395
2,hispanic,0.038217
0,asian,0.037262
3,white,0.03426


Here we can see that being black gives a patient a little less than double the risk of pregnancy issues than being white. Hispanics have the second highest rate of pregnancy complications.

### Gender

In [87]:
pregGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    pregGenderRisk.append(newRow)

pregGenderRisk = pd.DataFrame(pregGenderRisk)
pregGenderRisk = pregGenderRisk.sort_values(by='risk', ascending=False)
pregGenderRisk

Unnamed: 0,gender,risk
0,F,0.074523
1,M,0.0


This result may seem a bit redundant or silly, it makes sense as generally people identified as male do not get pregnant.

### Ethnicity

In [88]:
av_risk_eth = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    av_risk_eth.append(new_row)

av_risk_eth_df = pd.DataFrame(av_risk_eth)
av_risk_eth_df = av_risk_eth_df.sort_values(by='risk', ascending=False)


In [89]:
av_risk_eth_df

Unnamed: 0,eth,risk
5,dominican,0.074523
17,scottish,0.074523
1,american,0.054199
3,central_american,0.053231
19,west_indian,0.049682
8,french_canadian,0.049682
12,mexican,0.049682
14,portuguese,0.047908
4,chinese,0.042585
7,french,0.039746


Here we see that our finding that Black patients are more likely to experience pregnancy-related  complications is driven largely by Dominican patients.

### Birthplace

In [90]:
birthplace_counts = X_test.groupby('birthplace').size().reset_index(name='count')
town_info_rich, town_info_poor = find_town_info_all(birthplace_counts, 'birthplace')
richTownsUsed, richPeopleCount = get_towns_by_sum_pop(town_info_rich, 'birthplace')
poorTownsUsed, poorPeopleCount = get_towns_by_sum_pop(town_info_poor, 'birthplace')

In [91]:
np.random.seed(234)

av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.038981469137448335 av_poor_prob:  0.03668844154112784


### Current Town of Residence 

In [92]:
curr_counts = X_test.groupby('curr_town').size().reset_index(name='count')
town_info_rich, town_info_poor = find_town_info_all(curr_counts, 'curr_town')
richTownsUsed, richPeopleCount = get_towns_by_sum_pop(town_info_rich, 'curr_town')
poorTownsUsed, poorPeopleCount = get_towns_by_sum_pop(town_info_poor, 'curr_town')

In [93]:
np.random.seed(234)
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.04586055192640981 av_poor_prob:  0.03630627027507444


This finding was somewhat surprising to us, in that wealthier towns were found to have higher risks of pregnancy complications. We discuss the potential implications of this result in our results section.

# Cancer Analysis 


In [94]:
np.random.seed(2)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(conditions_cancer)

#getting rid of few NaN values
X_train.fillna(0.0, inplace=True)
#train the model
np.random.seed(500)
train_model(X_train, y_train)

best LR : 0.9486775980338212
best DTC: 0.9546641722607386
best max depth:  {'max_depth': 1}
best RFC:  0.9546641722607386
best max depth:  {'max_depth': 1}
best SVM:  0.9546641722607386
best score overall is:  0.9546641722607386  with model:  DTC


Once again we find that the model with the best score is DTC, The Decision Tree Classifier, with about 98% accuracy. 


In [95]:
DTC = DecisionTreeClassifier(max_depth=1)
DTC.fit(X_train, y_train)
pred_prob = DTC.predict_proba(X_test)

### Race

In [96]:
cancerRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    cancerRaceRisk.append(newRow)

cancerRaceRisk = pd.DataFrame(cancerRaceRisk)
cancerRaceRisk = cancerRaceRisk.sort_values(by='risk', ascending=False)
cancerRaceRisk


Unnamed: 0,race,risk
3,white,0.051942
2,hispanic,0.05165
1,black,0.046859
0,asian,0.034009


We find across the board cancer rates are somewhat even, but that at the extremes white patients have almost a 52% risk of being classified with cancer,and Asian patients have around a 34% risk. 

### Gender

In [97]:
cancerGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    cancerGenderRisk.append(newRow)

cancerGenderRisk = pd.DataFrame(cancerGenderRisk)
cancerGenderRisk = cancerGenderRisk.sort_values(by='risk', ascending=False)
cancerGenderRisk

Unnamed: 0,gender,risk
1,M,0.053825
0,F,0.047147


Women are slightly less likely to have cancer.

### Ethnicity 

In [98]:
cancerEthRisk = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    cancerEthRisk.append(new_row)

cancerEthRisk = pd.DataFrame(cancerEthRisk)
cancerEthRisk = cancerEthRisk.sort_values(by='risk', ascending=False)

cancerEthRisk

Unnamed: 0,eth,risk
9,german,0.105675
15,puerto_rican,0.067085
0,african,0.067085
4,chinese,0.062675
11,italian,0.059576
6,english,0.057127
13,polish,0.049934
10,irish,0.049557
14,portuguese,0.048342
18,swedish,0.045475


Our results for ethnicity largely match the results we found distinguishing by race.

### Birthplace

In [99]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.06399830132085002 av_poor_prob:  0.04238804096017698


### Current Town of Residence

In [100]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.03621368085712754 av_poor_prob:  0.05164958111475114


We note that for birthplace, people in rich towns are more likely to get diagnosed with cancer as opposed to people from poorer towns. For current town of residence, the opposite is true. 

# Heart Analysis

In [101]:
np.random.seed(210)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(conditions_heart)

#getting rid of few NaN values
X_train.fillna(0.0, inplace=True)
#train the model
np.random.seed(20)
train_model(X_train, y_train)

STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


best LR : 0.87766773045743
best DTC: 0.8973478595796193
best max depth:  {'max_depth': 1}
best RFC:  0.8999156303877335
best max depth:  {'max_depth': None}
best SVM:  0.8973478595796193
best score overall is:  0.8999156303877335  with model:  RFC


### Compute Average Risk scores
We found that the best model to predict probabilities for all our entries in this case would be RFC. 

In [115]:
RFC = RandomForestClassifier(random_state=0, max_depth=1)
RFC.fit(X_train, y_train)

pred_prob = RFC.predict_proba(X_test)

### 4. Compare Across Race, Gender, Ethnicity

### Race

In [116]:
heartRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    heartRaceRisk.append(newRow)

heartRaceRisk = pd.DataFrame(heartRaceRisk)
heartRaceRisk = heartRaceRisk.sort_values(by='risk', ascending=False)
heartRaceRisk

Unnamed: 0,race,risk
0,asian,0.510746
3,white,0.502101
1,black,0.491353
2,hispanic,0.49128


We find that the demographic with the highest likelihood of having heart problems is Asian, but overall the results are fairly even.

### Gender

In [117]:
heartGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    heartGenderRisk.append(newRow)

heartGenderRisk = pd.DataFrame(heartGenderRisk)
heartGenderRisk = heartGenderRisk.sort_values(by='risk', ascending=False)
heartGenderRisk

Unnamed: 0,gender,risk
0,F,0.508013
1,M,0.492275


According to our results, women and men are equally likely to have heart conditions, which disagrees with real medical trends that show men are much more likely to have these conditions.

### Ethnicity

In [118]:
heartEthRisk = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    heartEthRisk.append(new_row)

heartEthRisk = pd.DataFrame(heartEthRisk)
heartEthRisk = heartEthRisk.sort_values(by='risk', ascending=False)

heartEthRisk

Unnamed: 0,eth,risk
2,asian_indian,0.556735
17,scottish,0.556144
13,polish,0.553649
12,mexican,0.534791
14,portuguese,0.519123
1,american,0.516831
9,german,0.51627
16,russian,0.511412
18,swedish,0.508385
6,english,0.500485


Again, here we see very little variation.

### Birthplace

In [106]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.10892307692307693 av_poor_prob:  0.1103076923076923


### Current Town of Residence 

In [107]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.08661538461538462 av_poor_prob:  0.09256410256410255


It seems as if there are not significant differences between the risk of heart diseases between wealthier and less-wealthy birthplace towns or current towns of residence.

# Lungs Analysis 

In [108]:
np.random.seed(400)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(conditions_lungs)

#getting rid of few NaN values
X_train.fillna(0.0, inplace=True)
#train the model
train_model(X_train, y_train)

best LR : 0.5705770147830234
best DTC: 0.6107626279300099
best max depth:  {'max_depth': 5}
best RFC:  0.6210850665786289
best max depth:  {'max_depth': 4}
best SVM:  0.5971387696709585
best score overall is:  0.6210850665786289  with model:  RFC


### Compute Average Risk scores
We found that the best model to predict probabilities for all our entries iin this case would be RFC. 

In [124]:
RFC = RandomForestClassifier(random_state=0, max_depth=4)
RFC.fit(X_train, y_train)

pred_prob = RFC.predict_proba(X_test)

### 4. Compare Across Race, Gender, Ethnicity

### Race

In [125]:
lungsRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    lungsRaceRisk.append(newRow)

lungsRaceRisk = pd.DataFrame(lungsRaceRisk)
lungsRaceRisk = lungsRaceRisk.sort_values(by='risk', ascending=False)
lungsRaceRisk

Unnamed: 0,race,risk
0,asian,0.514338
1,black,0.509092
3,white,0.507817
2,hispanic,0.492106


We find very little variation for risk rates for lung issues for race.

### Gender

In [122]:
lungsGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    lungsGenderRisk.append(newRow)

lungsGenderRisk = pd.DataFrame(lungsGenderRisk)
lungsGenderRisk = lungsGenderRisk.sort_values(by='risk', ascending=False)
lungsGenderRisk

Unnamed: 0,gender,risk
0,F,0.519215
1,M,0.49355


Again, there are pretty even rates for gender and risk of having lung complications.

### Ethnicity

In [126]:
lungsEthRisk = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    lungsEthRisk.append(new_row)

lungsEthRisk = pd.DataFrame(lungsEthRisk)
lungsEthRisk = lungsEthRisk.sort_values(by='risk', ascending=False)

lungsEthRisk

Unnamed: 0,eth,risk
13,polish,0.583031
12,mexican,0.580761
17,scottish,0.577455
2,asian_indian,0.577379
18,swedish,0.560368
0,african,0.55998
16,russian,0.549236
1,american,0.548145
9,german,0.521056
14,portuguese,0.516368


This analysis follows the trend of generally even probabilities throughout, in the most extreme cases with polish people having 58% risk and chinese people having 45%.

### Birthplace

In [127]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.4872429058942045 av_poor_prob:  0.5189382015534418


### Current Town of Residence

In [128]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.5039833131472166 av_poor_prob:  0.5124758651408807


The results for the risk score for patients of different race, gender, ethnicity, birthplace town, and current town of residence are curious as every risk score hovers around a 0.5 and generally even throughout the different demographics. This points to the conclusion that we are maybe investigating too large amount of conditions under lung ailments.