# Contexto do projeto
Obesidade está entre as doenças mais graves do século, é um problema que além de prejudicar uma quantidade enorme de pessoas (as projeções são que até 2050 mais de a metade das pessoas do mundo seja obesas), custa autíssimo para a saúde pública e para os planos de saúde, já que está fortemente ligado a diabetes, hipertensão e outras 62 doenças.

Dessa forma buscamos entender com esse projeto as causas e consequências da obesidade, bem como sua relação com a diabetes e hipertensão, já que são as duas principais doenças relacionadas.

Para tanto faremos uso de 7 datasets.

In [3]:
# faz todos os imports necessários
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress

## IDH no mundo
Esse dataset descreve a curva de IDH no mundo por região ao longo dos anos

In [4]:
humanDevelopmentIndex = pd.read_csv('./data/HumanDevelopmentIndex.csv')
humanDevelopmentIndex = humanDevelopmentIndex.set_index('Country')
humanDevelopmentIndex.index = humanDevelopmentIndex.index.str.strip()

humanDevelopmentIndex.head()

Unnamed: 0_level_0,HDI Rank (2017),1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,168,,,,,,,,,,...,0.437,0.453,0.463,0.471,0.482,0.487,0.491,0.493,0.494,0.498
Albania,68,0.645,0.626,0.61,0.613,0.619,0.632,0.641,0.641,0.652,...,0.724,0.729,0.741,0.752,0.767,0.771,0.773,0.776,0.782,0.785
Algeria,85,0.577,0.581,0.587,0.591,0.595,0.6,0.608,0.617,0.627,...,0.709,0.719,0.729,0.736,0.74,0.745,0.747,0.749,0.753,0.754
Andorra,35,,,,,,,,,,...,0.831,0.83,0.828,0.827,0.849,0.85,0.853,0.854,0.856,0.858
Angola,147,,,,,,,,,,...,0.502,0.522,0.52,0.535,0.543,0.554,0.564,0.572,0.577,0.581


## Obesidade no mundo
Esse dataset descreve o crescimento da obesidade ao longo dos anos por país. Esses dados foram coletados do Our World in Data, site que reúne dados e visualizações de vários datasets.

A fonte dos dados foi uma coleta realizada pela UNICEF em parceria com alguns outros grupos.

Os dados estão resumidos por país, ano e percentual de obesos em cada país/ano.

In [5]:
worldObesity = pd.read_csv('./data/share-of-adults-defined-as-obese.csv')
worldObesity.columns = ['entity', 'code', 'year', 'obesityPercentage']
worldObesity = worldObesity.drop(columns=['code'])
worldObesity.year = worldObesity.year.astype('int64')
worldObesity = worldObesity.set_index('entity')
worldObesity.head()

Unnamed: 0_level_0,year,obesityPercentage
entity,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,1975,0.5
Afghanistan,1976,0.5
Afghanistan,1977,0.6
Afghanistan,1978,0.6
Afghanistan,1979,0.6


#### Relação obesidade e IDH no mundo
Agora vamos fazer esse mesmo estudo para o mundo e verificar qual a relevância estatística do IDH em relação a obesidade.

Primeiro vamos pegar a média de obesidade no mundo a cada ano.

In [6]:
worldObesityMeanPercentageByYear = worldObesity.groupby(['year'])['obesityPercentage'].mean().to_frame()
worldObesityMeanPercentageByYear.index = worldObesityMeanPercentageByYear.index.astype('int64')
worldObesityMeanPercentageByYear = worldObesityMeanPercentageByYear.rename(columns={'obesityPercentage': 'obesityPercentageMean'})
worldObesityMeanPercentageByYear.head()

Unnamed: 0_level_0,obesityPercentageMean
year,Unnamed: 1_level_1
1975,6.510995
1976,6.737696
1977,6.962304
1978,7.193717
1979,7.43089


Agora vamos pegar a média mundial do IDH de cada ano

In [7]:
worldIdhMeanByYear = humanDevelopmentIndex.drop(columns=['HDI Rank (2017)'])
worldIdhMeanByYear = worldIdhMeanByYear.T
worldIdhMeanByYear = worldIdhMeanByYear.reset_index()
worldIdhMeanByYear = worldIdhMeanByYear.rename(columns={'index': 'year'})
worldIdhMeanByYear = worldIdhMeanByYear.set_index('year')
worldIdhMeanByYear.index = worldIdhMeanByYear.index.astype('int64')
worldIdhMeanByYear = worldIdhMeanByYear.mean(axis=1).to_frame()
# worldIdhMeanByYear = worldIdhMeanByYear.rename(columns={'0': 'idhMean'})
worldIdhMeanByYear.columns = ['idhMean']
worldIdhMeanByYear.head()

Unnamed: 0_level_0,idhMean
year,Unnamed: 1_level_1
1990,0.596901
1991,0.598937
1992,0.60165
1993,0.605231
1994,0.609497


Finalmente vamos unir os dois dataframes e fazer a análise estatística.

In [8]:
relationWorldObesityIdhMean = worldObesityMeanPercentageByYear.merge(worldIdhMeanByYear, left_on='year', right_on='year')
relationWorldObesityIdhMean.head()

Unnamed: 0_level_0,obesityPercentageMean,idhMean
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1990,10.41466,0.596901
1991,10.71623,0.598937
1992,11.018325,0.60165
1993,11.319895,0.605231
1994,11.625654,0.609497


In [9]:
relationWorldObesityIdhMean.describe()

Unnamed: 0,obesityPercentageMean,idhMean
count,27.0,27.0
mean,14.840702,0.649174
std,2.902844,0.035576
min,10.41466,0.596901
25%,12.42199,0.620718
50%,14.642932,0.644795
75%,17.144241,0.679594
max,19.960733,0.706931


In [10]:
x = relationWorldObesityIdhMean.obesityPercentageMean
x = pd.Series([i/100 for i in x])
y = relationWorldObesityIdhMean.idhMean
relationWorldObesityIdhMeanModel = linregress(x, y)
relationWorldObesityIdhMeanModel.pvalue
print('p-value={:.3f}'.format(relationWorldObesityIdhMeanModel.pvalue))

p-value=0.000


Para nossa surpresa o P-Value indicou que o IDH não tem relevância estatística sobre a obesidade

#### Relação obesidade e IDH por país
Com esses dois datasets podemos relacionar o IDH com o percentual de obesos em cada país no passar dos anos

In [11]:
def relationObesityIdhByCountry(country): 
    countryObesity = worldObesity.loc[country]
    countryIdh = humanDevelopmentIndex.loc[country]
    countryIdh = countryIdh.to_frame()
    countryIdh = countryIdh.drop('HDI Rank (2017)')
    countryIdh = countryIdh.reset_index()
    countryIdh = countryIdh.rename(columns={'index': 'year'})
    countryIdh.year = countryIdh.year.astype('int64')
    countryIdh = countryIdh.rename(columns={country: 'idh'})
    relationObesityIdhCountry = countryObesity.join(countryIdh.set_index('year'), on='year')
    relationObesityIdhCountry = relationObesityIdhCountry.dropna()
    relationObesityIdhCountry = relationObesityIdhCountry.reset_index()
    relationObesityIdhCountry = relationObesityIdhCountry.set_index('year')
    relationObesityIdhCountry = relationObesityIdhCountry.drop(columns=['entity'])
    return relationObesityIdhCountry

relationObesityIdhByCountry('Afghanistan')

Unnamed: 0_level_0,obesityPercentage,idh
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2002,2.6,0.373
2003,2.7,0.383
2004,2.9,0.398
2005,3.0,0.408
2006,3.2,0.417
2007,3.4,0.429
2008,3.6,0.437
2009,3.8,0.453
2010,4.0,0.463
2011,4.2,0.471


## Cirurgias bariátricas nos EUA
Esse dataset descreve o crescimento de cirurgias bariátricas nos EUA ao longo dos anos.

In [12]:
bariatricSurgeriesUS = pd.read_csv('./data/bariatricSurgeriesInUsfrom2011to2017.csv')
bariatricSurgeriesUS = bariatricSurgeriesUS.iloc[0]
bariatricSurgeriesUS = bariatricSurgeriesUS.to_frame()
bariatricSurgeriesUS = bariatricSurgeriesUS.drop(axis=0, index='Unnamed: 0')
bariatricSurgeriesUS.columns = ['numSurgeries']
bariatricSurgeriesUS = bariatricSurgeriesUS.reset_index()
bariatricSurgeriesUS = bariatricSurgeriesUS.rename(columns={'index': 'year'})
bariatricSurgeriesUS.year = bariatricSurgeriesUS.year.astype('int64')
bariatricSurgeriesUS.numSurgeries = bariatricSurgeriesUS.numSurgeries.astype('int64')
bariatricSurgeriesUS = bariatricSurgeriesUS.set_index('year')
bariatricSurgeriesUS.numSurgeries = [i*1000 for i in bariatricSurgeriesUS.numSurgeries] 
bariatricSurgeriesUS

Unnamed: 0_level_0,numSurgeries
year,Unnamed: 1_level_1
2011,158000
2012,173000
2013,179000
2014,193000
2015,196000
2016,216000
2017,228000


Vamos agora calcular o p-value do número de cirurgias bariátricas em relação ao crescimento da obesidade nos EUA

In [13]:
usObesity = relationObesityIdhByCountry('United States')
usObesity = usObesity.drop(columns=['idh'])
relationObesityBariatricSurgeryUS = bariatricSurgeriesUS.merge(usObesity, left_on='year', right_on='year')
relationObesityBariatricSurgeryUS

Unnamed: 0_level_0,numSurgeries,obesityPercentage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,158000,33.0
2012,173000,33.6
2013,179000,34.3
2014,193000,34.9
2015,196000,35.6
2016,216000,36.2


In [14]:
x = relationObesityBariatricSurgeryUS['numSurgeries']
y = relationObesityBariatricSurgeryUS['obesityPercentage']

relationObesityBariatricSurgeryUSModel = linregress(x, y)
relationObesityBariatricSurgeryUSModel.pvalue
print('p-value={:.3f}'.format(relationObesityBariatricSurgeryUSModel.pvalue))

p-value=0.001


Mais uma vez para a surpresa do grupo o valor estatístico do percentual de obesos nos EUA é pouco relevante em relação ao aumento de cirurgias bariátricas de cirurgias bariátricas

## Diabetes no mundo
Esse dataset descreve a diabetes no mundo por região, sexto e tipo de moradia (rural ou urbana)

In [15]:
worldDiabetes = pd.read_csv('./data/IDFDiabetesAtlas-PrevalenceByAgeSexUrbanRural20-79years.csv')
worldDiabetes.head()

Unnamed: 0,country_id,report_country,IDF region,Report Age,Report Gender,Report set,"Adults with diabetes (20-79) in 1,000s","Adults population (20-79) in 1,000s",Diabetes prevalence (20-79)
0,1,Afghanistan,MENA,20-24,Female,Urban,6.786987,442.695289,1.53%
1,1,Afghanistan,MENA,20-24,Female,Rural,14.764568,1162.984711,1.27%
2,1,Afghanistan,MENA,25-29,Female,Urban,9.788225,347.899364,2.81%
3,1,Afghanistan,MENA,25-29,Female,Rural,21.052534,913.950636,2.30%
4,1,Afghanistan,MENA,30-34,Female,Urban,13.800323,288.586775,4.78%


In [35]:
worldDiabetes['Report Age'].value_counts()

50-54    884
45-49    884
40-44    884
55-59    884
25-29    884
20-24    884
35-39    884
60-64    884
30-34    884
65-69    884
70-74    884
75-79    884
Name: Report Age, dtype: int64

## Informações clínicas de pacientes no Brasil
Esse dataset descreve alguns fatores clínicos de pacientes no Brasil, dentre as labels inclui 'eObeso'

In [16]:
obesityClinicalInfo = pd.read_csv('./data/obesidade.csv')
obesityClinicalInfo.head()

Unnamed: 0,ID,Idade,GeneroCod,Eobeso,fumo_atual,imc,obesoHer,cc,cq,rcq,frqCardiaca,fumo,atvFisica,stress,psisto,pdiasto,psisalta
0,1,20.0,2.0,Não,0,27.94,0,95.0,112,0.85,75,0.0,2.0,3.0,120.0,80.0,0.0
1,2,31.0,1.0,Não,0,28.76,0,88.0,101,0.87,66,0.0,2.0,0.0,128.0,74.33,1.0
2,3,19.0,2.0,Não,0,25.35,0,79.0,102,0.77,69,0.0,2.0,0.0,113.33,70.0,0.0
3,4,20.0,2.0,Não,0,20.73,0,91.0,80,1.14,85,0.0,0.0,0.0,130.0,76.67,1.0
4,5,19.0,2.0,Não,0,24.54,0,83.0,98,0.85,72,0.0,2.0,0.0,130.0,80.0,1.0


In [43]:
x = obesityClinicalInfo['Eobeso'].value_counts()
print("Obesity: %.2f"%(x[1]/x[0]))

Obesity: 0.08


In [44]:
obesityClinicalInfo.groupby(by = ['Eobeso'])['frqCardiaca'].mean()

Eobeso
Não    79.276423
Sim    86.290323
Name: frqCardiaca, dtype: float64

In [46]:
obesityClinicalInfo.groupby(by = ['Eobeso'])['Idade'].median()

Eobeso
Não    21.0
Sim    24.0
Name: Idade, dtype: float64

## National Health And Nutrition Examination Survey
Esse dataset contém informações detalhadas sobre demografia, dieta, exames e medicamentos de pacientes nos EUA

In [17]:
#NationalHealthAndNutritionExaminationSurvey data
demographic = pd.read_csv('./data/NationalHealthAndNutritionExaminationSurvey/demographic.csv')
diet = pd.read_csv('./data/NationalHealthAndNutritionExaminationSurvey/diet.csv')
examination = pd.read_csv('./data/NationalHealthAndNutritionExaminationSurvey/examination.csv')
labs = pd.read_csv('./data/NationalHealthAndNutritionExaminationSurvey/labs.csv')
medications = pd.read_csv('./data/NationalHealthAndNutritionExaminationSurvey/medications.csv')
questionnaire = pd.read_csv('./data/NationalHealthAndNutritionExaminationSurvey/questionnaire.csv')

In [18]:
demographic.head()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,73557,8,2,1,69,,4,4,1.0,,...,3.0,4.0,,13281.237386,13481.042095,1,112,4.0,4.0,0.84
1,73558,8,2,1,54,,3,3,1.0,,...,3.0,1.0,1.0,23682.057386,24471.769625,1,108,7.0,7.0,1.78
2,73559,8,2,1,72,,3,3,2.0,,...,4.0,1.0,3.0,57214.803319,57193.285376,1,109,10.0,10.0,4.51
3,73560,8,2,1,9,,3,3,1.0,119.0,...,3.0,1.0,4.0,55201.178592,55766.512438,2,109,9.0,9.0,2.52
4,73561,8,2,2,73,,3,3,1.0,,...,5.0,1.0,5.0,63709.667069,65541.871229,2,116,15.0,15.0,5.0


In [19]:
diet.head()

Unnamed: 0,SEQN,WTDRD1,WTDR2D,DR1DRSTZ,DR1EXMER,DRABF,DRDINT,DR1DBIH,DR1DAY,DR1LANG,...,DRD370QQ,DRD370R,DRD370RQ,DRD370S,DRD370SQ,DRD370T,DRD370TQ,DRD370U,DRD370UQ,DRD370V
0,73557,16888.327864,12930.890649,1,49.0,2.0,2.0,6.0,2.0,1.0,...,,,,,,,,,,
1,73558,17932.143865,12684.148869,1,59.0,2.0,2.0,4.0,1.0,1.0,...,,2.0,,2.0,,2.0,,2.0,,2.0
2,73559,59641.81293,39394.236709,1,49.0,2.0,2.0,18.0,6.0,1.0,...,,,,,,,,,,
3,73560,142203.069917,125966.366442,1,54.0,2.0,2.0,21.0,3.0,1.0,...,,,,,,,,,,
4,73561,59052.357033,39004.892993,1,63.0,2.0,2.0,18.0,1.0,1.0,...,,2.0,,2.0,,2.0,,2.0,,2.0


In [20]:
examination.head()

Unnamed: 0,SEQN,PEASCST1,PEASCTM1,PEASCCT1,BPXCHR,BPAARM,BPACSZ,BPXPLS,BPXPULS,BPXPTY,...,CSXLEAOD,CSXSOAOD,CSXGRAOD,CSXONOD,CSXNGSOD,CSXSLTRT,CSXSLTRG,CSXNART,CSXNARG,CSAEFFRT
0,73557,1,620.0,,,1.0,4.0,86.0,1.0,1.0,...,2.0,1.0,1.0,1.0,4.0,62.0,1.0,,,1.0
1,73558,1,766.0,,,1.0,4.0,74.0,1.0,1.0,...,3.0,1.0,2.0,3.0,4.0,28.0,1.0,,,1.0
2,73559,1,665.0,,,1.0,4.0,68.0,1.0,1.0,...,2.0,1.0,2.0,3.0,4.0,49.0,1.0,,,3.0
3,73560,1,803.0,,,1.0,2.0,64.0,1.0,1.0,...,,,,,,,,,,
4,73561,1,949.0,,,1.0,3.0,92.0,1.0,1.0,...,3.0,1.0,4.0,3.0,4.0,,,,,1.0


In [21]:
labs.head()

Unnamed: 0,SEQN,URXUMA,URXUMS,URXUCR.x,URXCRS,URDACT,WTSAF2YR.x,LBXAPB,LBDAPBSI,LBXSAL,...,URXUTL,URDUTLLC,URXUTU,URDUTULC,URXUUR,URDUURLC,URXPREG,URXUAS,LBDB12,LBDB12SI
0,73557,4.3,4.3,39.0,3447.6,11.03,,,,4.1,...,,,,,,,,,524.0,386.7
1,73558,153.0,153.0,50.0,4420.0,306.0,,,,4.7,...,,,,,,,,,507.0,374.2
2,73559,11.9,11.9,113.0,9989.2,10.53,142196.890197,57.0,0.57,3.7,...,,,,,,,,,732.0,540.2
3,73560,16.0,16.0,76.0,6718.4,21.05,,,,,...,0.062,0.0,0.238,0.0,0.0071,0.0,,3.83,,
4,73561,255.0,255.0,147.0,12994.8,173.47,142266.006548,92.0,0.92,4.3,...,,,,,,,,,225.0,166.1


In [22]:
medications.head()

Unnamed: 0,SEQN,RXDUSE,RXDDRUG,RXDDRGID,RXQSEEN,RXDDAYS,RXDRSC1,RXDRSC2,RXDRSC3,RXDRSD1,RXDRSD2,RXDRSD3,RXDCOUNT
0,73557,1,99999,,,,,,,,,,2.0
1,73557,1,INSULIN,d00262,2.0,1460.0,E11,,,Type 2 diabetes mellitus,,,2.0
2,73558,1,GABAPENTIN,d03182,1.0,243.0,G25.81,,,Restless legs syndrome,,,4.0
3,73558,1,INSULIN GLARGINE,d04538,1.0,365.0,E11,,,Type 2 diabetes mellitus,,,4.0
4,73558,1,OLMESARTAN,d04801,1.0,14.0,E11.2,,,Type 2 diabetes mellitus with kidney complicat...,,,4.0


In [29]:
medications[medications['RXDDRUG'].isin(['INSULIN', 'GABAPENTIN'])]

Unnamed: 0,SEQN,RXDUSE,RXDDRUG,RXDDRGID,RXQSEEN,RXDDAYS,RXDRSC1,RXDRSC2,RXDRSC3,RXDRSD1,RXDRSD2,RXDRSD3,RXDCOUNT
1,73557,1,INSULIN,d00262,2.0,1460.0,E11,,,Type 2 diabetes mellitus,,,2.0
2,73558,1,GABAPENTIN,d03182,1.0,243.0,G25.81,,,Restless legs syndrome,,,4.0
30,73566,1,GABAPENTIN,d03182,1.0,2555.0,M62.83,,,Muscle spasm,,,3.0
133,73626,1,GABAPENTIN,d03182,1.0,1460.0,M79.2,,,"Neuralgia and neuritis, unspecified",,,5.0
162,73638,1,GABAPENTIN,d03182,1.0,730.0,M79.2,,,"Neuralgia and neuritis, unspecified",,,12.0
534,73797,1,GABAPENTIN,d03182,1.0,3650.0,M79.2,,,"Neuralgia and neuritis, unspecified",,,19.0
626,73831,1,GABAPENTIN,d03182,1.0,3.0,M79.2,,,"Neuralgia and neuritis, unspecified",,,10.0
645,73839,1,GABAPENTIN,d03182,1.0,730.0,E11.4,,,Type 2 diabetes mellitus with neurological com...,,,7.0
774,73899,1,GABAPENTIN,d03182,1.0,1825.0,M79.2,,,"Neuralgia and neuritis, unspecified",,,17.0
826,73922,1,GABAPENTIN,d03182,1.0,21.0,M54.9,,,"Dorsalgia, unspecified",,,9.0


In [23]:
questionnaire.head()

Unnamed: 0,SEQN,ACD011A,ACD011B,ACD011C,ACD040,ACD110,ALQ101,ALQ110,ALQ120Q,ALQ120U,...,WHD080U,WHD080L,WHD110,WHD120,WHD130,WHD140,WHQ150,WHQ030M,WHQ500,WHQ520
0,73557,1.0,,,,,1.0,,1.0,3.0,...,,40.0,270.0,200.0,69.0,270.0,62.0,,,
1,73558,1.0,,,,,1.0,,7.0,1.0,...,,,240.0,250.0,72.0,250.0,25.0,,,
2,73559,1.0,,,,,1.0,,0.0,,...,,,180.0,190.0,70.0,228.0,35.0,,,
3,73560,1.0,,,,,,,,,...,,,,,,,,3.0,3.0,3.0
4,73561,1.0,,,,,1.0,,0.0,,...,,,150.0,135.0,67.0,170.0,60.0,,,


In [48]:
questionnaire[['ACD110', 'ALQ120Q']]

Unnamed: 0,ACD110,ALQ120Q
0,,1.0
1,,7.0
2,,0.0
3,,
4,,0.0
5,,5.0
6,,
7,,2.0
8,,
9,,1.0
