### Imports

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

In [4]:
warnings.filterwarnings('ignore')

### Loading the data

In [9]:
medical_data = pd.read_csv('data/medical_data.csv', index_col=0)

In [11]:
pd.set_option('display.max_columns', None)

medical_data.head()

Unnamed: 0,sex,age,marital,income,state_code,obese,asthma,arthritis,depression,migraine,osteoporosis,cancer,bipolar_disorder,schizophrenia,major_depressive_disorder,generalized_anxiety_disorder,obsessive_compulsive_disorder,heart_attack,high_blood_pressure,high_cholesterol,smoker,stroke,diabetes,physical_activity,heavy_alcohol_consumption,chronic_obstructive_pulmonary_disease,thyroid_disorder,allergies,gastroesophageal_reflux_disease,sleep_apnea,fibromyalgia,irritable_bowel_syndrome,chronic_kidney_disease,back_pain,hepatitis,rheumatoid_arthritis,alzheimer,epilepsy,psoriasis,multiple_sclerosis,parkinson,celiac_disease,endometriosis,lupus
0,0,70,0,136480.4,DE,0,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,1,1,0,1
1,0,91,0,247620.82,NC,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,1,1,0,0,0,0,0,0
2,0,59,0,95551.06,TX,1,0,0,0,1,0,0,1,1,0,0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,1,0,0,1,1,0,0
3,0,92,0,37194.94,GA,0,0,0,0,1,0,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1
4,0,95,0,91107.5,CO,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,0


**Strech goal:** analyze the relationship between marital status and depression.

   There are two variables indicating if the individual has depression, one is called 'depression', the other one is called 'major_depressive_disorder'. There is not enough documentation about the data set based on which to derive the difference between the two. I can assume two possibilities:
   1) depression is any kind of depression (mild or major), while major_depressive_disorder is the major one
   2) depression indicates not major , and then we have another column for major

By analyzing the data further we might be able to get more insights into which option is more likely. For example, if 1) represents the reality, then it is not possible that depression = 0 and major_depressive_disorder = 1. On the other hand, if 2) represents the reality, then it is not possible that depression = 1 and major_depressive_disorder = 1. Eventually we might reject both 1) and 2) if the data disproves them.

### Data Exploration

#### Exploring general characteristics of the data

In [15]:
medical_data.describe()

Unnamed: 0,sex,age,marital,income,obese,asthma,arthritis,depression,migraine,osteoporosis,cancer,bipolar_disorder,schizophrenia,major_depressive_disorder,generalized_anxiety_disorder,obsessive_compulsive_disorder,heart_attack,high_blood_pressure,high_cholesterol,smoker,stroke,diabetes,physical_activity,heavy_alcohol_consumption,chronic_obstructive_pulmonary_disease,thyroid_disorder,allergies,gastroesophageal_reflux_disease,sleep_apnea,fibromyalgia,irritable_bowel_syndrome,chronic_kidney_disease,back_pain,hepatitis,rheumatoid_arthritis,alzheimer,epilepsy,psoriasis,multiple_sclerosis,parkinson,celiac_disease,endometriosis,lupus
count,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0,229762.0
mean,0.439237,58.452699,0.300624,139627.472518,0.301368,0.29897,0.299932,0.299266,0.299654,0.299636,0.298626,0.300742,0.300646,0.299088,0.3023,0.299549,0.103202,0.454431,0.441753,0.465682,0.044755,0.172875,0.733389,0.060715,0.30005,0.299123,0.300894,0.300041,0.297895,0.300302,0.299632,0.299954,0.302117,0.297573,0.298861,0.301347,0.30079,0.298531,0.298722,0.300359,0.300594,0.300372,0.299858
std,0.496295,23.366982,0.45853,63521.953118,0.458853,0.457808,0.458229,0.457938,0.458107,0.4581,0.457657,0.458582,0.45854,0.457859,0.459255,0.458062,0.304224,0.49792,0.496597,0.498822,0.206766,0.37814,0.442188,0.238807,0.45828,0.457875,0.458648,0.458276,0.457334,0.45839,0.458098,0.458238,0.459177,0.457192,0.45776,0.458844,0.458602,0.457614,0.457699,0.458415,0.458517,0.458421,0.458197
min,0.0,18.0,0.0,30000.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,38.0,0.0,84658.975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,58.0,0.0,139425.145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,79.0,1.0,194568.15,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,99.0,1.0,249999.43,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
medical_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 229762 entries, 0 to 253660
Data columns (total 44 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   sex                                    229762 non-null  int64  
 1   age                                    229762 non-null  int64  
 2   marital                                229762 non-null  int64  
 3   income                                 229762 non-null  float64
 4   state_code                             229762 non-null  object 
 5   obese                                  229762 non-null  int64  
 6   asthma                                 229762 non-null  int64  
 7   arthritis                              229762 non-null  int64  
 8   depression                             229762 non-null  int64  
 9   migraine                               229762 non-null  int64  
 10  osteoporosis                           229762 non-null  int64

This looks like a very clean data set with no missing data, most of the column are integers representing binary features, whether you have a cetain condition (1) or not (0). 

#### Exploring the two variables related to depression

In [12]:
medical_data.depression.value_counts()

depression
0    161002
1     68760
Name: count, dtype: int64

In [13]:
medical_data.major_depressive_disorder.value_counts()

major_depressive_disorder
0    161043
1     68719
Name: count, dtype: int64

In [14]:
pd.crosstab(medical_data.depression, medical_data.major_depressive_disorder)

major_depressive_disorder,0,1
depression,Unnamed: 1_level_1,Unnamed: 2_level_1
0,112789,48213
1,48254,20506


At this point both of my independent assumptions above got disproved so I will choose the variable major_depressiv_disorder and will work only with it. The reason I prefer it over the depression variable is that it is clearer what it means, it means major depressive disorder which is the official name of the clinical depression as defined by the World Health Organization.

#### Exploring thyroid disease variable

In [None]:
medical_data.thyroid_disorder

In [17]:
medical_data.thyroid_disorder.value_counts()

thyroid_disorder
0    161035
1     68727
Name: count, dtype: int64

There is no additional information in the data set documentation with regards to what kind of thyroid disease this column represents, e.g. hypo or hyper thyroidism or something else, or it just represents any kind of thyroid disorder being registered. So we will just analyze the relationship between depression and thyroid disorder.