 # Data Summary provided by GPT
 
 * there are several publicly available datasets that can be used for developing decision support systems in diabetes management. Here are a few examples:

    1. OhioT1DM Dataset: The OhioT1DM dataset (https://archive.ics.uci.edu/ml/datasets/OhioT1DM+Dataset) is a public dataset collected from patients with Type 1 Diabetes. It includes continuous glucose monitoring (CGM) data, insulin doses, meal information, physical activity, and physiological features. This dataset is suitable for developing decision support systems for insulin dosing, glucose prediction, and diabetes management.

    2. T2D-UCI Dataset: The T2D-UCI dataset (https://archive.ics.uci.edu/dataset/296/diabetes+130+us+hospitals+for+years+1999+2008) contains data from patients with Type 2 Diabetes, collected from 130 hospitals in the United States between 1999 and 2008. It includes demographic information, clinical factors, medications, and 10-year risk of diabetes-related complications. This dataset can be used for developing decision support systems for risk assessment, treatment planning, and predicting outcomes.

    3. Pima Indians Diabetes Database: The Pima Indians Diabetes Database (https://www.kaggle.com/uciml/pima-indians-diabetes-database) is a well-known dataset available on Kaggle. It includes data from female Pima Indians, including features such as glucose levels, BMI, age, and diabetes status. This dataset is often used for developing decision support systems for diabetes diagnosis and risk prediction.

        * Kaggle EDA Notebook - https://www.kaggle.com/code/shrutimechlearn/step-by-step-diabetes-classification-knn-detailed
        * Small data set - only 768 rows 
        * Can be used to predict Diabetes - The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
            * Glucose, BloodPressure, SkinThickness, Pregnancies, Insulin, BMI, DiabetesPedigreeFunction, Age, 
        

    4. EHR Datasets: Electronic Health Record (EHR) datasets, such as the MIMIC-III dataset (https://mimic.mit.edu/) and the eICU Collaborative Research Database (https://eicu-crd.mit.edu/), contain anonymized patient data collected from intensive care units. While these datasets cover a wide range of medical conditions, they can be used for developing decision support systems for critical care management in diabetes patients.

        * Probably, we are not able to get the access. https://eicu-crd.mit.edu/gettingstarted/access/ . It needs required training course with affiliation and request the access.

    5. CDC National Health and Nutrition Examination Survey

       * https://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?Component=Questionnaire&Cycle=2017-2020
       * 2017-March 2020 Pre-Pandemic Questionnaire Data - Continuous NHANES
           * Diabetes https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DIQ.htm
           * Age first told diabetes by a doctor
               - DIQ160 - Ever told you have prediabetes
               - DIQ230 - How long ago saw a diabetes specialist
               - DIQ240 - Is there one Dr you see for diabetes
           * How long taking insulin
               - DIQ050 - Taking insulin now
               - DID060 - How long taking insulin
           * Times to see doctors in last year
               - DID250 - Past year how many times seen doctor
           * How often checking blood for glucose or sugar
               - DIQ180 - Had blood tested past three years
               - DIQ070 - Take diabetic pills to lower blood sugar
               - DID260 - How often check blood for glucose/sugar
               - DIQ275 - Past year Dr checked for A1C
                   * Glycosylated (GLY-KOH-SIH-LAY-TED) hemoglobin or the "A one C" test measures your average level of blood sugar for the past 3 months, and usually ranges between 5.0 and 13.9. During the past 12 months, has a doctor or other health professional checked {your/SP's} glycosylated hemoglobin or "A one C"?
               - DIQ280 - What was your last A1C level
                   * What does {your/SP's} doctor or other health professional say {your/his/her} "A one C" level should be? (Pick the lowest level recommended by your health care professional.)
            - DIQ291 - What does Dr say A1C should be
           * What does doctor say your blood pressure should be?
           * the most recent LDL cholesterol number
           * what LDL cholesterol should be told by doctor
           * number of times checked feet for any sores or irritations by a doctor in last year
           * how often you check feet for sores or irritation
           * Eye affected 
               - DIQ360 - Last time had pupils dilated for exam
               - DIQ080 - Diabetes affected eyes/had retinopathy

           
           * Variable names across survey cycles

| Label                                  | 1999–2000 | 2001–2004 | 2005–2008 | 2009–Mar2020 |
|----------------------------------------|-----------|-----------|-----------|--------------|
| Age when first told you had diabetes   | DIQ040G   | DID040G   | DID040    | DID040       |
| Number of years of age                 | DIQ040Q   | DID040Q   |           |              |
| How long taking insulin                | DIQ060G   | DID060G   | DID060    | DID060       |
| Number of mos/yrs taking insulin       | DIQ060Q   | DID060Q   |           |              |
| Take diabetic pills to lower blood sugar | DIQ070    | DIQ070    | DID070    | DIQ070       |
| Past year times Dr check feet for sore | NA        | NA        | DID340    | DID341       |


* When working with public datasets, it's important to carefully review the data documentation, understand the limitations, and ensure compliance with any data usage policies. Additionally, it's recommended to combine these datasets with relevant clinical guidelines or expert knowledge to develop robust decision support systems for diabetes management.

## Data selection 
I will start with dataset 2, then looking at others. Based on the understanding, we will do further selection or integration. 

## Attributes of T2D-UCI Dataset 

* (https://www.hindawi.com/journals/bmri/2014/781670/tab1/)

| Feature name | Type | Description and values | % missing |
|--------------|------|------------------------|-----------|
| Encounter ID | Numeric | Unique identifier of an encounter | 0% |
| Patient number | Numeric | Unique identifier of a patient | 0% |
| Race | Nominal | Values: Caucasian, Asian, African American, Hispanic, and other | 2% |
| Gender | Nominal | Values: male, female, and unknown/invalid | 0% |
| Age | Nominal | Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) | 0% |
| Weight | Numeric | Weight in pounds. | 97% |
| Admission type | Nominal | Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available | 0% |
| Discharge disposition | Nominal | Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available | 0% |
| Admission source | Nominal | Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital | 0% |
| Time in hospital | Numeric | Integer number of days between admission and discharge | 0% |
| Payer code | Nominal | Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay | 52% |
| Medical specialty | Nominal | Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon | 53% |
| Number of lab procedures | Numeric | Number of lab tests performed during the encounter | 0% |
| Number of procedures | Numeric | Number of procedures (other than lab tests) performed during the encounter | 0% |
| Number of medications | Numeric | Number of distinct generic names administered during the encounter | 0% |
| Number of outpatient visits | Numeric | Number of outpatient visits of the patient in the year preceding the encounter | 0% |
| Number of emergency visits | Numeric | Number of emergency visits of the patient in the year preceding the encounter | 0% |
| Number of inpatient visits | Numeric | Number of inpatient visits of the patient in the year preceding the encounter | 0% |
| Diagnosis 1 | Nominal | The primary diagnosis (coded as first three digits of ICD9); 848 distinct values | 0% |
| Diagnosis 2 | Nominal | Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values | 0% |
| Diagnosis 3 | Nominal | Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values | 1% |
| Number of diagnoses | Numeric | Number of diagnoses entered to the system | 0% |
| Glucose serum test result | Nominal | Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured | 0% |
| A1c test result | Nominal | Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. | 0% |
| Change of medications | Nominal | Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” | 0% |
| Diabetes medications | Nominal | Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” | 0% |
| 24 features for medications | Nominal | For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed | 0% |
| Readmitted | Nominal | Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. | 0% |
