# Predicting Different Progressive Levels of Alzheimer's Disease

This project was carried out within the scope of UpSchool by Eda AYDIN.

## Table of Contents

- [A. Business Understanding - Project Objective](#a-business-understanding---project-objective)
- [B. Data Understanding](#b-data-understanding)
- [C. Data Analysis](#c-data-analysis)
- [D. Feature Engineering](#d-feature-engineering)
- [E. Modeling](#e-modeling)
- [F. Evaluation](#f-evaluation)

## A. Business Understanding  - Project Objective

This is an optional model development project on a real dataset related to predicting the different progressive levels of Alzheimer's disease (AD). The students are expected to use tensorflow library for modeling process and will be asked to submit the predicted labels for a test dataset by which their score will be evaluated objectively. 

In this project, you are supposed to provide a data science model to determine the level of Alzheimer disease. The levels are the ordinal categories from lower to higher respectively: 0, 0.25, 0.50, 1.0, 2.0, 3.0 (that are the progressive levels of Alzheimer's disease) 

You are expected to use the following features:
["EDUC","NACCMOCA","MARISTAT","NACCFAM","NACCGDS","NACCNE4S","NACCAPOE", "INDEPEND","RESIDENC","ANYMEDS","NACCAMD","DEL","HALL","DEPD","ANX","APA","DISN", "IRR","MOT","AGIT","ELAT","NITE","APP","DROPACT","NACCAGEB","SEX"]



Your train dataset size should be 70%, validation dataset size 15% as well as the test size 15%. Your target metric will be F1 score. 

## B. Data Understanding

### B.1 Data Short Information


| Index | Variable Name | Section                                             | Detail Section | Variable type         | Data type               | Short Descriptor                                                  | Data Source | Allowable codes                                                                                                                                                                                                                                                                                              | Missing Codes | Description / derivation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ----- | ------------- | --------------------------------------------------- | -------------- | --------------------- | ----------------------- | ----------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1     | SEX           | A1 - Subject Demographics                           |                | Original UDS question | Numeric cross-sectional | Subject's sex                                                     | rdd         | 1 = Male<br>2 = Female                                                                                                                                                                                                                                                                                       |               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 2     | EDUC          | A1 - Subject Demographics                           |                | Original UDS question | Numeric cross-sectional | Years of education                                                | rdd         | 0 - 36<br>99 = Unknown                                                                                                                                                                                                                                                                                       |               | In general,<br>12 = high school or GRE,<br>16 = bachelor's degree,<br>18 = master's degree,<br>20 = doctorate.<br>Note that although this variable is not collected at follow-up visits, the value from the initial visit will be shown at all follow-up visits.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 3     | MARISTAT      | A1 - Subject Demographics                           |                | Original UDS question | Numeric longitudinal    | Marital Status                                                    | rdd         | 1 = Married<br>2 = Widowed<br>3 = Divorced<br>4 = Separated<br>5 = Never married (for marriage was annulled)<br>6 = Living as married/domestic partner<br>8 = Other or unknown                                                                                                                               |               | Note that in v1– 2 there was an option for “other” status. These have been recoded to maristat = 9.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 4     | INDEPEND      | A1 - Subject Demographics                           |                | Original UDS question | Numeric longitudinal    | Level of independence                                             | rdd         | 1 = Able to live independently<br>2 = Requires some assistance with complex activities<br>3 = Requires some assistance with basic activities<br>4 = Completely dependent<br>9 = Unknown                                                                                                                      |               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 5     | RESIDENC      | A1 - Subject Demographics                           |                | Original UDS question | Numeric longitudinal    | Type of residence                                                 | rdd         | 1 = Single- or multi-family private residence<br>(apartment, condo, house)<br>2 = Retirement community or independent group living<br>3 = Assisted living, adult family home, or boarding home<br>4 = Skilled nursing facility, nursing home, hospital, or hospice<br>9 = Other or unknown                   |               | Note that in v1– 2 there was an option for “other” type of residence. These have been recoded to residenc = 9.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 6     | NACCAGEB      | A1 - Subject Demographics                           |                | NACC derived variable | Numeric cross-sectional | Subject's age at initial visit                                    | rdd         | 18 - 120                                                                                                                                                                                                                                                                                                     |               | Birth month and year are required elements in the UDS; however, birth day is not collected. To calculate naccageb, birth day is set to 1 for all subjects. Baseline age is then computed as initial visit date minus birth date. Note that although this variable is listed for all visits, it does not change across visits; it is cross-sectional.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 7     | NACCFAM       | A3 - Subject Family History                         |                | NACC derived variable | Numeric cross-sectional | Indicator of first-degree family member with cognitive impairment | rdd         | 0 = No report of a first-degree family member with cognitive impairment<br>1 = Report of at least one first-degree family member with cognitive impairment<br>9 = Unknown<br>\-4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question |               | UDS Form A3 version 1 – 2, submitted at all available visits: Subjects reporting at least one parent, sibling, or child with dementia at any visit will have naccfam = 1. Subjects who report no first-degree family members with dementia at all visits where Form A3 is submitted will have naccfam = 0.<br>UDS Form A3 version 3.0 or subsequent versions, submitted at all available visits: If at least one parent, sibling, or child is reported to have both a primary neurological problem/psychiatric condition of cognitive impairment/behavior change (coded as 1) and one of the primary diagnosis codes listed below at any visit, then naccfam = 1. Subjects who report all first-degree family members as having a family history absent of cognitive impairment/psychiatric condition (primary neurological problem/psychiatric condition coded as 2–8) or a primary neurological problem/psychiatric condition is reported (coded as 1), but a code other than those listed below is reported, will have naccfam = 0.<br>For subjects with Form A3 data from multiple form versions, all available data will be included in the calculation of naccfam. For example, if a family history of cognitive impairment is indicated on Form A3 using v3.0 but not on a previous version using v1–2, the subject will still have naccfam = 1.<br>Those with a submitted Form A3 (any version) who are missing data on all first-degree family members are coded as Unknown (naccfam = 9). If some first-degree family members are coded as No and some are coded as Unknown, then they are all coded as Unknown (naccfam = 9).<br>In general, a known history of cognitive impairment reported at any visit supersedes all visits with missing codes. Likewise, an indication of cognitive impairment at any visit supersedes all other visits where a history of cognitive impairment is indicated as not present. In all other conditions where reporting varies, data from the most recent visit are used to calculate naccfam.<br>If Form A3 was never submitted for any version of the UDS, naccfam will take a value of -4. Note that although this variable is listed for all visits, it does not change across visits; it is cross-sectional. |
| 8     | ANYMEDS       | A4 - Subject Medications                            |                | Original UDS question | Numeric longitudinal    | Subject taking any medications                                    | rdd         | 0 = No<br>1 = Yes<br>\-4 = Did not complete medications form                                                                                                                                                                                                                                                 |               | If the medications form was not completed, then anymeds = - 4.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 9     | NACCAMD       | A4 - Subject Medications                            |                | NACC derived variable | Numeric longitudinal    | Total number of medications reported at each visit                | rdd         | 0 - 40<br>\-4 = Did not complete medications form                                                                                                                                                                                                                                                            |               | This variable provides the total number of medications reported at a visit including all prescription and over the counter medications reported on UDS Form A4 at a single visit. If the medications form was not completed, then naccamd = -4.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 10    | CDRGLOB       | B4 CDR® Plus NACC FTLD                              |                | Original UDS question | Numeric longitudinal    | Global CDR®                                                       | rdd         | 0.0 = No impairment<br>0.5 = Questionable impairment 1.0 = Mild impairment<br>2.0 = Moderate impairment<br>3.0 = Severe impairment                                                                                                                                                                           |               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 11    | DEL           | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Delusions in the last month                                       | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (del=9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 12    | HALL          | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Hallucinations in the last month                                  | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (hall = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 13    | AGIT          | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Agitation or aggression in the last month                         | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (agit = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 14    | DEPD          | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Depression or dysphoria in the last month                         | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (depd = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 15    | ANX           | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Anxiety in the last month                                         | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (anx = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 16    | ELAT          | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Elation or euphoria in the last month                             | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (elat = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 17    | APA           | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Apathy or indifference in the last month                          | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (apa = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 18    | DISN          | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Disinhibition in the last month                                   | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (disn = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 19    | IRR           | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Irritability or lability in the last month                        | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (irr = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 20    | MOT           | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Motor disturbance in the last month                               | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (mot = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 21    | NITE          | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Nighttime behaviors in the last month                             | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (nite = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 22    | APP           | B5 Neuropsychiatric Inventory Questionnaire (NPI-Q) |                | Original UDS question | Numeric longitudinal    | Appetite and eating problems in the last month                    | rdd         | 0 = No<br>1 = Yes<br>9 = Unkown<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                                      |               | An option of Unknown (app = 9) was added to UDS v3.0 and subsequent versions. Also note that the wording in v3.0 and subsequent versions changed to be consistent with the way the NPI-Q was originally intended to be completed; the wording changes are not expected to affect the essential meaning of the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 23    | NACCGDS       | B6 Geriatric Depression Scale (GDS)                 |                | NACC derived variable | Numeric longitudinal    | Total GDS Score                                                   | rdd         | 0 - 15<br>88 = Could not be calculated<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question                                                                                                                                   |               | In earlier versions of the UDS, Centers were not given instructions on how to calculate the total GDS score if three or fewer GDS items were missing. NACC has created a new derived variable for Total GDS score so that subjects who were given the GDS in the earlier versions of UDS v1 will have a total GDS score if they skipped three or fewer items on the questionnaire. If the subject was missing more than three of the 15 items on the GDS for any UDS version, naccgds = 88. The UDS Coding Guidebook for Form B6 provides the algorithm for calculating the GDS score when three or fewer items are missing.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 24    | DROPACT       | B6 Geriatric Depression Scale (GDS)                 |                | Original UDS question | Numeric longitudinal    | Have you dropped many of your activities and interests?           | rdd         | 0 = No<br>1 = Yes<br>9 = Did not answer<br>\- 4 = Not available: UDS form submitted did not collect data in this way, or a skip pattern precludes response to this question<br>                                                                                                                              |               | Note that an option of 9 = Did not answer was added to UDS v3.0 and subsequent versions.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 25    | NACCAPOE      |                                                     |                | NACC derived variable | Numeric cross-sectional | APOE genotype                                                     | rdd-genetic | 1 = e3,e3<br>2 = e3,e4<br>3 = e3,e2<br>4 = e4,e4<br>5 = e4,e2<br>6 = e2,e2<br>9 = Missing/ unknown/ not assessed                                                                                                                                                                                             |               | APOE genotype is run independently by the ADC and reported to NACC on the NACC Neuropathology Form. APOE genotype is also reported by ADGC and NCRAD. In the rare case that the ADC-reported genotype and the genotype reported by ADGC are not the same, the genotype is set to 9 = Missing for that subject.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 26    | NACCNE4S      |                                                     |                | NACC derived variable | Numeric cross-sectional | Number of APOE e4 alleles                                         | rdd-genetic | 0 = No e4 allele<br>1 = 1 copy of e4 allele<br>2 = 2 copies of e4 allele<br>9 = Missing/ unknown/ not assessed                                                                                                                                                                                               |               | APOE genotype is run independently by the ADC and reported to NACC on the NACC Neuropathology Form. APOE genotype is also reported by ADGC and NCRAD. In the rare case that the ADC-reported genotype and the genotype reported by ADGC are not the same, the genotype is set to 9 = Missing for that subject.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |

### B.2 Import Libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from imblearn.over_sampling import SMOTE # for oversampling
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score

In [3]:
%pip install fast_ml --quiet
%pip install -U tensorflow-addons


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m21.1.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m21.1.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
from fast_ml.model_development import train_valid_test_split
import tensorflow as tf
import tensorflow_addons as tfa 
from keras import backend as K
from keras.regularizers import  l2

ImportError: cannot import name 'get_config' from 'tensorflow.python.eager.context' (/home/janvier/anaconda3/lib/python3.8/site-packages/tensorflow/python/eager/context.py)

In [None]:
# If you want to run this code VSCode or PyCharm, use this code segment.
# TensorFlow GPU won't support for versions of 12.0 and higher.
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [5]:
pip install tensorflow-addons --upgrade


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m21.1.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

### B.3 Data Preprocessing

#### B.3.1 Collection of Raw Data

In [None]:
df = pd.read_excel("data/data.xlsx")
df_main = pd.read_excel("data/data.xlsx") #for baseline model

#### B.3.2 Data Short Description

In [None]:
def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("\n")
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("\n")
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("\n")
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("\n")
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("\n")
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

In [None]:
check_df(df)

### B.4 Variable Type and Data Structure Consistency

#### B.4.1 Demystifying Variables Type (Numerical / Categorical)

In [None]:
def grab_col_names(dataframe, categorical_threshold=10, cardinal_threshold=20):
    """
    It gives the names of categorical, numerical and categorical but cardinal,nominal variables in the data set.
    Note: Categorical variables but numerical variables are also included in categorical variables.

    Parameters
    ----------
    dataframe : dataframe
        The dataframe from which variables names are to be retrieved.
    categorical_threshold : int, optional
        class threshold for numeric but categorical variables
    cardinal_threshold : int, optional
        Class threshold for categorical but cardinal variables

    Returns
    -------
        categorical_cols : list
            Categorical variable list
        numerical_cols : list
            Numerical variable list
        cardinal_cols : list
            Categorical looking cardinal variable list

    Examples
    -------
        import seaborn as sns
        df = sns.load_titanic_dataset("iris")
        print(grab_col_names(df))

    Notes
    -------
        categorical_cols + numerical_cols + cardinal_cols = total number of variables.
        nominal_cols is inside categorical_cols
        The sum of the 3 returned lists equals the total number of variables: categorical_cols + cardinal_cols = number of variables

    """

    categorical_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    nominal_cols = [col for col in dataframe.columns if
                    dataframe[col].nunique() < categorical_threshold and dataframe[col].dtypes != "O"]
    cardinal_cols = [col for col in dataframe.columns if
                     dataframe[col].nunique() > cardinal_threshold and dataframe[col].dtypes == "O"]
    categorical_cols = categorical_cols + nominal_cols
    categorical_cols = [col for col in categorical_cols if col not in cardinal_cols]

    # numerical_cols
    numerical_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    numerical_cols = [col for col in numerical_cols if col not in categorical_cols]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'categorical_cols: {len(categorical_cols)}')
    print(f'numerical_cols: {len(numerical_cols)}')
    print(f'cardinal_cols: {len(cardinal_cols)}')
    print(f'nominal_cols: {len(nominal_cols)}')
    return categorical_cols, numerical_cols, cardinal_cols, nominal_cols

In [None]:
# Değişken türlerin ayrıştırılması
categorical_cols, numerical_cols, cardinal_cols, nominal_cols = grab_col_names(df, categorical_threshold=5, cardinal_threshold=20)

In [None]:
print("Categorical column names: {} \n".format(categorical_cols))
print("Numerical column names: {}\n".format(numerical_cols))
print("Cardinal column names: {}\n".format(cardinal_cols))
print("Nominal column names: {}".format(nominal_cols))

#### B.4.2 Data Structure Control (Float / String / Integer)

In [None]:
"""

Description:
-----------

Algorithm print out comprises missing ratios and unique values of each column i a given dataframe


R&D:
---

Add '#_infinity_' column to the dataframe

"""

def MissingUniqueStatistics(df):

  import io
  import pandas as pd
  import psutil, os, gc, time
  import seaborn as sns
  from IPython.display import display, HTML
  # pd.set_option('display.max_colwidth', -1)
  from io import BytesIO
  import base64

  print("MissingUniqueStatistics process has began:\n")
  proc = psutil.Process(os.getpid())
  gc.collect()
  mem_0 = proc.memory_info().rss
  start_time = time.time()


  variable_name_list = []
  total_entry_list = []
  data_type_list = []
  unique_values_list = []
  number_of_unique_values_list = []
  missing_value_number_list = []
  missing_value_ratio_list = []
  mean_list=[]
  std_list=[]
  min_list=[]
  Q1_list=[]
  Q2_list=[]
  Q3_list=[]
  max_list=[]

  df_statistics = df.describe().copy()

  for col in df.columns:

    variable_name_list.append(col)
    total_entry_list.append(df.loc[:,col].shape[0])
    data_type_list.append(df.loc[:,col].dtype)
    unique_values_list.append(list(df.loc[:,col].unique()))
    number_of_unique_values_list.append(len(list(df.loc[:,col].unique())))
    missing_value_number_list.append(df.loc[:,col].isna().sum())
    missing_value_ratio_list.append(round((df.loc[:,col].isna().sum()/df.loc[:,col].shape[0]),4))

    try:
      mean_list.append(df_statistics.loc[:,col][1])
      std_list.append(df_statistics.loc[:,col][2])
      min_list.append(df_statistics.loc[:,col][3])
      Q1_list.append(df_statistics.loc[:,col][4])
      Q2_list.append(df_statistics.loc[:,col][5])
      Q3_list.append(df_statistics.loc[:,col][6])
      max_list.append(df_statistics.loc[:,col][7])
    except:
      mean_list.append('NaN')
      std_list.append('NaN')
      min_list.append('NaN')
      Q1_list.append('NaN')
      Q2_list.append('NaN')
      Q3_list.append('NaN')
      max_list.append('NaN')

  data_info_df = pd.DataFrame({'Variable': variable_name_list,
                               '#_Total_Entry':total_entry_list,
                               '#_Missing_Value': missing_value_number_list,
                               '%_Missing_Value':missing_value_ratio_list,
                               'Data_Type': data_type_list,
                               'Unique_Values': unique_values_list,
                               '#_Unique_Values':number_of_unique_values_list,
                               'Mean':mean_list,
                               'STD':std_list,
                               'Min':min_list,
                               'Q1':Q1_list,
                               'Q2':Q2_list,
                               'Q3':Q3_list,
                               'Max':max_list
                               })

  data_info_df = data_info_df.set_index("Variable", inplace=False)


  print('MissingUniqueStatistics process has been completed!')
  print("--- in %s minutes ---" % ((time.time() - start_time)/60))

  return data_info_df.sort_values(by='%_Missing_Value', ascending=False)#, HTML(df.to_html(escape=False, formatters=dict(col=mapping)))


In [None]:
data_info = MissingUniqueStatistics(df)
# data_info = data_info.set_index("Variable")
data_info

### B.5 Building the Target Variable (Regression or Classification)

In [None]:
%matplotlib inline
# Histogram of the target categories
def histogram(df,feature):
    #df = input("Enter a DataFrame name: ")
    #col = input("Enter a target column name: ")
    #df=eval(df)
    ncount = len(df)
    ax = sns.countplot(x = feature, data=df ,palette="hls")
    sns.set(font_scale=1)
    ax.set_xlabel('Target Segments')
    plt.xticks(rotation=90)
    ax.set_ylabel('Number of Observations')
    fig = plt.gcf()
    fig.set_size_inches(12,5)
    # Make twin axis
    ax2=ax.twinx()
    # Switch so count axis is on right, frequency on left
    ax2.yaxis.tick_left()
    ax.yaxis.tick_right()
    # Also switch the labels over
    ax.yaxis.set_label_position('right')
    ax2.yaxis.set_label_position('left')
    ax2.set_ylabel('Frequency [%]')
    for p in ax.patches:
        x=p.get_bbox().get_points()[:,0]
        y=p.get_bbox().get_points()[1,1]
        ax.annotate('{:.2f}%'.format(100.*y/ncount), (x.mean(), y),
                ha='center', va='bottom') # set the alignment of the text
    # Use a LinearLocator to ensure the correct number of ticks
    ax.yaxis.set_major_locator(ticker.LinearLocator(11))
    # Fix the frequency range to 0-100
    ax2.set_ylim(0,100)
    ax.set_ylim(0,ncount)
    # And use a MultipleLocator to ensure a tick spacing of 10
    ax2.yaxis.set_major_locator(ticker.MultipleLocator(10))
    # Need to turn the grid on ax2 off, otherwise the gridlines end up on top of the bars
    ax2.grid(None)
    plt.title('Histogram of Binary Target Categories', fontsize=20, y=1.08)
    plt.show()
    #plt.savefig('col.png')
    del ncount, x, y

In [None]:
target= "CDRGLOB"

histogram(df,target)

When we look at the histogram graph, we understand that we have imbalanced dataset. We will do oversampling to solve this problem.

## C. Data Analysis

### C.1 Preparatory Data Analysis (PDA)

#### C.1.1 Data Dropping

We did this in the first place by looking at the form given to us. 

#### C.1.2 Outlier Handling

##### C.1.2.1 Finding Outliers

In [None]:
def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquartile_range = quartile3 - quartile1
    up_limit = quartile3 + (1.5 * interquartile_range)
    low_limit = quartile1 - (1.5 * interquartile_range)
    return low_limit, up_limit

In [None]:
def check_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

In [None]:
for col in numerical_cols:
    print(col, check_outlier(df, col))

##### C.1.2.2 Accessing Outliers

In [None]:
def grab_outliers(dataframe, col_name, index=False):
    low, up = outlier_thresholds(dataframe, col_name)
    if dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].shape[0] > 10:
        print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].head())
    else:
        print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))])

    if index:
        outlier_index = dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].index
        return outlier_index

In [None]:
for col in numerical_cols:
  grab_outliers(df,col,True)

##### C.1.2.3 Solving the Outlier Problem

In [None]:
def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [None]:
for col in numerical_cols:
    replace_with_thresholds(df, col)

In [None]:
for col in numerical_cols:
    print(col, check_outlier(df, col))

#### C.1.3 Missing Data Handling

In [None]:
def missing_values_table(dataframe, na_name=False):

    # Column Names with Missing Values
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]

    # Number of Missing Values of One Column
    number_of_missing_values = dataframe[na_columns].isnull().sum().sort_values(ascending=False)

    # Percentage Distribution of Missing Data
    percentage_ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)

    # Dataframe with Missing Data
    missing_df = pd.concat([number_of_missing_values, np.round(percentage_ratio, 2)], axis=1, keys=['number_of_missing_values', 'percentage_ratio'])

    print(missing_df, end="\n")

    if na_name:
        return na_columns

In [None]:
missing_values_table(df)

### C.2 Exploratory Data Analysis

#### C.2.1 Basic Visualization

##### C.2.1.1 Univariate plots (PDF - PMF / BoxPlot / QQ Plot)

###### C.2.1.1.1 Analysis of Categorical Variables

In [None]:
def cat_summary(dataframe, col_name, plot=False,savefig=False):
    """
    It gives summary of categorical columns with a plot.

    Args:
        dataframe (dataframe): The dataframe from which variables names are to be retrieved.
        col_name (string): The column names from which features names are to be retrieved
        plot (bool, optional): Plot the figure of the specified column. Defaults to False.
        savefig(bool, optional): Save the figure of the specific column to the folder. Defaults to False
    """
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("########################################## \n")

    if plot:
        ax = sns.countplot(x=dataframe[col_name], data=dataframe,
                           order = df[col_name].value_counts().index)

        ncount = len(dataframe)
        sns.set(font_scale = 1)

        for p in ax.patches:
            x = p.get_bbox().get_points()[:, 0]
            y = p.get_bbox().get_points()[1, 1]
            ax.annotate('{:.2f}%'.format(100.*y/ncount), (x.mean(), y),ha='center', va='bottom')  # set the alignment of the text

        # Use a LinearLocator to ensure the correct number of ticks
        ax.yaxis.set_major_locator(ticker.LinearLocator(11))

        plt.xticks(rotation=45)
        plt.title("{} Count Graph.png".format(col_name.capitalize()))
        if savefig:
            plt.savefig("{} Count Graph.png".format(col_name.capitalize()))
        plt.show(block=True)

In [None]:
# Kategorik değişkenlerin incelenmesi

for col in categorical_cols:
    cat_summary(df, col, plot=True, savefig=False)

###### C.2.1.1.1 Analysis of Target Variable with Categorical Variables

In [None]:
def target_summary_with_categorical_data(dataframe, target, categorical_col):
    """
    It gives the summary of specified categorical column name according to target column.

    Args:
        dataframe (dataframe): The dataframe from which variables names are to be retrieved.
        target (string): The target column name are to be retrieved.
        categorical_col (string): The categorical column names are to be retrieved.
    """
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean()}), end="\n\n\n")

In [None]:
for col in categorical_cols:
    target_summary_with_categorical_data(dataframe=df, target = target, categorical_col=col)

###### C.2.1.1.2 Analysis of Numerical Variables

In [None]:
# Sayısal değişkenlerin incelenmesi
df[numerical_cols].describe().T

In [None]:
def num_summary(dataframe, col_name, plot=False, savefig=False):
    """
    It gives the summary of numerical columns with a plot.

    Args:
        dataframe (dataframe): The dataframe from which variables names are to be retrieved.
        col_name (string): The column names from which features names are to be retrieved
        plot (bool, optional): Plot the figure of the specified column. Defaults to False.
        savefig(bool, optional): Save the figure of the specific column to the folder. Defaults to False

    """
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[col_name].describe(quantiles).T)

    if plot:
        dataframe[col_name].hist()
        plt.xlabel(col_name)
        plt.title("{} Histogram Graph.png".format(col_name.capitalize()))
        if savefig:
            plt.savefig("{} Histogram Graph.png".format(col_name.capitalize()))
        plt.show(block=True)

In [None]:
for col in numerical_cols:
    num_summary(df, col, True)

###### C.2.1.1.2 Analysis of Target Variable with Numerical Variables

In [None]:
def target_summary_with_numerical_data(dataframe, target, numerical_col):
    """
    It gives the summary of specified numerical column name according to target column.

    Args:
        dataframe (dataframe): The dataframe from which variables names are to be retrieved.
        target (string): The target column name are to be retrieved.
        numerical_col (string): The numerical column names are to be retrieved.
    """
    print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

In [None]:
for col in numerical_cols:
    target_summary_with_numerical_data(df, target ,col)

#### C.2.1.2 Multivariate Plots (Scatter Plot / Histogram)

In [None]:
column_names1 = ["DEL","HALL","DEPD","ANX","APA","DISN","IRR","MOT","AGIT","ELAT","NITE","APP"]

df_pivot = df.pivot_table(values=column_names1,
                   index=["SEX","CDRGLOB"],
                   aggfunc="sum")
df_pivot

In [None]:
from matplotlib.pyplot import figure

figure(figsize=(8, 6), dpi=80)

df_pivot.plot.barh(stacked=True)
plt.title("Relationship between \n B5 Neuropsychiatric Inventory Questionnaire Features - \n Sex - Alzheimer Impairment Level")
plt.savefig("Relationship between B5 Neuropsychiatric Inventory Questionnaire Features - Sex - Alzheimer Impairment Level.png")

### C.3 Confirmatory Data Analysis

#### C.3.1 Feature Selection (Importances, Associations, and Significances)

##### C.3.1.1 Chi-Square test for Nominal Features

In [None]:
def chi2_by_hand(df, col1, col2):
    #---create the contingency table---
    df_cont = pd.crosstab(index = df[col1], columns = df[col2])
    display(df_cont)
    #---calculate degree of freedom---
    degree_f = (df_cont.shape[0]-1) * (df_cont.shape[1]-1)
    #---sum up the totals for row and columns---
    df_cont.loc[:,'Total']= df_cont.sum(axis=1)
    df_cont.loc['Total']= df_cont.sum()

    #---create the expected value dataframe---
    df_exp = df_cont.copy()
    df_exp.iloc[:,:] = np.multiply.outer(
        df_cont.sum(1).values,df_cont.sum().values) / df_cont.sum().sum()

    # calculate chi-square values
    df_chi2 = ((df_cont - df_exp)**2) / df_exp
    df_chi2.loc[:,'Total']= df_chi2.sum(axis=1)
    df_chi2.loc['Total']= df_chi2.sum()

    #---get chi-square score---
    chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()

    #---calculate the p-value---
    from scipy import stats
    p = stats.distributions.chi2.sf(chi_square_score, degree_f)

    return chi_square_score, degree_f, p

In [None]:
nominal_cols_for_exclude = []

for col in nominal_cols:
    chi_score, degree_f, p = chi2_by_hand(df,col,target)
    if p >= 0.05:
        nominal_cols_for_exclude.append(col)
    else:
        print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')

In [None]:
nominal_cols_for_exclude

##### C.3.1.2 ANOVA Test for Numerical Features

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

numerical_cols_for_exclude = []

for col in numerical_cols:
    model = ols(target + '~' + col, data = df).fit() #Ordinary least square method
    result_anova = sm.stats.anova_lm(model, typ = 2) # ANOVA Test
    if result_anova.loc[col]["PR(>F)"] >= 0.05: 
        numerical_cols_for_exclude.append(col) # Save column names with pi-value greater than 0.05
    else:
        print(result_anova)
        print("\n")

In [None]:
numerical_cols_for_exclude

In [None]:
df_new = df.drop(columns=nominal_cols_for_exclude) # data set changed

In [None]:
df_new.head()

In [None]:
# Değişken türlerin güncellenmesi
categorical_cols, numerical_cols, cardinal_cols, nominal_cols = grab_col_names(df_new, categorical_threshold=6, cardinal_threshold=20)

## D. Feature Engineering

### D.1 Encoding

#### D.1.1 Label Encoding

In [None]:
def label_encoder(dataframe, binary_col):
    """
    Apply Label Encoding to all specified categorical columns

    Args:
        dataframe (dataframe): The dataframe from which variables names are to be retrieved.
        binary_col (string): The numerical column names are to be retrieved.

    Returns:
        dataframe: Return the new dataframe
    """
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

In [None]:
binary_cols = [col for col in df_new.columns if df_new[col].dtypes == "O" and df_new[col].nunique() == 2]
binary_cols

In [None]:
for col in binary_cols:
    df_new = label_encoder(df_new, col)

#### D.1.2 One-Hot Encoding

In [None]:
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    """
    Apply One Hot Encoding to all specified categorical columns.

    Args:
        dataframe (dataframe): The dataframe from which variables names are to be retrieved.
        categorical_col (string): The numerical column names are to be retrieved.
        drop_first (bool, optional): Remove the first column after one hot encoding process to prevent overfitting. Defaults to False.

    Returns:
        dataframe: Return the new dataframe
    """
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

In [None]:
one_hot_encoding_cols = [col for col in categorical_cols if 10 >= df_new[col].nunique() > 2 and col != "CDRGLOB"]

In [None]:
one_hot_encoding_cols

In [None]:
df_new = one_hot_encoder(df_new, one_hot_encoding_cols,drop_first=True)

In [None]:
df_new.columns = [col.upper() for col in df_new.columns]

In [None]:
# Son final değişken türleri
categorical_cols, numerical_cols, cardinal_cols, nominal_cols = grab_col_names(df_new,
                                                                               categorical_threshold=5,
                                                                               cardinal_threshold=20)

In [None]:
categorical_cols = [col for col in categorical_cols if "CDRGLOB" not in col]

In [None]:
df_new.head()

## E. Modeling

### E.1 Data Splitting

In [None]:
X_train,y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df_new, target = target,
                                                                           train_size= 0.7,
                                                                           valid_size = 0.15,
                                                                           test_size = 0.15,
                                                                           random_state=0)

print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

### E.2 SMOTE Oversampling for train dataset

In [None]:
# smote oversampling önce eğitim setindeki sınıf sayısı
y_train.value_counts()

In [None]:
# Smote uygulanması (Eğitim setine uygulanıyor)
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=101)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [None]:
# random oversampling işleminden sonra eğitim setinin sınıf sayısı
y_train_smote.value_counts()

In [None]:
histogram(X_train_smote,y_train_smote)

In [None]:
print(X_train_smote.shape), print(y_train_smote.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

In [None]:
df_after_smote = pd.merge(X_train_smote, y_train_smote , left_index = True, right_index=True)

### E.3 Data Transformation - Scaling

#### E.3.1 Transformation (Scaling Numerical Variables)

In [None]:
min_max_scaler = MinMaxScaler()

X_train_smote = min_max_scaler.fit_transform(X_train_smote)
X_valid = min_max_scaler.transform(X_valid)
X_test = min_max_scaler.transform(X_test)

### E.4 Metric, Baseline, and Estimator / Classifier Selection

#### E.4.1 Metric and Baseline Selection for Model Evaluation

##### E.4.1.1 Metric Selection

I will be use accuracy, recall, precision, f1-score, auc metric.

##### E.4.1.2 Baseline Selection

In [None]:
X_train_baseline,y_train_baseline, X_valid_baseline, y_valid_baseline, X_test_baseline, y_test_baseline = train_valid_test_split(df_main, target = target,
                                                                           train_size= 0.7,
                                                                           valid_size = 0.15,
                                                                           test_size = 0.15,
                                                                           random_state=0)

print(X_train_baseline.shape), print(y_train_baseline.shape)
print(X_valid_baseline.shape), print(y_valid_baseline.shape)
print(X_test_baseline.shape), print(y_test_baseline.shape)

In [None]:
from sklearn.linear_model import LogisticRegression


lg_model = LogisticRegression().fit(X_train_baseline, y_train_baseline)
y_pred2 =lg_model.predict(X_valid_baseline)

print(classification_report(y_valid_baseline, y_pred2))

In [None]:
rf_model = RandomForestClassifier(random_state=46).fit(X_train_baseline, y_train_baseline)
y_pred = rf_model.predict(X_valid_baseline)

print(classification_report(y_valid_baseline, y_pred))

### E.5 Estimator / Classifier Selection

In [None]:
def valid_accuracy_result(model):
    predictions = [round(value) for value in model.predict(X_valid)]
    accuracy = accuracy_score(y_valid, predictions)
    return accuracy


def test_accuracy_result(model):
    predictions = [round(value) for value in model.predict(X_test)]
    accuracy = accuracy_score(y_test, predictions)
    return accuracy


def classification_report_result(model):
    print(classification_report(y_valid, model.predict(X_valid)))

#### E.5.1 Logistic Regression Classifier

In [None]:
# Training with oversampled data

logreg = LogisticRegression().fit(X_train_smote, y_train_smote)

classification_report_result(logreg)

In [None]:
logreg_valid = valid_accuracy_result(logreg)
print("Validation Accuracy: %.2f%%" % (logreg_valid * 100.0))

In [None]:
logreg_test = test_accuracy_result(logreg)
print("Test Accuracy: %.2f%%" % (logreg_test * 100.0))

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
logreg_tunned = LogisticRegression()
logreg_tunned_params = {"penalty": ['l2', 'elasticnet'],
                        "solver":['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                        "max_iter": [50,100,200],
                        "n_jobs" : [-1],
                        "C": [0.1,1,10]}

logreg_tunned_model = GridSearchCV(logreg_tunned, logreg_tunned_params,
                                   scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)
print(logreg_tunned_model.best_params_)

In [None]:
logreg_tunned = LogisticRegression(C=0.1,
                                   max_iter = 50,
                                   n_jobs = -1,
                                   penalty="l2",
                                   solver = "newton-cg").fit(X_train_smote,y_train_smote)

In [None]:
classification_report_result(logreg_tunned)

In [None]:
logreg_valid_tunned = valid_accuracy_result(logreg_tunned)
print("Validation Accuracy: %.2f%%" % (logreg_valid_tunned * 100.0))

In [None]:
logreg_test_tunned = test_accuracy_result(logreg_tunned)
print("Test Accuracy: %.2f%%" % (logreg_test_tunned * 100.0))

#### E.5.2 Ensembled Classifiers

##### E.5.2.1 Random Forest Classifier (Bagging)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier().fit(X_train_smote, y_train_smote)

classification_report_result(rfc)

In [None]:
rfc_valid = valid_accuracy_result(rfc)
print("Validation Accuracy: %.2f%%" % (rfc_valid * 100.0))

In [None]:
rfc_test = test_accuracy_result(rfc)
print("Test Accuracy: %.2f%%" % (rfc_test * 100.0))

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
rfc_tunned = RandomForestClassifier()
rfc_tunned_params = {'max_depth': [2,10,20],
                     'n_estimators': [100,200],
                     'max_features': [4,10,30],
                     'min_samples_leaf': [2,10]}

rfc_tunned_model = GridSearchCV(rfc_tunned, rfc_tunned_params,
                                scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)
print(rfc_tunned_model.best_params_)

In [None]:
rfc_tunned = RandomForestClassifier(max_depth = 2,
                                    max_features = 4,
                                    min_samples_leaf=2,
                                    n_estimators=100).fit(X_train_smote,y_train_smote)

In [None]:
classification_report_result(rfc_tunned)

In [None]:
rfc_valid_tunned = valid_accuracy_result(rfc_tunned)
print("Validation Accuracy: %.2f%%" % (rfc_valid_tunned * 100.0))

In [None]:
rfc_test_tunned = test_accuracy_result(rfc_tunned)
print("Test Accuracy: %.2f%%" % (rfc_test_tunned * 100.0))

##### E.5.2.2 Gradient Boosting Classifiers

###### E.5.2.2.1 XGBoost

In [None]:
from xgboost import XGBClassifier

xgboost_model = XGBClassifier().fit(X_train_smote, y_train_smote)

In [None]:
classification_report_result(xgboost_model)

In [None]:
xgboost_model_valid = valid_accuracy_result(xgboost_model)
print("Validation Accuracy: %.2f%%" % (xgboost_model_valid * 100.0))

In [None]:
xgboost_model_test = test_accuracy_result(xgboost_model)
print("Test Accuracy: %.2f%%" % (xgboost_model_test * 100.0))

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
xgb_tunned = XGBClassifier()
xgb_tunned_params = {"learning_rate"    : [0.01, 0.10, 0.25 ] ,
                     "max_depth"        : [ 3, 6, 10],
                     "min_child_weight" : [ 1, 3, 7 ],
                     "gamma"            : [ 0.0, 0.1 , 0.3],
                     "colsample_bytree" : [ 0.3, 0.5 , 0.7 ],
                     "booster"          : ['gbtree','gblinear','dart'],
                     "n-jobs"           : [-1]}

xgb_tunned_model = GridSearchCV(xgb_tunned, xgb_tunned_params,
                                scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)
print(xgb_tunned_model.best_params_)

In [None]:
xgb_tunned_new_params = {'booster' : ['gbtree'], 'colsample_bytree': [0.3], 'gamma': [0.0], 'learning_rate': [0.01], 'max_depth': [3], 'min_child_weight': [1], 'n-jobs': [-1]}

xgb_tunned = XGBClassifier()
xgb_tunned_new_model = GridSearchCV(xgb_tunned, xgb_tunned_new_params,
                                scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)

In [None]:
classification_report_result(xgb_tunned_new_model)

In [None]:
xgboost_model_valid_tunned = valid_accuracy_result(xgb_tunned_new_model)
print("Validation Accuracy: %.2f%%" % (xgboost_model_valid_tunned * 100.0))

In [None]:
xgboost_model_test_tunned = test_accuracy_result(xgb_tunned_new_model)
print("Test Accuracy: %.2f%%" % (xgboost_model_test_tunned * 100.0))

###### E.5.2.2.2 LightGBM


In [None]:
from lightgbm import LGBMClassifier

lgbm_model = LGBMClassifier().fit(X_train_smote, y_train_smote)

classification_report_result(lgbm_model)

In [None]:
lgbm_model_valid = valid_accuracy_result(lgbm_model)
print("Validation Accuracy: %.2f%%" % (lgbm_model_valid * 100.0))

In [None]:
lgbm_model_test = test_accuracy_result(lgbm_model)
print("Test Accuracy: %.2f%%" % (lgbm_model_test * 100.0))

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
lgbm_tunned = LGBMClassifier()
lgbm_tunned_params = {"boosting_type":['gbdt', 'dart', 'goss', 'rf'],
                      "n_estimators ":[100,200,500],
                      "learning_rate":[0.01,0.1,0.5]}

lgbm_tunned_model = GridSearchCV(lgbm_tunned, lgbm_tunned_params,
                                 scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)

print(lgbm_tunned_model.best_params_)

In [None]:
lgbm_tunned = LGBMClassifier(boosting_type="gbdt",
                             learning_rate=0.01,
                             n_estimators=100).fit(X_train_smote,y_train_smote)

In [None]:
classification_report_result(lgbm_tunned)

In [None]:
lgbm_model_tunned_valid = valid_accuracy_result(lgbm_tunned)
print("Validation Accuracy: %.2f%%" % (lgbm_model_tunned_valid * 100.0))

In [None]:
lgbm_model_tunned_test = test_accuracy_result(lgbm_tunned)
print("Test Accuracy: %.2f%%" % (lgbm_model_tunned_test * 100.0))

###### E.5.2.2.3 CatBoost

In [None]:
from catboost import CatBoostClassifier

catboost_model = CatBoostClassifier().fit(
    X_train_smote,
    y_train_smote)

In [None]:
classification_report_result(catboost_model)

In [None]:
predictions = [value for value in catboost_model.predict(X_valid)]
catboost_model_valid = accuracy_score(y_valid, predictions)
print("Validation Accuracy: %.2f%%" % (catboost_model_valid * 100.0))

In [None]:
predictions = [value for value in catboost_model.predict(X_test)]
catboost_model_test = accuracy_score(y_test, predictions)
print("Test Accuracy: %.2f%%" % (catboost_model_test * 100.0))

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
catboost_tunned = CatBoostClassifier()
catboost_tunned_params = {'iterations':[10, 1000],
                          'depth': [1, 8],
                          'learning_rate': [0.01, 1.0, 'log-uniform']
}

catboost_tunned_model = GridSearchCV(catboost_tunned, catboost_tunned_params,
                                     scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)

print(catboost_tunned_model.best_params_)

In [None]:
catboost_tunned = CatBoostClassifier(depth= 1, iterations= 10, learning_rate= 0.01).fit(X_train_smote, y_train_smote)

In [None]:
classification_report_result(catboost_tunned)

In [None]:
predictions = [value for value in catboost_tunned.predict(X_valid)]
catboost_tunned_valid = accuracy_score(y_valid, predictions)
print("Validation Accuracy: %.2f%%" % (catboost_tunned_valid * 100.0))

In [None]:
predictions = [value for value in catboost_tunned.predict(X_test)]
catboost_tunned_test = accuracy_score(y_test, predictions)
print("Test Accuracy: %.2f%%" % (catboost_tunned_test * 100.0))

### E.5.3 Neural Network Classifiers

#### E.5.3.1 Multi-Layer Perceptron (MLP)

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier().fit(X_train_smote, y_train_smote)

classification_report_result(mlp)

In [None]:
mlp_valid = valid_accuracy_result(mlp)
print("Validation Accuracy: %.2f%%" % (mlp_valid * 100.0))

In [None]:
mlp_test = test_accuracy_result(mlp)
print("Test Accuracy: %.2f%%" % (mlp_test * 100.0))

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
mlp_tunned = MLPClassifier()
mlp_tunned_params = {"activation":['identity', 'logistic', 'tanh', 'relu']}

mlp_tunned_model = GridSearchCV(mlp_tunned, mlp_tunned_params,
                                scoring='roc_auc', cv=3, n_jobs=-1, verbose=2).fit(X_train_smote,y_train_smote)
print(mlp_tunned_model.best_params_)

In [None]:
mlp_tunned = MLPClassifier(activation="identity").fit(X_train_smote, y_train_smote)

In [None]:
classification_report_result(mlp_tunned)

In [None]:
mlp_tunned_valid = valid_accuracy_result(mlp_tunned)
print("Validation Accuracy: %.2f%%" % (mlp_tunned_valid * 100.0))

In [None]:
mlp_tunned_test = test_accuracy_result(mlp_tunned)
print("Test Accuracy: %.2f%%" % (mlp_tunned_test * 100.0))

#### E.5.3.2 Neural Networks

##### E.5.3.2.1 Creating a Callback Function

In [None]:
# Implement callback function to stop training when accuracy reaches accuracy_threshold
class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if logs.get("loss") < 0.4:
            print("\n Loss is low so cancelling training!")
            self.model.stop_training = True

In [None]:
# Instantiate a callback object
callbacks = myCallback()

##### E.5.3.2.2 Dataset Preparation for Neural Networks Modeling

In [None]:
input_layer_size = df_after_smote.shape[1] - 1                     # Dimension of features
hidden_layer_size = input_layer_size*2                             # of units in hidden layer
output_layer_size = len(df_after_smote["CDRGLOB"].value_counts())  # number of label

In [None]:
print("Number of input layer size: {}".format(input_layer_size))
print("Number of hidden layer size: {}".format(hidden_layer_size))
print("Number of output_layer_size: {}".format(output_layer_size))

#####E.5.3.2.3 Neural Networks Modeling

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(input_layer_size)),
    tf.keras.layers.Dense(hidden_layer_size, activation = "relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(int(hidden_layer_size/2.5), activation = "relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(output_layer_size, activation = "softmax")
])

In [None]:
model.summary()

In [None]:
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:
model.compile(tf.keras.optimizers.Adam(learning_rate = 1e-4),
              loss = "sparse_categorical_crossentropy",
              metrics = [f1_m]) # tfa.metrics.F1Score(num_classes = output_layer_size)

In [None]:
model_history = model.fit(X_train_smote, y_train_smote,
                          epochs = 100,
                          verbose = 1,
                          batch_size = 32,
                          validation_data = (X_valid, y_valid))
                          #callbacks = [callbacks])

## F. Evaluation

In [None]:
# evaluate the model
accuracy, f1_score = model.evaluate(X_test, y_test, verbose=0)

In [None]:
print("Test Accuracy of Neural Network Modeling: {:.2f}".format(accuracy))

In [None]:
print("Test F-1 Score of Neural Network Modeling: {:.2f}".format(f1_score))

In [None]:
plt.plot(model_history.history["loss"])
plt.plot(model_history.history["val_loss"])
plt.legend(["loss", "validation loss"], loc ="upper right")
plt.show()

In [None]:
print(model.evaluate(X_train_smote, y_train_smote))

In [None]:
print(model.evaluate(X_valid, y_valid))

In [None]:
print(model.evaluate(X_test, y_test))

![](Comparison%20of%20Validation%20Accuracy%20Result%20-%20Validation%20Tunned%20Accuracy%20Result.png)

![](Comparison%20of%20Test%20Accuracy%20Result%20-%20Test%20Tunned%20Accuracy%20Result.png)