## Machine Learning Capstone PROJECT PROPOSAL 

### Domain Background
[Student briefly details background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited. A discussion of the student's personal motivation for investigating a particular problem in the domain is encouraged but not required.]

This capstone project involves machine learning modeling and analysis of clinical, demographic, and brain related derived anatomic measures from human MRI (magnetic resonance imaging) tests (http://www.oasis-brains.org/). The objectives of these measurements are to diagnose the level of Dementia in the individuals and the probability that these individuals may have Alzheimer's Disease (AD).

Recently, a close relative of mine had to undergo a sequence of MRI tests for cognition difficulties.The motivation for choosing this topic for the Capstone project arose from the desire to understand and analyze potential for Dementia and AD from MRI related data. This Capstone project does not use the MRI "imaging" data and does not focus on AD, focusses only on Dementia. 


### Problem Statement
[Student clearly describes the problem that is to be solved. The problem is well defined and has at least one relevant solution]

* Cross-Sectional and longitudinal OASIS MRI structural and demographic data (clinical, demographic, and brain related derived anatomic measures) from human MRI (magnetic resonance imaging) tests (http://www.oasis-brains.org/) will be used to train a set of linear and non-linear machine learning classification models.


* Clinical Dementia Rating (CDR) values provided in the data set will be used as "labels" for training the classification models. [Clinical Dementia Ratings (CDR values:  0=nondemented; 0.5 – very mild dementia; 1 = mild dementia; 2 = moderate dementia)]


* Pandas will be used for data loading and Python scikit-learn library for modeling.


* The goal is to train machine learning models to predict whether the individuals in the cross-validation set (test set) have dementia (CDR>0), and if they do, the severity level of dementia (CDR values of 0.5, 1, and 2). The problem will be formulated both as a binary classification problem (CDR=0, and CDR>0), and multiclass classification problem (CDR values in the dataset: 0, 0.5, 1, and 2). In the binary classification formulation, the CDR>0 the values in the sliced dataset will be relabeled as CDR=11.


* Classification Accuracy will be used as the primary metric. The results from the best model (one that provides the highest accuracy) will be reported along with those from the other models.


* About 80% of the data in the dataset will be used for training the models. About 20% of data will used prediction of the CDR label for the k-fold cross-validation with k=10. Sensitivity studies with proportion other than 80:20, e.g. 70:30, will be used to test sensitivity of this split on the accuracy.


* The base case will uses a dataset that combines the cross-sectional and the longitudinal MRI datasets.This has the benefit of having a larger dataset. The cross-sectional and the longitudinal datasets will also be trained/cross-validated separately, and classification accuracy will be reported.


* Data cleaning (e.g. removal of NaN values), dat exploration, data preparation, data visualization, and data preprocessing will be described, as needed, and the impact of the latter on prediction accuracy will be discussed.

### Datasets and Inputs
[The dataset(s) and/or input(s) to be used in the project are thoroughly described. Information such as how the dataset or input is (was) obtained, and the characteristics of the dataset or input, should be included. **It should be clear how the dataset(s) or input(s) will be used in the project and whether their use is appropriate given the context of the problem.**]

**Data source:**

Reference 1 below provide the downloadable MRI related data in csv format. Reference 2 provides meta data and additional facts about the cross-sectional MRI. Reference 3 provides additional follow up reliability data where the test candidates continued to have no dementia. Reference 3 data may be optionally used to predict additional CDR=0 (no Dementia) cases. 

Reference 1: http://www.oasis-brains.org/app/template/Index.vm;jsessionid=6926BBF18A3D5CD974E750FAC8ED01CE

Reference 2: http://www.oasis-brains.org/pdf/oasis_cross-sectional_facts.pdf

Reference 3: http://www.oasis-brains.org/app/action/BundleAction/bundle/OAS1_RELIABILITY

##### OASIS: Cross-sectional MRI Data in Young, Middle Aged, Nondemented and Demented Older Adults

Summary: This set consists of a cross-sectional collection of 416 subjects aged 18 to 96.  For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included.  The subjects are all right-handed and include both men and women.  100 of the included subjects over the age of 60 have been clinically diagnosed with very mild to moderate Alzheimer’s disease (AD).  Additionally, a reliability data set (Reference 3) is included which contains 20 nondemented subjects imaged on a subsequent visit within 90 days of their initial session. See Dementia related **Additional Data** below for the cross-sectional MRI cases used this project. Features based on these **Additional Data** will be used to train classification models to predict thelabels for the outcome (CDR).

##### OASIS: Longitudinal MRI Data in Nondemented and Demented Older Adults
 	
Summary: This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit. See Dementia related **Additional Data** below for the longitudinal MRI cases used this project. 

Features based on the **Additional Data** are relevant to finding machine learning solutions to the problem defined above, and will be used to train classification models to predict the labels for the outcome (Critical Dementia Rating, CDR).

**Additional data:** Specific References in parentheses below covering features are from Reference 2: http://www.oasis-brains.org/pdf/oasis_cross-sectional_facts.pdf . These features include Demographic, clinical, and derived anatomic measures related to brain that are located in the file oasis_crosssectional.csv.

* Demographics Data

Gender (M/F), Handedness (Hand), Age, Education (Educ), socioeconomic status (SES, Rubin et al.,1998). Education codes correspond to the following levels of education: 1: less than high school grad., 2:high school grad., 3: some college, 4: college grad., 5: beyond college.

* Clinical Data

Mini-Mental State Examination (MMSE, Rubin et al., 1998), Clinical Dementia Rating (CDR; 0=
nondemented; 0.5 – very mild dementia; 1 = mild dementia; 2 = moderate dementia, from Morris, 1993). All
participants with dementia (CDR >0) were diagnosed with probable AD.

* Derived anatomic volumes

--Estimated total intracranial volume (eTIV (mm3), Buckner et al., 2004), 

--Atlas scaling factor (ASF, Buckner et al., 2004), 

--Normalized whole brain volume (nWBV, Fotenos et al., 2004). 


### Solution Statement
*Student clearly describes a solution to the problem. The solution is applicable to the project domain and appropriate for the dataset(s) or input(s) given. Additionally, the solution is quantifiable, measurable, and replicable.

Solution:

* Train a supervised machine learning classification model to properly classify the OASIS data according to clinical dementia ratings(CDR values). 

* Train a number of candidate models from the scikit-learn library such as Logistic Regression, Linear Discriminant Analysis, KNN, Naive Bayes, CART, and SVM. 

* Select the best model based on the "Accuracy" Metric. 

* Combine the cross-sectional and longitudinal MRI related demographic and clinical data into a single dataset. 

* Split the dataset into training dataset (80%) and the remaining data(20%) for typical ten-fold cross validation. 

* Report the prediction accuracy of the models and identify the model that yields the higest classification accuracy. Report accuracy results in sklearn Confusion Matrix format(to evaluate classifier output quality) and, Classification Report format (provides, precision, recall, f1-score). See details here: http://scikit-learn.org/stable/modules/model_evaluation.html).


### Benchmark Model
A benchmark model is provided that relates to the domain, problem statement, and intended solution. Ideally, the student's benchmark model provides context for existing methods or known information in the domain and problem given, which can then be objectively compared to the student's solution. The benchmark model is clearly defined and measurable.

------------
#### Benchmark
Student clearly defines a benchmark result or threshold for comparing performances of solutions obtained.

##### My Benchmarks  in the two papers below:

1.Paper title: Usefulness of data from magnetic resonance imaging to improve prediction of dementia: population based cohort study

http://www.bmj.com/content/350/bmj.h2863

"Results During 10 years of follow-up, there were 119 confirmed cases of dementia, 84 of which were Alzheimer’s disease. The conventional risk model incorporated age, sex, education, cognition, physical function, lifestyle (smoking, alcohol use), health (cardiovascular disease, diabetes, systolic blood pressure), and the apolipoprotein genotype (C statistic for discrimination performance was 0.77, 95% confidence interval 0.71 to 0.82). No significant differences were observed in the discrimination performance of the conventional risk model compared with models incorporating data from MRI including white matter lesion volume (C statistic 0.77, 95% confidence interval 0.72 to 0.82; P=0.48 for difference of C statistics), brain volume (0.77, 0.72 to 0.82; P=0.60), hippocampal volume (0.79, 0.74 to 0.84; P=0.07), or all three variables combined (0.79, 0.75 to 0.84; P=0.05). Inclusion of hippocampal volume or all three MRI variables combined in the conventional model did, however, lead to significant improvement in reclassification measured by using the integrated discrimination improvement index (P=0.03 and P=0.04) and showed increased net benefit in decision curve analysis. Similar results were observed when the outcome was restricted to Alzheimer’s disease."

1a. C - Statistics: http://www.statisticshowto.com/c-statistic/

2. Paper Title: The Use of MRI and PET for Clinical Diagnosis of Dementia and Investigation of Cognitive Impairment: A Consensus Report 

https://www.alz.org/national/documents/imaging_consensus_report.pdf

"Once the presence of dementia has been established, the role of imaging in the diagnosis of dementia subtypes is very much a function of the clinical diagnosis. The accuracy of the clinical diagnosis of Alzheimer’s disease (AD) is quite good. Pathological AD has a prevalence of about 70% (range 50% to above 80% depending upon whether the AD occurs in isolation or with other entities) among all dementias (see evidence Table 1 in reference 1 ); thus, even clinicians with limited neurological expertise should have a diagnostic accuracy, for AD at least, at about that level. A review of 13 published studies gave average values for sensitivity and specificity of the clinical diagnosis of AD of 81% and 70%, respectively(1). The overall accuracy of the clinical diagnosis of AD versus not-AD compared with the neuropathological standard based on those values for prevalence, sensitivity, and specificity, is 78%. "

----------

### Evaluation Metrics
* Student proposes at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model presented. The evaluation metric(s) proposed are appropriate given the context of the data, the problem statement, and the intended solution.


Will Report classification accuracy (number of records correctly classified divided total number of records classified). Also report accuracy results in sklearn Confusion Matrix format(to evaluate classifier output quality) and Classification Report format  ( provides, precision, recall, and f1-score) which are quite appropriate for the mri dataset used to train models for CDR classification. See details and discussion of these sklearn metrics here:
http://scikit-learn.org/stable/modules/model_evaluation.html).

### Project Design
* Student summarizes a theoretical workflow for approaching a solution given the problem. Discussion is made as to what strategies may be employed, what analysis of the data might be required, or which algorithms will be considered. The workflow and discussion provided align with the qualities of the project. Small visualizations, pseudocode, or diagrams are encouraged but not required.

The following steps will be used. Some of the step details are patterned after recommendations from:
https://machinelearningmastery.com/

1. Download the CSV data from the OASIS web site (Reference 1).

2. Load the data into a PANDAS dataframe.

3. Load libraries (from Pandas, sklearn, and matplotlib)

4. Summarize Data 
       4a) Descriptive statistics: use the Python Pandas Describe() method.
       4b) Explore and visualize the dataset. Use histograms, scatter plots. Use Matplotlib for visualization.

5. Clean the dataset
       5a) remove NaN or replace NaN values with mean from the dataset; 
       5b) remove/replace missing data, if any.

6. Prepare Data (rename columns appropriately to be able to combine the cross-sectional and the longitudinal mri data.)

7. Preprocess and transform data, if appropriate, using normalization and/or rescaling the data. Check whether normalization and scaling will improve classification accuracy. As sensitivity study, use Data Transforms, where attributes are scaled or redistributed, in order to best expose the structure of the problem later to learning algorithms.

8. Feature selection
    8a) Check correlation among features by plotting correlation matrix using the corr() method to help in selecting features that are not strongly correlated with each other. This will yield optimized model input for higher accuracy. Feature selection methods are useful where redundant features may be removed and new features developed.
    
    8b) As a sensitivity study check whether PCA (principle component analysis) is helpful in reducing features.
    
9. Evaluate Algorithms: Evaluate the classification algorithms identified earlier in this section. 

    9a) Split the dataset into training dataset and validation dataset. Use seed to have reproducible results for random states.

    9b) Define test options using scikit-learn such as cross-validation and choose the evaluation metrics (see Evaluation Metrics section below)

    9c) Spot Check and Compare Algorithms: Using sklearn methods, will spot-check the suite of linear and nonlinear machine learning algorithms mention earlier in this section and compare the estimated accuracy of these algorithms. Will pick the algorithm that provides highest cross-validation accuracy.


10.Improve Accuracy: Will use the two prevalent and different ways to improve the accuracy of the models:

    10a. Algorithm Tuning - Search for a combination of parameters for each algorithm using scikit-learn that yields
    the best results.

    10b. Ensembles - Combine the prediction of multiple models into an ensemble prediction using ensemble
    techniques in scikit-learn.

*--------

11.Finalize Model: Will use an optimal model tuned by scikit-learn to make predictions on unseen data.

    11a. Will create a standalone model on entire training dataset using the parameters tuned by scikit-learn.

    11b) Make Predictions on the validation dataset

    11c) Create standalone model

    11d) Save model for later use


##### Representative screenshots and extracts of Python code and plots that will be part of the Analysis  Report

* LoadLibraries

<img src="Libraries_used.jpg">

* Cross sectional data sample (Pandas data frame)

<img src="Crosssectional_head.jpg">

* Longitudinal data sample (Pandas data frame)

<img src="Longitudinal_head.jpg">

* Merge the Cross Sectional and the Longitudinal data sets (Pandas data frames) and get descriptive statistics

<img src="Merge_datasets.jpg">

* Drop rows with NaN

    dfoas_merge=dfoas_merge.dropna(how='any') 

* Code: split merged dataset into a Training set and cross-validation set

<img src="Code_split_dataset.jpg">

* Python Code and Plot showing correlation among features in the merged dataset

<img src="Code_Correlation_Matrix_plot.jpg">

<img src="Correlation_Plot.jpg">

* Quick check of cross-validation accuracy of the chosen set of sklearn algorithms when applied to the merged MRI dataset

<img src="spot_check_algorithms.jpg">

* Training and Cross-Validation Accuracy for the Decision Tree Classifier

<img src="Training and Cross-Validation Accuracy for the Decision Tree Classifier.jpg">




### Presentation
Proposal follows a well-organized structure and would be readily understood by its intended audience. Each section is written in a clear, concise and specific manner. Few grammatical and spelling mistakes are present. All resources used and referenced are properly cited.