# *Factors Linked to the Prevention or Onset of Dementia Disease in Elderly Individuals*
#### Author: Morgan Zimmerman
(Word Count: 3125)

## **Introduction**

The aim of the following report is to analyze various lifestyle factors and their role in either the prevention or mitigation of dementia onset in an individual. Specifically the question of interest is whether or not there are lifestyle factors that have greater association with the prevalence of dementia than others. Does going for a 30-minute walk, or playing a board game with relatives lead to greater prevention of cognitive decline as one ages? These are the types of questions and comparisons that this research aims to answer. As mentioned below, the lifestyle and diet factors of interest are all ones that individuals have the means to control in their own lives. This offers a unique implication of the following report, in that it can be reflected on and applied to personal lifestyle patterns to encourage healthy aging.

First, a clear problem statement will be provided, as well as background information as to why looking at lifestyle factors correlated with dementia is a necessary analysis. Then, the data and its sources will be presented, followed by an articulation of steps taken to wrangle the data. A discussion on the methods and tools used throughout the project and detailed results will follow, supplemented by figures and visualizations. Finally, there will be a review of what "success" has looked like in the final stages of the report, as well as what next analysis steps might look like if allotted more time. 

## **Problem Statement and Background**


The motivation behind this research question is that an inevitable consequence of the aging population ia an increase in people living with dementia. According to a corresponding NIH (National Institutes of Health) analysis, "worldwide, there are roughly 50 million people living with dementia, and this number is projected to increase to 152 million by 2050, rising particularly in low-income and middle-income countries, where approximately two-thirds of all people living with dementia reside" [1]. Additionally, the Centers for Disease Control (CDC) collects survey data on Subjective Cognitive Decline (SCD) as part of their [Alzheimer's and Healthy Aging health portal](https://www.cdc.gov/aging/index.html) [2]. In their combined data from 2015-2018, the CDC found that 1 in 9 people, aged 45 year and older, are experiencing SCD. This is a concerning statistic, both because of the high prevalence in cognitive decline, but also because it considers indivudals starting at just forty-five years old. Furthermore, 1 in 3 people with SCD say it interferes with social activities, work, or voluntering. The National Institutes of Health has previously published reports on this type of data that suggests that even various age groups see effectiveness of cognitive-reaining activities in different ways. These statistics motivate this project, as it attempts to identify those activities that should be prioritized earlier in life, in order to retain cognition and avoiding losing the ability to partake in such activities later on.

Additionally, a jump in dementia prevalence has implications on healthcare. Currently, there are no disease-modifying (or curing) agents for dementia, but evidence of decline in the prevalence of dementia in high-income countries may suggest that the disease can be prevented by reducing strong risk factors. It is important to tailor prevention programs to local context and risk factors. The effectiveness of certain factors on the prevention of dementia onset is an important question to look at, as it can help motivate healhy aging lifetsyles and encourage the formation of influential health policy initiatives. With an expansion on this kind of research, health agencies can begin designing effective interventions and ultimately it will lead to more appropriate public health policies regarding healthy aging. 

## **Data**

The data used for this analysis comes from the National Institutes of Health (NIH). Specifically, the data only includes individuals from Indonesia, a middle-income country. Data was collected for the study via questionnaires on lifestyle, health risk factors, and cognitive and functional tests from September 2013 to December 2013. Information was collected from elders, whom are all above the age of 60 and reside in the most densely populated province in Indonesia, the Sumedang Regency of West Java. In total, 686 individuals qualified (and completed) for the study.

With each unit of observation at the patient-level, the data seeks to identify the modifiable risk factors for dementia, focusing on demographic, health, and lifestyle risk factors. Specific sociodemographic variables include age, gender, education, living area, and marital status. Additionally, socioeconomic factors, such as occupation and monthly income, are included. A number of health factors, like hypertension, cholesterol, body mass index (BMI), as well as cognitive function assessment scores, are present in the data set. Lastly, lifestyle and leisure factors, as well as dietary patterns are represented by a series of categorical variables. For these variables, respondents were asked how frequently (in times per week) they engaged in the leisure activity or ate from an individual food category. 

A selection of these variables were designated as predictors in the models to follow. The activity level variables and the dietary pattern variables were two categories of predictors that, separately, were applied to "dementia" (the outcome variable). These groups of predictors were selected due to the fact that the overall aim of the project was to identify lifestyle factors that have an effect on whether or not dementia is seen in an individual, and encourage individuals to reflect on and utilize the results to prevent cognitive decline. 

The outcome variable, whether or not an individual has dementia, was measured using the National Institute on Aging/Alzheimer's Association's (NIA-AA) criteria for dementia. In the absence of delirium or other mental disorders, individuals must have complained of severe cognitive decline that interferes with daily living activities to a healthcare professional and had that claim confirmed with cognitive assessments.

The columns with the most missing values in the dataset tend to be lab-based biomarkers. This could indicate that some patients had not completed certain tests or follow-up lab work. Additionally, there is missingness in the diet section of the dataset, which may mean patients did not properly document their food intake or do not eat a certain food category at all.

The corresponding datset, as well as previously performed studies can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8345013/). 

### *Data Wrangling*

Before any analysis can occur, the first step in this process was to reformat the data into a clean, readable dataframe. When initially reading in data from Excel, variable names where altered because some columns were titled with subheadings (rather than one name). Additionally, variables related to lifestyle leisure activities were renamed, so they better explained the column being represented. Previously, these columns were labeled *Social_3*, *Intellectual_7*, *Recreational_2*, and so on. Corresponding files from the study's website contained what actual activities fall into each of these categories, but they had to be frequently referenced to make any kind of conclusions about the data. It was more beneficial for this analysis to rename these variables to things like *volunteers*, *plays an instrument*, *cooks*, respectively. This allowed for greater readability and made it significantly easier to translate results without cross-referencing many files.

Another obstacle of the data cleaning process was to deal with categorical variables. A large majority of the variables present in the dataset, roughly all but a few health biomarkers, needed to be converted to categorical columns. Because many values in these columns were characters (word) entries, they needed to be cast to dummy variable values or ranking systems that represent the respective category's degrees. For example, columns representing dietary patterns and various food groups were valued as "frequently (2-3 times a week)" or "infrequently (<2 times a week)". These columns were tranformed to dummy variables, in which a value of 1 indicated the respondent "frequently" ate food from that food category and a value of 0 indicating they did not. Once the variable names, types, and values were tranformed, the next step in the process was to explore the data.

### *Data Exploration*

Exploring the data is necessary, as it searches for potential areas of either concern or further analysis. A basic example of what the data exploration stage of the project may look like, is plotting frequencies for certain influential variables, in order to gain an understanding of the type and range of observations that are represented in the data set. Below, *Figure 1* shows the frequencies for each age present in the data set. While only one frequency graph is included below for the sake of simplicity in the report, the advantage this stage in the project has on the overall analysis, is that it allows for greater understanding of the data that is being worked with, which in turn allows for greater interpretation of results.

![age_frequency-2.png](attachment:age_frequency-2.png)

Before beginning the analysis, one final transformation on the data itself was to create multiple subsets based on variable categories. With a large number of variables in the original dataset, it was easier to work with subsets of the data that only included the necessary predictor and outcome variables at each stage of the analysis. For example, one of the main goals of the project was to identify key leisure activities that are most effective in reducing cognitive decline as an individual ages, so a subset was created that contained only activity variables and the necessary outcome variable of dementia. Likewise, a subset was created for solely diet variables.

While not surprising, one important validation early on from the data, is that people living with dementia are less active and score lower on cognitive assessments, as shown in *Figure 2* below. To achieve this result, a summation of the total number of activities each individual person participated in (whether social, intellectual, physical, etc.) was calculated. Then the below graph was developed, which breaks up those without dementia on the left and those with dementia on the right. The plot shows the distribution of activity totals within those two groups. As a result, those with dementia have a more right-skewed distribution. In other words, those with dementia were associated with less active lifestyles, on average.

![active_totals-2.png](attachment:active_totals-2.png)

Likewise, another variable of interest outside of activity and diet is AMT scores, which stands for Abbreviated Mental Test. This is a rapid assessment of elderly patients for the possibility of dementia, typically given by a healthcare professional. *Figure 3* represents the average AMT score for each unique age for which there are observations present in the dataset. This verifies that there is a clear downward trend of AMT scores as a result of getting older, which shouldn't be a surprise to most. 

![avg_AMT-3.png](attachment:avg_AMT-3.png)

## **Analysis**

Within statistical learning, data was set up in the sklearn [3] framework. In other words, the outcome was separated from the predictors. Then, five models were considered and implemented using sklearn API. This process is explained below.

### *Machine Learning*

The first goal of the machine learning stage was to implement and compare various models, using sklearn API. On a subset dataframe that included only numerical values (excluding categorical variables), five models were considered. These five models are linear regression, K-nearest neighbors, decision tree, bagging regressor, and random forest. For each model, K-fold cross-validation was used to eatimate the test error. In order to make valid comparisons across models, a KFold generator was implemented to ensure the same data splits were used each time. The performance metric of interest, mean squared error (MSE), was collected for each model. A grid search, or exhaustive search over specified parameter values for an estimator, was implemented. After specifying all potential values that the tuning parameters can take on and using cross-validation to compare out-of-sample performance among these configurations, *Figure 4* below shows the order of the models (best to worst) based on each MSE score.

![model_MSE-2.png](attachment:model_MSE-2.png)

As a result, the Bagging Regressor model performed best on this subset of numerical variables. Applied to the test data, this model produced an R-squared fit of 0.8.

## **Results**

The bulk of the analysis follows the above process, but on subsets of the data that are of more interest. Two key groups of variables, leisure actitivity levels and dietary patterns, were of most interest in terms of their effect on the outcome or presence of dementia. The reason for this is that leisure and diet are two lifestyle factors that individuals have the potential to control in their own lives. If a significant link between these two factors and a person's chance of living with dementia can be expressed to the public, people may be more inclined to change their daily habits and diet. 

For this reason, the above modeling steps are implemented on the subset of data that only includes *activity* predictors. Note that this time the performance metric of interest is the ROC AUC score, an evaluation metric for binary classification problems. The best model for this data is the Random Forest Classifier with a max depth of 4 and 500 estimators. The resulting ROC AUC score, measure of the ability of a classifier to distinguish between classes, is 0.81. For reference, an ROC AUC score of 1 would mean that the model is able to perfectly distinguish between dementia and no dementia. Additionally, the model has an accuracy score of 73%, indicating that the model accuracy predicts the outcome of dementia based on activity factors 73% of the time.

Furthermore, this set of activity features was permuted in order to determine importance. Figure 3 below represents the relative importance of each activity variable in explaining the outcome (dementia). This figure illustrates that the three most important activities are reading, listening to music, and watching television. The implications of this for the analysis are that devoting more leisure time to these specific three activities may be linked to a decrease in dementia onset in an elder individual. Likewise, the reasoning for the inclusion of this figure in the final report is that it provides readers with an assessment of how best they might utilize their time in order to continue to optimize cognitive function as they age, especially those who are high-risk or have a family history of the disease.

![permutations-2.png](attachment:permutations-2.png)

For the sake of consistency, the explanation and process described above was applied to solely *diet* predictors. As a result, it was found that the best model for this subset of data is also the Random Forest Classifier, but this time with a max depth of 4 and 1500 estimators.

The resulting ROC AUC score is also 0.81. The random forest model has an accuracy score of 72%, indicating that the model accuracy predicts the outcome of dementia based on diet factors 72% of the time. After analyzing the relative importance of each food caegory in predicting dementia, it was found that carbohydrates and fruit have the most influence on preventing the onset of dementia. Note, too, that the NIH has also found links between increased fruit intake and delaying the onset of dementia, a claim consistent with the below result.

![diets.png](attachment:diets.png)

### *Takeaways*

In conclusion, it is evident that individuals without dementia have higher activity levels than those with dementia. Secondly, AMT scores declined with age. Lastly, the three most important activities in retaining cognition for aging individuals are reading, listening to music, and watching television. Likewise, an emphasis on increased carbohydrate and fruit intake leads to better dementia prevention.

One of the key takeaways from this project was becoming more aware of the modeling process, and more importantly, the interpretation of various results. The initial challenge with this analysis, was becoming familiar with both categorial predicttors and a categorical outcome variable. Initially, implementing the modeling process on the data set as a whole, with dementia as the outcome, resulted in low predictive accuracy scores. The key takeaway from that preliminary result was that poor accuracy scores do not always mean a bad analysis. In other words, variables that are designated as predictors of an outcome may also just be bad indicators by nature. They may not have any strong correlation with the outcome variable.

### *Theoretical Implications*

This analysis demonstrates that both lifestyle and diet factors are associated with the onset of dementia in this sub-population of individuals from Indonesia. An awareness that low activity levels among elders leads to a higher risk for the onset of dementia may encourage new inervention opportunities. Likewise, it may promote early prevention patterns in individuals at a young age. The knowledge that this type of study can provide is both beneficial to individuals, as they can apply lifestyle changes to mitigate their risk of dementia onset, but also to society and healthcare initiattives as a whole, as it allows for reflections on the types of prevention strategies that are available to individuals. At any rate, being able to provide patients with knowledge about how minor changes in their own lives can have drastic effects on their health later in life is a powerful tool and should be implemented appropriately. In the case of dementia, this knowledge allows for each individual to take part in protecting and perserving their own cognitive function.

## **Discussion**

As a whole, this project was successful in achieving the goals defined in the original proposal. In the preliminary proposal, *success* was defined as producing working code throughout project, producing publication-quality visuals to easily explain relationships among predictor variables, and creating a model that represents what major characteristics (related to one's demographics and health history) most influence the risk of dementia. This goals were achieved, although one critique would be to spend more time on the graphics. They have room for improvement in terms of being visually appealing, and also it would be of benefit to make them more cohesive (in terms of style, color, size, etc.) throughout the entirety of the report. 

The hope at the completion of this project, is that conclusive statements have been presented about what interventions or health changes individuals should consider making, especially if at greater risk of developing dementia. With more time, another major aim of the project would be to incorporate data from multiple countries, especially from varying income levels. It would be a logical next step to compare the results from Indonesia (a middle-income country) with a low-income and high-income country. This would strengthen the analysis significantly, as it would allow for more generalizable claims. More large-scale research would better lead to effective and efficient dementia and prevention strategies.

### **Sources**

[1] Ong, P. A., Annisafitrie, F. R., Purnamasari, N., Calista, C., Sagita, N., Sofiatin, Y., & Dikot, Y. (2021). Dementia Prevalence, Comorbidities, and Lifestyle Among Jatinangor Elders. Frontiers in neurology, 12, 643480. https://doi.org/10.3389/fneur.2021.643480

[2] “What Is Alzheimer's Disease?” Centers for Disease Control and Prevention. Centers for Disease Control and Prevention, November 23, 2021. https://www.cdc.gov/aging/index.html. 

[3] [Scikit-learn: Machine Learning in Python](https://jmlr.csail.mit.edu/), Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.