# 알츠하이머 예측하기 
## 데이터 출처
- 알츠하이머 데이터
- https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset

# 데이터 확인
## 1. 인적 정보
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|PatientID|식별 번호|471-6900|Age|나이|60-90|
|Gender|성별|0:남자, 1 : 여자|Ethnicity|인종|0:백인, 1 : 아프리카계 미국인, 2 : 아시아, 3 : 기타|
|EducationLevel|교육 수준|0 : 없음, 1 : 고등학교, 2 : 학사, 3 : 더 높음|

## 2. lifestype
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|Smoking|흡연 유/무|0 : 아니오, 1 : 예|AlcoholConsumption|주당 알코올 소비량|0-20|
|PhysicalActivity|주간 신체 활동량|0-10|DietQuality|식단 품질|0-10|
|SleepQuality|수면의 질|4-10|

## 3. 병력
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|FamilyHistoryAlzheimers|알츠하이머 가족력|0 : 없음, 1 : 예|CardiovascularDisease|심혈관 질환 유무|0 : 없음, 1 : 예|
|Diabetes|당뇨병 유무|0 : 없음, 1 : 예|Depression|우울증 유무|0 : 없음, 1 : 예|
|HeadInjury|두부 손상 유무|0 : 없음, 1 : 예|Hypertension|고혈압 유무|0 : 없음, 1 : 예|

## 4. 임상 측정
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|SystolicBP|수축기 혈압|90-180|DiastolicBP|이완기 혈압|60-120|
|CholesterolTotal|콜레스테롤 총량|150-300|CholesterolLDL|LDL콜레스테롤 수치|50-200|
|CholesterolHDL|HDL콜레스테롤 수치|20-100|CholesterolTriglycerides|중성지방|50-400|
|BMI|신체질량지수|15-40|

## 5. 인지 및 기능 평가
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|MMSE|간이 정신 상태 검사 점수|0-30, 낮을수록 인지 장애가 있음|FunctionalAssessment|기능평가|0-10, 낮을수록 기능 장애가 심함|
|MemoryComplaints|기억 장애 유무|0 : 없음, 1 : 예|BehavioralProblems|행동 문제 유무|0 : 없음, 1 : 예|
|ADL|일상생활활동 점수|0-10, 낮을수록 장애가 심함|

## 6. 증상
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|Confusion|혼란의 유무|0 : 없음, 1 : 예|Disorientation|방향 감각 상실 유무|0 : 없음, 1 : 예|
|PersonalityChanges|성격 변화 유무|0 : 없음, 1 : 예|DifficultyCompletingTasks|작업 완료 어려움 유무|0 : 없음, 1 : 예|
|Forgetfulness|건망증 유무|0 : 없음, 1 : 예|

## 7. 진단
|column 명|설명|value|column 명|설명|value|
|--|--|--|--|--|--|
|Diagnosis|알츠하이머 유무|0 : 없음, 1 : 예|DoctorInCharge|담당 의사에 대한 정보|'XXXConfid'|

In [1]:
pip install statsmodels

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install imblearn

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
from scipy import stats 
from statsmodels.formula.api import logit 
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


df = pd.read_csv('../data/alzheimers_disease_data.csv')
df

Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2144,6895,61,0,0,1,39.121757,0,1.561126,4.049964,6.555306,...,0,0,4.492838,1,0,0,0,0,1,XXXConfid
2145,6896,75,0,0,2,17.857903,0,18.767261,1.360667,2.904662,...,0,1,9.204952,0,0,0,0,0,1,XXXConfid
2146,6897,77,0,0,1,15.476479,0,4.594670,9.886002,8.120025,...,0,0,5.036334,0,0,0,0,0,1,XXXConfid
2147,6898,78,1,3,1,15.299911,0,8.674505,6.354282,1.263427,...,0,0,3.785399,0,0,0,0,1,1,XXXConfid


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PatientID,2149.0,5825.0,620.507185,4751.0,5288.0,5825.0,6362.0,6899.0
Age,2149.0,74.908795,8.990221,60.0,67.0,75.0,83.0,90.0
Gender,2149.0,0.506282,0.500077,0.0,0.0,1.0,1.0,1.0
Ethnicity,2149.0,0.697534,0.996128,0.0,0.0,0.0,1.0,3.0
EducationLevel,2149.0,1.286645,0.904527,0.0,1.0,1.0,2.0,3.0
BMI,2149.0,27.655697,7.217438,15.008851,21.611408,27.823924,33.869778,39.992767
Smoking,2149.0,0.288506,0.453173,0.0,0.0,0.0,1.0,1.0
AlcoholConsumption,2149.0,10.039442,5.75791,0.002003,5.13981,9.934412,15.157931,19.989293
PhysicalActivity,2149.0,4.920202,2.857191,0.003616,2.570626,4.766424,7.427899,9.987429
DietQuality,2149.0,4.993138,2.909055,0.009385,2.458455,5.076087,7.558625,9.998346


# 데이터 전처리

## 결측치 확인 : 없음

In [5]:
df.isnull().sum()

PatientID                    0
Age                          0
Gender                       0
Ethnicity                    0
EducationLevel               0
BMI                          0
Smoking                      0
AlcoholConsumption           0
PhysicalActivity             0
DietQuality                  0
SleepQuality                 0
FamilyHistoryAlzheimers      0
CardiovascularDisease        0
Diabetes                     0
Depression                   0
HeadInjury                   0
Hypertension                 0
SystolicBP                   0
DiastolicBP                  0
CholesterolTotal             0
CholesterolLDL               0
CholesterolHDL               0
CholesterolTriglycerides     0
MMSE                         0
FunctionalAssessment         0
MemoryComplaints             0
BehavioralProblems           0
ADL                          0
Confusion                    0
Disorientation               0
PersonalityChanges           0
DifficultyCompletingTasks    0
Forgetfu

## 필요없는 column 제거 : PatientID, DoctorInCharge

In [6]:
df.drop(['PatientID','DoctorInCharge'],axis=1, inplace=True)

## 데이터 정규화

In [7]:
scaler = MinMaxScaler()
for i in ['Age','BMI','AlcoholConsumption','PhysicalActivity','DietQuality','SleepQuality','SystolicBP','DiastolicBP','DiastolicBP','CholesterolTotal','CholesterolLDL','CholesterolHDL','CholesterolTriglycerides','MMSE','FunctionalAssessment','ADL']:
    df[i] = scaler.fit_transform(df[[i]])
df.describe()

Unnamed: 0,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,...,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis
count,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,...,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0
mean,0.49696,0.506282,0.697534,1.286645,0.506199,0.288506,0.502191,0.492456,0.498926,0.508312,...,0.508162,0.208004,0.156817,0.498244,0.205212,0.158213,0.150768,0.158678,0.301536,0.353653
std,0.299674,0.500077,0.996128,0.904527,0.288883,0.453173,0.288079,0.286182,0.291227,0.294065,...,0.28939,0.405974,0.363713,0.295023,0.40395,0.365026,0.357906,0.365461,0.459032,0.478214
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.233333,0.0,0.0,1.0,0.264272,0.0,0.257054,0.257117,0.245178,0.246843,...,0.256685,0.0,0.0,0.234191,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.5,1.0,0.0,1.0,0.512933,0.0,0.496936,0.477053,0.50723,0.519077,...,0.509601,0.0,0.0,0.503846,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.766667,1.0,1.0,2.0,0.754923,1.0,0.758278,0.743632,0.755758,0.760335,...,0.754954,0.0,0.0,0.758137,0.0,0.0,0.0,0.0,1.0,1.0
max,1.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 데이터 샘플링 : SMOTE

In [8]:
cate_columns = ['Gender','Ethnicity','EducationLevel','Smoking','FamilyHistoryAlzheimers','CardiovascularDisease','Diabetes','Depression','HeadInjury','Hypertension','MemoryComplaints','BehavioralProblems','Confusion','Disorientation','PersonalityChanges','DifficultyCompletingTasks','Forgetfulness']
cate_indices = [df.columns.get_loc(col) for col in cate_columns]
cate_indices

[1, 2, 3, 5, 10, 11, 12, 13, 14, 15, 24, 25, 27, 28, 29, 30, 31]

In [9]:
from imblearn.over_sampling import SMOTENC
import pandas as pd
cate_columns = ['Gender','Ethnicity','EducationLevel','Smoking','FamilyHistoryAlzheimers','CardiovascularDisease','Diabetes','Depression','HeadInjury','Hypertension','MemoryComplaints','BehavioralProblems','Confusion','Disorientation','PersonalityChanges','DifficultyCompletingTasks','Forgetfulness']
cate_indices = [df.columns.get_loc(col) for col in cate_columns]
cate_indices

# SMOTENC 적용
overSampling = SMOTENC(categorical_features=cate_indices, sampling_strategy=0.8, random_state=42)
feature = df.drop('Diagnosis', axis=1)
target = df['Diagnosis']
feature_sample, target_sample = overSampling.fit_resample(feature, target)

print(feature.shape, target.shape, feature_sample.shape, target_sample.shape)
df = pd.concat([feature_sample, target_sample], axis=1)
df.


(2149, 32) (2149,) (2500, 32) (2500,)


Unnamed: 0,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,...,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis
0,0.433333,0,0,2,0.316960,0,0.665183,0.633375,0.133931,0.837564,...,0.652102,0,0,0.172486,0,0,0,1,0,0
1,0.966667,0,0,0,0.473058,0,0.227170,0.762862,0.050995,0.525021,...,0.712108,0,0,0.259154,0,0,0,0,1,0
2,0.433333,0,3,1,0.111553,0,0.978276,0.785408,0.181896,0.945597,...,0.589697,0,0,0.711936,0,1,0,1,0,0
3,0.466667,1,0,1,0.752163,1,0.610751,0.843804,0.743443,0.731994,...,0.896823,0,1,0.648094,0,0,0,0,0,0
4,0.966667,0,0,0,0.228472,0,0.923204,0.631707,0.078698,0.265892,...,0.604699,0,0,0.001341,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,0.105265,1,0,2,0.662061,0,0.060032,0.503047,0.287995,0.162242,...,0.290674,0,0,0.164272,0,0,0,0,0,1
2496,0.522259,1,0,1,0.841854,0,0.828460,0.310607,0.905753,0.642736,...,0.192537,0,0,0.098954,0,0,0,0,0,1
2497,0.151783,1,0,0,0.167029,0,0.734901,0.278921,0.350008,0.670189,...,0.737102,0,0,0.553522,0,0,0,0,0,1
2498,0.552156,1,0,2,0.782612,0,0.837168,0.613912,0.624380,0.754645,...,0.709766,1,0,0.106809,0,0,0,0,0,1


# 가설 설정 

## 1. 인적 사항, 병력, 일상 생활, 임상 측정, 인지 및 기능 평가, 증상이 알츠하이머 유/무에 영향을 미친다.
- ### 결과
    - 인지 및 기능 평가 그룹이 알츠하이머 유/무에 많은 영향을 미친다. 

- ### 결론
    - 인지 및 기능 평가 그룹을 가지고 알츠하이머 예측 모델 제작
    - 인지 및 기능 평가에 영향을 미치는 변수들이 무엇이 있는지 확인하여, 알츠하이머의 예방 및 치료에 도움이 될 수 있는 방안 모색

### 모든 변수를 사용하여 로지스틱 회귀 모델을 만들고, 유의미한 변수들 추출
#### 유의미한 변수
- Ethnicity, Smoking, FamilyHistoryAlzheimers, HeadInjury, CholesterolLDL, MMSE, FunctionalAssessment, MemoryComplaints, BehavioralProblems, ADL, Confusion, PersonalityChanges
- 회귀 계수를 확인해보니 다른 그룹에 비해 인지 및 기능 평가 그룹에 속한 변수들이 높은 수치를 보임

In [10]:
feature_columns = list(df.drop('Diagnosis',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [11]:
from statsmodels.formula.api import logit 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
train,test = train_test_split(df,test_size = 0.3, random_state = 23)
model = logit('Diagnosis ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
pred_class = (pred > 0.5).astype(int)
print(classification_report(test['Diagnosis'],pred_class))

Optimization terminated successfully.
         Current function value: 0.396831
         Iterations 7
              precision    recall  f1-score   support

           0       0.85      0.85      0.85       403
           1       0.82      0.83      0.83       347

    accuracy                           0.84       750
   macro avg       0.84      0.84      0.84       750
weighted avg       0.84      0.84      0.84       750



In [12]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')

In [13]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

FunctionalAssessment               4.417250
ADL                                4.099187
MMSE                               2.923628
C(MemoryComplaints)[T.1]           2.311231
C(BehavioralProblems)[T.1]         2.121951
C(HeadInjury)[T.1]                 0.839768
C(Ethnicity)[T.3]                  0.670095
C(EducationLevel)[T.3]             0.586469
C(Smoking)[T.1]                    0.562357
C(Confusion)[T.1]                  0.560556
CholesterolLDL                     0.540790
C(PersonalityChanges)[T.1]         0.526275
C(Ethnicity)[T.1]                  0.513287
C(FamilyHistoryAlzheimers)[T.1]    0.467816
dtype: float64

### 인지 및 기능 평가 그룹을 사용하여 다시 로지스틱 회귀 모델을 만들고 모델의 정확성과 변수들의 회귀계수의 변화 확인
- 인지 및 기능 평가 그룹을 사용하여 만들었을 때 모델의 성능이 떨어지지 않았고, 각 변수들의 회귀 계수 역시 큰 변화가 없음
- 인지 및 기능 평가 그룹만 사용하여 예측 모델을 만들어도 충분하다고 판단

In [14]:
df_top5 = df[['FunctionalAssessment','ADL','MMSE','MemoryComplaints','BehavioralProblems','Diagnosis']]
feature_columns = list(df_top5.drop('Diagnosis',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [15]:
from statsmodels.formula.api import logit 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_top5,test_size = 0.3, random_state = 23)
model = logit('Diagnosis ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
pred_class = (pred > 0.5).astype(int)
print(classification_report(test['Diagnosis'],pred_class))

Optimization terminated successfully.
         Current function value: 0.430391
         Iterations 7
              precision    recall  f1-score   support

           0       0.85      0.86      0.85       403
           1       0.83      0.82      0.83       347

    accuracy                           0.84       750
   macro avg       0.84      0.84      0.84       750
weighted avg       0.84      0.84      0.84       750



In [16]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(MemoryComplaints)[T.1]', 'C(BehavioralProblems)[T.1]', 'FunctionalAssessment', 'ADL', 'MMSE']


In [17]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

FunctionalAssessment          4.149696
ADL                           3.979356
MMSE                          2.935631
C(MemoryComplaints)[T.1]      2.199120
C(BehavioralProblems)[T.1]    1.928970
dtype: float64

## 2. 인적 사항, 증상, 병력, 일상 생활, 임상 측정이 인지 및 기능 평가의 변수들에 영향을 미친다.

### FunctionalAssessment

In [18]:
df_FA = df.drop(['ADL','MMSE','MemoryComplaints','BehavioralProblems','Diagnosis'],axis=1)
feature_columns = list(df_FA.drop('FunctionalAssessment',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [19]:
from statsmodels.formula.api import ols 
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_FA,test_size = 0.3, random_state = 23)
model = ols('FunctionalAssessment ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
print(r2_score(test['FunctionalAssessment'],pred))
print(mean_squared_error(test['FunctionalAssessment'],pred))

-0.03303482550595982
0.08347065485900691


In [20]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(Ethnicity)[T.2]', 'C(CardiovascularDisease)[T.1]', 'C(Diabetes)[T.1]', 'BMI']


In [21]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(Ethnicity)[T.2]                0.064325
BMI                              0.049295
C(CardiovascularDisease)[T.1]    0.047386
C(Diabetes)[T.1]                 0.040951
dtype: float64

In [22]:
df_FA = df[['Ethnicity','CardiovascularDisease','Diabetes','BMI','FunctionalAssessment']]
feature_columns = list(df_FA.drop('FunctionalAssessment',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [23]:
from statsmodels.formula.api import ols 
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_FA,test_size = 0.3, random_state = 23)
model = ols('FunctionalAssessment ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
print(r2_score(test['FunctionalAssessment'],pred))
print(mean_squared_error(test['FunctionalAssessment'],pred))

-0.026924639872805356
0.08297694333688113


In [24]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(Ethnicity)[T.2]', 'C(CardiovascularDisease)[T.1]', 'C(Diabetes)[T.1]', 'BMI']


In [25]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(Ethnicity)[T.2]                0.068188
BMI                              0.047433
C(CardiovascularDisease)[T.1]    0.044630
C(Diabetes)[T.1]                 0.043763
dtype: float64

### ADL

In [26]:
df_ADL = df.drop(['FunctionalAssessment','MMSE','MemoryComplaints','BehavioralProblems','Diagnosis'],axis=1)
feature_columns = list(df_ADL.drop('ADL',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [27]:
from statsmodels.formula.api import ols 
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_ADL,test_size = 0.3, random_state = 23)
model = ols('ADL ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
print(r2_score(test['ADL'],pred))
print(mean_squared_error(test['ADL'],pred))

-0.02158391164066753
0.0835098280588827


In [28]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(Ethnicity)[T.1]', 'C(Ethnicity)[T.3]', 'C(Diabetes)[T.1]', 'Age']


In [29]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(Ethnicity)[T.3]    0.059446
Age                  0.048319
C(Diabetes)[T.1]     0.042929
C(Ethnicity)[T.1]    0.042605
dtype: float64

In [30]:
df_ADL = df[['Ethnicity','Age','Diabetes','ADL']]
feature_columns = list(df_ADL.drop('ADL',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [31]:
from statsmodels.formula.api import ols 
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_ADL,test_size = 0.3, random_state = 23)
model = ols('ADL ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
print(r2_score(test['ADL'],pred))
print(mean_squared_error(test['ADL'],pred))

-0.01288511741675924
0.08279874128306267


In [32]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(Ethnicity)[T.1]', 'C(Ethnicity)[T.3]', 'C(Diabetes)[T.1]', 'Age']


In [33]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(Ethnicity)[T.3]    0.060471
Age                  0.054536
C(Ethnicity)[T.1]    0.044079
C(Diabetes)[T.1]     0.044050
dtype: float64

### MMSE

In [34]:
df_MMSE = df.drop(['FunctionalAssessment','ADL','MemoryComplaints','BehavioralProblems','Diagnosis'],axis=1)
feature_columns = list(df_MMSE.drop('MMSE',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [35]:
from statsmodels.formula.api import ols 
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_MMSE,test_size = 0.3, random_state = 23)
model = ols('MMSE ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
print(r2_score(test['MMSE'],pred))
print(mean_squared_error(test['MMSE'],pred))

-0.026856449865876764
0.08467841692560661


In [36]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(EducationLevel)[T.2]', 'C(PersonalityChanges)[T.1]', 'C(Forgetfulness)[T.1]']


In [37]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(PersonalityChanges)[T.1]    0.063428
C(EducationLevel)[T.2]        0.042213
C(Forgetfulness)[T.1]         0.029150
dtype: float64

In [38]:
df_MMSE = df[['PersonalityChanges','EducationLevel','Forgetfulness','MMSE']]
feature_columns = list(df_MMSE.drop('MMSE',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [39]:
from statsmodels.formula.api import ols 
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_MMSE,test_size = 0.3, random_state = 23)
model = ols('MMSE ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
print(r2_score(test['MMSE'],pred))
print(mean_squared_error(test['MMSE'],pred))

-0.016069594024050504
0.08378889251700025


In [40]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
list_column.remove('Intercept')
print(list_column)

['C(PersonalityChanges)[T.1]', 'C(EducationLevel)[T.2]', 'C(Forgetfulness)[T.1]']


In [41]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(PersonalityChanges)[T.1]    0.066026
C(EducationLevel)[T.2]        0.046393
C(Forgetfulness)[T.1]         0.032598
dtype: float64

### MemoryComplaints

In [111]:
from imblearn.over_sampling import SMOTENC
import pandas as pd
df_MC = df.drop(['FunctionalAssessment','ADL','MMSE','BehavioralProblems','Diagnosis','MemoryComplaints'],axis=1)
df_MC = pd.concat([df_MC,df['MemoryComplaints']],axis=1)
cate_columns = ['Gender','Ethnicity','EducationLevel','Smoking','FamilyHistoryAlzheimers','CardiovascularDisease','Diabetes','Depression','HeadInjury','Hypertension','Confusion','Disorientation','PersonalityChanges','DifficultyCompletingTasks','Forgetfulness']
cate_indices = [df_MC.columns.get_loc(col) for col in cate_columns]
cate_indices

# SMOTENC 적용
overSampling = SMOTENC(categorical_features=cate_indices, sampling_strategy=0.8, random_state=42)
feature = df_MC.drop('MemoryComplaints', axis=1)
target = df_MC['MemoryComplaints']
feature_sample, target_sample = overSampling.fit_resample(feature, target)

print(feature.shape, target.shape, feature_sample.shape, target_sample.shape)
df_MC = pd.concat([feature_sample, target_sample], axis=1)
df_MC


(2500, 27) (2500,) (3546, 27) (3546,)


Unnamed: 0,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,...,CholesterolTotal,CholesterolLDL,CholesterolHDL,CholesterolTriglycerides,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,MemoryComplaints
0,0.433333,0,0,2,0.316960,0,0.665183,0.633375,0.133931,0.837564,...,0.615567,0.039538,0.171039,0.319802,0,0,0,1,0,0
1,0.966667,0,0,0,0.473058,0,0.227170,0.762862,0.050995,0.525021,...,0.540822,0.956205,0.738026,0.698711,0,0,0,0,1,0
2,0.433333,0,3,1,0.111553,0,0.978276,0.785408,0.181896,0.945597,...,0.894520,0.688497,0.622290,0.095072,0,1,0,1,0,0
3,0.466667,1,0,1,0.752163,1,0.610751,0.843804,0.743443,0.731994,...,0.063302,0.101085,0.605851,0.649922,0,0,0,0,0,0
4,0.966667,0,0,0,0.228472,0,0.923204,0.631707,0.078698,0.265892,...,0.583781,0.284763,0.461019,0.688892,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3541,0.751756,1,0,1,0.656051,0,0.806172,0.338110,0.081291,0.599621,...,0.447325,0.370307,0.624714,0.216122,0,0,0,0,0,1
3542,0.810584,1,0,1,0.836502,0,0.657047,0.247631,0.501080,0.857900,...,0.525842,0.287754,0.082913,0.623987,0,0,0,0,1,1
3543,0.726720,0,0,1,0.798293,0,0.212301,0.249531,0.652724,0.107822,...,0.453033,0.516506,0.698206,0.275231,0,0,0,0,0,1
3544,0.222070,0,0,2,0.848765,0,0.389676,0.687634,0.855266,0.518864,...,0.532671,0.464366,0.845408,0.012641,0,0,0,0,0,1


In [112]:
feature_columns = list(df_MC.drop('MemoryComplaints',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [113]:
from statsmodels.formula.api import logit 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_MC,test_size = 0.3, random_state = 23)
model = logit('MemoryComplaints ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
pred_class = (pred > 0.5).astype(int)
print(classification_report(test['MemoryComplaints'],pred_class))

Optimization terminated successfully.
         Current function value: 0.592570
         Iterations 6
              precision    recall  f1-score   support

           0       0.78      0.75      0.76       601
           1       0.69      0.72      0.70       463

    accuracy                           0.74      1064
   macro avg       0.73      0.73      0.73      1064
weighted avg       0.74      0.74      0.74      1064



In [114]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
# list_column.remove('Intercept')

In [115]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(HeadInjury)[T.1]                   1.058637
C(Diabetes)[T.1]                     0.988315
C(Ethnicity)[T.3]                    0.984485
C(PersonalityChanges)[T.1]           0.840936
C(Depression)[T.1]                   0.724144
C(Hypertension)[T.1]                 0.719963
C(Confusion)[T.1]                    0.702962
C(FamilyHistoryAlzheimers)[T.1]      0.633101
C(DifficultyCompletingTasks)[T.1]    0.504811
BMI                                  0.497418
C(Disorientation)[T.1]               0.467240
C(EducationLevel)[T.3]               0.451107
C(Ethnicity)[T.2]                    0.435246
C(Ethnicity)[T.1]                    0.373536
C(CardiovascularDisease)[T.1]        0.369511
C(Smoking)[T.1]                      0.338382
Age                                  0.318781
SleepQuality                         0.318725
DietQuality                          0.314050
dtype: float64

In [116]:
df_MC_top = df_MC[['HeadInjury','Diabetes','Ethnicity','PersonalityChanges','Depression','Hypertension','Confusion','FamilyHistoryAlzheimers','DifficultyCompletingTasks','BMI','Disorientation','EducationLevel','CardiovascularDisease','Smoking','Age','SleepQuality','DietQuality','MemoryComplaints']]
feature_columns = list(df_MC_top.drop('MemoryComplaints',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [118]:
from statsmodels.formula.api import logit 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_MC_top,test_size = 0.3, random_state = 23)
model = logit('MemoryComplaints ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
pred_class = (pred > 0.5).astype(int)
print(classification_report(test['MemoryComplaints'],pred_class))

Optimization terminated successfully.
         Current function value: 0.594938
         Iterations 6
              precision    recall  f1-score   support

           0       0.77      0.74      0.76       601
           1       0.68      0.71      0.70       463

    accuracy                           0.73      1064
   macro avg       0.73      0.73      0.73      1064
weighted avg       0.73      0.73      0.73      1064



In [120]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
# list_column.remove('Intercept')
print(list_column)

['C(HeadInjury)[T.1]', 'C(Diabetes)[T.1]', 'C(Ethnicity)[T.1]', 'C(Ethnicity)[T.2]', 'C(Ethnicity)[T.3]', 'C(PersonalityChanges)[T.1]', 'C(Depression)[T.1]', 'C(Hypertension)[T.1]', 'C(Confusion)[T.1]', 'C(FamilyHistoryAlzheimers)[T.1]', 'C(DifficultyCompletingTasks)[T.1]', 'C(Disorientation)[T.1]', 'C(CardiovascularDisease)[T.1]', 'C(Smoking)[T.1]', 'BMI', 'Age', 'SleepQuality', 'DietQuality']


In [121]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(HeadInjury)[T.1]                   1.060703
C(Diabetes)[T.1]                     0.977051
C(Ethnicity)[T.3]                    0.973855
C(PersonalityChanges)[T.1]           0.837965
C(Depression)[T.1]                   0.742573
C(Confusion)[T.1]                    0.722652
C(Hypertension)[T.1]                 0.715254
C(FamilyHistoryAlzheimers)[T.1]      0.635545
BMI                                  0.511805
C(DifficultyCompletingTasks)[T.1]    0.511704
C(Disorientation)[T.1]               0.482041
C(Ethnicity)[T.2]                    0.430760
C(Ethnicity)[T.1]                    0.370145
C(CardiovascularDisease)[T.1]        0.360385
SleepQuality                         0.336768
Age                                  0.336490
DietQuality                          0.333135
C(Smoking)[T.1]                      0.321090
dtype: float64

### BehavioralProblems

In [98]:
df_BP.BehavioralProblems

Unnamed: 0,BehavioralProblems,BehavioralProblems.1
0,0,0
1,0,0
2,0,0
3,1,1
4,0,0
...,...,...
2495,0,0
2496,0,0
2497,0,0
2498,0,0


In [99]:
from imblearn.over_sampling import SMOTENC
import pandas as pd
df_BP = df.drop(['FunctionalAssessment','ADL','MMSE','MemoryComplaints','Diagnosis','BehavioralProblems'],axis=1)
df_BP = pd.concat([df_BP,df['BehavioralProblems']],axis=1)
cate_columns = ['Gender','Ethnicity','EducationLevel','Smoking','FamilyHistoryAlzheimers','CardiovascularDisease','Diabetes','Depression','HeadInjury','Hypertension','Confusion','Disorientation','PersonalityChanges','DifficultyCompletingTasks','Forgetfulness']
cate_indices = [df_BP.columns.get_loc(col) for col in cate_columns]
cate_indices

# SMOTENC 적용
overSampling = SMOTENC(categorical_features=cate_indices, sampling_strategy=0.8, random_state=42)
feature = df_BP.drop('BehavioralProblems', axis=1)
target = df_BP['BehavioralProblems']
feature_sample, target_sample = overSampling.fit_resample(feature, target)

print(feature.shape, target.shape, feature_sample.shape, target_sample.shape)
df_BP = pd.concat([feature_sample, target_sample], axis=1)
df_BP


(2500, 27) (2500,) (3841, 27) (3841,)


Unnamed: 0,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,...,CholesterolTotal,CholesterolLDL,CholesterolHDL,CholesterolTriglycerides,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,BehavioralProblems
0,0.433333,0,0,2,0.316960,0,0.665183,0.633375,0.133931,0.837564,...,0.615567,0.039538,0.171039,0.319802,0,0,0,1,0,0
1,0.966667,0,0,0,0.473058,0,0.227170,0.762862,0.050995,0.525021,...,0.540822,0.956205,0.738026,0.698711,0,0,0,0,1,0
2,0.433333,0,3,1,0.111553,0,0.978276,0.785408,0.181896,0.945597,...,0.894520,0.688497,0.622290,0.095072,0,1,0,1,0,0
3,0.466667,1,0,1,0.752163,1,0.610751,0.843804,0.743443,0.731994,...,0.063302,0.101085,0.605851,0.649922,0,0,0,0,0,1
4,0.966667,0,0,0,0.228472,0,0.923204,0.631707,0.078698,0.265892,...,0.583781,0.284763,0.461019,0.688892,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3836,0.713191,1,0,2,0.780545,0,0.287384,0.230303,0.603548,0.451176,...,0.311621,0.425407,0.131417,0.634075,0,0,0,0,1,1
3837,0.909341,1,1,2,0.963017,0,0.452093,0.865590,0.686150,0.461161,...,0.349243,0.708007,0.233756,0.515707,0,0,0,0,0,1
3838,0.460715,0,0,2,0.293292,0,0.364285,0.378214,0.605574,0.358289,...,0.892080,0.259335,0.926512,0.047448,0,0,0,0,0,1
3839,0.846374,1,0,2,0.464766,0,0.050584,0.229237,0.214091,0.516537,...,0.863201,0.748036,0.068554,0.459116,0,0,0,0,1,1


In [102]:
feature_columns = list(df_BP.drop('BehavioralProblems',axis=1).columns)
feature_list = []
for i in feature_columns:
    if i in cate_columns:
        i = 'C({})'.format(i)
    feature_list.append(i)

In [103]:
from statsmodels.formula.api import logit 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
train,test = train_test_split(df_BP,test_size = 0.3, random_state = 23)
model = logit('BehavioralProblems ~ '+'+'.join(feature_list),train).fit()
pred = model.predict(test)
pred_class = (pred > 0.5).astype(int)
print(classification_report(test['BehavioralProblems'],pred_class))

Optimization terminated successfully.
         Current function value: 0.556099
         Iterations 6
              precision    recall  f1-score   support

           0       0.74      0.76      0.75       618
           1       0.72      0.69      0.70       535

    accuracy                           0.73      1153
   macro avg       0.73      0.73      0.73      1153
weighted avg       0.73      0.73      0.73      1153



In [104]:
list_column=[]
for i in model.pvalues.index:
    if model.pvalues[i] <= 0.05:
        list_column.append(i)
# list_column.remove('Intercept')

In [105]:
model_params = abs(model.params[list_column])
model_params.sort_values(ascending=False)

C(DifficultyCompletingTasks)[T.1]    1.179955
C(Ethnicity)[T.3]                    1.152485
C(Diabetes)[T.1]                     1.101324
C(PersonalityChanges)[T.1]           1.029852
C(Confusion)[T.1]                    1.010592
C(Ethnicity)[T.2]                    0.977410
C(FamilyHistoryAlzheimers)[T.1]      0.877655
C(CardiovascularDisease)[T.1]        0.835119
C(Smoking)[T.1]                      0.764679
C(Disorientation)[T.1]               0.735628
C(EducationLevel)[T.2]               0.631553
C(EducationLevel)[T.1]               0.606535
C(Depression)[T.1]                   0.564898
Age                                  0.540441
C(HeadInjury)[T.1]                   0.524331
C(Hypertension)[T.1]                 0.451799
C(Ethnicity)[T.1]                    0.434883
C(Gender)[T.1]                       0.336515
dtype: float64

# 결론 