# <b>背景</b>

<p>ET（essential tremor）患者：原发性震颤(Essential tremor)患者，特发性震颤（essential tremor，ＥＴ）最常见的运动障碍性疾病，主要为手、头部及身体其他部位的姿位性和运动性震颤。发病部位：上肢、头、面部、下颚。</p>
<p>参见文献： louis2003 Factors associated with increased risk of head tremor in essential tremor_ a community-based study in northern Manhattan）。</p>
<p>中线震颤（midline tremor）：   包含：面部（下颌部+唇部）、舌头、声音、头部（又称颈部）和躯干。<p>

    
# <b>目的</b>
1.1.	探索ET患者伴中线震颤的危险因素。
    
1.2.	ET患者伴焦虑和抑郁的危险因素    
    
# <b>任务</b>

利用逻辑回归模型评估变量的影响。


# <b>第一步 引包</b>

In [16]:
#引包：引入所需python包
import xlrd
import os
import re
import pandas as pd
import numpy as np
import itertools
from scipy import stats
from scipy.stats import kstest
from scipy.stats import chi2_contingency
from scipy.stats import chisquare
from scipy.stats import mannwhitneyu

import matplotlib as mpl
from matplotlib import pyplot as plt
from numpy import nan

import seaborn as sns
import time

from imblearn.over_sampling import SMOTE
#随机森林
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)

from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve,auc
from sklearn import linear_model, datasets
import xgboost as xgb
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import collections  #count frequence of items in list
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

# <b>第二步 读取数据</b>

1）设置默认目录

2）读取csv文件；


In [3]:
dir = "./"
print(os.listdir(dir))

['.ipynb_checkpoints', '1.数据读取和清洗.ipynb', 'data', 'output', '2.中线震颤并发情况.ipynb', '3.人口特征统计.ipynb', '4.疾病特征统计.ipynb', '5.变量相关性和重要性.ipynb', '6.危险因子（逻辑回归）.ipynb']


In [5]:
#清洗后数据
model_data= pd.read_csv(dir+"output/model_data.csv",index_col=0) #第一列为行索引
#df = df.drop(columns = df.columns[0]) #删除不需要的列
model_data.head(2)

Unnamed: 0,编号,姓名,性别,受教育时间,工作状态合并栏,婚姻状况合并栏,年龄,发病年龄,HAMA总,HAMD总分,...,自述抑郁时长,主观认知功能下降,面声颈量表分数,声颈量表分数,总病程,四肢静止性震颤,HAMD分级2级,HAMD分级3级,HAMA分级2级,HAMA3级
0,N001,马叔杏,1,6,1,1,63,59,15,14,...,1.0,0.0,2,2,4.0,0,0,1,1,2
1,N002,方怀琼,0,9,2,1,64,44,4,6,...,0.0,0.0,0,0,20.0,0,0,0,0,0


In [14]:
inf= pd.read_csv(dir+"output/inf.csv",index_col=0) #第一列为行索引
#df = df.drop(columns = df.columns[0]) #删除不需要的列
inf.head(2)

Unnamed: 0,编号,姓名,主观认知功能下降,MMSE,家族史,高血压,糖尿病,其他,抗ET药物使用,抗焦虑抑郁药物使用,...,抑郁分类,焦虑分类,静止性上肢震颤分数有无,运动性上肢震颤总分有无,运动性下肢震颤分数有无,运动性四肢震颤总分有无,面声颈部位分级分数有无,面声颈量表分数有无,声颈量表部位分级有无,声颈量表分数有无
0,N001,马叔杏,0.0,26,0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,1,1,1,1,1,1,1
1,N002,方怀琼,0.0,28,1,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,1,0,0,0,0


# <b>第三步 逻辑回归 </b>


In [10]:
classfield = ['性别', '工作状态合并栏', '婚姻状况合并栏', '家族史', '抗焦虑抑郁药物使用', '抗ET药物使用', '高血压', '糖尿病', '其他', '吸烟', '饮酒', '自述焦虑时长', '自述抑郁时长', '主观认知功能下降', 'HAMD分级2级', 'HAMD分级3级', 'HAMA分级2级', 'HAMA3级']

In [11]:
model_data.loc[:,classfield] = model_data.loc[:,classfield].astype(str)
model_data["声颈量表分数"] = model_data["声颈量表分数"].astype(int)
model_data["面声颈量表分数"] = model_data["面声颈量表分数"].astype(int)
model_data.dtypes

编号            object
姓名            object
性别            object
受教育时间          int64
工作状态合并栏       object
婚姻状况合并栏       object
年龄             int64
发病年龄           int64
HAMA总          int64
HAMD总分         int64
匹兹堡总分          int64
TRS_C          int64
MMSE           int64
家族史           object
抗焦虑抑郁药物使用     object
抗ET药物使用       object
高血压           object
糖尿病           object
其他            object
吸烟            object
饮酒            object
自述焦虑时长        object
自述抑郁时长        object
主观认知功能下降      object
面声颈量表分数        int64
声颈量表分数         int64
总病程          float64
四肢静止性震颤        int64
HAMD分级2级      object
HAMD分级3级      object
HAMA分级2级      object
HAMA3级        object
dtype: object

In [17]:
#哑变量
import statsmodels.api as sm
#根据变量相关性和重要性，得到的变量组合
type1 = ['面部','声音','颈部', '声颈有无', '面声颈有无', '意向性震颤']
list1 =  [["性别","婚姻状况合并栏","家族史","工作状态合并栏",  "发病年龄"],["TRS_C","年龄","性别"],["匹兹堡总分","受教育时长","发病年龄","总病程"],
          ["HAMA总","HAMD总分","家族史",'抗焦虑抑郁药物使用','抗ET药物使用']]
type2 =["抑郁分类","焦虑分类"]         
list2 = [["性别","声颈量表分数","婚姻状况合并栏"],["性别","面声颈量表分数","家族史","工作状态合并栏"],["声颈量表分数","发病年龄","TRS_C"],
         ["面声颈量表分数","发病年龄","TRS_C"],["匹兹堡总分","受教育时长","发病年龄","总病程"],["四肢静止性","婚姻状况合并栏","性别"]]
for i in type2 :
    print ("---------------" , i ," ---------------------------")
    for j in list2:
        data_log = model_data.filter(items = j)       
        data_log["type"] = inf[i].astype(int)        
        data_log = pd.get_dummies(data_log,drop_first = True)
        
        y = data_log['type']
        X = data_log.drop(columns = ["type"])
        
        #print("y",y,"x",x)
        logit_model=sm.Logit(y,X)
        result=logit_model.fit()
        print(result.summary2())
#z越大，该变量就越有意义。H0：z=0 （假设是正态分布）

        #SMOT

        os = SMOTE(random_state=0)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
        columns = X_train.columns
        os_data_X,os_data_y=os.fit_sample(X_train, y_train)
        os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
        os_data_y= pd.DataFrame(data=os_data_y,columns=['type'])
        data_log = os_data_X
        data_log["type"] = os_data_y
        
        average = 0
        testNum = 1
        for k in range(0, testNum):

            train,test = train_test_split(data_log,test_size = 1/3, random_state = 42)
            y_train = train['type'].ravel()
            train = train.drop(['type'], axis=1)  #dataframe
            x_train = train.values   # Creates an array of the train data
            x_test = test.drop(['type'], axis=1).values       # Creats an array of the test data
            y_test = test['type'].ravel()
    #训练LR分类器
            clf = LogisticRegression(class_weight ="balanced")  #class_weight ="balanced"
            clf.fit(x_train, y_train)
            print ("coefficients importances : : " ,np.std(x_train, 0)* clf.coef_) 
            print ("odds ratios : " , np.exp(clf.coef_))    #odds ratios
            y_pred = clf.predict(x_test)
            p = np.mean(y_pred == y_test)
            #print(p)
            average += p
        

--------------- 抑郁分类  ---------------------------
Optimization terminated successfully.
         Current function value: 0.444904
         Iterations 8
                        Results: Logit
Model:              Logit            Pseudo R-squared: -0.105  
Dependent Variable: type             AIC:              191.9697
Date:               2019-06-17 09:58 BIC:              201.9967
No. Observations:   209              Log-Likelihood:   -92.985 
Df Model:           2                LL-Null:          -84.164 
Df Residuals:       206              LLR p-value:      1.0000  
Converged:          1.0000           Scale:            1.0000  
No. Iterations:     8.0000                                     
----------------------------------------------------------------
             Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
----------------------------------------------------------------
声颈量表分数      -0.1930    0.0996  -1.9374  0.0527  -0.3883   0.0022
性别_1        -2.0993    0.3776  -5.559