## Datasets Source
This dataset was from the UCI ML Repository:
https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

## Dataset Information
This training dataset contains 467 records, including 333 liver patient records and 134 non liver patient records. The data set was collected from north east of Andhra Pradesh, India. Label field is a class label used to divide into groups(liver patient or not). Any patient whose age exceeded 89 is listed as being of age "90".

## Attribute Information:
1. Age: Age of the patient (年齡)
2. Gender: Gender of the patient (性別)
3. TB: Total Bilirubin (總膽紅素)
4. DB: Direct Bilirubin (直接型膽紅素/結合型膽紅素)
5. Alkphos: Alkaline Phosphotase (鹼性磷酸酶)
6. Sgpt: Alamine Aminotransferase (麩胺酸丙酮酸轉氨基酶/GPT)
7. Sgot: Aspartate Aminotransferase (麩胺酸苯醋酸轉氨基酶/GOT)
8. TP: Total Protiens (總蛋白)
9. ALB: Albumin (白蛋白)
10. A/G Ratio: Albumin and Globulin Ratio (白蛋白/球蛋白比值)
11. Label: used to split the data into two sets

## Additional information
[如何解讀肝功能檢驗報告]
https://www.jah.org.tw/form/index-1.asp?m=3&m1=8&m2=366&gp=361&id=522


### Download the training set

In [None]:
# Download from Google Drive
!gdown --id 1Y2gYY8XUWgcIA_GbytBuXoRkLlAWxnAF

Downloading...
From: https://drive.google.com/uc?id=1Y2gYY8XUWgcIA_GbytBuXoRkLlAWxnAF
To: /content/project1_indian_liver_patient.zip
  0% 0.00/8.37k [00:00<?, ?B/s]100% 8.37k/8.37k [00:00<00:00, 14.3MB/s]


In [None]:
!unzip project1_indian_liver_patient.zip
# if seeing the message: "replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:"   要不要覆蓋檔案
# you may enter "A"

Archive:  project1_indian_liver_patient.zip
replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: project1_test.csv       
  inflating: project1_train.csv      


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('project1_train.csv')
df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Label'],
      dtype='object')

In [None]:
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
0,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8,1
1,78,Male,1.0,0.3,152,28,70,6.3,3.1,0.9,1
2,60,Male,2.0,0.8,190,45,40,6.0,2.8,0.8,1
3,75,Male,10.6,5.0,562,37,29,5.1,1.8,0.5,1
4,19,Female,0.7,0.2,186,166,397,5.5,3.0,1.2,1


性別要轉成0、1

要記得標準化、補缺漏值

### The stage is yours

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [None]:
#Transfrom Gender string into float values
df['Gender']=pd.get_dummies( df['Gender'] )

#均值填補
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy="mean")
df["Albumin_and_Globulin_Ratio"] = imp.fit_transform(df["Albumin_and_Globulin_Ratio"].to_frame())

df["Albumin_and_Globulin_Ratio"].isnull().sum()

df


Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
0,40,1,0.9,0.3,293,232,245,6.8,3.1,0.80,1
1,78,0,1.0,0.3,152,28,70,6.3,3.1,0.90,1
2,60,0,2.0,0.8,190,45,40,6.0,2.8,0.80,1
3,75,0,10.6,5.0,562,37,29,5.1,1.8,0.50,1
4,19,1,0.7,0.2,186,166,397,5.5,3.0,1.20,1
...,...,...,...,...,...,...,...,...,...,...,...
462,32,0,0.7,0.2,276,102,190,6.0,2.9,0.93,1
463,58,0,0.8,0.2,180,32,25,8.2,4.4,1.10,0
464,34,0,5.9,2.5,290,45,233,5.6,2.7,0.90,1
465,36,0,0.8,0.2,182,31,34,6.4,3.8,1.40,0


In [None]:
#特徵縮放年齡
scaler = MinMaxScaler().fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])


In [None]:
scaler = MinMaxScaler().fit(df[['Alkaline_Phosphotase']])
df['Alkaline_Phosphotase'] = scaler.transform(df[['Alkaline_Phosphotase']])

scaler = MinMaxScaler().fit(df[['Alamine_Aminotransferase']])
df['Alamine_Aminotransferase'] = scaler.transform(df[['Alamine_Aminotransferase']])

scaler = MinMaxScaler().fit(df[['Aspartate_Aminotransferase']])
df['Aspartate_Aminotransferase'] = scaler.transform(df[['Aspartate_Aminotransferase']])

df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Label
0,0.486486,1,0.9,0.3,0.107125,0.111558,0.047774,6.8,3.1,0.8,1
1,1.0,0,1.0,0.3,0.037838,0.009045,0.012198,6.3,3.1,0.9,1
2,0.756757,0,2.0,0.8,0.056511,0.017588,0.006099,6.0,2.8,0.8,1
3,0.959459,0,10.6,5.0,0.239312,0.013568,0.003863,5.1,1.8,0.5,1
4,0.202703,1,0.7,0.2,0.054545,0.078392,0.078675,5.5,3.0,1.2,1


In [None]:
from sklearn.model_selection import train_test_split
x = df.drop(['Label'], axis=1)
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, stratify=df['Label'])

In [None]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
ypred = random_forest.predict(X_test)

print(metrics.classification_report(ypred, y_test))

              precision    recall  f1-score   support

           0       0.23      0.50      0.32         6
           1       0.91      0.76      0.83        41

    accuracy                           0.72        47
   macro avg       0.57      0.63      0.57        47
weighted avg       0.82      0.72      0.76        47



In [None]:
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,random_state=0)

forest.fit(x, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

In [None]:
from sklearn.svm import LinearSVC

# 模型构建与拟合
lsvm  = LinearSVC()
lsvm .fit(X_train, y_train)

# 模型预测
y_pred = lsvm .predict(X_test)

# 分类正确率
print("分类正确率：",round(lsvm .score(X_test, y_test),4))

分类正确率： 0.7234


In [None]:
from sklearn.model_selection import GridSearchCV
C_grid = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 1, 2, 3, 4, 5, 10, 20, 30, 40]
param_grid = {'C':C_grid}
grid = GridSearchCV(LinearSVC(), param_grid, cv=10, scoring='f1')
grid.fit(X_train, y_train)

print('Best paras',grid.best_params_)

param_grid = {'C': [0.0001, 0.01, 0.1, 0.3, 0.5, 1, 10, 100], 'gamma':[100, 10, 1, 0.1, 0.01, 0.001]} 
from sklearn.svm import SVC
grid_search = GridSearchCV(SVC(), param_grid, cv=10)

grid_search.fit(X_train, y_train)

grid.best_params_

Best paras {'C': 2}


{'C': 2}

In [None]:
from sklearn.model_selection import GridSearchCV
C_grid = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 1, 2, 3, 4, 5, 10, 20, 30, 40]
param_grid = {'C':C_grid}
grid_search = GridSearchCV(LinearSVC(), param_grid, cv=10, scoring='f1')
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=LinearSVC(),
             param_grid={'C': [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 1, 2, 3, 4,
                               5, 10, 20, 30, 40]},
             scoring='f1')

In [None]:
import numpy as np
import pandas as pd
from sklearn import svm, preprocessing, metrics

# 建立 SVC 模型
svc = svm.SVC()
svc_fit = svc.fit(X_train, y_train)

# 預測
test_y_predicted = svc.predict(X_test)

# 績效
accuracy = metrics.accuracy_score(y_test, test_y_predicted)
print(accuracy)

0.723404255319149


### Make prediction and submission file

In [None]:
x_test = pd.read_csv('project1_test.csv')
x_test['Gender']=pd.get_dummies( x_test['Gender'] )

x_test['Age'] = scaler.transform(x_test[['Age']])

x_test['Alkaline_Phosphotase'] = scaler.transform(x_test[['Alkaline_Phosphotase']])

x_test['Alamine_Aminotransferase'] = scaler.transform(x_test[['Alamine_Aminotransferase']])

x_test['Aspartate_Aminotransferase'] = scaler.transform(x_test[['Aspartate_Aminotransferase']])

x_test.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,0.001626,0,0.8,0.2,0.055296,0.012604,0.026428,5.5,2.5,0.8
1,0.004879,0,4.1,2.0,0.056719,0.175849,0.146575,5.0,2.7,1.1
2,0.007928,0,2.0,0.6,0.040455,0.007725,0.004472,5.7,3.0,1.1
3,0.011181,0,7.9,4.3,0.055296,0.008132,0.012604,6.0,3.0,1.0
4,0.006099,1,0.9,0.3,0.057532,0.045131,0.047774,6.8,3.1,0.8


In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Category'] = forest.predict(x_test)

df_submit.to_csv('submission_forest1.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Category'] = lsvm.predict(x_test)

df_submit.to_csv('submission_lsvm1.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Category'] = grid.predict(x_test)

df_submit.to_csv('submission_grid1.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Category'] = grid_search.predict(x_test)

df_submit.to_csv('submission_grid_search1.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Category'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Category'] = random_forest.predict(x_test)

df_submit.to_csv('submission_random_forest1.csv', index=None)