# Data Description
This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Any patient whose age exceeded 89 is listed as being of age "90".

Columns:

* Age of the patient
* Gender of the patient
* Total Bilirubin
* Direct Bilirubin
* Alkaline Phosphotase
* Alamine Aminotransferase
* Aspartate Aminotransferase
* Total Protiens
* Albumin
* Albumin and Globulin Ratio
* Dataset: field used to split the data into two sets (patient with liver disease, or no disease)       (liver disease = 1, not liver diseas = 2)
* Dataset link in Kaggle : https://www.kaggle.com/uciml/indian-liver-patient-records/discussion

 original notebook workflow : https://www.kaggle.com/sanjames/liver-patients-analysis-prediction-accuracy

# Support-Vector-Machine .

In [1]:
import pandas as pd
from sklearn.metrics import classification_report , confusion_matrix  
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings(action= 'ignore')

%matplotlib inline




## import tha dat file

In [2]:
liver_df = pd.read_csv('Data/indian_liver_patient.csv')
liver_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
Age                           583 non-null int64
Gender                        583 non-null object
Total_Bilirubin               583 non-null float64
Direct_Bilirubin              583 non-null float64
Alkaline_Phosphotase          583 non-null int64
Alamine_Aminotransferase      583 non-null int64
Aspartate_Aminotransferase    583 non-null int64
Total_Protiens                583 non-null float64
Albumin                       583 non-null float64
Albumin_and_Globulin_Ratio    579 non-null float64
Dataset                       583 non-null int64
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [3]:
liver_df.isna().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

* fill missing values

In [4]:
liver_df['Albumin_and_Globulin_Ratio'].fillna(value = liver_df['Albumin_and_Globulin_Ratio'].mean() , inplace = True)
liver_df.isna().sum()    

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Dataset                       0
dtype: int64

* convert categ to numeric

In [5]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
    
categ_features = ['Gender']
one_hot = OneHotEncoder()

transformer = ColumnTransformer([('one_hot',
                                     one_hot,
                                      categ_features)],
                                      remainder = 'passthrough' )
liver_df_transformed = pd.DataFrame(transformer.fit_transform(liver_df))
    
"""
     * Gender coverted to ----> columns 0 & 1,
     * we just rename the columns with each name after the Gender coverted to numerical 
"""
    
liver_df_transformed.columns = ['0', '1', 'Age','Total_Bilirubin', 'Direct_Bilirubin',
           'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
           'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
           'Albumin_and_Globulin_Ratio', 'Dataset']
liver_df_transformed.head()

Unnamed: 0,0,1,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,1.0,0.0,65.0,0.7,0.1,187.0,16.0,18.0,6.8,3.3,0.9,1.0
1,0.0,1.0,62.0,10.9,5.5,699.0,64.0,100.0,7.5,3.2,0.74,1.0
2,0.0,1.0,62.0,7.3,4.1,490.0,60.0,68.0,7.0,3.3,0.89,1.0
3,0.0,1.0,58.0,1.0,0.4,182.0,14.0,20.0,6.8,3.4,1.0,1.0
4,0.0,1.0,72.0,3.9,2.0,195.0,27.0,59.0,7.3,2.4,0.4,1.0


In [6]:
x = liver_df_transformed.drop('Dataset' , axis = 1)
y = liver_df_transformed['Dataset']

x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = .2 , random_state = 100)

In [7]:
svc_model = SVC()
svc_model.fit(x_train , y_train )


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [8]:
svc_model.score(x_train , y_train)

0.9935622317596566

In [9]:
svc_model.score(x_test, y_test)

0.7521367521367521

In [10]:
y_predict_svc = svc_model.predict(x_test)
print(classification_report(y_test, y_predict_svc))

              precision    recall  f1-score   support

         1.0       0.75      1.00      0.86        87
         2.0       1.00      0.03      0.06        30

    accuracy                           0.75       117
   macro avg       0.88      0.52      0.46       117
weighted avg       0.81      0.75      0.65       117



# Tuning hyper..... manually

In [11]:
svc_model_imporved = SVC(kernel = 'linear' , C = 4 , gamma = 1000 ).fit(x_train , y_train)


In [12]:
svc_model_imporved.score(x_train , y_train)

0.7060085836909872

In [13]:
svc_model_imporved.score(x_test , y_test)

0.7435897435897436

In [14]:
y_predict_imporved = svc_model_imporved.predict(x_test)
print(classification_report(y_test, y_predict_imporved))

              precision    recall  f1-score   support

         1.0       0.74      1.00      0.85        87
         2.0       0.00      0.00      0.00        30

    accuracy                           0.74       117
   macro avg       0.37      0.50      0.43       117
weighted avg       0.55      0.74      0.63       117



## Cross validation

In [15]:
from sklearn.model_selection import cross_val_score

"""
 cross_val_score parameters
   basline ===> svc_model
   data ===> x
   labels ===> y
   number pf iteration for validation cv ===> N
"""

cross_score = cross_val_score(svc_model , x , y , cv = 10)
cross_score

array([0.71186441, 0.71186441, 0.71186441, 0.71186441, 0.71186441,
       0.71186441, 0.70689655, 0.71929825, 0.71929825, 0.71929825])

In [16]:
np.mean(cross_score)

0.7135977729244208

In [17]:
import pickle

pickle.dump(svc_model , open('indian_liver_svc_model.pkl', 'wb' ))