## LogisticRegression and others from SAS® Viya® on Heart Disease
### Source
This example is adapted from [Heart Disease UCI](https://www.kaggle.com/code/harshalgadhe/heart-disease-uci) by Harshal Gadhe.

### Data Preparation
#### About the data set
The original data contains 76 different attributes of patients from four different hospital databases.  The goal is to determine if the attributes can be used to predict whether patients are diagnosed with heart disease.  However, this data has been subset to contain only 14 factors from only the Cleveland database.  The variables included and their interpretations are:
- age
- trestbps: resting blood pressure
- chol: serum cholesterol
- thalch: maximum heart rate achieved
- ca: number of major vessels (0-3) colored by flourosopy
- sex
- cp: chest pain type
- exang: exercise-induced angina
- slope: slope of the peak exercise ST segment
- thal:  thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
- restecg: resting electrocardiographic results
- fbs: fasting blood sugar
- target: diagnosis of heart disease
- oldpeak: ST depression induced by exercise relative to rest


In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sn

import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None

#### Importing the data set

In [None]:
workspace=f'{os.path.abspath("")}/../data/'
heart_df = pd.read_csv(workspace + "heart_disease.csv")
heart_df.head()

In [None]:
print(workspace)

### Data Preprocessing
We will start by getting some general characteristics about the data set.

In [None]:
heart_df.info()

#### Replacing NaN values with mean
In examining the data, we see there are some missing values.  We will replace those with the mean for the column.

In [None]:
cols_sum_null = heart_df.isnull().sum()
print(cols_sum_null)

In [None]:
hasnull_cols = cols_sum_null[cols_sum_null != 0]
for col in hasnull_cols.index:
    mean = heart_df[col].mean()
    heart_df[col].fillna(mean, inplace=True)
heart_df.isnull().sum()

#### Correlations

In [None]:
heart_df.corr()

### Visualizing the data
In order to get a better sense of the data, we will look at a variety of plots:
- a pairplot of continuous features
- a heatmap of correlations
- scatterplots of target versus each factor
- a histogram of target's values

In [None]:
cat_threshold = 8
quantitative = [c for c in heart_df.columns if len(heart_df[c].unique()) > cat_threshold]

sn.pairplot(heart_df[quantitative])

In [None]:
plt.figure(figsize=(12,10))
sn.heatmap(heart_df.corr(),annot=True,cmap=plt.cm.plasma)

In [None]:
plt.figure(figsize=(15,15))
for i in range(len(heart_df.columns)-1):
    plt.subplot(5, 3, i+1)
    sn.scatterplot(data=heart_df, x='target', y=heart_df.columns[i], hue='target')
    plt.xticks([0, 1])
plt.tight_layout(pad=4.0)
plt.show()

In [None]:
sn.countplot(x='target', data=heart_df)
plt.grid()

### Building and Training the Model
For details about using the classes in `sasviya.ml` see the [Python API documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=titlepage.htm).

#### Data preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
X=heart_df.drop('target',axis=1)
Y=heart_df['target']
heart_df=sc.fit(X).transform(X)

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=3)

#### Creating training and test data
We split the original data by putting 75% into the training set and 25% into the test set. 

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=3)

####  Training the model for six different alogorithms
We will train six different models against the training set: LogisticRegression, DecisionTreeClassifier, SVC, ForestClassifier, KNeighborsClassifier, and GaussianNB. The latter two are sklearn classifiers.

In [None]:
from sasviya.ml.linear_model import LogisticRegression
from sasviya.ml.tree import DecisionTreeClassifier
from sasviya.ml.svm import SVC
from sasviya.ml.tree import ForestClassifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [None]:
def model(X_train,y_train):
    models=[]

    lr = LogisticRegression(
        solver='lbfgs',
        tol=1e-4,
        max_iter=1000)
    lr.fit(X_train,y_train)
    models.append(lr)

    tree=DecisionTreeClassifier()
    tree.fit(X_train,y_train)
    models.append(tree)

    svm=SVC(kernel='rbf', coef0=0.1, C=1.0)

    svm.fit(X_train,y_train)
    models.append(svm)

    knn=KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train,y_train)
    models.append(knn)

    rfc=ForestClassifier()
    rfc.fit(X_train,y_train)
    models.append(rfc)

    nb=GaussianNB()
    nb.fit(X_train,y_train)
    models.append(nb)

    return models

In [None]:
models=model(X_train,y_train)

#### Gathering the accuracy scores
As we just ran six models, we should examine how well they did relative to each other.  For each, we will gather the accuracy scores for the training and test data and add them into a summary dataframe.

In [None]:
from sklearn.metrics import accuracy_score
train_accuracy=[]
test_accuracy=[]

for m in models:
    train_accuracy.append(round(m.score(X_train, y_train),2))
    test_accuracy.append(round(m.score(X_test, y_test),2))

Accuracy_score=pd.DataFrame({
    'Model': [str(m.__repr__()).split('(')[0]for m in models],
    'Train_Accuracy':train_accuracy,
    'Test_Accuracy':test_accuracy
})

### Finding the Best Model
After displaying the table of all accuracy scores for the training and test data, we will graph the test accuracy for each algorithm to see how they all fared on the data.

In [None]:
Accuracy_score

In [None]:
plt.figure(figsize=(12,6))
plt.plot(Accuracy_score['Model'],Accuracy_score['Test_Accuracy'],marker='x',color='red')
plt.xlabel('Model')
plt.ylabel('Test Accuracy')
plt.title('Test Accuracy by Model')
plt.grid()