# Predicting Smoking Status Using Biosignal Data and SVM
ISTA 431 Final Project\
Author: Miguel Candido Aurora Peralta

## Data Source
[https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction-using-biosignals](https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction-using-biosignals)

The data from Kaggle consists of two csv files, `test_dataset.csv` and `train_dataset.csv`. The training dataset contains the column `smoking` which indicates the smoking status of the individuals, while the test set does not, so I will only be using `train_dataset.csv` file for the purpose of this project.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("train_dataset.csv")

In [None]:
display(df.columns)

Index(['age', 'height(cm)', 'weight(kg)', 'waist(cm)', 'eyesight(left)',
       'eyesight(right)', 'hearing(left)', 'hearing(right)', 'systolic',
       'relaxation', 'fasting blood sugar', 'Cholesterol', 'triglyceride',
       'HDL', 'LDL', 'hemoglobin', 'Urine protein', 'serum creatinine', 'AST',
       'ALT', 'Gtp', 'dental caries', 'smoking'],
      dtype='object')

In [None]:
print(f"Rows: {len(df.index)}")

Rows: 38984


In [None]:
df.describe()

Unnamed: 0,age,height(cm),weight(kg),waist(cm),eyesight(left),eyesight(right),hearing(left),hearing(right),systolic,relaxation,...,HDL,LDL,hemoglobin,Urine protein,serum creatinine,AST,ALT,Gtp,dental caries,smoking
count,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,...,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0
mean,44.127591,164.689488,65.938718,82.062115,1.014955,1.008768,1.025369,1.02619,121.475631,75.994408,...,57.293146,115.081495,14.624264,1.086523,0.88603,26.198235,27.145188,39.905038,0.214421,0.367279
std,12.063564,9.187507,12.896581,9.326798,0.498527,0.493813,0.157246,0.159703,13.643521,9.658734,...,14.617822,42.883163,1.566528,0.402107,0.220621,19.175595,31.309945,49.693843,0.410426,0.48207
min,20.0,130.0,30.0,51.0,0.1,0.1,1.0,1.0,71.0,40.0,...,4.0,1.0,4.9,1.0,0.1,6.0,1.0,2.0,0.0,0.0
25%,40.0,160.0,55.0,76.0,0.8,0.8,1.0,1.0,112.0,70.0,...,47.0,91.0,13.6,1.0,0.8,19.0,15.0,17.0,0.0,0.0
50%,40.0,165.0,65.0,82.0,1.0,1.0,1.0,1.0,120.0,76.0,...,55.0,113.0,14.8,1.0,0.9,23.0,21.0,26.0,0.0,0.0
75%,55.0,170.0,75.0,88.0,1.2,1.2,1.0,1.0,130.0,82.0,...,66.0,136.0,15.8,1.0,1.0,29.0,31.0,44.0,0.0,1.0
max,85.0,190.0,135.0,129.0,9.9,9.9,2.0,2.0,233.0,146.0,...,359.0,1860.0,21.1,6.0,11.6,1090.0,2914.0,999.0,1.0,1.0


## Features

|Feature|Description|
|---|---|
|age|In 5 year increments|
|height(cm)|In 5cm increments|
|weight(kg)|In 5kg increments|
|waist(cm)|Waist circumference|
|eyesight(left) and eyesight(right)|From 0.1 to 2.0 (higher indicates better vision), but blindness is indicated by 9.9|
|hearing(left) and hearing(right)|1 is normal and 2 is abnormal|
|systolic|Blood pressure measurement when heart is contracting|
|relaxation|Blood pressure measurement when heart is relaxed, aka diastolic blood pressure|
|fasting blood sugar|Amount of glucose in blood when fasting (when it should be lowest)|
|Cholesterol|This figure is calculated as HDL+LDL+(0.2*triglyceride)|
|triglyceride|Amount of lipid in the blood.|
|HDL|High-density lipoprotein cholesterol level.|
|LDL|Low-density lipoprotein cholesterol level.|
|hemoglobin|Hemoglobin level in blood.|
|Urine protein|Level of protein in urine.|
|serum creatinine|Serum creatinine level.|
|AST|Level of aspartate aminotransferase enzyme.|
|ALT|evel of alanine aminotransferase enzyme.|
|Gtp|Level of gamma-glutamyl transpeptidase enzyme.|
|dental caries|Presence of dental cavities. 0 means they are absent, 1 means they are present|


## Preprocessing

The 9.9 values for eyesight can be converted to zeroes to fit with the format of 0.1 to 2.0, with higher values indicating better vision. <br><br>
The cholesterol column can be dropped as it is calculated from the existing HDL, LDL, and triglyceride values.

In [None]:
df['eyesight(left)'].replace({9.9:0}, inplace=True)
df['eyesight(right)'].replace({9.9:0}, inplace=True)

df.drop(['Cholesterol'], axis=1, inplace=True)


## Train, Validation, and Test Splits

In [None]:
from sklearn import model_selection, preprocessing

X = df.drop(["smoking"], axis=1)
y = df["smoking"]

# Standardizing features
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

## Initial model training
The SVM model will be trained initially with the `linear` kernel. The kernel is a hyperparameter that can be fine-tuned after evaluating the performance to see if better results can be achieved with a different option.

In [24]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

clf = SVC(kernel="linear")
clf.fit(X_train, y_train)

K-fold cross validation with 5 folds is performed on the model to evaluate its performance. The mean of the test scores for all 5 folds is the score for the model.

In [25]:
eval = cross_val_score(clf, X_train, y_train, cv=5)

In [27]:
eval.mean()

0.7311055562418178

## Fine-tuning kernel
Another kernel may get better performance when evaluated using k-fold cross validation. I will also train the model using the `rbf` and `sigmoid` kernels and evaluate them to compare.

In [29]:
clf_rbf = SVC(kernel="rbf")
clf_rbf.fit(X_train, y_train)

In [30]:
clf_sigmoid = SVC(kernel="sigmoid")
clf_sigmoid.fit(X_train, y_train)

In [31]:
eval_rbf = cross_val_score(clf_rbf, X_train, y_train, cv=5)

In [32]:
eval_sigmoid = cross_val_score(clf_sigmoid, X_train, y_train, cv=5)

In [34]:
print("Test Scores by Kernel:")
print(f"Linear: {eval.mean()}")
print(f"RBF: {eval_rbf.mean()}")
print(f"Sigmoid: {eval_sigmoid.mean()}")

Test Scores by Kernel:
Linear: 0.7311055562418178
RBF: 0.7528776983409878
Sigmoid: 0.6608523388153611


The RBF kernel achieved the best results with a ~75.288% mean score between the 5 folds.

## Evaluating performance on test set
After selecting the RBF kernel based on the k-fold cross validation results, the model can be tested using the test set.

In [35]:
y_pred = clf_rbf.predict(X_test)

Metrics are computed for the predictions made on the test set.
* **Accuracy**: The ratio of correctly predicted instances to the total
instances in the dataset.
* **Precision**: The ratio of correctly predicted positive observations to the total predicted positives.
* **Recall**: The ratio of correctly predicted positive observations to the total actual positives.

In [36]:
from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.7524688982942157
Precision: 0.6763545030498744
Recall: 0.6470992104359766


## Conclusion
The model was able to predict the correct result for the test set with ~75.25% accuracy. The accuracy rate indicates that the model was able to make correct predictions for the majority of the time, demonstrating somewhat good performance. The precision rate shows that when the model predicts a positive result, it is likely to be correct. Recall measures the model's ability to identify all relevant instances, minimizing false negatives. The performance in all aspects shows that the model is working as intended for the majority of the time.  <br><br>
While the model is correctly predicting smoking status about 75% of the time, it's important to consider these results in the context of the medical field. A false positive or negative result could be the difference between someone receiving treatment or not for algorithms used in healthcare, so higher accuracy is necessary. Better performance may be achieved with more extensive fine tuning. The `C` (regularization) and `gamma` values could also be adjusted for optimization. I chose to limit my fine-tuning to just kernel for this assignment due to computational and time constraints, as using scikit-learn's `GridSearchCV` took more processing power than I had available, even with the use of Google Colab. Other types of models may also provide better performance.