Survey data on smoking habits from the UK. The objective is to predict whether a person smokes or not using Support Vector Machines.

In [2]:
# Load the libraries
import pandas as pd

In [3]:
# Read the dataset as dataframe
df = pd.read_csv('smoking.csv')

In [4]:
# A brief overview
df.head()

Unnamed: 0.1,Unnamed: 0,gender,age,marital_status,highest_qualification,nationality,ethnicity,gross_income,region,smoke,amt_weekends,amt_weekdays,type
0,1,Male,38,Divorced,No Qualification,British,White,"2,600 to 5,200",The North,No,,,
1,2,Female,42,Single,No Qualification,British,White,"Under 2,600",The North,Yes,12.0,12.0,Packets
2,3,Male,40,Married,Degree,English,White,"28,600 to 36,400",The North,No,,,
3,4,Female,40,Married,Degree,English,White,"10,400 to 15,600",The North,No,,,
4,5,Female,39,Married,GCSE/O Level,British,White,"2,600 to 5,200",The North,No,,,


In [42]:
# Dataset's dimension
df.shape

(1670, 9)

In [6]:
# A brief piece of information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1691 entries, 0 to 1690
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             1691 non-null   int64  
 1   gender                 1691 non-null   object 
 2   age                    1691 non-null   int64  
 3   marital_status         1691 non-null   object 
 4   highest_qualification  1691 non-null   object 
 5   nationality            1691 non-null   object 
 6   ethnicity              1691 non-null   object 
 7   gross_income           1691 non-null   object 
 8   region                 1691 non-null   object 
 9   smoke                  1691 non-null   object 
 10  amt_weekends           421 non-null    float64
 11  amt_weekdays           421 non-null    float64
 12  type                   421 non-null    object 
dtypes: float64(2), int64(2), object(9)
memory usage: 171.9+ KB


In [7]:
# Remove amt_weekends, amt_weekdays, type 
#cause we have the 25% of not null values
df.drop(columns=['amt_weekends','amt_weekdays', 
                 'type'],axis=0, inplace=True)

In [8]:
# Are any null values?
df.isnull().sum()

Unnamed: 0               0
gender                   0
age                      0
marital_status           0
highest_qualification    0
nationality              0
ethnicity                0
gross_income             0
region                   0
smoke                    0
dtype: int64

In [38]:
# How many dublicated values?
df.duplicated().sum()

21

In [41]:
# Drop the 21 dublicated values
df.drop_duplicates(inplace=True)

In [9]:
#Remove the Unnamed: 0 column because it does not add value 
df.drop(columns=['Unnamed: 0'], axis=0, inplace=True)

In [10]:
# Converting objects to integers
smoke_map = {"Yes": 0, "No": 1}
df["smoke"] = df["smoke"].map(smoke_map)

In [11]:
gender_map = {"Male": 0, "Female": 1}
df["gender"] = df["gender"].map(gender_map)

In [12]:
marital_map = {"Married": 0, "Single": 1, "Widowed": 2, 
               "Divorced": 3, "Separated": 4}
df["marital_status"] = df["marital_status"].map(marital_map)

In [13]:
# Import required library
from sklearn.preprocessing import LabelEncoder

In [14]:
# Create LabelEncoder object
encoder = LabelEncoder()

In [15]:
# Apply label encoding to "highest_qualification" feature
df["highest_qualification"] = encoder.fit_transform(df["highest_qualification"])

In [16]:
# Apply label encoding to "nationality" feature
df["nationality"] = encoder.fit_transform(df["nationality"])

In [17]:
# Apply label encoding to "ethnicity" feature
df["ethnicity"] = encoder.fit_transform(df["ethnicity"])

In [18]:
# Apply label encoding to "gross_income" feature
df["gross_income"] = encoder.fit_transform(df["gross_income"])

In [19]:
# Apply label encoding to "region" feature
df["region"] = encoder.fit_transform(df["region"])

In [32]:
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score 

In [43]:
# Feature dataframe & target column
X = df.drop(columns=['smoke'], axis=0)
y = df['smoke']

In [44]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    random_state=42)

In [45]:
# Create SVM object with linear kernel
svm = SVC(kernel='rbf', gamma=0.1)

In [46]:
# Train SVM model on training data
svm.fit(X_train, y_train)

SVC(gamma=0.1)

In [47]:
# Make predictions on test data
y_pred = svm.predict(X_test)

In [48]:
# Evaluate accuracy of predictions
accuracy = accuracy_score(y_test, y_pred)

In [49]:
# Print accuracy score
print("Accuracy: {:.2f}".format(accuracy))

Accuracy: 0.75


The above metric means that out of all the samples in this dataset, 75% of them were classified correctly by SVM model as either smoke or non-smoke.

In [50]:
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [51]:
# Print the results
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1-score: {:.2f}".format(f1))

Precision: 0.77
Recall: 0.96
F1-score: 0.85


Precision is 0.77, which means that out of all the samples that SVM model predicted as positive (i.e., smoke), 76% of them were actually positive, while the rest were false positives.

Recall is 0.96, which means that out of all the actual positive samples in the dataset, the SVM model correctly identified almost every of them as positive.

F1-score is 0.85, which is the harmonic mean of precision and recall, taking both metrics into account. It indicates the balance between precision and recall, with higher values indicating better overall performance.

So, taking all the above metrics into acount, the selected model has a well-performance