# Bio-Signal Analysis for Smoking
# Agenda
**01 Importing the Libraries**

**02 Loading the Data**

**03 Data Cleaning**

**04 One Hot Encoding**

**05 Feature Selection** 

**06 Bagging Algorithms**


# Problem Statement
**Over the years, the company has 
collected details and gathered a lot of 
information about individuals. The 
management wants to build an intelligent 
system from the data to determine the 
presence or absence of smoking in a person 
through bio-signals. Given a person’s 
information, build a machine learning model 
that can classify the presence or absence of 
smoking.**

# Dataset Information
**This dataset is a collection of basic health biological signal data which
contains around 55K record with 27 attributes.**

# Importing the Libraries
**We start off this project by importing all the necessary
libraries that will be required for the process.**




In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Loading the Data
**Loading the data and removing the irrelevant columns.**


In [None]:
df = pd.read_csv("C:\\Users\\91959\\Downloads\\smoking.csv")
df = df.drop(columns=['ID','oral'])
df.head()

# Loading the Data
**Checking the shape of a dataframe and datatypes of all columns 
along with calculating the statistical data.**


In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

# Missing Values 
**Checking out the missing values in a dataframe**

In [None]:
df.isnull().sum()

# Data visualization
**We can clearly see from the below graph that most smokers are men.**

In [None]:
sns.barplot(data=df, x='gender',y='smoking')
plt.show()

In [None]:
sns.countplot(data=df, x='gender', hue='smoking')
plt.show()

**There are 36.73 percent of the people who are smoking ciggarette.**

In [None]:
plt.figure(figsize=(10,5))
df['smoking'].value_counts().plot.pie(autopct='%0.2f')

**Most number of smokers are having the age 40**

In [None]:
plt.figure(figsize=(9,6))
sns.histplot(data=df, x='age', hue='smoking')
plt.xlabel('Age of the population')
plt.show()

**Representation of columns using boxplot to detect outliers. Here outliers represent natural 
variations in the population, and they should be left as is in the dataset. These are called true 
outliers. Therefore for this dataset we will not remove outliers.**


In [None]:
for i in df.columns:
    if (df[i].dtypes=='int64' or df[i].dtypes=='float64'):
        sns.boxplot(df[i])
    plt.show()
    

# Data Cleaning
**Performing One Hot Encoding for 
categorical features of a dataframe**

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['tartar'] = le.fit_transform(df['tartar'])
df['dental caries'] = le.fit_transform(df['dental caries'])

In [None]:
df.info()


# Feature selection using feature importance 
**Feature importance is a technique 
that calculate a score for all the 
input features for a given model. So 
out of 24 features we will select the 
top 15 features based on the score.**


In [None]:
X = df.iloc[:,:-1]
y = df['smoking']

from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X,y)
df1 = pd.Series(model.feature_importances_, index=X.columns)
plt.figure(figsize=(8,8))
df1.nlargest(24).plot(kind='barh')
plt.show()

# Logistic Regression


In [None]:
X = df[['gender','height(cm)','Gtp','hemoglobin','triglyceride','age','weight(kg)','waist(cm)','HDL',
        'serum creatinine','ALT','fasting blood sugar','relaxation','LDL','systolic']]
y = df['smoking']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=42)



In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
      

**Calculating accuracy 
and generating the 
classification report 
of Logistic Regression**


In [None]:
from sklearn.metrics import accuracy_score, classification_report
print(f'Accuracy Score : {accuracy_score(y_test,y_pred)}')
print(classification_report(y_test,y_pred))
      

# Decision Tree
**The accuracy of the logistic regression model is
78 percentage**


In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
y_pred2 = dt.predict(x_test)
print(classification_report(y_test, y_pred2))

# Bagging Algorithm – Bagging Classifier
**Bootstrap Aggregation or bagging involves taking multiple samples from the training dataset 
(with replacement) and training a model for each sample.**

In [None]:
from sklearn.ensemble import BaggingClassifier
bagging_clf = BaggingClassifier(base_estimator = DecisionTreeClassifier(), n_estimators=1000)

bagging_clf.fit(x_train,y_train).score(x_test,y_test)
y_pred3 = bagging_clf.predict(x_test)

print(classification_report(y_test,y_pred3))

# Bagging Algorithm – Extra Trees


In [None]:
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(n_estimators=1000, random_state=42)
et.fit(x_train,y_train)

y_pred4 = et.predict(x_test)
print(classification_report(y_test, y_pred4))

# Bagging Algorithm – Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000)
rf.fit(x_train,y_train)

y_pred5 = rf.predict(x_test)
print(classification_report(y_test, y_pred5))