# **Hybrid System for Lung Cancer Classification**

Created by : Harjinder Singh <br>
Email : hjbrar7@gmail.com

Here i will create a Hybrid System using **Particle Swarm Optimisation** and **Logistic Regression**.

We will see how to use **PSO** to extract usefull features from dataset and then train our model on these features.

### **Importing Useful Libraries**

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

### **Importing Dataset**

Dataset that we are using here is taken from **UCI Machine Learning Repository**. <br>
Link to dataset is <a href="http://archive.ics.uci.edu/ml/datasets/Lung+Cancer" > Lung Cancer Dataset </a> <br>

This dataset contain 32 examples and 56 attributes. <br>
some of its fields are missing So we need to Purify this dataset before using it in out ML model.

---
### ***Note :***
This dataset is not big enough to train good Machine Learning Model


In [0]:
data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/lung-cancer/lung-cancer.data"
data = pd.read_csv(data_url,header=None)
data.head()

In [0]:
data.describe()

In [0]:
data.info()

## Preprocessing

In [0]:
# Convert the column name
data.columns = [str(x) for x in range(len(data.columns))]

We will find the Columns with missing data and then fill the missing values

In [0]:
cols = [ c for c in data.columns if np.dtype(data[c]) == 'O']
print("Columns with missing values are : ",cols)

print("Missing data is given with '?' ")
data[cols].head()

Now we will convert these values to **np.int64** and convert **?** to **-1** <br>
Then we will fill these -1 (missing values) with **mean** of the columnn.

In [0]:
def convertValues(x):
    try : x = np.int64(x)
    except : x = -1
    return x

for c in cols:
    data[c] = pd.Series([convertValues(x) for x in data[c]])

print("Convert '?' to -1")
data[cols].head()

Now fill missing values

In [0]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values= -1,strategy = "mean")
data[cols] = imputer.fit_transform(data[cols])
data[cols].head()

## **Visualising Data**

First column is our target values

In [0]:
sns.countplot(x='0',data=data)

In [0]:
plt.subplots(figsize=(15,10))
sns.heatmap(cmap="coolwarm",data=data)

In above graph we can see a number of features are **Higly -vely corelated** while some are also **highly +vely corelated**. So now we need to extract only useful features that can esure us good Machine Learning Model

# **Extracting useful features using PSO**

Out of all these features, may be not all will participate for <br>
good accuracy. So we may have to choose only those features whose <br>
contribution can give us useful information.

So we will use **Particle Swarm Optimization** - *(PSO)* to <br>
extract usefull features from our dataset

In [0]:
# Installing Pyswarm Library

!python -m pip install pyswarms;

In [0]:
# importing the libraries
import pyswarms as ps

### Spliting data in Train and Test set

In [0]:
from sklearn.model_selection import train_test_split
X,y = data.iloc[:,1:].values, data.iloc[:,0].values

Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3)
print("Xtrain shape ; {} \nytrain shape : {}".format(Xtrain.shape,ytrain.shape))

### Use of PSO

In [0]:
from sklearn.linear_model import LogisticRegression

# Logistic Regression Classifier
clf = LogisticRegression(solver='saga')

def cal_per_particle(mask,alpha,no_of_features=56):
    subset=None   
    if np.count_nonzero(mask) == 0:
        subset = Xtrain
    else :
        subset = Xtrain[:,mask==1]
    clf.fit(subset,ytrain)
    pred = (clf.predict(subset)==ytrain).mean()
    tmp = (alpha * (1.0 - pred) + (1.0 - alpha)* (1 - (subset.shape[1]/no_of_features)))
    return tmp

def calc(x,alpha=0.81):
    number_of_particles = x.shape[0]
    arr = [cal_per_particle(x[i],alpha) for i in range(number_of_particles)]
    return np.array(arr)

In [0]:
%%time

op1 = ['c1','c2']
ops = {'w':0.9,'k':30,'p':2}
for o in op1:
    ops[o] = np.random.random()
dims = Xtrain.shape[1]
optimizer = ps.discrete.BinaryPSO(n_particles=32, dimensions=dims, options=ops)
cost, pos = optimizer.optimize(calc, iters=800)
print("Options are : {}".format(ops))

In [0]:
cols = data.columns[1:]
print("Selected features are : ")
print([c for c,p in zip(cols,pos) if p==1 ])
print("Excluded features are : ")
print([c for c,p in zip(cols,pos) if p==0 ])

### **Visualising PSO Optimizer Cost History**

In [0]:
from pyswarms.utils.plotters import plot_cost_history

plot_cost_history(optimizer.cost_history)

## **Checking Model performance over Selected Features**

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

clf = LogisticRegression(solver='saga')
# selected trainging features
s_xtrain = Xtrain[:,pos==1]
clf.fit(s_xtrain,ytrain)

# selected testing features
s_xtest = Xtest[:,pos==1]

pred = clf.predict(s_xtest)

In [0]:
mat = confusion_matrix(ytest,pred)
sns.heatmap(mat, cmap="coolwarm",fmt='d',annot=True)

In [0]:
from sklearn.metrics import accuracy_score as ac

print("Accuracy Score : ",ac(ytest,pred))

## **Hyper Parameter Tunning**
As we see that after extracting the features with the help of **PSO**<br>
We were able to achieve **0.6 accuracy score**.

So now we will try to Tune our model so that we can increase our accuracy with the selected features.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

Some parameters for Tunning. We will create different combinations of paramerters to check for best solution.

In [0]:
c_param = [0.001, 0.01, 0.1, 1, 10, 100]
p_param = ['l1','l2']

param_list = [(c,p) for c in c_param for p in p_param]
print("Combinations are : ",param_list)

We will keep a track of **C**, **penality** and **accuracy** so that we can use hyper parameters in our final Model.

In [0]:
C=None
P=None
Accu=0.0

In [0]:
i,j=0,0
f, axes = plt.subplots(3, 4, figsize=(20, 15), sharex=True)
sns.despine(left=True)
s_xtrain = Xtrain[:,pos==1]
s_xtest = Xtest[:,pos==1]

for c,p in param_list:
    clf = LogisticRegression(solver='saga',penalty=p,C=c)
    clf.fit(s_xtrain,ytrain)
    pred = clf.predict(s_xtest)
    acc_score = accuracy_score(ytest,pred)
    if acc_score > Accu:
        Accu = acc_score
        C,P=c,p
    cm = confusion_matrix(ytest,pred)
    axes[i,j].set_title("{ 'c' : "+str(c)+" , 'penality' : "+p+" }")
    sns.heatmap(cm,cmap="coolwarm",fmt='d',annot=True,ax=axes[i,j])
    if j == 3: i = (i+1)%4
    j = (j+1)%4

plt.setp(axes, yticks=[])
plt.tight_layout()

### Result after Hyper Parameter Tunning

In [0]:
print("Best Accuracy : ",Accu)
print("Penalty : ",P)
print("C : ",c)

As we can see that after Hyper Parameter Tunning the <br>
accuracy score os **0.6** which is quite good as per dataset.

# **Finalizing the Hybrid Model**

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix , accuracy_score, classification_report
from sklearn.model_selection import cross_val_predict, cross_val_score, KFold

In [0]:
xtrain,xtest,ytrain,ytest = train_test_split(X[:,pos==1],y,test_size=0.3)

model = LogisticRegression(penalty=P,C=C,solver='saga')
model.fit(xtrain,ytrain)

pred = model.predict(xtest)

print("Accuracy Score : ",accuracy_score(ytest,pred))
c_mat = confusion_matrix(ytest,pred)
sns.heatmap(c_mat,cmap="coolwarm",fmt='d',annot=True)

In [0]:
cr = classification_report(ytest,pred,output_dict=True)
print(cr)

In [0]:
cr_df = pd.DataFrame(cr)
cr_df = cr_df.transpose()

In [0]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cr_df,annot=True,cmap="coolwarm",ax=ax)

In [0]:
cr_df

# **Summary**

During this project the main motive was to use Soft Computing to make Hybrid System.<br>
So i used **Particle Swarm Optimization - (PSO)** with **Logistic Regression** to create a <br>
Hybrid System which extract features using PSO and then apply logistic Regression<br>
on the data to classify different type of Lung Cancer.


The data we used in this Project is available opensource on UCI Machine Learning <br>
Repository. At last i was able to create Hybrid model that i used on this dataset.

### Main Concepts used in this are


*   Preprocessing Dataset
*   Visualizing Data
*   Feature Extraction using PSO
*   Check Model Perfomance over Selected Features
*   Hyper Parameter Tunning
*   Finalising Model with Logistic Regression
*   Classification Report
   

---



---



