Skip to content

kalnikos/Pap-Smear-Classification-model

Repository files navigation

Pap Smear Classification model

The goal of this project is to develop a classification model that will be able to identify the cervical dysplasia in two main categories, normal and abnormal. The data set that I’m working on has a target divided into 7 categories depending on how serious the dysplasia is. In order to divide these categories I will try to implement two unsupervised methods aiming to find:

  1. The number of the clusters.

  2. How I would split up these clusters.

In addition, I will  implement some techniques to determine highly correlated features and a dimension reduction method in order to identify patterns and divide the target column into less categories.

My dataset has 26 features and 500 rows. Although I have to deal with many features the number of the rows is not so big and this is something that I will try to increase in order to develop another model and compare the results. Firstly, I used some statistics techniques to examine the distribution of the features and scatterplots aiming to find which variables are highly correlated.

![cor](https://user-images.githubusercontent.com/66875726/97926634-21763300-1d6c-11eb-91ff-4ba6dbc76721.png

Yes is a little bit chaotic, many features but after used the Univariate Feature Selection method I had to manage 15 features.

Χωρίς τίτλο

After these steps I implemented two Unsupervised techniques in order to determine the number of the classes that I could split the data. I started with a Hierarchical algorithm and below is the dendrogram of the algorithm.

dend

As we can see from the dendrogram I can split the data into 2 or 3 classes, Definitely is an improvement considering the seven classes that the target group is divided. The second step that it was really useful in order to indentify the distribution of the data, were some scatter plots showed the correlation of the features with the target group.

Kerne_Short Kyto_Short Cyto_Long
kerne_shot_2 kyto_short_2 cyto_long_2
Kerne_Short Kyto_Short Cyto_Long
Cyto_short_3 kerne_shot_3 kyto_long_3

From the above scatter plots we can draw the conclusion that the categories 1-4 could classified in one group (normal cells) and the rest 5-7 in other group (abnormal cells)

One other technique that could be really helpful to distinguish the data target and verify the above conclusion it’s the Principal Component Analysis. Looking at the variance ratio of the first two component, 80% of the dataset’s variance lies along the first Principal Component and 14% lies along the second PC, We have a lot of information in the first two components So, let’s plot them.

pca

This plot verified indeed our target distinguish normal cell 1-4 and abnormal 5-7. It will be also really interesting to plot a 3D matrix in order to identify this classification.

newplot

Lastly before the prediction models, I trained a Supervised algorithm in order to examine the prediction on the 7 different cells categories. I trained a KNN model and below are the results.

matrix

The model was able to identify the third and fourth categories almost perfect but it’s not so accurate with the fifth and sixth.

Prediction Models

After those steps, I started the training methods, I trained and optimized 4 supervised models with 4 different assumptions: 2 different feature selection methods and 2 different feature scaling methods, standard scaler and normalizer aiming to find out the effect that could cause the results.

Compared all the different assumptions, optimized parameters and considered also the cross validation scores the results are below:

Training Set

Logistic Regression SVM KNN Decision Tree
Logistic_Regression SVM KNN Random_Forest

Test set

KNN Decision Tree
KNN Bdt

Despite the models' performance in train, test and cross validation set the model that performed better is the optimized version of the Decision Tree classifier.