## About Dataset

This data set contains 13 clinicopathologic features aiming to predict recurrence of well differentiated thyroid cancer. The data set was collected in duration of 15 years and each patient was followed for at least 10 years.


## Content

The size for the file featured within this Kaggle dataset is shown below — along with a list of attributes, and their description summaries:

Age: The age of the patient at the time of diagnosis or treatment

Gender: The gender of the patient (male or female).

Smoking: Whether the patient is a smoker or not.

Hx Smoking: Smoking history of the patient (e.g., whether they have ever smoked).

Hx Radiotherapy: History of radiotherapy treatment for any condition.

Thyroid Function: The status of thyroid function, possibly indicating if there are any abnormalities.

Physical Examination: Findings from a physical examination of the patient, which may include palpation of the thyroid gland and surrounding structures.

Adenopathy: Presence or absence of enlarged lymph nodes (adenopathy) in the neck region.

Pathology: Specific types of thyroid cancer as determined by pathology examination of biopsy samples.

Focality: Whether the cancer is unifocal (limited to one location) or multifocal (present in multiple locations).

Risk: The risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type.

T: Tumor classification based on its size and extent of invasion into nearby structures.

N: Nodal classification indicating the involvement of lymph nodes.

M: Metastasis classification indicating the presence or absence of distant metastases.

Stage: The overall stage of the cancer, typically determined by combining T, N, and M classifications.

Response: Response to treatment, indicating whether the cancer responded positively, negatively, or remained stable after treatment.

Recurred: Indicates whether the cancer has recurred after initial treatment.

## Importing basic Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore') 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.metrics import accuracy_score,classification_report,ConfusionMatrixDisplay,precision_score,recall_score,f1_score

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

## Loading Dataset

In [2]:
df = pd.read_csv("C:/Users/Hi/Downloads/Thyroid_Diff.csv")

In [3]:
df.head()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
0,27,F,No,No,No,Euthyroid,Single nodular goiter-left,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Indeterminate,No
1,34,F,No,Yes,No,Euthyroid,Multinodular goiter,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
2,30,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
3,62,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
4,62,F,No,No,No,Euthyroid,Multinodular goiter,No,Micropapillary,Multi-Focal,Low,T1a,N0,M0,I,Excellent,No


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383 entries, 0 to 382
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Age                   383 non-null    int64 
 1   Gender                383 non-null    object
 2   Smoking               383 non-null    object
 3   Hx Smoking            383 non-null    object
 4   Hx Radiothreapy       383 non-null    object
 5   Thyroid Function      383 non-null    object
 6   Physical Examination  383 non-null    object
 7   Adenopathy            383 non-null    object
 8   Pathology             383 non-null    object
 9   Focality              383 non-null    object
 10  Risk                  383 non-null    object
 11  T                     383 non-null    object
 12  N                     383 non-null    object
 13  M                     383 non-null    object
 14  Stage                 383 non-null    object
 15  Response              383 non-null    ob

In [5]:
## checking for missing values
print("**********************************")
print("feature\t\tMissing values")
print("**********************************")
print(df.isnull().sum())
print('**Total missing values**')
print(df.isnull().sum().sum())

**********************************
feature		Missing values
**********************************
Age                     0
Gender                  0
Smoking                 0
Hx Smoking              0
Hx Radiothreapy         0
Thyroid Function        0
Physical Examination    0
Adenopathy              0
Pathology               0
Focality                0
Risk                    0
T                       0
N                       0
M                       0
Stage                   0
Response                0
Recurred                0
dtype: int64
**Total missing values**
0


In [6]:
print("**********************************")
print("feature\t\t\tdata type")
print("**********************************")
print(df.dtypes)

**********************************
feature			data type
**********************************
Age                      int64
Gender                  object
Smoking                 object
Hx Smoking              object
Hx Radiothreapy         object
Thyroid Function        object
Physical Examination    object
Adenopathy              object
Pathology               object
Focality                object
Risk                    object
T                       object
N                       object
M                       object
Stage                   object
Response                object
Recurred                object
dtype: object


In [7]:
### Features

num_features = [feature for feature in df.columns if df[feature].dtype!='O']
cat_features = [feature for feature in df.columns if df[feature].dtype =='O']
discrete_features = [feature for feature in num_features if len(df[feature].unique())<=25]
continous_features = [feature for feature in num_features if feature not in discrete_features]



print("********************************************")
print("Feature type\t\t total no.of features")
print('********************************************')
print('Numerical features\t\t',len(num_features))
print("Categorical features\t\t",len(cat_features))
print("Discrete features\t\t",len(discrete_features))
print("Continous_features\t\t",len(continous_features))

********************************************
Feature type		 total no.of features
********************************************
Numerical features		 1
Categorical features		 16
Discrete features		 0
Continous_features		 1


In [8]:
df.columns

Index(['Age', 'Gender', 'Smoking', 'Hx Smoking', 'Hx Radiothreapy',
       'Thyroid Function', 'Physical Examination', 'Adenopathy', 'Pathology',
       'Focality', 'Risk', 'T', 'N', 'M', 'Stage', 'Response', 'Recurred'],
      dtype='object')

In [9]:
df.Gender.value_counts()

F    312
M     71
Name: Gender, dtype: int64

In [10]:
df.Smoking.value_counts()

No     334
Yes     49
Name: Smoking, dtype: int64

In [11]:
df["Hx Smoking"].value_counts()

No     355
Yes     28
Name: Hx Smoking, dtype: int64

In [12]:
df["Hx Radiothreapy"].value_counts()

No     376
Yes      7
Name: Hx Radiothreapy, dtype: int64

In [13]:
df["Thyroid Function"].value_counts()

Euthyroid                      332
Clinical Hyperthyroidism        20
Subclinical Hypothyroidism      14
Clinical Hypothyroidism         12
Subclinical Hyperthyroidism      5
Name: Thyroid Function, dtype: int64

In [14]:
df["Physical Examination"].value_counts()

Multinodular goiter            140
Single nodular goiter-right    140
Single nodular goiter-left      89
Normal                           7
Diffuse goiter                   7
Name: Physical Examination, dtype: int64

In [15]:
df["Adenopathy"].value_counts()

No           277
Right         48
Bilateral     32
Left          17
Extensive      7
Posterior      2
Name: Adenopathy, dtype: int64

In [16]:
df["Focality"].value_counts()

Uni-Focal      247
Multi-Focal    136
Name: Focality, dtype: int64

In [17]:
df["Risk"].value_counts()

Low             249
Intermediate    102
High             32
Name: Risk, dtype: int64

In [18]:
df["T"].value_counts()

T2     151
T3a     96
T1a     49
T1b     43
T4a     20
T3b     16
T4b      8
Name: T, dtype: int64

In [19]:
df.N.value_counts()

N0     268
N1b     93
N1a     22
Name: N, dtype: int64

In [20]:
df.M.value_counts()

M0    365
M1     18
Name: M, dtype: int64

In [21]:
df.Stage.value_counts()

I      333
II      32
IVB     11
III      4
IVA      3
Name: Stage, dtype: int64

In [22]:
df.Response.value_counts()

Excellent                 208
Structural Incomplete      91
Indeterminate              61
Biochemical Incomplete     23
Name: Response, dtype: int64

In [23]:
df.Recurred.value_counts()

No     275
Yes    108
Name: Recurred, dtype: int64

In [24]:
## Dependent feature

df['Recurred'] = df['Recurred'].astype('category')
df['Recurred'] = df['Recurred'].cat.codes


In [25]:
df['Recurred'].value_counts()

0    275
1    108
Name: Recurred, dtype: int64

## Handling imbalanced dataset

In [26]:
df_minority = df[df['Recurred'] == 1]
df_majority = df[df['Recurred'] == 0]

In [27]:
from sklearn.utils import resample

df_minority_resampled = resample(df_minority, replace = True,n_samples = len(df_majority),random_state = 42)

In [28]:
df = pd.concat([df_majority,df_minority_resampled])

In [29]:
df.shape

(550, 17)

In [30]:
### Sepertating the independent and dependent features

X = df.drop(["Recurred"], axis =1)
y = df["Recurred"]

In [31]:
X.shape

(550, 16)

In [32]:
y.shape

(550,)

In [33]:
y.value_counts()

0    275
1    275
Name: Recurred, dtype: int64

## Train test split

In [34]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25, random_state = 42)

## Column transformer

In [35]:


num_features = X.select_dtypes(exclude = 'object').columns
one_hot_columns = X.select_dtypes(include = 'object').columns

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop = 'first')

preprocessor = ColumnTransformer(
[
    ("Onehotencoder",oh_transformer,one_hot_columns),
    ("Standardscaler",numeric_transformer,num_features)
],remainder = 'passthrough')

In [36]:
preprocessor

In [37]:
X_train = preprocessor.fit_transform(X_train)

In [38]:
X_test = preprocessor.transform(X_test)

## Model Building

In [39]:
models={
    "Logisitic Regression":LogisticRegression(),
    "Random Forest":RandomForestClassifier(),
    #"Gradient Boost":GradientBoostingClassifier()
}
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred) # Calculate Accuracy
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted') # Calculate F1-score
    model_train_precision = precision_score(y_train, y_train_pred) # Calculate Precision
    model_train_recall = recall_score(y_train, y_train_pred) # Calculate Recall
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)


    # Test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred) # Calculate Accuracy
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted') # Calculate F1-score
    model_test_precision = precision_score(y_test, y_test_pred) # Calculate Precision
    model_test_recall = recall_score(y_test, y_test_pred) # Calculate Recall
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred) #Calculate Roc


    print(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print('- F1 score: {:.4f}'.format(model_train_f1))
    
    print('- Precision: {:.4f}'.format(model_train_precision))
    print('- Recall: {:.4f}'.format(model_train_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_train_rocauc_score))

    
    
    print('----------------------------------')
    
    print('Model performance for Test set')
    print('- Accuracy: {:.4f}'.format(model_test_accuracy))
    print('- F1 score: {:.4f}'.format(model_test_f1))
    print('- Precision: {:.4f}'.format(model_test_precision))
    print('- Recall: {:.4f}'.format(model_test_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_test_rocauc_score))

    
    print('='*35)
    print('\n')

Logisitic Regression
Model performance for Training set
- Accuracy: 0.9442
- F1 score: 0.9442
- Precision: 0.9552
- Recall: 0.9320
- Roc Auc Score: 0.9442
----------------------------------
Model performance for Test set
- Accuracy: 0.9493
- F1 score: 0.9493
- Precision: 0.9697
- Recall: 0.9275
- Roc Auc Score: 0.9493


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.9928
- F1 score: 0.9928
- Precision: 1.0000
- Recall: 0.9855
- Roc Auc Score: 0.9928




## Hyper parameter tuning

In [40]:
lr_params = {"penalty" :['l1', 'l2', 'elasticnet', None],
            "solver" : ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag' ,'saga'],
           }

In [41]:
grid = GridSearchCV(estimator = LogisticRegression(),param_grid = lr_params, scoring = 'accuracy',cv=3,verbose=2,n_jobs=-1)

In [42]:
grid.fit(X_train,y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [43]:
grid.best_params_

{'penalty': None, 'solver': 'saga'}

In [44]:
grid.best_score_

0.9563101660848408