BUSINESS PROBLEM
A healthcare provider aims to improve the early detection and diagnosis of thyroid cancer, thereby enhancing patient outcomes and reducing healthcare costs. By leveraging patient data, including demographics, medical history, lifestyle factors, and thyroid hormone levels, the provider seeks to develop predictive models and risk stratification tools that aid clinicians in making more informed decisions regarding patient management and follow-up.

OBJECTIVES
1.Develop and validate a machine learning model that accurately predicts the risk of thyroid cancer (Diagnosis: Benign/Malignant) based on patient data.
2.Identify Key Risk Factors: Determine the most influential factors (e.g., age, family history, TSH levels) associated with thyroid cancer risk. 
3. Improve Risk Stratification: Enhance the current risk stratification process (Thyroid_Cancer_Risk: Low/Medium/High) by developing a more precise and data-driven approach.


DATA UNDERSTANDING
Patient_ID (int): Unique identifier for each patient.
Age (int): Age of the patient.
Gender (object): Patient’s gender (Male/Female).
Country (object): Country of residence.
Ethnicity (object): Patient’s ethnic background.
Family_History (object): Whether the patient has a family history of thyroid cancer (Yes/No).
Radiation_Exposure (object): History of radiation exposure (Yes/No).
Iodine_Deficiency (object): Presence of iodine deficiency (Yes/No).
Smoking (object): Whether the patient smokes (Yes/No).
Obesity (object): Whether the patient is obese (Yes/No).
Diabetes (object): Whether the patient has diabetes (Yes/No).
TSH_Level (float): Thyroid-Stimulating Hormone level (µIU/mL).
T3_Level (float): Triiodothyronine level (ng/dL).
T4_Level (float): Thyroxine level (µg/dL).
Nodule_Size (float): Size of thyroid nodules (cm).
Thyroid_Cancer_Risk (object): Estimated risk of thyroid cancer (Low/Medium/High).
Diagnosis (object): Final diagnosis (Benign/Malignant).

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# loading the dataset
df = pd.read_csv('thyroid_cancer_risk_data.csv')
# checking the first 5 rows
df.head()

Unnamed: 0,Patient_ID,Age,Gender,Country,Ethnicity,Family_History,Radiation_Exposure,Iodine_Deficiency,Smoking,Obesity,Diabetes,TSH_Level,T3_Level,T4_Level,Nodule_Size,Thyroid_Cancer_Risk,Diagnosis
0,1,66,Male,Russia,Caucasian,No,Yes,No,No,No,No,9.37,1.67,6.16,1.08,Low,Benign
1,2,29,Male,Germany,Hispanic,No,Yes,No,No,No,No,1.83,1.73,10.54,4.05,Low,Benign
2,3,86,Male,Nigeria,Caucasian,No,No,No,No,No,No,6.26,2.59,10.57,4.61,Low,Benign
3,4,75,Female,India,Asian,No,No,No,No,No,No,4.1,2.62,11.04,2.46,Medium,Benign
4,5,35,Female,Germany,African,Yes,Yes,No,No,No,No,9.1,2.11,10.71,2.11,High,Benign


In [4]:
# checking the last 5 rows
df.tail()

Unnamed: 0,Patient_ID,Age,Gender,Country,Ethnicity,Family_History,Radiation_Exposure,Iodine_Deficiency,Smoking,Obesity,Diabetes,TSH_Level,T3_Level,T4_Level,Nodule_Size,Thyroid_Cancer_Risk,Diagnosis
212686,212687,58,Female,India,Asian,No,No,No,No,Yes,No,2.0,0.64,11.92,1.48,Low,Benign
212687,212688,89,Male,Japan,Middle Eastern,No,No,No,No,Yes,No,9.77,3.25,7.3,4.46,Medium,Benign
212688,212689,72,Female,Nigeria,Hispanic,No,No,No,No,No,Yes,7.72,2.44,8.71,2.36,Medium,Benign
212689,212690,85,Female,Brazil,Middle Eastern,No,No,No,No,No,Yes,5.62,2.53,9.62,1.54,Medium,Benign
212690,212691,46,Female,Japan,Middle Eastern,No,No,No,Yes,No,No,5.6,2.73,10.59,2.53,Low,Malignant


In [5]:
# checking data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212691 entries, 0 to 212690
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Patient_ID           212691 non-null  int64  
 1   Age                  212691 non-null  int64  
 2   Gender               212691 non-null  object 
 3   Country              212691 non-null  object 
 4   Ethnicity            212691 non-null  object 
 5   Family_History       212691 non-null  object 
 6   Radiation_Exposure   212691 non-null  object 
 7   Iodine_Deficiency    212691 non-null  object 
 8   Smoking              212691 non-null  object 
 9   Obesity              212691 non-null  object 
 10  Diabetes             212691 non-null  object 
 11  TSH_Level            212691 non-null  float64
 12  T3_Level             212691 non-null  float64
 13  T4_Level             212691 non-null  float64
 14  Nodule_Size          212691 non-null  float64
 15  Thyroid_Cancer_Ri

In [6]:
df.describe()

Unnamed: 0,Patient_ID,Age,TSH_Level,T3_Level,T4_Level,Nodule_Size
count,212691.0,212691.0,212691.0,212691.0,212691.0,212691.0
mean,106346.0,51.918497,5.045102,2.001727,8.246204,2.503403
std,61398.74739,21.632815,2.860264,0.866248,2.164188,1.444631
min,1.0,15.0,0.1,0.5,4.5,0.0
25%,53173.5,33.0,2.57,1.25,6.37,1.25
50%,106346.0,52.0,5.04,2.0,8.24,2.51
75%,159518.5,71.0,7.52,2.75,10.12,3.76
max,212691.0,89.0,10.0,3.5,12.0,5.0


DATA PREPARATION
DATA CLEANING

In [7]:
# Checking missing values 
df.isna().sum()

Patient_ID             0
Age                    0
Gender                 0
Country                0
Ethnicity              0
Family_History         0
Radiation_Exposure     0
Iodine_Deficiency      0
Smoking                0
Obesity                0
Diabetes               0
TSH_Level              0
T3_Level               0
T4_Level               0
Nodule_Size            0
Thyroid_Cancer_Risk    0
Diagnosis              0
dtype: int64

In [8]:
# there are no missing values in this data set

In [9]:
# checking duplicates
df.duplicated().sum()

0

In [10]:
# there are no duplicates in this data set.

In [11]:
# feature engineering
df.columns

Index(['Patient_ID', 'Age', 'Gender', 'Country', 'Ethnicity', 'Family_History',
       'Radiation_Exposure', 'Iodine_Deficiency', 'Smoking', 'Obesity',
       'Diabetes', 'TSH_Level', 'T3_Level', 'T4_Level', 'Nodule_Size',
       'Thyroid_Cancer_Risk', 'Diagnosis'],
      dtype='object')

In [13]:
# Lifestyle risk evaluation
#df['lifestyle_score'] = df['Diabetes']+ df['Obesity']+ df['Smoking']
#df.head()

Unnamed: 0,Patient_ID,Age,Gender,Country,Ethnicity,Family_History,Radiation_Exposure,Iodine_Deficiency,Smoking,Obesity,Diabetes,TSH_Level,T3_Level,T4_Level,Nodule_Size,Thyroid_Cancer_Risk,Diagnosis,lifestyle_score
0,1,66,Male,Russia,Caucasian,No,Yes,No,No,No,No,9.37,1.67,6.16,1.08,Low,Benign,NoNoNo
1,2,29,Male,Germany,Hispanic,No,Yes,No,No,No,No,1.83,1.73,10.54,4.05,Low,Benign,NoNoNo
2,3,86,Male,Nigeria,Caucasian,No,No,No,No,No,No,6.26,2.59,10.57,4.61,Low,Benign,NoNoNo
3,4,75,Female,India,Asian,No,No,No,No,No,No,4.1,2.62,11.04,2.46,Medium,Benign,NoNoNo
4,5,35,Female,Germany,African,Yes,Yes,No,No,No,No,9.1,2.11,10.71,2.11,High,Benign,NoNoNo


In [14]:
# nodule size categorization
df['Nodule_Size'].unique()

array([1.08, 4.05, 4.61, 2.46, 2.11, 0.02, 0.01, 4.3 , 0.81, 1.44, 0.35,
       3.87, 4.15, 0.38, 1.68, 2.86, 0.25, 4.93, 1.63, 2.27, 2.41, 0.46,
       4.79, 3.63, 4.64, 4.22, 1.54, 4.26, 4.  , 3.06, 0.04, 3.4 , 0.17,
       0.09, 2.68, 3.9 , 3.76, 3.65, 1.14, 3.56, 3.57, 3.15, 4.21, 1.66,
       0.71, 4.27, 0.86, 2.7 , 1.25, 2.18, 1.09, 2.54, 4.99, 1.57, 1.64,
       1.75, 2.59, 2.55, 2.83, 1.06, 3.25, 2.32, 0.26, 2.01, 2.04, 1.76,
       0.72, 4.49, 0.42, 4.88, 0.24, 3.29, 2.5 , 0.84, 3.5 , 4.23, 3.3 ,
       3.01, 3.31, 1.3 , 0.77, 2.19, 4.47, 4.54, 3.36, 3.17, 3.91, 2.78,
       0.73, 4.77, 3.59, 0.37, 2.03, 0.19, 2.71, 1.17, 4.08, 2.77, 2.6 ,
       4.78, 1.59, 2.05, 4.01, 4.67, 0.44, 0.95, 3.81, 0.29, 1.11, 3.93,
       2.75, 0.83, 1.39, 1.02, 0.47, 4.04, 3.53, 4.12, 4.83, 3.69, 1.38,
       1.33, 4.81, 4.39, 0.65, 1.47, 0.18, 1.23, 1.05, 0.3 , 2.48, 0.92,
       2.69, 2.22, 0.66, 2.02, 0.15, 4.96, 3.09, 4.06, 1.22, 3.54, 0.23,
       0.68, 3.08, 1.37, 3.64, 1.1 , 0.27, 1.95, 4.