# European Cancer Data – Data Analytics Project
# Objectives

* Load and preprocess the European Cancer dataset.  
* Perform exploratory data analysis (EDA) to understand data distribution and relationships.  
* Identify key risk factors associated with Cancer data.  
* Store cleaned and processed data for further analysis and model training.  
* Develop data visualizations to support insights.  

# Inputs

* Dataset: cancer_patient_data.csv (https://www.kaggle.com/datasets/ak0212/european-cancer-patient-dataset)
* Required Libraries: Pandas, NumPy, Matplotlib, Seaborn, SciPy, Scikit-learn, Plotly
* Columns of Interest:  
* Demographics: Age, Gender  
* Target Features: Survival Status, Survival Duration (months)  
* Significant Features: Cancer Types, Cancer Stage  
* Lifestyle Indicators: Smoking Status, Alcohol Consumption, BMI, Socioeconomic Status, Urban vs. Rural, Comorbidities, Quality of Life Score  
* Healthcare Indicators: Healthcare System, Follow-up Visits, Recurrence, Clinical Trial Participation  
* Other Indicators: Treatment_Delay_Category, Severity_Index, Access_Risk, Proxy_Comorbidity_Score  

# Outputs

* Cleaned dataset: Processed dataset stored as a CSV file for analysis (cancer_patient_data_cleaned.csv).  
* Exploratory Data Analysis (EDA):   
* Distribution of target cancer data across other features.  
* Observing the distribution or features like Age, Gender, Cancer Type/Stage, BMI, smoking status, Alcohol consumption.  
* Identify outliers in numerical data.  
* Compare categorical variables with the target variable.  
* Feature-engineered dataset: Enhanced dataset with new derived features.  
* Insights & Summary Reports: Key findings documented for further decision-making.  

# Additional Comments

Ensure proper handling of missing, duplicated and outlier values to maintain data integrity.

# Changing work directory  
To run the notebook in the editor, the working directory needs to be changed from its current folder to its parent folder. Thus, we first access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/jacobambat/dev/European_Cancer_Data/jupyter_notebooks'

Then we make the parent of the current directory the new current directory by using:

* os.path.dirname() to get the parent directory
* os.chir() to define the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirming the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/jacobambat/dev/European_Cancer_Data'

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
!chmod 600 kaggle.json

In [6]:
KaggleDatasetPath = "ak0212/european-cancer-patient-dataset"
DestinationFolder = "data/raw"

In [7]:
print(DestinationFolder)

data/raw


In [8]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/ak0212/european-cancer-patient-dataset
License(s): MIT
Downloading european-cancer-patient-dataset.zip to data/raw
  0%|                                                | 0.00/344k [00:00<?, ?B/s]
100%|█████████████████████████████████████████| 344k/344k [00:00<00:00, 433MB/s]


In [None]:
import zipfile
import glob
import pandas as pd

In [9]:
zip_files = glob.glob(os.path.join(DestinationFolder, "*.zip"))
print(zip_files)

for zip_file in zip_files:
    print('zip_file')
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)  # Extract here
    os.remove(zip_file)  # Remove the ZIP file after extraction

print("Extraction complete.")
#read csv file

['data/raw/european-cancer-patient-dataset.zip']
zip_file
Extraction complete.


In [10]:
df = pd.read_csv(f'{DestinationFolder}/cancer_patient_data.csv')
df

Unnamed: 0,Patient ID,Country,Region,Age,Gender,Cancer Type,Cancer Stage,Diagnosis Date,Treatment Start Date,Treatment End Date,...,Alcohol Consumption,BMI,Socioeconomic Status,Urban vs. Rural,Healthcare System,Follow-up Visits,Recurrence,Clinical Trial Participation,Comorbidities,Quality of Life Score
0,PT00001,Finland,Southwest Finland,89.0,Male,Breast,III,2015-08-10,2015-09-20,2016-09-07,...,Moderate,39.3,Medium,Urban,NHS,12.0,Yes,Yes,Diabetes,5.0
1,PT00002,Belgium,Flanders,49.0,Female,Prostate,IV,2010-01-27,2010-06-29,2011-02-09,...,Moderate,21.6,High,Rural,Private Insurance,10.0,,Yes,Diabetes,10.0
2,PT00003,Poland,Silesian,42.0,Male,Lung,III,2016-08-25,2016-10-23,2016-12-27,...,,23.7,Low,,Statutory Health Insurance,5.0,No,No,Obesity,1.0
3,PT00004,Ireland,Dublin,51.0,Female,Prostate,III,,,2011-02-28,...,Moderate,34.0,Medium,Rural,Private Insurance,4.0,Yes,No,Diabetes,5.0
4,PT00005,,Sicily,76.0,Female,Lung,II,2018-05-26,,2019-04-20,...,,33.9,Medium,Urban,Statutory Health Insurance,5.0,No,No,Hypertension,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,PT09996,Austria,Styria,45.0,,Ovarian,II,2023-06-16,2023-10-10,2024-01-16,...,Moderate,38.2,Medium,Urban,NHS,11.0,No,No,,3.0
9996,PT09997,Netherlands,North Holland,22.0,,Leukemia,II,2022-11-24,2023-05-13,2023-12-22,...,,25.9,High,Rural,NHS,,No,No,Cardiovascular Disease,3.0
9997,PT09998,Norway,Oslo,21.0,,Pancreatic,II,2011-06-24,2011-10-18,2012-07-19,...,Moderate,36.8,High,Urban,NHS,17.0,Yes,Yes,Diabetes,10.0
9998,PT09999,Ireland,Dublin,21.0,Male,Pancreatic,I,2014-12-27,2015-02-10,2016-01-09,...,Moderate,18.8,Low,Rural,Private Insurance,3.0,Yes,No,Obesity,7.0


# ETL

In [11]:
#EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Patient ID                    9500 non-null   object 
 1   Country                       9500 non-null   object 
 2   Region                        9500 non-null   object 
 3   Age                           9500 non-null   float64
 4   Gender                        9500 non-null   object 
 5   Cancer Type                   9500 non-null   object 
 6   Cancer Stage                  9500 non-null   object 
 7   Diagnosis Date                9500 non-null   object 
 8   Treatment Start Date          9500 non-null   object 
 9   Treatment End Date            9500 non-null   object 
 10  Treatment Type                9500 non-null   object 
 11  Hospital Type                 9500 non-null   object 
 12  Survival Status               9500 non-null   object 
 13  Su

In [12]:
#get the general overview of dataset with head
df.head()

Unnamed: 0,Patient ID,Country,Region,Age,Gender,Cancer Type,Cancer Stage,Diagnosis Date,Treatment Start Date,Treatment End Date,...,Alcohol Consumption,BMI,Socioeconomic Status,Urban vs. Rural,Healthcare System,Follow-up Visits,Recurrence,Clinical Trial Participation,Comorbidities,Quality of Life Score
0,PT00001,Finland,Southwest Finland,89.0,Male,Breast,III,2015-08-10,2015-09-20,2016-09-07,...,Moderate,39.3,Medium,Urban,NHS,12.0,Yes,Yes,Diabetes,5.0
1,PT00002,Belgium,Flanders,49.0,Female,Prostate,IV,2010-01-27,2010-06-29,2011-02-09,...,Moderate,21.6,High,Rural,Private Insurance,10.0,,Yes,Diabetes,10.0
2,PT00003,Poland,Silesian,42.0,Male,Lung,III,2016-08-25,2016-10-23,2016-12-27,...,,23.7,Low,,Statutory Health Insurance,5.0,No,No,Obesity,1.0
3,PT00004,Ireland,Dublin,51.0,Female,Prostate,III,,,2011-02-28,...,Moderate,34.0,Medium,Rural,Private Insurance,4.0,Yes,No,Diabetes,5.0
4,PT00005,,Sicily,76.0,Female,Lung,II,2018-05-26,,2019-04-20,...,,33.9,Medium,Urban,Statutory Health Insurance,5.0,No,No,Hypertension,6.0


In [13]:
#get list of collumn names in dataset
df.columns.tolist()

['Patient ID',
 'Country',
 'Region',
 'Age',
 'Gender',
 'Cancer Type',
 'Cancer Stage',
 'Diagnosis Date',
 'Treatment Start Date',
 'Treatment End Date',
 'Treatment Type',
 'Hospital Type',
 'Survival Status',
 'Survival Duration (Months)',
 'Genetic Markers',
 'Family History',
 'Smoking Status',
 'Alcohol Consumption',
 'BMI',
 'Socioeconomic Status',
 'Urban vs. Rural',
 'Healthcare System',
 'Follow-up Visits',
 'Recurrence',
 'Clinical Trial Participation',
 'Comorbidities',
 'Quality of Life Score']

In [14]:
#checking the missing values
df.isnull().sum()


Patient ID                       500
Country                          500
Region                           500
Age                              500
Gender                           500
Cancer Type                      500
Cancer Stage                     500
Diagnosis Date                   500
Treatment Start Date             500
Treatment End Date               500
Treatment Type                   500
Hospital Type                    500
Survival Status                  500
Survival Duration (Months)       969
Genetic Markers                 2815
Family History                   500
Smoking Status                   500
Alcohol Consumption             4299
BMI                              500
Socioeconomic Status             500
Urban vs. Rural                  500
Healthcare System                500
Follow-up Visits                 500
Recurrence                       500
Clinical Trial Participation     500
Comorbidities                   2463
Quality of Life Score            500
d

In [15]:
#drop the rows with NaN
df = df.dropna(axis=0, subset=['Patient ID'])
df = df.dropna(axis=0, subset=['Country'])
df = df.dropna(axis=0, subset=['Age'])
df = df.dropna(axis=0, subset=['Gender'])
df = df.dropna(axis=0, subset=['Cancer Type'])
df = df.dropna(axis=0, subset=['Cancer Stage'])
df = df.dropna(axis=0, subset=['Diagnosis Date'])
df = df.dropna(axis=0, subset=['Treatment Start Date'])
df = df.dropna(axis=0, subset=['Treatment End Date'])
df = df.dropna(axis=0, subset=['Treatment Type'])
df = df.dropna(axis=0, subset=['Urban vs. Rural'])
df = df.dropna(axis=0, subset=['Healthcare System'])
df = df.dropna(axis=0, subset=['Recurrence'])
df = df.dropna(axis=0, subset=['Region'])
df = df.dropna(axis=0, subset=['Hospital Type'])
df = df.dropna(axis=0, subset=['Socioeconomic Status'])
df = df.dropna(axis=0, subset=['Follow-up Visits'])
df = df.dropna(axis=0, subset=['Quality of Life Score'])
df = df.dropna(axis=0, subset=['Survival Duration (Months)'])

In [16]:
# Replace empty strings and NaN with 'Unknown' 
df['Survival Status'] = df['Survival Status'].fillna('Unknown')
df['Genetic Markers'] = df['Genetic Markers'].fillna('Unknown')
df['Family History'] = df['Family History'].fillna('Unknown')
df['Smoking Status'] = df['Smoking Status'].fillna('Unknown')
df['Alcohol Consumption'] = df['Alcohol Consumption'].fillna('Unknown')
df['Clinical Trial Participation'] = df['Clinical Trial Participation'].fillna('Unknown')
df['Comorbidities'] = df['Comorbidities'].fillna('Unknown')

In [17]:
#Replace BMI collumn NaN with NaN with median value
df['BMI'] = df['BMI'].fillna(df['BMI'].median())

In [18]:
#convert the collumn 'Quality of Life Score' values datatype to int
df['Quality of Life Score'] = df['Quality of Life Score'].astype(int)

In [19]:
#convert the collumn 'Age' to int
df['Age'] = df['Age'].astype(int)

In [20]:
# Remove 'PT' from 'Patient ID' collumn and convert the values to int
df['Patient ID'] = df['Patient ID'].str.replace('PT', '').astype(int)

In [21]:
#check for duplicate rows
duplicate_check= df.duplicated().any()
print('There are duplicates:', duplicate_check)

There are duplicates: False


In [22]:
#checking for NaN or empty values
df.dropna(axis=1, how='all')

Unnamed: 0,Patient ID,Country,Region,Age,Gender,Cancer Type,Cancer Stage,Diagnosis Date,Treatment Start Date,Treatment End Date,...,Alcohol Consumption,BMI,Socioeconomic Status,Urban vs. Rural,Healthcare System,Follow-up Visits,Recurrence,Clinical Trial Participation,Comorbidities,Quality of Life Score
0,1,Finland,Southwest Finland,89,Male,Breast,III,2015-08-10,2015-09-20,2016-09-07,...,Moderate,39.3,Medium,Urban,NHS,12.0,Yes,Yes,Diabetes,5
5,6,Sweden,Västra Götaland,28,Male,Pancreatic,IV,2019-03-06,2019-05-16,2020-03-23,...,Unknown,23.2,Medium,Urban,Private Insurance,15.0,No,No,Obesity,6
9,10,Spain,Andalusia,77,Female,Ovarian,I,2013-07-22,2013-11-08,2014-11-04,...,Unknown,38.3,Medium,Rural,Statutory Health Insurance,7.0,No,No,Hypertension,7
10,11,Spain,Andalusia,22,Female,Pancreatic,II,2012-12-27,2013-06-06,2013-12-27,...,Heavy,21.9,Medium,Urban,NHS,18.0,No,No,Unknown,2
15,16,Germany,North Rhine-Westphalia,53,Male,Breast,I,2012-07-24,2012-08-14,2013-03-14,...,Unknown,37.5,Medium,Urban,Private Insurance,12.0,Yes,No,Hypertension,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,Portugal,Lisbon,71,Female,Breast,IV,2017-10-25,2018-02-15,2018-09-29,...,Unknown,29.9,Medium,Urban,Statutory Health Insurance,18.0,No,No,Unknown,8
9992,9993,Norway,Trøndelag,89,Female,Breast,I,2020-11-12,2020-12-01,2021-01-01,...,Unknown,20.8,Low,Rural,NHS,16.0,Yes,No,Unknown,1
9993,9994,Denmark,Zealand,43,Male,Colorectal,I,2010-05-30,2010-08-25,2011-06-07,...,Moderate,20.1,High,Rural,NHS,12.0,No,No,Obesity,1
9994,9995,Belgium,Wallonia,36,Male,Breast,III,2013-02-06,2013-03-05,2013-12-23,...,Moderate,22.5,Medium,Urban,Private Insurance,10.0,No,No,Hypertension,1


In [23]:
# reset the index
df.reset_index(drop=True, inplace=True)

In [32]:
#checking for unique values
unique_counts = df.nunique()
unique_table = pd.DataFrame({'Column': unique_counts.index, 'Unique Values': unique_counts.values})
unique_table

Unnamed: 0,Column,Unique Values
0,Patient ID,3627
1,Country,15
2,Region,45
3,Age,70
4,Gender,3
5,Cancer Type,7
6,Cancer Stage,4
7,Diagnosis Date,2589
8,Treatment Start Date,2616
9,Treatment End Date,2626


In [39]:
# convert BMI to string
df['BMI']=df['BMI'].astype(float)

In [29]:
# Get the unique values of 'Survival Status' column
unique_values = df['Survival Status'].unique()
print(unique_values)
# Encoding the data for analysis
df['IsSurvivalStatus'] = df['Survival Status'].map({'Alive': 1, 'Deceased': 0, 'Unknown': 3})
df['Cancer Stage'] = df['Cancer Stage'].astype(str)
df['Age Group'] = pd.cut(df['Age'], bins=[0, 30, 45, 60, 75, 100], labels=["<30", "30-45", "45-60", "60-75", "75+"])

['Alive' 'Deceased' 'Unknown']


In [31]:
df['Diagnosis Date'] = pd.to_datetime(df['Diagnosis Date'], errors='coerce')
df['Treatment Start Date'] = pd.to_datetime(df['Treatment Start Date'], errors='coerce')
df['Time to Treatment (Days)'] = (df['Treatment Start Date'] - df['Diagnosis Date']).dt.days

In [34]:
# Get the unique values of 'Cancer Stage' column
unique_values = df['Cancer Stage'].unique()
print(unique_values)
# Encode to 1,2,3,4 for the collumn 'Cancer Stage'
df['Cancer Stage'] = df['Cancer Stage'].map({'I': 1, 'II': 2, 'III': 3, 'IV': 4})

print(df)

['III' 'IV' 'I' 'II']
      Patient ID   Country                  Region  Age  Gender Cancer Type  \
0              1   Finland       Southwest Finland   89    Male      Breast   
1              6    Sweden         Västra Götaland   28    Male  Pancreatic   
2             10     Spain               Andalusia   77  Female     Ovarian   
3             11     Spain               Andalusia   22  Female  Pancreatic   
4             16   Germany  North Rhine-Westphalia   53    Male      Breast   
...          ...       ...                     ...  ...     ...         ...   
3622        9990  Portugal                  Lisbon   71  Female      Breast   
3623        9993    Norway               Trøndelag   89  Female      Breast   
3624        9994   Denmark                 Zealand   43    Male  Colorectal   
3625        9995   Belgium                Wallonia   36    Male      Breast   
3626        9999   Ireland                  Dublin   21    Male  Pancreatic   

      Cancer Stage Diagnosis 

In [35]:
# Find the minimum and maximum 'Survival Duration (Months)'
min_SDM = df['Survival Duration (Months)'].min()
max_SDM = df['Survival Duration (Months)'].max()

print('Minimum Survival Duration (Months):', min_SDM)
print('Maximum Survival Duration (Months):', max_SDM)

# Define Survival Duration (months) categories using apply and lambda and using discretisation and binning the SDM categories
df["SDM_Category"] = df["Survival Duration (Months)"].apply(lambda x: 
                                     "One Year" if x <= 24 else 
                                     "Two Years" if x <= 48 else 
                                     "Three Years" if x <= 72 else
                                     "Four Years" if x <= 94 else
                                     "Five Years" if x <= 116 else
                                     "Less than Six Years")

# Print the DataFrame
print(df["SDM_Category"].count())
print(df)

Minimum Survival Duration (Months): 6.0
Maximum Survival Duration (Months): 119.0
3627
      Patient ID   Country                  Region  Age  Gender Cancer Type  \
0              1   Finland       Southwest Finland   89    Male      Breast   
1              6    Sweden         Västra Götaland   28    Male  Pancreatic   
2             10     Spain               Andalusia   77  Female     Ovarian   
3             11     Spain               Andalusia   22  Female  Pancreatic   
4             16   Germany  North Rhine-Westphalia   53    Male      Breast   
...          ...       ...                     ...  ...     ...         ...   
3622        9990  Portugal                  Lisbon   71  Female      Breast   
3623        9993    Norway               Trøndelag   89  Female      Breast   
3624        9994   Denmark                 Zealand   43    Male  Colorectal   
3625        9995   Belgium                Wallonia   36    Male      Breast   
3626        9999   Ireland                  

In [40]:
# Define BMI categories using apply and lambda and using discretisation and binning the bmi categories
df["BMI_Category"] = df["BMI"].apply(lambda x: 
                                     "Underweight" if x < 18.5 else 
                                     "Normal weight" if x < 25 else 
                                     "Overweight" if x < 30 else 
                                     "Obese")

# Print the DataFrame
print(df)

      Patient ID   Country                  Region  Age  Gender Cancer Type  \
0              1   Finland       Southwest Finland   89    Male      Breast   
1              6    Sweden         Västra Götaland   28    Male  Pancreatic   
2             10     Spain               Andalusia   77  Female     Ovarian   
3             11     Spain               Andalusia   22  Female  Pancreatic   
4             16   Germany  North Rhine-Westphalia   53    Male      Breast   
...          ...       ...                     ...  ...     ...         ...   
3622        9990  Portugal                  Lisbon   71  Female      Breast   
3623        9993    Norway               Trøndelag   89  Female      Breast   
3624        9994   Denmark                 Zealand   43    Male  Colorectal   
3625        9995   Belgium                Wallonia   36    Male      Breast   
3626        9999   Ireland                  Dublin   21    Male  Pancreatic   

      Cancer Stage Diagnosis Date Treatment Start D

In [41]:
# Get the unique values of 'Gender' column
unique_values = df['Gender'].unique()
print(unique_values)
# Encode to 0,1,2 for the collumn 'Gender'
df['IsGender'] = df['Gender'].map({'Male': 0, 'Female': 1, 'Non-binary': 2})

print(df)

['Male' 'Female' 'Non-binary']
      Patient ID   Country                  Region  Age  Gender Cancer Type  \
0              1   Finland       Southwest Finland   89    Male      Breast   
1              6    Sweden         Västra Götaland   28    Male  Pancreatic   
2             10     Spain               Andalusia   77  Female     Ovarian   
3             11     Spain               Andalusia   22  Female  Pancreatic   
4             16   Germany  North Rhine-Westphalia   53    Male      Breast   
...          ...       ...                     ...  ...     ...         ...   
3622        9990  Portugal                  Lisbon   71  Female      Breast   
3623        9993    Norway               Trøndelag   89  Female      Breast   
3624        9994   Denmark                 Zealand   43    Male  Colorectal   
3625        9995   Belgium                Wallonia   36    Male      Breast   
3626        9999   Ireland                  Dublin   21    Male  Pancreatic   

      Cancer Stage D

In [42]:
# Get the unique values of 'Recurrence' column
unique_values = df['Recurrence'].unique()
print(unique_values)
# Encode to 0 and 1 for the collumn 'Recurrence'
df['IsRecurrence'] = df['Recurrence'].map({'No': 0, 'Yes': 1})

print(df)

['Yes' 'No']
      Patient ID   Country                  Region  Age  Gender Cancer Type  \
0              1   Finland       Southwest Finland   89    Male      Breast   
1              6    Sweden         Västra Götaland   28    Male  Pancreatic   
2             10     Spain               Andalusia   77  Female     Ovarian   
3             11     Spain               Andalusia   22  Female  Pancreatic   
4             16   Germany  North Rhine-Westphalia   53    Male      Breast   
...          ...       ...                     ...  ...     ...         ...   
3622        9990  Portugal                  Lisbon   71  Female      Breast   
3623        9993    Norway               Trøndelag   89  Female      Breast   
3624        9994   Denmark                 Zealand   43    Male  Colorectal   
3625        9995   Belgium                Wallonia   36    Male      Breast   
3626        9999   Ireland                  Dublin   21    Male  Pancreatic   

      Cancer Stage Diagnosis Date Trea

In [43]:
# Get the unique values of 'Clinical Trial Participation' column
unique_values = df['Clinical Trial Participation'].unique()
print(unique_values)
# Encode to 0,1,2 for the collumn 'Clinical Trial Participation'
df['IsClinical_Trial_Participation'] = df['Clinical Trial Participation'].map({'No': 0, 'Yes': 1, 'Unknown': 2 })
print(df)

['Yes' 'No' 'Unknown']
      Patient ID   Country                  Region  Age  Gender Cancer Type  \
0              1   Finland       Southwest Finland   89    Male      Breast   
1              6    Sweden         Västra Götaland   28    Male  Pancreatic   
2             10     Spain               Andalusia   77  Female     Ovarian   
3             11     Spain               Andalusia   22  Female  Pancreatic   
4             16   Germany  North Rhine-Westphalia   53    Male      Breast   
...          ...       ...                     ...  ...     ...         ...   
3622        9990  Portugal                  Lisbon   71  Female      Breast   
3623        9993    Norway               Trøndelag   89  Female      Breast   
3624        9994   Denmark                 Zealand   43    Male  Colorectal   
3625        9995   Belgium                Wallonia   36    Male      Breast   
3626        9999   Ireland                  Dublin   21    Male  Pancreatic   

      Cancer Stage Diagnosis

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3627 entries, 0 to 3626
Data columns (total 35 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Patient ID                      3627 non-null   int64         
 1   Country                         3627 non-null   object        
 2   Region                          3627 non-null   object        
 3   Age                             3627 non-null   int64         
 4   Gender                          3627 non-null   object        
 5   Cancer Type                     3627 non-null   object        
 6   Cancer Stage                    3627 non-null   int64         
 7   Diagnosis Date                  3627 non-null   datetime64[ns]
 8   Treatment Start Date            3627 non-null   datetime64[ns]
 9   Treatment End Date              3627 non-null   object        
 10  Treatment Type                  3627 non-null   object        
 11  Hosp

write the output to new CSV file

In [45]:
df.to_csv(f'data/processed/cancer_patient_data_cleaned.csv')