## 1. Exploratory data analysis

In this initial analysis, we load the dataset and perform a preliminary investigation to understand its structure and characteristics.

The goal of exploratory data analysis is to get an overall view of the dataset, identify patterns, detect possible inconsistencies or outliers, check for missing values, and understand the distribution of variables. This step is essential for guiding decisions on data preprocessing, model selection, and analytical approaches in the subsequent stages.

In [8]:
import pandas as pd

In [9]:
dataset_path = '../data/dataset.csv'
df = pd.read_csv(dataset_path)

The `shape` attribute of a DataFrame returns the number of rows and columns in the dataset.

In [10]:
dataset_lines, dataset_columns = df.shape
print("The dataset has {} lines and {} columns.".format(dataset_lines, dataset_columns))

The dataset has 890000 lines and 17 columns.


The `head()` method returns, by default, the first five rows of a DataFrame.

This initial preview is useful for quickly understanding the structure of the data, including column names, data types, and the overall format of the dataset. It also helps with the early detection of potential issues, such as missing values, irrelevant columns, or inconsistent data.

In [11]:
df.head(5)

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,1,64.0,Male,Sweden,2016-04-05,Stage I,Yes,Passive Smoker,29.4,199,0,0,1,0,Chemotherapy,2017-09-10,0
1,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
2,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
3,4,51.0,Female,Belgium,2016-02-05,Stage I,No,Passive Smoker,43.0,241,1,1,0,0,Chemotherapy,2017-04-23,0
4,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0


The `dtypes` attribute displays the data type of each column in the DataFrame, such as `int64`, `float64`, or `object` (commonly used for strings), among others.

This information is essential to understand how the data is represented internally and to ensure that each column has the appropriate type for future analyses, such as numerical computations, groupings, or data transformations.

In [12]:
df.dtypes

id                      int64
age                   float64
gender                 object
country                object
diagnosis_date         object
cancer_stage           object
family_history         object
smoking_status         object
bmi                   float64
cholesterol_level       int64
hypertension            int64
asthma                  int64
cirrhosis               int64
other_cancer            int64
treatment_type         object
end_treatment_date     object
survived                int64
dtype: object

The `describe()` method provides a statistical summary of the numerical columns in the DataFrame.

It returns metrics such as count, mean, standard deviation, minimum and maximum values, and the quartiles (25%, 50%, 75%), offering an overview of the data distribution. This information is useful for identifying trends, variability, and potential outliers in the numerical variables.

In [13]:
df.describe()

Unnamed: 0,id,age,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,survived
count,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0
mean,445000.5,55.007008,30.494172,233.633916,0.750024,0.46974,0.225956,0.088157,0.220229
std,256921.014127,9.994485,8.368539,43.432278,0.432999,0.499084,0.418211,0.283524,0.414401
min,1.0,4.0,16.0,150.0,0.0,0.0,0.0,0.0,0.0
25%,222500.75,48.0,23.3,196.0,1.0,0.0,0.0,0.0,0.0
50%,445000.5,55.0,30.5,242.0,1.0,0.0,0.0,0.0,0.0
75%,667500.25,62.0,37.7,271.0,1.0,1.0,0.0,0.0,0.0
max,890000.0,104.0,45.0,300.0,1.0,1.0,1.0,1.0,1.0


The instruction `df.isna().sum()` checks for missing (NaN) values in each column of the DataFrame.

It returns the count of null values per column, helping to identify variables that may require handling, such as filling, removal, or replacement. This analysis is crucial to ensure data quality before moving on to more advanced steps.

In [14]:
df.isna().sum()

id                    0
age                   0
gender                0
country               0
diagnosis_date        0
cancer_stage          0
family_history        0
smoking_status        0
bmi                   0
cholesterol_level     0
hypertension          0
asthma                0
cirrhosis             0
other_cancer          0
treatment_type        0
end_treatment_date    0
survived              0
dtype: int64