<a href="https://colab.research.google.com/github/ioanap20/Branch-and-bound-for-TSP/blob/main/medical_outcomes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, we aim to predict the survival years of cancer patients using machine learning techniques. We begin by analyzing a global cancer patient dataset *(global_cancer_patients_2015_2024.csv)* and apply two models:

**Logistic Regression**

**Random Forest Classifier**

Our goal is to evaluate how accurately these models can estimate survival durations based on available patient features.

After building and testing our models on this dataset, we will repeat the process with another cancer-related dataset *(cancer issue.csv)* . This will allow us to compare the performance and generalizability of both models across different data sources, helping us determine which method is more effective for predicting survival outcomes.

In [None]:

# Import standard libraries for  machine learning
import pandas as pd                  # For data manipulation
import numpy as np                   # For numerical operations
import matplotlib.pyplot as plt      # For plotting
import seaborn as sns                # For visualizations

# Import the model we'll use
from sklearn.linear_model import LogisticRegression

# Upload CSV file from the computer
from google.colab import files
uploaded = files.upload()

# Load the dataset into a pandas DataFrame
df = pd.read_csv('global_cancer_patients_2015_2024.csv')

#  Display the first few rows of the dataset
df.head()

In [None]:
df.tail() #Display the last 5 rows to check how the dataset ends

In [None]:
df.info() # Get a concise summary of the dataset: column names, non-null counts, and data types

In [None]:
df.describe() # Generate summary statistics for all numeric columns (count, mean, std, min, max, quartiles)

In [None]:
df.isnull().sum() # Check for missing values in each column

In [None]:
int(df.duplicated().sum()) # Count the number of duplicate rows in the dataset (converted to a regular integer)

In [None]:
df.shape # Show the number of rows and columns in the dataset as a tuple (rows, columns)

In [None]:
df.dtypes # Display the data type of each column (e.g., int, float, object)

In [None]:
df.columns # List all column names in the dataset

In [None]:
# Print all column names with their corresponding index numbers
for i, col in enumerate(df.columns):
        print(f"{i+1}. {col}")

In [None]:
# Find the maximum age in the dataset
max_age = df['Age'].max()
# Filter the dataset to find the patient(s) with the maximum age
oldest_patients = df[df['Age'] == max_age]
# Display the details of the oldest patient(s)
print(oldest_patients)

In [None]:
# Find the minimum age in the dataset
min_age = df['Age'].min()

# Filter the dataset to find the patient(s) with the minimum age
youngest_patients = df[df['Age'] == min_age]

# Display the details of the youngest patient(s)
print(youngest_patients)

In [None]:
# Create a table showing the number of patients for each cancer type by country/region
region_cancer_counts = df.groupby(['Country_Region', 'Cancer_Type']).size().unstack(fill_value=0)
# Display the first few rows of the resulting table
region_cancer_counts.head()

In [None]:
# Show summary statistics of severity scores for each smoking level
df.groupby('Smoking')['Target_Severity_Score'].describe()

Now we want to understand the data better, we want to find relationships between features. We will do different scatterplots to find these. We want our visualizations to answer relevant questions like "what factors impact survival?", "which cancers are most common?", or "how does risk relate to severity?"

**Age distribution**

In [None]:
sns.histplot(df['Age'], kde=True)
plt.title("Age Distribution of Patients")
plt.show()