# Overview

Early-onset colon cancer (typically defined as cases diagnosed before age 50) has raised concern due to reports of increasing incidence among younger adults. This proposal outlines a comprehensive health data science study to investigate global early-onset colon cancer trends using the International Agency for Research on Cancer’s **Cancer Incidence in Five Continents Plus (CI5plus)** database. The rationale stems from recent observations suggesting a surge in colon cancer among individuals in their 20s–40s, contrasted with stabilizing or declining rates in older adults [(pressroom.cancer.org)](https://pressroom.cancer.org/Colorectal-Cancer-Cases-Surge-Globally#:~:text=cancer%20,the%20journal%20The%20Lancet%20Oncology). The research aims to determine whether these patterns constitute a widespread epidemiological shift or are confined to specific regions or cohorts. Key objectives include estimating temporal trends in colon cancer incidence for ages 15–49 versus 50+, comparing these trends across countries and regions, and assessing whether rising young-adult incidence is disproportionately high relative to older adults. The study will employ an age–period–cohort analytical framework, treating age as a continuous variable (using midpoint assignments and spline functions) to increase precision beyond the 5-year age bands provided by CI5plus. Incidence data from dozens of countries will be stratified by sex, world region, and country income level to identify disparities. 

**Significance:** By leveraging international registry data and robust statistical methods, this research will clarify the extent of the early-onset colon cancer phenomenon, discern global versus localized patterns, and inform public health strategies (such as targeted awareness or screening policies) in response to any confirmed trends.  



## Background of the Problem 

Colorectal cancer (CRC) – encompassing cancers of the colon and rectum – is the third most common cancer worldwide [(gut.bmj.com)](https://gut.bmj.com/content/66/4/683). Historically, CRC has predominantly affected older adults, with incidence rates increasing sharply with age. Screening programs for colon cancer (e.g. colonoscopy) typically begin at age 50 in many countries, reflecting the long-held view that those under 50 are at low risk. However, emerging evidence over the past decade has challenged this assumption, particularly for colon and rectal cancers occurring in younger adults. Multiple high-income countries have reported rising CRC incidence among adults in their 20s, 30s, and 40s [(academic.oup.com)](https://doi.org/10.1093/jnci/djw322) [(thelancet.com)](https://doi.org/10.1016/S1470-2045(24)00688-3). In the United States, for example, overall CRC incidence has been declining in older adults (due in part to screening and risk factor improvements), yet **incidence in adults under 50 has been increasing since the 1980s** [(academic.oup.com)](https://doi.org/10.1093/jnci/djw322). Notably, rectal cancers appear to be rising faster than colon cancers in these younger age groups [(academic.oup.com)](https://doi.org/10.1093/jnci/djw322). Similar patterns have been observed in other high-income settings: a European analysis found a significant increase in CRC among young adults across several countries over the last 25 years [(gut.bmj.com)](https://doi.org/10.1136/gutjnl-2018-317592), and recent global data indicate this is not a uniquely American or European phenomenon [(thelancet.com)](https://doi.org/10.1016/S1470-2045(24)00688-3). These reports have captured public attention, leading to alarm and speculation about an “epidemic” of early-onset colon cancer. Despite the publicity, robust epidemiological data on this issue remain limited and sometimes conflicting.  

Complicating the narrative, the majority of colon cancer cases still occur in older adults – for instance, about 90% of CRC diagnoses are in people over 50 [(nytimes.com)](https://www.nytimes.com/2017/03/13/well/live/colon-and-rectal-cancers-rising-in-young-people.html). Early-onset cases (under 50) are relatively rare in absolute terms, representing a small fraction of total incidence. For example, only ~5% of colorectal cancer cases in the UK occur below age 50 [(cancerresearchuk.org)](https://news.cancerresearchuk.org/2024/12/11/early-onset-bowel-cancer-rise-global-phenomenon/). This raises important questions: *Is the rising incidence among young adults significant enough to signal a true epidemiological shift? Or could it partly reflect increased vigilance and diagnostic efforts in younger populations?*  

## Statement of the Problem  


The central problem driving this research is the *uncertainty about global trends in early-onset colon cancer*. Media coverage and regional reports highlight an apparent rise, but there is a *significant lack of comprehensive data* to confirm or quantify this trend on a global scale. Most existing studies combine colon and rectal cancers, leaving a gap in understanding how *colon cancer specifically* is trending among younger age groups worldwide. It also remains unclear whether increases are *disproportionate relative to older adults* or simply mirror overall incidence shifts.  

In summary: **There is an urgent need for a rigorous, data-driven investigation into global incidence trends of early-onset colon cancer, to determine whether a genuine increase exists, to characterize its extent and distribution across populations, and to discern how these trends compare to those in older adults.**  


# Data Loading and Initial Exploration

In [49]:
import os

os.chdir('/Users/ogeohia/PYTHON/eo-colon-cancer-trends-ci5plus')

In [27]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [28]:
# Using the CI5plus Summary dataset
# Load CI5plus_Summary dataset from Google Drive into a pandas DataFrame
cancer_df = pd.read_csv("data/CI5plus_Summary/data.csv")
# Display the first 5 rows of the DataFrame
display(cancer_df.head())

Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year
0,80000299,1,1,1,30,96307.0,1993
1,80000299,1,1,2,18,66677.0,1993
2,80000299,1,1,3,11,59556.0,1993
3,80000299,1,1,4,8,60462.0,1993
4,80000299,1,1,5,30,72770.0,1993


In [29]:
# Display column names and their data types
print(cancer_df.info())

# Display the number of rows and columns
print("\nShape of the DataFrame:", cancer_df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3963894 entries, 0 to 3963893
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id_code      int64  
 1   sex          int64  
 2   cancer_code  int64  
 3   age          int64  
 4   cases        int64  
 5   py           float64
 6   year         int64  
dtypes: float64(1), int64(6)
memory usage: 211.7 MB
None

Shape of the DataFrame: (3963894, 7)


# Data preprocessing and wrangling


In [30]:
# Filter the DataFrame for Colon Cancer and reset the index
colon_cancer_df = cancer_df[cancer_df['cancer_code'] == 21].copy().reset_index(drop=True)

# Display the shape of the filtered DataFrame and the first few rows to verify
display(colon_cancer_df.head())
print("\nShape of the filtered DataFrame:", colon_cancer_df.shape)

Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year
0,80000299,1,21,1,2,96307.0,1993
1,80000299,1,21,2,0,66677.0,1993
2,80000299,1,21,3,0,59556.0,1993
3,80000299,1,21,4,0,60462.0,1993
4,80000299,1,21,5,0,72770.0,1993



Shape of the filtered DataFrame: (136686, 7)


In [31]:
# Merge with id_dict to get registry details
# Load the registry dictionary
id_dict = pd.read_csv('data/CI5plus_Summary/id_dict.csv')
colon_cancer_full = colon_cancer_df.merge(id_dict[['id_code', 'id_label', 'CI5_continent', 'registry_code']], on='id_code', how='left')
# Map sex codes to labels for better readability
sex_map = {1: 'Male', 2: 'Female'}
colon_cancer_full['sex_label'] = colon_cancer_full['sex'].map(sex_map)
display(colon_cancer_full.head())

Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year,id_label,CI5_continent,registry_code,sex_label
0,80000299,1,21,1,2,96307.0,1993,"Uganda, Kyadondo County",1,800002,Male
1,80000299,1,21,2,0,66677.0,1993,"Uganda, Kyadondo County",1,800002,Male
2,80000299,1,21,3,0,59556.0,1993,"Uganda, Kyadondo County",1,800002,Male
3,80000299,1,21,4,0,60462.0,1993,"Uganda, Kyadondo County",1,800002,Male
4,80000299,1,21,5,0,72770.0,1993,"Uganda, Kyadondo County",1,800002,Male


In [32]:
# Split UK data and non-UK data in colon_cancer_full
# Separate UK data
uk_data = colon_cancer_full[colon_cancer_full['id_label'].str.startswith('UK,')].copy()

# Create a new 'country' column by extracting the country name from 'id_label' for non-UK data
colon_cancer_full['country'] = colon_cancer_full['id_label'].str.split(r'[:,()]', expand=True)[0].str.strip()

# For UK data, retain the full 'id_label' in the 'country' column
uk_data['country'] = uk_data['id_label']

# Concatenate the colon_cancer_full and UK data back together
colon_cancer_full = pd.concat([colon_cancer_full, uk_data], ignore_index=True)

# Display the first few rows to verify the new 'country' column
display(colon_cancer_full.head())

# Display the unique values in the new 'country' column to check for correctness
print("\nUnique countries extracted:")
display(colon_cancer_full['country'].unique())
colon_cancer_full.shape

Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year,id_label,CI5_continent,registry_code,sex_label,country
0,80000299,1,21,1,2,96307.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda
1,80000299,1,21,2,0,66677.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda
2,80000299,1,21,3,0,59556.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda
3,80000299,1,21,4,0,60462.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda
4,80000299,1,21,5,0,72770.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda



Unique countries extracted:


array(['Uganda', 'Argentina', 'Chile', 'Colombia', 'Costa Rica',
       'Ecuador', 'France', 'Puerto Rico', 'Canada', 'USA', 'Bahrain',
       'China', 'India', 'Israel', 'Japan', 'Republic of Korea', 'Kuwait',
       'Philippines', 'Qatar', 'Thailand', 'Türkiye', 'Austria',
       'Belarus', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark',
       'Estonia', 'Germany', 'Iceland', 'Ireland', 'Italy', 'Latvia',
       'Lithuania', 'Malta', 'The Netherlands', 'Norway', 'Poland',
       'Slovenia', 'Spain', 'Switzerland', 'UK', 'Australia',
       'New Zealand', 'UK, England', 'UK, Scotland',
       'UK, Northern Ireland', 'UK, Wales'], dtype=object)

(141512, 12)

In [33]:
# Create a dictionary to map CI5_continent codes to continent names
continent_map = {
    1: 'Africa',
    2: 'Latin America and the Caribbean',
    3: 'Northern America',
    4: 'Asia',
    5: 'Europe',
    6: 'Oceania'
}

# Create a new 'continent' column by mapping the 'CI5_continent' codes
colon_cancer_full['continent'] = colon_cancer_full['CI5_continent'].map(continent_map)

# Display the first few rows to verify the new 'continent' column
display(colon_cancer_full.head())

Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year,id_label,CI5_continent,registry_code,sex_label,country,continent
0,80000299,1,21,1,2,96307.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa
1,80000299,1,21,2,0,66677.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa
2,80000299,1,21,3,0,59556.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa
3,80000299,1,21,4,0,60462.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa
4,80000299,1,21,5,0,72770.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa


In [34]:
# Sanity check
# Display unique countries where 'continent' is NaN in colon_cancer_full
countries_with_nan_continent = colon_cancer_full[colon_cancer_full['continent'].isna()]['country'].unique()
print("Countries with NaN in 'continent':")
print(countries_with_nan_continent)

Countries with NaN in 'continent':
[]


In [35]:
# Add UN M49 sub-region column
# Load the Luke's ISO-3166 dataset from GitHub
regions_df = pd.read_csv("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv")

# Prepare a mapping from country name to UN M49 sub-region
country_to_subregion = regions_df.set_index('name')['sub-region'].to_dict()

# Map the 'country' column in colon_cancer_full to the sub-region
colon_cancer_full['region'] = colon_cancer_full['country'].map(country_to_subregion)

# Display the first few rows to verify the new 'region' column
display(colon_cancer_full.head())
print("\nShape of the filtered DataFrame:", colon_cancer_full.shape)
print("\nUnique regions extracted:")
display(colon_cancer_full['region'].unique())

Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year,id_label,CI5_continent,registry_code,sex_label,country,continent,region
0,80000299,1,21,1,2,96307.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa
1,80000299,1,21,2,0,66677.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa
2,80000299,1,21,3,0,59556.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa
3,80000299,1,21,4,0,60462.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa
4,80000299,1,21,5,0,72770.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa



Shape of the filtered DataFrame: (141512, 14)

Unique regions extracted:


array(['Sub-Saharan Africa', 'Latin America and the Caribbean',
       'Western Europe', 'Northern America', nan, 'Western Asia',
       'Eastern Asia', 'Southern Asia', 'South-eastern Asia',
       'Eastern Europe', 'Southern Europe', 'Northern Europe',
       'Australia and New Zealand'], dtype=object)

In [36]:
# Display the unique countries where 'region' is NaN in colon_cancer_full
countries_with_nan_region = colon_cancer_full[colon_cancer_full['region'].isna()]['country'].unique()
print("Countries with NaN in 'region':")
print(countries_with_nan_region)

Countries with NaN in 'region':
['USA' 'Republic of Korea' 'Czech Republic' 'The Netherlands' 'UK'
 'UK, England' 'UK, Scotland' 'UK, Northern Ireland' 'UK, Wales']


In [37]:
# Manual mapping for countries with NaN region or special cases
manual_region_map = {
    'USA': 'Northern America',
    'Republic of Korea': 'Eastern Asia',
    'Czech Republic': 'Eastern Europe',
    'The Netherlands': 'Western Europe',
    'UK': 'Northern Europe',
    'UK, England': 'Northern Europe',
    'UK, Scotland': 'Northern Europe',
    'UK, Northern Ireland': 'Northern Europe',
    'UK, Wales': 'Northern Europe'
}

# Update the 'region' column in colon_cancer_full using the manual mapping
colon_cancer_full['region'] = colon_cancer_full.apply(
    lambda row: manual_region_map[row['country']] if pd.isna(row['region']) and row['country'] in manual_region_map else row['region'],
    axis=1
)

# Display the unique regions after mapping
print("Unique regions after manual mapping:")
display(colon_cancer_full['region'].unique())
print(f"Number of rows with NaN in 'region': {colon_cancer_full[colon_cancer_full['region'].isna()].shape[0]}")

Unique regions after manual mapping:


array(['Sub-Saharan Africa', 'Latin America and the Caribbean',
       'Western Europe', 'Northern America', 'Western Asia',
       'Eastern Asia', 'Southern Asia', 'South-eastern Asia',
       'Eastern Europe', 'Southern Europe', 'Northern Europe',
       'Australia and New Zealand'], dtype=object)

Number of rows with NaN in 'region': 0


In [38]:
# Add HDI category column
# Load the HDI dataset
hdi_df = pd.read_csv('data/hdi_2023.csv')

# Map HDI values to categories
hdi_categories = {
    (0.8, 1.0): "Very High",
    (0.7, 0.8): "High",
    (0.55, 0.7): "Medium",
    (0.0, 0.55): "Low"
}

def categorize_hdi(hdi_value):
    for (low, high), category in hdi_categories.items():
        if low <= hdi_value < high:
            return category
    return "Unknown"

# Apply the categorization
hdi_df['hdi_category'] = hdi_df['hdi'].apply(categorize_hdi)

# Merge HDI data with colon_cancer_full on country name
colon_cancer_full = colon_cancer_full.merge(hdi_df, on='country', how='left')
display(colon_cancer_full.head())
print(f"Number of rows with NaN in 'hdi_category': {colon_cancer_full[colon_cancer_full['hdi_category'].isna()].shape[0]}")


Unnamed: 0,id_code,sex,cancer_code,age,cases,py,year,id_label,CI5_continent,registry_code,sex_label,country,continent,region,hdi,hdi_category
0,80000299,1,21,1,2,96307.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium
1,80000299,1,21,2,0,66677.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium
2,80000299,1,21,3,0,59556.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium
3,80000299,1,21,4,0,60462.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium
4,80000299,1,21,5,0,72770.0,1993,"Uganda, Kyadondo County",1,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium


Number of rows with NaN in 'hdi_category': 40698


In [39]:
# Display the unique countries where 'hdi_category' is NaN in colon_cancer_full
nan_countries = colon_cancer_full[colon_cancer_full['hdi_category'].isna()]['country'].unique()
print("Countries with NaN HDI category:")
for country in nan_countries:
    print(f" - {country}")


Countries with NaN HDI category:
 - Puerto Rico
 - USA
 - Republic of Korea
 - Czech Republic
 - The Netherlands
 - UK
 - UK, England
 - UK, Scotland
 - UK, Northern Ireland
 - UK, Wales


In [40]:
# Manual HDI values and categories for specific countries
manual_hdi_map = {
    'USA': 0.938,
    'Republic of Korea': 0.937,
    'Czech Republic': 0.915,
    'The Netherlands': 0.955,
    'UK': 0.946,
    'UK, England': 0.940, # Ave. for English regions - HDI (2022) Source: https://globaldatalab.org/shdi/table/2022/shdi+lifexp+lgnic/GBR/
    'UK, Scotland': 0.933, # HDI (2022) Source - Global Data Lab: https://globaldatalab.org/shdi/table/2022/shdi+lifexp+lgnic/GBR/
    'UK, Northern Ireland': 0.907, # HDI (2022) Source - Global Data Lab: https://globaldatalab.org/shdi/table/2022/shdi+lifexp+lgnic/GBR/
    'UK, Wales': 0.910, # HDI (2022) Source - Global Data Lab: https://globaldatalab.org/shdi/table/2022/shdi+lifexp+lgnic/GBR/
    'Puerto Rico': 0.879 # HDI (2022) Source - Wikipedia: https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_Human_Development_Index_score
}

# Function to assign HDI category based on value
def get_hdi_category(hdi_value):
    if hdi_value >= 0.8:
        return "Very High"
    elif hdi_value >= 0.7:
        return "High"
    elif hdi_value >= 0.55:
        return "Medium"
    elif hdi_value >= 0.0:
        return "Low"
    else:
        return "Unknown"

# Update 'hdi' and 'hdi_category' for these countries in colon_cancer_full
for country, hdi_value in manual_hdi_map.items():
    mask = colon_cancer_full['country'] == country
    colon_cancer_full.loc[mask, 'hdi'] = hdi_value
    colon_cancer_full.loc[mask, 'hdi_category'] = get_hdi_category(hdi_value)

# Display updated rows for verification
display(colon_cancer_full[colon_cancer_full['country'].isin(manual_hdi_map.keys())][['country', 'hdi', 'hdi_category']].drop_duplicates())

Unnamed: 0,country,hdi,hdi_category
8854,Puerto Rico,0.879,Very High
10944,USA,0.938,Very High
49742,Republic of Korea,0.937,Very High
70946,Czech Republic,0.915,Very High
101688,The Netherlands,0.955,Very High
121600,UK,0.946,Very High
136686,"UK, England",0.94,Very High
138472,"UK, Scotland",0.933,Very High
139992,"UK, Northern Ireland",0.907,Very High
140942,"UK, Wales",0.91,Very High


In [41]:
# Display the unique countries where 'hdi_category' is NaN in colon_cancer_full
nan_countries = colon_cancer_full[colon_cancer_full['hdi_category'].isna()]['country'].unique()
print("Countries with NaN HDI category:")
for country in nan_countries:
    print(f" - {country}")

Countries with NaN HDI category:


In [42]:
# Filter out *0-14 years, *80+ years, and missing age
# 0-14 years corresponds to age codes 1, 2, and 3
# 80+ years corresponds to age codes 17 and 18
# Missing age corresponds to age code 19
excluded_age_codes = list(range(1, 4)) + list(range(17, 20)) # Age codes 1,2,3 (0-14); 17, 18 (80+); 19 (missing)

# Filter the colon_cancer_full to exclude these age codes
colon_cancer_full = colon_cancer_full[~colon_cancer_full['age'].isin(excluded_age_codes)].copy()

# Display the shape of the age-filtered DataFrame and the unique age codes remaining
print("Shape of the DataFrame after excluding specified age ranges:", colon_cancer_full.shape)
print("\nUnique age codes remaining:")
display(colon_cancer_full['age'].unique())

Shape of the DataFrame after excluding specified age ranges: (96824, 16)

Unique age codes remaining:


array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

In [43]:
# Filter the DataFrame to include data from 1978 to 2017
colon_cancer_full = colon_cancer_full[(colon_cancer_full['year'] >= 1978) & (colon_cancer_full['year'] <= 2017)].copy()

# Display the shape of the year-filtered DataFrame and the unique years remaining
print("Shape of the DataFrame after excluding years before 1978 and after 2017:", colon_cancer_full.shape)
print("\nUnique years remaining:")
display(colon_cancer_full['year'].unique())

Shape of the DataFrame after excluding years before 1978 and after 2017: (92326, 16)

Unique years remaining:


array([1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015, 2016, 2017, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
       1991, 1992, 1982, 1978, 1979, 1980, 1981])

In [44]:
# Create a dictionary mapping age codes (4-16) to their midpoints
# Age code 4: 15-19 -> 17.5
# Age code 5: 20-24 -> 22.5
# ...
# Age code 16: 75-79 -> 77.5
age_midpoint_map = {
    4: 17.5,
    5: 22.5,
    6: 27.5,
    7: 32.5,
    8: 37.5,
    9: 42.5,
    10: 47.5,
    11: 52.5,
    12: 57.5,
    13: 62.5,
    14: 67.5,
    15: 72.5,
    16: 77.5,
}

# Create the 'age_cont' column using the mapping
colon_cancer_full['age_cont'] = colon_cancer_full['age'].map(age_midpoint_map)

# Drop redundant columns
columns_to_drop = ['id_code', 'sex', 'cancer_code', 'age', 'id_label', 'CI5_continent']
colon_cancer_full = colon_cancer_full.drop(columns=columns_to_drop)

# Display the first few rows to verify the new column and dropped columns
display(colon_cancer_full.head())

Unnamed: 0,cases,py,year,registry_code,sex_label,country,continent,region,hdi,hdi_category,age_cont
3,0,60462.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,17.5
4,0,72770.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,22.5
5,0,64952.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,27.5
6,1,45156.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,32.5
7,0,28283.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,37.5


In [45]:
# Create a broad age group column 'age_group' (Young and Old) based on 'age_cont'
colon_cancer_full['age_group'] = colon_cancer_full['age_cont'].apply(lambda x: 'Young' if x <= 47.5 else 'Old')

# Display the first few rows with the new 'age_group' column
display(colon_cancer_full.head())

Unnamed: 0,cases,py,year,registry_code,sex_label,country,continent,region,hdi,hdi_category,age_cont,age_group
3,0,60462.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,17.5,Young
4,0,72770.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,22.5,Young
5,0,64952.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,27.5,Young
6,1,45156.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,32.5,Young
7,0,28283.0,1993,800002,Male,Uganda,Africa,Sub-Saharan Africa,0.582,Medium,37.5,Young


In [46]:
# Save colon_cancer_full DataFrame to CSV in the /data directory
colon_cancer_full.to_csv('data/colon_cancer_full.csv', index=False)
print("colon_cancer_full saved to data/colon_cancer_full.csv")

colon_cancer_full saved to data/colon_cancer_full.csv
