# Cancer incidence trends in young adults

## Introduction

The idea of this project comes from a conversation I recently have with a friend of mine who worked in a cancer retreat center. The friend is a medical professional who worked at the center for many years. He noticed that over the past 10 years, there have been more relatively young patients in the center than it was before. The friend was inclined to attribute this observation to the deteriorating quality of air, food, and other environmental factors. I skeptically suggested that there are other possible explanations of this phenomenon. 10 years ago, the friend himself was younger and could perceive typical patients as older.

Cancer is a class of diseases in which some of the body’s cells grow uncontrollably and spread to other organs and tissues. Most cancers form a tumor. According to modern concepts, cancer is caused by changes to genes that control the way our cells function, especially how they grow and divide. Despite intensive research and development of new treatments, cancer remains the leading causes of death worldwide. In 2018, there were 18.1 million new cases and 9.5 million cancer-related deaths worldwide. By 2040, the number of new cancer cases per year is expected to rise to 29.5 million and the number of cancer-related deaths to 16.4 million [[1]](#ref_1).

Cancer can be considered an age-related disease because the incidence of most cancers increases with age, rising more rapidly beginning in midlife. Despite the fact that the disease can occur at any age, more than half of all cancers occurred in adults aged ≥65 years [[2]](#ref_2). Therefore, for a long time, cancer was considered a disease predominantly affecting the elderly. Given this age specificity, a possible increase in the number of diseases among younger age groups may be of great interest.


<figure>
  <img src="media/cancer_and_age.jpg" alt="Invasive cancer incidence, by age, U.S., 2009">
  <figcaption>
      <center>
          <strong>
              Figure 1. Invasive cancer incidence, by age, U.S., 2009 <a href="#ref_2">[2]</a>
          </strong>
      </center>
  </figcaption>
</figure>
    
    
As a software developer I am interested in Digital health. The idea of using information technologies to enhance the efficiency of healthcare seems very promising. Collection and analysis of health data using the data-science approach could potentially improve our understanding of diseases such as cancer. Therefore, I decided to investigate my friend's observation. Using available data and data-science approach it seems possible to determine if young adults have become more often diagnosed with cancer in recent decades.

## Aims and Objectives

Within this research project my aim is to find out if young adults are more likely to be diagnosed with cancer in the recent decades. The goal is to get a big picture across all regions of the world and all ethnic groups. I want to take into account population changes and other possible factors that can skew the statistics.

For the purpose of this project, a young adult is considered to be between the ages of 15 and 45. This range was chosen because people younger than 15 have other types of cancer with different epidemiological dynamics that are outside the scope of this study [[3]](#ref_3). The age group over 45 also has its own epidemiological dynamics. It has long seen an increase in morbidity, but it is attributed mainly to an increase in overall life expectancy and a decrease in mortality from other causes [[4]](#ref_4). This is also outside the scope of this study.

## Literature Review

I found several articles on the Internet in which the issue is studied. Below I provide a summary and key findings that are relevant in the context of this study.

**Incidence trends for twelve cancers in younger adults—a rapid review. Br J Cancer 126, 1374–1386 (2022). [[5]](#ref_5)**

This paper analyzed the epidemiological information of some types of cancer in young adult patients, and came to the following conclusion: "Overall, this review provides evidence that some cancers are increasingly being diagnosed in younger age groups, although the mechanisms remain unclear." [[5]](#ref_5) This is a meta-analysis of existing studies on different types of cancer, but my goal is to explore the big picture across all types of cancer. The sources used in the article were mainly originated from the United States, while my goal is to get numbers for all countries.

**Trends in Cancer Incidence in US Adolescents and Young Adults, 1973-2015 [[6]](#ref_6)**.

Some findings from this paper: "In this serial cross-sectional, US population-based study using cancer registry data from 497 452 AYAs, the rate of cancer increased by 29.6% from 1973 to 2015, with kidney carcinoma increasing at the greatest rate. Breast carcinoma and testicular cancer were the most common cancer diagnoses for female and male AYAs, respectively." [[6]](#ref_6)

The autors coclude: "In this cross-sectional, US population-based study, cancer in AYAs was shown to have a unique epidemiological pattern and is a growing health concern, with many cancer subtypes having increased in incidence from 1973 to 2015. Continued research on AYA cancers is important to understanding and addressing the distinct health concerns of this population." [[6]](#ref_6) AYA is stands for Adolescents and Young Adults.

The findings from this paper also support the idea that the increase in the number of diseases among young people in the United states has natural causes. However, the article is focused on the United States, while my goal is to get the overal picture for all regions of the world.

## Method

* Acquiring a dataset that is fit for purpose.
    1. Describe data requirements
    2. Evaluate different possible data sources
    3. Select a data source that fit for purpose
* Exploring the dataset through different lenses, identifying key features and potential flaws in the data
* Produce a systematic, rigorous and well-reasoned report on how you work through the dataset.

## Dataset

The only source for the agregated cancer worldwide statistics is [International Agency for Research on Cancer (IARC)](https://www.iarc.fr/).

### Preprocess

In [1]:
!pip install pandas==1.4.2

import os
import requests
import zipfile
import re
import io
import codecs
from urllib.parse import urlparse
import pandas as pd



Download "detailed data" files from https://ci5.iarc.fr/ci5i-x/pages/download.aspx for each volumes.

In [2]:
urls = (
    "https://ci5.iarc.fr/CI5-X/CI5-Xd.zip",
    "https://ci5.iarc.fr/ci5i-x/old/vol9/CI5-IXd.zip",
    "https://ci5.iarc.fr/ci5i-x/old/vol8/CI5-VIIId.zip",
    "https://ci5.iarc.fr/ci5i-x/old/vol7/CI5-VIId.zip",
    "https://ci5.iarc.fr/ci5i-x/old/vol6/CI5-VI.zip",
    "https://ci5.iarc.fr/ci5i-x/old/vol5/CI5-V.zip"
)

os.makedirs("CI5", exist_ok=True)

def download_if_not_exists(url, path):
    if os.path.exists(path):
        print(f"File {path} exists in cache")
    else:
        print(f"Downloading {url}...")
        response = requests.get(url)
        print(f"Save to {path}")
        open(path, "wb").write(response.content)

def unzip(file_path, target_dir):
    print(f"Extract {file_path} to {target_dir}")
    with zipfile.ZipFile(file_path, "r") as zip_ref:
        zip_ref.extractall(target_dir)

for url in urls:
    url_path = urlparse(url).path
    file_name = os.path.basename(url_path)
    file_path = os.path.join("CI5", file_name)
    volume_name = os.path.splitext(file_name)[0]
    volume_path = os.path.join("CI5", volume_name)
    
    if os.path.exists(volume_path):
        print(f"Volume {volume_name} exists in cache")
    else:
        download_if_not_exists(url, file_path)
        unzip(file_path, os.path.join("CI5", volume_name))

Volume CI5-Xd exists in cache
Volume CI5-IXd exists in cache
Volume CI5-VIIId exists in cache
Volume CI5-VIId exists in cache
Volume CI5-VI exists in cache
Volume CI5-V exists in cache


#### Process volumes

##### Process CI5-V

In [3]:
v_registry_df = pd.read_csv("CI5/CI5-V/registry.txt", sep="\t", index_col=0)
v_cases_df = pd.read_csv("CI5/CI5-V/cases.csv", index_col=0)

v_df = v_registry_df.join(v_cases_df, how="inner", lsuffix="_registry")
v_df["PERIOD"] = v_df["PERIOD_1"].astype(str) + '-' + v_df["PERIOD_2"].astype(str)

v_df = v_df[["PERIOD", "N15_19", "N20_24", "N25_29", "N30_34", "N35_39", "N40_44"]]

v_df = v_df.groupby(["PERIOD"]).sum()

v_df

Unnamed: 0_level_0,N15_19,N20_24,N25_29,N30_34,N35_39,N40_44
PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1973-1982,109,191,225,317,458,613
1977-1981,21619,26581,45167,71753,73766,102870
1978-1978,460,713,953,1440,1973,3098
1978-1979,72,144,331,516,1010,1501
1978-1980,57,89,198,454,666,1025
1978-1981,3327,5137,8708,14056,19693,31630
1978-1982,89948,139238,226030,332676,436280,636450
1979-1982,46021,60846,99846,184743,254224,380070
1980-1980,84,97,160,268,314,453
1980-1982,876,1221,1733,2691,3521,5299


##### Process CI5-VI

In [4]:
vi_registry_df = pd.read_csv("CI5/CI5-VI/registry.txt", sep="\t", index_col=0, names=["REGISTRY", "PERIOD_1", "PERIOD_2", "NAME"])
vi_cases_df = pd.read_csv("CI5/CI5-VI/cases.csv", index_col=0)
vi_df = vi_registry_df.join(vi_cases_df, how="inner", lsuffix="_registry")
vi_df["PERIOD"] = vi_df["PERIOD_1"].astype(str) + '-' + vi_df["PERIOD_2"].astype(str)
vi_df = vi_df[["PERIOD", "N15_19", "N20_24", "N25_29", "N30_34", "N35_39", "N40_44"]]
vi_df = vi_df.groupby(["PERIOD"]).sum()
vi_df


Unnamed: 0_level_0,N15_19,N20_24,N25_29,N30_34,N35_39,N40_44
PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1981-1985,101,100,270,752,1318,1936
1982-1986,2773,3781,5695,10208,15114,17332
1982-1987,333,476,869,1399,2187,3259
1983-1984,38,51,78,156,228,341
1983-1985,137,202,229,308,607,940
1983-1986,1118,1852,2802,4852,7770,11256
1983-1987,69574,113339,185680,285256,430059,590906
1984-1985,18,28,29,47,158,208
1984-1986,59,63,76,268,502,679
1984-1987,974,1332,2129,3189,4472,5211


##### Process CI5-VIId

In [5]:
viid_registry_df = pd.read_csv("CI5/CI5-VIId/registry.txt", index_col=0, names=["REGISTRY", "NAME"])
# remove the first row as it is a broken header
viid_registry_df = viid_registry_df[1:]

viid_name_split_df = viid_registry_df["NAME"].str.extract(r"(.+)\s+\((\d+)-(\d+)\)", expand=True)
viid_name_split_df
viid_registry_df[["PERIOD_1", "PERIOD_2"]] = viid_name_split_df[[1,2]].rename(columns= {1: "PERIOD_1", 2: "PERIOD_2"})
viid_registry_df.index =  viid_registry_df.index.astype(int)

viid_cases_df = pd.read_csv("CI5/CI5-VIId/CI5VII.csv", names=["REGISTRY", "SEX", "CANCER_NUMBER", "AGE", "CASES_COUNT", "PERSON_YEARS"])
viid_cases_df["AGE"].replace({1: "N0_4", 2: "N5_9", 3: "N10_14", 4: "N15_19", 5: "N20_24", 6: "N25_29", 7: "N30_34", 8: "N35_39", 9: "N40_44", 10: "N45_49", 11: "N50_54", 12: "N55_59", 13: "N60_64", 14: "N65_69", 15: "N70_74", 16: "N75_79", 17: "N80_84", 18: "N85+", 19: "N_UNK"}, inplace=True)
viid_cases_df = viid_cases_df.groupby(["REGISTRY", "AGE"])["CASES_COUNT"].sum().to_frame().reset_index().pivot(index="REGISTRY", columns="AGE", values="CASES_COUNT")


viid_df = viid_registry_df.join(viid_cases_df, how="inner", lsuffix="_registry")
viid_df["PERIOD"] = viid_df["PERIOD_1"].astype(str) + '-' + viid_df["PERIOD_2"].astype(str)
viid_df = viid_df[["PERIOD", "N15_19", "N20_24", "N25_29", "N30_34", "N35_39", "N40_44"]].groupby(["PERIOD"]).sum()
viid_df


Unnamed: 0_level_0,N15_19,N20_24,N25_29,N30_34,N35_39,N40_44
PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1983-1992,60,151,125,295,362,361
1986-1990,494,592,980,1590,3188,5433
1986-1992,55,40,98,98,151,235
1987-1991,633,968,1397,2270,2816,3951
1987-1992,1827,2874,4213,6356,10797,18241
1988-1989,1874,3788,7193,9600,11424,12735
1988-1990,8777,15118,23959,33124,48376,80768
1988-1991,3383,4693,7147,10743,17248,26871
1988-1992,128209,199744,344032,532654,765721,1088390
1988-1993,690,916,1686,2014,2684,2764


##### Process CI5-VIIId

In [6]:
viiid_registry_df = pd.read_table("CI5/CI5-VIIId/registry.txt", index_col=0)
viiid_registry_df = viiid_registry_df.index.str.extract(r"\s*(\d+)\s+(.*)\((\d+)(-(\d+))?\)", expand=True).drop(columns=3)
viiid_registry_df = viiid_registry_df.rename(columns= {0: "REGISTRY", 1: "NAME", 2: "PERIOD_1", 4: "PERIOD_2"}).set_index("REGISTRY")
viiid_registry_df.index = viiid_registry_df.index.astype(int)

viiid_cases_df = pd.read_csv("CI5/CI5-VIIId/CI5-VIII.csv", names=["REGISTRY", "SEX", "CANCER_NUMBER", "AGE", "CASES_COUNT", "PERSON_YEARS"])
viiid_cases_df["AGE"].replace({1: "N0_4", 2: "N5_9", 3: "N10_14", 4: "N15_19", 5: "N20_24", 6: "N25_29", 7: "N30_34", 8: "N35_39", 9: "N40_44", 10: "N45_49", 11: "N50_54", 12: "N55_59", 13: "N60_64", 14: "N65_69", 15: "N70_74", 16: "N75_79", 17: "N80_84", 18: "N85+", 19: "N_UNK"}, inplace=True)
viiid_cases_df = viiid_cases_df.groupby(["REGISTRY", "AGE"])["CASES_COUNT"].sum().to_frame().reset_index().pivot(index="REGISTRY", columns="AGE", values="CASES_COUNT")


viiid_df = viiid_registry_df.join(viid_cases_df, how="inner", lsuffix="_registry")
viiid_df = viiid_df[["PERIOD_1", "PERIOD_2", "N15_19", "N20_24", "N25_29", "N30_34", "N35_39", "N40_44"]]

# Fix broken record for Taiwan
viiid_df.loc[81, 'PERIOD_1'] = 1993
viiid_df.loc[81, 'PERIOD_2'] = 1997

viiid_df["PERIOD"] = viiid_df["PERIOD_1"].astype(str) + '-' + viiid_df["PERIOD_2"].astype(str)
viiid_df = viiid_df.groupby(["PERIOD"]).sum()

viiid_df

Unnamed: 0_level_0,N15_19,N20_24,N25_29,N30_34,N35_39,N40_44
PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1983-1997,60,151,125,295,362,361
1991-1995,1808,2912,4598,6919,8854,12016
1992-1993,108,109,169,318,377,456
1992-1995,866,1456,2390,3098,4561,7543
1992-1996,220,390,794,1286,1720,2166
1993-1994,182,200,408,587,751,1042
1993-1995,763,1144,1546,3125,5460,8027
1993-1996,7958,12858,19452,28071,36544,46988
1993-1997,124159,196708,340851,522488,756035,1093152
1994-1996,378,1042,1550,1767,1679,2256


##### Preprocess CI5-IXd

In [7]:
ixd_registry_df = pd.read_table("CI5/CI5-IXd/registry.txt", names=["REGISTRY", "NAME"], index_col=0)
ixd_registry_df = ixd_registry_df["NAME"].str.extract(r"\s*(.*)\s*\((\d+)-(\d+)\)", expand=True)
ixd_registry_df = ixd_registry_df.rename(columns= {0: "NAME", 1: "PERIOD_1", 2: "PERIOD_2"})

registry_dfs = []
for registry in ixd_registry_df.index:
    df = pd.read_csv(f"CI5/CI5-IXd/{registry}.csv", names=["SEX", "CANCER_NUMBER", "AGE", "CASES_COUNT", "PERSON_YEARS"])
    df['REGISTRY'] = registry
    registry_dfs.append(df)

ixd_cases_df = pd.concat(registry_dfs)
ixd_cases_df["AGE"].replace({1: "N0_4", 2: "N5_9", 3: "N10_14", 4: "N15_19", 5: "N20_24", 6: "N25_29", 7: "N30_34", 8: "N35_39", 9: "N40_44", 10: "N45_49", 11: "N50_54", 12: "N55_59", 13: "N60_64", 14: "N65_69", 15: "N70_74", 16: "N75_79", 17: "N80_84", 18: "N85+", 19: "N_UNK"}, inplace=True)
ixd_cases_df = ixd_cases_df.groupby(["REGISTRY", "AGE"])["CASES_COUNT"].sum().to_frame().reset_index().pivot(index="REGISTRY", columns="AGE", values="CASES_COUNT")


ixd_df = ixd_registry_df.join(ixd_cases_df, how="inner", lsuffix="_registry")
ixd_df["PERIOD"] = ixd_df["PERIOD_1"].astype(str) + '-' + ixd_df["PERIOD_2"].astype(str)
ixd_df = ixd_df[["PERIOD", "N15_19", "N20_24", "N25_29", "N30_34", "N35_39", "N40_44"]]
ixd_df = ixd_df.groupby(["PERIOD"]).sum()
ixd_df

Unnamed: 0_level_0,N15_19,N20_24,N25_29,N30_34,N35_39,N40_44
PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1996-2000,781,1313,1990,2940,5490,8848
1997-2001,903,1477,2268,3307,5062,7496
1998-2000,326,604,1031,1901,2790,4273
1998-2001,4394,6821,10911,16854,26314,38890
1998-2002,537275,779860,1257547,2011714,3197234,5082025
1999-2001,1054,1884,2758,4147,6544,10399
1999-2002,14162,19930,35855,63583,104313,164389
2000-2002,1082,1100,2286,3735,6616,9427


##### Preprocess CI5-Xd

In [8]:
with codecs.open("CI5/CI5-Xd/registry.txt", 'r', 'utf8', errors="ignore") as ff:
    content = ff.read()

xd_registry_df = pd.read_table(io.StringIO(content), names=["REGISTRY", "NAME"], index_col=0)
xd_registry_df
xd_registry_df = xd_registry_df["NAME"].str.extract(r"\s*(.*)\s*\((\d+)-(?:\d+,\d+-)?(\d+)\)", expand=True)
xd_registry_df
xd_registry_df = xd_registry_df.rename(columns= {0: "NAME", 1: "PERIOD_1", 2: "PERIOD_2"})

registry_dfs = []
for registry in xd_registry_df.index:
    df = pd.read_csv(f"CI5/CI5-Xd/{registry}.csv", names=["SEX", "CANCER_NUMBER", "AGE", "CASES_COUNT", "PERSON_YEARS"])
    df['REGISTRY'] = registry
    registry_dfs.append(df)

xd_cases_df = pd.concat(registry_dfs)
xd_cases_df

xd_cases_df["AGE"].replace({1: "N0_4", 2: "N5_9", 3: "N10_14", 4: "N15_19", 5: "N20_24", 6: "N25_29", 7: "N30_34", 8: "N35_39", 9: "N40_44", 10: "N45_49", 11: "N50_54", 12: "N55_59", 13: "N60_64", 14: "N65_69", 15: "N70_74", 16: "N75_79", 17: "N80_84", 18: "N85+", 19: "N_UNK"}, inplace=True)
xd_cases_df = xd_cases_df.groupby(["REGISTRY", "AGE"])["CASES_COUNT"].sum().to_frame().reset_index().pivot(index="REGISTRY", columns="AGE", values="CASES_COUNT")


xd_df = xd_registry_df.join(xd_cases_df, how="inner", lsuffix="_registry")
xd_df["PERIOD"] = xd_df["PERIOD_1"].astype(str) + '-' + xd_df["PERIOD_2"].astype(str)
xd_df = xd_df[["PERIOD", "N15_19", "N20_24", "N25_29", "N30_34", "N35_39", "N40_44"]]
xd_df = xd_df.groupby(["PERIOD"]).sum()
xd_df

Unnamed: 0_level_0,N15_19,N20_24,N25_29,N30_34,N35_39,N40_44
PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2003-2005,2466,3390,5447,8052,12140,18517
2003-2006,4069,6424,11046,18414,28424,40588
2003-2007,751831,1155257,1748001,2683177,4147578,6988975
2004-2007,6383,9179,13859,22064,36656,59308
2005-2007,3339,5144,7793,11450,17771,25631


In [9]:
preprocessed_df = pd.concat([v_df, vi_df, viid_df, viiid_df, ixd_df, xd_df])
print(preprocessed_df.to_string())

           N15_19   N20_24   N25_29   N30_34   N35_39   N40_44
PERIOD                                                        
1973-1982     109      191      225      317      458      613
1977-1981   21619    26581    45167    71753    73766   102870
1978-1978     460      713      953     1440     1973     3098
1978-1979      72      144      331      516     1010     1501
1978-1980      57       89      198      454      666     1025
1978-1981    3327     5137     8708    14056    19693    31630
1978-1982   89948   139238   226030   332676   436280   636450
1979-1982   46021    60846    99846   184743   254224   380070
1980-1980      84       97      160      268      314      453
1980-1982     876     1221     1733     2691     3521     5299
1980-1983      65       89      139      158      301      449
1981-1982     243      475      816      797     1131     1811
1982-1982     542      918     1568     2277     3218     4093
1981-1985     101      100      270      752     1318  

## Limitations

In this study, I will try to find whether there is a tendency for an increase in the incidence in the selected age group, but the reasons for such an increase are beyond the scope of this study.

## References

<a id='ref_1'>[1]</a> "What Is Cancer?" by National Cancer Institute (2021, May 5) [Online]. Available: https://www.cancer.gov/about-cancer/understanding/what-is-cancer

<a id='ref_2'>[2]</a> "Age and Cancer Risk" Am J Prev Med. 2014 Mar; 46(3 0 1): S7–15. [Online]. Available: https://doi.org/10.1016/j.amepre.2013.10.029

<a id='ref_3'>[3]</a> "Childhood Cancers" by National Cancer Institute (2021, April 12) [Online]. Available: https://www.cancer.gov/types/childhood-cancers

<a id='ref_4'>[4]</a> "The Challenging Landscape of Cancer and Aging: Charting a Way Forward" by Norman E. Sharpless, M.D. (2018, January 24) [Online]. Available: https://www.cancer.gov/news-events/cancer-currents-blog/2018/sharpless-aging-cancer-research

<a id='ref_5'>[5]</a> di Martino, E., Smith, L., Bradley, S.H. et al. Incidence trends for twelve cancers in younger adults—a rapid review. Br J Cancer 126, 1374–1386 (2022). [Online]. Available: https://doi.org/10.1038/s41416-022-01704-x

<a id='ref_6'>[6]</a> Scott AR, Stoltzfus KC, Tchelebi LT, et al. Trends in Cancer Incidence in US Adolescents and Young Adults, 1973-2015. JAMA Netw Open. 2020;3(12):e2027738. [Online]. Available: https://doi.org/10.1001/jamanetworkopen.2020.27738
