In [None]:
import pandas as pd
from scipy import stats

# Cancer and BMI

In this notebook we will be exploring the relationship between BMI and cancer. We will be using clincal datasets that give us information about the patient/sample rather than the genes or proteins. The NIH National Cancer Institute has a page dedicated to the connection between BMI and cancer and provides a general understanding of what BMI is, how a high BMI affects the body, and the current research that has been conducted linking a high BMI and cancer.

The NIH page provides a chart that shows the BMI ranges for people aged 20 and above. We will use these numbers as the cutoffs for the BMI categories. There is a different chart for children that uses weight percentiles, but the patients in our dataset will all be adults so we don't need to consider that chart.

Below 18.5: Underweight <br/>
18.5 to 24.9: Healthy <br/>
25.0 to 29.9: Overweight <br/>
30.0 to 39.9: Obese<br/>
40.0 or higher: Severely Obese


There are many kinds of cancer, and we need to narrow down which types to look as. Preferably these types will have a known association with cancer. The NIH page also has a chart that gives that cancer type and the number of increased chances you are to develop that cancer if you are overweight, obese, or severely obese. These is no cancer known to be associated with being underweight.

To narrow down what cancer types to explore we will take the cancers that are 2 times as likely and greater in people with a high BMI



<strong>Endometrial:</strong> <br/>
7 times as likely in people with severe obesity <br/>
2-4 times more likely in people with obesity or who are overweight <br/>

<strong>Esophageal Adenocarcinoma:</strong> <br/>
4.8 times as likely in people with servere obesity </br>
2.4-2.7 times as likely in people with obesity </br>
1.5 times as likely in people in overweight </br>

<strong>Gastric Cardia:</strong> <br/>
2 times in likely in people in obesity </br>

<strong>Liver:</strong>
2 times in likely in people in obesity or who are overweight </br>

<strong>Kidney: </strong></br>
2 times as likely in people with obesity or who are overweight </br>

This information was taken from the NIH National Institute of Cancer page


#References
The NIH page also provides references to the papers that show these statistics for BMI and cancer.

<strong>Endometrial:</strong> </br>
<ul>
Setiawan VW, Yang HP, Pike MC, et al. Type I and II endometrial cancers: Have they different risk factors? <em>Journal of Clinical Oncology</em> 2013; 31(20):2607-2618.

Dougan MM, Hankinson SE, Vivo ID, et al. Prospective  study of body size throughout the life course and the incidence of endometrial cnacer amoung premenopausal and postmenopausal women. <em> International Journal of Cancer</em> 2015; 137(3):625-637.
</ul>

<strong>Esophageal Adenocarcinoma:</strong> <br/>
<ul>
Hoyo C, Cook MB, Kamangar F, et al. Body mass index in relation to oesophageal and oesophagogastric junctions adenocarcinomas: A pooled analysis from the International BEACON Consortium, <em> International Journal of Epidemiology</em> 2012; 41(6):1706-1718.
</ul>

<strong>Gastric Cardia:</strong>  <br/>
<ul>
Chen Y, Liu L, Wang X, et al. Body mass index and risk of gastric cancer: A meta-analysis of a population with more than ten million from 24 prospective studies. <em>Cancer Epidemiology, Biomarkers & Prevention</em> 2013; 22(8):1395–1408.
</ul>

<strong>Liver:</strong> <br/>
<ul>
Chen Y, Wang X, Wang J, Yan Z, Luo J. Excess body weight and the risk of primary liver cancer: An updated meta-analysis of prospective studies.<em> European Journal of Cancer </em> 2012; 48(14):2137–2145.

Campbell PT, Newton CC, Freedman ND, et al. Body mass index, waist circumference, diabetes, and risk of liver cancer for U.S. adults. <em>Cancer Research</em> 2016; 76(20):6076–6083.
</ul>

<strong>Kidney:</strong> <br>
<ul>
Wang F, Xu Y. Body mass index and risk of renal cell cancer: A dose-response meta-analysis of published cohort studies. <em>International Journal of Cancer</em> 2014; 135(7):1673–1686 #slightly higher in women

Sanfilippo KM, McTigue KM, Fidler CJ, et al. Hypertension and obesity and the risk of kidney cancer in 2 large cohorts of US men and women. ,<em>Hypertension</em> 2014; 63(5):934–941.
</ul>




#Obtaining clinical data from cBioPortal

Open https://www.cbioportal.org/

Choose the tissue you want to look at from the left bar

Pick a study from the list then click “Explore Selected Studies”

Select the Clinical Data tab at the top

Click “Columns” on the top right and select columns that will be useful to you

Click the “Download TSV” button below the search bar

**Making your data accessible in Google Colab notebook**

Upload your data to your personal google drive

Go to sharing options and set it so that anyone with the link can view

Copy the file code from the sharing link

Use the command !gdown ‘link’ --fuzzy

The file should now be accessible in the runtime, and other collaborators can use the same command to access it


In [None]:
!gdown 1QUrGUrgen7DtRKe5rI79c5hNtx2tbYEj --fuzzy #endometrial
!gdown 1jj8IqouXvC1wUZwlDG8R_FVkIdQEVSwK --fuzzy #esophogeal
!gdown 1H44zgIepOhzoI01b-DLl44tA_t5-CdKo --fuzzy #kidney
!gdown 1tZGBpa-XYaXgceHqkVGL8bj-Pv-s501J --fuzzy #liver
!gdown 1TtfeSmUkva4XRIy1tRn29HEHlzVGlYe3 --fuzzy #skin

Downloading...
From: https://drive.google.com/uc?id=1QUrGUrgen7DtRKe5rI79c5hNtx2tbYEj
To: /content/ucec_tcga_clinical_data.tsv
100% 436k/436k [00:00<00:00, 120MB/s]
Downloading...
From: https://drive.google.com/uc?id=1jj8IqouXvC1wUZwlDG8R_FVkIdQEVSwK
To: /content/esca_tcga_clinical_data.tsv
100% 148k/148k [00:00<00:00, 74.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1H44zgIepOhzoI01b-DLl44tA_t5-CdKo
To: /content/kirp_tcga_clinical_data.tsv
100% 204k/204k [00:00<00:00, 93.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1tZGBpa-XYaXgceHqkVGL8bj-Pv-s501J
To: /content/lihc_tcga_clinical_data.tsv
100% 278k/278k [00:00<00:00, 99.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1TtfeSmUkva4XRIy1tRn29HEHlzVGlYe3
To: /content/skcm_tcga_clinical_data.tsv
100% 288k/288k [00:00<00:00, 95.4MB/s]


In [None]:
esca_file = '/content/esca_tcga_clinical_data.tsv'
kirp_file = '/content/kirp_tcga_clinical_data.tsv'
lihc_file = '/content/lihc_tcga_clinical_data.tsv'
ucec_file = '/content/ucec_tcga_clinical_data.tsv'
skcm_file = '/content/skcm_tcga_clinical_data.tsv'

In [None]:
esca_dataframe = pd.read_csv(esca_file, sep='\t')
kirp_dataframe = pd.read_csv(kirp_file, sep='\t')
lihc_dataframe = pd.read_csv(lihc_file, sep='\t')
ucec_dataframe = pd.read_csv(ucec_file, sep='\t')
skcm_dataframe = pd.read_csv(skcm_file, sep='\t')

#Manipulating and Combining Data

Before doing any analysis we needed to unify differently named columns between datasets, filter to relevant information, repair missing values, add columns using data from other columns, etc.

In [None]:
#used this to determine which columns were common between datasets before manually looking through to determine which columns we needed
commonColumns = []
for column in list(esca_dataframe.columns):
  if column in list(kirp_dataframe.columns):
    if column in list(lihc_dataframe.columns):
      if column in list(ucec_dataframe.columns):
        commonColumns.append(column)
print(len(list(esca_dataframe.columns)))
print(len(commonColumns))
sorted(commonColumns)
filter_columns = ['American Joint Committee on Cancer Tumor Stage Code',
'Cancer Type',
'Cancer Type Detailed',
'Diagnosis Age',
'Disease Free (Months)',
'Disease Free Status',
'Ethnicity Category',
'Race Category',
'Neoplasm Disease Stage American Joint Committee on Cancer Code',
'Other Patient ID',
'Other Sample ID',
'Overall Survival (Months)',
'Overall Survival Status',
'Patient Height',
'Patient ID',
'Patient Weight',
'Sample ID',
'Sex',
'Stage Other']
#en_filtered = filter_columns
#en_filtered.append('Neoplasm American Joint Committee on Cancer Clinical Group Stage')


112
68


In [None]:
#Filter columns on dataframes to the selected relevant entries
esca_dataframe_filtered = esca_dataframe[filter_columns]
kirp_dataframe_filtered = kirp_dataframe[filter_columns]
lihc_dataframe_filtered = lihc_dataframe[filter_columns]
skcm_dataframe_filtered = skcm_dataframe[filter_columns]

In [None]:
en_filtered = filter_columns
en_filtered.append('Neoplasm American Joint Committee on Cancer Clinical Group Stage')
ucec_dataframe_filtered = ucec_dataframe[en_filtered]

In [None]:
#Melanoma dataset is mostly NaN for cancer type, presumably because they didn't feel the need to include it in a set of only melanomas
#This changes the cancer type to melanoma and the detailed cancer type to nonspecified melanoma for NaN values
def set_melanoma(row):
  return 'Melanoma'

def set_melanoma_nonspecific(row):
  d = row['Cancer Type Detailed']
  if isinstance(d, str):
    return d
  else:
    return 'Nonspecified Melanoma'

skcm_dataframe_filtered['Cancer Type'] = skcm_dataframe_filtered.apply(set_melanoma, axis=1)
skcm_dataframe_filtered['Cancer Type Detailed'] = skcm_dataframe_filtered.apply(set_melanoma_nonspecific, axis=1)

#Filter out cancer types from the esophogeal dataset that are not Esophageal Adenocarcinoma
#esca_dataframe_filtered = esca_dataframe_filtered.loc[esca_dataframe_filtered['Cancer Type Detailed'] == 'Esophageal Adenocarcinoma']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  skcm_dataframe_filtered['Cancer Type'] = skcm_dataframe_filtered.apply(set_melanoma, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  skcm_dataframe_filtered['Cancer Type Detailed'] = skcm_dataframe_filtered.apply(set_melanoma_nonspecific, axis=1)


In [None]:
#Combining all the datasets
datasets = [esca_dataframe_filtered, kirp_dataframe_filtered, lihc_dataframe_filtered, skcm_dataframe_filtered, ucec_dataframe_filtered]
combined = pd.concat(datasets)
combined

Unnamed: 0,American Joint Committee on Cancer Tumor Stage Code,Cancer Type,Cancer Type Detailed,Diagnosis Age,Disease Free (Months),Disease Free Status,Ethnicity Category,Race Category,Neoplasm Disease Stage American Joint Committee on Cancer Code,Other Patient ID,Other Sample ID,Overall Survival (Months),Overall Survival Status,Patient Height,Patient ID,Patient Weight,Sample ID,Sex,Stage Other,Neoplasm American Joint Committee on Cancer Clinical Group Stage
0,T3,Esophagogastric Cancer,Esophageal Adenocarcinoma,67.0,5.65,1:Recurred/Progressed,,,,0500F1A6-A528-43F3-B035-12D3B7C99C0F,BB3B0DDF-9896-4A64-B59A-5A545383CBF0,25.76,1:DECEASED,183.0,TCGA-2H-A9GF,95.0,TCGA-2H-A9GF-01,Male,,
1,T3,Esophagogastric Cancer,Esophageal Adenocarcinoma,66.0,16.56,1:Recurred/Progressed,,,,70084008-697D-442D-8F74-C12F8F598570,0FA8DD36-C202-4643-9ACA-C5635BF56CF1,20.04,1:DECEASED,178.0,TCGA-2H-A9GG,74.0,TCGA-2H-A9GG-01,Male,,
2,T1,Esophagogastric Cancer,Esophageal Adenocarcinoma,44.0,26.28,1:Recurred/Progressed,,,,606DC5B8-7625-42A6-A936-504EF25623A4,17DF1FC1-16B7-4898-8359-413D5F143413,31.24,1:DECEASED,183.0,TCGA-2H-A9GH,91.0,TCGA-2H-A9GH-01,Male,,
3,T3,Esophagogastric Cancer,Esophageal Adenocarcinoma,68.0,13.14,1:Recurred/Progressed,,,,CEAF98F8-517E-457A-BF29-ACFE22893D49,0934D8AF-67FB-4917-A687-6AFD8948B7D9,14.29,1:DECEASED,188.0,TCGA-2H-A9GI,100.0,TCGA-2H-A9GI-01,Male,,
4,T1,Esophagogastric Cancer,Esophageal Adenocarcinoma,57.0,41.43,1:Recurred/Progressed,,,,EE47CD59-C8D8-4B1E-96DB-91C679E4106F,4F4BC0A9-262E-4828-AEE8-068378E820BF,58.51,1:DECEASED,189.0,TCGA-2H-A9GJ,70.0,TCGA-2H-A9GJ-01,Male,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
544,,Endometrial Cancer,Uterine Serous Carcinoma/Uterine Papillary Ser...,85.0,,,NOT HISPANIC OR LATINO,BLACK OR AFRICAN AMERICAN,,54AC877A-52FF-450C-9DF9-B7CD3FC8E2E2,7BC29ACD-1306-4439-9042-27A9115EB563,4.43,1:DECEASED,168.0,TCGA-QS-A8F1,75.0,TCGA-QS-A8F1-01,Female,,Stage IIIA
545,,Endometrial Cancer,Uterine Endometrioid Carcinoma,64.0,20.73,0:DiseaseFree,NOT HISPANIC OR LATINO,BLACK OR AFRICAN AMERICAN,,3C2A6E30-A507-49F6-8B1F-36EB3AA41E60,629770A3-4982-480B-A066-E85C609DFA3F,20.73,0:LIVING,66.0,TCGA-SJ-A6ZI,93.0,TCGA-SJ-A6ZI-01,Female,,Stage IB
546,,Endometrial Cancer,Uterine Endometrioid Carcinoma,61.0,18.27,0:DiseaseFree,NOT HISPANIC OR LATINO,BLACK OR AFRICAN AMERICAN,,FBFED398-2A44-44A9-83CF-657C29CB7D28,C3064816-C2C1-4AF0-8F0E-89D900E47D20,18.27,0:LIVING,168.0,TCGA-SJ-A6ZJ,132.0,TCGA-SJ-A6ZJ-01,Female,,Stage IB
547,,Endometrial Cancer,Uterine Endometrioid Carcinoma,73.0,0.07,0:DiseaseFree,NOT HISPANIC OR LATINO,BLACK OR AFRICAN AMERICAN,,C1F1DC90-2C67-4862-9E5F-078B861FFC6E,8A93E942-EE48-421A-A696-2B367C6C3867,0.07,0:LIVING,167.0,TCGA-SL-A6J9,88.0,TCGA-SL-A6J9-01,Female,,Stage IB


In [None]:
#Endometrial stage data comes in a different column, so we need an apply function to normalize the stage across endometrial and other datasets
def normalize(row):
  stageI = ['Stage I', 'Stage IA', 'Stage IB', 'Stage IC']
  stageII = ['Stage II', 'Stage IIA', 'Stage IIB']
  stageIII = ['Stage III', 'Stage IIIA', 'Stage IIIB','Stage IIIC', 'Stage IIIC1', 'Stage IIIC2']
  stageIV = ['Stage IV', 'Stage IVA', 'Stage IVB']
  t1 = ['T1a', 'T1', 'T1b']
  t2 = ['T2', 'T2a', 'T2b']
  t3 = ['T3', 'T3a', 'T3b', 'T3c']
  t4 = ['T4', 'T4a']
  if row['American Joint Committee on Cancer Tumor Stage Code'] in t1 or row['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] in stageI:
    type = 'T1'
  elif row['American Joint Committee on Cancer Tumor Stage Code'] in t2 or row['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] in stageII:
    type = 'T2'
  elif row['American Joint Committee on Cancer Tumor Stage Code'] in t3 or row['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] in stageIII:
    type = 'T3'
  elif row['American Joint Committee on Cancer Tumor Stage Code'] in t4 or row['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] in stageIV:
    type = 'T4'
  else:
    type = 'Tx or T0'
  return type

In [None]:
#Apply function that takes the height and weight from the columns and gives the BMI in a new column
def calc_BMI(row):
  w = row['Patient Weight']
  h = (row['Patient Height']) / 100
  BMI = w / h**2
  return BMI

def group_BMI(row):
  BMI = row['BMI']
  category = 'Underweight'
#BMI over 40 is morbidly obese, but there were none in the datasets examined, so including it created problems with ordering the variable
#  if BMI >= 40:
#    category = 'Morbidly Obese'
  if BMI >= 30:
    category = 'Obese'
  elif BMI >= 25:
    category = 'Overweight'
  elif BMI >= 16.5:
    category = 'Normal'
  return category


#Questions to Explore
Answering Questions with Data

**Study Based Questions**

Which Cancer types are associated with BMI?

Based on studies, we expect certain cancer subtypes, such as esphogeal adenocarcinoma to be associated with higher BMI

**Exploration**

Is there a BMI difference between the sexes in BMI-linked cancer types?

Is there a link between cancer stage and BMI in cancer patients?

Is there a link between age and BMI in cancer patients?

**Other Potential Questions that could be Explored**

Is a higher BMI associated with lower survival time past a certain age

Are there mutated genes correlated with a high BMI, and do they differ between cancer types?

Do certain pathways change based on bmi? This could be answered with proteomic data.

In [None]:
combined['Stage code'] = combined.apply(normalize,axis=1)
combined['BMI'] = combined.apply(calc_BMI,axis=1)
combined['BMI_category'] = combined.apply(group_BMI,axis=1)
combined = combined.loc[combined['BMI'] > 0]
#in R you can order a categorical variable as a factor, haven't quite figured out how to do it in python yet
#combined_data['BMI_category'] = combined_data['BMI_category'].astype('category')
#combined_data['BMI_category'] = combined_data['BMI_category'].cat.reorder_categories(['Underweight','Normal','Overweight','Obese'])

#Early Look at Data Using Graphs

Graphs are a great way to visually explore the relationships between different variables before doing statistical analysis

In [None]:
#General look at the distribution of BMI split by detailed cancer type and colored by general cancer type
#Melanoma included as a control for a cancer that is not known to be related to BMI
import plotly.express as px
box = px.box(combined, x = 'Cancer Type Detailed', y = 'BMI', color = 'Cancer Type', title = 'BMI by Cancer Type', range_y = [0,80])
box.show()

In [None]:
#After splitting BMI into categories, comparing how many fall into the categories by cancer type
bar = px.bar(combined, x = 'BMI_category', facet_col='Cancer Type', color = 'Cancer Type')
bar.show()
#It is pretty easy to see here that for associated cancer types, they had much higher proportions of higher BMI categories than melanom

In [None]:
#Plotting the relationship between BMI and survival between cancer types and split by sex
scatter = px.scatter(combined, x = 'BMI', y = 'Overall Survival (Months)', color = 'Cancer Type', facet_col = 'Sex',
                     range_x = [10,70], range_y = [0,250])
scatter.show()
#nothing particularly interesting pops out here or in other similar comparisons I tried, but I feel like this is a good way to visually see if there is a connection

# Statistical Analysis

Next we will look at two cancer types associated with high BMI and run statistics for a comparison between gender, stage, and age. We will be using the Espphageal and the Endometrial dataset to run these statistics. We'll first pull a couple of preliminary percentages to look for what could be interesting, and then run formal statistics to see if it is significant.

In [None]:
#Function that takes a dataframe and returns a dictionary with the percent breakdown of each race
def percentRace(df):
  Af_Am = df.loc[df['Race Category'] == 'BLACK OR AFRICAN AMERICAN']
  White = df.loc[df['Race Category'] == 'WHITE']
  Asian = df.loc[df['Race Category'] == 'ASIAN']
  Islander = df.loc[df['Race Category'] == 'NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER']
  Native = df.loc[df['Race Category'] == 'AMERICAN INDIAN OR ALASKA NATIVE']
  percent_Af_Am = len(Af_Am) / len(df)
  percent_White = len(White) / len(df)
  percent_Asian = len(Asian) / len(df)
  percent_Islander = len(Islander) / len(df)
  percent_Native = len(Native) / len(df)
  dict = {}
  dict['African_American'] = percent_Af_Am
  dict['White'] = percent_White
  dict['Asian'] = percent_Asian
  dict['Native'] = percent_Native
  dict['Islander'] = percent_Islander
  return dict

In [None]:
#takes a dataframe and returns the percent of samples that are obese
def percentObese(df):
  count = df.loc[df['BMI'] > 30]
  percent = len(count) / len(df)
  return percent

In [None]:
print(percentRace(combined))
print(percentObese(combined))

{'African_American': 0.10756972111553785, 'White': 0.6799468791500664, 'Asian': 0.15803452855245684, 'Native': 0.00398406374501992, 'Islander': 0.00597609561752988}
0.37317397078353254


#ESCA Analysis

Before we begin perfroming statistical tests on the data we first need to create the data frames we want to work with. We also need to drop NA values from the BMI column. If you don't drop NA values you will get a NA value when you run the statistical test. We will first be running a Sex comparison so we need a separate dataframe for men and women. We also know that Esophageal adenocarcinoma is linked with BMI so we will filter the dataset for the cancer specifically.

In [None]:
filtered_esca = esca_dataframe.loc[esca_dataframe['Cancer Type Detailed'] == 'Esophageal Adenocarcinoma']

NameError: name 'esca_dataframe' is not defined

In [None]:
filtered_esca['BMI'] = filtered_esca.apply(calc_BMI,axis=1)

In [None]:
esca_dataframe_all = esca_dataframe
esca_dataframe_all['BMI'] = esca_dataframe_all.apply(calc_BMI,axis=1)

In [None]:
esca_dataframe_nadrop = filtered_esca[filtered_esca['BMI'].notna()]

In [None]:
esca_dataframe_nadrop_all = esca_dataframe_all[esca_dataframe_all['BMI'].notna()]

In [None]:
female_esca_dataframe = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['Sex'] == 'Female']

In [None]:
male_esca_dataframe = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['Sex'] == 'Male']

### Normality Comparison

A Shapiro test is used to determine if data can use the assumption of normality. Normality is important when running statistics because some tests, such as the t-test and ANOVA assume normality. If your data isn't normally distributed those tests are not suited to use on your data.

In [None]:
stats.shapiro(male_esca_dataframe['BMI'])

In [None]:
stats.shapiro(female_esca_dataframe['BMI'])

Because half of our comparison is not normal according to the Shaprio Test we cannot use a parimetric test, therefore for our two variable comparison we will use the Mann-Whitney and for our multiple comparison we will use the Kruskal-Wallace.

### Sex and BMI Comparison

After filtered the datasets between men and women we can run the Mann-Whitney test to see if BMI is significant based on gender.

In [None]:
stats.mannwhitneyu(x=male_esca_dataframe['BMI'], y=female_esca_dataframe['BMI'], alternative = 'two-sided')

### Non-associated and Associated Comparison
We also know that Esophageal Adenocarcinoma is specifcally known to be linked to obesity where as Esophageal Squamous Cell Carcinoma is not explicitly know to be linked. To explore this we filtered dataset creating one with the first cancer and one with the second. Using our normality comparison we know that a nonparametric test would be most fitting.

#### Normality Comparison

In [None]:
squamous_cell_esca_dataframe = esca_dataframe_nadrop_all.loc[esca_dataframe_nadrop_all['Cancer Type Detailed'] == 'Esophageal Squamous Cell Carcinoma']

In [None]:
adenocarcinoma_esca_dataframe = esca_dataframe_nadrop_all.loc[esca_dataframe_nadrop_all['Cancer Type Detailed'] == 'Esophageal Adenocarcinoma']

In [None]:
stats.shapiro(squamous_cell_esca_dataframe['BMI'])

In [None]:
stats.shapiro(adenocarcinoma_esca_dataframe['BMI'])

Since our Shaprio Test is significant we will be using the Mann-WhitneyU Test

In [None]:
stats.mannwhitneyu(squamous_cell_esca_dataframe['BMI'], adenocarcinoma_esca_dataframe['BMI'])

We can also visualize our data with a box plot

In [None]:
import plotly.express as px
esca_combined = esca_dataframe_nadrop_all.loc[(esca_dataframe_nadrop_all['Cancer Type Detailed'] == 'Esophageal Squamous Cell Carcinoma') | (esca_dataframe_nadrop_all['Cancer Type Detailed'] == 'Esophageal Adenocarcinoma')]
esca_box = px.box(esca_combined, x = 'Cancer Type Detailed', y = 'BMI', color = 'Cancer Type Detailed', title = 'Nonassociated and Associated Comparison', range_y = [0,80])
esca_box.show()

We can see that we have a significant p-value for the non-associated versus associated cancer type, meaning we are seeing a similar trend to our papers mentioned at the beginning.

### Stage and BMI Comparison

To run a comparison between cancer stage and BMI we will need to split up the data into different dataframes for each stage. An easy way to do this is to use the .unique() function on the stage column. This will show you what the different stages are and then you can filter accordingly.

In [None]:
esca_dataframe['American Joint Committee on Cancer Tumor Stage Code'].unique()

In [None]:

esca_stage1 = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['American Joint Committee on Cancer Tumor Stage Code'] == 'T1']
esca_stage2 = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['American Joint Committee on Cancer Tumor Stage Code'] == 'T2']
esca_stage3 = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['American Joint Committee on Cancer Tumor Stage Code'] == 'T3']
esca_stage4 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['American Joint Committee on Cancer Tumor Stage Code'] == 'T4') | (esca_dataframe_nadrop['American Joint Committee on Cancer Tumor Stage Code'] == 'T4a')]
esca_stage0 = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['American Joint Committee on Cancer Tumor Stage Code'] == 'T0']

After filtering the dataframes you may want to check the length of each one. If there is a length of zero it will mess up the statistical test and you will get a NA value for your p-value. If all of your dataframes have a length of zero you probably went wrong filtering somewhere. You could also consider removing ones with a low count.  

In [None]:
print(len(esca_stage1))
print(len(esca_stage2))
print(len(esca_stage3))
print(len(esca_stage4))
print(len(esca_stage0))

We can now run the kruskal test using the BMI column from each of our filtered data sets. This will compare the BMIs between each of the stages to see if stage is significant.

In [None]:
stats.kruskal(esca_stage1['BMI'], esca_stage2['BMI'], esca_stage3['BMI'],  esca_stage4['BMI'], esca_stage0['BMI'] )

### Age and BMI Comparison


The age comparison will be very similar to the stage comparison, however the break up between dataframes will be more based on how you want to bin it. For this we did bins of five years, except for anything below 44 is all in one category.

In [None]:
esca_dataframe_nadrop['Diagnosis Age'].unique()

In [None]:
esca_age_40_44 = esca_dataframe_nadrop.loc[esca_dataframe_nadrop['Diagnosis Age'] <= 44]
esca_age_45_49 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 49) & (esca_dataframe_nadrop['Diagnosis Age'] >= 45)]
esca_age_50_54 =esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 54) & (esca_dataframe_nadrop['Diagnosis Age'] >= 50)]
esca_age_55_59 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 59) & (esca_dataframe_nadrop['Diagnosis Age'] >= 55)]
esca_age_60_64 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 64) & (esca_dataframe_nadrop['Diagnosis Age'] >= 60)]
esca_age_65_69 =esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 69) & (esca_dataframe_nadrop['Diagnosis Age'] >= 65)]
esca_age_70_74 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 74) & (esca_dataframe_nadrop['Diagnosis Age'] >= 70)]
esca_age_75_79 =esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 79) & (esca_dataframe_nadrop['Diagnosis Age'] >= 75)]
esca_age_80_84 =esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 84) & (esca_dataframe_nadrop['Diagnosis Age'] >= 80)]
esca_age_85_89 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 89) & (esca_dataframe_nadrop['Diagnosis Age'] >= 85)]
esca_age_90_94 = esca_dataframe_nadrop.loc[(esca_dataframe_nadrop['Diagnosis Age'] <= 94) & (esca_dataframe_nadrop['Diagnosis Age'] >= 90)]

In [None]:
print(len(esca_age_40_44))
print(len(esca_age_45_49))
print(len(esca_age_50_54))
print(len(esca_age_55_59))
print(len(esca_age_70_74))
print(len(esca_age_75_79))
print(len(esca_age_80_84))
print(len(esca_age_85_89))
print(len(esca_age_90_94))

We again will give the kruskal function the BMI column from each of our dataframes to perform the statistical test.

In [None]:
stats.kruskal(esca_age_40_44['BMI'], esca_age_45_49['BMI'], esca_age_50_54['BMI'],  esca_age_55_59['BMI'],  esca_age_60_64['BMI'],esca_age_65_69['BMI'], esca_age_70_74['BMI'],esca_age_75_79['BMI'],esca_age_80_84['BMI'],esca_age_85_89['BMI'])

# UCEC Analysis
Since Endometrial cancer is in the female reproductive organs a sex analysis would be impossible. Therefore this cancer will only have a stage and age comparison.

In [None]:
bmi_endo = ucec_dataframe
bmi_endo['BMI'] = bmi_endo.apply(calc_BMI,axis=1)

In [None]:
ucec_dataframe_nadrop = bmi_endo[bmi_endo['BMI'].notna()]

### Non-associated and Associated Comparison
We also know that Uterine Endometrioid Carcinoma is specifcally known to be linked to obesity where as Uterine Serous Carcinoma/Uterine Papillary Serous Carcinoma is primarily caused by a mutation in a gene. To explore this we filtered the UCEC dataset creating one with the first cancer and one with the second. Using our normality comparison we know that a nonparametric test would be most fitting.

#### Normality Comparison

In [None]:
serous_ucec_dataframe = ucec_dataframe_nadrop.loc[ucec_dataframe_nadrop['Cancer Type Detailed'] == 'Uterine Serous Carcinoma/Uterine Papillary Serous Carcinoma']

In [None]:
endometriod_ucec_dataframe = ucec_dataframe_nadrop.loc[ucec_dataframe_nadrop['Cancer Type Detailed'] == 'Uterine Endometrioid Carcinoma']

In [None]:
stats.shapiro(serous_ucec_dataframe['BMI'])

ShapiroResult(statistic=0.920781672000885, pvalue=9.861155376711395e-06)

In [None]:
stats.shapiro(endometriod_ucec_dataframe['BMI'])

ShapiroResult(statistic=0.6416081190109253, pvalue=9.02367993905904e-28)

Since our Shaprio Test is significant we will be using the Mann-WhitneyU Test

In [None]:
stats.mannwhitneyu(serous_ucec_dataframe['BMI'], endometriod_ucec_dataframe['BMI'])

MannwhitneyuResult(statistic=14265.0, pvalue=1.816098982562477e-06)

We can also visualize our data with a box plot

In [None]:
import plotly.express as px
endo_combined = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Cancer Type Detailed'] == 'Uterine Endometrioid Carcinoma') | (ucec_dataframe_nadrop['Cancer Type Detailed'] == 'Uterine Serous Carcinoma/Uterine Papillary Serous Carcinoma')]
ucec_box = px.box(endo_combined, x = 'Cancer Type Detailed', y = 'BMI', color = 'Cancer Type Detailed', title = 'Nonassociated and Associated Comparison', range_y = [0,80])
ucec_box.show()

Our p-value is significant meaning we can reject the null hypothesis that there is no different between the non-associated and associated cancer types.

### Stage and BMI Comparison

The endometrial cancer dataset used a different column for the stage of cancer. This columm used a different code than the previous so you will need to find all the different stage names again using the unique function.

In [None]:
ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'].unique()

array(['Stage III', 'Stage I', 'Stage IA', 'Stage IIIA', 'Stage IIIB',
       'Stage IIIC2', 'Stage IB', 'Stage IIIC1', 'Stage II', 'Stage IC',
       'Stage IIIC', 'Stage IVB', 'Stage IIB', 'Stage IIA', 'Stage IVA',
       'Stage IV'], dtype=object)

In [None]:
ucec_stage1 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage I') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IA') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IB') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IC')]
ucec_stage2 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage II') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIA') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIB')]
ucec_stage3 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage III') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIIA') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIIB') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIIC') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIIC1') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IIIC2')]
ucec_stage4 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IV') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'StageIVA') | (ucec_dataframe_nadrop['Neoplasm American Joint Committee on Cancer Clinical Group Stage'] == 'Stage IVB')]




In [None]:
print(len(ucec_stage1))
print(len(ucec_stage2))
print(len(ucec_stage3))
print(len(ucec_stage4))

325
49
115
27


In [None]:
stats.kruskal(ucec_stage1['BMI'], ucec_stage2['BMI'], ucec_stage3['BMI'],  ucec_stage4['BMI'])

KruskalResult(statistic=3.2516275298841375, pvalue=0.354432134649887)

### Age and BMI Comparison

In [None]:
ucec_dataframe['Diagnosis Age'].unique()

array([59., 54., 69., 51., 67., 57., 61., 73., 79., 65., 75., 38., 76.,
       63., 44., 53., 58., 46., 64., 55., 81., 68., 74., 62., 47., 86.,
       31., 77., 41., 71., 52., 70., 60., 85., nan, 83., 90., 78., 56.,
       66., 82., 42., 33., 45., 39., 40., 37., 72., 80., 84., 87., 35.,
       50., 48., 88., 43., 89., 49., 34., 36.])

In [None]:
ucec_dataframe_nadrop = ucec_dataframe[ucec_dataframe['BMI'].notna()]

In [None]:
ucec_age_40_44 = ucec_dataframe_nadrop.loc[ucec_dataframe_nadrop['Diagnosis Age'] <= 44]
ucec_age_45_49 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 49) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 45)]
ucec_age_50_54 =ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 54) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 50)]
ucec_age_55_59 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 59) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 55)]
ucec_age_60_64 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 64) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 60)]
ucec_age_65_69 =ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 69) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 65)]
ucec_age_70_74 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 74) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 70)]
ucec_age_75_79 =ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 79) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 75)]
ucec_age_80_84 =ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 84) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 80)]
ucec_age_85_89 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 89) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 85)]
ucec_age_90_94 = ucec_dataframe_nadrop.loc[(ucec_dataframe_nadrop['Diagnosis Age'] <= 94) & (ucec_dataframe_nadrop['Diagnosis Age'] >= 90)]

In [None]:
print(len(ucec_age_40_44))
print(len(ucec_age_45_49))
print(len(ucec_age_50_54))
print(len(ucec_age_55_59))
print(len(ucec_age_70_74))
print(len(ucec_age_75_79))
print(len(ucec_age_80_84))
print(len(ucec_age_85_89))
print(len(ucec_age_90_94))

25
19
51
81
68
39
28
13
3


In [None]:
stats.kruskal(ucec_age_40_44['BMI'], ucec_age_45_49['BMI'], ucec_age_50_54['BMI'],  ucec_age_55_59['BMI'], ucec_age_60_64['BMI'],ucec_age_65_69['BMI'], ucec_age_70_74['BMI'],ucec_age_75_79['BMI'],ucec_age_80_84['BMI'],ucec_age_85_89['BMI'], ucec_age_90_94['BMI'] )

KruskalResult(statistic=35.98814674047432, pvalue=8.45718173354837e-05)

If you wanted to look at age as a continuous variable rather than a categorical variable you could run linear regression and see how the age compares with BMI in that way.

In [None]:
scatter2 = px.scatter(bmi_endo, x = 'BMI', y = 'Diagnosis Age', color_discrete_sequence=['green'],
                      range_x = [15,70], range_y = [30,95],
                      trendline = 'ols', trendline_scope="overall", trendline_color_override='blue')
scatter2.show()