# Challenge 4: Preventing Melanomas

Challenge Statement: A cancer diagnosis is a significant moment for a patient and their support network. It often leads into a long period of treatment and support. Prevention and early detection play a big role in positive outcomes. How could we better track and support cancer patients through this journey?

Can we pre-emptively identify potential patients through target interviews and other ongoing, cancer detection programs?


## Supporting Mentors

Richard Trethick – Department of Health

## Potential Focus Questions

This is a complex Challenge and you may find it useful to consider the following questions when building out your prototype:

    - Can we use different data types to predict the prevalence of melanoma (coastal locations, UV irradiance, location to medical services etc.)?  

    - Can we help group cancer patients according to their cancer stage?  

    - Can we predict a cancer patient’s outcome according to their participating in interviews and programs?

## Critical Concepts

Emergency Departments and Hospitals are incredibly complex organisations. You may want to talk to your Challenge Mentors about the following ideas as you developing you prototype:

    - Melanoma progression and diagnostic approaches.  

    - Current thinking around melanoma causes and contributory factors.  

    - Melanoma treatment approaches.  

    - Melanoma prevention approaches.

## Supporting Data Sets

The Department of Health has produced a synthesised version of the WA Cancer Registry (WACR) to help resolve this Challenge. Since 1982, the Western Australian Cancer Registry has collected population-based incidence and mortality cancer data for use in the planning of health care services and the support of cancer monitoring, evaluation and research at local, national and international levels. The Registry included data points on:

    - Age
    - Nationality
    - Location of diagnosis
    - Type of diagnosis
    - Morphology
    - Melanoma specific details


## Potential Solution Pathways

You are free to resolve this Challenge by developing your prototype in whatever means you may like. Our mentors, partners and organising team have thought of the following techniques as being viable methods to resolve the Challenge:

    - Statistical models to measure melanoma occurrence.  

    - Spatial analysis of melanoma occurrence and other descriptors.  

    - Machine learning models predicting cancer prevalence and risk.



---
# Imports

In [2]:
# import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import pandas as pd
import csv
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage, to_tree, cut_tree
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.decomposition import PCA
from graphviz import Source

---
# Data Preparation

#### Reading the CSV file into Google Colab:

In [None]:
# # Code to read .csv file into Colaboratory:
# !pip install -U -q PyDrive
# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials
# # Authenticate and create the PyDrive client.
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

# # https://drive.google.com/file/d/1HGs7U8uVfHMdrIdgmup42JC78rmZokre/view?usp=drive_link
# id = '1HGs7U8uVfHMdrIdgmup42JC78rmZokre'
# downloaded = drive.CreateFile({'id':id})
# downloaded.GetContentFile('Hackathon_syntheticMelanoma_5_Oct_2023.csv')

#### Inspecting the data:

In [31]:
data = pd.read_csv('Hackathon_syntheticMelanoma_5_Oct_2023.csv')

In [19]:
data.head()

Unnamed: 0,person_ID,sex,aboriginal_status,age,country_of_birth,diagnosis_postcode,diagnosis_year,tumour_site_code,morphology_code,basis_of_diagnosis,year_of_death,melanoma_clark_level,melanoma_breslow_thickness,stage
0,1,2,4.0,85,901.0,6084,2013,C443,8720,1,2013.0,,,4
1,2,1,4.0,30,905.0,6225,2019,C447,8743,1,,4.0,1.83,3
2,3,1,4.0,30,,6031,2011,C445,8723,1,,4.0,1.73,2
3,4,2,4.0,30,905.0,6148,2012,C444,8720,1,,3.0,0.61,1
4,5,1,4.0,30,905.0,6401,2010,C447,8720,1,,3.0,0.8,1


In [41]:
data.describe()

Unnamed: 0,sex,aboriginal_status,age,country_of_birth,diagnosis_year,basis_of_diagnosis,melanoma_clark_level,melanoma_breslow_thickness,stage,morphology_8720,...,morphology_8772,tumour_site_C440,tumour_site_C441,tumour_site_C442,tumour_site_C443,tumour_site_C444,tumour_site_C445,tumour_site_C446,tumour_site_C447,tumour_site_C449
count,13747.0,13735.0,13747.0,6338.0,13747.0,13747.0,13258.0,13149.0,13747.0,13747.0,...,13747.0,13747.0,13747.0,13747.0,13747.0,13747.0,13747.0,13747.0,13747.0,13747.0
mean,0.407507,4.040553,61.021677,1250.292995,2015.147814,1.020223,3.140444,1.650041,1.404816,0.286317,...,0.008511,0.002037,0.002619,0.022478,0.081472,0.078417,0.361752,0.248782,0.196988,0.005456
std,0.491388,0.530586,16.431136,1082.868856,3.075055,0.344183,0.973376,2.89927,0.788402,0.452056,...,0.091865,0.045087,0.051109,0.148236,0.273569,0.268837,0.480525,0.432323,0.397738,0.073664
min,0.0,1.0,0.0,0.0,2010.0,1.0,2.0,0.08,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,4.0,50.0,905.0,2013.0,1.0,2.0,0.4,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,4.0,65.0,905.0,2015.0,1.0,3.0,0.65,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,4.0,75.0,1101.0,2018.0,1.0,4.0,1.55,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,1.0,9.0,100.0,9232.0,2020.0,9.0,5.0,40.0,4.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13747 entries, 0 to 13746
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_ID                   13747 non-null  int64  
 1   sex                         13747 non-null  int64  
 2   aboriginal_status           13735 non-null  float64
 3   age                         13747 non-null  int64  
 4   country_of_birth            6338 non-null   float64
 5   diagnosis_postcode          13747 non-null  int64  
 6   diagnosis_year              13747 non-null  int64  
 7   tumour_site_code            13747 non-null  object 
 8   morphology_code             13747 non-null  int64  
 9   basis_of_diagnosis          13747 non-null  int64  
 10  year_of_death               3001 non-null   float64
 11  melanoma_clark_level        13258 non-null  float64
 12  melanoma_breslow_thickness  13149 non-null  float64
 13  stage                       137

In [17]:
# data = data.drop(columns=['person_ID'])

In [None]:
# data['sex'] = data['sex'] - 1

In [None]:
data.head()

Unnamed: 0,sex,aboriginal_status,age,country_of_birth,diagnosis_postcode,diagnosis_year,tumour_site_code,morphology_code,basis_of_diagnosis,year_of_death,melanoma_clark_level,melanoma_breslow_thickness,stage
0,1,4.0,85,901.0,6084,2013,C443,8720,1,2013.0,,,4
1,0,4.0,30,905.0,6225,2019,C447,8743,1,,4.0,1.83,3
2,0,4.0,30,,6031,2011,C445,8723,1,,4.0,1.73,2
3,1,4.0,30,905.0,6148,2012,C444,8720,1,,3.0,0.61,1
4,0,4.0,30,905.0,6401,2010,C447,8720,1,,3.0,0.8,1


In [55]:
def preprocess(data):
    data = data.drop(columns=['person_ID', 'year_of_death', 'diagnosis_postcode', 'basis_of_diagnosis', 'country_of_birth', 'year_of_death'])
    data['sex'] = data['sex'] - 1
    # data['tumour_site_code'] = [int(code.replace('C', '')) - 440 for code in data['tumour_site_code']]

    # Perform one-hot encoding for 'morphology_code' column
    morphology_encoded = pd.get_dummies(data['morphology_code'], prefix='morphology')
    data = pd.concat([data, morphology_encoded], axis=1)
    
    # Perform one-hot encoding for 'tumour_site_code' column
    tumour_site_encoded = pd.get_dummies(data['tumour_site_code'], prefix='tumour_site')
    data = pd.concat([data, tumour_site_encoded], axis=1)
    
    # Drop the original 'morphology_code' and 'tumour_site_code' columns
    data = data.drop(columns=['morphology_code', 'tumour_site_code'])

    # Drop NULL rows
    data = data[data['aboriginal_status'].notnull()]
    data = data[data['melanoma_breslow_thickness'].notnull()]
    data = data[data['melanoma_clark_level'].notnull()]
    
    
    # Perform one-hot encoding for 'aboriginal_status' column
    aboriginal_status_encoded = pd.get_dummies(data['aboriginal_status'], prefix='aboriginal_status')
    data = pd.concat([data, aboriginal_status_encoded], axis=1)
    
    # Drop the original 'aboriginal_status' column
    data = data.drop(columns=['aboriginal_status'])
    
    return data

In [56]:
data = pd.read_csv('Hackathon_syntheticMelanoma_5_Oct_2023.csv')
data = preprocess(data)

In [57]:
data.head()

Unnamed: 0,sex,age,diagnosis_year,melanoma_clark_level,melanoma_breslow_thickness,stage,morphology_8720,morphology_8721,morphology_8723,morphology_8727,...,tumour_site_C442,tumour_site_C443,tumour_site_C444,tumour_site_C445,tumour_site_C446,tumour_site_C447,tumour_site_C449,aboriginal_status_1.0,aboriginal_status_4.0,aboriginal_status_9.0
1,0,30,2019,4.0,1.83,3,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,0,30,2011,4.0,1.73,2,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
3,1,30,2012,3.0,0.61,1,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
4,0,30,2010,3.0,0.8,1,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
5,1,90,2010,2.0,0.24,1,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [58]:
column_names = data.columns.tolist()
print(column_names)

['sex', 'age', 'diagnosis_year', 'melanoma_clark_level', 'melanoma_breslow_thickness', 'stage', 'morphology_8720', 'morphology_8721', 'morphology_8723', 'morphology_8727', 'morphology_8730', 'morphology_8740', 'morphology_8742', 'morphology_8743', 'morphology_8744', 'morphology_8745', 'morphology_8750', 'morphology_8760', 'morphology_8770', 'morphology_8771', 'morphology_8772', 'tumour_site_C440', 'tumour_site_C441', 'tumour_site_C442', 'tumour_site_C443', 'tumour_site_C444', 'tumour_site_C445', 'tumour_site_C446', 'tumour_site_C447', 'tumour_site_C449', 'aboriginal_status_1.0', 'aboriginal_status_4.0', 'aboriginal_status_9.0']
