# Data Science Methodology for Breast Cancer Diagnosis (DSM-BCD)

**Problema.** Colombia presenta limitaciones con respecto al acceso de la detección y el diagnóstico temprano del cáncer, provocado en la mayoría de los casos por factores como el estrato socio-económico, la cobertura del seguro de salud, el origen y la accesibilidad. En promedio, el tiempo de espera de un paciente es de 90 días desde la aparición de los síntomas hasta el diagnóstico de dicho cáncer. La primera acción para reducir la tasa de mortalidad por cáncer de mama debe estar enfocada en la agilidad del diagnóstico y el acceso oportuno a la atención.  Así, el objetivo de esta investigación es aplicar las etapas de la metodología KDD\footnote{Knowledge Discovery in Databases} al conjunto de datos de morbilidad por cáncer entre los años 2019 y 2020 en el municipio de Pereira-Risaralda. Esto con la finalidad de pronosticar y caracterizar el tipo de población mas susceptible de padecer esta enfermedad según su edad, genero, zona y régimen de salud.

**Conjunto de datos.** Un total de 817 muestras de tumores de mama se perfilaron con cinco plataformas diferentes como se ha descrito previamente (Cancer Genome Atlas Research Network, 2014) y también se perfilaron 633 casos mediante matriz de proteínas en fase inversa (RPPA)

**1.Study ID**
**2.Patient ID**
**3.Sample ID**
**4.Diagnosis Age**
**5.American Joint Committee on Cancer Metastasis Stage Code**
**6.Neoplasm Disease Lymph Node Stage American Joint Committee on Cancer Code**
**7.Neoplasm Disease Stage American Joint Committee on Cancer Code**
**8.American Joint Committee on Cancer Publication Version Type**
**9.American Joint Committee on Cancer Tumor Stage Code**
**10.Brachytherapy first reference point administered total dose**
**11.Cancer Type**
**12.Cancer Type Detailed**
**13.Cent17 Copy Number**
**14.Birth from Initial Pathologic Diagnosis Date**
**15.Days to Sample Collection.**
**16.Death from Initial Pathologic Diagnosis Date**
**17.Last Alive Less Initial Pathologic Diagnosis Date Calculated Day Value**
**18.Days to Last Followup**
**19.Disease Free (Months)**
**20.Disease Free Status**
**21.Disease code**
**22.ER positivity scale other**
**23.ER positivity scale used**
**24.ER Status By IHC**
**25.ER Status IHC Percent Positive**
**26.Ethnicity Category**
**27.First surgical procedure other**
**28.Form completion date**
**29.Fraction Genome Altered**
**30.HER2 and cent17 cells count**
**31.HER2 and cent17 scale other**
**32.HER2 cent17 ratio**
**33.HER2 copy number**
**34.HER2 fish method**
**35.HER2 fish status**
**36.HER2 ihc percent positive**
**37.HER2 ihc score**
**38.HER2 positivity method text**
**39.HER2 positivity scale other**
**40.Neoplasm Histologic Type Name**
**41.Tumor Other Histologic Subtype**
**42.Neoadjuvant Therapy Type Administered Prior To Resection Text**
**43.Prior Cancer Diagnosis Occurence**
**44.ICD-10 Classification**
**45.International Classification of Diseases for Oncology, Third Edition ICD-O-3 Histology Code**
**46.International Classification of Diseases for Oncology, Third Edition ICD-O-3 Site Code**
**47.IHC-HER2**
**48.IHC Score**
**49.Informed consent verified**
**50.Year Cancer Initial Diagnosis**
**51.Is FFPE**
**52.Primary Lymph Node Presentation Assessment Ind-3**
**53.Positive Finding Lymph Node Hematoxylin and Eosin Staining Microscopy Count**
**54.Positive Finding Lymph Node Keratin Immunohistochemistry Staining Method Count**
**55.Lymph Node(s) Examined Number**
**56.Margin status reexcision**
**57.Menopause Status**
**58.Metastatic Site**
**59.Metastatic Site Other**
**60.Metastatic tumor indicator**
**61.First Pathologic Diagnosis Biospecimen Acquisition Method Type**
**62.First Pathologic Diagnosis Biospecimen Acquisition Other Method Type**
**63.Micromet detection by ihc**
**64.Mutation Count**
**65.New Neoplasm Event Post Initial Therapy Indicator**
**66.Nte cent 17 HER2 ratio**
**67.Nte er ihc intensity score**
**68.Nte er status**
**69.Nte er status ihc positive**
**70.Nte HER2 fish status**
**71.Nte HER2 positivity ihc score**
**72.Nte HER2 status**
**73.Nte HER2 status ihc positive**
**74.Nte pr ihc intensity score**
**75.Nte pr status by ihc**
**76.Nte pr status ihc positive**
**77.Oct embedded**
**78.Oncotree Code**
**79.Overall Survival (Months)**
**80.Overall Survival Status**
**81.Other Patient ID**
**82.Other Sample ID**
**83.Pathology Report File Name**
**84.Disease Surgical Margin Status**
**85.Adjuvant Postoperative Pharmaceutical Therapy Administered Indicator**
**86.Primary Tumor Site**
**87.Project code**
**88.Tissue Prospective Collection Indicator**
**89.PR positivity define method**
**90.PR positivity ihc intensity score**
**91.PR positivity scale other**
**92.PR positivity scale used**
**93.PR status by ihc**
**94.PR status ihc percent positive**
**95.Race Category**
**96.Did patient start adjuvant postoperative radiotherapy?**
**97.Tissue Retrospective Collection Indicator**
**98.Number of Samples Per Patient**
**99.Sample Type**
**100.Sex**
**101.Somatic Status**
**102.Staging System**
**103.Staging System_1**
**104.Surgery for positive margins**
**105.Surgery for positive margins other**
**106.Surgical procedure first**
**107.Tissue Source Site**
**108.TMB (nonsynonymous)**
**109.Person Neoplasm Status**
**110.Tumor Disease Anatomic Site**


## Análisis exploratorio de datos

In [None]:
import pandas as pd
import numpy as np
from IPython.display import Image
from dataprep.eda import plot, plot_correlation, plot_missing,create_report
import seaborn as sns

In [None]:
with open('brca_tcga_pub2015_clinical_data.csv') as f:
    breast_cancer=pd.read_csv(f, delimiter=';')

In [None]:
breast_cancer.head(10)

In [None]:
breast_cancer.shape

In [None]:
create_report(breast_cancer)

In [None]:
#breast_cancer=breast_cancer.drop(['NOMBRE DIAGNOSTICO'], axis=1)

In [None]:
#breast_cancer.to_csv('Breast_Cancer_C500_C509.csv',index=False)