# Medicina Personalizada - Redefinindo o Tratamento de Câncer

Muito tem sido dito durante os últimos anos sobre como a medicina de
precisão e, mais concretamente, como o teste genético, vai provocar disrupção no
tratamento de doenças como o câncer.

Mas isso ainda está acontecendo apenas parcialmente devido à enorme
quantidade de trabalho manual ainda necessário. Neste projeto, tentaremos levar
a medicina personalizada ao seu potencial máximo.

Uma vez sequenciado, um tumor cancerígeno pode ter milhares de
mutações genéticas. O desafio é distinguir as mutações que contribuem para o
crescimento do tumor das mutações.

Atualmente, esta interpretação de mutações genéticas está sendo feita
manualmente. Esta é uma tarefa muito demorada, onde um patologista clínico tem
que revisar manualmente e classificar cada mutação genética com base em
evidências da literatura clínica baseada em texto.

Para este projeto, o MSKCC (Memorial Sloan Kettering Cancer Center) está
disponibilizando uma base de conhecimento anotada por especialistas, onde
pesquisadores e oncologistas de nível mundial anotaram manualmente milhares
de mutações.

Neste projeto, você vai desenvolver um algoritmo de Aprendizado de
Máquina que, usando essa base de conhecimento como uma linha de base,
classifica automaticamente as variações genéticas.

O dataset completo pode ser encontrado em:
https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

Este projeto faz parte do curso Machine Learning da Data Science Academy

### Preparando as bibliotecas a serem utilizadas

In [45]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sn
import numpy as np
import keras
import re
import string
import nltk
import spacy 

Both, training and test, data sets are provided via two different files. One (training/test_variants) provides the information about the genetic mutations, whereas the other (training/test_text) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

Therefore the genetic mutation (row) with ID=15 in the file training_variants, was classified using the clinical evidence (text) from the row with ID=15 in the file training_text

* __training_variants__ - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations), Class (1-9 the class this genetic mutation has been classified on)
* **training_text** - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)
* __test_variants__ - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations)
* **test_text** - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)

In [3]:
test_text = "./data_files/test_text"
test_variants = "./data_files/test_variants"
training_text = "./data_files/training_text"
training_variants = "./data_files/training_variants"

### ### Lendo os arquivos de entrada e convertendo os dados lidos em dataframes

In [4]:
def convert_variant_df(read):
    lista = open(read, "r",encoding="utf8").readlines()
    #Esta lista possui \n junto ao texto, entao vamos remover
    lista_nova = [texto.split(sep="\n")[0].split(",") for texto in lista]
    df = pd.DataFrame(lista_nova[1:],columns=lista_nova[0])
    return(df)
    

In [5]:
def convert__df(read):
    #separa o texto pelo delimitador ||
    lista = re.split('([0-9]+)(\|\|)',open(read, "r",encoding="utf8").read())
    #Remove da lista os elementos ||
    lista = [elemento for elemento in lista if elemento != "||"]
    #Detecta o titulo do df
    titulo = lista[0].split("\n")[0].split(",")
    lista_nova= [[lista[index+1],lista[index+2]] for index in range(0,len(lista[1:])-1,2)]
    
    df = pd.DataFrame(lista_nova,columns=titulo)
    return(df)  

In [6]:
df_test_text = convert__df(test_text)
df_test_variants = convert_variant_df(test_variants)
df_training_text = convert__df(training_text)
df_training_variants = convert_variant_df(training_variants)


#### Função para exploração inicial de dados

In [7]:
# Criando uma função que retorna um dataframe de descrição de dados (tal qual a função describe do pacote explore do R)
def explore_describe(df):
    df_out = pd.DataFrame(columns = ['variable','type','na' ,'na_pct' ,'unique','min', 'quat25','median','mean', \
                                     'quat75','max','std','skewness','kurtosis','media_desvio'])
    df_out['variable'] = df.columns
    df_out['type'] = df.dtypes.values
    df_out['na'] = [sum(df[coluna].isna()) for coluna in df.columns]
    df_out['na_pct'] = [str(round(100*sum(df[coluna].isna())/df.shape[0],1))+'%' for coluna in df.columns]
    df_out['unique'] = [len(df[coluna].unique()) for coluna in df.columns]
    df_out['min']  = [round(min(df[coluna]),2) if 'int' in str(df[coluna].dtype) or 'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['mean'] = [round(df[coluna].mean(),2) if 'int' in str(df[coluna].dtype) or \
                      'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['max']  = [round(max(df[coluna]),2) if 'int' in str(df[coluna].dtype) or 'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['std'] = [round(df[coluna].std(),2) if 'int' in str(df[coluna].dtype) or \
                      'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['quat25'] = [round(df[coluna].quantile(0.25),2) if 'int' in str(df[coluna].dtype) or \
                      'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['quat75'] = [round(df[coluna].quantile(0.75),2) if 'int' in str(df[coluna].dtype) or \
                      'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['median'] = [round(df[coluna].quantile(0.5),2) if 'int' in str(df[coluna].dtype) or \
                      'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['skewness'] = [round(df[coluna].skew(),2) if 'int' in str(df[coluna].dtype) or \
                          'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    df_out['kurtosis'] = [round(df[coluna].kurt(),2) if 'int' in str(df[coluna].dtype) or \
                          'float' in str(df[coluna].dtype) else '-' for coluna in df.columns]
    
    df_out_media_desvio_list = []
    for coluna in df.columns:
        if(('int' in str(df[coluna].dtype)) or ('float' in str(df[coluna].dtype)) ):
            if((all(df[coluna] == 0)) or (df[coluna].std() == 0)):
                df_out_media_desvio_list.append(0)
            else:
                df_out_media_desvio_list.append(round(df[coluna].mean()/df[coluna].std(),2))
        else:
            df_out_media_desvio_list.append('-')
    
    df_out['media_desvio'] = df_out_media_desvio_list
    return(df_out)

In [8]:
df_training = df_training_text.merge(right=df_training_variants,on = 'ID').drop(columns = "ID")
df_test = df_test_text.merge(right=df_test_variants,on = 'ID').drop(columns = "ID")
df_training

Unnamed: 0,Text,Gene,Variation,Class
0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1
1,Abstract Background Non-small cell lung canc...,CBL,W802*,2
2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2
3,Recent evidence has demonstrated that acquired...,CBL,N454D,3
4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4
...,...,...,...,...
3316,Introduction Myelodysplastic syndromes (MDS) ...,RUNX1,D171N,4
3317,Introduction Myelodysplastic syndromes (MDS) ...,RUNX1,A122*,1
3318,The Runt-related transcription factor 1 gene (...,RUNX1,Fusions,1
3319,The RUNX1/AML1 gene is the most frequent targe...,RUNX1,R80C,4


In [9]:
df_test

Unnamed: 0,Text,Gene,Variation
0,2. This mutation resulted in a myeloproliferat...,ACSL4,R570S
1,Abstract The Large Tumor Suppressor 1 (LATS1)...,NAGLU,P521L
2,Vascular endothelial growth factor receptor (V...,PAH,L333F
3,Inflammatory myofibroblastic tumor (IMT) is a ...,ING1,A148D
4,Abstract Retinoblastoma is a pediatric retina...,TMEM216,G77A
...,...,...,...
5663,The realization in the late 1970s that RAS har...,SLC46A1,R113S
5664,Hemizygous deletions are common molecular abno...,FOXC1,L130F
5665,All most R267W of has with to SMARTpool invest...,GSS,R267W
5666,Abstract Blood samples from 125 unrelated fami...,CTSK,G79E


In [10]:
#Checando os dados iniciais
explore_describe(df_training)

Unnamed: 0,variable,type,na,na_pct,unique,min,quat25,median,mean,quat75,max,std,skewness,kurtosis,media_desvio
0,Text,object,0,0.0%,1921,-,-,-,-,-,-,-,-,-,-
1,Gene,object,0,0.0%,264,-,-,-,-,-,-,-,-,-,-
2,Variation,object,0,0.0%,2996,-,-,-,-,-,-,-,-,-,-
3,Class,object,0,0.0%,9,-,-,-,-,-,-,-,-,-,-


In [11]:
explore_describe(df_test)

Unnamed: 0,variable,type,na,na_pct,unique,min,quat25,median,mean,quat75,max,std,skewness,kurtosis,media_desvio
0,Text,object,0,0.0%,5611,-,-,-,-,-,-,-,-,-,-
1,Gene,object,0,0.0%,1397,-,-,-,-,-,-,-,-,-,-
2,Variation,object,0,0.0%,5628,-,-,-,-,-,-,-,-,-,-


### Vamos salvar os df no formato csv para facilitar o carregamento posterior

In [17]:
df_training.to_csv("df_training.csv")
df_test.to_csv("df_test.csv")

## Vamos importar os arquivos csv

In [18]:
df_training = pd.read_csv("df_training.csv")
df_test = pd.read_csv("df_test.csv")

## Removendo elementos não ASCII

In [24]:
# Primeiramente, vamos criar a frequencia de cada termo
textos_treino = df_training.Text

In [25]:
def removeNoAscii(s):
    return "".join(i for i in s if ord(i) < 128)

In [26]:
textos_treino_limpa = textos_treino.map(lambda x: removeNoAscii(x))

## Criando gerando o CORPUS para tratamento dos dados

In [166]:
def corpusnization(text):
    #Removendo a pontuação e criando o CORPUS
    nopunct_token = nltk.tokenize.regexp_tokenize(text.lower(),"[\w']+")
   
    #Removendo stopwords
    token_no_stopwords = [word for word in nopunct_token if word not in nltk.corpus.stopwords.words('english')]
    
    #Stemming
    #cooking -> cook
    token_stem = [nltk.stem.PorterStemmer().stem(token) for token in token_no_stopwords]
    
    #Lemmatization
    #mice -> mouse
    token_final = [nltk.stem.WordNetLemmatizer().lemmatize(token) for token in token_stem]
    return(token_stem)
    

In [170]:
b = "AttributeError: module 'nltk.collections' has no attribute 'stopwords'. You can´t or can't do that. Why aren't you cooking that? There are a lot of Bacteria. Hunds is the plural of hund. Better for you"
doc = corpusnization(b)
doc

['attributeerror',
 'modul',
 "'nltk",
 "collections'",
 'attribut',
 "'stopwords'",
 "can't",
 'cook',
 'lot',
 'bacteria',
 'hund',
 'plural',
 'hund',
 'better']