## APRENDIZADO DE MÁQUINA
### Professor: Hugo de Paula
### Curso: Ciência de Dados e Big Data - Of. 01


#### Trabalho Orientado
A proposta deste trabalho é aplicar técnicas de aprendizado de máquina e extrair conhecimento de bases de dados quaisquer para identificar padrões interessantes que possam ser úteis para o apoio à tomada de decisão de algum problema social ou empresarial. O processo de Data Science é iterativo, o que pressupõe o retorno a etapas anteriores do processo para ajustes caso necessário. 

#### Autor do Trabalho
* Marcelo Oliveira (marcelohonoliveira@gmail.com)

#### Motivação

A Biópsia é um exame invasivo que serve para analisar a saúde e a integridade de diversos tecidos do corpo como pele, pulmão, músculo, osso, fígado, rim ou baço. O objetivo da biópsia é observar qualquer mudança, como alteração da forma e do tamanho das células, sendo útil até mesmo para identificar a presença de células cancerígenas e outros problemas de saúde.
Por ser um exame invasivo, há chances de resultarem complicações após sua realização (dores, sagramentos etc.). Portanto, se fosse possível prever seu resultado ou pelo menos priorizar paciêntes com indicações com maior probabilidade de resultado positivo, já seria de grande valia.

### Objetivo
Desenvolver um modelo que, a partir de diversar informações do paciente com risco de estar com Câncer do Colo do Útero, seja capaz de predizer se o resultado de uma Biópsia neste paciente teria resultado positivo ou negativo.


#### Solução Desenvolvida
Foram ajustaram dois modelos de Aprendizado de Máquina Supervisionado - Support Vector Machines (SVM) e K-Nearest Neighbors (KNN) - para classificação de uma base de dados médica. A partir dos dados, foi possível contribuir na decisão de se fazer uma biópsia ou não prevendo seu resultado com uma precisão de acerto (acurácia) para mais de 90% dos casos testados. O modelo recebe 35 métricas e responde qual seria o possível resultado da biópsia de uma paciente com suspeita de Câncer do Colo do Útero.

### Câncer do Colo do Útero (Fatores de Risco)
[Base de Dados disponível aqui](https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29)

**Resumo:** Este conjunto de dados concentra-se na previsão de indicadores/diagnóstico de Câncer do Colo do Útero. Os recursos cobrem informações demográficas, hábitos e registros médicos históricos.


#### Informação sobre as variáveis:
01. (int) Age 
02. (int) Number of sexual partners 
03. (int) First sexual intercourse (age) 
04. (int) Num of pregnancies 
05. (bool) Smokes 
06. (bool) Smokes (years) 
07. (bool) Smokes (packs/year) 
08. (bool) Hormonal Contraceptives 
09. (int) Hormonal Contraceptives (years) 
10. (bool) IUD 
11. (int) IUD (years) 
12. (bool) STDs 
13. (int) STDs (number) 
14. (bool) STDs:condylomatosis 
15. (bool) STDs:cervical condylomatosis 
16. (bool) STDs:vaginal condylomatosis 
17. (bool) STDs:vulvo-perineal condylomatosis 
18. (bool) STDs:syphilis 
19. (bool) STDs:pelvic inflammatory disease 
20. (bool) STDs:genital herpes 
21. (bool) STDs:molluscum contagiosum 
22. (bool) STDs:AIDS 
23. (bool) STDs:HIV 
24. (bool) STDs:Hepatitis B 
25. (bool) STDs:HPV 
26. (int) STDs: Number of diagnosis 
27. (int) STDs: Time since first diagnosis 
28. (int) STDs: Time since last diagnosis 
29. (bool) Dx:Cancer 
30. (bool) Dx:CIN 
31. (bool) Dx:HPV 
32. (bool) Dx 
33. (bool) Hinselmann: target variable 
34. (bool) Schiller: target variable 
35. (bool) Cytology: target variable 
36. (bool) Biopsy: target variable


#### Autores:
Kelwin Fernandes (kafc _at_ inesctec _dot_ pt) - INESC TEC & FEUP, Porto, Portugal. 
Jaime S. Cardoso - INESC TEC & FEUP, Porto, Portugal. 
Jessica Fernandes - Universidad Central de Venezuela, Caracas, Venezuela.


#### Informações sobre o conjunto de dados:
O conjunto de dados foi coletado no Hospital Universitário de Caracas em Caracas na Venezuela. O conjunto de dados inclui informações demográficas, hábitos e registros médicos históricos de 858 pacientes. Vários pacientes decidiram não responder algumas das questões devido a preocupações com a privacidade (valores em falta).


#### Artigos relevantes:
Kelwin Fernandes, Jaime S. Cardoso, and Jessica Fernandes. 'Transfer Learning with Partial Observability Applied to Cervical Cancer Screening.' Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, 2017.


#### Solicitação de citação:
Kelwin Fernandes, Jaime S. Cardoso, and Jessica Fernandes. 'Transfer Learning with Partial Observability Applied to Cervical Cancer Screening.' Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, 2017.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np

#Carregamento do Dataset
pysparkDF = sqlContext.sql("SELECT * FROM cancer")
pandasDF = pysparkDF.toPandas()

In [4]:
pandasDF.dtypes

In [5]:
pandasDF.describe()

In [6]:
pandasDF.corr(method='pearson', min_periods=1)

In [7]:
display(pysparkDF)

Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,IUD (years),STDs,STDs (number),STDs:condylomatosis,STDs:cervical condylomatosis,STDs:vaginal condylomatosis,STDs:vulvo-perineal condylomatosis,STDs:syphilis,STDs:pelvic inflammatory disease,STDs:genital herpes,STDs:molluscum contagiosum,STDs:AIDS,STDs:HIV,STDs:Hepatitis B,STDs:HPV,STDs: Number of diagnosis,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42,3.0,23.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
51,3.0,17.0,6.0,1.0,34.0,3.4,0.0,0.0,1.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
26,1.0,26.0,3.0,0.0,0.0,0.0,1.0,2.0,1.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,1.0,20.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
44,3.0,15.0,0.0,1.0,1.266972909,2.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
#Criação dos Datasets de Treino e Teste
X_train, X_test, y_train, y_test = train_test_split(pandasDF.loc[:,'Age':'Citology'], pandasDF.loc[:,'Biopsy'], random_state = 0)

#Modelo "Support Vector Machines (SVM)"
SVM = svm.SVC(kernel = 'linear', C = 1).fit(X_train, y_train)
acSVM = SVM.score(X_test, y_test)
print("## AVALIAÇÃO DE ACURÁCIA DOS MODELOS ##\n1. (a) Acurácia do modelo 'Support Vector Machines (SVM)': {0:.2f}%\n".format(acSVM * 100))

#Modelo "K-Nearest Neighbors (KNN)"
KNN = KNeighborsClassifier(n_neighbors = 1).fit(X_train, y_train)
acKNN = KNN.score(X_test, y_test)
print("1. (b) Acurácia do modelo 'K-Nearest Neighbors (KNN)': {0:.2f}%\n".format(acKNN * 100))

#Entrada dos dados novos para classificação (Biópsia = Negativo)
negativo = np.array([[29, 2.0, 20.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

#Entrada dos dados novos para classificação (Biópsia = Positivo)
positivo = np.array([[33, 3.0, 18.0, 2.0, 0, 0, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])


#Classfificação pelo Modelo SVM (NEGATIVO)
SVM_prediction = SVM.predict(negativo)

if SVM_prediction == 0:
  biopsySVM = "Negativo [-]"
elif SVM_prediction == 1:
  biopsySVM = "Positivo [+]"
else:
  biopsySVM = "Indefinido"
  
print("## PACIENTE COM TESTE PRÉVIO NEGATIVO ##\n2. (a) Classificação pelo SVM: Resultado da Biópsia é {0}\n".format(biopsySVM))


#Classfificação pelo Modelo KNN  (NEGATIVO)
KNN_prediction = KNN.predict(negativo)

if KNN_prediction == 0:
  biopsyKNN = "Negativo [-]"
elif KNN_prediction == 1:
  biopsyKNN = "Positivo [+]"
else:
  biopsyKNN = "Indefinido"

print("2. (b) Classificação pelo KNN: Resultado da Biópsia é {0}\n".format(biopsyKNN))



#Classfificação pelo Modelo SVM (POSITIVO)
SVM_prediction = SVM.predict(positivo)

if SVM_prediction == 0:
  biopsySVM = "Negativo [-]"
elif SVM_prediction == 1:
  biopsySVM = "Positivo [+]"
else:
  biopsySVM = "Indefinido"
  
print("## PACIENTE COM TESTE PRÉVIO POSITIVO ##\n3. (a) Classificação pelo SVM: Resultado da Biópsia é {0}\n".format(biopsySVM))


#Classfificação pelo Modelo KNN (POSITIVO)
KNN_prediction = KNN.predict(positivo)

if KNN_prediction == 0:
  biopsyKNN = "Negativo [-]"
elif KNN_prediction == 1:
  biopsyKNN = "Positivo [+]"
else:
  biopsyKNN = "Indefinido"

print("3. (b) Classificação pelo KNN: Resultado da Biópsia é {0}\n".format(biopsyKNN))

#### Resultados
Ambos os modelos foram capazes de prever o resultado de uma Biópsia nos pacientes testados. Isso poderia favorecer a decisão médica de se solicitar ou não um exame mais complexo e, portanto, mais caro e com maiores riscos ao paciente por ser um exame invasivo. A ideia aqui não seria a substituição da Biópsia, mas sim priorizar pacientes que obtiveram resultado positivo por meio do modelo e até iniciar um tratamento imediatamente caso isso também não gere riscos ao paciente.

#### Conclusão
Pela observação dos resultados obtidos, percebe-se que a aplicação de um modelo de Aprendizado de Máquina é realmente muito útil em diversas aplicações e tem um poder de aplicabilidade muito aderente em casos de extrema complexidade e importância. Neste trabalho, apesar de ainda incipiente e limitado, já foi possível verificar que a aplicação em áreas médicas pode trazer mais credibilidade às decisões médicas, baratear os custos pela não realização de exames desnecessários e mitigar os desconfortos e complicações de procedimentos de diagnóstico garantindo mais bem estar aos pacientes.