<a href="https://colab.research.google.com/github/joserobertofox/datascience/blob/main/MVP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **MVP de Análise de Dados e Boas Práticas**
# **Aluno: José Roberto Assis Silva**
  

# **1. Definição do Problema**

O dataset que será utilizado neste projeto e o **HCV Data Set** que contém dados para previsão de hepatite C. A fonte deste dataset temo como doador: Ralf Lichtinghagen: Instituto de Química Clínica; Universidade de Medicina de Hannover (MHH); Hannover, Alemanha. O conjunto de dados contém valores laboratoriais de doadores de sangue e pacientes com hepatite C e valores demográficos como idade e será tratado como um problema de aprendizado supervisionado. Estes valores laboratorias contém varíaves preditoras ao atributo alvo para classificação que é a **CATEGORY (valores: '0=Doador de Sangue', '0s=Doador de Sangue suspeito', '1=Hepatite', '2=Fibrose', '3=Cirrose')**. Para mais informações sobre este dataset, consulte: https://archive.ics.uci.edu/ml/datasets/HCV+data.

**Atributos do Dataset:**

1. **ID** - ID do paciente N°
2. **Category** - (valores: '0=Doador de Sangue', '0s=Doador de Sangue suspeito', '1=Hepatite', '2=Fibrose', '3=Cirrose')
3. **Age** - Idade em anos
4. **Sex** - Sexo (m/f)
5. **ALB** - 
6. **ALP** - 
7. **ALT** - 
8. **AST** - 
9. **BIL** - 
10. **CHE** - 
11. **CHOL** - 
12. **CREA** - 
13. **GGT** - 
14. **PROT** - 



In [8]:
# Imports
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms # para tratamento de missings
from matplotlib import cm
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
# Configuração para tratamento de erros
import warnings
warnings.filterwarnings("ignore")

# **2. Carga de Dados**

Vou utilizar o pacote Pandas (Python Data Analysis Library) para fazer a carga do arquivo **hcvdata0** em formato **.csv** e a partir do carregamento vou fazer uma análise exploratória.



In [18]:
# Carregamento do arquivo hcvdata0.csv usando Pandas e usando a URL do repositório no GITHUB

# Varíavel recebendo a URL do repositório para onde foi feito o upload do arquivo no GITHUB
url = "https://raw.githubusercontent.com/joserobertofox/datascience/main/hcvdat0.csv"

# Informando o cabeçalho das colunas
colunas = ['ID', 'Category', 'Age', 'Sex', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']

# Carregando o arquivo utilizando as descrições informadas das colunas 
dataset = pd.read_csv(url, names=colunas, skiprows=1, delimiter=',')

In [21]:
dataset.head()

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


# **3. Análise de Dados**




#3.1. Estatísticas Descritivas



In [22]:
# Mostrando as quantidade de instâncias e quantidade de atributos do dataset
print(dataset.shape)

(615, 14)


In [23]:
# Mostrando os tipos de dados dos atributos do dataset
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 615 entries, 0 to 614
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        615 non-null    int64  
 1   Category  615 non-null    object 
 2   Age       615 non-null    int64  
 3   Sex       615 non-null    object 
 4   ALB       614 non-null    float64
 5   ALP       597 non-null    float64
 6   ALT       614 non-null    float64
 7   AST       615 non-null    float64
 8   BIL       615 non-null    float64
 9   CHE       615 non-null    float64
 10  CHOL      605 non-null    float64
 11  CREA      615 non-null    float64
 12  GGT       615 non-null    float64
 13  PROT      614 non-null    float64
dtypes: float64(10), int64(2), object(2)
memory usage: 67.4+ KB
None


In [24]:
# Listando as 15 primeiras linhas do dataset
dataset.head(15)

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
5,6,0=Blood Donor,32,m,41.6,43.3,18.5,19.7,12.3,9.92,6.05,111.0,91.0,74.0
6,7,0=Blood Donor,32,m,46.3,41.3,17.5,17.8,8.5,7.01,4.79,70.0,16.9,74.5
7,8,0=Blood Donor,32,m,42.2,41.9,35.8,31.1,16.1,5.82,4.6,109.0,21.5,67.1
8,9,0=Blood Donor,32,m,50.9,65.5,23.2,21.2,6.9,8.69,4.1,83.0,13.7,71.3
9,10,0=Blood Donor,32,m,42.4,86.3,20.3,20.0,35.2,5.46,4.45,81.0,15.9,69.9


In [26]:
# Listando as 15 últimas linhas do dataset
dataset.tail(15)

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
600,601,3=Cirrhosis,59,m,27.0,73.8,4.0,65.2,209.0,2.47,3.61,71.7,28.5,60.6
601,602,3=Cirrhosis,59,m,31.0,86.3,5.4,95.4,117.0,1.57,3.51,60.5,53.6,68.5
602,603,3=Cirrhosis,61,m,39.0,102.9,27.3,143.2,15.0,5.38,4.88,72.3,400.3,73.4
603,604,3=Cirrhosis,65,m,,,40.0,54.0,13.0,7.5,,70.0,107.0,79.0
604,605,3=Cirrhosis,74,m,23.0,34.1,2.1,90.4,22.0,2.5,3.29,51.0,46.8,57.1
605,606,3=Cirrhosis,42,f,33.0,79.0,3.7,55.7,200.0,1.72,5.16,89.1,146.3,69.9
606,607,3=Cirrhosis,49,f,33.0,190.7,1.2,36.3,7.0,6.92,3.82,485.9,112.0,58.5
607,608,3=Cirrhosis,52,f,39.0,37.0,1.3,30.4,21.0,6.33,3.78,158.2,142.5,82.7
608,609,3=Cirrhosis,58,f,34.0,46.4,15.0,150.0,8.0,6.26,3.98,56.0,49.7,80.6
609,610,3=Cirrhosis,59,f,39.0,51.3,19.6,285.8,40.0,5.77,4.51,136.1,101.1,70.5


Após listar as 15 primeiras e as 15 últimas linhas verifiquei que existem valores faltantes em alguns atributos e o atributo CATEGORY contem um valor numérico juntamente com um carácter especial "="  que associa o valor com o diagnóstico. Na fase de processamento de dados poderemos fazer uma conversão deste atributo somente para um valor inteiro.

In [27]:
# Verifica o tipo de dataset de cada atributo
dataset.dtypes

ID            int64
Category     object
Age           int64
Sex          object
ALB         float64
ALP         float64
ALT         float64
AST         float64
BIL         float64
CHE         float64
CHOL        float64
CREA        float64
GGT         float64
PROT        float64
dtype: object

In [28]:
# Fazendo um resumo estatístico dos atributos númericos do Dataset (média, desvio padrão, mínimo, máximo e os quartis)
dataset.describe()

Unnamed: 0,ID,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
count,615.0,615.0,614.0,597.0,614.0,615.0,615.0,615.0,605.0,615.0,615.0,614.0
mean,308.0,47.40813,41.620195,68.28392,28.450814,34.786341,11.396748,8.196634,5.368099,81.287805,39.533171,72.044137
std,177.679487,10.055105,5.780629,26.028315,25.469689,33.09069,19.67315,2.205657,1.132728,49.756166,54.661071,5.402636
min,1.0,19.0,14.9,11.3,0.9,10.6,0.8,1.42,1.43,8.0,4.5,44.8
25%,154.5,39.0,38.8,52.5,16.4,21.6,5.3,6.935,4.61,67.0,15.7,69.3
50%,308.0,47.0,41.95,66.2,23.0,25.9,7.3,8.26,5.3,77.0,23.3,72.2
75%,461.5,54.0,45.2,80.1,33.075,32.9,11.2,9.59,6.06,88.0,40.2,75.4
max,615.0,77.0,82.2,416.6,325.3,324.0,254.0,16.41,9.67,1079.1,650.9,90.0
