In [None]:
# Import main libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import math

## 1. Data Gathering

In [None]:
url = "https://filebin.net/axvze9ujhh6e15k6/patients_data.zip?t=v1j8o2xr"

## 2. Data Assessing

The assessing in divided into two mains aspects:

* Quality of the dataset
* Tidiness of the dataset

#### 2.1 Quality

Low quality dataset is related to a dirty dataset, which means the content quality of data.

Commom issues:

* Missing values
* Non standard units (km, meters, inches, etc. all mixed)
* Innacurate data, invalid data, inconsistent data, etc.

#### Data Quality Dimensions
* **Completeness**: - Worst
    - Do we have all the records that we should?
    - Do we have missing records or not?
    - Are there specific rows, columns, or cells missing?
    
* **Validity**: - 2nd worst
    - We have the records, but they're not valid, i.e. they don't confirm to a defined schema. Schema is a defined set of rules for data. The rules can be read as world constraints (e.g. negative height is impossible).
    
* **Accuracy**: - 3rd worst
    - Innacurate data is wrong data that is valid. It adheres to the defined schema, but it's not still corrrect. Example: a typo of height = 27 in when it should be 72 in.
    
* **Consistency**: - least worst
    - Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency is a standard format.

>One dataset may be high enough quality for one application but not for another.



#### Tidiness

Untidy data or _messy_ data, is about the structure of the dataset.

* Each obsevation by rows, and;
* Each variable/features by column;

This is the Hadley Wickham definition of tidy data.

### Assessing the data

There are two ways to assess the data.

* Visual, and;
* Programmatic.

#### Visual Assessment

Using regular tools, such as Graphics, Excel, tables, etc. It means, there is a human assessing the data.

#### Programmatic Assessment

Using automation to dataset evaluation is scalable, and allows you to handle a very huge quantity of data.

Examples of "Programmatic Assessment": Analysing the data using `.info()`, `.head()`, `.describe()`, plotting graphics (`.plot()`), etc..

#### 2.1 Patients Table

In [None]:
patients.sample(30)

In [None]:
patients.info()

In [None]:
patients[patients[["given_name", "surname"]].duplicated(keep="first")]

In [None]:
plt.hist(patients.height)

In [None]:
imc_calculado = np.round((patients["weight"] * 0.453592) / (((patients["height"] * 2.54)/100)**2), 1)
imc_dado = patients.bmi

imc_comparisons = pd.DataFrame({
    "imc_dado": patients[(imc_dado != imc_calculado)]["bmi"],
    "imc_calculado": imc_calculado[imc_dado != imc_calculado]
})

imc_comparisons["diff"] = imc_comparisons["imc_calculado"] - imc_comparisons["imc_dado"]

imc_comparisons

#### 2.2 Treatments Table

In [None]:
treatments.sample(30)

#### 2.3 Adverse Reactions Table

In [None]:
adverse_reactions.sample(30)

In [None]:
adverse_reactions.adverse_reaction.value_counts()

#### Quality
*PATIENTS Table*
- X Coluna `state` tem estados por extenso. California. New York. Illionois. Florida. Nebraska
- X Coluna `patient_id` como inteiro. Deveria ser string
- X Coluna `brithdate` como string. Deveria ser datetime.
- X Paciente John Doe tem mais de um registro com as mesmas informações, exceto a patient_id.
- X Paciente Jakob Jakobsen tem 2 registros, um com o nome Jake. Jake's row index = 29
- X Paciente Patrick Gersten tem 2 registros, um com o nome Pat. Pat's row index = 502
- X Paciente Sandra Taylor tem 2 registros, um com o nome Sandy. Sandy's row index = 282
- X Paciente Camilla Zaitseva tem peso em kg ao inves de lbs.
- X Paciente Tim Neudorf tem altura 27 lb. O certo são 72 lbs.

- X Coluna `contact` números de telefone não seguem o mesmo padrão.


*TREATMENTS Table*
- Colunas `given_name` e `surname` com nomes em minúsculo. Deve-se entitular
- Coluna `hba1c_change` com valores missing e cálculos errados.
- Doses iniciais e finais com "u". E como string


*ADVERSE REACTIONS Table*
- Colunas `given_name` e `surname` tem nomes minúsculos.

#### Tidiness
*PATIENTS Table*
- X Coluna `contact` apresenta email e número de telefone. O número de telefone.

*TREATMENTS Table*
- Colunas `auralin` e `novodra` devem ser separadas em:
    - Auralin e Novodra são valores que devem estar na coluna `medicine`
    - Doses iniciais e finais devem estar em colunas separadas.

## 3. Data Cleaning

#### 3.1 Patients Table

In [None]:
patients_clean = patients.copy()

**ISSUE**: Coluna state ora apresenta as siglas dos estados, ora o nome por extenso dos estados.

**Plan**: Abreviar nomes que estão por extenso

In [None]:
abbreviation_dic = {
    "California": "CA",
    "New York": "NY",
    "Illinois": "IL",
    "Florida": "FL",
    "Nebraska": "NE"
}

# Forma iterativa
#patients_clean["state"] = patients_clean.apply(lambda row: abbreviation_dic[row["state"]] if row["state"] in abbreviation_dic else row["state"], axis=1)

# Forma vetoriza
for k, v in abbreviation_dic.items():
    patients_clean["state"] = patients_clean["state"].str.replace(k, v)

In [None]:
# Test
patients_clean.state.value_counts()

**ISSUE**: Coluna `contact` apresenta email e número de telefone. O número de telefone.

**Plan**: Separar telefone de email usando regex e colocá-los em colunas diferentes. Deletar a coluna contact.

In [None]:
patients_clean["phone_number"] = patients_clean["contact"].str.extract("((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})")
patients_clean["email"] = patients_clean["contact"].str.extract("([A-Za-z][\S^@]+@\w+\.[A-Za-z]+)")

patients_clean.drop(["contact"], axis=1, inplace=True)

In [None]:
#Test
patients_clean[["phone_number", "email"]][:60]

**ISSUE**: Coluna `patient_id` como inteiro

**Plan**: Transformar tipo de dado da coluna para objeto

In [None]:
patients_clean["patient_id"] = patients_clean["patient_id"].astype("str")

In [None]:
# Test
patients_clean.info()

**ISSUE**: Coluna `birthdate` como inteiro

**Plan**: Transformar para datetime

In [None]:
patients_clean["birthdate"] = pd.to_datetime(patients_clean["birthdate"], format="%m/%d/%Y")

In [None]:
# Testar
patients_clean.info()

**ISSUE**: Paciente John Doe tem vários registros semelhantes

**Plan**: Verificar os índices para deletar John Doe e usar df.drop(lista_indices, axis=0, inplace=True)

In [None]:
lista_indices = patients_clean[patients_clean[["given_name", "surname"]].duplicated(keep="first")].index

patients_clean.drop(lista_indices, axis=0, inplace=True)

In [None]:
# Test
patients_clean[patients_clean[["given_name", "surname"]].duplicated(keep="first")]

**ISSUE**: Pacientes com registros para seu nome verdadeiro e apelido.

**Plan**: Dropar índices dos registros dos apelidos

In [None]:
patients_clean.drop([29, 502, 282], axis=0, inplace=True)

In [None]:
# Test
patients_clean[patients_clean["given_name"] == "Jake"]
patients_clean[patients_clean["given_name"] == "Pat"]
patients_clean[patients_clean["given_name"] == "Sandy"]

**ISSUE**: Paciente Camilla Zaitseva com peso em kg

**Plan**: Passar seu peso para lb

In [None]:
patients_clean[patients_clean["given_name"] == "Camilla"]

patients_clean.loc[210, "weight"] = np.round(48.8 * 2.20462, 1)

In [None]:
# Test
patients_clean.loc[210, :]

**ISSUE**: Paciente Tim Neudorf tem altura com caracteres trocados

**Plan**: Destrocar caracteres

In [None]:
patients_clean.loc[4, "height"] = 72

In [None]:
# Test
patients_clean.loc[4, "height"]