## Inteligência Artificial - L.EIC029 - 2023/2025

### Group_A2_115
• Henrique Silva - up202105647 <br>
• João Couto - up202006526 <br>
• Tiago Azevedo - up202108840 <br>	


### 1. Data Preprocessing

Para começar, damos `import` de todas as bibiliotecas que vamos utilizar.

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.preprocessing import LabelEncoder

De seguida, lemos o dataset.

O dataset é composto pelos seguintes dados:

- **Case_No**: Número do caso de estudo.
- **A1-A10**: [Perguntas](#-perguntas-do-q-chat-10) do Q-CHAT-10 (*Quantitative Checklist for Autism in Toddlers*) onde as respostas possíveis são *Always*, *Usually*, *Sometimes*, *Rarerly* e *Never*. Para as perguntas 1-9 (A1-A9), as respostas *Sometimes*, *Rarerly* e *Never* são mapeadas com um **1**. Na pergunta 10 (A10), as respostas *Always*, *Usually* e *Sometimes* são mapeadas com um **1**.
- **Age_Mons**: Idade em meses do indivíduo. 
- **Qchat-10-score**: Número de **1**'s no Q-CHAT-10. Quanto este valor é maior do que **3**, há indícios de que o indíviduo possa ter *ASD*.
- **Sex**: Sexo do indivíduo. 
- **Ethnicity**: Etnia do indivíduo. 
- **Jaundice**: Indica se o indíviduo possui [*Icterícia*](https://pt.wikipedia.org/wiki/Icter%C3%ADcia). 
- **Family_mem_with_ASD**: Indica se algum membro da família possui *ASD*. 
- **Who completed the test**: Indica quem realizou o teste. 
- **Class/ASD Traits**: Indica se o indíviduo tem indícios de *ASD*. <br><br>

#### Perguntas do Q-CHAT-10
- **A1**: Does your child look at you when you call his/her name?
- **A2**: How easy is it for you to get eye contact with your child?
- **A3**: Does your child point to indicate that s/he wants something? (e.g. a toy that is
out of reach)
- **A4**: Does your child point to share interest with you? (e.g. pointing at an interesting sight)
- **A5**: Does your child pretend? (e.g. care for dolls, talk on a toy phone)
- **A6**: Does your child follow where you’re looking?
- **A7**: If you or someone else in the family is visibly upset, does your child show signs of wanting to comfort them? (e.g. stroking hair, hugging them)
- **A8**: Would you describe your child’s first words as unusual?
- **A9**: Does your child use simple gestures? (e.g. wave goodbye)
- **A10**: Does your child stare at nothing with no apparent purpose?

In [96]:
data = pd.read_csv("autism_dataset_for_toddlers.csv")

# remove o espaço depois de 'Traits'
data.rename(columns={"Class/ASD Traits ": "Class/ASD Traits"}, inplace=True)

data.head()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes


Agora que já conhecemos o nosso dataset, vamos começar por remover a coluna `Case_No`, uma vez que estes dados não são relevantes.

In [97]:
data = data.drop(["Case_No"], axis="columns")
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes


De seguida, vamos preencher a informação que falta (caso exista) com `NaN`.

In [98]:
data.fillna(np.nan, inplace=True)
data.isna().any()

A1                        False
A2                        False
A3                        False
A4                        False
A5                        False
A6                        False
A7                        False
A8                        False
A9                        False
A10                       False
Age_Mons                  False
Qchat-10-Score            False
Sex                       False
Ethnicity                 False
Jaundice                  False
Family_mem_with_ASD       False
Who completed the test    False
Class/ASD Traits          False
dtype: bool

Por último, é necessário converter as várias colunas cuja informação não são valores numéricos.

In [99]:
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes


In [100]:
object_data = data.select_dtypes(include=['object']).copy()
encoder = LabelEncoder()

for col in object_data.columns:
    if col == "Class/ASD Traits":
        continue
    data[col] = encoder.fit_transform(data[col])

data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,0,0,0,0,0,0,1,1,0,1,28,3,0,8,1,0,4,0
1,1,1,0,0,0,1,1,0,0,0,36,4,1,5,1,0,4,1
2,1,0,0,0,0,0,1,1,0,1,36,4,1,8,1,0,4,1
3,1,1,1,1,1,1,1,1,1,1,24,10,1,0,0,0,4,1
4,1,1,0,1,1,1,1,1,1,1,20,9,0,5,0,1,4,1


De seguida, vamos analisar a distribuição da informação do dataset e verificar se encontramos alguma anomalia.

In [101]:
data.describe()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
count,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0
mean,0.563567,0.448767,0.401328,0.512334,0.524668,0.57685,0.649905,0.459203,0.489564,0.586338,27.867173,5.212524,0.697343,5.863378,0.273245,0.16129,3.885199,0.690702
std,0.496178,0.497604,0.4904,0.500085,0.499628,0.494293,0.477226,0.498569,0.500128,0.492723,7.980354,2.907304,0.459626,2.098325,0.445837,0.367973,0.639852,0.462424
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.0,3.0,0.0,5.0,0.0,0.0,4.0,0.0
50%,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,30.0,5.0,1.0,6.0,0.0,0.0,4.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,36.0,8.0,1.0,7.0,1.0,0.0,4.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,36.0,10.0,1.0,10.0,1.0,1.0,4.0,1.0


Com base nesta tabela, podemos verificar que não existem anomalias.

- Todas as colunas possuem 1054 linhas (*count*);
- Tanto no *Qchat-10-Score*, como nas perguntas *A1-A10*, os valores registados estão dentro dos valores normais, isto é, o valor mínimo (*min*) é igual a 0 e o valor máximo (*max*) é igual a 1 nas perguntas e igual a 10 no resultado final.

No entanto, é mais fácil descobrir se existem anomalias ao visualizar esta informação.

#### Bibliografia

- https://www.kaggle.com/datasets/vaishnavisirigiri/autism-dataset-for-toddlers?resource=download
- https://www.autismalert.org/uploads/PDF/SCREENING--AUTISM--QCHAT-10%20Question%20Autism%20Survey%20for%20Toddlers.pdf
- https://pt.wikipedia.org/wiki/Icter%C3%ADcia
- https://miamioh.edu/centers-institutes/center-for-analytics-data-science/students/coding-tutorials/python/data-cleaning.html
- https://hyperskill.org/learn/step/32241