<a href="https://colab.research.google.com/github/lucaaroeiracrv/dataset-lfw_people/blob/main/dataset_lfw_people.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anotações: Análise e Limpeza de Dados

1. **Importação do Dataset**:
   - Importar o dataset para a documentação.
   - Dataset = dados utilizados para treinamento ou análise combinatória.

2. **Exploração Inicial**:
   - Exiba o dataset e mostre as 5 primeiras linhas.
   - Mostre as informações do dataset utilizando a função `describe`.

3. **Identificação de Variáveis**:
   - Verifique se o dataset contém variáveis qualitativas e quantitativas.
   - Identifique e liste quais são as variáveis qualitativas e quantitativas presentes.

4. **Limpeza e Normalização do Dataset**:
   - Limpe o dataset, removendo dados inúteis (pode excluir até 20% do dataset).
   - Normalize os dados, se necessário, para padronizar as escalas das variáveis.

5. **Análises de Correlação**:
   - Realize 3 análises que façam sentido com o dataset, envolvendo a correlação entre variáveis.
   - Utilize a correlação de Pearson para medir essas correlações.

6. **Explicação sobre a Correlação de Pearson**:
   - Explique o que é a correlação de Pearson e como ela é calculada.
   - Inclua uma explicação sobre a tabela de correlação gerada com os coeficientes de Pearson.



---



In [None]:
import pandas as pd
from sklearn.datasets import fetch_lfw_people
from sklearn.preprocessing import MinMaxScaler

In [None]:
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

In [None]:
df = pd.DataFrame(lfw_people.data)

In [None]:
print("Descrição do dataset:")
df.describe()

Descrição do dataset:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849
count,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,...,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0,1288.0
mean,0.355769,0.374626,0.412965,0.462963,0.507093,0.541963,0.567585,0.587089,0.605063,0.621473,...,0.370729,0.38826,0.413324,0.44784,0.480924,0.496889,0.486948,0.465413,0.436856,0.404322
std,0.180138,0.174567,0.169204,0.165087,0.159878,0.148425,0.142111,0.138726,0.135052,0.130857,...,0.181329,0.201138,0.227058,0.249804,0.273326,0.285078,0.29493,0.303429,0.302333,0.303257
min,0.0,0.001307,0.001307,0.003922,0.005229,0.00915,0.00915,0.036601,0.039216,0.047059,...,0.003922,0.007843,0.001307,0.003922,0.003922,0.002614,0.0,0.0,0.0,0.0
25%,0.227124,0.252288,0.29902,0.359477,0.40915,0.449673,0.478105,0.500654,0.525163,0.539869,...,0.243137,0.244118,0.244444,0.256209,0.259804,0.249673,0.220588,0.177451,0.157843,0.129412
50%,0.339216,0.369935,0.415033,0.461438,0.509804,0.543137,0.571242,0.59085,0.609804,0.628758,...,0.355556,0.371242,0.389542,0.420261,0.448366,0.477778,0.46732,0.438562,0.392157,0.330719
75%,0.47451,0.486275,0.524183,0.574183,0.613072,0.639216,0.660131,0.678431,0.695425,0.70719,...,0.473203,0.495425,0.548366,0.616013,0.695425,0.739869,0.745425,0.740196,0.699673,0.667974
max,0.997386,0.996078,0.992157,0.968627,0.959477,0.968627,0.971242,0.989542,0.996078,0.993464,...,0.985621,0.993464,0.99085,0.998693,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
print("\nPrimeiras 5 linhas do dataset:")
print(df.head())


Primeiras 5 linhas do dataset:
       0         1         2         3         4         5         6     \
0  0.997386  0.996078  0.992157  0.966013  0.758170  0.569935  0.700654   
1  0.147712  0.197386  0.175163  0.192157  0.385621  0.473203  0.543791   
2  0.343791  0.394771  0.491503  0.555556  0.597386  0.611765  0.606536   
3  0.047059  0.016993  0.023529  0.016993  0.031373  0.230065  0.677124   
4  0.471895  0.458824  0.486275  0.499346  0.494118  0.513726  0.545098   

       7         8         9     ...      1840      1841      1842      1843  \
0  0.794771  0.784314  0.767320  ...  0.437909  0.426144  0.422222  0.415686   
1  0.615686  0.671895  0.694118  ...  0.168627  0.239216  0.296732  0.307190   
2  0.626144  0.640523  0.652288  ...  0.483660  0.430065  0.379085  0.410458   
3  0.667974  0.641830  0.400000  ...  0.481046  0.749020  0.903268  0.915033   
4  0.543791  0.560784  0.581699  ...  0.107190  0.062745  0.019608  0.018301   

       1844      1845      1846     

In [None]:
print("\nInformações gerais sobre o dataset:")
print(df.info())


Informações gerais sobre o dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1288 entries, 0 to 1287
Columns: 1850 entries, 0 to 1849
dtypes: float32(1850)
memory usage: 9.1 MB
None


In [None]:
# Verificar o tipo de cada variável (todas são quantitativas)
print(df.dtypes)
print(lfw_people.data)

0       float32
1       float32
2       float32
3       float32
4       float32
         ...   
1845    float32
1846    float32
1847    float32
1848    float32
1849    float32
Length: 1850, dtype: object
[[0.9973857  0.99607843 0.9921568  ... 0.38169935 0.38823533 0.3803922 ]
 [0.14771242 0.19738562 0.1751634  ... 0.45751634 0.44444445 0.53594774]
 [0.34379086 0.39477125 0.49150327 ... 0.709804   0.72156864 0.7163399 ]
 ...
 [0.3633987  0.3372549  0.30718955 ... 0.19738562 0.22091503 0.19346406]
 [0.19346406 0.24705882 0.34248367 ... 0.7346406  0.6640523  0.6117647 ]
 [0.11633987 0.10196079 0.1267974  ... 0.13333334 0.13725491 0.2535948 ]]


### Normalizar os dados
---

In [None]:
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [None]:
df_cleaned = df_normalized.dropna()

---
###pearson

In [None]:
correlation_matrix = df_cleaned.corr(method='pearson')

In [None]:
print(correlation_matrix)

In [None]:
correlation_1 = correlation_matrix.iloc[0, 1]
correlation_2 = correlation_matrix.iloc[2, 3]
correlation_3 = correlation_matrix.iloc[4, 5]

print(f"Correlação entre variável 0 e 1: {correlation_1}")
print(f"Correlação entre variável 2 e 3: {correlation_2}")
print(f"Correlação entre variável 4 e 5: {correlation_3}")