# Projeto de Mineração de Dados (P1)

- Aluno: Luiz Fernando Costa dos Santos
- Matrícula: 20200025446


O dataset utilizado foi o Adult, que foi retirado do [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult). Mais informações no arquivo `./data/adult.names` ou no próprio site mencionado acima.

## Importações

In [1]:
import pandas as pd 
import numpy as np

## Pré-processamentos e carregamento dos dados

In [2]:
np.random.seed(40)

features_names = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income"
]


In [3]:
original = pd.read_csv("data/adult.data.csv", header=None)
new = pd.read_csv("data/adult.data.csv", header=None)

original.columns = features_names
new.columns = features_names

In [13]:
cat_cols = original.select_dtypes(["object"]).columns

In [19]:
replace_map = {}
for i in cat_cols:
    replace_map[i] = dict(zip(original[i].unique(), range(len(original[i].unique()))))

In [23]:
for i in cat_cols:
    original[i].replace(replace_map[i], inplace=True)

In [24]:
for i in cat_cols:
    new[i].replace(replace_map[i], inplace=True)

## Gerando os nulos

In [25]:
mask = np.random.choice(2, original.shape, p=[0.9, 0.1])

new = new.mask(mask == 1)

In [26]:
new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             26396 non-null  float64
 1   workclass       26334 non-null  float64
 2   fnlwgt          26425 non-null  float64
 3   education       26276 non-null  float64
 4   education-num   26562 non-null  float64
 5   marital-status  26404 non-null  float64
 6   occupation      26374 non-null  float64
 7   relationship    26310 non-null  float64
 8   race            26319 non-null  float64
 9   sex             26228 non-null  float64
 10  capital-gain    26361 non-null  float64
 11  capital-loss    26486 non-null  float64
 12  hours-per-week  26536 non-null  float64
 13  native-country  26293 non-null  float64
 14  income          26315 non-null  float64
dtypes: float64(15)
memory usage: 3.7 MB


In [27]:
original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32561 non-null  int64
 1   workclass       32561 non-null  int64
 2   fnlwgt          32561 non-null  int64
 3   education       32561 non-null  int64
 4   education-num   32561 non-null  int64
 5   marital-status  32561 non-null  int64
 6   occupation      32561 non-null  int64
 7   relationship    32561 non-null  int64
 8   race            32561 non-null  int64
 9   sex             32561 non-null  int64
 10  capital-gain    32561 non-null  int64
 11  capital-loss    32561 non-null  int64
 12  hours-per-week  32561 non-null  int64
 13  native-country  32561 non-null  int64
 14  income          32561 non-null  int64
dtypes: int64(15)
memory usage: 3.7 MB


## Método da média 

In [29]:
new

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39.0,0.0,77516.0,0.0,,0.0,,0.0,0.0,,,0.0,40.0,0.0,0.0
1,,1.0,83311.0,,13.0,1.0,1.0,1.0,,0.0,0.0,0.0,13.0,0.0,0.0
2,38.0,2.0,215646.0,1.0,,2.0,2.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
3,53.0,2.0,234721.0,2.0,,,2.0,,1.0,0.0,0.0,,40.0,0.0,0.0
4,28.0,2.0,338409.0,0.0,,1.0,,2.0,1.0,1.0,0.0,0.0,40.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27.0,2.0,257302.0,6.0,,,10.0,2.0,0.0,,0.0,0.0,38.0,0.0,0.0
32557,40.0,2.0,154374.0,1.0,,1.0,9.0,1.0,,0.0,0.0,,40.0,0.0,1.0
32558,58.0,2.0,151910.0,1.0,9.0,6.0,0.0,,0.0,1.0,,,40.0,0.0,0.0
32559,22.0,2.0,201490.0,1.0,,0.0,0.0,3.0,0.0,,0.0,0.0,20.0,0.0,0.0


In [32]:
new_with_mean = new.fillna(new.mean(axis=0))

## Método da Regressão

In [36]:
from sklearn.linear_model import LinearRegression

def replace_nulls_with_regression(df, target_column, predictor_columns):
    """
    Replace null values in a target column using linear regression based on predictor columns.

    Parameters:
    - df: pandas DataFrame
    - target_column: str, the column containing null values to be replaced
    - predictor_columns: list of str, columns to be used as predictors for linear regression

    Returns:
    - df: pandas DataFrame with null values replaced using linear regression
    """

    # Separate data into two sets: one with non-null values and one with null values
    # df_not_null = df.dropna(subset=[target_column])
    # df_null = df[df[target_column].isnull()]

    # Prepare X (predictors) and y (target) for the linear regression model
    X = df[predictor_columns]
    y = df[target_column]

    # Create and fit the linear regression model
    model = LinearRegression()
    model.fit(X, y)

    # Predict the missing values using the model
    predicted_values = model.predict(df[predictor_columns])

    # Replace null values with the predicted values in the original DataFrame
    df.loc[df[target_column].isnull(), target_column] = predicted_values

    return predicted_values


In [39]:
new_with_regression = pd.DataFrame()
for i in new_with_mean.columns:
    new_with_regression[i] = replace_nulls_with_regression(new_with_mean, i, new_with_mean.columns.drop(i))

In [40]:
new_with_regression

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,35.070674,2.232403,185482.353543,3.282537,10.507562,1.029243,3.644106,1.393051,0.148466,0.192735,-273.068930,58.161293,41.226082,0.673317,0.218308
1,38.293821,2.058007,191643.741982,2.409689,9.835836,0.775311,4.053323,1.543484,0.166324,0.568812,489.563211,74.370304,42.888997,0.859395,0.277598
2,43.186002,2.027032,191468.033594,3.158802,10.403210,0.873332,4.953981,1.444638,0.143186,0.375428,283.934260,73.625034,42.690908,0.710452,0.283059
3,36.855431,2.153142,181242.345921,3.368272,10.079082,1.469981,5.021807,1.233227,0.157795,0.393299,619.718225,69.692988,40.936251,2.467205,0.338720
4,33.420089,2.321037,187517.004168,3.464150,10.237798,1.090650,3.884084,2.184602,0.271176,0.405108,-101.050805,39.857259,36.296223,2.484094,0.070359
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,36.416587,2.601833,195060.360371,3.429449,8.579781,0.707124,4.603170,1.747709,0.240439,0.317089,262.157884,58.658671,38.348115,1.237276,0.120630
32557,43.503652,2.658738,190790.645731,3.513749,10.931734,0.924543,4.426656,1.194936,0.135109,0.101156,2802.548869,166.362286,44.614382,1.137988,0.238870
32558,54.931774,2.136128,177209.370250,3.467516,10.268266,2.068024,4.179697,1.928066,0.232620,0.695440,95.393305,64.101894,38.418533,0.670646,0.210707
32559,30.686247,1.922533,200124.082401,3.342374,10.341361,0.569033,4.613231,2.017356,0.234902,0.596347,-208.401386,32.813528,38.780258,0.772166,0.089333


## Comparação utilizando distância euclideana

In [49]:
from scipy.spatial.distance import euclidean


results = {}
for i in new_with_regression.columns:
    results[i] = euclidean(original.loc[:, i], new_with_regression.loc[:, i])

Métricas originais

In [63]:
original.describe()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,2.309972,189778.4,3.424465,10.080679,1.083781,4.666411,1.542397,0.221707,0.330795,1077.648844,87.30383,40.437456,1.290317,0.24081
std,13.640433,1.225728,105550.0,3.453582,2.57272,1.251381,3.386119,1.437431,0.627348,0.470506,7385.292085,402.960219,12.347429,5.045373,0.427581
min,17.0,0.0,12285.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,28.0,2.0,117827.0,1.0,9.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,40.0,0.0,0.0
50%,37.0,2.0,178356.0,2.0,10.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,40.0,0.0,0.0
75%,48.0,2.0,237051.0,5.0,12.0,1.0,7.0,3.0,0.0,1.0,0.0,0.0,45.0,0.0,0.0
max,90.0,8.0,1484705.0,15.0,16.0,6.0,14.0,5.0,4.0,1.0,99999.0,4356.0,99.0,41.0,1.0


Métricas das variáveis contínuas

In [60]:
for i in list(set(new_with_regression.columns) - set(cat_cols)):
    print(i, results[i])

education-num 415.27673788371135
age 2162.3100521844035
hours-per-week 2123.3613189631865
capital-loss 71846.35284621688
capital-gain 1296898.234313095
fnlwgt 18968471.523893345


Métricas das variáveis categóricas

In [64]:
for i in cat_cols:
    print(i, results[i])

workclass 215.58765934012987
education 602.2893552618021
marital-status 204.25899412621942
occupation 576.7436963054544
relationship 242.782219267876
race 110.02436865318583
sex 77.78678454920804
native-country 886.590352811457
income 68.38539484503234


Ordenando por maiores distâncias

In [74]:
pd.Series(results).sort_values(ascending=False)

fnlwgt            1.896847e+07
capital-gain      1.296898e+06
capital-loss      7.184635e+04
age               2.162310e+03
hours-per-week    2.123361e+03
native-country    8.865904e+02
education         6.022894e+02
occupation        5.767437e+02
education-num     4.152767e+02
relationship      2.427822e+02
workclass         2.155877e+02
marital-status    2.042590e+02
race              1.100244e+02
sex               7.778678e+01
income            6.838539e+01
dtype: float64

Tabela de comparação

In [88]:
table = pd.DataFrame()
for i, col in enumerate(new_with_regression.columns): 
    # list(set(new_with_regression.columns) - set(cat_cols)):
    if col in cat_cols:
        type_ = "categorical"
    else:
        type_ = "numerical"
        
    table[i] = [
            col, 
            results[col], 
            type_, 
            original.describe().loc["min", col], 
            original.describe().loc["max", col],
            abs(original.describe().loc["min", col] - original.describe().loc["max", col])
        ]

In [89]:
table = table.T.sort_values(by=1, ascending=False)
table.columns = ["column", "distance", "type", "original_min", "original_max", "original_range"]

In [90]:
table

Unnamed: 0,column,distance,type,original_min,original_max,original_range
2,fnlwgt,18968471.523893,numerical,12285.0,1484705.0,1472420.0
10,capital-gain,1296898.234313,numerical,0.0,99999.0,99999.0
11,capital-loss,71846.352846,numerical,0.0,4356.0,4356.0
0,age,2162.310052,numerical,17.0,90.0,73.0
12,hours-per-week,2123.361319,numerical,1.0,99.0,98.0
13,native-country,886.590353,categorical,0.0,41.0,41.0
3,education,602.289355,categorical,0.0,15.0,15.0
6,occupation,576.743696,categorical,0.0,14.0,14.0
4,education-num,415.276738,numerical,1.0,16.0,15.0
7,relationship,242.782219,categorical,0.0,5.0,5.0


Os maiores erros da simulação são das variáveis que possuiam grande diferença entre seu valor máximo e mínimo. Talvez a normalização seja uma opção para diminuir o impacto da grandeza dos números.