# Data cleaning

### I think that the most important columns at this moment are 'age', 'education.num', 'marital.status', 'race', 'sex', 'capital.gain', 'hours.per.week' and maybe 'occupation'.
But we will use some Feature selection methods to choose the best ones in the next notebook.

In [1]:
#Importing pandas
import pandas as pd

In [2]:
#Reading our dataset
df = pd.read_csv('data.csv')

In [3]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


In [4]:
#Firstly I will delete from start 'education', 'native country' and 'relationship' columns because they don't give me any new information
df = df.drop(['education', 'relationship', 'native.country', 'fnlwgt'], axis = 1)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   education.num   32561 non-null  int64 
 3   marital.status  32561 non-null  object
 4   occupation      32561 non-null  object
 5   race            32561 non-null  object
 6   sex             32561 non-null  object
 7   capital.gain    32561 non-null  int64 
 8   capital.loss    32561 non-null  int64 
 9   hours.per.week  32561 non-null  int64 
 10  income          32561 non-null  object
dtypes: int64(5), object(6)
memory usage: 2.7+ MB


As we can see, we do not have NAN values directly into our dataset, so we can affirme that they are in another form

I will use value_counts function to analyze every column in particular

The reason I didn't use a function is because I prefered to analyze directly the columns by themselves with value_counts, that will help me choose better the imputation method which I will apply to the column

In [6]:
#Analyzing 'workclass' column 
df['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [7]:
#Analyzing 'occupation' column 
df['occupation'].value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

We have only 2 columns with NAN values marked as '?'
We'll remove them, because in these 2 columns most of the rows (95%) have a NAN value in both of the columns (workclass and occupation), that will help our model to predict better.


In [8]:
#Removing NAN ('?') values in workclass and occupation column
df = df[df['workclass'] != '?']
df = df[df['occupation'] != '?']

Also as we have different value counts for each class, its safe to impute them with frequency model, for that I will encode them with numbers first

In [9]:
#We will replace the categorical values in numerical values so that we will be able to impute them with Frequency Imp later
df['workclass'] = df['workclass'].replace(df['workclass'].unique().tolist(), [*range(0, 7, 1)])

In [10]:
#Replacing with num values the 'occupation' column
df['occupation'] = df['occupation'].replace(df['occupation'].unique().tolist(),[*range(0, 14, 1)] )

For 'occupation' I choose to impute them with numbers and then with target imputation column as we have lots of different categories.

For'marital.status' -I will encode each category from the column in numbers to prepare them for Frequency Imputation Transformer

In [11]:
#Analyzing 'marital.status' column 
df['marital.status'].value_counts()

Married-civ-spouse       14339
Never-married             9912
Divorced                  4258
Separated                  959
Widowed                    840
Married-spouse-absent      389
Married-AF-spouse           21
Name: marital.status, dtype: int64

In [12]:
#Replacing with numerical values the 'marital.status' column
df['marital.status'] = df['marital.status'].replace(df['marital.status'].unique().tolist(),[*range(0, 7, 1)] )

For 'race' column I will create dummy variables because the majority of persons are White, so I'd like to see in the next step, Feature Selection, if the model will choose only the column of White persons ( in this case it will show that being White and not other race, has an influence on having high income ).

In [13]:
#Analyzing 'race' column 
df['race'].value_counts()

White                 26301
Black                  2909
Asian-Pac-Islander      974
Amer-Indian-Eskimo      286
Other                   248
Name: race, dtype: int64

In [14]:
#Creating a dummy variable for the 'race' column
df = pd.get_dummies(df, columns=['race'], drop_first = True)

In [15]:
#Analyzing 'sex' column 
df['sex'].value_counts()

Male      20788
Female     9930
Name: sex, dtype: int64

In [16]:
#Creating a dummy variable for the 'sex' column
df = pd.get_dummies(df, columns=['sex'], drop_first = True)

In [17]:
#Analyzing TARGET column 
df['income'].value_counts()

<=50K    23068
>50K      7650
Name: income, dtype: int64

We'll understand that as having a high income (>50K) and low income (<=50K)

In [18]:
#Creating a dummy variable for the 'income' column
df = pd.get_dummies(df, columns=['income'], drop_first = True)
#Renaming the target column
df.rename(columns = {'income_>50K':'income'}, inplace = True)

In [19]:
#Changing data types to float for the Transformers
df = df.astype(float)

In [20]:
df

Unnamed: 0,age,workclass,education.num,marital.status,occupation,capital.gain,capital.loss,hours.per.week,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,income
1,82.0,0.0,9.0,0.0,0.0,0.0,4356.0,18.0,0.0,0.0,0.0,1.0,0.0,0.0
3,54.0,0.0,4.0,1.0,1.0,0.0,3900.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0
4,41.0,0.0,10.0,2.0,2.0,0.0,3900.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0
5,34.0,0.0,9.0,1.0,3.0,0.0,3770.0,45.0,0.0,0.0,0.0,1.0,0.0,0.0
6,38.0,0.0,6.0,2.0,4.0,0.0,3770.0,40.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22.0,0.0,10.0,3.0,11.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,1.0,0.0
32557,27.0,0.0,12.0,4.0,10.0,0.0,0.0,38.0,0.0,0.0,0.0,1.0,0.0,0.0
32558,40.0,0.0,9.0,4.0,1.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,1.0,1.0
32559,58.0,0.0,9.0,0.0,4.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0


In [21]:
#Importing the transformer from Imperio library and instantiating the model
from imperio import FrequencyImputationTransformer
freq = FrequencyImputationTransformer()

In [22]:
#Applying frequency imputation to our dataset (columns selected)
new_df = freq.apply(df, target = 'income', columns = ['workclass', 'marital.status'] )

In [23]:
#Importing Target Imputation 
from imperio import TargetImputationTransformer
target = TargetImputationTransformer()

In [24]:
#Applying target imputation to 'occupation' column
new_df = target.apply(new_df, target = 'income', columns = ['occupation'] )

In [25]:
new_df

Unnamed: 0,age,workclass,education.num,marital.status,occupation,capital.gain,capital.loss,hours.per.week,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,income
1,82.0,0.73885,9.0,0.027346,0.484014,0.0,4356.0,18.0,0.0,0.0,0.0,1.0,0.0,0.0
3,54.0,0.73885,4.0,0.138616,0.124875,0.0,3900.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0
4,41.0,0.73885,10.0,0.031219,0.449034,0.0,3900.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0
5,34.0,0.73885,9.0,0.138616,0.041578,0.0,3770.0,45.0,0.0,0.0,0.0,1.0,0.0,0.0
6,38.0,0.73885,6.0,0.031219,0.134483,0.0,3770.0,40.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22.0,0.73885,10.0,0.322677,0.325116,0.0,0.0,40.0,0.0,0.0,0.0,1.0,1.0,0.0
32557,27.0,0.73885,12.0,0.466795,0.304957,0.0,0.0,38.0,0.0,0.0,0.0,1.0,0.0,0.0
32558,40.0,0.73885,9.0,0.466795,0.124875,0.0,0.0,40.0,0.0,0.0,0.0,1.0,1.0,1.0
32559,58.0,0.73885,9.0,0.027346,0.134483,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0


Our dataset is ready to be analyzed

In [26]:
#Saving the dataset as a new csv file
new_df.to_csv("data2.csv")