# Universidade do Estado do Rio de Janeiro
## Departamento de Engenharia de Sistemas e Computação - Tópicos Especiais A

## Lecture 02 - Data Preprocessing: Categorical Data Transformation, Normalization, and Feature Selection


In this lecture, we will cover some important aspects of data preprocessing for machine learning. Specifically, we will discuss how to transform categorical data, normalize numerical data, and perform feature selection. We will be using the Adult Income dataset from the UCI Machine Learning Repository as our example.

### Loading the Dataset
Let's start by loading the dataset into our Python environment. We will be using the pandas library to work with our data.

In [1]:
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
           'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
df = pd.read_csv(url, header=None, names=columns, na_values='?')

# Print the first few rows of the dataset
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Data Transformation: Categorical Variables

Many machine learning algorithms require numerical input data. However, some of the features in our dataset are categorical, such as workclass, education, marital-status, occupation, relationship, race, sex, and native-country. To use these features in our machine learning models, we need to transform them into numerical values.

#### Label Encoding

One common way to transform categorical variables is to use label encoding. Label encoding involves assigning each unique category in a categorical feature with a numerical value. For example, we can encode the sex feature as follows:

In [2]:
from sklearn.preprocessing import LabelEncoder

# Instantiate the LabelEncoder
le = LabelEncoder()

# Fit and transform the 'sex' feature
df['sex'] = le.fit_transform(df['sex'])

# Print the unique categories in the 'sex' feature
print(df['sex'].unique())


[1 0]


In [3]:
# Instantiate the LabelEncoder
le = LabelEncoder()

# Fit and transform the 'sex' feature
df['income'] = le.fit_transform(df['income'])

# Print the unique categories in the 'sex' feature
print(df['income'].unique())


[0 1]


#### One-Hot Encoding
Another approach to transforming categorical variables is to use one-hot encoding. One-hot encoding involves creating a new binary feature for each unique category in a categorical feature. For example, we can one-hot encode the race feature as follows:

In [4]:
from sklearn.preprocessing import OneHotEncoder

# Instantiate the OneHotEncoder
ohe = OneHotEncoder()

# Fit and transform the 'race' feature
race_ohe = ohe.fit_transform(df[['race']])

# Create a new dataframe with the one-hot encoded 'race' feature
race_df = pd.DataFrame(race_ohe.toarray(), columns=ohe.get_feature_names(['race']))

# Add the new dataframe to the original dataframe
df = pd.concat([df, race_df], axis=1)

# Drop the original 'race' feature
df.drop('race', axis=1, inplace=True)

# Print the first few rows of the dataset
df.head()




Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,income,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,1,2174,0,40,United-States,0,0.0,0.0,0.0,0.0,1.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,1,0,0,13,United-States,0,0.0,0.0,0.0,0.0,1.0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,1,0,0,40,United-States,0,0.0,0.0,0.0,0.0,1.0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,1,0,0,40,United-States,0,0.0,0.0,1.0,0.0,0.0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,0,0,0,40,Cuba,0,0.0,0.0,1.0,0.0,0.0


### Data Normalization
Normalization is the process of scaling numerical features to a common range. This is often necessary for machine learning algorithms that use distance-based measures, such as k-nearest neighbors and support vector machines.



#### Min-Max Scaling
One common method of normalization is min-max scaling. Min-max scaling scales the values of a feature to a range between 0 and 1. For example, we can normalize the age feature as follows:

In [5]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the 'age' feature
df['age'] = scaler.fit_transform(df[['age']])



## Feature Selection
Feature selection is the process of selecting a subset of relevant features from our dataset that are most useful for our machine learning model. This can help to reduce overfitting and improve the performance of our model. There are different methods for feature selection, such as filter methods, wrapper methods, and embedded methods.

One common filter method for feature selection is correlation analysis, which measures the linear relationship between two variables. We can use the corr() function from pandas to compute the correlation matrix for our dataset:

In [6]:
df.corr()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,income,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White
age,1.0,-0.076646,0.036527,0.088832,0.077674,0.057775,0.068756,0.234037,-0.010137,-0.011111,-0.019434,-0.034415,0.033412
fnlwgt,-0.076646,1.0,-0.043195,0.026858,0.000432,-0.010252,-0.018768,-0.009463,-0.064148,-0.051323,0.118009,0.006376,-0.056896
education-num,0.036527,-0.043195,1.0,0.01228,0.12263,0.079923,0.148123,0.335154,-0.029345,0.062091,-0.075272,-0.044133,0.051353
sex,0.088832,0.026858,0.01228,1.0,0.04848,0.045567,0.229309,0.21598,-0.01082,-0.000856,-0.115604,-0.013906,0.103486
capital-gain,0.077674,0.000432,0.12263,0.04848,1.0,-0.031615,0.078409,0.223329,-0.006015,0.009851,-0.020631,-0.001774,0.014429
capital-loss,0.057775,-0.010252,0.079923,0.045567,-0.031615,1.0,0.054256,0.150526,-0.012947,0.004469,-0.021762,-0.005964,0.021044
hours-per-week,0.068756,-0.018768,0.148123,0.229309,0.078409,0.054256,1.0,0.229689,-0.003096,-0.004564,-0.053153,-0.007188,0.049345
income,0.234037,-0.009463,0.335154,0.21598,0.223329,0.150526,0.229689,1.0,-0.028721,0.010543,-0.089089,-0.03183,0.085224
race_ Amer-Indian-Eskimo,-0.010137,-0.064148,-0.029345,-0.01082,-0.006015,-0.012947,-0.003096,-0.028721,1.0,-0.017829,-0.031991,-0.008996,-0.237763
race_ Asian-Pac-Islander,-0.011111,-0.051323,0.062091,-0.000856,0.009851,0.004469,-0.004564,0.010543,-0.017829,1.0,-0.059144,-0.016632,-0.439572


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,income,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White
0,0.301370,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,1,2174,0,40,United-States,0,0.0,0.0,0.0,0.0,1.0
1,0.452055,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,1,0,0,13,United-States,0,0.0,0.0,0.0,0.0,1.0
2,0.287671,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,1,0,0,40,United-States,0,0.0,0.0,0.0,0.0,1.0
3,0.493151,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,1,0,0,40,United-States,0,0.0,0.0,1.0,0.0,0.0
4,0.150685,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,0,0,0,40,Cuba,0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,0,0,0,38,United-States,0,0.0,0.0,0.0,0.0,1.0
32557,0.315068,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,1,0,0,40,United-States,1,0.0,0.0,0.0,0.0,1.0
32558,0.561644,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,0,0,0,40,United-States,0,0.0,0.0,0.0,0.0,1.0
32559,0.068493,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,1,0,0,20,United-States,0,0.0,0.0,0.0,0.0,1.0


In [7]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X = df.drop('income', axis=1).values
y = df['income'].values

# Apply SelectKBest and chi2 to select top 10 features
selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X, y)

# Get the names of the selected features
feature_names = X.columns[selector.get_support(indices=True)].tolist()

print("Selected features:", feature_names)

ValueError: could not convert string to float: ' State-gov'

In [9]:
categorical_variables = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'native-country']

# Instantiate the LabelEncoder
for var in categorical_variables:
    le = LabelEncoder()
    # Fit and transform the 'sex' feature
    df[var] = le.fit_transform(df[var])

In [11]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X = df.drop('income', axis=1).values
y = df['income'].values

X_columns = df.drop('income',axis=1).columns

# Apply SelectKBest and chi2 to select top 10 features
selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X, y)


# Get the names of the selected features
feature_names = X_columns[selector.get_support(indices=True)].tolist()

print("Selected features:", feature_names)

Selected features: ['fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week']


In [14]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Create a logistic regression model
model = LinearRegression()

# Perform feature selection with RFE
rfe = RFE(model, n_features_to_select=10)
fit = rfe.fit(X, y)

# Get the names of the selected features
feature_names = X_columns[fit.support_].tolist()

print("Selected features:", feature_names)


Selected features: ['age', 'education-num', 'marital-status', 'relationship', 'sex', 'race_ Amer-Indian-Eskimo', 'race_ Asian-Pac-Islander', 'race_ Black', 'race_ Other', 'race_ White']
