# Feature Engineering
You should build a machine learning pipeline with a data preprocessing and feature engineering step. In particular, you should do the following:
- Load the `adult` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Preprocess the dataset by 
    - removing missing values using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html);
    - encoding categorical attributes using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html);
    - normalizing/scaling features using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html);
    - handling imbalanced classes using [Imbalanced-Learn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html);
    - and reducing the dimensionality of the dataset using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Evaluate the impact of the data preprocessing and feature engineering methods on the effectiveness and efficiency of the model.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

## Importing Modules

In [1]:
import pandas as pd
import sklearn.model_selection
import sklearn.svm
import sklearn.metrics
import sklearn.preprocessing
import sklearn.decomposition
import imblearn.over_sampling

## Loading the Dataset

In [2]:
df = pd.read_csv("../../datasets/adult.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Splitting the Data into Training and Test Sets

In [3]:
df_train, df_test = sklearn.model_selection.train_test_split(df)
print("df_train:", df_train.shape)
print("df_test:", df_test.shape)

df_train: (24420, 15)
df_test: (8141, 15)


## Data Exploration on Training Data

In [4]:
df_train["workclass"].unique()

array([' Private', ' Self-emp-not-inc', ' Federal-gov', ' Self-emp-inc',
       ' Local-gov', ' ?', ' State-gov', ' Never-worked', ' Without-pay'],
      dtype=object)

In [11]:
df_train["target"].value_counts()

 <=50K    16995
 >50K      5635
Name: target, dtype: int64

## Removing Missing Values

In [6]:
df_train = df_train.replace(" ?", pd.NaT)
df_train_cleaned = df_train.dropna()
print("df_train_cleaned:", df_train_cleaned.shape)

df_test = df_test.replace(" ?", pd.NaT)
df_test_cleaned = df_test.dropna()
print("df_test_cleaned:", df_test_cleaned.shape)

df_train_cleaned: (22630, 15)
df_test_cleaned: (7532, 15)


## Splitting Features and the Target Label

In [7]:
x_train = df_train_cleaned.drop(["target"], axis=1)
y_train = df_train_cleaned["target"]
print("x_train:", x_train.shape)
print("y_train:", y_train.shape)

x_test = df_test_cleaned.drop(["target"], axis=1)
y_test = df_test_cleaned["target"]
print("x_test:", x_test.shape)
print("y_test:", y_test.shape)

x_train: (22630, 14)
y_train: (22630,)
x_test: (7532, 14)
y_test: (7532,)


## One-Hot Encoding

In [8]:
# Building the one-hot encoder model
enc = sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")
enc.fit(x_train)

# Encoding the categorical attriutes of training data
x_train_encoded = enc.transform(x_train).toarray()

# Encoding the categorical attriutes of test data
x_test_encoded = enc.transform(x_test).toarray()

print("x_train_encoded:", x_train_encoded.shape)
print("x_test_encoded:", x_test_encoded.shape)

x_train_encoded: (22630, 16851)
x_test_encoded: (7532, 16851)


## Standardization

In [9]:
# Building a standardization model
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(x_train_encoded)

# scaling the training features
x_train_standardized = scaler.transform(x_train_encoded)

# Reducing the number of test features
x_test_standardized = scaler.transform(x_test_encoded)

print("x_train_standardized:", x_train_standardized.shape)
print("x_test_standardized:", x_test_standardized.shape)

x_train_standardized: (22630, 16851)
x_test_standardized: (7532, 16851)


## Dimensionality Reduction

In [10]:
# Building a PCA model
pca = sklearn.decomposition.PCA(n_components=100)
pca.fit(x_train_standardized)

# Reducing the number of training features
x_train_reduced = pca.transform(x_train_standardized)

# Reducing the number of test features
x_test_reduced = pca.transform(x_test_standardized)

print("x_train_reduced:", x_train_reduced.shape)
print("x_test_reduced:", x_test_reduced.shape)

x_train_reduced: (22630, 100)
x_test_reduced: (7532, 100)


## Oversampling

In [12]:
sm = imblearn.over_sampling.SMOTE() 
x_train_balanced, y_train_balanced = sm.fit_resample(x_train_reduced, y_train)
y_train_balanced.value_counts()

 <=50K    16995
 >50K     16995
Name: target, dtype: int64

## Training a Model

In [13]:
model = sklearn.svm.SVC()
model.fit(x_train_balanced, y_train_balanced);

## Test the Trained Model

In [15]:
y_predicted = model.predict(x_test_reduced)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
accuracy

0.7950079660116834