## **Feature Engineering**

Feature engineering is the process of creating new features or transforming existing features from raw data in order to improve the performance of a machine learning model. It plays a crucial role in the performance of machine learning models, as the quality and quantity of features can significantly impact the model's ability to make accurate predictions. Some common techniques used in feature engineering include:

Feature scaling: This involves transforming the values of a feature to be within a specific range, such as between 0 and 1. This can be useful when working with features that have different units or scales.

Feature normalization: This involves transforming the values of a feature to have a mean of 0 and a standard deviation of 1. This can be useful when working with features that have different units or scales.

Feature transformation: This involves applying mathematical operations to the values of a feature, such as taking the square, cube or logarithm of the values. This can be useful when working with features that have non-linear relationships with the target variable.

Feature extraction: This involves creating new features from existing features by applying mathematical or statistical operations. For example, you might extract the mean, median, or standard deviation of a set of features.

Binning: This involves grouping a set of continuous or numerical data into a smaller number of discrete "bins" or intervals.

Encoding categorical data: This involves converting categorical data, which can take on a limited number of values, into a format that can be understood by a machine learning model.

Handling missing values: This involves identifying and either dropping or imputing missing values in the data.

Combining features: This involves creating new features by combining multiple existing features. For example, you might create a new feature by taking the product or ratio of two existing features.

Dimensionality reduction: This involves reducing the number of features in the data set by either removing or combining features.

# **Encoding categorical data**

Encoding categorical data is the process of converting categorical data, which can take on a limited number of values, into a format that can be understood by a machine learning model. There are several ways to encode categorical data, including:

One-hot encoding: This method creates a new binary column for each unique category in the data. Each row is then given a 1 in the column corresponding to the category it belongs to, and 0s in all other columns. One-hot encoding can be useful when the categories are mutually exclusive and there is no inherent ordering of the categories.

Label encoding: This method assigns an integer value to each category. This can be useful when there is an inherent ordering of the categories, and it is important to maintain that ordering.

Count encoding: This method replaces a categorical value with the count of how many times that value appears in the dataset.

Binary Encoding: This method replaces the categorical feature with a binary encoding of its integers. It is useful when the categorical feature have high cardinality and the tree-based model is used.

Helmert Encoding: This method replaces a categorical value with a vector of the differences between the mean of the target variable for each level of the categorical variable and the overall mean of the target variable.

There are also libraries available in different languages like Python, R, etc. which have functions that can be used to encode categorical data like pandas.get_dummies, sklearn.preprocessing.LabelEncoder, etc.

It's important to note that the choice of encoding method depends on the specific use case and the model that you are using. Some models may perform better with one type of encoding than others.

## **1. Label Encoding**

In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Create an instance of the LabelEncoder class
le = LabelEncoder()

# Example dataframe with a 'color' column containing categorical data
data = {'color': ['red', 'green', 'blue', 'red', 'green', 'blue']}
df = pd.DataFrame(data)

# Fit the encoder to the categorical data
le.fit(df['color'])

# View the mapping from categories to integers
print(le.classes_) # will print ['blue', 'green', 'red']

# Transform the categorical data into integers
df['color'] = le.transform(df['color'])

# View the encoded data
print(df)


['blue' 'green' 'red']
   color
0      2
1      1
2      0
3      2
4      1
5      0


### **2. One Hot Encoding**

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Example dataframe with a 'color' column containing categorical data
data = {'color': ['red', 'green', 'blue', 'red', 'green', 'blue']}
df = pd.DataFrame(data)

# Create an instance of the OneHotEncoder class
enc = OneHotEncoder()

# Fit the encoder to the categorical data
enc.fit(df[['color']])

# Transform the categorical data into one-hot encoded data
one_hot_data = enc.transform(df[['color']]).toarray()

# Create a new dataframe with the one-hot encoded data
df_encoded = pd.DataFrame(one_hot_data, columns=enc.get_feature_names_out(['color']))

# View the encoded data
print(df_encoded)


   color_blue  color_green  color_red
0         0.0          0.0        1.0
1         0.0          1.0        0.0
2         1.0          0.0        0.0
3         0.0          0.0        1.0
4         0.0          1.0        0.0
5         1.0          0.0        0.0


In [None]:
type(df['color'])

pandas.core.series.Series

In [None]:
type(df[['color']])

pandas.core.frame.DataFrame

In [None]:
one_hot_data

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Performing OneHotEncoding using pandas get_dummies

In [None]:
data = {'color': ['red', 'green', 'blue', 'red', 'green', 'blue']}
df = pd.DataFrame(data)
df_encoded1 = pd.get_dummies(df['color'])
df_encoded1


Unnamed: 0,blue,green,red
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,1,0
5,1,0,0


## **Binning**

Binning is a method of grouping a set of continuous or numerical data into a smaller number of discrete "bins" or intervals. This can be useful for data visualization, statistical analysis, and machine learning tasks. Each bin is typically represented by a single value, such as the midpoint or mean of the data in that bin, and the data points within a bin are considered to be similar to one another in some way. Binning can be used to reduce the complexity of a dataset and make patterns and trends more easily visible.

There are several ways to implement binning, depending on the specific use case and the type of data being binned. Here are a few common methods:

Fixed-width binning: In this method, the range of the data is divided into a fixed number of bins of equal width. This is a simple and straightforward way to bin data, but it may not be appropriate if the data has a non-uniform distribution.

Adaptive binning: In this method, the width of the bins is adjusted based on the distribution of the data. This can be useful when dealing with data that has a non-uniform distribution.

K-means binning: In this method, the data is divided into a fixed number of bins based on the k-means clustering algorithm. This can be useful when the data is not evenly distributed and there are clear clusters or groups in the data.

There are also many libraries available in different languages like Python, R, etc. which have functions that can be used to implement binning like numpy.histogram, pandas.cut, etc.

In [None]:
import pandas as pd

# Example dataframe with a 'age' column
data = {'age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70]}
df = pd.DataFrame(data)

# Define the bin edges and labels
bin_edges = [20, 30, 40, 50, 60, 70]
bin_labels = ['20-30', '30-40', '40-50', '50-60', '60-70']

# Perform binning on the 'age' column
df['age_bins'] = pd.cut(df['age'], bin_edges, labels=bin_labels)

# View the binned data
print(df)


   age age_bins
0   25    20-30
1   30    20-30
2   35    30-40
3   40    30-40
4   45    40-50
5   50    40-50
6   55    50-60
7   60    50-60
8   65    60-70
9   70    60-70


You can also use the qcut function to perform quantile-based binning, this way the number of observations in each bin will be roughly equal.

In [None]:
# Perform quantile-based binning on the 'age' column
df['age_bins'] = pd.qcut(df['age'], q=5, labels=bin_labels)
df

Unnamed: 0,age,age_bins
0,25,20-30
1,30,20-30
2,35,30-40
3,40,30-40
4,45,40-50
5,50,40-50
6,55,50-60
7,60,50-60
8,65,60-70
9,70,60-70


You can also use the numpy's histogram method to perform binning and get the bin edges and frequencies of each bin.

In [None]:
import numpy as np

# Perform binning on the 'age' column
bin_edges, bin_freq = np.histogram(df['age'], bins=[20,30,40,50,60,70])


# Variable Transformation

Some machine learning models, like linear and logistic regression, assume that the variables follow a normal distribution. More likely, variables in real datasets will follow more a skewed distribution.

By applying a number of transformations to these variables, and mapping their skewed distribution to a normal distribution, we can increase the performance of our models.

The most commonly-used methods to transform variables are the following:

Logarithmic transformation
Square root transformation
Reciprocal transformation
Exponential or power transformation

# **Logarithmic transformation**

In [None]:
# import pyplot
import pandas as pd
import matplotlib.pyplot  as plt
%matplotlib inline
#reading the data into variable new_data
data = pd.read_csv('https://raw.githubusercontent.com/jainrachit108/datasets/main/kc_house_data.csv')
new_data = data.drop('date' ,axis = 1)
#plot price histogram
plt.hist(new_data['price'],bins = 10)


In [None]:
#plotting histogram after log transformation 
import numpy as np
new_data['price'] = np.log(new_data['price'])
plt.hist(new_data['price'])