# __Feature Engineering__

## __Agenda__

In this lesson, we will cover the following concepts with the help of examples:
- Introduction to Feature Engineering
- Feature Engineering Methods
- Transforming Variables
  * Log Transformation
  * Square Root Transformation
  * Box-Cox Transformation
- Features Scaling
- Label Encoding
- One Hot Encoding
- Hashing
    * Hashlib Module
- Grouping Operations

## __1. Introduction to Feature Engineering__
It refers to the process of selecting, modifying, or creating new features (variables) from the raw data to improve the performance of machine learning models.
- It involves transforming the data into a more suitable format, making it easier for models to learn patterns and make accurate predictions.
- It is a critical step in the data preprocessing pipeline and plays a key role in the success of machine learning projects.



## __2. Feature Engineering Methods__

They introduce the concept of creating new features through mathematical operations, transformations, or combining existing variables.

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_10_Feature_Engineering/Feature_Engineering_Methods.png)

__Note:__ The __Data Wrangling__ lesson extensively addresses various feature engineering methods, including outlier handling, imputation, and data cleaning. Any aspects not covered in that lesson but deemed essential for feature engineering are comprehensively discussed here.

In [1]:
import pandas as pd
import numpy as np
df= pd.read_csv("../data/HousePrices.csv")

In [None]:
df.head()

In [None]:
# Create a new feature 'total_rooms' by adding bedrooms and bathrooms
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df.head()

## __3. Transforming Variables__
Transforming variables is a crucial aspect of feature engineering that involves modifying the scale, distribution, or nature of variables to meet certain assumptions or to make them more suitable for analysis or modeling.
- Here are a few common techniques for transforming variables:
1. Log transformation
2. Square root transformation
3. Box-cox transformation


### __3.1 Log Transformation__

Log transformation is useful for handling skewed data or reducing the impact of outliers. It applies the natural logarithm to the variable values and makes highly skewed distributions less skewed.

In [None]:
# Logarithmic transformation of the 'price' column
df['log_price'] = df['price'].apply(np.log)
df.head()

In [None]:
# Create boxplot for the 'price' and 'log_price' columns
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.boxplot(x='price', data=df, ax=ax[0])
sns.boxplot(x='log_price', data=df, ax=ax[1])
ax[0].set_title('Boxplot of the price column')
ax[1].set_title('Boxplot of the log_price column')
plt.show()

### __3.2 Square Root Transformation__
Square root transformation, like log transformation, effectively stabilizes variance and addresses skewed distributions. Although it's gentler than log transformation, it achieves the same objective.

In [None]:
# Square root transforming the 'price' variable
df['SquareRoot_price'] = df['price'].apply(np.sqrt)
# Displaying the DataFrame with the new feature
print("DataFrame with square root transformed 'price':")
df[['price', 'SquareRoot_price']].head()

In [None]:
import seaborn as sns

fig, ax = plt.subplots(1, 3, figsize=(10, 5))
sns.histplot(df['price'], ax=ax[0])
sns.histplot(df['log_price'], ax=ax[1])
sns.histplot(df['SquareRoot_price'], ax=ax[2])
ax[0].set_title('Histogram of the price column')
ax[1].set_title('Histogram of the log_price column')
ax[2].set_title('Histogram of the SquareRoot_price column')
plt.show()

### __3.3 Box-Cox Transformation__

The box-cox transformation is a family of power transformations that includes log and square root transformations.
- It can handle a broader range of data distributions.

- Ensuring positive data is crucial for the Box-Cox transformation because it involves taking the logarithm, which is undefined for zero or negative values. Adding a constant helps avoid mathematical errors and ensures the transformation can be applied effectively.

In [None]:
from scipy.stats import boxcox

# Applying Box-Cox transformation to 'sales' variable
df['BoxCox_sqft'], best_lambda = boxcox(df['sqft_living'])

# Displaying the DataFrame with the Box-Cox transformed 'sales' variable
print("DataFrame with box-cox transformed price:")
df[['sqft_living', 'BoxCox_sqft']].head()

In [None]:
print(best_lambda)
# Create histogram for the 'sqft_living' and 'BoxCox_sqft' columns
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.histplot(df['sqft_living'], ax=ax[0])
sns.histplot(df['BoxCox_sqft'], ax=ax[1])
ax[0].set_title('Histogram of the sqft_living column')
ax[1].set_title('Histogram of the BoxCox_sqft column')
plt.show()

## __4. Feature Scaling__
Feature scaling is a technique used in machine learning and data preprocessing to standardize or normalize the range of independent variables or features of a dataset.

- Min-max scaling transforms data to a specific range, typically between 0 and 1, preserving the relative differences between values. This normalization technique is ideal for datasets with known bounds, ensuring that all values are rescaled proportionally to fit within the specified range.

- Standard scaling is preferable for normally distributed data to maintain mean-centeredness and consistent standard deviations.

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_10_Feature_Engineering/Label_Encoding.png)

In [None]:
# Normalization using sickit-learn
from sklearn.preprocessing import MinMaxScaler

# Scaling numeric features using min-max scaling
scaler = MinMaxScaler()
df[['sqft_living_scaled', 'sqft_lot_scaled']] = scaler.fit_transform(df[['sqft_living', 'sqft_lot']])
df.head()

In [None]:
# Standardization using sickit-learn
from sklearn.preprocessing import StandardScaler

# Scaling numeric features using standard scaling
standard_sc = StandardScaler()
df[['sqft_living_standard', 'sqft_lot_standard']] = standard_sc.fit_transform(df[['sqft_living', 'sqft_lot']])
df.head()

## __5. Label Encoding__

Label encoding is a technique used to convert categorical labels into a numeric format, making it suitable for machine learning algorithms that require numerical input.
- In label encoding, each unique category is assigned an integer value.
- This is particularly useful when dealing with ordinal categorical data, where the order of categories matters.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame
data = {'size': ['small', 'medium', 'large', 'medium', 'small']}
df2 = pd.DataFrame(data)

# Before label encoding
print("Original DataFrame:")
df2.head()

In [None]:
# Apply label encoding
label_encoder = LabelEncoder()
df2['size_encoded'] = label_encoder.fit_transform(df2['size'])

# After label encoding
print("\nDataFrame after label encoding:")
df2.head()

In [None]:
label_encoder.transform(['small', 'medium', 'large'])

In [None]:
# Create one hot encoding for the 'size' column
df2n = pd.get_dummies(df2['size'])
df2 = df2.merge(df2n, left_index=True, right_index=True)
df2.head()


In [None]:
# Demonstrating label encoding using csv file
from sklearn.preprocessing import LabelEncoder

# Label encoding for the 'city' column
label_encoder = LabelEncoder()
df['city_encoded'] = label_encoder.fit_transform(df['city'])
df.head()

In [None]:
# use one hot encoding for the 'city' column
df_one_hot = pd.get_dummies(df['city'])
df = pd.concat([df, df_one_hot], axis=1)
df.head()

In [None]:
# Other solution, without sickit-learn
df['city_encoded_v2'] = df['city'].astype('category').cat.codes
df.head()

In [None]:
# Other solution, without sickit-learn
my_cities = df['city'].unique()
my_cities_labels = {city: i for i, city in enumerate(my_cities)}
df['city_encoded_v3'] = df['city'].map(my_cities_labels)
df.head()

## __6. One-Hot Encoding__

One-hot encoding is a technique to represent categorical variables as binary vectors.
- It is particularly useful when dealing with nominal categorical data, where there is no inherent order among categories.
- In one-hot encoding, each unique category is transformed into a binary column, and only one column in each set of binary columns is _hot_ (or 1) to indicate the presence of that category.

- It increases dataset dimensionality, facilitating categorical data representation. However, it can lead to increased complexity and computational overhead.

In [None]:
import pandas as pd

# Sample DataFrame
data = {'color': ['red', 'blue', 'green', 'red', 'green']}
df3 = pd.DataFrame(data)

# Before one-hot encoding
print("Original DataFrame:")
df3

In [None]:
# Apply one-hot encoding
df_encoded = pd.get_dummies(df3['color'])
df3 = pd.concat([df3, df_encoded], axis=1) # Can drop_first=True to avoid multicollinearity
# After one-hot encoding
print("\nDataFrame after one-hot encoding:")
df3

In [None]:
# REad housing data
df = pd.read_csv("../data/HousePrices.csv")
df.head()


In [None]:
# Demonstrating one-hot encoding using csv file
# One-Hot Encoding for the 'view' column
df_encode = pd.get_dummies(df['view'], drop_first=True)
df = pd.concat([df, df_encode], axis=1)
# After one-hot encoding
print("\nDataFrame after one-hot encoding:")
df.head()

In [None]:
# Same, using sickit-learn
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
tr = enc.fit_transform(df[['view']])
print(tr)

In [None]:
print(enc.categories_)

In [None]:
print(enc.categories_[0])

In [None]:
tr.toarray()

In [None]:
df_encode2= df.copy()
# Convert the encoded array back to a DataFrame and change the column names to the original categories
df_encode2[enc.categories_[0]] = tr.toarray()
df_encode2.head()

In [None]:
df_encode2.shape

In [None]:
# Add one hot encoding for the 'city' column
one_hot_city_encoder = OneHotEncoder()
tr_city = one_hot_city_encoder.fit_transform(df[['city']])
df_encode3 = df.copy()
df_encode3[one_hot_city_encoder.categories_[0]] = tr_city.toarray()
print(df_encode3.head())
print(df_encode3.shape)


## __7. Hashing__

It is a technique to convert input data (of variable length) into a fixed-length string of characters, typically a hash code.
- The hash function takes an input (or message) and returns a fixed-size string of characters, which is typically a hexadecimal number.
- It is commonly used for indexing data structures, checking data integrity, and hashing passwords.


![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_10/10_01/Lesson_10_Feature_EngineeringHashing.jpg)

In [None]:
# Example of hashing in Python
data = "Hello, Hashing!"

# Using the hash() function
hash_value = hash(data)

print(f"Original data: {data}")
print(f"Hash value: {hash_value}")

In [None]:
# Demonstrating hashing using csv file
# Hashing for the 'street' column
df['street_hashed'] = df['street'].apply(hash)
df.head(10)

### __7.1 Hashlib Module__

The hashlib module in Python is used for generating hash values. It offers interfaces to different cryptographic hash algorithms like MD5, SHA-1, SHA-256, SHA-384, and SHA-512.

- It enables the efficient use of hash functions, ensuring secure computations.
- It provides reliability for hash-related operations.
- It is widey used for cryptographic operations, data integrity, and password hashing.
- It ensures convenience and robustness.



Cryptographic hash algorithms vary in hash size and security levels.

- For tasks where security is not a critical concern, you can opt for MD5 or SHA-1. However, it's important to note that both algorithms are deprecated due to vulnerabilities.

- For security-sensitive applications, it's advisable to prioritize SHA-256, SHA-384, or SHA-512 due to their stronger security and larger hash sizes.

In [None]:
# Example of hashlib module in Python
import hashlib

# Input data
data = b'Hello, world!'
print(f"Original data: {data.decode()} \n")

# Calculate MD5 hash
md5_hash = hashlib.md5(data).hexdigest()
print("MD5 Hash:", md5_hash)

# Calculate SHA-1 hash
sha1_hash = hashlib.sha1(data).hexdigest()
print("SHA-1 Hash:", sha1_hash)

# Calculate SHA-256 hash
sha256_hash = hashlib.sha256(data).hexdigest()
print("SHA-256 Hash:", sha256_hash)

# Calculate SHA-384 hash
sha384_hash = hashlib.sha384(data).hexdigest()
print("SHA-384 Hash:", sha384_hash)

# Calculate SHA-512 hash
sha512_hash = hashlib.sha512(data).hexdigest()
print("SHA-512 Hash:", sha512_hash)


In this example, the `hashlib` module is imported and input data is provided in bytes format. Hash values are then computed using the md5(), sha1(), sha256(), sha384(), and sha512() functions, and their hexadecimal representations are obtained using hexdigest().

In [None]:
# Demonstrating MD5 hash function using csv file for the 'street' column

street_column = df['street']
hashed_streets = street_column.apply(lambda x: hashlib.md5(x.encode()).hexdigest())

# Replace the original street values with hash values
df['hashed_street'] = hashed_streets

# Optionally, write the updated DataFrame back to a CSV file
df.to_csv('hashed_file.csv', index=False)

df

## __8. Grouping Operations__

Grouping operations involve splitting a dataset into groups based on some criteria, applying a function to each group independently, and then combining the results.
- This is a crucial step in data analysis and manipulation, allowing for insights into the data at a more granular level.
- Grouping operations are commonly combined with aggregate functions to summarize data within each group.

In [None]:
import pandas as pd

# Sample DataFrame
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
        'Revenue': [500, 300, 700, 400, 600]}

df4 = pd.DataFrame(data)
df4

In [None]:
grouped_df = df4.groupby('Category')
grouped_df.head()

In [None]:
# Grouping by 'Category' and calculating total revenue for each category
revenues = grouped_df['Revenue'].sum()

print("\nGrouped DataFrame with total revenue:")
print(type(revenues))
print(revenues.head())

In [None]:
# Grouping by 'city' and calculating the average price
df_grouped_city = df.groupby('city')
df_grouped_city.head()

In [None]:
# Grouping by 'city' and calculating the average price
average_price = df_grouped_city['price'].mean()
average_price.head()

In [None]:
# Grouping by 'city' and calculating the minimum price
df_grouped_min = df_grouped_city['price'].min()
df_grouped_min.head()

## __9. Removing highly correlated features__

In [None]:
df= pd.read_csv("../data/HousePrices.csv")
df.head()

In [None]:
# Create a dataframe wit hall columns except "price" (because we don't want to remove it)
import pandas as pd
import numpy as np
df_features = df.copy()
df_features = df_features.drop(columns=['price'])

df_features.head()

In [None]:
# Compute correlation
print(df_features.columns)
features_to_corr = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
                    'floors', 'sqft_above',
                    'sqft_basement', 'yr_built', 'yr_renovated']
df_corr = df_features[features_to_corr].corr()

In [None]:
df_corr.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# Heatmap
plt.figure(figsize=(12,10))
sns.heatmap(df_corr,annot=True,cmap='coolwarm')
plt.title("Heat map of the correlation matrix")
plt.show()

In [None]:
df_corr.columns

In [None]:
df_corr.iloc[0,1]

In [None]:
(df_corr.iloc[2,3:]>0.5).sum()

In [None]:
df_corr.shape[0]

In [None]:
cols = df_corr.columns
cols_after_corr = list(cols)

for i in range(df_corr.shape[0]):
    if ((df_corr.iloc[i,i+1:].abs()>0.5).sum()>0.1):
        cols_after_corr.remove(cols[i])
        print("Removing",cols[i])

print(len(cols_after_corr))
cols_after_corr


In [None]:
df_features_after_corr = df_features[cols_after_corr]
df_features_after_corr.head()

In [None]:
df_features_after_corr.shape

In [113]:
df_corr = df_features_after_corr.corr()

In [None]:
# Heatmap
plt.figure(figsize=(12,10))
sns.heatmap(df_corr,annot=True,cmap='coolwarm')
plt.title("Heat map of the correlation matrix")
plt.show()

# __Assisted Practice__

## __Problem Statement:__
A botanical research team is conducting a comprehensive analysis of iris flowers, aiming to derive valuable insights from their characteristics. The team wants to explore feature engineering techniques to understand and visualize the relationships within the Iris dataset.

## __Steps to Perform:__
- Understand the Dataset: Get familiar with the Iris dataset and its features
- Engineer Features: Create new features like sepal area and petal area
- Transform Variables: If the features are not normally distributed, apply transformations
- Scale Features: Use Min-Max Scaling or standard scaling to scale the features
- Encode Labels: Convert the categorical data (species) into numerical data using label encoding
- One Hot Encoding: Apply one hot encoding on the species feature and compare with label encoding