## __1. Introduction to Feature Engineering__
It refers to the process of selecting, modifying, or creating new features (variables) from the raw data to improve the performance of machine learning models.
- It involves transforming the data into a more suitable format, making it easier for models to learn patterns and make accurate predictions.
- It is a critical step in the data preprocessing pipeline and plays a key role in the success of machine learning projects.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('HousePrices.csv')
df.head(2)

FileNotFoundError: [Errno 2] No such file or directory: 'HousePrices.csv'

In [None]:
# Creating a new Feature 'total_rooms' by adding bedrooms and bathrooms

df['total_rooms'] = df['bedrooms']+df['bathrooms']
print(df['total_rooms'])
df.head(2)

## __3. Transforming Variables__
Transforming variables is a crucial aspect of feature engineering that involves modifying the scale, distribution, or nature of variables to meet certain assumptions or to make them more suitable for analysis or modeling.
- Here are a few common techniques for transforming variables:
  
1. Log transformation
2. Square root transformation
3. Box-cox transformation

### __3.1 Log Transformation__

Log transformation is useful for handling skewed data or reducing the impact of outliers. It applies the natural logarithm to the variable values and makes highly skewed distributions less skewed.

In [None]:
# Logarithmic tansformation of the sqft_living column
df['log_sqft_living'] = np.log(df['sqft_living'])
df['log_sqft_living']

In [None]:
df['sqft_living'].hist()

In [None]:
df['log_sqft_living'].hist()

In [None]:
df.head(2)

In [None]:
# Logarithmic tansformation of the price column
df['log_price'] = np.log(df['price'])
df['log_price']

# Since we got a infinity value we will replace it with 0
df['log_price'] = df['log_price'].replace(-np.inf,0)
df['log_price'].min(), df['log_price'].max()


In [None]:
df['price'].hist()

In [None]:
df['log_price'].hist()

### __3.2 Square Root Transformation__
Square root transformation, like log transformation, effectively stabilizes variance and addresses skewed distributions. Although it's gentler than log transformation, it achieves the same objective.

In [None]:
# Squareroot transformating the price variable

df['sqrt_price'] = np.sqrt(df['price'])

# Displaying the DataFrame with the new feature
print("Dataframe with squareroot transformed price:")
print(df[['price','sqrt_price']])

In [None]:
df['sqrt_price'].hist()

In [None]:
df['price'].hist()

### __3.3 Box-Cox Transformation__

The box-cox transformation is a family of power transformations that includes log and square root transformations.
- It can handle a broader range of data distributions.

- Ensuring positive data is crucial for the Box-Cox transformation because it involves taking the logarithm, which is undefined for zero or negative values. Adding a constant helps avoid mathematical errors and ensures the transformation can be applied effectively.

In [None]:
from scipy.stats import boxcox

# Applying box-cox transformation to price variable
df['bc_sqft_living']=boxcox(df['sqft_living'])

# Displaying DataFrame with boxcox tranformed price variable

print(df[['sqft_living','bc_sqft_living']])

## __4. Feature Scaling__
Feature scaling is a technique used in machine learning and data preprocessing to standardize or normalize the range of independent variables or features of a dataset.

- Min-max scaling transforms data to a specific range, typically between 0 and 1, preserving the relative differences between values. This normalization technique is ideal for datasets with known bounds, ensuring that all values are rescaled proportionally to fit within the specified range.

- Standard scaling is preferable for normally distributed data to maintain mean-centeredness and consistent standard deviations.

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_10_Feature_Engineering/Label_Encoding.png)

In [None]:
from sklearn.preprocessing import MinMaxScaler

# scaling numeric feature using min max values
scaler = MinMaxScaler()
df[['sqft_living','sqft_lot']] = scaler.fit_transform(df[['sqft_living','sqft_lot']])
df[['sqft_living','sqft_lot']]

In [None]:
df['price'].var()

In [None]:
df['log_price'].var()

## __5. Label Encoding__

Label encoding is a technique used to convert categorical labels into a numeric format, making it suitable for machine learning algorithms that require numerical input.
- In label encoding, each unique category is assigned an integer value.
- This is particularly useful when dealing with ordinal categorical data, where the order of categories matters.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame
data = {'size': ['small', 'medium', 'large', 'medium', 'small']}
df1 = pd.DataFrame(data)

# Before ordinal label encoding
print("Original DataFrame:")
print(df1)

# Apply label encoding
label_encoder = LabelEncoder()
df1['size_encoded'] = label_encoder.fit_transform(df1['size'])

# After label encoding
print("\nDataFrame after label encoding:")
print(df1)

In [2]:
import pandas as pd

# Sample DataFrame
data = {'color': ['red', 'blue', 'green', 'red', 'green']}
df2 = pd.DataFrame(data)

# Before one-hot encoding
print("Original DataFrame:")
print(df2)

# Apply one-hot encoding
df2_encoded = pd.get_dummies(df2, columns=['color'], prefix='color')

# After one-hot encoding
print("\nDataFrame after one-hot encoding:")
print(df2_encoded)

Original DataFrame:
   color
0    red
1   blue
2  green
3    red
4  green

DataFrame after one-hot encoding:
   color_blue  color_green  color_red
0       False        False       True
1        True        False      False
2       False         True      False
3       False        False       True
4       False         True      False


In [None]:
# Demonstrating label encoding using csv file
from sklearn.preprocessing import LabelEncoder

# Label encoding for the 'city' column
label_encoder = LabelEncoder()
df['city_encoded'] = label_encoder.fit_transform(df['city'])
print(df)

In [None]:
# Demonstrating one-hot encoding using csv file
# One-Hot Encoding for the 'view' column
df_encode = pd.get_dummies(df, columns=['price'], prefix='price')

# After one-hot encoding
print("\nDataFrame after one-hot encoding:")
print(df_encode)

In [None]:
df_encode.columns

## __7. Hashing__

It is a technique to convert input data (of variable length) into a fixed-length string of characters, typically a hash code.
- The hash function takes an input (or message) and returns a fixed-size string of characters, which is typically a hexadecimal number.
- It is commonly used for indexing data structures, checking data integrity, and hashing passwords.


![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_10/10_01/Lesson_10_Feature_EngineeringHashing.jpg)

In [None]:
# Example of Hashing in Python

data = 'Hello! hashing!'

# Using the hash() function
hash_value = hash(data)

print(f" Orignial Data : {data}")
print(f"\n hash value: {hash_value}")

In [None]:
# Demonstrating hashing using csv file
# Hashing for the 'street' column
df['street_hashed'] = df['street'].apply(hash)
df.head(2)

## __Hashlib Module__

The hashlib module in Python is used for generating hash values. It offers interfaces to different cryptographic hash algorithms like MD5, SHA-1, SHA-256, SHA-384, and SHA-512.

- Used for security purpose.
- It enables the efficient use of hash functions, ensuring secure computations.
- It provides reliability for hash-related operations.
- It is widey used for cryptographic operations, data integrity, and password hashing.
- It ensures convenience and robustness.

Cryptographic hash algorithms vary in hash size and security levels.

- For tasks where security is not a critical concern, you can opt for MD5 or SHA-1. However, it's important to note that both algorithms are deprecated due to vulnerabilities.

- For security-sensitive applications, it's advisable to prioritize SHA-256, SHA-384, or SHA-512 due to their stronger security and larger hash sizes.

In [None]:
# Example of Hashlib module in python
import hashlib

In [None]:
# Input data
data = b"Hello World!"
print(f" Original Data: {data.decode()}\n")

# Calculate MD5 hash
md5_hash = hashlib.md5(data).hexdigest()
print(f"MD5 Hash: {md5_hash}")

# Calculate SHA-1 hash
sha1_hash = hashlib.sha1(data).hexdigest()
print(f"\nSHA-1 hash: {sha1_hash}")

# Calculate SHA-256 hash
sha256_hash = hashlib.sha256(data).hexdigest()
print(f"\nSHA-256 hash: {sha256_hash}")

# Calculate SHA-384 hash
sha384_hash = hashlib.sha384(data).hexdigest()
print(f"\nSHA-384 hash: {sha384_hash}")

# Calculate SHA-512 hash
sha512_hash = hashlib.sha512(data).hexdigest()
print(f"\nSHA-512 hash: {sha512_hash}")

In [None]:
# Demonstrating MD5 hash function using csv file for the 'street' column

street_column = df['street']
hashed_streets = street_column.apply(lambda x: hashlib.md5(x.encode()).hexdigest())

# Replace the original street values with hash values
df['hashed_street'] = hashed_streets

# Optionally, write the updated DataFrame back to a CSV file
df.to_csv('hashed_file.csv', index=False)

df.head(2)

## __8. Grouping Operations__

Grouping operations involve splitting a dataset into groups based on some criteria, applying a function to each group independently, and then combining the results.
- This is a crucial step in data analysis and manipulation, allowing for insights into the data at a more granular level.
- Grouping operations are commonly combined with aggregate functions to summarize data within each group.

In [None]:
# Sample DataFrame
df3 = pd.DataFrame({'Category':['Electronics','Clothing','Electronics','Clothing','Electronics'],
                   'Revenue':[500,700,300,400,600]})

# Grouping by Categories and calculating the total revenue for each category

grouped_df3 = df3.groupby('Category')['Revenue'].sum().reset_index()

print(f"Original DataFrame: {df3}")
print(f"\n Grouped DataFrame with Total Revenue: {grouped_df3}")

In [None]:
# Sample DataFrame
df3 = pd.DataFrame({'Revenue':[500,700,300,400,600,1000,2000,3000,4000,5000]})

for 'Revenue' in df3.column:
    if i < 1000:
    df3['R_Category']='Below_1000'
else:
    df3['R_Category']='Above_1000'

print(df3)