# Data Preprocessing

### Here are some important things to understand for data preprocessing in supervised machine learning. 

### A general guide to data preprocessing: https://www.kaggle.com/code/farzadnekouei/flight-data-eda-to-preprocessing#Step-3-|-Dataset-Overview 

## Categorical Variables Encoding

### Why:
- Machine Learning Algorithms Require Numerical Input: <span style="color:orange;">Most machine learning algorithms, particularly those based on mathematical computations such as linear regression, logistic regression, and support vector machines, require numerical input. These algorithms cannot directly process categorical data in their raw form</span>

- Improves Model Performance: <span style="color:orange;">Encoding categorical variables can help improve the performance of a model. Proper encoding allows the model to interpret categorical data correctly, leading to more accurate predictions. For instance, one-hot encoding can help the model treat each category as distinct, while ordinal encoding can preserve the order in ordinal data.</span>

- Handles High Cardinality: <span style="color:orange;">Encoding techniques like target encoding or frequency encoding can be beneficial when dealing with high cardinality categorical variables (i.e., variables with many unique categories). These techniques can help reduce dimensionality while still capturing important information from the categories.</span>

- Prevents Ordinal Assumptions: <span style="color:orange;">One-hot encoding prevents the model from assuming any ordinal relationship between categories that are nominal. For example, encoding colors ("red", "green", "blue") with one-hot encoding ensures that the model does not treat "green" as being between "red" and "blue".</span>

- Enhances Interpretability: <span style="color:orange;">Some encoding methods, such as target encoding, can enhance the interpretability of the model by linking categories directly to the target variable. This can be useful for understanding the impact of different categories on the prediction.</span>

- Reduces Overfitting: <span style="color:orange;">Encoding techniques can also help in reducing overfitting. For example, binary encoding reduces the number of features compared to one-hot encoding, which can help in preventing the model from memorizing the training data.</span>

- Compatibility with Algorithms: <span style="color:orange;">Certain algorithms, such as tree-based models (e.g., decision trees, random forests), can handle categorical data more flexibly. However, even for these algorithms, encoding can help in improving the efficiency and performance of the model.</span>

- Data Consistency: <span style="color:orange;">Encoding ensures that the data fed into the model is consistent and in a suitable format. This consistency is crucial for maintaining the integrity of the training process and ensuring reliable predictions.</span>

In [9]:
# Import libraries  
import numpy as np 
import pandas as pd 
from sklearn import preprocessing 
from tabulate import tabulate

### Examples of Encoding

**1. Label Encoding**
- Description: Assigns each unique category a different integer value
- Use Case: Suitable for ordinal categorical variables where the categories have an inherent order


In [14]:
cat_var = ["low", "medium", "high"]  
cat_var = pd.Categorical(cat_var, categories=["low", "medium", "high"], ordered=True) 
cat_var = cat_var.codes
cat_var

array([0, 1, 2], dtype=int8)

**2. One-Hot Encoding**
- Description: Creates a new binary column for each category
- Use Case: Suitable for nominal categorical variables with no intrinsic order

In [19]:
from sklearn.preprocessing import OneHotEncoder

cat_var = [["apple"], ["banana"], ["melon"], ["orange"]] #requires 2d array for one-hot encoding
print(tabulate(cat_var))
encoder = OneHotEncoder(sparse_output=False)
cat_var = encoder.fit_transform(cat_var)
print(tabulate(cat_var))

------
apple
banana
melon
orange
------
-  -  -  -
1  0  0  0
0  1  0  0
0  0  1  0
0  0  0  1
-  -  -  -


**3. Binary Encoding**
- Description: Converts categories to binary digits and then splits them into separate columns
- Use Case: Useful when there are a large number of categories and one-hot encoding would create too many columns

In [25]:
import category_encoders as ce

df = pd.DataFrame({'cat_var':['green','red','blue','pink','black','white',
           'brown','purple','yellow','grey','wheat']})
print(tabulate(df))
encoder = ce.BinaryEncoder(cols=['cat_var'])
df = encoder.fit_transform(df)
print(tabulate(df))


--  ------
 0  green
 1  red
 2  blue
 3  pink
 4  black
 5  white
 6  brown
 7  purple
 8  yellow
 9  grey
10  wheat
--  ------
--  -  -  -  -
 0  0  0  0  1
 1  0  0  1  0
 2  0  0  1  1
 3  0  1  0  0
 4  0  1  0  1
 5  0  1  1  0
 6  0  1  1  1
 7  1  0  0  0
 8  1  0  0  1
 9  1  0  1  0
10  1  0  1  1
--  -  -  -  -


**4. Frequency Encoding**
- Description: Replaces each category with its frequency in the dataset
- Use Case: When the frequency of categories can provide useful information to the model

In [31]:
cat_var = np.random.choice(['green', 'blue', 'red'], size=20, p=[0.3, 0.5, 0.2]) #note: generate each number with frequency given independently
df = pd.DataFrame({'cat_var': cat_var})
print(tabulate(df.head(10)))
df['cat_var'] = df['cat_var'].map(df['cat_var'].value_counts(normalize=True))
print(tabulate(df.head(10)))

-  -----
0  blue
1  red
2  blue
3  blue
4  blue
5  blue
6  green
7  red
8  red
9  blue
-  -----
-  ----
0  0.45
1  0.2
2  0.45
3  0.45
4  0.45
5  0.45
6  0.35
7  0.2
8  0.2
9  0.45
-  ----


**5. Target Encoding (Mean Encoding)**
- Description: Replaces each category with the mean of the target variable for that category
- Use Case: Useful for high cardinality categorical features, particularly in classification problems

In [33]:
cat_var = np.random.choice(['green', 'blue', 'red'], size=20, p=[0.3, 0.5, 0.2])
target_var = np.random.randint(0, 11, size=20)
df = pd.DataFrame({'cat_var': cat_var, 'target_var': target_var})
print(tabulate(df.head(10)))
target_mean = df.groupby('cat_var')['target_var'].mean()
df['cat_var'] = df['cat_var'].map(target_mean)
print(tabulate(df.head(10)))

-  -----  --
0  blue    8
1  blue    1
2  green   8
3  green   4
4  blue   10
5  blue    7
6  red     1
7  green   2
8  green   2
9  blue    9
-  -----  --
-  -----  --
0  6.3     8
1  6.3     1
2  3.875   8
3  3.875   4
4  6.3    10
5  6.3     7
6  5       1
7  3.875   2
8  3.875   2
9  6.3     9
-  -----  --


**6. Ordinal Encoding**
- Description: Assigns each unique category an integer value, respecting the order
- Use Case: Specifically for ordinal variables where categories have a defined order

In [42]:
from sklearn.preprocessing import OrdinalEncoder

cat_var = [['low'], ['medium'], ['high']] #2-d array
print(tabulate(cat_var))
encoder = OrdinalEncoder(categories=[['low','medium','high']]) #order the category
cat_var = encoder.fit_transform(cat_var)
print(tabulate(cat_var))

------
low
medium
high
------
-
0
1
2
-


**7. Hashing Encoding**
- Description: Applies a hash function to the category to convert it to a fixed-size vector
- Use Case: Useful when the number of unique categories is very large
- The number of components we use depends on many factors such as the size of the data and the number of unique categories

In [84]:
from sklearn.feature_extraction import FeatureHasher


cat_var = [['green'], ['red'], ['blue'], ['pink'], ['black'], ['white'], ['brown'], ['purple'], ['yellow'], ['grey']]
print(tabulate(cat_var))
encoder = FeatureHasher(n_features=7, input_type='string')
cat_var_encoded = encoder.transform(cat_var).toarray()
df = pd.DataFrame(cat_var_encoded)
df = pd.concat([pd.DataFrame(cat_var), df], axis=1)
print(tabulate(df))

------
green
red
blue
pink
black
white
brown
purple
yellow
grey
------
-  ------  -  --  --  --  --  --  -
0  green   0   0   0   0   0   1  0
1  red     0   0  -1   0   0   0  0
2  blue    0   0   0   0  -1   0  0
3  pink    0   0   0   1   0   0  0
4  black   0   0   0  -1   0   0  0
5  white   0   0   0   0   0  -1  0
6  brown   0   1   0   0   0   0  0
7  purple  0   0   0   0   0   0  1
8  yellow  0   0   0   0   1   0  0
9  grey    0  -1   0   0   0   0  0
-  ------  -  --  --  --  --  --  -


# Feature Selection

## WHY
- **Model Performance**: <span style="color:orange;">Using relevant features can enhance model accuracy and generalization by focusing on the most informative variables</span>
- **Overfitting Reduction**: <span style="color:orange;">Reducing the number of features can decrease the risk of overfitting, where the model learns noise rather than the underlying pattern</span>
- **Computational Efficiency**: <span style="color:orange;">Fewer features mean faster training times and reduced resource consumption</span>
- **Interpretability**: <span style="color:orange;">Models with fewer features are easier to understand and interpret, which is crucial in many applications</span>

## Type of Feature Selection

### 1. Filter Methods (quick, computationally inexpensive)
#### Overview: These methods evaluate the relevance of features by examining their intrinsic properties.
#### Techniques:
- **Correlation Coefficient**: Measures the linear relationship between each feature and the target variable
- **Chi-Square Test**: Evaluates the independence of categorical features with respect to the target variable
- **Mutual Information**: Assesses the dependency between each feature and the target
- **Variance Threshold**: Removes features with low variance, assuming they have little information


### 2. Wrapper Methods (computationally intensive)
#### Overview: These methods evaluate feature subsets based on model performance
#### Techniques:
- **Forward Selection**: Starts with no features and adds them one by one based on model improvement
- **Backward Elimination**: Starts with all features and removes them one by one based on model performance
- **Recursive Feature Elimination (RFE)**: Recursively removes the least important features and builds the model

### 3. Regularization Methods (for high-dimensional data or many irrelevant features)
#### Overview: These methods perform feature selection during the model training process
#### Techniques:
- **Lasso (L1 Regularization)**: Shrinks less important feature coefficients to zero
- **Ridge (L2 Regularization)**: Penalizes the size of coefficients but does not perform feature selection per se
- **Elastic Net**: Combines L1 and L2 regularization to select features and shrink coefficients


## Evaluation Criteria for Feature Selection
- **Relevance**: The feature's ability to predict the target variable
- **Redundancy**: The degree to which a feature is redundant with respect to other features
- **Interaction**: The feature's contribution in the presence of other features


## Practical Steps in Feature Selection

- **Understand the Data**: Explore and preprocess the data (handling missing values, encoding categorical variables)
- **Initial Filtering**: Use filter methods to remove irrelevant or redundant features
- **Model-Based Selection**: Apply wrapper or embedded methods to further refine the feature set
- **Validation**: Validate the selected features using cross-validation to ensure they generalize well to unseen data
- **Iterate**: Feature selection is an iterative process. Revisit and refine the feature set as necessary based on model performance and new insights

## Common Pitfalls in Feature Selection
- **Ignoring Domain Knowledge**: Failing to incorporate domain expertise can lead to the exclusion of important features
- **Overfitting in Feature Selection**: Performing feature selection on the entire dataset can lead to overfitting. Always use cross-validation
- **Too Many Features**: Even with selection, having too many features can still lead to overfitting and high variance



## Examples

### 1. Filter Methods

#### a. Correlation Coefficient

In [18]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')



# Compute correlation coefficients
correlations = X.corrwith(y).abs().sort_values(ascending=False)
selected_features_corr = sorted(correlations.head(10).index.tolist())
print("After Correlation Coefficient Dataframe Shape: ",X[selected_features_corr].shape)
print(tabulate(X[selected_features_corr].head(5), headers=selected_features_corr, tablefmt='psql'))


Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

#### b. Chi-Square Test

In [19]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.feature_selection import SelectKBest, chi2

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')

# Discretize features
discretizer = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
X_discretized = discretizer.fit_transform(X)

# Apply Chi-Square test
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X_discretized, y)
selected_features_chi2 = sorted(X.columns[chi2_selector.get_support()].tolist())
print("After Chi-Square Test Dataframe Shape: ",X[selected_features_chi2].shape)
print(tabulate(X[selected_features_chi2].head(5), headers=selected_features_chi2, tablefmt='psql'))

Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

#### c. Mutual Information

In [20]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')

# Apply Mutual Information method
mi_selector = SelectKBest(mutual_info_classif, k=10)
mi_selector.fit(X, y)
selected_features_mi = sorted(X.columns[mi_selector.get_support()].tolist())
print("After Mutual Information Dataframe Shape: ",X[selected_features_mi].shape)
print(tabulate(X[selected_features_mi].head(5), headers=selected_features_mi, tablefmt='psql'))

Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

#### d. Variance Threshold

In [21]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import VarianceThreshold

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')

# Apply Variance Threshold
var_threshold = VarianceThreshold(threshold=0.5)
var_threshold.fit(X)
selected_features_var = sorted(X.columns[var_threshold.get_support()].tolist())
print("After Variance Threshold (0.5) Dataframe Shape: ",X[selected_features_var].shape)
print(tabulate(X[selected_features_var].head(5), headers=selected_features_var, tablefmt='psql'))

Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

### 2. Wrapper Methods

#### a. Forward Selection

**Pros**:
- Computationally efficient: Starts with no features and adds them one by one, resulting in fewer model fittings compared to backward elimination
- Scalable: Better suited for datasets with a large number of features
- Faster convergence: Typically reaches a satisfactory subset of features more quickly

**Cons**:

- Risk of missing important features: Since it starts with an empty model, there is a chance that some important features might not be added if their individual contributions are not immediately apparent
- Greedy algorithm: May not find the optimal set of features as it makes locally optimal choices at each step

In [22]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SequentialFeatureSelector
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')

# Apply Forward Selection
model = LogisticRegression(max_iter=5000)
sfs = SequentialFeatureSelector(model, n_features_to_select=10, direction='forward')
sfs.fit(X, y)
selected_features_fs = sorted(X.columns[sfs.get_support()].tolist())
print("After Forward Selection Dataframe Shape: ", X[selected_features_fs].shape)
print(tabulate(X[selected_features_fs].head(5), headers=selected_features_fs, tablefmt='psql'))


Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

#### b. Backward Elimination

**Pros**:
- Starts with all features, ensuring that no potentially important features are initially excluded
- Can sometimes be more thorough since it evaluates the importance of each feature within the context of all other features

**Cons**:
- Computationally intensive: Each step involves fitting the model multiple times, which can be very slow if there are many features
- Not scalable: As the number of features increases, the time required to perform backward elimination grows significantly

In [23]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SequentialFeatureSelector
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')

# Apply Backward Elimination
model = LogisticRegression(max_iter=5000)
sfs = SequentialFeatureSelector(model, n_features_to_select=10, direction='backward')
sfs.fit(X, y)
selected_features_be = sorted(X.columns[sfs.get_support()].tolist())
print("After Backward Elimination Dataframe Shape: ", X[selected_features_be].shape)
print(tabulate(X[selected_features_be].head(5), headers=selected_features_be, tablefmt='psql'))

Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

#### c. Recursive Feature Elimination (RFE)

In [24]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataframe Shape: ", X.shape)
print(tabulate(X.head(5), headers='keys', tablefmt='psql'))
y = pd.Series(data.target, name='target')

# Apply Recursive Feature Elimination (RFE)
model = LogisticRegression(max_iter=5000)
rfe = RFE(model, n_features_to_select=10)
rfe.fit(X, y)
selected_features_rfe = sorted(X.columns[rfe.get_support()].tolist())
print("After Recursive Feature Elimination Dataframe Shape: ", X[selected_features_rfe].shape)
print(tabulate(X[selected_features_rfe].head(5), headers=selected_features_rfe, tablefmt='psql'))

Original Dataframe Shape:  (569, 30)
+----+---------------+----------------+------------------+-------------+-------------------+--------------------+------------------+-----------------------+-----------------+--------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+----------------+-----------------+-------------------+--------------+--------------------+---------------------+-------------------+------------------------+------------------+---------------------------+
|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points

### 3. Regularization Methods

#### a. L1 Lasso Regularization (Lasso Regression)

In [43]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply Lasso regularization
lasso = Lasso(alpha=0.1)  # You can adjust alpha for stronger or weaker regularization
lasso.fit(X_train_scaled, y_train)

# Observe the coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso.coef_})
# Set very small coefficients to exactly zero
coefficients['Coefficient'] = coefficients['Coefficient'].apply(lambda x: 0 if abs(x) < 1e-5 else x)
coefficients = coefficients.reindex(coefficients['Coefficient'].abs().sort_values(ascending=True).index)
coefficients = coefficients.sort_values(by=['Coefficient', 'Feature'], ascending=[True, True])
selected_features_lasso = coefficients[coefficients['Coefficient'] != 0]['Feature'].tolist()
print(f"Selected features using Lasso Regularization: {selected_features_lasso}")

# Display coefficients
print(tabulate(coefficients, headers='keys', showindex=False, tablefmt='psql'))

# Make predictions and evaluate the model
y_pred = lasso.predict(X_test_scaled)
y_pred_class = (y_pred >= 0.5).astype(int)  # Convert to binary classification

accuracy = accuracy_score(y_test, y_pred_class)
conf_matrix = confusion_matrix(y_test, y_pred_class)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)


Selected features using Lasso Regularization: ['worst concave points', 'worst radius', 'mean concave points', 'worst texture']
+-------------------------+---------------+
| Feature                 |   Coefficient |
|-------------------------+---------------|
| worst concave points    |    -0.149999  |
| worst radius            |    -0.117937  |
| mean concave points     |    -0.036743  |
| worst texture           |    -0.0167356 |
| area error              |     0         |
| compactness error       |     0         |
| concave points error    |     0         |
| concavity error         |     0         |
| fractal dimension error |     0         |
| mean area               |     0         |
| mean compactness        |     0         |
| mean concavity          |     0         |
| mean fractal dimension  |     0         |
| mean perimeter          |     0         |
| mean radius             |     0         |
| mean smoothness         |     0         |
| mean symmetry           |     0    

#### b. L2 Ridge Regularization (Ridge Regression)

In [48]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply Ridge regularization
ridge = Ridge(alpha=0.1)  # You can adjust alpha for stronger or weaker regularization
ridge.fit(X_train_scaled, y_train)

# Observe the coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': ridge.coef_})
# Set very small coefficients to exactly zero (though for Ridge, coefficients usually aren't zero)
coefficients['Coefficient'] = coefficients['Coefficient'].apply(lambda x: 0 if abs(x) < 1e-5 else x)
# Sort by absolute value of coefficients in descending order
coefficients = coefficients.reindex(coefficients['Coefficient'].abs().sort_values(ascending=False).index)

selected_features_ridge = coefficients[coefficients['Coefficient'] != 0]['Feature'].tolist()
print(f"Top 5 features using Ridge Regularization: {selected_features_ridge[0:5]}")

# Display coefficients
print(tabulate(coefficients, headers='keys', showindex=False, tablefmt='psql'))

# Make predictions and evaluate the model
y_pred = ridge.predict(X_test_scaled)
y_pred_class = (y_pred >= 0.5).astype(int)  # Convert to binary classification

accuracy = accuracy_score(y_test, y_pred_class)
conf_matrix = confusion_matrix(y_test, y_pred_class)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)


Top 5 features using Ridge Regularization: ['worst radius', 'worst area', 'mean compactness', 'worst perimeter', 'mean concave points']
+-------------------------+---------------+
| Feature                 |   Coefficient |
|-------------------------+---------------|
| worst radius            |   -0.790691   |
| worst area              |    0.437734   |
| mean compactness        |    0.234254   |
| worst perimeter         |    0.196706   |
| mean concave points     |   -0.18459    |
| mean radius             |    0.183285   |
| radius error            |   -0.146449   |
| mean perimeter          |   -0.14281    |
| concavity error         |    0.139473   |
| worst concavity         |   -0.12651    |
| concave points error    |   -0.108204   |
| worst compactness       |   -0.104189   |
| mean concavity          |   -0.0916966  |
| worst concave points    |    0.0790364  |
| area error              |    0.0739497  |
| worst symmetry          |   -0.0685449  |
| worst texture           | 

#### Elastic Net (L1 & L2)

In [45]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from tabulate import tabulate

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply Elastic Net regularization
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # You can adjust alpha and l1_ratio for stronger or weaker regularization
elastic_net.fit(X_train_scaled, y_train)

# Observe the coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': elastic_net.coef_})
# Set very small coefficients to exactly zero
coefficients['Coefficient'] = coefficients['Coefficient'].apply(lambda x: 0 if abs(x) < 1e-5 else x)
coefficients = coefficients.reindex(coefficients['Coefficient'].abs().sort_values(ascending=True).index)
coefficients = coefficients.sort_values(by=['Coefficient', 'Feature'], ascending=[True, True])
selected_features_elastic_net = coefficients[coefficients['Coefficient'] != 0]['Feature'].tolist()
print(f"Selected features using Elastic Net Regularization: {selected_features_elastic_net}")

# Display coefficients
print(tabulate(coefficients, headers='keys', showindex=False, tablefmt='psql'))

# Make predictions and evaluate the model
y_pred = elastic_net.predict(X_test_scaled)
y_pred_class = (y_pred >= 0.5).astype(int)  # Convert to binary classification

accuracy = accuracy_score(y_test, y_pred_class)
conf_matrix = confusion_matrix(y_test, y_pred_class)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)


Selected features using Elastic Net Regularization: ['worst concave points', 'worst radius', 'mean concave points', 'worst texture', 'worst perimeter', 'worst symmetry', 'worst smoothness']
+-------------------------+---------------+
| Feature                 |   Coefficient |
|-------------------------+---------------|
| worst concave points    |   -0.113208   |
| worst radius            |   -0.0953203  |
| mean concave points     |   -0.0754843  |
| worst texture           |   -0.0517978  |
| worst perimeter         |   -0.0466087  |
| worst symmetry          |   -0.0205105  |
| worst smoothness        |   -0.00269989 |
| area error              |    0          |
| compactness error       |    0          |
| concave points error    |    0          |
| concavity error         |    0          |
| fractal dimension error |    0          |
| mean area               |    0          |
| mean compactness        |    0          |
| mean concavity          |    0          |
| mean fractal dim