# Case 1 - Data Wrangling

### Table of Contents

1. **Importing Libraries**

2. **Loading Data**

3. **Data Wrangling**
- Splitting the data into train (80%) and test set (20%)
- Continuous Data: Standardizing
- Continuous Data: KNN Imputation - Handling missing values by KNNImputer
- Categorical Data: Mode Imputation - Handling missing values by SimpleImputer
- Categorical Data: One-hot encoding
- Saving wrangled data to new .csv file
- Summary of the data

4. **Data Wrangling for Nested CV (Without Splitting)**

5. **Wrangling case1Data_Xnew.csv**

6. **Feature Extraction**

## 1. Importing Libraries

In [2]:
import numpy as np
import pandas as pd

# Imputers
from sklearn.impute import KNNImputer, SimpleImputer

# Standardization scalers
from sklearn.preprocessing import StandardScaler

# Splitting data
from sklearn.model_selection import train_test_split

# Set seed for reproducibility
import random
random.seed(42)

## 2. Loading Data

In [5]:
# Path to the data files
data_path_1 = '../data/case1Data.csv'
data_path_2 = '../data/case1Data_Xnew.csv'

# Load the data into a numpy array
data_np = np.loadtxt(data_path_1, delimiter=',', skiprows=1)
data_np_new = np.loadtxt(data_path_2, delimiter=',', skiprows=1)

# Print the shape of the data in the numpy array
print(data_np.shape) # 100 rows and 101 columns (100 features and 1 target)
print(data_np_new.shape) # 1000 rows and 100 columns (100 features and no target)

# Create a pandas dataframe and use the first row as the column names
data_pd = pd.read_csv(data_path_1, sep=',', header=0)
data_pd_new = pd.read_csv(data_path_2, sep=',', header=0)

# Print the shape of the data in the pandas dataframe
print(data_pd.shape)
print(data_pd_new.shape)

(100, 101)
(1000, 100)
(100, 101)
(1000, 100)


## 3. Data Wrangling

### Splitting the data into train (80%) and test set (20%)

In [29]:
# Splitting the data into features and target
X = data_pd.iloc[:, 1:]
y = data_pd.iloc[:, 0]

print("X: ", X.shape)
print("y: ", y.shape)

# Splitting into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X:  (100, 100)
y:  (100,)


### Continuous data: Standardizing

In [30]:
# Using StandardScaler from scikit-learn to standardize the data
scaler = StandardScaler()

# Standardizing the numerical features (all columns exept the last five)
X_train.iloc[:, :-5] = scaler.fit_transform(X_train.iloc[:, :-5])
X_test.iloc[:, :-5] = scaler.transform(X_test.iloc[:, :-5])

# Also standardizing the target values
#y_train = scaler.fit_transform(y_train.values.reshape(-1, 1)).flatten() # reshaping to 1D array
#y_test = scaler.transform(y_test.values.reshape(-1, 1)).flatten() # reshaping to 1D array

### Continuous Data: KNN Imputation - Handling missing values by KNNImputer

In [31]:
# Using KNNImputer from scikit-learn to impute the missing values in the data (for continuous variables) with the mean of the k-nearest neighbors (k=5)

# class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False, keep_empty_features=False)
continuous_imputer = KNNImputer(n_neighbors=5, missing_values=np.nan)

# Fitting the imputer on the training data and transforming the training and test data
X_train.iloc[:, :-5] = pd.DataFrame(continuous_imputer.fit_transform(X_train.iloc[:, :-5]))
X_test.iloc[:, :-5] = pd.DataFrame(continuous_imputer.transform(X_test.iloc[:, :-5]))

### Categorical Data: Mode Imputation - Handling missing values by SimpleImputer

In [32]:
# Mode Imputation: Using SimpleImputer from scikit-learn to impute the missing values in the data (for categorical variables) with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Fitting the imputer on the training data and transforming the training and test data
X_train.iloc[:, -5:] = categorical_imputer.fit_transform(X_train.iloc[:, -5:])
X_test.iloc[:, -5:] = categorical_imputer.transform(X_test.iloc[:, -5:])

### Categorical Data: One-hot encoding

In [33]:
# One-hot encoding the categorical variables using get_dummies from pandas library (for the last five columns)
X_train = pd.get_dummies(X_train, columns=X_train.columns[-5:])
X_test = pd.get_dummies(X_test, columns=X_test.columns[-5:])

# Printing the shape of the data to check if the one-hot encoding worked and only performed once
print("Number of columns in X_train and X_test should be 100 + 5*number of unique values in the last 5 columns: 116")
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("If the number of columns is >116, one-hot encoding was performed more than once and should be fixed by re-running the code from the beginning.")


Number of columns in X_train and X_test should be 100 + 5*number of unique values in the last 5 columns: 116
X_train:  (80, 116)
X_test:  (20, 116)
If the number of columns is >116, one-hot encoding was performed more than once and should be fixed by re-running the code from the beginning.


### Saving wrangled data to new .csv file

In [34]:
# Converting the data into numpy arrays
X_train = np.asarray(X_train, dtype=np.float64)
X_test = np.asarray(X_test, dtype=np.float64)
y_train = np.asarray(y_train, dtype=np.float64)
y_test = np.asarray(y_test, dtype=np.float64)

# Saving the preprocessed data to csv files
np.savetxt('../data/case1Data_Xtrain.csv', X_train, delimiter=',')
np.savetxt('../data/case1Data_Xtest.csv', X_test, delimiter=',')
np.savetxt('../data/case1Data_ytrain.csv', y_train, delimiter=',')
np.savetxt('../data/case1Data_ytest.csv', y_test, delimiter=',')

### Summary of the data

In [35]:
# Printing the shape of the data
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

# Size of the training and test data
n_train = X_train.shape[0]
n_test = X_test.shape[0]
p = X_train.shape[1]

# Printing the size of the training and test data
print("n_train: ", n_train) # number of training samples
print("n_test: ", n_test) # number of test samples
print("p: ", p) # number of features/variables/columns/parameters

# Checking for missing values in the wrangled data
missing_values_X_train = np.isnan(X_train)
print("Number of missing values in X_train: ", np.sum(missing_values_X_train))
missing_values_X_test = np.isnan(X_test)
print("Number of missing values in X_test: ", np.sum(missing_values_X_test))
missing_values_y_train = np.isnan(y_train)
print("Number of missing values in y_train: ", np.sum(missing_values_y_train))
missing_values_y_test = np.isnan(y_test)
print("Number of missing values in y_test: ", np.sum(missing_values_y_test))

X_train:  (80, 116)
X_test:  (20, 116)
y_train:  (80,)
y_test:  (20,)
n_train:  80
n_test:  20
p:  116
Number of missing values in X_train:  0
Number of missing values in X_test:  0
Number of missing values in y_train:  0
Number of missing values in y_test:  0


## 4. Data Wrangling for Nested CV (Without Splitting, Standardizing, and Handling Missing Values)

In [3]:
# Loading the data into a numpy array
data = np.loadtxt('../data/case1Data.csv', delimiter=',', skiprows=1)

# Splitting the data into features (X) and target (y)
X = data[:, 1:] # All columns except the first one
y = data[:, 0] # First column
print("X: ", X.shape)
print("y: ", y.shape)

## Using StandardScaler from scikit-learn to standardize the data
#scaler = StandardScaler()
#
## Converting X into a pandas dataframe
#X = pd.DataFrame(X)
#
## Standardizing the numerical features (all columns exept the last five)
#X.iloc[:, :-5] = scaler.fit_transform(X.iloc[:, :-5])
#
## Using KNNImputer from scikit-learn to impute the missing values in the data (for continuous variables) with the mean of the k-nearest neighbors (k=5)
#continuous_imputer = KNNImputer(n_neighbors=5, missing_values=np.nan)
#X.iloc[:, :-5] = pd.DataFrame(continuous_imputer.fit_transform(X.iloc[:, :-5]))
#
## Mode Imputation: Using SimpleImputer from scikit-learn to impute the missing values in the data (for categorical variables) with the most frequent value
#categorical_imputer = SimpleImputer(strategy='most_frequent')
#X.iloc[:, -5:] = categorical_imputer.fit_transform(X.iloc[:, -5:])
#
## One-hot encoding the categorical variables using get_dummies from pandas library (for the last five columns)
#X = pd.get_dummies(X, columns=X.columns[-5:])
#
## Printing the shape of the data to check if the one-hot encoding worked and only performed once
#print("Number of columns in X should be 100 + 5*number of unique values in the last 5 columns: 116")
#print("X: ", X.shape)
#print("If the number of columns is >116, one-hot encoding was performed more than once and should be fixed by re-running the code from the beginning.")

# Converting the data into numpy arrays
X = np.asarray(X, dtype=np.float64)

# Saving the preprocessed data to a csv file
np.savetxt('../data/case1Data_X.csv', X, delimiter=',')
np.savetxt('../data/case1Data_y.csv', y, delimiter=',')

X:  (100, 100)
y:  (100,)


## 5. Wrangling case1Data_Xnew.csv

In [42]:
# Loading the new data into a numpy array
X_new = pd.DataFrame(np.loadtxt('../data/case1Data_Xnew.csv', delimiter=',', skiprows=1))
print("X_new: ", X_new.shape)

# Using StandardScaler from scikit-learn to standardize the data
scaler = StandardScaler()

# Standardizing the numerical features (all columns exept the last five)
X_new.iloc[:, :-5] = scaler.fit_transform(X_new.iloc[:, :-5])

# Using KNNImputer from scikit-learn to impute the missing values in the data (for continuous variables) with the mean of the k-nearest neighbors (k=5)
X_new.iloc[:, :-5] = pd.DataFrame(continuous_imputer.fit_transform(X_new.iloc[:, :-5]))

# Mode Imputation: Using SimpleImputer from scikit-learn to impute the missing values in the data (for categorical variables) with the most frequent value
X_new.iloc[:, -5:] = categorical_imputer.fit_transform(X_new.iloc[:, -5:])

# One-hot encoding the categorical variables using get_dummies from pandas library (for the last five columns)
X_new = pd.get_dummies(X_new, columns=X_new.columns[-5:])

# Printing the shape of the data to check if the one-hot encoding worked and only performed once
print("Number of columns in X_new should be 100 + 5*number of unique values in the last 5 columns: 116")
print("X_new: ", X_new.shape)
print("If the number of columns is >116, one-hot encoding was performed more than once and should be fixed by re-running the code from the beginning.")

# Converting the data into numpy arrays
X_new = np.asarray(X_new, dtype=np.float64)

# Saving the preprocessed data to a csv file
np.savetxt('../data/case1Data_Xnew_wrangled.csv', X_new, delimiter=',')
print("X_new: ", X_new.shape)

X_new:  (1000, 100)
Number of columns in X_new should be 100 + 5*number of unique values in the last 5 columns: 116
X_new:  (1000, 116)
If the number of columns is >116, one-hot encoding was performed more than once and should be fixed by re-running the code from the beginning.
X_new:  (1000, 116)


## 6. Feature Extraction

In [None]:
## Calculating the variance of the features in the training data
#var_X_train = np.var(X_train, axis=0)
#print("Variance of the features in the training data: ", var_X_train)
#
## Calculating the covariance between the features and the target in the training data
#cov_X_train_y_train = np.cov(X_train.T, y_train)
#print("Covariance between the features and the target in the training data: ", cov_X_train_y_train)
#
## Removing all features with variance below 0.1
#X_train = X_train[:, var_X_train > 0.2]
#
## Calculating the variance of the features in the training data
#var_X_train = np.var(X_train, axis=0)
#print("Variance of the features in the training data: ", var_X_train)

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel

lasso = LassoCV(cv=3, random_state=42) # LassoCV automatically performs cross-validation
lasso.fit(X_train, y_train) # Fit the model on the training data

# Select important features (nonzero coefficients)
selector = SelectFromModel(lasso, prefit=True) # Use the Lasso model to select features
selected_features_mask = selector.get_support()
selected_features = np.where(selected_features_mask)[0]  # Get indices of selected features

print("Selected Feature Indices:", selected_features)

Selected Feature Indices: [  6   8   9  10  12  27  28  29  31  34  35  40  42  44  48  50  53  54
  56  58  61  63  67  72  74  75  77  81  82  85 112]


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

def compute_bic(X, y, model):
    """Compute Bayesian Information Criterion (BIC) for a given model."""
    n, k = X.shape  # n = samples, k = features
    model.fit(X, y)
    residuals = y - model.predict(X)
    rss = np.sum(residuals**2)  # Residual sum of squares
    sigma2 = rss / n  # Estimated variance
    bic = n * np.log(sigma2) + k * np.log(n)  # BIC formula
    return bic

def stepwise_bic_selection(X, y):
    """Greedy forward selection based on BIC."""
    n_features = X.shape[1]
    selected_features = []  # Start with an empty set
    best_bic = np.inf
    model = LinearRegression()

    while True:
        bic_scores = []
        candidates = [i for i in range(n_features) if i not in selected_features]

        # Try adding each remaining feature
        for feature in candidates:
            current_features = selected_features + [feature]
            bic = compute_bic(X[:, current_features], y, model)
            bic_scores.append((feature, bic))

        # Select the feature that minimizes BIC
        bic_scores.sort(key=lambda x: x[1])
        best_new_feature, new_bic = bic_scores[0]

        # Stop if BIC does not improve
        if new_bic >= best_bic:
            break

        # Otherwise, update the best BIC and selected features
        best_bic = new_bic
        selected_features.append(best_new_feature)

    return selected_features, best_bic

# Run BIC feature selection
best_features, best_bic = stepwise_bic_selection(X_train, y_train)

print("Selected Feature Indices:", best_features)
print("Best BIC Score:", best_bic)


Selected Feature Indices: [31, 61, 35, 53, 67, 36, 81, 48, 9, 38, 77, 74, 104]
Best BIC Score: 492.75769252942996
