##### 1. Text Vectorization Techniques

| Technique             | Description                                                                                   | Use Case                                                                                   |
|-----------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Bag of Words (BoW)| Converts text into a matrix of token counts.                                                   | Basic text classification tasks. |
| TF-IDF            | Weighs terms based on frequency and inverse document frequency.                               | Text classification, information retrieval. |
| Count Vectorizer  | Similar to BoW but includes all token counts.                                                 | Tasks focusing on word frequency. |
| Word Embeddings   | Represents words in a continuous vector space.                                                | Sentiment analysis, named entity recognition. |
| BERT              |                                                                                               |                                               |
| Document Embeddings| Represents entire documents in vector space.                                                 | Document classification, clustering. | ([Vectorization Techniques in NLP - GeeksforGeeks](https://www.geeksforgeeks.org/vectorization-techniques-in-nlp/?utm_source=chatgpt.com), [Vectorization Techniques in NLP [Guide]](https://neptune.ai/blog/vectorization-techniques-in-nlp-guide?utm_source=chatgpt.com))

##### 2. Scaling Techniques

| Technique              | Description                                                                                   | Use Case                                                                                   |
|------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| StandardScaler     | Standardizes features by removing the mean and scaling to unit variance.                      | Models sensitive to feature scaling (e.g., SVM, KNN). |
| MinMaxScaler       | Scales features to a given range, typically [0, 1].                                           | Neural networks, algorithms requiring bounded data. |
| RobustScaler       | Scales features using statistics that are robust to outliers.                                 | Datasets with outliers. |
| QuantileTransformer| Transforms features to follow a uniform or normal distribution.                               | When normality is required. |
| Normalizer          | Scales individual samples to have unit norm.                                                  | Text classification, clustering. | ([The choice of scaling technique matters for classification performance](https://arxiv.org/abs/2212.12343?utm_source=chatgpt.com), [Techniques of Feature Selection in Machine Learning | by Srivignesh Rajan | Analytics Vidhya | Medium](https://medium.com/analytics-vidhya/an-introduction-to-feature-selection-in-machine-learning-9d6f2d5e47?utm_source=chatgpt.com))

##### 3. Feature Selection Techniques

| Technique                     | Description                                                                                   | Use Case                                                                                   |
|-------------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Filter Methods            | Select features based on statistical tests.                                                   | High-dimensional datasets. |
| Wrapper Methods           | Evaluate feature subsets using model performance.                                             | When model accuracy is critical. |
| Embedded Methods          | Perform feature selection during model training.                                              | Lasso regression, decision trees. |
| Recursive Feature Elimination (RFE)| Iteratively removes features and builds a model on the remaining attributes.           | When reducing model complexity is desired. |
| Principal Component Analysis (PCA)| Reduces dimensionality by transforming features into a set of linearly uncorrelated components. | When dealing with multicollinearity. | ([Feature Selection Techniques in Machine Learning | GeeksforGeeks](https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/?utm_source=chatgpt.com), [Feature Selection Techniques in Machine Learning](https://www.analyticsvidhya.com/blog/2021/06/feature-selection-techniques-in-machine-learning-2/?utm_source=chatgpt.com), [Techniques of Feature Selection in Machine Learning | by Srivignesh Rajan | Analytics Vidhya | Medium](https://medium.com/analytics-vidhya/an-introduction-to-feature-selection-in-machine-learning-9d6f2d5e47?utm_source=chatgpt.com))

##### 4. Missing Value Imputation Techniques

| Technique                   | Description                                                                                   | Use Case                                                                                   |
|-----------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Mean/Median/Mode Imputation| Replaces missing values with the mean, median, or mode.                                     | Simple and quick imputation. |
| K-Nearest Neighbors (KNN) Imputation| Imputes missing values using the mean of the nearest neighbors.                       | When relationships between features exist. |
| Multivariate Imputation by Chained Equations (MICE)| Iterative imputation using other features.                                              | Complex datasets with missing values. |
| Regression Imputation   | Predicts missing values using a regression model based on other variables.                   | When a predictive relationship exists. |
| Multiple Imputation     | Creates multiple imputations to account for uncertainty.                                      | When dealing with uncertainty in missing data. | ([Missing Value Imputation: Filling in missing data points | Machine Learning Design Patterns | Software Patterns Lexicon](https://softwarepatternslexicon.com/machine-learning/data-management-patterns/data-preprocessing/missing-value-imputation/?utm_source=chatgpt.com), [Data Imputation Techniques in ML | GeeksforGeeks](https://www.geeksforgeeks.org/data-imputation-techniques-in-ml/?utm_source=chatgpt.com), [Missing Data Imputation Techniques in Machine Learning - Analytics Yogi](https://vitalflux.com/missing-data-imputation-machine-learning/?utm_source=chatgpt.com))

##### 5. Regression Algorithms

Basic Algorithms:

- Linear Regression: Predicts a dependent variable by fitting a linear relationship with independent variables.
- Ridge Regression: Linear regression with L2 regularization to prevent overfitting.
- Lasso Regression: Linear regression with L1 regularization, promoting sparsity.
- ElasticNet: Combines L1 and L2 regularization.

Advanced Algorithms:

- Decision Tree Regressor: Non-linear model that splits data into subsets based on feature values.
- Random Forest Regressor: Ensemble of decision trees to improve accuracy and control overfitting.
- Gradient Boosting Regressor: Builds models sequentially to correct errors made by previous models.
- Support Vector Regression (SVR): Uses support vector machines for regression tasks.
- XG Boost



In [35]:
import os
import mlflow
import inspect
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV, RFE
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

# Regressors
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor

In [36]:
df = pd.read_excel(r"D:\campusx_dsmp2\9. MLOps revisited\cars24_mlops_project\experiment\cars24_v3.xlsx")

##### processing columns

In [37]:
# Assuming the dataframe is called df
df['year'] = pd.to_numeric(df['year'], errors='coerce')  # Convert 'year' to numeric

# Convert 'ownership' to numeric (e.g., 'Owned' -> 1, 'Leased' -> 0)
df['ownership'] = df['ownership'].map({'Owned': 1, 'Leased': 0})

# Convert boolean-like columns to numeric (1 for True, 0 for False)
bool_columns = [
    '360DegreeCamera', 'AlloyWheels', 'AppleCarplayAndroidAuto', 'Bluetooth',
    'CruiseControl', 'GpsNavigation', 'InfotainmentSystem', 'LeatherSeats',
    'ParkingAssist', 'PushButtonStart', 'RearAc', 'SpecialRegNo', 'Sunroof/Moonroof',
    'TopModel', 'Tpms', 'VentilatedSeats'
]

df[bool_columns] = df[bool_columns].applymap(lambda x: 1 if x else 0)

In [38]:
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'bool']).columns.tolist()

In [39]:
import dagshub
dagshub.init(repo_owner='iamprashantjain', repo_name='MLOps_UsedCarPricePrediction', mlflow=True)
mlflow.set_tracking_uri("https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow")
mlflow.set_experiment('All Techniques & Algorithms')

<Experiment: artifact_location='mlflow-artifacts:/04caa17676fa4b6498a319c6ed0548f2', creation_time=1745946826363, experiment_id='2', last_update_time=1745946826363, lifecycle_stage='active', name='All Techniques & Algorithms', tags={}>

In [None]:
# Define target and features
target_col = "listingPrice"
X = df.drop(columns=[target_col])
y = df[target_col]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify columns
numerical_cols = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object', 'bool']).columns.tolist()

# Preprocessing components
scaler = StandardScaler()
imputer = SimpleImputer(strategy='mean')
vectorizer = OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore')

# Define pipelines
numeric_pipeline = Pipeline([
    ('imputer', imputer),
    ('scaler', scaler)
])
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', vectorizer)
])
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

# Fit the preprocessor once and transform
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Store feature names for selectors (optional)
try:
    preprocessed_feature_names = (
        preprocessor.named_transformers_['num']['scaler'].get_feature_names_out(numerical_cols).tolist() +
        preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out().tolist()
    )
except:
    preprocessed_feature_names = [f"f{i}" for i in range(X_train_processed.shape[1])]

# Regressors
regression_models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "ElasticNet": ElasticNet(alpha=0.1, l1_ratio=0.5)
}

# Feature selectors
feature_selectors = {
    "SelectKBest": SelectKBest(score_func=f_regression, k=10),
    "RFECV": RFECV(estimator=RandomForestRegressor(n_jobs=-1), step=1, cv=2)
}

# Start MLflow run
with mlflow.start_run(run_name="All_Pipeline_Experiments") as parent_run:
    mlflow.set_tag("experiment_type", "minimal_preprocessing_once")

    # Log the notebook
    try:
        mlflow.log_artifact(
            r"D:\campusx_dsmp2\9. MLOps revisited\cars24_mlops_project\experiment\5_Experiment2_AllTechniques.ipynb",
            artifact_path="source_code"
        )
    except:
        print("⚠️ Could not log notebook file.")

    for model_name, model in regression_models.items():
        for selector_name, selector in feature_selectors.items():

            with mlflow.start_run(
                run_name=f"{model_name}_{selector_name}",
                nested=True
            ):
                mlflow.log_param("model", model_name)
                mlflow.log_param("scaler", "StandardScaler")
                mlflow.log_param("imputer", "SimpleMean")
                mlflow.log_param("vectorizer", "OHE")
                mlflow.log_param("feature_selector", selector_name)

                # Select features
                if selector_name == "RFECV":
                    feature_selector = RFECV(estimator=model, step=1, cv=3)
                else:
                    feature_selector = selector

                pipeline = Pipeline([
                    ('feature_selector', feature_selector),
                    ('model', model)
                ])

                try:
                    pipeline.fit(X_train_processed, y_train)
                    y_pred = pipeline.predict(X_test_processed)

                    # Metrics
                    mae = mean_absolute_error(y_test, y_pred)
                    mse = mean_squared_error(y_test, y_pred)
                    r2 = r2_score(y_test, y_pred)

                    mlflow.log_metric("MAE", mae)
                    mlflow.log_metric("MSE", mse)
                    mlflow.log_metric("R2", r2)

                    mlflow.sklearn.log_model(pipeline, artifact_path=f"{model_name}_pipeline")

                    # Save predictions
                    results_df = pd.DataFrame({
                        "Actual": y_test,
                        "Predicted": y_pred
                    })
                    results_file = f"{model_name}_{selector_name}_results.csv"
                    results_df.to_csv(results_file, index=False)
                    mlflow.log_artifact(results_file, artifact_path="predictions")
                    os.remove(results_file)

                    print(f"✅ {model_name} + {selector_name} | MAE: {mae:.2f}, R2: {r2:.2f}")
                except Exception as e:
                    print(f"❌ Error with {model_name} + {selector_name}: {e}")
                    mlflow.log_param("error", str(e))



✅ LinearRegression + SelectKBest | MAE: 103072.10, R2: 0.95


2025/04/30 23:33:33 INFO mlflow.tracking._tracking_service.client: 🏃 View run LinearRegression_SelectKBest at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2/runs/aedcb00e80dc4842b0372695477cd193.
2025/04/30 23:33:33 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2.


✅ LinearRegression + RFECV | MAE: 342656.61, R2: 0.00


2025/05/01 04:37:06 INFO mlflow.tracking._tracking_service.client: 🏃 View run LinearRegression_RFECV at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2/runs/58eeff6e8d7d40e6a97c82d031922d01.
2025/05/01 04:37:06 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2.


✅ Ridge + SelectKBest | MAE: 108728.56, R2: 0.91


2025/05/01 04:37:20 INFO mlflow.tracking._tracking_service.client: 🏃 View run Ridge_SelectKBest at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2/runs/baa0600fe2a04efbae92d8e244095f1b.
2025/05/01 04:37:20 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2.


✅ Ridge + RFECV | MAE: 118156.36, R2: 0.89


2025/05/01 05:25:10 INFO mlflow.tracking._tracking_service.client: 🏃 View run Ridge_RFECV at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2/runs/ae1d0c81c2024c0d8cffd4c37f9a4aa7.
2025/05/01 05:25:10 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2.
  model = cd_fast.enet_coordinate_descent(


✅ Lasso + SelectKBest | MAE: 123339.99, R2: 0.84


2025/05/01 05:25:25 INFO mlflow.tracking._tracking_service.client: 🏃 View run Lasso_SelectKBest at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2/runs/cab6a1ff3b8c4b0f89bc51bdd93c5fea.
2025/05/01 05:25:25 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/2.
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_de

KeyboardInterrupt: 