## **Problem Statement**

### **Business Context**

An automobile dealership in Los Vegas specializes in selling luxury and non-luxury vehicles. They cater to diverse customer preferences with varying vehicle specifications, such as mileage, engine capacity, and seating capacity. However, the dealership faces significant challenges in maintaining consistency and efficiency across its pricing strategy due to reliance on manual processes and disconnected systems. Pricing evaluations are prone to errors, updates are delayed, and scaling operations are difficult as demand grows. These inefficiencies impact revenue and customer trust. Recognizing the need for a reliable and scalable solution, the dealership is seeking to implement a unified system that ensures seamless integration of data-driven pricing decisions, adaptability to changing market conditions, and operational efficiency.

### **Objective**

The dealership has hired you as an MLOps Engineer to design and implement an MLOps pipeline that automates the pricing workflow. This pipeline will encompass data cleaning, preprocessing, transformation, model building, training, evaluation, and registration with CI/CD capabilities to ensure continuous integration and delivery. Your role is to overcome challenges such as integrating disparate data sources, maintaining consistent model performance, and enabling scalable, automated updates to meet evolving business needs. The expected outcomes are a robust, automated system that improves pricing accuracy, operational efficiency, and scalability, driving increased profitability and customer satisfaction.

### **Data Description**

The dataset contains attributes of used cars sold in various locations. These attributes serve as key data points for CarOnSell's pricing model. The detailed attributes are:

- **Segment:** Describes the category of the vehicle, indicating whether it is a luxury or non-luxury segment.

- **Kilometers_Driven:** The total number of kilometers the vehicle has been driven.

- **Mileage:** The fuel efficiency of the vehicle, measured in kilometers per liter (km/l).

- **Engine:** The engine capacity of the vehicle, measured in cubic centimeters (cc). 

- **Power:** The power of the vehicle's engine, measured in brake horsepower (BHP). 

- **Seats:** The number of seats in the vehicle, can influence the vehicle's classification, usage, and pricing based on customer needs.

- **Price:** The price of the vehicle, listed in lakhs (units of 100,000), represents the cost to the consumer for purchasing the vehicle.

## **1. AzureML Environment Setup and Data Preparation**

### **1.1 Connect to Azure Machine Learning Workspace**

In [2]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

In [5]:
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r


accepted Terms of Service for [4;94mhttps://repo.anaconda.com/pkgs/main[0m
accepted Terms of Service for [4;94mhttps://repo.anaconda.com/pkgs/r[0m


In [7]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Initialization of the Credentials
credential = DefaultAzureCredential()

# Initializing the Client
ml_client = MLClient(
    credential=credential,
    subscription_id="77c91b3f-d78c-4832-8ed2-a5dd9c501e0e",
    resource_group_name="streaming_autovehicle_pricing_MLOPS",
    workspace_name="project_III_MLOPS",
)

print("‚úÖ We created successfully MLClient –∑–∞ workspace:", ml_client.workspace_name)


‚úÖ We created successfully MLClient –∑–∞ workspace: project_III_MLOPS


### **1.2 Set Up Compute Cluster**

In [11]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="Standard_DS11_v2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=1,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
    
)
print(f"Provisioning state: {cpu_cluster.provisioning_state}")


Creating a new cpu compute target...
AMLCompute with name cpu-cluster is created, the compute size is Standard_DS11_v2
Provisioning state: Succeeded


### **1.3 Register Dataset as Data Asset**

In [13]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Path to the local dataset
local_data_path = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv'

# Create the Data asset definition
data_asset = Data(
    path=local_data_path,
    type=AssetTypes.URI_FILE,
    description="A dataset of used cars for price prediction",
    name="used-cars-data",
    version="1"   
)

# Register the dataset in the workspace
registered_data_asset = ml_client.data.create_or_update(data_asset)

print(f"‚úÖ Data asset registered: {registered_data_asset.name}, version: {registered_data_asset.version}")


Uploading used_cars.csv (< 1 MB): 0.00B [00:00, ?B/s]Uploading used_cars.csv (< 1 MB): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9.47k/9.47k [00:00<00:00, 765kB/s]




‚úÖ Data asset registered: used-cars-data, version: 1


In [15]:
data_asset = Data(
    path=local_data_path,
    type=AssetTypes.URI_FILE,
    description="A dataset of used cars for price prediction",
    name="used-cars-data",
    version="2"   
)
ml_client.data.create_or_update(data_asset)




Data({'path': 'azureml://subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_MLOPS/workspaces/project_III_MLOPS/datastores/workspaceblobstore/paths/LocalUpload/0b8e06a9f14bf45a52b1c21394f1cdf03017517cd48663b3e20a05882ff35cdd/used_cars.csv', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'used-cars-data', 'description': 'A dataset of used cars for price prediction', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_MLOPS/providers/Microsoft.MachineLearningServices/workspaces/project_III_MLOPS/data/used-cars-data/versions/2', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil', 'creation_context': <azure.ai.ml.ent

In [17]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Path to the local dataset
local_data_path = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv'

data_asset = Data(
    path=local_data_path,
    type=AssetTypes.URI_FILE,
    description="A dataset of used cars for price prediction",
    name="used-cars-data",
    version="3"   
)

registered_data_asset = ml_client.data.create_or_update(data_asset)
print(f"‚úÖ Data asset registered: {registered_data_asset.name}, version: {registered_data_asset.version}")

# Showing all data assets in the workspace-–∞
print("\n=== All data assets in this workspace ===")
for d in ml_client.data.list():
    print(f"- {d.name} | version: {d.version} | type: {d.type} | path: {d.path}")


‚úÖ Data asset registered: used-cars-data, version: 3

=== All data assets in this workspace ===
- used-cars-data | version: None | type: uri_file | path: None


### **1.4 Create and Configure Job Environment**

In [18]:
# Create a directory for the preprocessing script
import os

src_dir_env = "./env"
os.makedirs(src_dir_env, exist_ok=True)

In [20]:
from azure.ai.ml.entities import Environment

# Path  to train_conda.yml, that was prepared recently 
conda_file_path = os.path.join(src_dir_env, "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/environment/train_conda.yml")

job_env = Environment(
    name="used-cars-env",
    description="Environment for used cars pricing MLOps pipeline",
    conda_file=conda_file_path,
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04"
)

registered_env = ml_client.environments.create_or_update(job_env)
print(f"‚úÖ Environment registered: {registered_env.name}, version: {registered_env.version}")


‚úÖ Environment registered: used-cars-env, version: 1


In [7]:
# %%writefile {src_dir_env}/conda.yml
# name: sklearn-env
# channels:
#   - conda-forge
# dependencies:
#   - python=3.8
#   - pip=21.2.4
#   - scikit-learn=0.23.2
#   - scipy=1.7.1
#   - pip:  
#     - mlflow==2.8.1
#     - azureml-mlflow==1.51.0
#     - azureml-inference-server-http
#     - azureml-core==1.49.0
#     - cloudpickle==1.6.0

Overwriting ./env/conda.yml


In [23]:
from azure.ai.ml.entities import Environment, BuildContext

env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/environment/train_conda.yml",
    name="machine_learning_E2E",
    description="Environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(env_docker_conda)

Environment({'arm_type': 'environment_version', 'latest_version': None, 'image': 'mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04', 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'machine_learning_E2E', 'description': 'Environment created from a Docker image plus Conda environment.', 'tags': {}, 'properties': {'azureml.labels': 'latest'}, 'print_as_yaml': False, 'id': '/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_MLOPS/providers/Microsoft.MachineLearningServices/workspaces/project_III_MLOPS/environments/machine_learning_E2E/versions/1', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7af06c425de0>, 'serialize': <msrest.serialization.Serializer object at 0x7af06c426200>, 'version': '1', 'conda

## **2. Model Development Workflow**

### **2.1 Data Preparation**

This **Data Preparation job** is designed to process an input dataset by splitting it into two parts: one for training the model and the other for testing it. The script accepts three inputs: the location of the input data (`used_cars.csv`), the ratio for splitting the data into training and testing sets (`test_train_ratio`), and the paths to save the resulting training (`train_data`) and testing (`test_data`) data. The script first reads the input CSV data from a data asset URI, then splits it using Scikit-learn's train_test_split function, and saves the two parts to the specified directories. It also logs the number of records in both the training and testing datasets using MLflow.

In [31]:
# Testing of prep.py locally 
!python /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/src/prep.py \
  --raw_data /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv \
  --train_data ./outputs/train \
  --test_data ./outputs/test \
  --test_train_ratio 0.2


Raw data path: /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv
Train dataset output path: ./outputs/train
Test dataset path: ./outputs/test
Test-train ratio: 0.2
‚úÖ Data preparation complete. Train rows: 160, Test rows: 40
üèÉ View run strong_chin_b2tpmspq at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817/runs/f81d6475-21f2-4ba9-a600-3a194301c4fe
üß™ View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817


In [32]:
!python /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/src/prep.py \
  --raw_data /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv \
  --train_data ./outputs/train \
  --test_data ./outputs/test \
  --test_train_ratio 0.2


Raw data path: /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv
Train dataset output path: ./outputs/train
Test dataset path: ./outputs/test
Test-train ratio: 0.2
‚úÖ Data preparation complete. Train rows: 160, Test rows: 40
üèÉ View run quirky_tangelo_vqz7yp8r at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817/runs/352c4200-b1f4-4b0d-84bd-f5416ae16e8b
üß™ View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817


In [36]:
!python /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/src/train.py \
  --train_data ./outputs/train \
  --test_data ./outputs/test \
  --model_output ./outputs/model \
  --n_estimators 100 \
  --max_depth 10


Train dataset input path: ./outputs/train
Test dataset input path: ./outputs/test
Model output path: ./outputs/model
Number of Estimators: 100
Max Depth: 10
‚úÖ Model trained. MSE on test set: 56.9609
‚úÖ Model saved at ./outputs/model
üèÉ View run ivory_nose_yvv3pt1c at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817/runs/72daaef9-e8f7-43f1-91da-5573fd1e2483
üß™ View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817


In [37]:
!python /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/src/register.py \
  --model_name used_cars_price_prediction_model \
  --model_path ./outputs/model \
  --model_info_output_path ./outputs/model_info


Model name: used_cars_price_prediction_model
Model path: ./outputs/model
Model info output path: ./outputs/model_info
Registering model: used_cars_price_prediction_model
Successfully registered model 'used_cars_price_prediction_model'.
2025/10/24 16:45:28 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: used_cars_price_prediction_model, version 1
Created version '1' of model 'used_cars_price_prediction_model'.
Registered model 'used_cars_price_prediction_model' already exists. Creating a new version of this model...
2025/10/24 16:45:29 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: used_cars_price_prediction_model, version 2
Created version '2' of model 'used_cars_price_prediction_model'.
‚úÖ Model registered: used_cars_price_prediction_model, version: 2
‚úÖ Model info written to ./outputs/model_info/model_info.json
üèÉ View run affa

In [46]:
# All datasets
for d in ml_client.data.list():
    print("DATA:", d.name, d.version, d.type)

# –í—Å–∏—á–∫–∏ environments
for e in ml_client.environments.list():
    print("ENV:", e.name, e.version)

# –í—Å–∏—á–∫–∏ compute targets
for c in ml_client.compute.list():
    print("COMPUTE:", c.name, c.type)


DATA: used-cars-data None uri_file
ENV: machine_learning_E2E None
ENV: used-cars-env None
ENV: AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu None
COMPUTE: LastProjectCompute computeinstance
COMPUTE: cpu-cluster amlcompute


In [50]:
# determining the working environments in the workspace
for env in ml_client.environments.list():
    print("ENV name:", env.name, "| version:", env.version)


ENV name: machine_learning_E2E | version: None
ENV name: used-cars-env | version: None
ENV name: AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu | version: None


In [61]:
for e in ml_client.environments.list(name="used-cars-env"):
    print("ENV:", e.name, "| version:", e.version)
env = ml_client.environments.get(name="used-cars-env", version="1")
print(env.id)


ENV: used-cars-env | version: 1
/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_MLOPS/providers/Microsoft.MachineLearningServices/workspaces/project_III_MLOPS/environments/used-cars-env/versions/1


In [59]:
from azure.ai.ml.entities import Environment

env = Environment(
    name="used-cars-env",
    version="1",
    conda_file="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/environment/train_conda.yml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04"
)

ml_client.environments.create_or_update(env)


Environment({'arm_type': 'environment_version', 'latest_version': None, 'image': 'mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04', 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'used-cars-env', 'description': 'Environment for used cars pricing MLOps pipeline', 'tags': {}, 'properties': {'azureml.labels': 'latest'}, 'print_as_yaml': False, 'id': '/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_MLOPS/providers/Microsoft.MachineLearningServices/workspaces/project_III_MLOPS/environments/used-cars-env/versions/1', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7af018ba0790>, 'serialize': <msrest.serialization.Serializer object at 0x7af018258820>, 'version': '1', 'conda_file': {'channels': ['defaul

In [101]:
import os, uuid
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data, Environment
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# 1. Workspace connection
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="77c91b3f-d78c-4832-8ed2-a5dd9c501e0e",
    resource_group_name="streaming_autovehicle_pricing_MLOPS",
    workspace_name="project_III_MLOPS"
)

# 2. Giving an unique version
unique_version = uuid.uuid4().hex[:8]

# 3. Registring of the DATASET/the local CSV
local_file = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data/used_cars.csv"
assert os.path.isfile(local_file), "‚ùå Local CSV file is not found."

data_asset = Data(
    name="used-cars-data-fixed",
    version=unique_version,
    type=AssetTypes.URI_FILE,
    path=local_file,   
    description="Used cars dataset uploaded via azure-ai-ml SDK"
)

data_registered = ml_client.data.create_or_update(data_asset)
print(f"‚úÖ Dataset registered: {data_registered.name}:{data_registered.version}")
print(f"   URI: {data_registered.path}")

# 4. –†–µ–≥–∏—Å—Ç—Ä–∞—Ü–∏—è –Ω–∞ ENVIRONMENT –æ—Ç train_conda.yml
conda_file = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/environment/train_conda.yml"
assert os.path.isfile(conda_file), "‚ùå train_conda.yml not found."

env = Environment(
    name="used-cars-env",
    version=unique_version,
    conda_file=conda_file,
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04"
)

env_registered = ml_client.environments.create_or_update(env)
print(f"‚úÖ Environment registered: {env_registered.name}:{env_registered.version}")




Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


‚úÖ Dataset registered: used-cars-data-fixed:a7d230f4
   URI: azureml://subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_MLOPS/workspaces/project_III_MLOPS/datastores/workspaceblobstore/paths/LocalUpload/0b8e06a9f14bf45a52b1c21394f1cdf03017517cd48663b3e20a05882ff35cdd/used_cars.csv
‚úÖ Environment registered: used-cars-env:a7d230f4


In [22]:

from azure.ai.ml import MLClient, load_job
from azure.identity import DefaultAzureCredential
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

ml_client = MLClient.from_config(credential=DefaultAzureCredential())


# Loading pipeline job from YAML file
pipeline_job = load_job("/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/mlops/azureml/train/newpipeline.yml")

# Starting the pipeline
returned_job = ml_client.jobs.create_or_update(pipeline_job)

# Printing the link to Azure ML Studio for monitoring
print("‚úÖ Pipeline submitted successfully!")
print(f"üîó Studio URL: {returned_job.studio_url}")



Found the config file in: ./.azureml/config.json
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Clas

‚úÖ Pipeline submitted successfully!
üîó Studio URL: https://ml.azure.com/runs/sleepy_brain_vcff7cn2l4?wsid=/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_MLOPS/workspaces/project_III_MLOPS&tid=3f211132-3351-46c8-ba33-39c5bcff66b3


#### **Define Data Preparation job**

For this AzureML job, we define the `command` object that takes input files and output directories, then executes the script with the provided inputs and outputs. The job runs in a pre-configured AzureML environment with the necessary libraries. The result will be two separate datasets for training and testing, ready for use in subsequent steps of the machine learning pipeline.

In [1]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient, command, Input
import os, glob, json

ml_client = MLClient.from_config(credential=DefaultAzureCredential())

raw_input = Input(
    path="azureml:used-cars-data:5",
    mode="download",
    type="uri_file"
)

cmd = command(
    display_name="prep-data-final",
    description="Final data prep job with local output and diagnostics",
    command="python prepare.py --raw_data ${{inputs.raw_data}}",
    environment="azureml:used-cars-env:1",
    code="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/data-science/src",
    compute="cpu-cluster",
    inputs={"raw_data": raw_input}
)

job = ml_client.jobs.create_or_update(cmd)
print("‚úÖ Submitted job:", job.name)
ml_client.jobs.stream(job.name)

# Download and inspect results
out_dir = f"outputs_{job.name}"
os.makedirs(out_dir, exist_ok=True)
ml_client.jobs.download(name=job.name, download_path=out_dir)

print("\nüìÇ Files found:")
for p in sorted(glob.glob(out_dir + "/**/*", recursive=True)):
    print(" -", p)

diag = glob.glob(out_dir + "/**/prep_diagnostics.json", recursive=True)
if diag:
    print("\nüìã Diagnostics:")
    print(json.dumps(json.load(open(diag[0])), indent=2))
else:
    print("\n‚ö†Ô∏è No prep_diagnostics.json found.")

for csv in glob.glob(out_dir + "/**/*.csv", recursive=True):
    print(f"\nüìÑ Preview of {os.path.basename(csv)}:")
    with open(csv, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            print(line.strip())
            if i >= 4: break


Found the config file in: ./.azureml/config.json
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more in

‚úÖ Submitted job: salmon_cow_8yqvjnb8j5
RunId: salmon_cow_8yqvjnb8j5
Web View: https://ml.azure.com/runs/salmon_cow_8yqvjnb8j5?wsid=/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_MLOPS/workspaces/project_III_MLOPS

Execution Summary
RunId: salmon_cow_8yqvjnb8j5
Web View: https://ml.azure.com/runs/salmon_cow_8yqvjnb8j5?wsid=/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_MLOPS/workspaces/project_III_MLOPS


üìÇ Files found:
 - outputs_salmon_cow_8yqvjnb8j5/artifacts
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/prep_diagnostics.json
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/test
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/test/test.csv
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/train
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/train/train.csv
 - outputs_salmon_cow_8yqvjnb8j5/artifacts/system_logs
 - ou

### **2.1.1. CREATING GIT REPO AND COPYING THE PROJECT IN THE REPO**

In [20]:
from git import Repo

repo_path = "/home/azureuser/cloudfiles/code/Users/kenderov.emil"
repo = Repo.init(repo_path)


In [21]:
repo.index.add(["src/train.py", "notebooks/train_model.py", "README.md", ".gitignore"])
repo.index.commit("Initial commit: clean structure and training scripts")


<git.Commit "bfadcc2aae6aaa5f6da394455e3732198ab01b74">

In [24]:
with open("/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/secrets/github_token.txt", "r") as f:
    token = f.read().strip()
url = f"https://{token}@github.com/kenderovemil/mlops-used-cars-lastproject3.git"

# Connect remote
if 'origin' not in [remote.name for remote in repo.remotes]:
    repo.create_remote('origin', url)
else:
    repo.remote('origin').set_url(url)

# Give main branch and push 
repo.git.branch('-M', 'main')
repo.git.push("origin", "main")




''

In [26]:
repo.index.add(["azureml_jobs/train_job.py", "azureml_jobs/train_model.py"])
repo.index.commit("Adding training job py files")

repo.git.push("origin", "main")

''

In [28]:
repo.index.add("data/used_cars.csv")
repo.index.commit("Adding dataset file")

repo.git.push("origin", "main")


''

In [29]:
repo.index.add(["data-science/components/prep_job.yml", "data-science/components/prep_component.yml"])
repo.index.commit("Adding yml components")

repo.git.push("origin", "main")

''

In [30]:
repo.index.add(["data-science/environment/train_conda.yml", "data-science/src/prep.py", "data-science/src/prepare.py", "data-science/src/register.py", "data-science/src/train.py"])
repo.index.commit("Adding environment yml file and Python files for the jobs")
repo.git.push("origin", "main")


''

In [32]:
repo.index.add(["mlops/azureml/train/data.yml", "mlops/azureml/train/newpipeline.yml", "mlops/azureml/train/prep.yml", "mlops/azureml/train/register.yml", "mlops/azureml/train/train-env.yml","mlops/azureml/train/train.yml"])
repo.index.commit("Adding jobs yml files")
repo.git.push("origin", "main")





''

In [33]:
repo.index.add("model_training/train_model.py")
repo.index.commit("Adding Python file for the training of the model")
repo.git.push("origin", "main")

''

In [35]:
repo.index.add(["outputs/train_diagnostics.json", "outputs/model/conda.yaml", "outputs/model/MLmodel", 
"outputs/model/model.pkl", 
"outputs/model/python_env.yaml", 
"outputs/model/requirements.txt", "outputs/model_info/model_info.json", "outputs/test/test.csv","outputs/train/train.csv"])
repo.index.commit("Adding a folder for the output files of the jobs ")
repo.git.push("origin", "main")





''

In [37]:
import os

def collect_all_files(root_dir):
    all_files = []
    for dirpath, _, filenames in os.walk(root_dir):
        for file in filenames:
            full_path = os.path.join(dirpath, file)
            rel_path = os.path.relpath(full_path, start=repo.working_tree_dir)
            all_files.append(rel_path)
    return all_files


In [38]:
files_to_add = collect_all_files("kenderov.emil/outputs_cool_market_ptblwpcn61")
repo.index.add(files_to_add)
repo.index.commit("Add outputs from outputs_cool_market_ptblwpcn61 job")
repo.git.push("origin", "main")

''

In [None]:
files_to_add = collect_all_files("Users/kenderov.emil/outputs_olden_feast_0jszpw23gh")
repo.index.add(files_to_add)
repo.index.commit("Add outputs from outputs_cool_market_ptblwpcn61 job")
repo.git.push("origin", "main")

In [39]:

files_to_add = collect_all_files("Users/kenderov.emil")
repo.index.add(files_to_add)
repo.index.commit("Add project notebook file ")
repo.git.push("origin", "main")

''

### **2.2 Training the Model**

This Model Training job is designed to train a **Random Forest Regressor** on the dataset that was split into training and testing sets in the previous data preparation job. This job script accepts five inputs: the path to the training data (`train_data`), the path to the testing data (`test_data`), the number of trees in the forest (`n_estimators`, with a default value of 100), the maximum depth of the trees (`max_depth`, which is set to None by default), and the path to save the trained model (`model_output`).

The script begins by reading the training and testing data files, then processes the data to separate features (X) and target labels (y). A Random Forest Regressor model is initialized using the given n_estimators and max_depth, and it is trained using the training data. The model's performance is evaluated using the `Mean Squared Error (MSE)`. The MSE score is logged in MLflow. Finally, the trained model is saved and stored in the specified output location as an MLflow model. The job completes by logging the final MSE score and ending the MLflow run.


In [44]:
import os
import pandas as pd

def load_all_csvs(root_dir, filename):
    collected = []
    for dirpath, _, filenames in os.walk(root_dir):
        for file in filenames:
            if file == filename:
                full_path = os.path.join(dirpath, file)
                try:
                    df = pd.read_csv(full_path)
                    collected.append(df)
                    print(f"Loaded: {full_path} ({df.shape})")
                except Exception as e:
                    print(f"Failed to load {full_path}: {e}")
    if collected:
        combined = pd.concat(collected, ignore_index=True)
        print(f"Total combined shape for {filename}: {combined.shape}")
        return combined
    else:
        print(f"No {filename} files found in {root_dir}")
        return None


In [46]:
def preprocess(df):
    df = df.copy()
    # –ü—Ä–µ–æ–±—Ä–∞–∑—É–≤–∞–Ω–µ –Ω–∞ –∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª–Ω–∏—Ç–µ –∫–æ–ª–æ–Ω–∏
    df = pd.get_dummies(df, drop_first=True)
    return df


In [47]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import joblib
import os

# 1. Input parameters
train_data = "data/train.csv"
test_data = "data/test.csv"
n_estimators = 100
max_depth = None
model_output = "outputs/model"
metrics_output = "outputs/metrics"

# 2. Data loading
train_df = load_all_csvs(".", "train.csv")
test_df = load_all_csvs(".", "test.csv")
train_df = preprocess(train_df)
test_df = preprocess(test_df)

# Making the columns equal
train_df, test_df = train_df.align(test_df, join="left", axis=1, fill_value=0)


X_train = train_df.drop("price", axis=1)
y_train = train_df["price"]
X_test = test_df.drop("price", axis=1)
y_test = test_df["price"]


# 3. Training
model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
model.fit(X_train, y_train)

# 4. Evaluation
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)

# 5. Recording of the model
os.makedirs(model_output, exist_ok=True)
joblib.dump(model, os.path.join(model_output, "model.pkl"))

# 6. Recording of the metrics
os.makedirs(metrics_output, exist_ok=True)
with open(os.path.join(metrics_output, "mse.txt"), "w") as f:
    f.write(f"MSE: {mse}\n")

print(f"Training complete. MSE: {mse}")


Loaded: ./outputs/train/train.csv ((160, 7))
Loaded: ./outputs_cool_market_ptblwpcn61/artifacts/outputs/train/train.csv ((160, 7))
Loaded: ./outputs_olden_feast_0jszpw23gh/artifacts/outputs/train/train.csv ((160, 7))
Loaded: ./outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/train/train.csv ((160, 7))
Loaded: ./tmp_train/train.csv ((160, 7))
Total combined shape for train.csv: (800, 7)
Loaded: ./outputs/test/test.csv ((40, 7))
Loaded: ./outputs_cool_market_ptblwpcn61/artifacts/outputs/test/test.csv ((40, 7))
Loaded: ./outputs_olden_feast_0jszpw23gh/artifacts/outputs/test/test.csv ((40, 7))
Loaded: ./outputs_salmon_cow_8yqvjnb8j5/artifacts/outputs/test/test.csv ((40, 7))
Loaded: ./tmp_test/test.csv ((40, 7))
Total combined shape for test.csv: (200, 7)
Training complete. MSE: 113.63710734672493


In [71]:
from git import Repo

repo_path = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo"

# reading of the token
with open("/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/secrets/github_token.txt", "r") as f:
    token = f.read().strip()

url = f"https://{token}@github.com/kenderovemil/mlops-used-cars-lastproject3.git"

repo = Repo(repo_path)

if 'origin' not in [remote.name for remote in repo.remotes]:
    repo.create_remote('origin', url)
else:
    repo.remote('origin').set_url(url)

repo.git.push("origin", "main", "--force")




''

In [81]:
import os
from git import Repo

repo_path = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo"
repo = Repo(repo_path)

# Change the working folder of Python to the repo
os.chdir(repo_path)

# 1. Finding all  .ipynb files
notebooks = []
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith(".ipynb"):
            notebooks.append(os.path.join(root, file))

# 2. Adding of the needed artefacts 
files_to_add = notebooks + [
    "outputs/model/model.pkl",
    "outputs/metrics/mse.txt",
    ".gitignore"
]

# 3. Adding and commit
repo.index.add(files_to_add)
repo.index.commit("Add all notebooks, model, metrics, and .gitignore to exclude secrets")
repo.git.push("origin", "main")


''

#### **Define Model Training Job**

For this AzureML job, we define the `command` object that takes the paths to the training and testing data, the number of trees in the forest (`n_estimators`), and the maximum depth of the trees (`max_depth`) as inputs, and outputs the trained model. The command runs in a pre-configured AzureML environment with all the necessary libraries. The job produces a trained **Random Forest Regressor model**, which can be used for predicting the price of used cars based on the given attributes.

In [89]:
import pandas as pd
import joblib
from sklearn.ensemble import RandomForestRegressor

def train_model(train_data, test_data, n_estimators, max_depth, model_output):
    train_df = pd.read_csv(train_data)
    test_df = pd.read_csv(test_data)

    X_train = train_df.drop("price", axis=1)
    y_train = train_df["price"]

    model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
    model.fit(X_train, y_train)

    joblib.dump(model, model_output)
    print(f"‚úÖ Model trained and saved to: {model_output}")

# –°–∞–º–æ –∞–∫–æ –Ω–µ —Å–º–µ –≤ Jupyter, –∏–∑–ø–æ–ª–∑–≤–∞–º–µ argparse
def is_running_in_jupyter():
    try:
        get_ipython()
        return True
    except NameError:
        return False

if __name__ == "__main__" and not is_running_in_jupyter():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--train_data", type=str)
    parser.add_argument("--test_data", type=str)
    parser.add_argument("--n_estimators", type=int, default=100)
    parser.add_argument("--max_depth", type=int, default=None)
    parser.add_argument("--model_output", type=str)
    args = parser.parse_args()

    train_model(
        train_data=args.train_data,
        test_data=args.test_data,
        n_estimators=args.n_estimators,
        max_depth=args.max_depth,
        model_output=args.model_output
    )



In [91]:
#searching test.csv and train.csv

import os

def find_csv_files(root_dir, target_names=["train.csv", "test.csv"]):
    found_files = {}
    for dirpath, dirnames, filenames in os.walk(root_dir):
        for name in target_names:
            if name in filenames:
                full_path = os.path.join(dirpath, name)
                found_files[name] = full_path
    return found_files

project_root = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo"

found = find_csv_files(project_root)

if found:
    for name, path in found.items():
        print(f"‚úÖ {name} found here:\n{path}\n")
else:
    print("‚ö†Ô∏è Neither train.csv, nor test.csv were found.")


‚úÖ test.csv –Ω–∞–º–µ—Ä–µ–Ω —Ç—É–∫:
/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo/outputs/test/test.csv

‚úÖ train.csv –Ω–∞–º–µ—Ä–µ–Ω —Ç—É–∫:
/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo/outputs/train/train.csv



In [93]:
train_model(
    train_data="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo/outputs/train/train.csv",
    test_data="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo/outputs/test/test.csv",
    n_estimators=100,
    max_depth=10,
    model_output="outputs/model/model.pkl"
)



‚úÖ Model trained and saved to: outputs/model/model.pkl


In [94]:
from git import Repo
import os

repo_path = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo"
repo = Repo(repo_path)
os.chdir(repo_path)

# Adding the model
repo.index.add(["outputs/model/model.pkl"])
repo.index.commit("Add trained Random Forest model to outputs/model/")
repo.git.push("origin", "main")

print("‚úÖ Model.pkl upload successfully in GitHub.")


‚úÖ Model.pkl upload successfully in GitHub.


In [102]:
!python /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/register_model_job_last/reg_model.py \
    --model outputs/model/model.pkl \
    --train_data /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo/outputs/train/train.csv


Registered model 'used_cars_price_prediction_model' already exists. Creating a new version of this model...
2025/10/31 19:12:00 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: used_cars_price_prediction_model, version 5
Created version '5' of model 'used_cars_price_prediction_model'.
‚úÖ Model registered in MLflow as 'used_cars_price_prediction_model'
üèÉ View run strong_van_4gksqpj6 at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817/runs/467e34a6-4cf9-4bb2-ba7a-86646fc2d894
üß™ View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspac

In [119]:
repo.index.add([
    os.path.join(repo_path, "kenderov.emil/register_model_job_last/reg_model.py"),
    os.path.join(repo_path, "kenderov.emil/register_model_job_last/reg_model_output.txt")
])
repo.index.commit("Register model in MLflow (version 5) and archive output log")
repo.git.push("origin", "main")

print("‚úÖ Both files uploaded in GitHub.")



‚úÖ Both files uploaded in GitHub.


### **2.3 Registering the Best Trained Model**

The **Model Registration job** is designed to take the best-trained model from the hyperparameter tuning sweep job and register it in MLflow as a versioned artifact for future use in the used car price prediction pipeline. This job script accepts one input: the path to the trained model (model). The script begins by loading the model using the `mlflow.sklearn.load_model()` function. Afterward, it registers the model in the MLflow model registry, assigning it a descriptive name (`used_cars_price_prediction_model`) and specifying an artifact path (`random_forest_price_regressor`) where the model artifacts will be stored. Using MLflow's `log_model()` function, the model is logged along with its metadata, ensuring that the model is easily trackable and retrievable for future evaluation, deployment, or retraining.

In [19]:
import joblib
import pandas as pd
import os
import mlflow
import mlflow.sklearn
from sklearn.metrics import mean_squared_error
from mlflow.models.signature import infer_signature

# Paths to all models
model_paths = [
    "fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl",
    "fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl",
    "fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl",
    "fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl",
    "fresh_cleaned_repo/outputs/model/model.pkl",
    "outputs/model/model.pkl"
]

# Load and transform test.csv
test_df = pd.read_csv("outputs/test/test.csv")
X_test = test_df.drop("price", axis=1)
y_test = test_df["price"]
X_test_encoded = pd.get_dummies(X_test)

# Evaluation of all models
best_rmse = float("inf")
best_model = None
best_path = None
best_features = None

for path in model_paths:
    if os.path.isfile(path):
        try:
            model = joblib.load(path)
            model_features = getattr(model, "feature_names_in_", X_test_encoded.columns)
            X_test_aligned = X_test_encoded.reindex(columns=model_features, fill_value=0)
            preds = model.predict(X_test_aligned)
            rmse = mean_squared_error(y_test, preds, squared=False)
            print(f"üìä {path} ‚Üí RMSE: {rmse:.2f}")
            if rmse < best_rmse:
                best_rmse = rmse
                best_model = model
                best_path = path
                best_features = model_features
        except Exception as e:
            print(f"‚ö†Ô∏è Error by evaluating of {path}: {e}")

# Registring the best model
if best_model:
    train_df = pd.read_csv("outputs/train/train.csv")
    X_train = train_df.drop("price", axis=1)
    X_train_encoded = pd.get_dummies(X_train)
    X_train_aligned = X_train_encoded.reindex(columns=best_features, fill_value=0)

    signature = infer_signature(X_train_aligned, best_model.predict(X_train_aligned))

    with mlflow.start_run():
        mlflow.sklearn.log_model(
            sk_model=best_model,
            artifact_path="random_forest_price_regressor",
            registered_model_name="used_cars_price_prediction_model",
            signature=signature
        )
        print(f"\n‚úÖ Best model registred successfully at: {best_path}")
else:
    print("‚ö†Ô∏è No model was discovered.")
# Create Folder and Save Result
os.makedirs("best_registred_model", exist_ok=True)

import os
with open("best_registred_model/reg_model_output.txt", "w", encoding="utf-8") as F:
    F.write("‚úÖ The best model was successfully registered.\n\n")
    F.write(f"üìç Model path: {best_path}\n")
    F.write(f"üìä RMSE over test.csv: {best_rmse:.2f}\n\n")
    F.write("üìå MLflow Model Name: used_cars_price_prediction_model\n")
    F.write("üìå Artifact Path: random_forest_price_regressor\n")
    F.write("üìå Version: 5\n\n")
    F.write("üß™ View experiment: AzureML Studio ‚Üí Models ‚Üí used_cars_price_prediction_model\n\n")
    F.write("üïäÔ∏è This is an act of recognition and conciliation. The model is entered in the project memory.\n")

print("üìú The reg_model_output.txt file was created in best_registred_model/")




üìä fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl ‚Üí RMSE: 7.55
üìä fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl ‚Üí RMSE: 7.55
üìä fresh_cleaned_repo/fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl ‚Üí RMSE: 7.55
üìä fresh_cleaned_repo/fresh_cleaned_repo/outputs/model/model.pkl ‚Üí RMSE: 7.55
üìä fresh_cleaned_repo/outputs/model/model.pkl ‚Üí RMSE: 8.41
üìä outputs/model/model.pkl ‚Üí RMSE: 4.15

‚úÖ Best model registred successfully at: outputs/model/model.pkl
üèÉ View run bright_stem_qknkrfqv at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourceGroups/streaming_autovehicle_pricing_mlops/providers/Microsoft.MachineLearningServices/workspaces/project_iii_mlops/#/experiments/15b2329d-4ee6-4fd1-95e7-f88908228817/runs/301e37fa-18ea-4bb9-9856-fd7dfb1b64c4
üß™ View experiment at: https://ea

In [23]:
from git import Repo
import os

# üîê Loading GitHub token from file
with open("/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/secrets/github1.txt", "r") as token_file:
    github_token = token_file.read().strip()

# üìå GitHub Data
username = "kenderovemil"
repo_name = "mlops-used-cars-lastproject3"
file_path = "best_registred_model/reg_model_output.txt"

# üìÇ Local path to Git repoto
repo_path = os.getcwd()
repo = Repo(repo_path)

# üîÑ Set remote token URL
remote_url = f"https://{github_token}@github.com/{username}/{repo_name}.git"
origin = repo.remote(name="origin")
origin.set_url(remote_url)


# ‚úÖ Add, commit, and push
repo.index.add([file_path])
repo.index.commit("üìú Backup of registration of the best model ‚Äî RMSE 4.15")

# üß≠ Push with upstream check
branch = repo.active_branch
if branch.tracking_branch() is None:
    origin.push(refspec=f"{branch.name}:{branch.name}", set_upstream=True)
else:
    origin.push()

print("‚úÖ The Git archive has been successfully refreshed through Python and a token from a file.")



‚úÖ The Git archive has been successfully refreshed through Python and a token from a file.


#### **Define Model Register Job**

For this AzureML job, a `command` object is defined to execute the `model_register.py` script. It accepts the best-trained model as input, runs the script in the `AzureML-sklearn-1.0-ubuntu20.04-py38-cpu` environment, and uses the same compute cluster as the previous jobs (`cpu-cluster`). This job plays a crucial role in the pipeline by ensuring that the best-performing model identified during hyperparameter tuning is systematically stored and made available in the MLflow registry for further evaluation, deployment, or retraining. Integrating this job into the end-to-end pipeline automates the process of registering high-quality models, completing the model development lifecycle and enabling the prediction of used car prices.

In [45]:
from azure.ai.ml.entities import Model

registered_model = Model(
    path="azureml://datastores/workspaceblobstore/paths/outputs/model/model.pkl",
    name="final_random_forest_model",
    version="1",
    description="Final registered Random Forest model",
    type="custom_model"  
)

ml_client.models.create_or_update(registered_model)
print("‚úÖ Model registered successfully.")


‚úÖ Model registered successfully.


In [28]:
from azure.ai.ml.entities import Environment
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="77c91b3f-d78c-4832-8ed2-a5dd9c501e0e",
    resource_group_name="streaming_autovehicle_pricing_mlops",
    workspace_name="project_III_MLOPS"
)

new_env = Environment(
    name="train-env-lastproject3",
    version="6.44A1B57HH68c", 
    description="Stable environment with sklearn 1.5.1, Python 3.10, and AzureML SDK 1.52.0",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    conda_file={
        "channels": ["conda-forge", "defaults"],
        "dependencies": [
            "python=3.10",
            "pip",
            "numpy=1.26.0",
            "pandas=2.1.1",
            "scikit-learn=1.5.1",
            "joblib=1.3.2",
            {
                "pip": [
                    "mlflow==2.9.2",
                    "azureml-core==1.52.0",
                    "azureml-mlflow==1.52.0",
                    "packaging==23.2",
                    "cloudpickle==2.2.1",
                    "typing-extensions==4.8.0"
                ]
            }
        ],
        "name": "train-env"
    }
)

ml_client.environments.create_or_update(new_env)
print("‚úÖ New Environment was registered at: train-env-lastproject3:6.44A1B57HH68c")





Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


‚úÖ New Environment was registered at: train-env-lastproject3:6.44A1B57HH68c


In [34]:
import zipfile, os
SRC = "smoke_test"
ZIP = "smoke_test.zip"
if os.path.exists(ZIP): os.remove(ZIP)
with zipfile.ZipFile(ZIP, "w", compression=zipfile.ZIP_DEFLATED) as zf:
    for root, dirs, files in os.walk(SRC):
        dirs[:] = [d for d in dirs if d != ".git"]
        for f in files:
            if f.endswith(".zip") or f.endswith(".amltmp"): continue
            full = os.path.join(root, f)
            arc = os.path.relpath(full, os.getcwd())
            zf.write(full, arc)
print("wrote", ZIP)


wrote smoke_test.zip


In [35]:
import zipfile
with zipfile.ZipFile("smoke_test.zip") as z:
    print(z.namelist())


['smoke_test/.amlignore', 'smoke_test/conda.yml', 'smoke_test/diag.py', 'smoke_test/smoke_test.py']


In [40]:

#this code is a smoke test, still not the real job
from azure.ai.ml import command
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="77c91b3f-d78c-4832-8ed2-a5dd9c501e0e",
    resource_group_name="streaming_autovehicle_pricing_mlops",
    workspace_name="project_III_MLOPS",
)

job = command(
    name="env-smoke-test-dir",
    code="smoke_test",                # –¥–∏—Ä–µ–∫—Ç–æ—Ä–∏—è—Ç–∞, –Ω–µ zip
    command="python smoke_test.py",   
    environment="smoke-test-env:1",   
    compute="cpu-cluster",
    display_name="env-smoke-test-dir",
    experiment_name="env_diagnostics",
)

ret = ml_client.jobs.create_or_update(job)
print("Submitted:", ret.name, ret.studio_url)



Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Uploading smoke_test (0.0 MBs): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1718/1718 [00:00<00:00, 73146.56it/s]


Git properties are removed because the repository URL contains a secret.


Submitted: env-smoke-test-dir https://ml.azure.com/runs/env-smoke-test-dir?wsid=/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_mlops/workspaces/project_III_MLOPS&tid=3f211132-3351-46c8-ba33-39c5bcff66b3


In [59]:
from azure.ai.ml import MLClient, command
from azure.identity import DefaultAzureCredential
import datetime

# 1. Client initialization
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="77c91b3f-d78c-4832-8ed2-a5dd9c501e0e",
    resource_group_name="streaming_autovehicle_pricing_mlops",
    workspace_name="project_III_MLOPS",
)

# 2. Job unique name 
job_name = f"model-register-job-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}"

# 3. Configuration(without outputs)
job = command(
    name=job_name,
    code="model_register",   
    command=(
        "python model_register.py "
        "--model_input_path model.pkl "
        "--model_name used_cars_price_prediction_model"
    ),
    environment="smoke-test-env:1",
    compute="cpu-cluster",
    display_name=job_name,
    experiment_name="project_pipeline"
)

# 4. Submit
returned = ml_client.jobs.create_or_update(job)
print("Submitted:", returned.name)
print("Studio URL:", getattr(returned, "studio_url", None))



Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Uploading model_register (2.28 MBs): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2284094/2284094 [00:00<00:00, 25696265.80it/s]


Git properties are removed because the repository URL contains a secret.


Submitted: model-register-job-20251102-205520
Studio URL: https://ml.azure.com/runs/model-register-job-20251102-205520?wsid=/subscriptions/77c91b3f-d78c-4832-8ed2-a5dd9c501e0e/resourcegroups/streaming_autovehicle_pricing_mlops/workspaces/project_III_MLOPS&tid=3f211132-3351-46c8-ba33-39c5bcff66b3


In [62]:
ml_client.jobs.download(
    name=returned.name,
    download_path="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/model_register/",
    all=True
)
#downloading in the folder model_register all outputs

Downloading artifact azureml://datastores/workspaceartifactstore/ExperimentRun/dcid.model-register-job-20251102-205520 to /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/model_register/artifacts


In [72]:
from git import Repo
import os

repo_path = "/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil/fresh_cleaned_repo"
repo = Repo(repo_path)

assert not repo.bare

folder_to_add = os.path.join(repo_path, "model_register")
repo.index.add([folder_to_add])

repo.index.commit("Add model_register folder with scripts and artifacts")
repo.remote(name="origin").push(refspec="main:main")

print("‚úÖ model_register is uploaded GitHub.")



‚úÖ model_register is uploaded GitHub.


### **2.4. Assembling the End-to-End Workflow**

The end-to-end pipeline integrates all the previously defined jobs into a seamless workflow, automating the process of data preparation, model training, hyperparameter tuning, and model registration. The pipeline is designed using Azure Machine Learning's `@pipeline` decorator, specifying the compute target and providing a detailed description of the workflow.

In [None]:
    # ------- WRITE YOUR CODE HERE -------

In [7]:
%%bash
set -euo pipefail

REPO="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil"
BACKUP_DIR="/tmp/repo_backup_$(date -u +%Y%m%dT%H%M%SZ)"
LOG="$REPO/git_cleanup_log.txt"
SMOKE="smoke_test.sh"
REQ_FILE="tmp_test_env/requirements-pip.txt"
CLEAN_BRANCH="env-snapshot-clean-$(date -u +%Y%m%dT%H%M%SZ)"

echo "=== GIT CLEANUP + ORPHAN PREP START $(date -u) ===" | tee "$LOG"
echo "Repo: $REPO" | tee -a "$LOG"
mkdir -p "$BACKUP_DIR"
echo "Backup dir: $BACKUP_DIR" | tee -a "$LOG"

if [ ! -d "$REPO" ]; then
  echo "ERROR: repo not found: $REPO" | tee -a "$LOG"
  exit 2
fi
cd "$REPO"

# 1. create tar backup of repo (fast, keeps worktree and .git)
echo "Creating repo tar backup (may be large)..." | tee -a "$LOG"
tar -czf "$BACKUP_DIR/repo_backup.tar.gz" -C "$REPO" . 2>/dev/null || { echo "Tar failed; continuing" | tee -a "$LOG"; }
echo "Backup created (or attempted). Copy also staged unmerged files if any." | tee -a "$LOG"

# 2. record unmerged files (if any) and copy them to backup dir for inspection
echo "Listing unmerged (conflicted) entries (if present)..." | tee -a "$LOG"
git ls-files -u > "$BACKUP_DIR/unmerged_index.txt" 2>/dev/null || true
if [ -s "$BACKUP_DIR/unmerged_index.txt" ]; then
  echo "Found unmerged entries, saving file list to $BACKUP_DIR/unmerged_index.txt" | tee -a "$LOG"
  awk '{print $4}' "$BACKUP_DIR/unmerged_index.txt" | sort -u | while read -r f; do
    mkdir -p "$(dirname "$BACKUP_DIR/$f")"
    cp -a "$REPO/$f" "$BACKUP_DIR/$f" 2>/dev/null || true
  done
else
  echo "No unmerged entries found." | tee -a "$LOG"
fi

# 3. try to abort any merge/rebase in progress
echo "Attempting git merge --abort and git rebase --abort (if applicable)..." | tee -a "$LOG"
git merge --abort 2>/dev/null || true
git rebase --abort 2>/dev/null || true

# 4. try safer resets: reset --merge (keeps untracked), then if index still dirty do hard reset and clean
echo "Attempting git reset --merge ..." | tee -a "$LOG"
git reset --merge 2>/dev/null || true

# If index still indicates conflicts, remove unmerged entries from index (they are saved in backup)
if git ls-files -u | grep -q .; then
  echo "Removing unmerged entries from index (they are backed up)..." | tee -a "$LOG"
  git ls-files -u | awk '{print $4}' | sort -u | xargs -r git rm --cached -f 2>/dev/null || true
fi

# 5. hard-clean working tree (destructive to unstaged changes) ‚Äî we created backup above
echo "Performing git reset --hard HEAD and git clean -fd to ensure clean index..." | tee -a "$LOG"
set +e
git reset --hard HEAD 2>&1 | tee -a "$LOG" || true
git clean -fd 2>&1 | tee -a "$LOG" || true
set -e

# 6. sanity check: no MERGE_HEAD and no unmerged entries should remain
if [ -f ".git/MERGE_HEAD" ]; then
  echo "MERGE_HEAD still present; aborting and exiting for manual resolution." | tee -a "$LOG"
  exit 3
fi
if git ls-files -u | grep -q .; then
  echo "Unmerged index entries still present; aborting for manual resolution." | tee -a "$LOG"
  exit 4
fi

echo "Index cleaned. Creating orphan branch $CLEAN_BRANCH and committing only artifacts." | tee -a "$LOG"

# 7. create orphan branch and prepare clean commit from tmpdir (same as earlier safe flow)
git checkout --orphan "$CLEAN_BRANCH" 2>&1 | tee -a "$LOG"

# Remove all files from index (we will add only selected files)
git rm -rf --cached . 2>/dev/null || true

TMPDIR="$(mktemp -d)"
echo "Preparing clean tree in $TMPDIR" | tee -a "$LOG"

# Copy desired artifacts into tmpdir
if [ -f "$SMOKE" ]; then
  mkdir -p "$TMPDIR/$(dirname "$SMOKE")"
  cp -p "$SMOKE" "$TMPDIR/$SMOKE"
  echo "Copied $SMOKE to tmpdir." | tee -a "$LOG"
else
  echo "Warning: $SMOKE not found; it will not be included." | tee -a "$LOG"
fi

if [ -f "$REQ_FILE" ]; then
  mkdir -p "$TMPDIR/$(dirname "$REQ_FILE")"
  cp -p "$REQ_FILE" "$TMPDIR/$REQ_FILE"
  echo "Copied $REQ_FILE to tmpdir." | tee -a "$LOG"
else
  echo "Warning: $REQ_FILE not found; it will not be included." | tee -a "$LOG"
fi

cat > "$TMPDIR/README_env_snapshot.md" <<'EOF'
Clean environment snapshot for push.
Includes: smoke_test.sh and tmp_test_env/requirements-pip.txt only.
This branch intentionally omits project history to avoid leaking secrets.
EOF

# Remove files from working tree (we're on orphan branch) then copy in the tmpdir content
git ls-files -z | xargs -0 -r rm -f || true
cp -a "$TMPDIR/." "$REPO/"

git add --all 2>&1 | tee -a "$LOG"

if git diff --cached --quiet; then
  echo "No files staged for commit after preparing tmpdir. Exiting." | tee -a "$LOG"
  rm -rf "$TMPDIR"
  exit 0
fi

git commit -m "Clean snapshot: smoke_test + requirements (no history) - $CLEAN_BRANCH" 2>&1 | tee -a "$LOG" || { echo "Commit failed" | tee -a "$LOG"; }

# 8. push if remote exists
REMOTE_URL="$(git remote get-url origin 2>/dev/null || true)"
if [ -z "$REMOTE_URL" ]; then
  echo "No remote origin configured. To push run: git remote add origin <url> ; git push -u origin $CLEAN_BRANCH" | tee -a "$LOG"
else
  echo "Attempting push to origin: $REMOTE_URL" | tee -a "$LOG"
  set +e
  git push -u origin "$CLEAN_BRANCH" 2>&1 | tee -a "$LOG"
  RC=${PIPESTATUS[0]}
  set -e
  if [ "$RC" -ne 0 ]; then
    echo "Push failed with code $RC. Likely push-protection (secret scanning) or credential issue. See log above for remote message." | tee -a "$LOG"
  else
    echo "Push succeeded." | tee -a "$LOG"
  fi
fi

# Cleanup
rm -rf "$TMPDIR"
echo "Backup saved at $BACKUP_DIR/repo_backup.tar.gz" | tee -a "$LOG"
echo "=== GIT CLEANUP + ORPHAN PREP END $(date -u) ===" | tee -a "$LOG"
echo "Log: $LOG"


=== GIT CLEANUP + ORPHAN PREP START Wed Nov  5 20:45:43 UTC 2025 ===
Repo: /mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil
Backup dir: /tmp/repo_backup_20251105T204543Z
Creating repo tar backup (may be large)...
Backup created (or attempted). Copy also staged unmerged files if any.
Listing unmerged (conflicted) entries (if present)...
Found unmerged entries, saving file list to /tmp/repo_backup_20251105T204543Z/unmerged_index.txt
Attempting git merge --abort and git rebase --abort (if applicable)...
Attempting git reset --merge ...
Removing unmerged entries from index (they are backed up)...
rm 'model_register/.amlignore'
rm 'model_register/.amlignore.amltmp'
rm 'outputs/model/model.pkl'
Performing git reset --hard HEAD and git clean -fd to ensure clean index...
Updating files: 100% (51/51), done.
HEAD is now at e7946ec Preserve local model_register before merge
Removing .amlignore
Removing .amlignore.amltmp
Removing .azureml/
Removing .ipynb

In [2]:
%%bash
set -euo pipefail
REPO="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil"
LOG="$REPO/git_filter_repo_find_log.txt"
cd "$REPO"
echo "=== FIND SUSPICIOUS STRINGS START $(date -u) ===" | tee "$LOG"

# 1) quick heuristic search in HEAD and recent commits (adjust regex if needed)
echo "--- Searching HEAD working tree for common token patterns ---" | tee -a "$LOG"
git grep -nI --line-number -E "ghp_[A-Za-z0-9]{36}|gho_[A-Za-z0-9]{36}|ghs_[A-Za-z0-9]{36}|GH[AP]_[A-Za-z0-9]{20,}|AKIA[0-9A-Z]{16}|azure.*key|AZURE_CLIENT_SECRET|CLIENT_SECRET|TOKEN|PASSWORD|SECRET" HEAD || true | tee -a "$LOG"

echo "--- Searching entire history (git rev-list) for same patterns (may take time) ---" | tee -a "$LOG"
git rev-list --all | head -n 2000 | while read -r rev; do
  git grep -nI -E "ghp_[A-Za-z0-9]{36}|gho_[A-Za-z0-9]{36}|ghs_[A-Za-z0-9]{36}|GH[AP]_[A-Za-z0-9]{20,}|AKIA[0-9A-Z]{16}|AZURE|CLIENT_SECRET|SECRET|TOKEN|PASSWORD" "$rev" >/dev/null 2>&1 \
    && echo "POSSIBLE MATCH IN COMMIT: $rev" | tee -a "$LOG"
done

echo "--- Show commits where GitHub push reported secret (if known) ---" | tee -a "$LOG"
# try to surface commit ids referenced in previous push error ‚Äî repo logs may have them
grep -nRI "GH013" .git* 2>/dev/null || true
grep -nRI "secret" .git* 2>/dev/null || true

echo "Search complete. If you find an exact literal secret string or a filename containing the secret, copy it and use it in the next cell." | tee -a "$LOG"
echo "Log: $LOG"
echo "=== FIND SUSPICIOUS STRINGS END $(date -u) ===" | tee -a "$LOG"


=== FIND SUSPICIOUS STRINGS START Thu Nov  6 14:15:02 UTC 2025 ===
--- Searching HEAD working tree for common token patterns ---
--- Searching entire history (git rev-list) for same patterns (may take time) ---
POSSIBLE MATCH IN COMMIT: 6e8f8e7558114895f9e51a2866201a3ba0503376
POSSIBLE MATCH IN COMMIT: e7946ec63d687ccb94db642410b99cddf8894cb1
POSSIBLE MATCH IN COMMIT: f32d5a66f5a6b43cddbeebf99b7fbe9b8f369dba
POSSIBLE MATCH IN COMMIT: ddd117f7cf1795c845d4f173f406bbdea8ca340a
POSSIBLE MATCH IN COMMIT: 5481482373a9487122b4238cd5a1e9d34b740073
POSSIBLE MATCH IN COMMIT: 058f4a5dc7fd2d5e917aed449f1616b13943030b
POSSIBLE MATCH IN COMMIT: a4e22b7560cc988a547c7e0f1aac0f4ae7618aa9
POSSIBLE MATCH IN COMMIT: 1acfa0fae097fbdfb2fcf673005f8ce695b48fe3
POSSIBLE MATCH IN COMMIT: 637b5d1a7f80e23f8505a55b0b9de5821f8a6814
POSSIBLE MATCH IN COMMIT: 6c97b7a0675fec0a960e14bd55654eca9d2c64c1
POSSIBLE MATCH IN COMMIT: 784f4bcdb3f9be1cda89c93e80bdb0e0a4183a3c
POSSIBLE MATCH IN COMMIT: b9b7528015115420414972cc7d

In [6]:
# –ø–æ–∫–∞–∂–∏ current remote (—Ç—Ä—è–±–≤–∞ –¥–∞ –ø–æ–∫–∞–∑–≤–∞ HTTPS —Å –≤–≥—Ä–∞–¥–µ–Ω PAT)
!git remote -v

# set SSH URL (–∑–∞–º–µ–Ω–∏ OWNER/REPO —Å —Ç–≤–æ–µ—Ç–æ)
!git remote set-url origin git@github.com:kenderovemil/mlops-used-cars-lastproject3.git

# –ø–æ—Ç–≤—ä—Ä–¥–∏
!git remote -v

# —Ç–µ—Å—Ç –Ω–∞ SSH auth
!ssh -T git@github.com

# —Ç–µ—Å—Ç –∑–∞ push –ø—Ä–∞–≤–∞ (dry-run)
!git push --dry-run origin HEAD


origin	https://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@github.com/kenderovemil/mlops-used-cars-lastproject3.git (fetch)
origin	https://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@github.com/kenderovemil/mlops-used-cars-lastproject3.git (push)
origin	git@github.com:kenderovemil/mlops-used-cars-lastproject3.git (fetch)
origin	git@github.com:kenderovemil/mlops-used-cars-lastproject3.git (push)
The authenticity of host 'github.com (140.82.114.4)' can't be established.
ED25519 key fingerprint is SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? ^C
Everything up-to-date


In [4]:
!git add .
!git commit -m "Sync changes to GitHub for AzureML visibility"
!git push



On branch env-snapshot-clean-20251105T204543Z
Your branch is ahead of 'origin/env-snapshot-clean-20251105T204543Z' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	[31mmodified:   fresh_cleaned_repo/fresh_cleaned_repo[m (modified content, untracked content)
	[31mmodified:   skycore_pipeline_clean-[m (modified content, untracked content)
	[31mmodified:   used-cars-mlops[m (modified content, untracked content)

no changes added to commit (use "git add" and/or "git commit -a")
Enumerating objects: 24, done.
Counting objects: 100% (24/24), done.
Delta compression using up to 4 threads
Compressing objects: 100% (16/16), done.
Writing objects: 100% (17/17), 32.11 KiB | 632.00 KiB/s, done.
Total 17 (delta 8), reused 1 (delta 0), pack-reused 0

In [None]:
%%bash
set -euo pipefail

# ---------- –∫–æ–Ω—Ñ–∏–≥—É—Ä–∞—Ü–∏—è (–ø—Ä–æ–º–µ–Ω—è–π —Å–∞–º–æ TARGET_PATH –∞–∫–æ –µ –Ω—É–∂–Ω–æ) ----------
REPO="/mnt/batch/tasks/shared/LS_root/mounts/clusters/lastprojectcompute/code/Users/kenderov.emil"
TIMESTAMP="$(date -u +%Y%m%dT%H%M%SZ)"
BACKUP="/tmp/repo_backup_${TIMESTAMP}.tar.gz"
LOGDIR="$REPO/git_filter_repo_fullflow_logs"
mkdir -p "$LOGDIR"
FILTER_LOG="$LOGDIR/filter_invert_log.txt"
PUSH_LOG="$LOGDIR/push_invert_log.txt"
# ------------------------------------------------------------------------

echo "START path-removal cleanup flow: $TIMESTAMP"
echo "Repo: $REPO"
cd "$REPO"

# 1) Backup
echo "1) Creating tar backup of working tree and .git (may be large)..."
tar -czf "$BACKUP" -C "$REPO" . 2>/dev/null || { echo "Backup tar failed ‚Äî check disk space"; exit 2; }
echo "Backup saved: $BACKUP"

# 2) Confirm rotation already done
echo
echo "2) CONFIRMATION REQUIRED: You MUST have revoked/regenerated all compromised tokens in GitHub."
echo "   Type YES to confirm you have revoked them, anything else to abort."
read -r CONFIRM
if [ "$CONFIRM" != "YES" ]; then
  echo "Aborted by user (no confirmation). No destructive changes performed."
  exit 1
fi

# 3) Enter target path to remove from history
echo
echo "3) Enter the relative path to remove from history (e.g., secrets/.env OR notebooks/Week-17_Project_FullCode_Notebook.ipynb):"
read -r TARGET_PATH
if [ -z "$TARGET_PATH" ]; then
  echo "No TARGET_PATH provided ‚Äî aborting."
  exit 2
fi
echo "Target path to remove: $TARGET_PATH"

# 4) Ensure git-filter-repo is available
if ! command -v git-filter-repo >/dev/null 2>&1; then
  echo "git-filter-repo not found ‚Äî installing to user site-packages..." | tee -a "$FILTER_LOG"
  python -m pip install --user git-filter-repo 2>&1 | tee -a "$FILTER_LOG"
  export PATH="$HOME/.local/bin:$PATH"
fi

# 5) Create bare mirror clone
MIRROR_DIR="/tmp/repo_mirror_${TIMESTAMP}"
echo "5) Creating a bare mirror clone to operate on -> $MIRROR_DIR" | tee -a "$FILTER_LOG"
git clone --mirror . "$MIRROR_DIR" 2>&1 | tee -a "$FILTER_LOG"

cd "$MIRROR_DIR"

# 6) Run git-filter-repo to remove the path everywhere
echo "6) Running git-filter-repo --path '$TARGET_PATH' --invert-paths (this rewrites history)..." | tee -a "$FILTER_LOG"
git filter-repo --path "$TARGET_PATH" --invert-paths 2>&1 | tee -a "$FILTER_LOG"

# 7) Prune and gc
git reflog expire --expire=now --all 2>&1 | tee -a "$FILTER_LOG"
git gc --prune=now --aggressive 2>&1 | tee -a "$FILTER_LOG"

# 8) Verify removal
echo "7) Verifying removal: searching for occurrences of $TARGET_PATH in new history..." | tee -a "$FILTER_LOG"
if git rev-list --all | xargs -r git grep -nI --line-number --no-color -- "$TARGET_PATH" >/dev/null 2>&1; then
  echo "ERROR: target path still found in rewritten history. Inspect $FILTER_LOG" | tee -a "$FILTER_LOG"
  exit 3
else
  echo "Verification passed: $TARGET_PATH not found in rewritten history (mirror)." | tee -a "$FILTER_LOG"
fi

# 9) Prepare to push mirror (force)
REMOTE="$(git remote get-url origin 2>/dev/null || true)"
if [ -z "$REMOTE" ]; then
  echo "No remote origin configured in mirror; aborting push. To push, configure remote then run: git push --mirror origin" | tee -a "$PUSH_LOG"
  exit 0
fi

echo
echo "8) FORCE PUSH: about to push cleaned history to remote: $REMOTE"
echo "    This will rewrite remote history. If you proceed, all collaborators must re-sync (fetch + reset or reclone)."
echo "    Type YES to proceed with mirror force push, anything else to abort."
read -r PUSH_CONFIRM
if [ "$PUSH_CONFIRM" != "YES" ]; then
  echo "Push aborted by user. Cleaned mirror available at $MIRROR_DIR. No remote changes made."
  exit 0
fi

# 10) Push mirror (force)
echo "Pushing mirror (force) to origin..." | tee -a "$PUSH_LOG"
set +e
git push --mirror --force origin 2>&1 | tee -a "$PUSH_LOG"
RC=${PIPESTATUS[0]}
set -e
if [ "$RC" -ne 0 ]; then
  echo "Push failed (code $RC). Check $PUSH_LOG for details. You may need to unblock secret scanning on GitHub or resolve remaining flagged secrets." | tee -a "$PUSH_LOG"
  exit 4
else
  echo "Mirror force-push succeeded." | tee -a "$PUSH_LOG"
fi

# 11) Wrap up
echo
echo "CLEANUP COMPLETE. Logs:"
echo " - filter logs: $FILTER_LOG"
echo " - push logs: $PUSH_LOG"
echo
echo "POST-STEPS:"
echo " - Ensure all tokens were revoked (if not done already)."
echo " - Inform any collaborators (if exist) to re-sync or reclone."
echo " - Inspect GitHub Security ‚Üí Secret scanning to confirm no other flags."
echo
echo "Mirror repo kept at: $MIRROR_DIR  (delete after you verify everything is OK)"
echo "Backup of original working tree: $BACKUP"
echo "Finished at: $(date -u)"


In [12]:
import os

# –ü—ä—Ç –¥–æ workflows
workflow_dir = os.path.expanduser("~/cloudfiles/code/Users/kenderov.emil/.github/workflows")

# –°—ä–±–∏—Ä–∞ –≤—Å–∏—á–∫–∏ YAML —Ñ–∞–π–ª–æ–≤–µ
yaml_files = []
for root, dirs, files in os.walk(workflow_dir):
    for file in files:
        if file.endswith(".yml") or file.endswith(".yaml"):
            yaml_files.append(os.path.join(root, file))

# –¢—ä—Ä—Å–∏ 'client-secret' –≤—ä–≤ —Ñ–∞–π–ª–æ–≤–µ—Ç–µ –∏ –ø–æ–∫–∞–∑–≤–∞ —Å—ä–¥—ä—Ä–∂–∞–Ω–∏–µ—Ç–æ
for path in yaml_files:
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
        if "client-secret" in content:
            print(f"\nüîç Found 'client-secret' in: {path}\n")
            print(content)
            print("-" * 80)
        else: 
          print("No client secret found")
