# ML da PIPELINE -  bu hamma bosqichlarni tartibga soluvchi va avtomotlashtiruvchi zanjir hisoblanadi(workflow)

# PIPELINE nima uchun muhim?

1) kodlar tartibli boladi(Har bir boshqich alohida bloqlarda boladi)
2) takrorlanadigan jarayonlar avtomatlashtiriladi(masalan: scaler, encoder, har safar ozi ishlaydi)
3) data Leakage ning oldi olinadi (scaler faqat trainingga orgatiladi -- testda qollaniladi)
4) Modelni diploying qilish osonlashadi (prod da ham ayni pipeline ishlatiladi)
5) Hyperparametr tuning soddalashadi (GridSearchCV - Pipeline juda qulay )

# Pipeline turlari

1) Manual 
2) Auto
3) Manual +Auto Combination

# Pipelinening mavjud tartiblari

1) Data Preprocessing Pipeline :: missing values imputation // encoding // scaling // feature engineering // balancing // outlier removal
2) Modeling Pipeline  :: preprocessing + mmodelni birlashtirish
3) MLOps Pipeline (Production pipeline)  :: data ingestion // data validation// feature store // model training // model tuning // continiuous training (CT) // continiuous deployment (CD) // Minetoring drifting (tizimlari :: Airflow // Perfect // MLflow // Kuberflow)

# SCIKIT LEARN + To'liq PIPELINE amaliyot

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer


df = pd.read_csv(
    r"C:\Users\Jahongir\desktop\practise\Data\Row_Data\employee_promotion.csv"
)

# TARGET
y = df["recruitment_channel"]
X = df.drop("recruitment_channel", axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X.select_dtypes(include=["object", "category"]).columns


preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, cat_cols)
    ]
)

pipe = Pipeline([
    ("prep", preprocess),
    ("model", GradientBoostingClassifier(random_state=42))
])

params = {
    "model__n_estimators": [50, 100],
    "model__learning_rate": [0.01, 0.1]
}

grid = GridSearchCV(
    pipe,
    params,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid.fit(X_train, y_train)


print("Best Params:", grid.best_params_)
print("Best CV Score:", grid.best_score_)
print("Test Accuracy:", grid.score(X_test, y_test))


Best Params: {'model__learning_rate': 0.1, 'model__n_estimators': 50}
Best CV Score: 0.5544405586540021
Test Accuracy: 0.558474730888524


In [None]:
# REAL LOYIHALARDA  PIPELINE ARXITEKTURASI


>>>> Raw Data 

>>>> Data Cleaning (missing value, duplicate)

>>>> Feature engineering

>>>> Training Pipeline

>>>> Modul Registry (MLFlow)

>>>> Deployment(Docker, FastApi)


>>>> Monitoring + Drift Detection



# PIPELINE ishlatiladigan joylar


| Soha | Misol (Pipeline) |
|------|------------------|
| Fraud Detection | kelayotgan tranzaksiya → pipeline → model |
| Recommendation System | user event → pipeline → ranking |
| NLP (Natural Language Processing) | text cleaning → tokenizer → model |
| CV (Computer Vision) | image resize → augmentation → model |
| Banking | scoring pipeline |
| Industry | predictive maintenance |


# ML jarayonini avtomatlashtirish va tartibga solishni ta'minlaydigan tizim

In [21]:
# Manual Pipeline -- bu qo'lda yaratiladigan pipeline
# AutoML Pipeline -- modelni avtomatik tanlaydigan pipeline


In [23]:
# # Manual Pipeline -- bunda dasturchi hamma jarayonlarni qolda belgilaydi

# 1) preprocessing
# 2) Feature Engineering
# 3) Scaling
# 4) Encoding
# 5) Modeni tanlash
# 6) Hyperparametr tuning
# 7) training 
# 8) Prediction

In [25]:
# #  Manual Pipeline nima uchun afzal?

# 1) Toliq nazorat dasturchida 
# 2) xoxlagan transformatsiyani qollay olish imkoniyati
# 3) kaggle real loyihalarida keng qollaniladi
# 4) jarayon davomida debugging oson kechadi
# 5) 

In [27]:
# # Kamchiliklari ---

# 1) kop vaqt ketadi
# 2) kod uzun boladi 
# 3) tuning qolda qilinadi


# Manual Pipeline Amaliyot