## Blurb-baseline-malli (TF-IDF + Logistic Regression)

Notebook rakentaa Logistic Regression -luokittelijan, joka käyttää blurbin TF-IDF-vektoreita ennustamaan projektin onnistumista

- TF-IDF tuotetaan moduulin `preprocessing/blurb_tfidf.py` funktiolla `make_blurb_tfidf_pipeline()`   
- Notebook sisältää esiprosessoidun datan latauksen, vektorisoinnin, train/test-jaon, mallin opetuksen ja perusmetriikat (accuracy, confusion matrix, classification report)


In [1]:
import pandas as pd

from preprocessing.blurb_tfidf import make_blurb_tfidf_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
)


In [2]:
def load_and_vectorize_blurbs(path: str = "../data/processed_data.parquet"):
    """
    Lataa datasetin ja vektorisoi blurb-sarakkeen TF-IDF:llä.

    Palauttaa:
        X: sparse-matriisi TF-IDF-piirteille
        y: kohdemuuttuja (onnistuminen)
        vectorizer: käytetty TF-IDF-vektorisaattori
    """
    # Esiprosessoitu datasetti
    df = pd.read_parquet(path)

    vectorizer = make_blurb_tfidf_pipeline()
    X = vectorizer.fit_transform(df["blurb"].fillna(""))

    # Projektin onnistuminen binäärinen (successful/failed)
    y = (df["state"] == "successful").astype(int)


    return X, y, vectorizer


X, y, vectorizer = load_and_vectorize_blurbs()


In [3]:
def split_train_test(X, y, test_size: float = 0.2, random_state: int = 42):
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=test_size,
        random_state=random_state,
        stratify=y,
    )
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = split_train_test(X, y)


In [4]:
def train_logreg_baseline(X_train, y_train) -> LogisticRegression:
    clf = LogisticRegression(max_iter=200, n_jobs=-1)
    clf.fit(X_train, y_train)
    return clf


clf = train_logreg_baseline(X_train, y_train)


In [5]:
def evaluate_baseline(clf, X_test, y_test) -> None:
    y_pred = clf.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print("=== Baseline: blurb only (Logistic Regression) ===")
    print(f"Accuracy: {acc:.4f}\n")
    print("Confusion matrix:")
    print(cm)
    print("\nClassification report:")
    print(report)


evaluate_baseline(clf, X_test, y_test)


=== Baseline: blurb only (Logistic Regression) ===
Accuracy: 0.6669

Confusion matrix:
[[48985 16081]
 [23653 30552]]

Classification report:
              precision    recall  f1-score   support

           0       0.67      0.75      0.71     65066
           1       0.66      0.56      0.61     54205

    accuracy                           0.67    119271
   macro avg       0.66      0.66      0.66    119271
weighted avg       0.67      0.67      0.66    119271

