# Logistic Regression Model Analysis
### Raha Musavi-Masterarbeit-VAMoS
This notebook provides a detailed analysis of the trained Logistic Regression model for identifying the presence of Ezafe in Middle Persian nominal phrases. It covers the evaluation metrics, ROC analysis, learning curve, data integrity checks, random label shuffling test, and error analysis, aligning with Sections 5.9 and 5.10 of the thesis.

In [1]:
## Setup and Data Loading
# Import necessary libraries
import pandas as pd
import numpy as np
import joblib # For loading the saved model
from sklearn.model_selection import train_test_split, learning_curve, StratifiedKFold # learning_curve, StratifiedKFold for learning curve
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_curve, auc, roc_auc_score
from imblearn.over_sampling import RandomOverSampler # For reproducing the split
from sklearn.utils import shuffle # For random label shuffling
import matplotlib.pyplot as plt
import seaborn as sns
import os # To check if files exist

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

In [2]:
# --- Configuration --- (Matches lr_train_evaluate.py configuration for consistency)
INPUT_FEATURES_CSV = "lr_final_features.csv"
INPUT_TARGET_CSV = "lr_target.csv"
INPUT_MODEL_PATH = "final_logistic_regression_model.pkl"

# Configuration needed to reproduce the train/test split and resampling
TEST_SIZE = 0.3
RANDOM_STATE = 42

# Configuration for Learning Curve plot
CV_FOLDS_LC = 5 # Number of folds for learning curve calculation (often less than training CV for speed)

# --- Load Data and Model --- (Assumes lr_feature_engineering.py and lr_train_evaluate.py were run)
if not os.path.exists(INPUT_FEATURES_CSV) or not os.path.exists(INPUT_TARGET_CSV) or not os.path.exists(INPUT_MODEL_PATH):
    print(f"Error: Required input files not found.")
    print(f"Please ensure '{INPUT_FEATURES_CSV}', '{INPUT_TARGET_CSV}', and '{INPUT_MODEL_PATH}' exist.")
else:
    X = pd.read_csv(INPUT_FEATURES_CSV)
    y = pd.read_csv(INPUT_TARGET_CSV).squeeze() # Use squeeze() to get a Series
    best_model = joblib.load(INPUT_MODEL_PATH)

    print(f"Loaded final features (X) shape: {X.shape}")
    print(f"Loaded target (y) shape: {y.shape}")
    print("Loaded trained Logistic Regression model.")

    # --- Reproduce Train/Test Split and Resampling --- (Same as lr_train_evaluate.py)
    # The original data distribution needs to be balanced and split to evaluate the model correctly.
    oversampler = RandomOverSampler(random_state=RANDOM_STATE)
    X_resampled, y_resampled = oversampler.fit_resample(X, y)
    print(f"\nResampled data shape: {X_resampled.shape}")
    print(f"Resampled target distribution:\\n{y_resampled.value_counts()}")

    # Split data into train and test sets. Stratified split is important for imbalanced data.
    X_train, X_test, y_train, y_test = train_test_split(
        X_resampled, y_resampled,
        test_size=TEST_SIZE,
        random_state=RANDOM_STATE,
        stratify=y_resampled # Use stratify to maintain class distribution
    )
    print(f"\nReproduced Train data shape: {X_train.shape}")
    print(f"Reproduced Test data shape: {X_test.shape}")

Error: Required input files not found.
Please ensure 'lr_final_features.csv', 'lr_target.csv', and 'final_logistic_regression_model.pkl' exist.


## 5.9 Results
This section presents the primary evaluation metrics for the optimal Logistic Regression model on the test set, as discussed in Section 5.9.

In [None]:
# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)

print("--- Test Set Evaluation (Section 5.9) ---")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\\n", classification_report(y_test, y_pred))