<a href="https://colab.research.google.com/github/mariah0134/Data-Science-Project/blob/main/Notebooks/Phase-3/baseline_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [44]:
#imports:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

Config the Data Files:

In [45]:
DATA_CONFIG = [
    {
        "filename": "boe_cleaned.csv",
        "text_cols": ["Text"],
        "label": "boe"
    },
    {
        "filename": "qiwa_data_cleaned.csv",
        "text_cols": ["Content"],
        "label": "qiwa"
    },
    {
        "filename": "labor_law_faq_cleaned.csv",
        "text_cols": ["Question", "Answer"],
        "label": "faq"
    },
    {
        "filename": "istitlaa_cleaned.csv",
        "text_cols": ["Current Text", "Proposed Text"],
        "label": "istitlaa"
    }
]

Load and Proccess Data:

In [46]:
def clean_text(text):
    """function to make sure the data doesn't have extra spaces after merging"""
    text = str(text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

all_data_frames = []

for source in DATA_CONFIG:
    try:
        df = pd.read_csv(source["filename"])
        df['text'] = df[source["text_cols"]].fillna('').astype(str).agg(' '.join, axis=1)
        df['text'] = df['text'].apply(clean_text)
        df['source'] = source["label"]
        processed_df = df[['text', 'source']]
        processed_df = processed_df[processed_df['text'] != '']
        all_data_frames.append(processed_df)
        print(f"✅ Loaded successfully {source['label']} (# of Samples: {len(processed_df)})")

    except FileNotFoundError:
        print(f"❌ Coudn't find file: {source['filename']}")
    except Exception as e:
        print(f"❌ Something went wrong: {source['filename']}: {e}")

if not all_data_frames:
    print("❌ Coouldn't load any data")
else:
    combined_df = pd.concat(all_data_frames, ignore_index=True)
    print(f"\nLoaded successfully # of samples: {len(combined_df)}")

    x = combined_df['text']
    y = combined_df['source']

✅ Loaded successfully boe (# of Samples: 248)
✅ Loaded successfully qiwa (# of Samples: 16)
✅ Loaded successfully faq (# of Samples: 16)
✅ Loaded successfully istitlaa (# of Samples: 16)

Loaded successfully # of samples: 296


Splitting the Data

In [47]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print(f"We split {len(x_train)} Training, {len(x_test)} Testing.")

try:
    pd.DataFrame(x_train, columns=['text']).to_csv('x_train.csv', index=False)
    pd.DataFrame(y_train, columns=['source']).to_csv('y_train.csv', index=False)
    pd.DataFrame(x_test, columns=['text']).to_csv('x_test.csv', index=False)
    pd.DataFrame(y_test, columns=['source']).to_csv('y_test.csv', index=False)

    print("✅ 4 files were saved successfully")
    print("x_train.csv")
    print("y_train.csv")
    print("x_test.csv")
    print("y_test.csv")
except Exception as e:
    print(f"Something went wrong: {e}")

We split 236 Training, 60 Testing.
✅ 4 files were saved successfully
x_train.csv
y_train.csv
x_test.csv
y_test.csv


Buliding the Baseline Model:

In [48]:
print(f"We have: {len(x_train)} Training samples, {len(x_test)} Testing samples.")

#Building the Pipeline (Encoding + Model)
baseline_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(random_state=42, solver='liblinear'))
])

#Training the model
baseline_pipeline.fit(x_train, y_train)
print("Training completed")

#Testing
y_pred = baseline_pipeline.predict(x_test)

print("\n--- Baseline Results ---")

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
results_dict = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
results_df = pd.DataFrame(results_dict).transpose()
print(results_df)

We have: 236 Training samples, 60 Testing samples.
Training completed

--- Baseline Results ---
Accuracy: 0.8500

Classification Report:
              precision  recall  f1-score  support
boe              0.8500    1.00  0.918919    51.00
faq              0.0000    0.00  0.000000     3.00
istitlaa         0.0000    0.00  0.000000     3.00
qiwa             0.0000    0.00  0.000000     3.00
accuracy         0.8500    0.85  0.850000     0.85
macro avg        0.2125    0.25  0.229730    60.00
weighted avg     0.7225    0.85  0.781081    60.00


##Model 1

**TF-IDF Encoding**

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=6000,
    ngram_range=(1,2)
)

X_train_tfidf = tfidf.fit_transform(x_train)
X_test_tfidf = tfidf.transform(x_test)

print("TF-IDF encoding done!")

TF-IDF encoding done!


This step converts the raw text into numerical features using TF-IDF (Term Frequency–Inverse Document Frequency).
TF-IDF helps the model identify the most important words in each text by giving higher weights to unique words and lower weights to common ones.
We transform both training and testing sets so the model can learn patterns consistently

**Building Model 1 (Random Forest Classifier)**



In [50]:
from sklearn.ensemble import RandomForestClassifier

model1 = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42
)

model1.fit(X_train_tfidf, y_train)
print("Model 1 (Random Forest) has been trained!")


Model 1 (Random Forest) has been trained!


In this step, we build Model 1 using the Random Forest Classifier, which is an ensemble machine-learning algorithm.
Random Forest works by training multiple decision trees and combining their predictions, making it more stable and reducing overfitting.
We train the model using the TF-IDF encoded training data

**Predicting on the Test Set**

In [51]:
y_pred_model1 = model1.predict(X_test_tfidf)


Here, we use the trained Random Forest model to predict the class labels for the test dataset.
These predictions will later be compared to the true labels to evaluate the model’s performance

**Model Evaluation**

In [52]:
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred_model1)
print(f"Model 1 Accuracy: {accuracy:.4f}")

report = classification_report(y_test, y_pred_model1, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df


Model 1 Accuracy: 0.9167


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,precision,recall,f1-score,support
boe,0.910714,1.0,0.953271,51.0
faq,0.0,0.0,0.0,3.0
istitlaa,1.0,1.0,1.0,3.0
qiwa,1.0,0.333333,0.5,3.0
accuracy,0.916667,0.916667,0.916667,0.916667
macro avg,0.727679,0.583333,0.613318,60.0
weighted avg,0.874107,0.916667,0.88528,60.0


In this step, we calculate the model’s performance using:

Accuracy (overall correctness)

Precision, Recall, F1-score for each class

A detailed classification report to understand how well the model performs for each dataset source.

This evaluation helps us understand whether the model can correctly classify text into:
boe, faq, qiwa, istitlaa

**Saving Model Results**

In [54]:
model1_accuracy = accuracy
model1_report_df = report_df

report_df.to_csv("model1_report.csv")
print("Model 1 metrics saved to model1_report.csv")


Model 1 metrics saved to model1_report.csv
