In [1]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, f1_score

In [2]:
# Load and preprocess
data = pd.read_csv("Breast_Cancer.csv")
data = data.dropna()
label = LabelEncoder()
data['Tumor Size'] = label.fit_transform(data['Tumor Size'])
data.head()

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,3,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,34,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,62,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,17,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,40,Positive,Positive,3,1,50,Alive


In [3]:
# Split
X = data.drop('Tumor Size', axis=1)
y = data['Tumor Size']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# One-hot encode categorical columns
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'  # Keep other columns
)

X_train_encoded = preprocessor.fit_transform(X_train)
X_test_encoded = preprocessor.transform(X_test)


In [6]:
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_encoded, y_train)

In [7]:
# Apply the same transformation to X_test
X_test_encoded = preprocessor.transform(X_test)

# Now predict
y_pred = rf.predict(X_test_encoded)

In [11]:
# Evaluate
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')  # or 'macro' or 'micro'

print("Accuracy:", acc)
print("F1-score:", f1)

Accuracy: 0.0968944099378882
F1-score: 0.08481980930457798


**Part 3: Ethical Reflection **

Deploying a predictive model trained on the Breast Cancer dataset in a corporate environment to allocate issue priorities raises ethical concerns. One major risk is dataset bias—if the training data underrepresents certain teams, departments, or issue types, the model may systematically deprioritize their tickets. For instance, if historical data shows that issues from a particular team were often marked as “low priority,” the model may perpetuate this bias, even if the issues are critical. This can lead to unfair resource allocation and erode trust in the system.

Tools like IBM’s AI Fairness 360 (AIF360) can help mitigate these risks. AIF360 provides metrics to detect bias (e.g., disparate impact, equal opportunity difference) and algorithms to reduce it (e.g., reweighing, adversarial debiasing). By integrating fairness checks into the ML pipeline, organizations can ensure more equitable outcomes. Transparency, regular audits, and stakeholder feedback loops are essential to maintain fairness and accountability in AI-driven decision-making.