# Ensemble Model
Enhance the DeBerta in regards to the recall, as, even though it was 
overall the best model from our baselines, it had a low recall.
Hence, we add some additional decision logic when DeBerta is predicting 
a sample to be negative.

We are reducing the amount of false negatives by adding additional logic
to negative predictions to possibly swap them to a positive prediction,
given that the models after the DeBerta all agree on a positive prediction.

This approach will increase the recall (as we do less negative predictions), 
but will also increase the amount of false positives (as we doing more positive
predictions overall).

In [1]:
%load_ext autoreload
%autoreload 2

In [5]:
import numpy as np
import polars
from polars import DataFrame, Series, col
from transformers import AutoModelForSequenceClassification
from label_legends.deberta import load_dataset, load_deberta, load_dataset
from label_legends.preprocess import holdout, load_data, load_test, transform, load_own
from label_legends.female import predict_female
from label_legends.swears_negative import predict_swear, predict_negative_sentiment
from label_legends.result import calculate_scores
from label_legends.util import ROOT

# Use Deberta to create initial predictions:
We fine tune the DeBerta on our train set as we 
did for the baseline and report the performance of
DeBerta alone.

In [3]:
deberta = load_deberta()

In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

  self.pid = os.fork()


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

  self.pid = os.fork()




DeBerta is trained for 5 epochs, for details look into `deberta.py` in our package.

In [4]:
train_out = deberta.train()
train_out.metrics

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch,Training Loss,Validation Loss


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


🏃 View run ./results at: https://mlflow.mahluke.page/#/experiments/0/runs/84afb3bcc4784f47b2b4d4b838223629
🧪 View experiment at: https://mlflow.mahluke.page/#/experiments/0


{'train_runtime': 3930.9354,
 'train_samples_per_second': 37.396,
 'train_steps_per_second': 2.338,
 'total_flos': 7025499503316000.0,
 'train_loss': 0.2813671406776005,
 'epoch': 5.0}

In [7]:
test = load_own() # replace with `load_test()` if test data should be used
test_transformed = transform(test)
deberta_prediction = deberta.predict(load_dataset(test_transformed["text"].to_list(), test_transformed["label"].to_list()))
deberta_prediction



  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


PredictionOutput(predictions=array([[ 0.22100908, -0.3849549 ],
       [ 0.20448087, -0.35345423],
       [ 1.4354323 , -1.9412558 ],
       [ 1.3453149 , -1.7552105 ],
       [ 1.6932449 , -2.0605826 ],
       [ 1.330438  , -1.7849141 ],
       [ 0.00815822, -0.13858354],
       [ 1.4483907 , -1.8163158 ],
       [ 1.6816638 , -2.0947795 ],
       [ 1.3424137 , -1.6290814 ],
       [ 1.4861088 , -1.8607656 ],
       [ 1.5169125 , -1.922798  ],
       [ 1.5939958 , -1.9327121 ],
       [ 1.6717114 , -2.0650768 ],
       [ 1.3191297 , -1.8152466 ],
       [ 1.5501674 , -1.966885  ],
       [ 0.01321473, -0.18853423],
       [ 0.04816356, -0.10234571],
       [ 0.5855616 , -0.77393943],
       [ 0.6758357 , -0.7819318 ],
       [ 0.32601324, -0.5094394 ],
       [ 0.9591873 , -1.4126728 ],
       [ 1.2254938 , -1.559727  ],
       [ 1.1200161 , -1.5564845 ],
       [ 1.4277071 , -1.8830895 ],
       [ 1.1576838 , -1.5392526 ],
       [ 1.2855673 , -1.7444466 ],
       [ 1.6091292 , -2.05

In [10]:
predictions = DataFrame({"id": test["id"], "label": deberta_prediction.label_ids, "deberta": np.argmax(deberta_prediction.predictions, axis=1)})
predictions.head()

id,label,deberta
i64,i64,i64
1,1,0
2,1,0
3,0,0
4,1,0
5,0,0


In [11]:
scores_deberta = calculate_scores(predictions["label"], predictions["deberta"])
scores_deberta

precision:	0.8750
recall:		0.2500
fscore:		0.3889
accuracy:	0.6333
tn: 31	 fp: 1
fn: 21	 tp: 7

# Post-Process negative predictions
In order to improve the base results from above we employ 3 models.
If all of those models predict positive, our ensemble swaps the negative 
prediction to positive, which allows us to achieve a higher recall.

## Predict if sentence is referencing a female

In [12]:
def assign_type(prediction: int, label: str):
    if prediction == 0:
        if label == "not sexist":
            return "tn"
        return "fn"
    if label == "sexist":
        return "tp"
    return "fp"


predictions = predictions.join(test, on="id").with_columns(polars.struct(["deberta", "label_sexist"]).map_elements(lambda x: assign_type(x['deberta'],x['label_sexist']), return_dtype=polars.String).alias("type")).select(["id", "type", "label", "deberta"])

predictions.head()

id,type,label,deberta
i64,str,i64,i64
1,"""fn""",1,0
2,"""fn""",1,0
3,"""tn""",0,0
4,"""fn""",1,0
5,"""tn""",0,0


Add the predictions whether a text is about a female or not:

In [13]:
def alt_on_neg_pred(prediction: int, alternative: int):
    """Use the alternative if prediction is negative"""
    if prediction == 1:
        return 1
    return alternative


predictions = predictions.join(predict_female(test), on="id").with_columns(
    polars.struct(["female", "deberta"]).map_elements(lambda x: alt_on_neg_pred(x["deberta"], x["female"])).alias("pred_female"))

  predictions = predictions.join(predict_female(test), on="id").with_columns(


We see that the prediction of most false positives is reconsidered, as those contain some female phrase. However, many of true negatives are also reconsidered, leading to a potentially worse precision. As we are focusing on increasing the recall, this is a trade-off we have to make.

In [14]:
predictions.group_by("female", "type").len().with_columns((polars.col("len") / polars.sum("len").over("type")).alias("perc")).filter(col("female")==1).sort("len").rename({"len": "samples"})

female,type,samples,perc
i32,str,u32,f64
1,"""tp""",6,0.857143
1,"""tn""",7,0.225806
1,"""fn""",13,0.619048


In [15]:
predictions.head()

id,type,label,deberta,female,pred_female
i64,str,i64,i64,i32,i64
1,"""fn""",1,0,1,1
2,"""fn""",1,0,1,1
3,"""tn""",0,0,0,0
4,"""fn""",1,0,1,1
5,"""tn""",0,0,0,0


If we would only use 'female' without some additional measurements, we would increase our recall considerably to over 95%, which comes at the cost of precision and f-score, which are drastically decreased. In order to not worsen the model that strongly overall, we add additional measurements to decide when to change a negative prediction by Deberta to a positive one.

In [16]:
scores_female = calculate_scores(predictions["label"], predictions["pred_female"])
scores_female

precision:	0.7143
recall:		0.7143
fscore:		0.7143
accuracy:	0.7333
tn: 24	 fp: 8
fn: 8	 tp: 20

## Sentiment analysis
Predict if text has a negative connotation. If true, and also targeted towards a female, predict the sample to be sexist.




In [17]:
predictions = predictions.join(predict_negative_sentiment(test), on="id").with_columns(polars.struct(["negative", "deberta"]).map_elements(lambda x: alt_on_neg_pred(x["deberta"], x["negative"])).alias("pred_negative"))
predictions.head()

  predictions = predictions.join(predict_negative_sentiment(test), on="id").with_columns(polars.struct(["negative", "deberta"]).map_elements(lambda x: alt_on_neg_pred(x["deberta"], x["negative"])).alias("pred_negative"))


id,type,label,deberta,female,pred_female,negative,pred_negative
i64,str,i64,i64,i32,i64,i32,i64
1,"""fn""",1,0,1,1,0,0
2,"""fn""",1,0,1,1,0,0
3,"""tn""",0,0,0,0,1,1
4,"""fn""",1,0,1,1,0,0
5,"""tn""",0,0,0,0,1,1


In [18]:
predictions.group_by("negative", "type", "female").len().with_columns((polars.col("len") / polars.sum("len").over("type")).alias("perc")).filter(col("type").is_in(["fn", "tn"])).sort("len").rename({"len": "samples"}).sort("type")

negative,type,female,samples,perc
i32,str,i32,u32,f64
1,"""fn""",0,2,0.095238
1,"""fn""",1,3,0.142857
0,"""fn""",0,6,0.285714
0,"""fn""",1,10,0.47619
1,"""tn""",1,2,0.064516
0,"""tn""",1,5,0.16129
1,"""tn""",0,9,0.290323
0,"""tn""",0,15,0.483871


In [19]:
scores_negative = calculate_scores(predictions["label"], predictions["pred_negative"])
scores_negative

precision:	0.5000
recall:		0.4286
fscore:		0.4615
accuracy:	0.5333
tn: 20	 fp: 12
fn: 16	 tp: 12

### Combine female & negative

In [20]:
predictions = predictions.with_columns(polars.struct(["pred_negative", "pred_female"]).map_elements(lambda x: x["pred_negative"] and x["pred_female"]).alias("pred_female_negative"))

  predictions = predictions.with_columns(polars.struct(["pred_negative", "pred_female"]).map_elements(lambda x: x["pred_negative"] and x["pred_female"]).alias("pred_female_negative"))


In [21]:
scores_female_negative = calculate_scores(predictions["label"], predictions["pred_female_negative"])
scores_female_negative

precision:	0.7692
recall:		0.3571
fscore:		0.4878
accuracy:	0.6500
tn: 29	 fp: 3
fn: 18	 tp: 10

## Bad words
Check if text contains a 'bad word', if so, predict the sample to be sexist.

In [22]:
predictions = predictions.join(predict_swear(test), on="id").with_columns(
    polars.struct(["swear", "deberta"]).map_elements(lambda x: alt_on_neg_pred(x["deberta"], x["swear"])).alias("pred_swear"))

  predictions = predictions.join(predict_swear(test), on="id").with_columns(


In [23]:
predictions.group_by("negative", "swear", "type").len().with_columns((polars.col("len") / polars.sum("len").over("type")).alias("perc")).filter(col("type").is_in(["fn", "tn"])).sort("len").rename({"len": "samples"})

negative,swear,type,samples,perc
i32,i32,str,u32,f64
0,1,"""fn""",1,0.047619
1,1,"""fn""",1,0.047619
0,1,"""tn""",2,0.064516
1,0,"""fn""",4,0.190476
1,0,"""tn""",11,0.354839
0,0,"""fn""",15,0.714286
0,0,"""tn""",18,0.580645


In [24]:
scores_swear = calculate_scores(predictions["label"], predictions["pred_swear"])
scores_swear

precision:	0.7500
recall:		0.3214
fscore:		0.4500
accuracy:	0.6333
tn: 29	 fp: 3
fn: 19	 tp: 9

## Ensemble
Check if text is about a female AND if it contains swear words, or if text is about a female AND it is negative. If so, predict the sample to be sexist.


In [25]:
predictions = predictions.with_columns(
    polars.struct(["pred_female_negative", "pred_swear"]).map_elements(lambda x: x["pred_female_negative"] and x["pred_swear"]).alias("pred_ensemble"))

  predictions = predictions.with_columns(


In [26]:
scores_ensemble = calculate_scores(predictions["label"], predictions["pred_ensemble"])
scores_ensemble

precision:	0.8889
recall:		0.2857
fscore:		0.4324
accuracy:	0.6500
tn: 31	 fp: 1
fn: 20	 tp: 8

# Store results

In [27]:
scores_df = DataFrame([scores.asdict() for scores in [scores_deberta, scores_female, scores_negative, scores_swear, scores_female_negative, scores_ensemble]]).with_columns_seq(Series(name="model", values=["deberta", "female", "negative", "swear", "female_negative", "ensemble"])).select("model", "precision", "recall", "fscore", "accuracy", "tp", "tn", "fp","fn")
scores_df.head()

model,precision,recall,fscore,accuracy,tp,tn,fp,fn
str,f64,f64,f64,f64,i64,i64,i64,i64
"""deberta""",0.875,0.25,0.388889,0.633333,7,31,1,21
"""female""",0.714286,0.714286,0.714286,0.733333,20,24,8,8
"""negative""",0.5,0.428571,0.461538,0.533333,12,20,12,16
"""swear""",0.75,0.321429,0.45,0.633333,9,29,3,19
"""female_negative""",0.769231,0.357143,0.487805,0.65,10,29,3,18


# Result storage
We save all predictions in the `resource` directory in order to load them with 
the `ensemble_analysis.ipynb` notebook, where we do a more thorough analysis
of the results and create some plots.

In [28]:
scores_df.write_csv(ROOT / "resource" / "ensemble_scores.csv")

In [29]:
predictions.write_csv(ROOT / "resource" / "ensemble_predictions.csv")