# Text Classification

This project focused on two subtasks from the OffensEval 2019 shared task, part of SemEval, centered on offensive language detection in social media posts. The work included Subtask A and a modified version of Subtask B.

Subtask A involved identifying whether a post contained offensive language. Offensive content included insults, threats, and other unacceptable expressions. The problem was framed as a binary text classification task.

Subtask B required distinguishing whether an offensive post was targeted or untargeted. Targeted posts directed insults or threats at specific individuals or groups, while untargeted posts used offensive language without a specific target. In this assignment, a three-label formulation was used: targeted, untargeted, and non-offensive, resulting in a multiclass classification setup.

The project used scikit-learn for feature extraction, model development, and evaluation, applying its standard text-processing and machine-learning components to build the required classifiers.

In [22]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import f1_score, accuracy_score

The dataset contained 13240 annotated tweets for training and 860 tweets for testing. Each tweet included labels for Subtask A with values True or False and Subtask B with values TIN, UNT, and NOT. Sentiment annotations were also provided and were incorporated later in the workflow. The data was loaded into two DataFrames representing the training and test splits.

In [23]:
train = pd.read_csv("train.tsv", sep="\t")
test = pd.read_csv("test.tsv", sep="\t")
train[["tweet", "sentiment", "subtask_a", "subtask_b"]]

Unnamed: 0,tweet,sentiment,subtask_a,subtask_b
0,@USER She should ask a few native Americans wh...,neutral,True,UNT
1,@USER @USER Go home youâ€™re drunk!!! @USER #MAG...,negative,True,TIN
2,Amazon is investigating Chinese employees who ...,neutral,False,NOT
3,"@USER Someone should'veTaken"" this piece of sh...",negative,True,UNT
4,@USER @USER Obama wanted liberals &amp; illega...,negative,False,NOT
...,...,...,...,...
13235,@USER Sometimes I get strong vibes from people...,negative,True,TIN
13236,Benidorm âœ… Creamfields âœ… Maga âœ… Not too sh...,positive,False,NOT
13237,@USER And why report this garbage. We don't g...,negative,True,TIN
13238,@USER Pussy,negative,True,UNT


# Text Representation

Text representation was performed to convert the tweets into numerical feature vectors for both subtasks. A bag-of-words model based on tf-idf was used. The representation was obtained using scikit-learn's `TfidfVectorizer`, which provided built-in tokenization, stop-word handling, and other text pre-processing options. The `create_tfidfvectorizer` function created and returned a `TfidfVectorizer` object with all parameters set to their default values, following the libraryâ€™s documentation.

In [24]:
def create_tfidfvectorizer():
  vectorizer = TfidfVectorizer()
  return vectorizer

In [25]:
vectorizer = create_tfidfvectorizer()

The `TfidfVectorizer` was applied to the dataset to generate numerical representations of the tweets. The vectorizer was first trained on the training set to learn the vocabulary and idf values. It was then used to transform both the training and test sets, producing feature vectors for each tweet. The resulting representations had 19,083 dimensions, with the train set shaped as (13240, 19083) and the test set shaped as (860, 19083). The `run_vectorizer` function performed this training and transformation process.

In [26]:
def run_vectorizer(vectorizer, train, test):
  X_train = vectorizer.fit_transform(train)
  X_test = vectorizer.transform(test)
  return X_train, X_test

In [27]:
train_x, test_x = run_vectorizer(vectorizer, train["tweet"], test["tweet"])
print(f"Shape of train input data: {train_x.get_shape()}")
print(f"Shape of test input data: {test_x.get_shape()}")

Shape of train input data: (13240, 19083)
Shape of test input data: (860, 19083)


## Logistic Regression 

A Logistic Regression classifier was created to predict offensive language in tweets. The `create_model` function generated a `LogisticRegression` object with default parameters, including the one-vs-all strategy for multiclass classification and the Limited-memory BFGS optimizer. The maximum number of training iterations was increased to 1000 to ensure convergence.


In [28]:
def create_model():   # 3 Marks
  model = LogisticRegression(max_iter=1000)
  return model

In [29]:
model = create_model()

The `LogisticRegression` model automatically adapted to the classification type based on the training labels, handling both binary and multiclass cases. The `run_model` function trained the model using the training feature vectors and target labels, and then generated predictions for the test set. The same implementation was applied for both subtasks.

In [30]:
def run_model(model, train_x, train_y, test_x):
  model.fit(train_x, train_y)
  pred = model.predict(test_x)
  return pred

In [31]:
prediction = run_model(model, train_x, train["subtask_a"], test_x)
test['prediction_a'] = prediction
test[['id', 'tweet', 'subtask_a', 'prediction_a']]

Unnamed: 0,id,tweet,subtask_a,prediction_a
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,True,False
1,27014,"#ConstitutionDay is revered by Conservatives, ...",False,False
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,False,False
3,13876,#Watching #Boomer getting the news that she is...,False,False
4,60133,#NoPasaran: Unity demo to oppose the far-right...,True,False
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,True,False
856,25657,#MeetTheSpeakers ðŸ™Œ @USER will present in our e...,False,False
857,67018,3 people just unfollowed me for talking about ...,True,True
858,50665,#WednesdayWisdom Antifa calls the right fascis...,False,False


In [32]:
prediction = run_model(model, train_x, train["subtask_b"], test_x)
test['prediction_b'] = prediction
test[['id', 'tweet', 'subtask_b', 'prediction_b']]

Unnamed: 0,id,tweet,subtask_b,prediction_b
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,TIN,TIN
1,27014,"#ConstitutionDay is revered by Conservatives, ...",NOT,NOT
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,NOT,NOT
3,13876,#Watching #Boomer getting the news that she is...,NOT,NOT
4,60133,#NoPasaran: Unity demo to oppose the far-right...,TIN,NOT
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,TIN,NOT
856,25657,#MeetTheSpeakers ðŸ™Œ @USER will present in our e...,NOT,NOT
857,67018,3 people just unfollowed me for talking about ...,UNT,NOT
858,50665,#WednesdayWisdom Antifa calls the right fascis...,NOT,NOT


The performance of the `LogisticRegression` model was evaluated using appropriate metrics for each subtask. For SubTask A, accuracy and binary F1 scores were computed, yielding an accuracy of 0.80 and a binary F1 of 0.49. For SubTask B, micro and macro F1 scores were calculated, resulting in a micro F1 of 0.78 and a macro F1 of 0.46.


> \*\*\* SubTask A \*\*\*  
accuracy: 0.80  
binary f1: 0.49  
>
> \*\*\* SubTask B \*\*\*    
micro f1: 0.78  
macro f1: 0.46

In [33]:
print("*** SubTask A ***")
print(f"accuracy: {accuracy_score(test['subtask_a'], test['prediction_a']):0.2f}")
print(f"binary f1: {f1_score(test['subtask_a'], test['prediction_a'], average='binary'):0.2f}")
print("")
print("*** SubTask B ***")
print(f"micro f1: {f1_score(test['subtask_b'], test['prediction_b'], average='micro'):0.2f}")
print(f"macro f1: {f1_score(test['subtask_b'], test['prediction_b'], average='macro'):0.2f}")

*** SubTask A ***
accuracy: 0.80
binary f1: 0.49

*** SubTask B ***
micro f1: 0.78
macro f1: 0.46


## Balancing the Dataset

The differences observed between the metrics used in the above evaluation indicate that the **OfensEval** dataset is not balanced. In **SubTask A**, getting an `accuracy` much higher than the `binary f1` can mean that the number of `False` cases is larger than the number of `True` cases. Similarly, obtaining very different `micro f1` and `macro f1` scores in **SubTask B** is a hint that some of the classes are more frequent than others. This can be verified with the following code lines:
>```python
train.groupby(by="subtask_a")[["tweet"]].count().reset_index()

|    | subtask_a   |   tweet |
|---:|:------------|--------:|
|  0 | False       |    8840 |
|  1 | True        |    4400 |

>```python
train.groupby(by="subtask_b")[["tweet"]].count().reset_index()

|    | subtask_b   |   tweet |
|---:|:------------|--------:|
|  0 | NOT         |    8840 |
|  1 | TIN         |    3876 |
|  2 | UNT         |     524 |

One solution that can mitigate this problem is to assign weights to the classes in a way that reduces the influence of the most frequent ones. **Scikit-learn** allows easily applying such approach by setting the appropriate option when creating the model. The goal of the next exercise is to create a new version of the `LogisticRegression` that handles the unbalanced dataset better.

The function `create_balanced_model` created and returned a `LogisticRegression` equal to the one created by `create_model` with the only difference being that this version automatically adjusts class weights.

In [34]:
def create_balanced_model():
  balanced_model = LogisticRegression(max_iter=1000, class_weight = 'balanced')
  return balanced_model

In [35]:
balanced_model = create_balanced_model()
prediction = run_model(balanced_model, train_x, train["subtask_a"], test_x)
test['prediction_balanced_a'] = prediction
test[['id', 'tweet', 'subtask_a', 'prediction_balanced_a']]

Unnamed: 0,id,tweet,subtask_a,prediction_balanced_a
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,True,True
1,27014,"#ConstitutionDay is revered by Conservatives, ...",False,False
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,False,False
3,13876,#Watching #Boomer getting the news that she is...,False,False
4,60133,#NoPasaran: Unity demo to oppose the far-right...,True,False
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,True,False
856,25657,#MeetTheSpeakers ðŸ™Œ @USER will present in our e...,False,False
857,67018,3 people just unfollowed me for talking about ...,True,True
858,50665,#WednesdayWisdom Antifa calls the right fascis...,False,True


In [36]:
prediction = run_model(balanced_model, train_x, train["subtask_b"], test_x)
test['prediction_balanced_b'] = prediction
test[['id', 'tweet', 'subtask_b', 'prediction_balanced_b']]

Unnamed: 0,id,tweet,subtask_b,prediction_balanced_b
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,TIN,TIN
1,27014,"#ConstitutionDay is revered by Conservatives, ...",NOT,TIN
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,NOT,NOT
3,13876,#Watching #Boomer getting the news that she is...,NOT,NOT
4,60133,#NoPasaran: Unity demo to oppose the far-right...,TIN,NOT
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,TIN,TIN
856,25657,#MeetTheSpeakers ðŸ™Œ @USER will present in our e...,NOT,NOT
857,67018,3 people just unfollowed me for talking about ...,UNT,UNT
858,50665,#WednesdayWisdom Antifa calls the right fascis...,NOT,NOT


The balanced `LogisticRegression` model reduced the disparity between accuracy and binary F1 in SubTask A, and between micro and macro F1 in SubTask B. While accuracy and micro F1 decreased slightly, binary F1 and macro F1 improved significantly, reflecting better performance on underrepresented classes.


> \*\*\* SubTask A \*\*\*  
accuracy: 0.78  
binary f1: 0.61  
>
> \*\*\* SubTask B \*\*\*  
micro f1: 0.75  
macro f1: 0.59

In [37]:
print("*** SubTask A ***")
print(f"accuracy: {accuracy_score(test['subtask_a'], test['prediction_balanced_a']):0.2f}")
print(f"binary f1: {f1_score(test['subtask_a'], test['prediction_balanced_a'], average='binary'):0.2f}")
print("")
print("*** SubTask B ***")
print(f"micro f1: {f1_score(test['subtask_b'], test['prediction_balanced_b'], average='micro'):0.2f}")
print(f"macro f1: {f1_score(test['subtask_b'], test['prediction_balanced_b'], average='macro'):0.2f}")

*** SubTask A ***
accuracy: 0.78
binary f1: 0.61

*** SubTask B ***
micro f1: 0.75
macro f1: 0.59


## Additional Features 

A `ColumnTransformer` was created to incorporate additional features into the text representation for the `LogisticRegression` model. The `create_column_transformer` function returned a transformer with two components: a `TfidfVectorizer` applied to the tweet text and a `OneHotEncoder` applied to the sentiment annotations. Both transformers used default parameters, producing a combined feature vector that included textual and sentiment information.

In [38]:
def create_column_transformer():
  transformers = [
      ('text', TfidfVectorizer(), 'tweet'),
      ('sentiment', OneHotEncoder(),['sentiment'])
  ]
  column_transformer = ColumnTransformer(transformers)
  return column_transformer

The `ColumnTransformer` was applied to the train and test DataFrames using the `run_vectorizer` function. The transformer was trained on the training data and then used to transform both sets. The resulting feature vectors incorporated both the tf-idf text representation and the one-hot encoded sentiment annotations, producing vectors of 19,086 dimensions per tweetâ€”three more dimensions than the previous approach.

> Shape of train input data: (13240, 19086)  
Shape of test input data: (860, 19086)

In [39]:
column_transformer = create_column_transformer()
train_x_sentiment, test_x_sentiment = run_vectorizer(column_transformer, train, test)
print(f"Shape of train input data: {train_x_sentiment.get_shape()}")
print(f"Shape of test input data: {test_x_sentiment.get_shape()}")

Shape of train input data: (13240, 19086)
Shape of test input data: (860, 19086)


In [40]:
prediction = run_model(balanced_model, train_x_sentiment, train["subtask_a"], test_x_sentiment)
test['prediction_sentiment_a'] = prediction
test[['id', 'tweet', 'subtask_a', 'prediction_sentiment_a']]

Unnamed: 0,id,tweet,subtask_a,prediction_sentiment_a
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,True,True
1,27014,"#ConstitutionDay is revered by Conservatives, ...",False,False
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,False,False
3,13876,#Watching #Boomer getting the news that she is...,False,False
4,60133,#NoPasaran: Unity demo to oppose the far-right...,True,False
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,True,True
856,25657,#MeetTheSpeakers ðŸ™Œ @USER will present in our e...,False,False
857,67018,3 people just unfollowed me for talking about ...,True,True
858,50665,#WednesdayWisdom Antifa calls the right fascis...,False,True


In [41]:
prediction = run_model(balanced_model, train_x_sentiment, train["subtask_b"], test_x_sentiment)
test['prediction_sentiment_b'] = prediction
test[['id', 'tweet', 'subtask_b', 'prediction_sentiment_b']]

Unnamed: 0,id,tweet,subtask_b,prediction_sentiment_b
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,TIN,TIN
1,27014,"#ConstitutionDay is revered by Conservatives, ...",NOT,NOT
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,NOT,NOT
3,13876,#Watching #Boomer getting the news that she is...,NOT,NOT
4,60133,#NoPasaran: Unity demo to oppose the far-right...,TIN,NOT
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,TIN,TIN
856,25657,#MeetTheSpeakers ðŸ™Œ @USER will present in our e...,NOT,NOT
857,67018,3 people just unfollowed me for talking about ...,UNT,UNT
858,50665,#WednesdayWisdom Antifa calls the right fascis...,NOT,NOT


The addition of the Sentiment Analysis to the input feature vector should help in both **SubTask A** and **SubTask B**. All the metrics should get some improvement, especially `binary f1` and `macro f1`:

> \*\*\* SubTask A \*\*\*  
accuracy: 0.79  
binary f1: 0.66    
>
> \*\*\* SubTask B \*\*\*  
micro f1: 0.77  
macro f1: 0.62

In [42]:
print("*** SubTask A ***")
print(f"accuracy: {accuracy_score(test['subtask_a'], test['prediction_sentiment_a'] ):0.2f}")
print(f"binary f1: {f1_score(test['subtask_a'], test['prediction_sentiment_a'], average='binary'):0.2f}")
print("")
print("*** SubTask B ***")
print(f"micro f1: {f1_score(test['subtask_b'], test['prediction_sentiment_b'], average='micro'):0.2f}")
print(f"macro f1: {f1_score(test['subtask_b'], test['prediction_sentiment_b'], average='macro'):0.2f}")

*** SubTask A ***
accuracy: 0.79
binary f1: 0.66

*** SubTask B ***
micro f1: 0.77
macro f1: 0.62
