<div style="background-color:
#00008B; 
            color:white; 
            padding:15px; 
            border-radius:10px; 
            text-align:center; 
            font-size:36px; 
            font-weight:bold;">
    BDA Competition 2: Predicting whether social media posts contain offensive content <br>
    <span style="font-size:20px; font-weight:normal;">
        Group 1: Charlotte de Vries, Johannes Degner & Oliver Hutton
    </span>
</div>

<div style="background-color:#00008b; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:24px; 
            font-weight:bold;">
    Table of Contents
</div>

<br>

1. Setup


2. Read Data

3. Data Exploration

4. Preprocessing<br>
   4.1 Tokenization<br>
   4.2 Stopwords

5. Feature Engineering<br>
   5.1 Tf-idf and character-level features<br>
   5.2 Average tweet embeddings<br>
   5.3 Sentiment analysis<br>
   4 + 5 Function with preprocessing and features<br>
   5.4 Merge all features

6. Models<br>
   6.1 Model fitting<br>
   6.2 Plotting and picking best set of features<br>
   6.3 Using best model with more tweets

7. Submission<br>
   7.1 Converting test tweets to features<br>
   7.2 Generating predictions for each tweet<br>
   7.3 Formatting submission file

8. Division of Labour



<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    Introduction
    </div>

The goal of this competition is to construct a classifier that can accurately recognize and label tweets as including either offensive (1) or non-offensive language (0). 

A vast collection of tweets and comments collected from various online platforms provided the data. Human judges assessed the texts to determine whether or not they contained content that was offensive. As a result, the competition's outcomes will generalize to online comments and posts on Twitter and other comparable social media platforms.

The Ridge-penalized and Lasso-penalized logistic regression models are potential machine learning models to be applied here. To tune the regularization parameter (λ) for the best prediction performance, which is determined primarily by AUC, both models employ cross-validation through the elastic net. 

The feature set derives from tweet text and includes: Lexico-syntactic n-grams (unigram–quadrigram) capturing local word-order cues (negation, modifiers, fixed phrases/slogans), using raw within-tweet counts to preserve intensity/repetition.
Term-weighting features—TF, IDF, TF-IDF—to highlight tokens that are locally frequent yet globally distinctive. Surface (structural) features—unique tokens, words, characters, average word length—characterizing lexical diversity, verbosity, and compression/elongation. Semantic embeddings—tweet-level means of pretrained 300-D GloVe vectors—providing compact meaning that groups spelling variants/slang/euphemisms and complements token features on short, noisy text.
Affective lexicon features—NRC emotion counts (e.g., nrc_anger, nrc_disgust) and a Bing positivity proportion—offering low-dimensional, interpretable tone cues. 
Abusive-language lexicon features—matching a profanity lexicon to uni–quad n-grams (e.g., “fuck you,” “son of a bitch”), capped at 4-grams (only one 5-gram), with tweet-level toxicity scored by summing 1–3 severity ratings across matches.

It is crucial to recognize that labeling errors of this magnitude are inevitable because some words can be classified as either normal or offensive, depending on the context and possibly the subtle use of tone. Thus, we anticipate an error bound of 5-10%. 

#Importing packages codeblock
<div style="background-color:#00008B; 
     	     	color:white; 
        		padding:12px; 
        		border-radius:8px; 
         	  	font-size:30px; 
        	 	font-weight:bold;">
1.Setup
   	 	</div>


We start by loading the required packages from tidyverse, and tidytext.

In [None]:
## Importing packages
## test
library(tidyverse) # metapackage with lots of helpful functions
library(tidytext)

## Limiting the number of data frame rows displayed
options(repr.matrix.max.rows=8)

## Data attached to this notebook
list.files(path = "../input")

<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    2. Read Data
</div>
To begin with this step, we locate the data and load it into memory.

In [None]:
dir("../input", recursive=TRUE)

The data consist of two separate files. The training data (which is already labeled as either including offensive (1), or not including offensive language (0). The test data is not yet labeled. The training data initially consists of 400,000 tweets, while the test data includes 50,001 tweets. The goal is to build a model on the training data that enables us to accurately predict whether a tweet includes offensive language or not on the test data. 

In [None]:
# Find the right file paths
train_filepath = dir("..", pattern="train.csv", recursive=TRUE, full.names = TRUE)
test_filepath = dir("..", pattern="test.csv", recursive=TRUE, full.names = TRUE)

# Read in the csv files
traindat = read_csv(train_filepath, col_types="cci") 
testdat = read_csv(test_filepath, col_types="cc")

In [None]:
traindat
testdat

<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    3. Data Exploration
</div>

Before creating features and building models, we performed a quick exploratory analysis of our dataset. We compared offensive and normal tweets by looking at their mean number of words, mean number of characters, and mean average word length.

In [None]:
# Filter to labelled tweets
eda_dat <- traindat %>%
  filter(label %in% c(0, 1))

# Character-level features
eda_summary <- eda_dat %>%
  mutate(
    word_count   = str_count(tweet, "\\S+"),
    char_count   = str_length(tweet),
    avg_word_len = char_count / pmax(word_count, 1)
  )

# Summary statistics by label
eda_stats <- eda_summary %>%
  group_by(label) %>%
  summarise(
    mean_words   = mean(word_count, na.rm = TRUE),
    mean_chars   = mean(char_count, na.rm = TRUE),
    mean_avg_word_len = mean(avg_word_len, na.rm = TRUE),
    .groups = "drop"
  )

eda_stats


In [None]:
# Set plotting area: 1 row, 3 columns
par(mfrow = c(1, 3))

# Plot 1: mean words per tweet
barplot(
  height = eda_stats$mean_words,
  names.arg = eda_stats$label,
  col = c("#00BFC4", "#F8766D"),
  main = "Mean Words per Tweet",
  xlab = "Label",
  ylab = "Mean Words"
)

# Plot 2: mean characters per tweet
barplot(
  height = eda_stats$mean_chars,
  names.arg = eda_stats$label,
  col = c("#00BFC4", "#F8766D"),
  main = "Mean Characters per Tweet",
  xlab = "Label",
  ylab = "Mean Characters"
)

# Plot 3: mean average word length
barplot(
  height = eda_stats$mean_avg_word_len,
  names.arg = eda_stats$label,
  col = c("#00BFC4", "#F8766D"),
  main = "Mean Average Word Length",
  xlab = "Label",
  ylab = "Avg Word Length"
)


Having plotted our findings, we can see that offensive tweets generally have fewer words and fewer characters. This make sense, considering that offensive statements often rely on shorter and stronger words, for instance "shut up" or "fuck off". Non-offensive tweets might be more descriptive and convsersational, which inherently has more structure and are longer. However, average word length is similar in both types of tweets.

<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    4. Preprocessing
</div>

Our initial approach was to compute the steps of tokenization, possible stopwords, and feature engineering seperately. However, after using this approach, we concluded that this method was slow and carried the risk of leaking. Therefore, we redesigned preprocessing and feature generation into a single, parameterized feature-builder. A unified function applies the same tokenization, stopword rules, n-gram boundaries, normalizations, and lexicon joins to each split.

The function eliminates the alignment issues and post-hoc merges by producing a single sparse design matrix with a set column order and distinct, namespaced feature labels (for instance, tfidf:, profanity:, etc...). Runtime and memory use are significantly reduced because tokenization is done just once and is utilised for TF-IDF, n-gram counts, profanity and sentiment matching. Finally, this method enables us to compare models with different sets of features while using fewer tweets (n = 20000), making the process faster than using all tweets. Afterwards, we will select the best model and apply it to more tweets for our final model.

## 4.1 Tokenization


We first tokenize each tweet. We retain raw within-tweet counts to capture direct lexical signal and intensity (e.g., repeated slurs), giving the model access to uncompromised frequency information before normalization.

We add bigrams (2-grams) to capture short-range context and compositional meanings that unigrams miss. Especially negation (“not good”), intensifiers/modifiers (“very stupid”, “disgusting behavior”), targeted phrases (“[slur] people”, “illegal immigrants”), and common phrasal verbs/commands (“go back”, “get out”). Raw counts preserve how often such pairings occur within a tweet, reflecting intensity and repetition of phrase-level cues.

Trigrams (3-grams) manage to capture multiword expressions and templated statements typical of tweets (“go back to”, “they are not”), that uni,- and bigrams might miss. They capture stronger intent and polarity than any subset alone. They also help to disambiguate contexts like sarcasm markers (“yeah right sure”) or imperative structures, with counts reflecting repeated use of these patterned constructions.

Quadrigrams (4-grams) enable us to preserve even more. They can capture slogans, threats, and templated incitements that are highly diagnostic but only emerge over multiple tokens. (e.g., “go back to [country]”, “make X great again”, “ban them all now”). Quadrigrams also encode entity-specific frames (group + action + location/time), and raw frequency highlights repeated slogan-level usage within and across tweets.

## 4.2 Stopwords


At this point of the notebook, making use of stopwords was deemed not appropriate. In some contexts, even words like “you”, “we”, or “them” might be highly informative for detecting offensive language. Therefore, we deliberately retain function words (e.g. pronouns, negation), because – in targeted abuse detection, their distribution can carry crucial cues about intent and target. However, we decided to try a version where we remove stopwords from the word embeddings. Since stopwords carry little semantic meaning, we thought that this might make average tweet embeddings more accurate in predicting new offensive tweets.

<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    5. Feature Engineering
</div>

## 5.1 Tf-idf and character-level features

***Lexical weighting features***

**Term Frequency (TF)**: We normalize counts by tweet length to prevent longer tweets from dominating purely by verbosity, ensuring each token’s weight reflects its salience within that tweet.
	
**Inverse Document Frequency (IDF)**: It lowers the weight of very common words and boosts rarer, more informative ones, so feature weights better reflect each token’s ability to distinguish tweets across the corpus.

**TF–IDF**: By combining TF and IDF, we emphasize tokens that are both prominent in a tweet and informative globally - an effective, sparse representation for linear classifiers on short texts.

***Lexical diversity***

**Unique tokens per tweet(num_unique_tokens)** We add a tweet-level measure of distinct vocabulary to capture stylistic structure (verbosity vs. repetitiveness), which can correlate with abusive or chant-like discourse beyond word content alone.

***Structural (surface) features***

**Words per tweet**: The number of words per tweet indexes overall verbosity or utterance length; short bursts often mark tags, insults, or slogans, whereas longer tweets typically contain rationale or narrative content.

**Character per tweet**: The total character count provides an orthographic footprint, sensitive to elongations (e.g., “soooo”), emoji or digit density, extended punctuation runs, and highly compressed styles.


### 5.1.1 Non-zero variance features

We decided to remove tokens that occur in less than 0.01% of the documents. The number 0.01% was chosen arbitrarily, but it should remove idiosyncratic strings and misspellings that occur only in singular tweets. 

## 5.2 Average tweet embeddings

We also include a tweet-level average of pretrained 200-dimensional GloVe word vectors. This gives each tweet a compact “meaning” summary, so similar words—including slang, spelling variants, and euphemisms—are treated alike even when they don’t match exactly. That semantic signal complements TF-IDF’s exact-token focus and is especially helpful because tweets are short and noisy. Using pretrained vectors also brings in knowledge from large text corpora at low cost, improving robustness and generalization.

## 5.3 Sentiment analysis
We load the NRC emotion lexicon and the Bing polarity lexicon, prefix the labels (“nrc_…”, “bing_…”) to avoid name collisions and make feature provenance explicit, and then join them to our tokenized tweets. From NRC we construct per-tweet emotion counts by tallying how many lexicon-matched tokens fall into each category (e.g., nrc_anger, nrc_disgust), resulting in an interpretable affect profile for every tweet. From Bing we compute a frequency-weighted positivity proportion - the share of token counts labeled positive among all lexicon-covered tokens in the tweet - which summarizes overall tone. We include these features because offensive language systematically co-occurs with certain emotions and negative polarity, and these low-dimensional, interpretable signals complement TF–IDF and embeddings by capturing affect rather than exact wording. Although the join excludes out-of-lexicon tokens, this trade-off yields robust cues that improve discrimination on short, noisy tweets.

Upon inspecting the features, we noticed that certain sentiments also appear as tf-idf features. Hence, we rename the sentiments to start with 'nrc_' for the nrc lexicon, and 'bing_' for the bing lexicon.

We further decided to use the profanities lexicon and apply it to our tokenized tweets. However, since profanities are not always single words, some of the text in the lexicon consists of multiple words, for instance "fuck you" or "son of a bitch". We account for this by applying the profanities lexicon to unigrams, bigrams, trigrams and quadgrams. There was only one instance of a slur with 5 words, so we decided to stop at quadgrams. In the profanities lexicon, there is also a severity rating for the insult/slur, ranging from 1 to 3. We decided to use the sum of severity rating of each ngram for each tweet.

## 4 + 5: Function with preprocessing and feature engineering

In [None]:
# Feature builder
build_feature_matrices <- function(n = 15000,
                                   glove_path = "/kaggle/input/glove-embeddings/glove.6B.200d.txt",
                                   glove_n_lines = 100000,
                                   ncores = 4) {
    cat(sprintf("\n[build_feature_matrices] Using n = %d tweets\n", n))
  
  # Helper: tweet-level character features
  extract_char_features <- function(data) {
    data |>
      mutate(word_count   = str_count(tweet, "\\S+"),
             tweet_length = str_length(tweet)) |>
      select(id, word_count, tweet_length)
  }
  
  ## 3.1 Tokenization of ngrams (uni to quad)
  tokenized_tweets <- traindat |>
    slice_head(n = n) |>
    unnest_tokens(token, tweet, token = "words") |>
    count(id, label, token, name = "n")
  
  tweets_bigrams <- traindat |>
    slice_head(n = n) |>
    unnest_tokens(bigram, tweet, token = "ngrams", n = 2) |>
    filter(!is.na(bigram)) |>
    count(id, bigram, name = "n")
  
  tweets_trigrams <- traindat |>
    slice_head(n = n) |>
    unnest_tokens(trigram, tweet, token = "ngrams", n = 3) |>
    filter(!is.na(trigram)) |>
    count(id, trigram, name = "n")
  
  tweets_quadgrams <- traindat |>
    slice_head(n = n) |>
    unnest_tokens(quadgram, tweet, token = "ngrams", n = 4) |>
    filter(!is.na(quadgram)) |>
    count(id, quadgram, name = "n")
  
  # token dataset without stopwords
  tokenized_tweets_no_sw = traindat |>
    slice_head(n = n) |>
    unnest_tokens(token, tweet, token = "words") |> # tokenize tweets at word level
    filter(!token %in% stop_words$word) |>
    count(id, token) # count tokens within tweets as 'n'
  
  ## 4.1 TF-IDF + character-level features
  features <- tokenized_tweets |>
    bind_tf_idf(token, id, n)
  
  tweet_lengths <- tokenized_tweets |>
    group_by(id) |>
    summarise(num_unique_tokens = n_distinct(token), .groups = "drop")
  
  
  char_feats <- traindat |>
    slice_head(n = n) |>
    extract_char_features()
  
  features <- features |>
    left_join(char_feats, by = "id") |>
    left_join(tweet_lengths, by = "id")
  
  # 4.1.1 Near-zero variance cutoff
  features_keep <- features |>
    filter(idf <= -log(0.01 / 100)) |>
    rename(tweet_id = id)
  
  ## 4.2 GloVe embeddings (200d)
  cat("[build_feature_matrices] Reading GloVe…\n")
  embedding_lines <- readLines(glove_path, n = glove_n_lines)
  
  emb <- strsplit(embedding_lines, " ")
  emb <- do.call(rbind, emb)
  colnames(emb) <- c("token", paste0("dim", 1:200))
  glove_tibble <- as_tibble(emb) |> mutate(across(2:201, as.numeric))

  # Join words with GloVe embeddings
  word_embeddings <- tokenized_tweets |>
    inner_join(glove_tibble, by = "token")

  # Calculate mean of word embedding for each tweet
  mean_embeddings <- word_embeddings |>
    group_by(id) |>
    summarise(across(starts_with("dim"), ~ mean(.x, na.rm = TRUE)), .groups = "drop")
  
  # Embeddings without stopwords
  # Join words with GloVe embeddings
  word_embeddings_nosw <- tokenized_tweets_no_sw %>%
    inner_join(glove_tibble, by = "token")
  
  # Calculate mean of word embedding for each tweet
  mean_embeddings_nosw <- word_embeddings_nosw %>%
    group_by(id) %>%
    summarise(across(starts_with("dim"), function(x) mean(x, na.rm = TRUE)))
  
  
  ## 4.3 Sentiment lexicons
  # Load lexicons
  nrc  <- read.csv("/kaggle/input/bing-nrc-afinn-lexicons/NRC.csv")
  bing <- read.csv("/kaggle/input/bing-nrc-afinn-lexicons/Bing.csv")

  # Rename sentiments as they might overlap with the tf-idf word features
  nrc$sentiment  <- paste0("nrc_", nrc$sentiment)
  bing$sentiment <- paste0("bing_", bing$sentiment)

  # Calculate sentiment for tweets
  sentiment_tweets <- tokenized_tweets |>
    inner_join(nrc,
               by = c("token" = "word"),
               relationship = "many-to-many") |>
    count(id, sentiment, name = "n") |>
    pivot_wider(
      id_cols = id,
      names_from = sentiment,
      values_from = n,
      values_fill = 0
    )
                     
  # Calcualte average negativity per tweet
  avg_negativity <- tokenized_tweets |>
    inner_join(bing,
               by = c("token" = "word"),
               relationship = "many-to-many") |>
    mutate(is_negative = ifelse(sentiment == "bing_negative", 1, 0)) |>
    group_by(id) |>
    summarise(negativity = sum(is_negative * n) / sum(n),
              .groups = "drop")
  
  ## Profanities, looking at all ngrams
  profanities <- read.csv("/kaggle/input/profanities-in-english-collection/profanity_en.csv")
  profanities_clean <- profanities |>
    mutate(text = str_trim(str_to_lower(text)))
                     
  profanity_unigrams   <- profanities_clean |> filter(!str_detect(text, " "))
  profanity_bigrams    <- profanities_clean |> filter(str_count(text, " ") == 1)
  profanity_trigrams   <- profanities_clean |> filter(str_count(text, " ") == 2)
  profanity_quadgrams  <- profanities_clean |> filter(str_count(text, " ") == 3)
                     
  # Profanity rating for unigrams in tweets and sum the rating for each tweet
  profanity_rating_unigrams <- tokenized_tweets |>
    left_join(
      profanity_unigrams |> select(text, severity_rating),
      by = c("token" = "text"),
      relationship = "many-to-many"
    ) |>
    mutate(across(everything(), ~ replace_na(.x, 0))) |>
    group_by(id) |>
    summarise(severity_unigram = sum(severity_rating),
              .groups = "drop")
                     
  # Profanity rating for bigrams in tweets and sum the rating for each tweet
  profanity_rating_bigrams <- tweets_bigrams |>
    left_join(
      profanity_bigrams |> select(text, severity_rating),
      by = c("bigram" = "text"),
      relationship = "many-to-many"
    ) |>
    mutate(across(everything(), ~ replace_na(.x, 0))) |>
    group_by(id) |>
    summarise(severity_bigram = sum(severity_rating),
              .groups = "drop")
                     
  # Profanity rating for trigrams in tweets and sum the rating for each tweet
  profanity_rating_trigrams <- tweets_trigrams |>
    left_join(
      profanity_trigrams |> select(text, severity_rating),
      by = c("trigram" = "text"),
      relationship = "many-to-many"
    ) |>
    mutate(across(everything(), ~ replace_na(.x, 0))) |>
    group_by(id) |>
    summarise(severity_trigram = sum(severity_rating),
              .groups = "drop")
                     
  # Profanity rating for quad in tweets and sum the rating for each tweet
  profanity_rating_quadgrams <- tweets_quadgrams |>
    left_join(
      profanity_quadgrams |> select(text, severity_rating),
      by = c("quadgram" = "text"),
      relationship = "many-to-many"
    ) |>
    mutate(across(everything(), ~ replace_na(.x, 0))) |>
    group_by(id) |>
    summarise(severity_quadgram = sum(severity_rating),
              .groups = "drop")
                     
  # join all ratings together
  profanity_total_rating <- profanity_rating_unigrams |>
    left_join(profanity_rating_bigrams, by = "id") |>
    left_join(profanity_rating_trigrams, by = "id") |>
    left_join(profanity_rating_quadgrams, by = "id") |>
    mutate(across(starts_with("severity"), ~ replace_na(.x, 0)))
                     
  # and sum them together
  profanity_sum <- profanity_total_rating |>
    transmute(id,
              profanity_sum_rating =
                severity_unigram + severity_bigram + severity_trigram + severity_quadgram)
  
  
  ## Align IDs and build sparse matrices
  all_ids <- sort(unique(features_keep$tweet_id))
  
  # tf-idf
  X_tfidf <- features_keep |>
    mutate(tweet_id = factor(tweet_id, levels = all_ids)) |>
    cast_sparse(tweet_id, token, tf_idf)
  
  # character features
  X_chars <- features_keep |>
    select(tweet_id, tweet_length, word_count, num_unique_tokens) |>
    pivot_longer(-tweet_id, names_to = "feature", values_to = "value") |>
    mutate(tweet_id = factor(tweet_id, levels = all_ids)) |>
    cast_sparse(tweet_id, feature, value)
  
  # embeddings
  X_emb <- mean_embeddings |>
    
    filter(id %in% all_ids) |>
    mutate(tweet_id = factor(id, levels = all_ids)) |>
    pivot_longer(starts_with("dim"),
                 names_to = "feature",
                 values_to = "value") |>
    cast_sparse(tweet_id, feature, value)

  # embeddings without stopwords                   
  X_emb_nosw <- mean_embeddings_nosw %>% # without stopwords
    filter(id %in% all_ids) %>%
    mutate(tweet_id = factor(id, levels = all_ids)) %>%
    pivot_longer(cols = starts_with("dim"),
                 names_to = "feature",
                 values_to = "value") %>%
    cast_sparse(tweet_id, feature, value)
  
  # NRC, ensuring full row coverage
  valid_ids_nrc <- intersect(all_ids, sentiment_tweets$id)
  X_nrc_tmp <- sentiment_tweets |>
    filter(id %in% valid_ids_nrc) |>
    rename(tweet_id = id) |>
    mutate(tweet_id = factor(tweet_id, levels = valid_ids_nrc)) |>
    pivot_longer(-tweet_id, names_to = "sentiment_nrc", values_to
                 = "count_nrc") |>
    cast_sparse(tweet_id, sentiment_nrc, count_nrc)

  # when running it the first time, there were some missing row ids
  # so we account for this now by filling them to 0              
  missing_ids_nrc <- setdiff(all_ids, rownames(X_nrc_tmp))
  if (length(missing_ids_nrc) > 0) {
    empty_mat <- Matrix::sparseMatrix(
      i = integer(0),
      j = integer(0),
      x = numeric(0),
      dims = c(length(missing_ids_nrc), ncol(X_nrc_tmp)),
      dimnames = list(missing_ids_nrc, colnames(X_nrc_tmp))
    )
    X_nrc <- rbind(X_nrc_tmp, empty_mat)
    X_nrc <- X_nrc[all_ids, ]
  } else {
    X_nrc <- X_nrc_tmp[all_ids, ]
  }
  
  # Bing, counts per sentiment
  valid_ids_bing <- intersect(all_ids, sentiment_tweets$id)
  X_bing_tmp <- sentiment_tweets |>
    filter(id %in% valid_ids_bing) |>
    rename(tweet_id = id) |>
    
    mutate(tweet_id = factor(tweet_id, levels = valid_ids_bing)) |>
    pivot_longer(-tweet_id, names_to = "sentiment_bing", values_to = "count_bing") |>
    cast_sparse(tweet_id, sentiment_bing, count_bing)

  # same as for nrc, there were some missing row ids
  # we account for this now by filling them with 0
  missing_ids_bing <- setdiff(all_ids, rownames(X_bing_tmp))
  if (length(missing_ids_bing) > 0) {
    empty_mat <- Matrix::sparseMatrix(
      i = integer(0),
      j = integer(0),
      x = numeric(0),
      dims = c(length(missing_ids_bing), ncol(X_bing_tmp)),
      dimnames = list(missing_ids_bing, colnames(X_bing_tmp))
    )
    X_bing <- rbind(X_bing_tmp, empty_mat)
    X_bing <- X_bing[all_ids, ]
  } else {
    X_bing <- X_bing_tmp[all_ids, ]
  }
  
  # Profanity score
  X_profanities <- profanity_sum |>
    filter(id %in% all_ids) |>
    mutate(tweet_id = factor(id, levels = all_ids)) |>
    transmute(tweet_id, feature = "profanity_rating", value = profanity_sum_rating) |>
    cast_sparse(tweet_id, feature, value)
  
  
  list(
    ids = all_ids,
    X_tfidf = X_tfidf,
    X_chars = X_chars,
    X_emb = X_emb,
    X_emb_nosw = X_emb_nosw,
    X_nrc = X_nrc,
    X_bing = X_bing,
    X_profanities = X_profanities
  )
  
}


## 5.4 Merge all features

Here, we merge all the feature sets into 5 different versions:
- V1: Tf-idf + Character-level features
- V2: Tf-idf + Character-level features + Sentiment analyses
- V3: Tf-idf + Character-level features + Sentiment analyses + average tweet embeddings
- V4: Character-level features + Sentiment analyses + average tweet embeddings
- V5: Tf-idf + Character-level features + Sentiment analyses + average tweet embeddings no stopwords

In [None]:
# Initial run with 20000
built <- build_feature_matrices(n = 20000)

X_v1 <- cbind(built$X_tfidf, built$X_chars)
X_v2 <- cbind(built$X_tfidf, built$X_chars, built$X_nrc, built$X_profanities, built$X_bing)
X_v3 <- cbind(built$X_tfidf, built$X_chars, built$X_nrc, built$X_profanities, built$X_bing, built$X_emb)
X_v4 <- cbind(built$X_nrc, built$X_profanities, built$X_bing, built$X_emb)
X_v5 <- cbind(built$X_tfidf, built$X_chars, built$X_nrc, built$X_profanities, built$X_bing, built$X_emb_nosw)


<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    6. Models
</div>

## 6.1 Model fitting

The code below creates the target vector `y`. 
We ensure that every row of `X` has the correct corresponding entry in `y`. 
We use the training–validation set approach for selecting between models; due to the high dimensional data, this is faster than doing cross-validation.

We decided to use elastic net as it performs particularly well with high-dimensional data, and allows for flexibility in regularisation by tuning the hyperparameter. When alpha = 1, only the L1 penalty is applied, which is a lasso regression. When alpha = 0, only the L2 penalty is applied, which is a ridge regression. Elastic net has the advantage that it can combine the weights of the two penalties by setting alpha to different values. Therefore, we opted to use a range of values for alpha (0, 0.2, 0.4, 0.6, 0.8 and 1) and create a function for this.

In [None]:
# Elastic net evaluator
evaluate_elastic_net_model <- function(X, alpha_values = seq(0, 1, 0.2)) {
  set.seed(2025)

  # Build y from X rownames joined to traindat
  y <- data.frame(id = rownames(X)) |>
    inner_join(traindat, by = "id") |>
    pull(label)

  # Train/valid split
  trainidx <- caret::createDataPartition(y, p = 0.8)$Resample1
  Xtrain <- X[trainidx, ]
  ytrain <- y[trainidx]
  Xvalid <- X[-trainidx, ]
  yvalid <- y[-trainidx]

  # Parallel folds
  doMC::registerDoMC(cores = 4)

  results <- tibble(alpha = numeric(), auc = numeric(), lambda_min = numeric())
  best_auc   <- -Inf
  best_model <- NULL
  best_alpha <- NA_real_

  for (a in alpha_values) {
    fit <- glmnet::cv.glmnet(
      Xtrain, ytrain,
      alpha = a,
      nfolds = 3,

      family = "binomial",
      type.measure = "auc",
      parallel = TRUE
    )
    preds <- predict(fit, Xvalid, s = "lambda.min", type = "response")
    auc_v <- glmnet::assess.glmnet(preds, newy = yvalid, family = "binomial")$auc

    results <- add_row(results, alpha = a, auc = as.numeric(auc_v), lambda_min = fit$lambda.min)

    if (auc_v > best_auc) {
      best_auc   <- auc_v
      best_model <- fit
      best_alpha <- a
    }
  }

  list(results = results, best_model = best_model, best_alpha = best_alpha)
}


In [None]:
# Run function on each set of features
results_v1 <- evaluate_elastic_net_model(X_v1)
results_v2 <- evaluate_elastic_net_model(X_v2)
results_v3 <- evaluate_elastic_net_model(X_v3)
results_v4 <- evaluate_elastic_net_model(X_v4)
results_v5 <- evaluate_elastic_net_model(X_v5)

# Label each auc score with the appropriate version
results_v1$results$features <- "V1"
results_v2$results$features <- "V2"
results_v3$results$features <- "V3"
results_v4$results$features <- "V4"
results_v5$results$features <- "V5"

# Combine all results
results_all <- dplyr::bind_rows(
  results_v1$results,
  results_v2$results,
  results_v3$results,
  results_v4$results,
  results_v5$results
)

## 6.2 Plotting and picking best set of features

To compare and help visualise the different models with the various sets of features, we decided to plot each model.

In [None]:
ggplot(results_all, aes(x = alpha, y = auc, color = features)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 1, 0.2)) +
  labs(
    title = "Elastic Net AUC by Feature Set and Alpha",
    x = "Alpha (Elastic Net Mixing Parameter)",
    y = "AUC",
    color = "Feature Set"
  ) +
  theme_minimal(base_size = 14)

# Pick the overall winner across V1 to V5
pick_best <- function() {
  sets <- list(
    V1 = results_v1,
    V2 = results_v2,
    V3 = results_v3,
    V4 = results_v4,
    V5 = results_v5
  )
  best <- tibble(
    set   = names(sets),
    auc   = sapply(sets, function(x) max(x$results$auc, na.rm = TRUE)),
    alpha = sapply(sets, function(x) x$best_alpha)
  )
  best <- arrange(best, desc(auc)) %>% slice(1)
  best
}

best_small <- pick_best()
print(best_small)


Summary of different versions
- V1: Tf-idf + Character features
- V2: Tf-idf + Character features + Sentiment analyses
- V3: Tf-idf + Character features + Sentiment analyses + average tweet embeddings
- V4: Character features + Sentiment analyses + average tweet embeddings
- V5: Tf-idf + Character features + Sentiment analyses + average tweet embeddings no stopwords

The best model will vary depending on how many tweets we use. When more tweets are used, we expect the models including average tweet embeddings to perform better.

## 6.3 Using best model with more tweets

Now that we have a best model, we will re-train a new model with the same features but with more tweets. Here, we decided to run the model again using 170'000 tweets - a third of the dataset. This makes it less computationally intense than using the entire dataset, but we still include enough data to make accurate predictions.

In [None]:

## more tweets with same features as winner
built_new <- build_feature_matrices(n = 130000)

# Here, we used an LLM to help us construct the best-performing model 
## ==> LLM Start src: https://chatgpt.com/s/t_68e75a75a9588191a3aba5c4bf723f9f
# Helper to construct the winning X on demand
build_X_from_set <- function(set_name, built_obj) {
  switch(set_name,
    V1 = cbind(built_obj$X_tfidf, built_obj$X_chars),
    V2 = cbind(built_obj$X_tfidf, built_obj$X_chars, built_obj$X_nrc, built_obj$X_profanities, built_obj$X_bing),
    V3 = cbind(built_obj$X_tfidf, built_obj$X_chars, built_obj$X_nrc, built_obj$X_profanities, built_obj$X_bing, built_obj$X_emb),
    V4 = cbind(built_obj$X_nrc, built_obj$X_profanities, built_obj$X_bing, built_obj$X_emb),
    V5 = cbind(built_obj$X_tfidf, built_obj$X_chars, built_obj$X_nrc, built_obj$X_profanities, built_obj$X_bing, built_obj$X_emb_nosw),
    stop("Unknown feature set: ", set_name)
  )
}


X_best <- build_X_from_set(best_small$set, built_new)
## ==> LLM End

# Get y aligned to the matrix rows
y_new <- data.frame(id = rownames(X_best)) %>%
  inner_join(traindat, by = "id") %>%
  pull(label)

# Train final model on more tweets with the same alpha as the best model from before
doMC::registerDoMC(cores = 4)
alpha_star <- best_small$alpha

cat("Refitting with more tweets with set =", best_small$set, " alpha =", alpha_star, "\n")

final_fit <- glmnet::cv.glmnet(
  x = X_best,
  y = y_new,
  alpha = alpha_star,
  nfolds = 3,
  family = "binomial",
  type.measure = "auc",
  parallel = TRUE
)


In [None]:
# make predictions
prob_preds <- predict(final_fit, newx = X_best, s = "lambda.min", type = "response")

# ROC
roc_embedding <- glmnet::roc.glmnet(as.numeric(prob_preds), newy = y_new)

# plotting the AUC
plot(roc_embedding,type='l', main="Receiver Operating Characteristic function")

# computing the AUC
glmnet::assess.glmnet(prob_preds, newy = y_new, family="binomial")$auc


<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:30px; 
            font-weight:bold;">
    7. Submitting Predictions
</div>

In [None]:
sample_filepath = dir("..", pattern="sample.*.csv", recursive=TRUE, full.names = TRUE)

sample_submission = read_csv(sample_filepath, col_types = cols(col_character(), col_double()))

head(sample_submission)

### 7.1 Converting test tweets to features

We have to create a sparse matrix for each set of features (tf-idf, mean tweet embeddings, and sentiment analysis with NRC and Bing), and combine the matrices.


In [None]:
## Some helper functions

# Create the design matrix from built components
build_X_from_set <- function(set_name, b) {
  switch(set_name,
    V1 = cbind(b$X_tfidf, b$X_chars),
    V2 = cbind(b$X_tfidf, b$X_chars, b$X_nrc, b$X_profanities, b$X_bing),
    V3 = cbind(b$X_tfidf, b$X_chars, b$X_nrc, b$X_profanities, b$X_bing, b$X_emb),
    V4 = cbind(b$X_nrc, b$X_profanities, b$X_bing, b$X_emb),
    V5 = cbind(b$X_tfidf, b$X_chars, b$X_nrc, b$X_profanities, b$X_bing, b$X_emb_nosw),
    stop("Unknown feature set: ", set_name)
  )
}

# Align test sparse matrix columns to training matrix columns
align_sparse_cols <- function(M, ref_cols) {
  cur  <- colnames(M)
  miss <- setdiff(ref_cols, cur)
  if (length(miss) > 0) {
    zero_block <- Matrix::sparseMatrix(
      i = integer(0), j = integer(0), x = numeric(0),
      dims = c(nrow(M), length(miss)),
      dimnames = list(rownames(M), miss)
    )
    M <- cbind(M, zero_block)
  }
  M[, ref_cols, drop = FALSE]
}


In [None]:
## Training idf
# use the same n that we trained on (130000)
n_train_idf <- 130000

train_tokens_for_idf <- traindat %>%
  slice_head(n = n_train_idf) %>%
  mutate(id = as.character(id)) %>%
  unnest_tokens(token, tweet, token = "words") %>%
  count(id, token, name = "n")

idf_map <- train_tokens_for_idf %>%
  bind_tf_idf(token, id, n) %>%
  distinct(token, idf)

n_docs_train <- n_distinct(train_tokens_for_idf$id)


In [None]:
# Basic objects
test_small <- testdat %>% mutate(id = as.character(id))
test_ids   <- test_small$id

# Character features
test_char_feats <- test_small %>%
  transmute(
    id,
    word_count   = str_count(tweet, "\\S+"),
    tweet_length = str_length(tweet)
  )

# Unigram tokens for test
test_tokens <- test_small %>%
  unnest_tokens(token, tweet, token = "words") %>%
  count(id, token, name = "n")

# Unique-token counts (for X_chars)
test_num_unique <- test_tokens %>%
  group_by(id) %>%
  summarise(num_unique_tokens = n_distinct(token), .groups = "drop")

# tf-idf on test using training idf
sum_n <- test_tokens %>% group_by(id) %>% summarise(sum_n = sum(n), .groups = "drop")
test_tf <- test_tokens %>%
  left_join(sum_n, by = "id") %>%
  mutate(tf = n / pmax(sum_n, 1))

test_tfidf <- test_tf %>%
  left_join(idf_map, by = "token") %>%
  mutate(idf = ifelse(is.na(idf), log(n_docs_train), idf),
         tf_idf = tf * idf) %>%
  transmute(tweet_id = id, token, tf_idf)

# convert to sparse matrices
X_tfidf_test <- test_tfidf %>%
  cast_sparse(tweet_id, token, tf_idf)

X_chars_test <- test_char_feats %>%
  left_join(test_num_unique, by = "id") %>%
  rename(tweet_id = id) %>%
  pivot_longer(-tweet_id, names_to = "feature", values_to = "value") %>%
  cast_sparse(tweet_id, feature, value)

In [None]:
## Do the same for the sentiment analysis
nrc <- read.csv("/kaggle/input/bing-nrc-afinn-lexicons/NRC.csv")
bing <- read.csv("/kaggle/input/bing-nrc-afinn-lexicons/Bing.csv")
nrc$sentiment <- paste0("nrc_",  nrc$sentiment)
bing$sentiment <- paste0("bing_", bing$sentiment)

# nrc counts
nrc_counts_test <- test_tokens %>%
  inner_join(nrc, by = c("token" = "word"), relationship = "many-to-many") %>%
  count(id, sentiment, wt = n, name = "cnt")

X_nrc_test <- nrc_counts_test %>%
  transmute(tweet_id = id, sentiment_nrc = sentiment, count_nrc = cnt) %>%
  cast_sparse(tweet_id, sentiment_nrc, count_nrc)

# bing counts
bing_counts_test <- test_tokens %>%
  inner_join(bing, by = c("token" = "word"), relationship = "many-to-many") %>%
  count(id, sentiment, wt = n, name = "cnt")

#convert to sparse matrix
X_bing_test <- bing_counts_test %>%
  transmute(tweet_id = id, sentiment_bing = sentiment, count_bing = cnt) %>%
  cast_sparse(tweet_id, sentiment_bing, count_bing)


In [None]:
## Do the same for the profanities lexicon
profanities <- read.csv("/kaggle/input/profanities-in-english-collection/profanity_en.csv") %>%
  mutate(text = str_trim(str_to_lower(text)))
prof_uni <- profanities %>% filter(!str_detect(text, " "))

profanity_scores_test <- test_tokens %>%
  left_join(prof_uni %>% select(text, severity_rating),
            by = c("token" = "text"), relationship = "many-to-many") %>%
  mutate(severity_rating = replace_na(severity_rating, 0)) %>%
  group_by(id) %>%
  summarise(profanity_rating = sum(n * severity_rating), .groups = "drop")

#convert to sparse matrix
X_profanities_test <- profanity_scores_test %>%
  transmute(tweet_id = id, feature = "profanity_rating", value = profanity_rating) %>%
  cast_sparse(tweet_id, feature, value)

In [None]:
# Read GloVe once (same file & dims as training)
embedding_lines <- readLines("/kaggle/input/glove-embeddings/glove.6B.200d.txt", n = 100000)
emb <- strsplit(embedding_lines, " ")
emb <- do.call(rbind, emb)
colnames(emb) <- c("token", paste0("dim", 1:200))
glove_tibble <- as_tibble(emb) %>% mutate(across(2:201, as.numeric))

# With all tokens
word_embeddings_test <- test_tokens %>% inner_join(glove_tibble, by = "token")
mean_embeddings_test <- word_embeddings_test %>%
    group_by(id) %>%
    summarise(across(starts_with("dim"), ~mean(.x, na.rm = TRUE)), .groups = "drop")

X_emb_test <- mean_embeddings_test %>%
    mutate(tweet_id = id) %>% select(-id) %>%
    pivot_longer(-tweet_id, names_to = "feature", values_to = "value") %>%
    cast_sparse(tweet_id, feature, value)
    
# Without stopwords
# Build X_emb_nosw_test only if V5 is the best model, computationally less intense
if (best_small$set == "V5") {
  test_tokens_nosw <- test_small %>%
    unnest_tokens(token, tweet, token = "words") %>%
    anti_join(stop_words, by = c("token" = "word")) %>%
    count(id, token, name = "n")

  # re-use glove_tibble we loaded for embeddings
  word_embeddings_nosw_test <- test_tokens_nosw %>% inner_join(glove_tibble, by = "token")
  mean_embeddings_nosw_test <- word_embeddings_nosw_test %>%
    group_by(id) %>%
    summarise(across(starts_with("dim"), ~mean(.x, na.rm = TRUE)), .groups = "drop")

  X_emb_nosw_test <- mean_embeddings_nosw_test %>%
    mutate(tweet_id = id) %>% select(-id) %>%
    pivot_longer(-tweet_id, names_to = "feature", values_to = "value") %>%
    cast_sparse(tweet_id, feature, value)
}


In [None]:
# Decide the row order we want
test_ids <- test_small$id

## ==> LLM Start src: https://chatgpt.com/s/t_68e75a75a9588191a3aba5c4bf723f9f
# Helper: pad to all rows in the right order
pad_rows <- function(M, ids) {
  if (is.null(M)) return(NULL)
  rn <- rownames(M)
  miss <- setdiff(ids, rn)
  if (length(miss) > 0) {
    empty <- Matrix::sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
                                  dims = c(length(miss), ncol(M)),
                                  dimnames = list(miss, colnames(M)))
    M <- rbind(M, empty)
  }
  M[ids, , drop = FALSE]
}
## ==> LLM End

# Which components are needed for a given set
needed_for <- function(set_name) {
  switch(set_name,
    V1 = c("tfidf","chars"),
    V2 = c("tfidf","chars","nrc","profanities","bing"),
    V3 = c("tfidf","chars","nrc","profanities","bing","emb"),
    V4 = c("nrc","profanities","bing","emb"),
    V5 = c("tfidf","chars","nrc","profanities","bing","emb_nosw")
  )
}

# Build the list of components our set actually needs
need <- needed_for(best_small$set)

# using helper funciton, fill missing rows
built_test <- list(ids = test_ids)
if ("tfidf" %in% need) built_test$X_tfidf <- pad_rows(X_tfidf_test, test_ids)
if ("chars" %in% need) built_test$X_chars <- pad_rows(X_chars_test, test_ids)
if ("nrc" %in% need) built_test$X_nrc <- pad_rows(X_nrc_test, test_ids)
if ("bing" %in% need) built_test$X_bing <- pad_rows(X_bing_test, test_ids)
if ("profanities" %in% need) built_test$X_profanities <- pad_rows(X_profanities_test, test_ids)
if ("emb" %in% need) built_test$X_emb <- pad_rows(X_emb_test, test_ids)
if ("emb_nosw" %in% need) built_test$X_emb_nosw <- pad_rows(X_emb_nosw_test, test_ids)

# Bind in the right order for your winning set
X_test <- build_X_from_set(best_small$set, built_test)

# Align columns to the exact training matrix we used to fit final_fit
X_test_aligned <- align_sparse_cols(X_test, colnames(X_best))


### 7.2 Generating predictions for each tweet

Next we employ our final model to get probabilities for each tweet to contain offensive language:

In [None]:
# Predict probabilities with the trained model
pred_prob_final <- as.numeric(predict(final_fit, newx = X_test_aligned, s = "lambda.min", type = "response"))

In [None]:
dim(testdat)


### 7.3 Writing the submission file

In [None]:
# made it 2D instead of a vector
pred_prob_final <- cbind(prob = as.numeric(pred_prob_final))
rownames(pred_prob_final) <- rownames(X_test_aligned)

tibble(id = rownames(pred_prob_final), prob = pred_prob_final[,1]) |>
    right_join(testdat, by="id") |>
    arrange(as.numeric(id)) |> 
    replace_na(list(prob = mean(y_new))) |>
    select(-tweet) |>
    write_csv("submission.csv")

# Check to see if the format is correct
list.files()
read_csv("submission.csv", col_types="cd") # should be 50,001 x 2

<div style="background-color:#00008B; 
            color:white; 
            padding:12px; 
            border-radius:8px; 
            font-size:20px; 
            font-weight:bold;">
    8. Division of Labour
</div>

The development of features was done by all - Johannes created character-level features; Charlotte created the sentiment features; Oliver made the word embedding features. Everyone helped in joining all of these together. Oliver worked on the overall function that included preprocessing and feature extraction. Johannes and Charlotte did some research about the topic and provided detailed descriptions of each step of the notebook. Everyone was involved in building and testing the models. Charlotte and Oliver formatted the submission part.