## Assignment 3: Subreddit Classification

In this assignment, you will train and evaluate text classification models.
Your goal is to classify Reddit comments according to which subreddit they were posted in.


### 📚 Data

The data you will be working with consists of real-world text comments from Reddit. 
Comments are sampled at equal proportions from each of five political subreddits: The_Donald, Conservative, politics, Libertarian, and ChapoTrapHouse.

Here is an example entry: 
- `id`: 2091
- `text`: Eating soap to own the republicans
- `label`: ChapoTrapHouse

The data is split into three files.
The TRAIN and DEV sets contain texts and labels. Use these to develop your classification models.
The TEST set contains only texts. This is what you will be assessed on.

Please note that these are real Reddit comments.
We filtered out likely-inappropriate content, but some comments may still be sensitive or offensive.


### 📝 Task

Your task is to train a model to classify posts according to which subreddit they belong to.
This is a five-way classification task with balanced data.


### ⚙️ Implementation

There will be 3 Tracks.
Each track corresponds to a different type of model you will train.

- **Track 1 - "classic ML"**: Logistic regression, SVM, random forest, or any other model that does not require training a neural network. You can still use neural embeddings (e.g. GloVe, Word2Vec) as input features.
- **Track 2 - "BERT-style"**: Encoder models like DistilBERT, RoBERTa, DeBERTa, etc. You can use pre-trained models, but you must fine-tune them on the training data.
- **Track 3 (✨OPTIONAL✨): -  "Open"**: Any classifier. This could include methods not covered in class. It is important that you provide a clear description of the methodology you used in your submission.

You may **not** use additional data for training your model.
You may **not** use API models (e.g. GPT-4, Claude, etc.) either for processing inputs (e.g. getting embeddings) or for getting predictions.

### 📈 Baselines

For your reference, we provide the following baselines:

#### TF-IDF Logistic Regression Classifier with 2-6 Character Ngrams
Trained on TRAIN and evaluated on DEV.

| Class           | Precision | Recall | F1-score | Support |
|-----------------|-----------|--------|----------|---------|
| ChapoTrapHouse  | 0.45      | 0.55   | 0.50     | 799     |
| Libertarian     | 0.52      | 0.52   | 0.52     | 798     |
| politics        | 0.39      | 0.34   | 0.36     | 799     |
| Conservative    | 0.32      | 0.26   | 0.29     | 802     |
| The_Donald      | 0.39      | 0.44   | 0.41     | 802     |
| Accuracy        |           |        | 0.42     |         |
| Macro Avg       | 0.42      | 0.42   | 0.42     | 4000    |
| Weighted Avg    | 0.42      | 0.42   | 0.42     | 4000    |

#### Random Choice Classifier
Assigns class labels to predictions entirely randomly, without considering any input features or patterns in the data.

| Class           | Precision | Recall | F1-score | Support |
|-----------------|-----------|--------|----------|---------|
| ChapoTrapHouse  | 0.20      | 0.20   | 0.20     | 799     |
| Libertarian     | 0.20      | 0.21   | 0.20     | 798     |
| politics        | 0.19      | 0.19   | 0.19     | 799     |
| Conservative    | 0.20      | 0.19   | 0.20     | 802     |
| The_Donald      | 0.19      | 0.20   | 0.19     | 802     |
| Accuracy        |           |        | 0.20     |         |
| Macro Avg       | 0.20      | 0.20   | 0.20     | 4000    |
| Weighted Avg    | 0.20      | 0.20   | 0.20     | 4000    |

### 🏅 Assessment

Tracks will have equal weighting in the final grade.
If you submit to Track 3, **we will choose the best two tracks to evaluate your assignment**.

For each track, you will be assessed based on the **Macro F1 Score** of your predictions on the TEST set.
We have provided an example for how to calculate Macro F1 below, which you should use in developing your classification model using the TRAIN and DEV sets.
We will use this exact implementation to evaluate your predictions on the TEST set.

For each track, your submission will be a CSV file with two columns:
`id`, which is the ID of the comment in the TEST set, and
`label`, which is the text label you predict for that comment (e.g. "politics" or "ChapoTrapHouse").

You also have to submit a brief description of the methodology you used for each track (max 100 words per track).
It is very important that you stick to the "allowed" methods for each track.
We will check your code: if it is missing or does not conform to the regulations, you will receive a 0 for that track.


### 📥 Submission Instructions

Follow these instructions to submit your assigment on BlackBoard:

1. **File structure**: Ensure that your submission is a .zip file, and that it contains the following items with exactly these specified names:
  - `track_1_test.csv`: A CSV file with two columns (id, label) for Track 1.
  - `track_2_test.csv`: A CSV file with two columns (id, label) for Track 2.
  - `track_3_test.csv` (optional): A CSV file with two columns (id, label) for Track 3.
  - `description.txt`: A brief description of the methodology you used for each track (max 100 words per track).
  - `/code`: A folder containing all your code for the assignment. This code needs to be well-documented, and fully and easily reproducible by us. For very large files, include a Google Drive link to the files in your description.txt instead of uploading them directly.
2. **Submission**: Upload the .zip file to the BlackBoard Assignment 3 section.
3. **Deadline**: Please refer to BlackBoard.



In [5]:
# example of Macro F1 Score calculation
# we will use this implementation to evaluate your submissions

from sklearn.metrics import f1_score

y_true = ["ChapoTrapHouse", "politics", "ChapoTrapHouse"]
y_pred = ["ChapoTrapHouse", "politics", "politics"]

f1_score(y_true, y_pred, average="macro")

0.6666666666666666