# Final Project

**Group HOMEWORK**. This final project can be collaborative. The maximum members of a group is 2. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints), you can determine: 
-  how to preprocess the input text (e.g., remove emoji, remove stopwords, text lemmatization and stemming, etc.);
-  which method to use to encode text features (e.g., TF-IDF, N-grams, Word2vec, GloVe, Part-of-Speech (POS), etc.);
-  which model to use.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision, Accuracy and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. 
- **Format**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary and answer the following questions: 
    - What preprocessing steps do you follow?
    - How do you select the features from the inputs? 
    - Which model you use and what is the structure of your model?
    - How do you train your model?
    - What is the performance of your best model?
    - What other models or feature engineering methods would you like to implement in the future?
- **Two Rules**, violations will result in 0 points in the grade: 
    - Not allowed to use test set in the training: You CANNOT use any of the instances from test set in the training process. 
    - Not allowed to use code from generative AI (e.g., ChatGPT). 

## Evaluation

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report. 

The report should include: (a)code, (b)outputs, (c)explainations for each step, and (d)summary (you can add markdown cells). 

The due date is **December 8, Friday by 11:59pm.

## Imports

In [61]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import warnings
import re
import emoji
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

nltk.download("punkt")

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Suppress warnings
warnings.filterwarnings("ignore")

print(stop_words)

{'hers', 'o', "you'd", "wasn't", 'for', 'needn', 've', 'in', 'all', 'yourselves', "it's", 'above', 'them', 'through', 'once', 'ain', 'being', 'until', 'down', "isn't", "mightn't", 't', 'only', 'myself', 'have', 'd', 'nor', 'will', 'at', 'do', 'up', 'that', "should've", 'yours', 'll', 'himself', 'the', 'few', 'over', "aren't", 'against', 'such', "doesn't", "hasn't", 'its', 'by', 'should', 'very', 'him', 'it', 'does', 'our', 're', 'these', "needn't", 'out', 'now', 'having', 'with', 'be', 'both', 'shouldn', 'just', 'themselves', "that'll", 'to', 'shan', "don't", 'after', 'doesn', 'some', 'yourself', 'each', 'was', "you've", 'any', 'than', 'because', 'about', "shouldn't", 'we', "couldn't", 'they', "wouldn't", 'he', 's', 'can', 'from', 'your', 'my', 'into', 'ma', 'don', 'an', 'of', 'wasn', 'she', 'below', "didn't", "hadn't", "you're", 'or', 'same', 'this', "she's", 'did', 'as', 'between', 'weren', 'his', 'ourselves', 'and', 'more', 'couldn', 'when', 'me', 'wouldn', 'those', "you'll", 'not',

[nltk_data] Downloading package stopwords to C:\Users\very cool
[nltk_data]     guy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\very cool
[nltk_data]     guy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read data

In [62]:


df = pd.read_csv('edos_labelled_data.csv')

def remove_emojis(string):
    return emoji.replace_emoji(string, replace='')

def remove_stop_words(string):
    words = filter(None, string.split(' '))
    retval = []
    for w in words:
        if not w in stop_words:
            retval.append(w)
    return " ".join(retval)
    
def to_lower_case(value):
    return value.lower()

def remove_punctuation(string):
    return re.sub(r'[^\w\s]', '', string)
    
def lemmatize(string):
    words = word_tokenize(string)
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in words])

df['text'] = df['text'].apply(to_lower_case)
df['text'] = df['text'].apply(remove_emojis)
df['text'] = df['text'].apply(remove_stop_words)
df['text'] = df['text'].apply(remove_punctuation)
df['text'] = df['text'].apply(lemmatize)

## Encode data

In [71]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_matrix.toarray()

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

encoded_df = pd.concat([df, tfidf_df], axis=1)



      0 bodi fat dehydr bro ricel gauntmax volcel hold ghoul frame theori legit tri contribut im incel got pussi sorri rather sorri sorri lad x  \
0                                                 False                                                                                           
1                                                 False                                                                                           
2                                                 False                                                                                           
3                                                 False                                                                                           
4                                                 False                                                                                           
...                                                 ...                                                               

## Split into train and test

In [68]:
train = df[df['split'] == "train"]
test = df[df['split'] == "test"]

X_train = train[["text"]]
y_train = train[["label"]]

X_test = test[["text"]]
y_test = test[["label"]]


0       nigeria rape woman men rape back nsfw in niger...
1                                             then keeper
2       like metallica video poor mutil bastard say pl...
3                                                   woman
4                                            bet wish gun
                              ...                        
5274    make clear look ltr mislead her get better dis...
5275    like big sisterhood stem her feminist bitch ol...
5276    goe like thi im danc floor there girl seem che...
5277    could like ladi corner man cave could paus che...
5278                              yea tran women hate men
Name: text, Length: 5279, dtype: object


## Summary

1. What preprocessing steps do you follow?
   
   Your answer:
   
2. How do you select the features from the inputs?
   
   Your answer:
   
3. Which model you use and what is the structure of your model?
   
   Your answer:
   
4. How do you train your model?
   
   Your answer:
   
5. What is the performance of your best model?
   
   Your answer:
   
6. What other models or feature engineering methods would you like to implement in the future?
   
   Your answer:
   