# Final Project

**Group HOMEWORK**. This final project can be collaborative. The maximum members of a group is 2. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints), you can determine: 
-  how to preprocess the input text (e.g., remove emoji, remove stopwords, text lemmatization and stemming, etc.);
-  which method to use to encode text features (e.g., TF-IDF, N-grams, Word2vec, GloVe, Part-of-Speech (POS), etc.);
-  which model to use.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision, Accuracy and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. 
- **Format**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary and answer the following questions: 
    - What preprocessing steps do you follow?
    - How do you select the features from the inputs? 
    - Which model you use and what is the structure of your model?
    - How do you train your model?
    - What is the performance of your best model?
    - What other models or feature engineering methods would you like to implement in the future?
- **Two Rules**, violations will result in 0 points in the grade: 
    - Not allowed to use test set in the training: You CANNOT use any of the instances from test set in the training process. 
    - Not allowed to use code from generative AI (e.g., ChatGPT). 

## Evaluation

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report. 

The report should include: (a)code, (b)outputs, (c)explainations for each step, and (d)summary (you can add markdown cells). 

The due date is **December 8, Friday by 11:59pm.

## Imports

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import warnings
import re
import emoji
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder 
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

nltk.download("punkt")

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Suppress warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lukel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lukel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read data

In [2]:


df = pd.read_csv('edos_labelled_data.csv')

def remove_emojis(string):
    return emoji.replace_emoji(string, replace='')

def remove_stop_words(string):
    words = filter(None, string.split(' '))
    retval = []
    for w in words:
        if not w in stop_words:
            retval.append(w)
    return " ".join(retval)
    
def to_lower_case(value):
    return value.lower()

def remove_punctuation(string):
    return re.sub(r'[^\w\s]', '', string)
    
def lemmatize(string):
    words = word_tokenize(string)
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in words])

label_encoder = LabelEncoder() 
df["label"] = label_encoder.fit_transform(df["label"])

df['text'] = df['text'].apply(to_lower_case)
df['text'] = df['text'].apply(remove_emojis)
df['text'] = df['text'].apply(remove_stop_words)
df['text'] = df['text'].apply(remove_punctuation)
df['text'] = df['text'].apply(lemmatize)

## Split into train and test

In [13]:
train = df[df['split'] == "train"]
test = df[df['split'] == "test"]

X_train = train["text"]
y_train = train["label"]

X_test = test["text"]
y_test = test["label"]

## Create the Results DataFrame

In [105]:
columns = ['Feature And Model', 'Precision', 'Accuracy', 'F1-score']
results = pd.DataFrame(columns=columns) 

def add_new_result(results_df, report, model_name, accuracy):
    precision = report['weighted avg']['precision']
    f1_score = report['weighted avg']['f1-score']
    data = {
        'Feature And Model': model_name, 
        'Precision': precision, 
        'Accuracy': accuracy, 
        'F1-score': f1_score
    }
    row_data = pd.DataFrame([data])
    results_df = pd.concat([results_df, row_data])
    return results_df

## Naive Bayes

In [100]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('naive bayes', MultinomialNB())         
])

clf.fit(X_train, y_train)

nb_predictions = clf.predict(X_test)
report = classification_report(y_test, nb_predictions, output_dict=True)
accuracy = accuracy_score(y_test, nb_predictions)
results = add_new_result(results, report, "Naive Bayes With TF-IDF", accuracy) 

## Support Vector machine

In [101]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('naive bayes', SVC(kernel='linear'))         
])

clf.fit(X_train, y_train)

svm_predictions = clf.predict(X_test)
report = classification_report(y_test, svm_predictions, output_dict=True)
accuracy = accuracy_score(y_test, svm_predictions)
results = add_new_result(results, report, "Support Vector machine With TF-IDF", accuracy) 

## K Nearest Neighbors

In [104]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('knn', KNeighborsClassifier(n_neighbors=3))         
])

clf.fit(X_train, y_train)

knn_predictions = clf.predict(X_test)
report = classification_report(y_test, knn_predictions, output_dict=True)
accuracy = accuracy_score(y_test, knn_predictions)
results = add_new_result(results, report, "K Nearest Neighbors With TF-IDF", accuracy) 
print(classification_report(y_test, knn_predictions))
results

              precision    recall  f1-score   support

           0       0.75      0.95      0.84       789
           1       0.52      0.15      0.24       297

    accuracy                           0.73      1086
   macro avg       0.63      0.55      0.54      1086
weighted avg       0.68      0.73      0.67      1086



Unnamed: 0,Feature And Model,Precision,Accuracy,F1-score
0,Naive Bayes With TF-IDF,0.792938,0.740331,0.644122
0,Support Vector machine With TF-IDF,0.813795,0.81768,0.799877
0,K Nearest Neighbors With TF-IDF,0.684964,0.729282,0.672106
0,K Nearest Neighbors With TF-IDF,0.684964,0.729282,0.672106
0,K Nearest Neighbors With TF-IDF,0.684964,0.729282,0.672106


In [7]:
## Create the Results DataFrame

In [None]:
results = {
    'Feature And Model': ['Naive Bayes With TF-IDF', 'Naive Bayes With TF-IDF', 
                          'SVM With TF-IDF', 'SVM With TF-IDF', 
                          'KNN With TF-IDF', 'KNN With TF-IDF'],
    'Precision': [0.74, 0.00, 0.68, 0.00, 0.00, 0.00],
    'Accuracy': [0.74, 0.00, 0.68, 0.00, 0.00, 0.00],
    'F1-score': [0.85, 0.00, 0.68, 0.00, 0.00, 0.00]
    
    
                  precision    recall  f1-score   support

           0       0.74      1.00      0.85       789
           1       0.94      0.05      0.10       297

    accuracy                           0.74      1086
   macro avg       0.84      0.53      0.48      1086
weighted avg       0.79      0.74      0.64      1086

}

#### Summary

1. What preprocessing steps do you follow?
   
   Your answer:
   
2. How do you select the features from the inputs?
   
   Your answer:
   
3. Which model you use and what is the structure of your model?
   
   Your answer:
   
4. How do you train your model?
   
   Your answer:
   
5. What is the performance of your best model?
   
   Your answer:
   
6. What other models or feature engineering methods would you like to implement in the future?
   
   Your answer:
   