# Final Project

**Group HOMEWORK**. This final project can be collaborative. The maximum members of a group is 2. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints), you can determine: 
-  how to preprocess the input text (e.g., remove emoji, remove stopwords, text lemmatization and stemming, etc.);
-  which method to use to encode text features (e.g., TF-IDF, N-grams, Word2vec, GloVe, Part-of-Speech (POS), etc.);
-  which model to use.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision, Accuracy and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. 
- **Format**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary and answer the following questions: 
    - What preprocessing steps do you follow?
    - How do you select the features from the inputs? 
    - Which model you use and what is the structure of your model?
    - How do you train your model?
    - What is the performance of your best model?
    - What other models or feature engineering methods would you like to implement in the future?
- **Two Rules**, violations will result in 0 points in the grade: 
    - Not allowed to use test set in the training: You CANNOT use any of the instances from test set in the training process. 
    - Not allowed to use code from generative AI (e.g., ChatGPT). 

## Evaluation

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report. 

The report should include: (a)code, (b)outputs, (c)explainations for each step, and (d)summary (you can add markdown cells). 

The due date is **December 8, Friday by 11:59pm.

## Imports
We include all of the imports that are needed throughout the duration of the program here. This prevents us from needing to import each component separately in the future steps


In [13]:
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.svm import SVC
import re
import emoji
import nltk
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to C:\Users\very cool
[nltk_data]     guy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\very cool
[nltk_data]     guy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


TensorFlow version: <module 'tensorflow._api.v2.version' from 'C:\\Users\\very cool guy\\anaconda3\\envs\\tf-gpu\\lib\\site-packages\\tensorflow\\_api\\v2\\version\\__init__.py'>
Num GPUs Available:  1


## Process data
In this step we process all of the text. We do so by providing the functionality of lower case conversion, lemmatization as well as removal of emojis, step words and punctuation. This will allow our model to make predictions on our data, as unnecessary components of the message have been processed correctly.

In [24]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download("punkt")

df = pd.read_csv('edos_labelled_data.csv')

def remove_emojis(string):
    return emoji.replace_emoji(string, replace='')

def remove_stop_words(string):
    words = filter(None, string.split(' '))
    retval = []
    for w in words:
        if not w in stop_words:
            retval.append(w)
    return " ".join(retval)
    
def to_lower_case(value):
    return value.lower()

def remove_punctuation(string):
    return re.sub(r'[^\w\s]', '', string)
    
def lemmatize(string):
    words = word_tokenize(string)
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in words])

label_encoder = LabelEncoder() 
df["label"] = label_encoder.fit_transform(df["label"])

df['text'] = df['text'].apply(to_lower_case)
df['text'] = df['text'].apply(remove_emojis)
df['text'] = df['text'].apply(remove_stop_words)
df['text'] = df['text'].apply(remove_punctuation)
df['text'] = df['text'].apply(lemmatize)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lukel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lukel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Split into train and test
This will split up the data into separate testing and training data. This will allow us to properly train our models, as well as run tests on the model which will allow us to gauge their performance.

In [25]:
train = df[df['split'] == "train"]
test = df[df['split'] == "test"]

X_train = train["text"]
y_train = train["label"]

X_test = test["text"]
y_test = test["label"]

## Import BERT
This will bring in the necessary functionality to incorporate the needed components for our BERT encoding as well as provide other needed functionality that will allow us to properly encode our features.


In [26]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [27]:
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

## Create model
Maybe Remove this section later

In [19]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [21]:
METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=METRICS)

model.fit(X_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x24502638cd0>

In [22]:
model.evaluate(X_test, y_test)



[0.5658291578292847,
 0.7504603862762451,
 0.595588207244873,
 0.27272728085517883]

## Create the Results DataFrame
This will create the initial dataframe that contains all of the columns representing the performance of a given model. It also provides the functionality of a function which will append a new row of data into the data frame.

In [5]:
columns = ['Feature And Model', 'Precision', 'Accuracy', 'F1-score']
results = pd.DataFrame(columns=columns) 

def add_new_result(results_df, report, model_name, accuracy):
    precision = report['weighted avg']['precision']
    f1_score = report['weighted avg']['f1-score']
    data = {
        'Feature And Model': model_name, 
        'Precision': precision, 
        'Accuracy': accuracy, 
        'F1-score': f1_score
    }
    row_data = pd.DataFrame([data])
    results_df = pd.concat([results_df, row_data])
    return results_df

## Naive Bayes TF-IDF
This will first use TF-IDF, in order to properly encode the data into its correct form. Once encoded it will pass with data into a Naive Bayes model for training. Then it will run the testing on the model to gauge how well it performed and place the performance into the dataframe.

In [6]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('naive bayes', MultinomialNB())         
])

clf.fit(X_train, y_train)

nb_predictions = clf.predict(X_test)
report = classification_report(y_test, nb_predictions, output_dict=True)
accuracy = accuracy_score(y_test, nb_predictions)
results = add_new_result(results, report, "Naive Bayes With TF-IDF", accuracy) 

  results_df = pd.concat([results_df, row_data])


## Naive Bayes BERT
This will first use BERT, in order to properly encode the data into its correct form. Once encoded it will pass with data into a Naive Bayes model for training. Then it will run the testing on the model to gauge how well it performed and place the performance into the dataframe.

In [28]:
#TODO

## Support Vector Machine TF-IDF
This will first use TF-IDF, in order to properly encode the data into its correct form. Once encoded it will pass with data into a Support Vector Machine model for training. Then it will run the testing on the model to gauge how well it performed and place the performance into the dataframe.

In [7]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('SVC', SVC(kernel='linear'))         
])

clf.fit(X_train, y_train)

svm_predictions = clf.predict(X_test)
report = classification_report(y_test, svm_predictions, output_dict=True)
accuracy = accuracy_score(y_test, svm_predictions)
results = add_new_result(results, report, "Support Vector machine With TF-IDF", accuracy) 

## Support Vector Machine BERT
This will first use BERT, in order to properly encode the data into its correct form. Once encoded it will pass with data into a Support Vector Machine model for training. Then it will run the testing on the model to gauge how well it performed and place the performance into the dataframe.

In [22]:
#TODO

## K Nearest Neighbors TF-IDF
This will first use TF-IDF, in order to properly encode the data into its correct form. Once encoded it will pass with data into a K Nearest Neighbors model for training. Then it will run the testing on the model to gauge how well it performed and place the performance into the dataframe.

In [29]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('knn', KNeighborsClassifier(n_neighbors=3))         
])

clf.fit(X_train, y_train)

knn_predictions = clf.predict(X_test)
report = classification_report(y_test, knn_predictions, output_dict=True)
accuracy = accuracy_score(y_test, knn_predictions)
results = add_new_result(results, report, "K Nearest Neighbors With TF-IDF", accuracy) 

## K Nearest Neighbors BERT
This will first use TF-IDF, in order to properly encode the data into its correct form. Once encoded it will pass with data into a K Nearest Neighbors model for training. Then it will run the testing on the model to gauge how well it performed and place the performance into the dataframe.

In [23]:
#TODO

## Results Display 
This will display the results of all the models and encododing so that it can be easily seen what the performance of each is.

In [30]:
results

Unnamed: 0,Feature And Model,Precision,Accuracy,F1-score
0,Naive Bayes With TF-IDF,0.792938,0.740331,0.644122
0,Support Vector machine With TF-IDF,0.813795,0.81768,0.799877
0,K Nearest Neighbors With TF-IDF,0.684964,0.729282,0.672106
0,K Nearest Neighbors With TF-IDF,0.684964,0.729282,0.672106


### Summary

1. **What preprocessing steps do you follow?**  
   Our initial preprocessing began with us removing the emojis from the text using a Python library which would detect when a character was an emoji. Then we eliminated all of the stop words, thus filtering out common words that hold very little relevance in deciding whether a message contains sexist content. We also removed all punctuation and converted all characters to lowercase. Lastly, we performed lemmatization on the message to make words with similar form the exact same to standardized the word.

2. **How do you select the features from the inputs?**  
   We began by extracting messages from the CSV file and preprocessing the text for feature selection. Since the models struggle in comprehending text, we decided to convert it into numerical values using two different methods: tf-idf and BERT encoding. The tf-idf method assigns weights to individual terms by prioritizing less frequent words such as "ugly," thus amplifying their importance, whereas more common words will hold less weight. Meanwhile, BERT encoding uses contextual understanding for sentences along with being pretrained on vast amounts of text.

3. **Which model do you use and what is the structure of your model?**  
   We tried various models which included Naive Bayes, Support Vector Machine, and K Nearest Neighbors to ensure we selected the best model for classifying a message as sexist or not.
   - **Naive Bayes:** This uses the specific probabilities of each term to make a prediction. For example, terms like “hate” are going to occur more frequently in sexist messages than in non-sexist messages. So the model will run a calculation on each term in a provided message to determine the likelihood of the message being sexist or not. Then it will choose the labeling of the message which has the higher probability.
   - **Support Vector Machines:** Works by plotting a hyperplane using all of the training data. Then, in order to determine if a message is sexist or not, it will plot a point of the provided message and it will use its position relative to the hyperplane to determine if it is sexist or not.
   - **K Nearest Neighbors:** Creates a vector space from the training messages. Then the model will predict if a text message is sexist or not by looking at the n closest vectors to the test vector. The test vector will be classified based on what most of these neighboring training vectors are.

4. **How do you train your model?**  
   To train our model, we simply had to pass our processed training data to the model. Scikit-learn was then able to do the specific implementations of the training in the function call itself. This allowed us to then be able to call and make predictions on the model in order to determine if a message is sexist or not.

5. **What is the performance of your best model?**  
   Overall, our performance was by far the best when using the Support Vector Machine. We found that when using the tf-idf method we were able to achieve an accuracy of around 0.82, whereas with the BERT method we were able to achieve an accuracy of xxx. This was 6 - 8% better performing than all of our other models making it the best model in our prediction of if a message is sexist or not.

6. **What other models or feature engineering methods would you like to implement in the future?**  
   In the future, it would be nice to implement a neural network for a model in order to determine whether a message is sexist or not. Message classification is often done very well by a neural network, so this would likely give us a much higher prediction accuracy. We had chosen not to do this method due to complexity with understanding how to properly change parameters to make the model better. So as we continue to grow more in our understanding of how these neural networks work, we will likely want to implement it.

**Overview:**  
Through this project, we were able to properly train a variety of models to predict if a message was sexist or not. Prior to training the models, we had to process our data so that they would be in a format that our model could make accurate predictions on. We processed our data by first changing the messages to be in a useful form. We did so through the use of lowercase conversion, lemmatization as well as removal of emojis, stop words, and punctuation. Afterwards, we needed to have our text representations turned into numerical values so that the models could be properly trained with the data. We did this encoding through the use of BERT and tf-idf. Once encoded, we used our data to train three separate models: Naive Bayes, Support Vector Machine, and K Nearest Neighbors. From training these models, we were able to achieve high success with the Support Vector Machine.
