<a href="https://colab.research.google.com/github/iolef/Sarcasm-identification-in-implicit-misogyny/blob/main/3_1_Models_application_to_AMI_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Application of the models to the AMI 2018 dataset**

In this notebook, the [implicit_hate_detection_model](https://colab.research.google.com/drive/1JKPe3_dBIM0slAlNJD9vLX9FXXvfNg1o) and the [humour_detection_model](https://colab.research.google.com/drive/13GXmdFHAvd37D4uyFKUiL_EF40ylXwOi) are applied to the [AMI 2018 dataset](https://https://amievalita2018.wordpress.com/).

# **1. Setup**

1.1 Installing Transformers

In [None]:
# Transformers installation
! pip install transformers[torch] datasets evaluate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting transformers[torch]
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_6

1.2 Imports

In [None]:
import os
import pandas as pd
import re

# **2. Dataset upload**

In [None]:
# Selcting the desired subfolders from the corpus folder
train_path = os.path.join(os.getcwd(), "en_training_anon.tsv")
test_path = os.path.join(os.getcwd(), "en_testing_labeled_anon.tsv")

# Creating the dataframes
train_df = pd.read_csv(train_path, delimiter="\t")
test_df = pd.read_csv(test_path, delimiter="\t")

# Merging the two dataframes into one
df = pd.concat([train_df, test_df])
df

Unnamed: 0,id,text,misogynous,misogyny_category,target
0,1,Please tell me why the bitch next to me in the...,1,dominance,active
1,2,<MENTION_1> <MENTION_2> Bitch shut the fuck up,1,dominance,active
2,3,"<MENTION_1> Dear cunt, please shut the fuck up.",1,dominance,active
3,4,RT <MENTION_1> Pls shut the fuck up bitch,1,dominance,active
4,5,"RT <MENTION_1> ""when u gonna get your license""...",1,dominance,active
...,...,...,...,...,...
995,5064,<MENTION_1> You people are hysterical. Dow up ...,0,0,0
996,5065,<MENTION_1> <MENTION_2> you leftist gimps are ...,0,0,0
997,5066,<MENTION_1> <MENTION_2> 10 year old you is hys...,0,0,0
998,5067,<MENTION_1> Tekashi a whole bitch lol fuck tha...,0,0,0


# **3. Preprocessing**

3.1 Creating a function which includes all the text preprocessing operations.

In [None]:
def clean_text(post):
 # Lowercasing
 post = post.lower()
 # Hashtag removal
 post = re.sub(r'((?<=[\s\W])|^)[#](\w+|[^#]|$)', ' ', post)
 # "rt" removal
 post = re.sub(r'^rt\s*', '', post)
 # special characters removal
 post = re.sub(r'[^\w]', ' ', post)
 post = re.sub(r'mention_[0-9]+', '', post)
 # stripping
 post = post.strip()
 # removing the unnecessary whitespaces between words
 post = ' '.join(post.split())
 return post

# Applying to each post the clean_text function
df["text"] = df["text"].apply(clean_text)
df

Unnamed: 0,id,text,misogynous,misogyny_category,target
0,1,please tell me why the bitch next to me in the...,1,dominance,active
1,2,bitch shut the fuck up,1,dominance,active
2,3,dear cunt please shut the fuck up,1,dominance,active
3,4,pls shut the fuck up bitch,1,dominance,active
4,5,when u gonna get your license shut the fuck up...,1,dominance,active
...,...,...,...,...,...
995,5064,you people are hysterical dow up 26 since elec...,0,0,0
996,5065,you leftist gimps are hysterical i d call you ...,0,0,0
997,5066,10 year old you is hysterical,0,0,0
998,5067,tekashi a whole bitch lol fuck that nigga he s...,0,0,0


# **4. Application of the models to the dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from transformers import pipeline

# Implicit_hate_detection_model
hate_model_path = os.path.join(os.getcwd(), "checkpoint-2700")
implicit_classifier = pipeline(task="sentiment-analysis", model="/content/drive/MyDrive/NLP PROJECT/checkpoint-2700", tokenizer="/content/drive/MyDrive/NLP PROJECT/checkpoint-2700")

# Humour_detection_model
humour_model_path = os.path.join(os.getcwd(), "checkpoint-840") #
humour_classifier = pipeline(task="sentiment-analysis", model="/content/drive/MyDrive/NLP PROJECT/checkpoint-840", tokenizer="/content/drive/MyDrive/NLP PROJECT/checkpoint-525")

# Adding the models classification to the dataset
classification = {'hate': [], 'humour': []}
for post in df['text']:

   classification['hate'].append(implicit_classifier(post)[0]["label"])
   classification['humour'].append(humour_classifier(post)[0]["label"])

# Naming the new columns of the dataframe
df['hate'] = classification['hate']
df['humour'] = classification['humour']

# Converting the dataframe into a tsv file
df.to_csv('classified_dataset.tsv', '\t')
