# Milestone 1: preprocessing
___

In [1]:
import pandas as pd
import re
from collections import Counter

In [2]:
data = pd.read_csv('../data/edos_labelled_individual_annotations.csv')
print("Columns:", list(data.columns))
print("Shape:", data.shape)
data.head()

Columns: ['rewire_id', 'text', 'annotator', 'label_sexist', 'label_category', 'label_vector', 'split']
Shape: (60000, 7)


Unnamed: 0,rewire_id,text,annotator,label_sexist,label_category,label_vector,split
0,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,17,sexist,2. derogation,2.2 aggressive and emotive attacks,train
1,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,2,sexist,2. derogation,2.2 aggressive and emotive attacks,train
2,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,6,not sexist,none,none,train
3,sexism2022_english-1,"What do you guys think about female ""incels""? ...",17,not sexist,none,none,train
4,sexism2022_english-1,"What do you guys think about female ""incels""? ...",15,not sexist,none,none,train


The dataset contains a more fine-grained sexism detection, but we're working only with the `label_sexist`.

In [3]:
data = data.drop(columns=['label_category', 'label_vector'])
data.head()

Unnamed: 0,rewire_id,text,annotator,label_sexist,split
0,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,17,sexist,train
1,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,2,sexist,train
2,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,6,not sexist,train
3,sexism2022_english-1,"What do you guys think about female ""incels""? ...",17,not sexist,train
4,sexism2022_english-1,"What do you guys think about female ""incels""? ...",15,not sexist,train


### Exploratory analysis

In [4]:
print(f"There are: {len(data['annotator'].unique())} different annotators.")
print("Annotator IDs:", sorted(data['annotator'].unique()))

There are: 19 different annotators.
Annotator IDs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


Each of the 20000 unique comments was annotated by 3 different annotators. In 4444 cases, annotators reached a 2/3 agreement rather than full 3/3.

In [5]:
print(f"There are: {len(data['rewire_id'].unique())} different comments annotated in total.")
print("Minimum number of annotations for a comment:", data['rewire_id'].value_counts().min())
print("Maximum number of annotations for a comment:", data['rewire_id'].value_counts().max())

There are: 20000 different comments annotated in total.
Minimum number of annotations for a comment: 3
Maximum number of annotations for a comment: 3


In [6]:
unique_label_counts = data.groupby('rewire_id')['label_sexist'].nunique() # 1 (3/3 agreement) or 2 (2/3 agreement)
agreement_2_3_count = (unique_label_counts != 1).sum() # number of comments where annotators have 2/3 agreement

print("Number of 'rewire_id' entries (comments) with 2/3 agreement among annotators:", agreement_2_3_count)
print(unique_label_counts)

Number of 'rewire_id' entries (comments) with 2/3 agreement among annotators: 4444
rewire_id
sexism2022_english-0       2
sexism2022_english-1       1
sexism2022_english-10      2
sexism2022_english-100     1
sexism2022_english-1000    1
                          ..
sexism2022_english-9995    1
sexism2022_english-9996    2
sexism2022_english-9997    1
sexism2022_english-9998    2
sexism2022_english-9999    2
Name: label_sexist, Length: 20000, dtype: int64


### Aggregated dataset

In [6]:
data_agg = pd.read_csv('../data/edos_labelled_aggregated.csv')
data_agg = data_agg.drop(columns=['label_category', 'label_vector'])
print("Columns:", list(data_agg.columns))
print("Shape:", data_agg.shape)
data_agg

Columns: ['rewire_id', 'text', 'label_sexist', 'split']
Shape: (20000, 4)


Unnamed: 0,rewire_id,text,label_sexist,split
0,sexism2022_english-9609,"In Nigeria, if you rape a woman, the men rape ...",not sexist,dev
1,sexism2022_english-16993,"Then, she's a keeper. 😉",not sexist,train
2,sexism2022_english-13149,This is like the Metallica video where the poo...,not sexist,train
3,sexism2022_english-13021,woman?,not sexist,train
4,sexism2022_english-966,I bet she wished she had a gun,not sexist,dev
...,...,...,...,...
19995,sexism2022_english-5228,girls really get fucked almost every weekend ?,not sexist,train
19996,sexism2022_english-10140,The hatred for moslems are Reasonable and Just...,not sexist,train
19997,sexism2022_english-9726,Now this is a woman who gets it. 👆,not sexist,train
19998,sexism2022_english-13365,“American Idol” finalist [USER] said nothing i...,not sexist,train


Conclusions drawn using regular expressions:

- `[URL]`, `[USER]` are placeholders used by dataset authors instead of actual URLs and real usernames
- female related nouns and pronouns are more frequent than male
- hashtags `#` often used
- huge amount of profanities

In [7]:
def count_patterns(pattern, data):
    return Counter(match for text in data.text for match in re.findall(pattern, text)).most_common()

In [20]:
# count_patterns(r'\[[A-Z]+\]', data_agg) # catching: [USER], [URL]
# count_patterns(r'\b(she|her|wom[ae]n|female|girl|lady)\b', data_agg) # female related nouns, pronouns etc.
# count_patterns(r'\b(he|him|his|m[ae]n|male|boy|guy|dude)\b', data_agg) # male related nouns, pronouns etc.
# count_patterns(r'#\w+', data_agg) # hashtag
# count_patterns(r'\b(fuck|shit|damn|asshole|bitch|slut)\b', data_agg) # profanities

### Text normalization

What are the most common words in our concatenated text?

##### Stopword removal

`she`, `herself` etc. should be excluded from stopwords

##### Lemmatization

### Exporting in the CoNLL format
