<a href="https://colab.research.google.com/github/karynaur/degree-of-profanity/blob/main/Affinity_Answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Affinity Answers Task:

1. Imagine there is a file full of Twitter tweets by various users and you are provided a set of words that indicates racial slurs. Write a program that can indicate the degree of profanity for each sentence in the file. Write in any programming language (preferably in Python)-make any assumptions, but remember to state them. Please place the code in GitHub with proper documentation and share.

Assumptions made:      

1. Data is provided in text files
2. The data provided is just the tweet content and nothing else
3. The slurs provided appear atleast once in the tweet dataset

### Create Data



In [1]:
%%writefile tweets.txt
Hey there! Heres a slur
Bad words are not acceptable
Slurs here, slurs there, slurs everywhere
I am a bully!
Heres a nice sentence
Twitter is sometimes good!
Twitter is honestly the worst
And the list goes on and on with more RACE SLURS

Writing tweets.txt


In [2]:
%%writefile slurs.txt
bad, slur, bully, worst, race

Writing slurs.txt


In [3]:
with open('tweets.txt', 'r') as f:
    sentences = f.readlines()

with open('slurs.txt', 'r') as f:
    slurs = f.read().split(',')

### Cleaning the data

1. Make all text lower case
2. Remove URL's and unicode characters
3. Remove stopwords
4. Lemmatization to group words based on root definition

In [4]:
import re
import nltk
import nltk.corpus
from nltk.stem import WordNetLemmatizer


nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet = True)

lemmatizer = WordNetLemmatizer()
stop = nltk.corpus.stopwords.words('english')


def clean_data(text_list):
    clean_text = []
    for text in text_list:
        # 1. Make all text lower case
        text = text.lower()

        # 2. Remove URL's and unicode characters
        text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)

        # 3. Remove stopwords
        text = " ".join([word for word in text.split() if word not in (stop)])
        
        # 4. Lemmatization to group words based on root definition
        text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])

        clean_text.append(text)
    return clean_text



In [5]:
clean_sentence = clean_data(sentences)

table = [[raw,clean] for raw,clean in zip(sentences, clean_sentence)]

from tabulate import tabulate
print(tabulate(table, headers = ['Raw data', 'Clean data']))

Raw data                                          Clean data
------------------------------------------------  -------------------------
Hey there! Heres a slur                           hey here slur
Bad words are not acceptable                      bad word acceptable
Slurs here, slurs there, slurs everywhere         slur slur slur everywhere
I am a bully!                                     bully
Heres a nice sentence                             here nice sentence
Twitter is sometimes good!                        twitter sometimes good
Twitter is honestly the worst                     twitter honestly worst
And the list goes on and on with more RACE SLURS  list go race slur


In [6]:
slurs = clean_data(slurs)
slurs

['bad', 'slur', 'bully', 'worst', 'race']

## Naive Approach
Iterate though the text and find for words from the slurs provided

In [7]:
for i, sentence in enumerate(clean_sentence):
    count = 0
    original = sentences[i].splitlines()[0]
    for word in sentence.split():
        if word in slurs:
            count += 1
    print(f"Degree of profanity of \"{original}\": {count/len(sentence.split()): 0.4f}")
    

Degree of profanity of "Hey there! Heres a slur":  0.3333
Degree of profanity of "Bad words are not acceptable":  0.3333
Degree of profanity of "Slurs here, slurs there, slurs everywhere":  0.7500
Degree of profanity of "I am a bully!":  1.0000
Degree of profanity of "Heres a nice sentence":  0.0000
Degree of profanity of "Twitter is sometimes good!":  0.0000
Degree of profanity of "Twitter is honestly the worst":  0.3333
Degree of profanity of "And the list goes on and on with more RACE SLURS":  0.5000


## Using Word2Vec

In [8]:
from gensim.models import Word2Vec

# Get model and run it on our dataset
model = Word2Vec([i.split() for i in clean_sentence], min_count=1)
list(model.wv.vocab)

['hey',
 'here',
 'slur',
 'bad',
 'word',
 'acceptable',
 'everywhere',
 'bully',
 'nice',
 'sentence',
 'twitter',
 'sometimes',
 'good',
 'honestly',
 'worst',
 'list',
 'go',
 'race']

#### Absolute degree of profanity

In [21]:
import numpy as np
profanities = []

# threshold similarity for calculative score
threshold = 0.5
print(f"Absolute degree of profanity with a threshold profanity of {threshold}\n")
for i, sentence in enumerate(clean_sentence):
    total = 0
    original = sentences[i].splitlines()[0]
    for word in sentence.split():

        # Check for similarity between words and slurs
        for slur in slurs:
            score = model.wv.similarity(slur.split()[0], word)
            if score > threshold: total+=score
    profanities.append(total)
    print(f"Degree of profanity of \"{original}\": {total: 0.4f}")
    

Absolute degree of profanity with a threshold profanity of 0.5

Degree of profanity of "Hey there! Heres a slur":  1.0000
Degree of profanity of "Bad words are not acceptable":  1.0000
Degree of profanity of "Slurs here, slurs there, slurs everywhere":  3.0000
Degree of profanity of "I am a bully!":  1.0000
Degree of profanity of "Heres a nice sentence":  0.0000
Degree of profanity of "Twitter is sometimes good!":  0.0000
Degree of profanity of "Twitter is honestly the worst":  1.0000
Degree of profanity of "And the list goes on and on with more RACE SLURS":  2.0000


#### Relative degree of profanity

In [24]:
from scipy.special import softmax

# Take the softmax of individual profanity values to get relative profanity in our dataset
profanities = softmax(profanities)

In [37]:
print(f"Relative degree of profanity with a threshold of {threshold} in percentage\n")
for ii, i in enumerate(sentences):
    print(f"Degree of profanity of \"{i.splitlines()[0]}\": {profanities[ii]*100: 0.2f}%")

Relative degree of profanity with a threshold of 0.5 in percentage

Degree of profanity of "Hey there! Heres a slur":  6.74%
Degree of profanity of "Bad words are not acceptable":  6.74%
Degree of profanity of "Slurs here, slurs there, slurs everywhere":  49.78%
Degree of profanity of "I am a bully!":  6.74%
Degree of profanity of "Heres a nice sentence":  2.48%
Degree of profanity of "Twitter is sometimes good!":  2.48%
Degree of profanity of "Twitter is honestly the worst":  6.74%
Degree of profanity of "And the list goes on and on with more RACE SLURS":  18.31%
