# Feature Engineering: Profanity

There are several options for detecting profanity. Three prominent ones are

   - [x] [profanity-filter](#https://pypi.org/project/profanity-filter/): a sophisticated, word-list-based package with boolean methods like is_profane
   - [ ] [profanity-check](#https://pypi.org/project/alt-profanity-check/): a model-based approach to profanity detection compatible with spaCy
   - [ ] [better-profanity](#https://pypi.org/project/better-profanity/): another list-based approach, though less accurate than profanity-filter
   
profanity-filter is the most appropriate for this project. While a model-based approach is appealing, it's more likely to identify words or sentences as profane because the sentence has the structure of an insult. For instance, if a comedian says, "That guy is a dingleberry," profanity-check would flag it as profane, while profanity-filter would not.

## import

In [None]:
import pickle
import numpy as np
import pandas as pd
from datetime import date
import json
from tqdm.notebook import tqdm

import re
from collections import Counter, defaultdict

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
from gensim.corpora.dictionary import Dictionary
from gensim.models import Phrases

from profanityfilter import ProfanityFilter

In [None]:
sw = stopwords.words("english")

In [None]:
with open(f'../data/metascripts/metascript_df_ws.pickle', 'rb') as file:
    metascripts = pickle.load(metascripts)

## prepare the data

In [None]:
descriptions = list(metascripts['description'].values())
scripts = list(metascripts['transcript'].values())
scripts_dict = zip(descriptions, scripts)

## profanity detection, frequency, and proportion

In [None]:
pf = ProfanityFilter()
tok_scripts = [regexp_tokenize(transcript, r"\b[a-zA-Z'\w\-\*]+\b") for transcript in scripts]
tok_scripts_lc = [[token.lower() for token in script] for script in tok_scripts]
word_counts = [Counter(token for token in script) for script in tok_scripts_lc]

In [None]:
dictionary = Dictionary(tok_scripts_lc)
corpus = [dictionary.doc2bow(script) for script in tok_scripts_lc]

In [None]:
corpus_overall_counts = {}
for bow in tqdm(corpus):
    for id, count in bow:
        if dictionary[id] in corpus_overall_counts.keys():
            corpus_overall_counts[dictionary[id]] += count
        else:
            corpus_overall_counts[dictionary[id]] = count

In [None]:
profane_dict = {word: pf.is_profane(word) for word in tqdm(corpus_overall_counts)}

In [None]:
with open('../data/profanity_booleans_no_lemma.pickle', 'wb') as file:
    pickle.dump(profane_dict, file)
    
with open('../data/profanity_booleans_no_lemma.pickle', 'rb') as file:
    profane_dict = pickle.load(file)

In [None]:
profanity_counts = {description: {word:count for word, count in script_counts.items() if profane_dict[word]} for script_counts, description in zip(word_counts,descriptions)}

In [None]:
profane_words = [sum(words.values()) for description, words in profanity_counts.items()]
total_words = [sum(script_word_counts.values()) for script_word_counts in word_counts]
profane_proportion = [profane/total for profane, total in zip(profane_words, total_words)]
profane_per_sent = [profane/sent_count for profane, sent_count in zip(profane_words, sent_counts)]
profane_per_min = [profane/minutes for profane, minutes in zip(profane_words, metascripts['runtimeMins'].values)]

In [None]:
metascripts['profane count'] = profane_words
metascripts['profane proportion'] = profane_proportion
metascripts['profanity per sentence'] = profane_per_sent
metascripts['profanity per minute'] = profane_per_min

## pickle updated metascripts

In [None]:
with open('../data/metascripts_df_profanity.pickle', 'wb') as file:
    pickle.dump(metascripts, file)

## explore

In [None]:
px.box(metascripts, x = 'profanity per minute', hover_data = ['description', 'profane count'], points = 'all')