<a href="https://colab.research.google.com/github/josbex/HS-detection_in_social_media_posts/blob/master/Hate_speech_detection_in_social_media_posts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hate Speech detection in social media posts

## Using GPU for training BERT model

Go to: Edit -> Notebook settinngs -> Hardware accelerator -> (GPU)

In [1]:
import tensorflow as tf
import torch
import pandas as  pd
import re 
import numpy as np

In [2]:
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')


Found GPU at: /device:GPU:0


In [3]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


## Dataset

The dataset needs to be loaded from the drive (this entails the dataset is in your drive). If so, just run the cell below and follow the link to get an authorization code. 

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


After the the dataset can be read from the drive, just specify the name of the dataset you want to read. For this case the OLID training dataset is loaded.

In [5]:
df = pd.read_csv("/content/gdrive/My Drive/olid-training-v1.0.tsv", sep="\t") 
print(df.head())

      id                                              tweet  ... subtask_b subtask_c
0  86426  @USER She should ask a few native Americans wh...  ...       UNT       NaN
1  90194  @USER @USER Go home you’re drunk!!! @USER #MAG...  ...       TIN       IND
2  16820  Amazon is investigating Chinese employees who ...  ...       NaN       NaN
3  62688  @USER Someone should'veTaken" this piece of sh...  ...       UNT       NaN
4  43605  @USER @USER Obama wanted liberals &amp; illega...  ...       NaN       NaN

[5 rows x 5 columns]


## Data pre-processing 

Some steps are needed for the data processing first the tweet-preprocesser (https://pypi.org/project/tweet-preprocessor/) is used for cleaning up the tweets from urls, users, hashtags and emoticons. 

In [6]:
!pip install tweet-preprocessor

Collecting tweet-preprocessor
  Downloading https://files.pythonhosted.org/packages/17/9d/71bd016a9edcef8860c607e531f30bd09b13103c7951ae73dd2bf174163c/tweet_preprocessor-0.6.0-py3-none-any.whl
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [7]:
import preprocessor as p

In [8]:
def tokenize_tweets():
  p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.EMOJI)
  for tweet in df.tweet:
    df.replace(tweet, p.tokenize(tweet), inplace=True)

def remove_pattern(input_txt, pattern, replace):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, replace, input_txt)    
    return input_txt  

## Tokenize tweets

Here we replace the @, emojis and urls using the tweet preprocessor. 

In [9]:
tokenize_tweets()
print(df.tweet[1]) 

$MENTION$ $MENTION$ Go home youre drunk!!! $MENTION$ #MAGA #Trump2020 $EMOJI$$EMOJI$ URL


## Vocab

To be able to parse hashtags efficiently a big vocabulary is needed and just for this method the vocab needs to be in a list format sorted in word lenght order, shortest to longest. This will be updated later to be saved to an csv file, just to make it easier to add new words. 

As of now a list of 3000 most common english words was combined with a list of 1300 differents slurs and curse words. We can probably add some better vocab list later since this one doesn't work for different variations of words, for example it can split #humanright but not #humanrights. 

In [10]:
words = ['a', 'I', 'ad', 'ah', 'AM', 'as', 'at', 'be', 'by', 'do', 'go', 'he', 'hi', 'ie', 'if', 'in', 'it', 'me', 'Mr', 'Ms', 'my', 'no', 'of', 'oh', 'ok', 'on', 'or', 'PC', 'PM', 'so', 'to', 'TV', 'up', 'us', 'vs', 'we', 'bi', 'fu', 'ho', 'ky', 'uk', 'wn', 'act', 'add', 'age', 'ago', 'aid', 'aim', 'air', 'all', 'and', 'any', 'arm', 'art', 'ask', 'bad', 'bag', 'ban', 'bar', 'bed', 'bet', 'big', 'bit', 'box', 'boy', 'bus', 'but', 'buy', 'can', 'cap', 'car', 'cat', 'CEO', 'cop', 'cow', 'cry', 'cup', 'cut', 'dad', 'day', 'die', 'dig', 'DNA', 'dog', 'dry', 'due', 'ear', 'eat', 'egg', 'end', 'era', 'etc', 'eye', 'fan', 'far', 'fat', 'fee', 'few', 'fit', 'fix', 'fly', 'for', 'fun', 'gap', 'gas', 'gay', 'get', 'God', 'gun', 'guy', 'hat', 'her', 'hey', 'him', 'hip', 'his', 'hit', 'hot', 'how', 'ice', 'ill', 'its', 'jet', 'Jew', 'job', 'joy', 'key', 'kid', 'lab', 'lap', 'law', 'lay', 'leg', 'let', 'lie', 'lip', 'lot', 'low', 'mad', 'man', 'map', 'may', 'mix', 'mom', 'Mrs', 'net', 'new', 'nod', 'nor', 'not', 'now', "n't", 'nut', 'odd', 'off', 'oil', 'old', 'one', 'our', 'out', 'owe', 'own', 'pan', 'pay', 'per', 'pet', 'pie', 'pop', 'pot', 'put', 'raw', 'red', 'rid', 'row', 'rub', 'run', 'sad', 'say', 'sea', 'see', 'set', 'sex', 'she', 'sin', 'sir', 'sit', 'six', 'ski', 'sky', 'son', 'sue', 'sun', 'tap', 'tax', 'tea', 'ten', 'the', 'tie', 'tip', 
'toe', 'too', 'top', 'toy', 'try', 'two', 'use', 'via', 'war', 'way', 'wet', 'who', 'why', 'win', 'yes', 'yet', 'you', 'abo', 'ass', 'bra', 'cum', 'die', 'dix', 'ero', 'evl', 'fag', 'fat', 'fok', 'fuc', 'fuk', 'gay', 'gin', 'gob', 'god', 'goy', 'gun', 'gyp', 'hiv', 'jap', 'jew', 'kid', 'kkk', 'kum', 'lez', 'lsd', 'mad', 'nig', 'nip', 'pee', 'pom', 'poo', 'pot', 'pud', 'sex', 'sob', 'sos', 'tit', 'tnt', 'uck', 'wab', 'wog', 'wop', 'wtf', 'xtc', 'xxx', 'able', 'acid', 'aide', 'AIDS', 'ally', 'also', 'Arab', 'area', 'army', 'auto', 'away', 'baby', 'back', 'bake', 'ball', 'band', 'bank', 'base', 'bean', 'bear', 'beat', 'beer', 'bell', 'belt', 'bend', 'best', 'bike', 'bill', 'bind', 'bird', 'bite', 'blow', 'blue', 'boat', 'body', 'bomb', 'bond', 'bone', 'book', 'boom', 'boot', 'born', 'boss', 'both', 'bowl', 'buck', 'burn', 'bury', 'busy', 'cake', 'call', 'camp', 'card', 'care', 'case', 'cash', 'cast', 'cell', 'chef', 'chip', 'cite', 'city', 'club', 'clue', 'coal', 'coat', 'code', 'cold', 'come', 'cook', 'cool', 'cope', 'copy', 'core', 'corn', 'cost', 'crew', 'crop', 'dare', 'dark', 'data', 'date', 'dead', 'deal', 'dear', 'debt', 'deck', 'deep', 'deer', 'deny', 'desk', 'diet', 'dirt', 'dish', 'door', 'down', 'drag', 'draw', 'drop', 'drug', 'dust', 'duty', 'each', 'earn', 'ease', 'east', 'easy', 'edge', 'else', 'even', 'ever', 'face', 'fact', 'fade', 'fail', 'fair', 'fall', 'farm', 'fast', 'fate', 'fear', 'feed', 'feel', 'file', 'fill', 'film', 'find', 'fine', 'fire', 'firm', 'fish', 'five', 'flag', 'flat', 'flee', 'flow', 'folk', 'food', 'foot', 'form', 
'four', 'free', 'from', 'fuel', 'full', 'fund', 'gain', 'game', 'gang', 'gate', 'gaze', 'gear', 'gene', 'gift', 'girl', 'give', 'glad', 'goal', 'gold', 'golf', 'good', 'grab', 'gray', 'grow', 'hair', 'half', 'hall', 'hand', 'hang', 'hard', 'hate', 'have', 'head', 'hear', 'heat', 'heel', 'hell', 'help', 'here', 'hero', 'hide', 'high', 'hill', 'hire', 'hold', 'hole', 'holy', 'home', 'hope', 'host', 'hour', 'huge', 'hurt', 'idea', 'into', 'iron', 'item', 'jail', 'join', 'joke', 'jump', 'jury', 'just', 'keep', 'kick', 'kill', 'kind', 'king', 'kiss', 'knee', 'know', 'lack', 'lady', 'lake', 'land', 'last', 'late', 'lawn', 'lead', 'leaf', 'lean', 'left', 'less', 'life', 'lift', 'like', 'line', 'link', 'list', 'live', 'load', 'loan', 'lock', 'long', 'look', 'lose', 'loss', 'lost', 'lots', 'loud', 'love', 'luck', 'lung', 'mail', 'main', 'make', 'male', 'mall', 'many', 'mark', 'mask', 'mass', 'math', 'meal', 'mean', 'meat', 'meet', 'menu', 'mere', 'mess', 'milk', 'mind', 'mine', 'miss', 'mode', 'mood', 'moon', 'more', 'most', 'move', 'much', 'must', 'myth', 'name', 'near', 'neck', 'need', 'news', 'next', 'nice', 'nine', 'none', 'nose', 'note', 'odds', 'okay', 'once', 'only', 'onto', 'open', 'oven', 'over', 'pace', 'pack', 'page', 'pain', 'pair', 'pale', 'palm', 'pant', 'park', 'part', 'pass', 'past', 'path', 'peak', 'peer', 'pick', 'pile', 'pine', 'pink', 'pipe', 'plan', 'play', 'plot', 'plus', 'poem', 'poet', 'pole', 'poll', 'pool', 'poor', 'port', 'pose', 'post', 'pour', 'pray', 'pull', 'pure', 'push', 'quit', 'race', 'rail', 'rain', 'rank', 'rare', 'rate', 'read', 'real', 'rely', 'rest', 'rice', 'rich', 'ride', 'ring', 'rise', 'risk', 'road', 'rock', 'role', 'roll', 'roof', 'room', 'root', 'rope', 'rose', 'rule', 'rush', 'safe', 'sake', 'sale', 'salt', 'same', 'sand', 'save', 
'seat', 'seed', 'seek', 'seem', 'self', 'sell', 'send', 'ship', 'shit', 'shoe', 'shop', 'shot', 'show', 'shut', 'sick', 'side', 'sigh', 'sign', 'sing', 'sink', 'site', 'size', 'skin', 'slip', 'slow', 'snap', 'snow', 'soft', 'soil', 'some', 'song', 'soon', 'sort', 'soul', 'soup', 'spin', 'spot', 'star', 'stay', 'step', 'stir', 'stop', 'such', 'suit', 'sure', 'swim', 'tail', 'take', 'tale', 'talk', 'tall', 'tank', 'tape', 'task', 'team', 'tear', 'teen', 'tell', 'tend', 'tent', 'term', 'test', 'text', 'than', 'that', 'them', 'then', 'they', 'thin', 'this', 'thus', 'time', 'tiny', 'tire', 'tone', 'tool', 'toss', 'tour', 'town', 'tree', 'trip', 'true', 'tube', 'turn', 'twin', 'type', 'ugly', 'unit', 'upon', 'urge', 'used', 'user', 'vary', 'vast', 'very', 'view', 'vote', 'wage', 'wait', 'wake', 'walk', 'wall', 'want', 'warm', 'warn', 'wash', 'wave', 'weak', 'wear', 'week', 'well', 'west', 'what', 'when', 'whom', 'wide', 'wife', 'wild', 'will', 'wind', 'wine', 'wing', 'wipe', 'wire', 'wise', 'wish', 'with', 'wood', 'word', 'work', 'wrap', 'yard', 'yeah', 'year', 'yell', 'your', 'zone', 'abbo', 'anal', 'anus', 'arab', 'arse', 'babe', 'barf', 'bast', 'blow', 'bomb', 'bomd', 'bong', 'boob', 'boom', 'burn', 'butt', 'chav', 'chin', 'cigs', 'clit', 'cock', 'coon', 'crap', 'cumm', 'cunn', 'cunt', 'dago', 'damn', 'dead', 'dego', 'deth', 'dick', 'died', 'dies', 'dike', 'dink', 'dive', 'dong', 'doom', 'dope', 'drug', 'dumb', 'dyke', 'fart', 'fear', 'fire', 'floo', 'fore', 'fuck', 'fuks', 'geez', 'geni', 'gipp', 'gook', 'groe', 'gypo', 'gypp', 'hapa', 'hebe', 'heeb', 'hell', 'hobo', 'hoes', 'hole', 'homo', 'honk', 'hook', 'hore', 'hork', 'horn', 'ikey', 'itch', 'jade', 'jeez', 'jiga', 'jigg', 'jism', 'jiz', 'jizz', 'jugs', 'kike', 'kill', 'kink', 'kock', 'koon', 'krap', 'kums', 'kunt', 'kyke', 'laid', 'lezz', 'lies', 'limy', 'mams', 'meth', 'milf', 'mofo', 'moky', 'muff', 'munt', 'nazi', 'nigg', 'nigr', 'nook', 'nude', 'nuke', 'oral', 'orga', 'orgy', 'paki', 'payo', 'peck', 'perv', 'phuk', 'phuq', 'piky', 'pimp', 'piss', 'pixy', 'pohm', 'poon', 'poop', 'porn', 'pric', 'pros', 'pube', 'pudd', 'puke', 'puss', 'pusy', 'quim', 'rape', 'rere', 'rump', 'scag', 'scat', 'scum', 'sexy', 'shag', 'shat', 'shav', 'shit', 'sick', 'skum', 'slav', 'slut', 'smut', 'snot', 'spic', 'spig', 'spik', 'spit', 'suck', 'taff', 'tang', 'tard', 'teat', 'tits', 'turd', 'twat', 'vibr', 'wank', 'wetb', 'whit', 'whiz', 'whop', 'wuss', 'about', 'above', 'abuse', 'actor', 'adapt', 'admit', 'adopt', 'adult', 'after', 'again', 'agent', 'agree', 'ahead', 'album', 'alive', 'allow', 'alone', 'along', 'alter', 'among', 'anger', 'angle', 'angry', 'apart', 'apple', 'apply', 'argue', 'arise', 'armed', 'Asian', 'aside', 'asset', 'avoid', 'award', 'aware', 'awful', 'badly', 'basic', 'basis', 'beach', 'begin', 'being', 'below', 'bench', 'Bible', 'birth', 'black', 'blade', 'blame', 'blind', 'block', 'blood', 'board', 'brain', 'brand', 'bread', 'break', 'brick', 'brief', 'bring', 'broad', 'brown', 'brush', 'build', 'bunch', 'buyer', 'cabin', 'cable', 'carry', 'catch', 'cause', 'chain', 'chair', 'chart', 'chase', 'cheap', 'check', 'cheek', 'chest', 'chief', 'child', 'civil', 'claim', 'class', 'clean', 'clear', 'climb', 'clock', 'close', 'cloud', 'coach', 'coast', 'color', 'couch', 'could', 'count', 'court', 'cover', 'crack', 'craft', 'crash', 'crazy', 'cream', 'crime', 'cross', 'crowd', 'cycle', 'daily', 'dance', 'death', 'delay', 'depth', 'dirty', 'doubt', 'dozen', 'draft', 'drama', 'dream', 'dress', 'drink', 'drive', 'eager', 'early', 'earth', 'eight', 'elect', 'elite', 'empty', 'enemy', 'enjoy', 'enter', 'entry', 'equal', 'error', 'essay', 'event', 'every', 'exact', 'exist', 'extra', 'faith', 'false', 'fault', 'favor', 'fence', 'fewer', 'fiber', 'field', 'fifth', 'fifty', 'fight', 'final', 'first', 'flame', 'flesh', 'float', 'floor', 'focus', 'force', 'forth', 'found', 'frame', 'fresh', 'front', 'fruit', 'fully', 'funny', 'ghost', 'giant', 'given', 'glass', 'glove', 'grade', 'grain', 'grand', 'grant', 'grass', 'grave', 'great', 'green', 'group', 'guard', 'guess', 'guest', 'guide', 'habit', 'happy', 'heart', 'heavy', 'hello', 'honey', 'honor', 'horse', 'hotel', 'house', 'human', 'humor', 'ideal', 'image', 'imply', 'index', 'inner', 'Iraqi', 'Irish', 'issue', 'joint', 'judge', 'juice', 'knife', 'knock', 'label', 'labor', 'large', 'later', 'Latin', 'laugh', 'layer', 'learn', 'least', 'leave', 'legal', 'lemon', 'level', 'light', 'limit', 'local', 'loose', 'lover', 'lower', 'lucky', 'lunch', 'major', 'maker', 'marry', 'match', 'maybe', 'mayor', 'media', 'metal', 'meter', 'might', 'minor', 'model', 'money', 'month', 'moral', 'motor', 'mount', 'mouse', 'mouth', 'movie', 'music', 'naked', 'nerve', 'never', 'newly', 'night', 'noise', 'north', 'novel', 'nurse', 'occur', 'ocean', 'offer', 'often', 'onion', 'order', 'other', 'ought', 'owner', 'paint', 'panel', 'paper', 'party', 'patch', 'pause', 'peace', 'phase', 'phone', 'photo', 'piano', 'piece', 'pilot', 'pitch', 'place', 'plane', 'plant', 'plate', 'point', 'porch', 'pound', 'power', 'press', 'price', 'pride', 'prime', 'print', 'prior', 'proof', 'proud', 'prove', 'quick', 'quiet', 'quite', 'quote', 'radio', 'raise', 'range', 'rapid', 'ratio', 'reach', 'react', 'ready', 'refer', 'relax', 'reply', 'rifle', 'right', 'river', 'rough', 'round', 'route', 'rural', 'salad', 'sales', 'sauce', 'scale', 'scene', 'scope', 'score', 'seize', 'sense', 'serve', 'seven', 'shade', 'shake', 'shall', 'shape', 'share', 'sharp', 'sheet', 'shelf', 'shell', 'shift', 'shine', 'shirt', 'shock', 'shoot', 'shore', 'short', 'shout', 'shrug', 'sight', 'since', 'skill', 'slave', 'sleep', 'slice', 'slide', 'small', 'smart', 'smell', 'smile', 'smoke', 'solar', 'solid', 'solve', 'sorry', 'sound', 'south', 'space', 'speak', 'speed', 'spend', 'split', 'sport', 'staff', 'stage', 'stair', 'stake', 'stand', 'stare', 'start', 'state', 'steal', 'steel', 'stick', 'still', 'stock', 'stone', 'store', 'storm', 'story', 'strip', 'study', 'stuff', 'style', 'sugar', 'super', 'swear', 'sweep', 'sweet', 'swing', 'table', 'taste', 'teach', 'terms', 'thank', 'their', 'theme', 'there', 'these', 'thick', 'thing', 'think', 'third', 'those', 'three', 'throw', 'tight', 'tired', 'title', 'today', 'tooth', 'topic', 'total', 'touch', 'tough', 'tower', 'trace', 'track', 'trade', 'trail', 'train', 'treat', 'trend', 'trial', 'tribe', 'trick', 'troop', 'truck', 'truly', 'trust', 'truth', 'twice', 'uncle', 'under', 'union', 'until', 'upper', 'urban', 'usual', 'value', 'video', 'virus', 'visit', 'vital', 'voice', 'voter', 'waste', 'watch', 'water', 'weigh', 'wheel', 'where', 'which', 'while', 'white', 'whole', 'whose', 'woman', 'works', 'world', 'worry', 'worth', 'would', 'wound', 'write', 'wrong', 'yield', 'young', 'yours', 'youth', 'abuse', 'adult', 'allah', 'angie', 'angry', 'arabs', 'argie', 'asian', 'asses', 'balls', 'beast', 'bible', 'bitch', 'black', 'blind', 'boang', 'bogan', 'bombs', 'boner', 'boobs', 'booby', 'boody', 'boong', 'booty', 'bunga', 'chink', 'choad', 'chode', 'cocky', 'cohee', 'color', 'cooly', 'cra5h', 'crabs', 'crack', 'crash', 'crime', 'darky', 'death', 'demon', 'devil', 'dildo', 'dirty', 'drunk', 'eatme', 'enema', 'enemy', 'erect', 'fagot', 'fairy', 'faith', 'fatah', 'fatso', 'feces', 'felch', 'fight', 'forni', 'fraud', 'fubar', 'fucck', 'fucka', 'fucks', 'fugly', 'fuuck', 'ginzo', 'girls', 'goyim', 'gross', 'gubba', 'gyppo', 'gyppy', 'hamas', 'harem', 'honky', 'horny', 'hoser', 'husky', 'hussy', 'hymen', 'hymie', 'idiot', 'jebus', 'jesus', 'jigga', 'jiggy', 'jihad', 'jizim', 'jizm', 'joint', 'kafir', 'kills', 'kinky', 'knife', 'kotex', 'kraut', 'latin', 'lesbo', 
'lezbe', 'lezbo', 'lezzo', 'limey', 'loser', 'lugan', 'lynch', 'mafia', 'mgger', 'mggor', 'mocky', 'moles', 'moron', 'naked', 'nasty', 'necro', 'negro', 'niger', 'nigga', 'nigra', 'nigre', 'nymph', 'osama', 'pansy', 'panti', 
'pendy', 'penis', 'piker', 'pikey', 'pixie', 'pocha', 'pocho', 'pommy', 'porno', 'prick', 'pubic', 'pussy', 'queef', 'queer', 'rabbi', 'randy', 'raped', 'raper', 'roach', 'sadis', 'sadom', 'sandm', 'satan', 'screw', 'semen', 
'seppo', 'sexed', 'shhit', 'shite', 'shits', 'shoot', 'sissy', 'skank', 'slant', 'slave', 'slime', 'slopy', 'sluts', 'slutt', 'smack', 'sodom', 'sooty', 'spank', 'sperm', 'spick', 'spunk', 'squaw', 'stagg', 'taboo', 'teste', 
'titty', 'tramp', 'trots', 'twink', 'urine', 'usama', 'vomit', 'vulva', 'whash', 'whore', 'willy', 'abroad', 'absorb', 'accept', 'access', 'accuse', 'across', 'action', 'active', 'actual', 'adjust', 'admire', 'advice', 'advise', 'affair', 'affect', 'afford', 'afraid', 'agency', 'agenda', 'almost', 'always', 'amount', 'animal', 'annual', 'answer', 'anyone', 'anyway', 'appeal', 'appear', 'around', 'arrest', 'arrive', 'artist', 'asleep', 'aspect', 'assert', 'assess', 'assign', 'assist', 'assume', 'assure', 'attach', 'attack', 'attend', 'author', 'barely', 'barrel', 'basket', 'battle', 'beauty', 'become', 'before', 'behind', 'belief', 'belong', 'beside', 'better', 'beyond', 'border', 'borrow', 'bother', 'bottle', 'bottom', 'branch', 'breast', 'breath', 'bridge', 'bright', 'broken', 'budget', 'bullet', 'burden', 'butter', 'button', 'camera', 'campus', 'cancer', 'carbon', 'career', 'center', 'chance', 'change', 'charge', 'cheese', 'choice', 'choose', 'church', 'circle', 'client', 'clinic', 'closer', 'coffee', 'column', 'comedy', 'commit', 'common', 'cookie', 'corner', 'cotton', 'county', 'couple', 'course', 'cousin', 'create', 'credit', 'crisis', 'critic', 'custom', 'damage', 'danger', 'dealer', 'debate', 'decade', 'decide', 'deeply', 'defeat', 'defend', 'define', 'degree', 'demand', 'depend', 'depict', 'deputy', 'derive', 'desert', 'design', 'desire', 'detail', 'detect', 'device', 'devote', 'differ', 'dining', 'dinner', 'direct', 'divide', 'doctor', 'double', 'driver', 'during', 'easily', 'editor', 'effect', 'effort', 'either', 'e-mail', 'emerge', 'employ', 'enable', 'energy', 'engage', 'engine', 'enough', 'ensure', 'entire', 'escape', 'estate', 'ethics', 'ethnic', 'evolve', 'exceed', 'except', 'expand', 'expect', 'expert', 'expose', 'extend', 'extent', 'fabric', 'factor', 'fairly', 'family', 'famous', 'farmer', 'father', 'fellow', 'female', 'figure', 'finger', 'finish', 'flavor', 'flight', 'flower', 'follow', 'forest', 'forget', 'formal', 'former', 'fourth', 'freeze', 'French', 'friend', 'future', 'galaxy', 'garage', 'garden', 'garlic', 'gather', 'gender', 'gently', 'German', 'gifted', 'glance', 'global', 'golden', 'ground', 'growth', 'guilty', 'handle', 'happen', 'hardly', 'health', 'heaven', 'height', 'highly', 'honest', 'horror', 'hungry', 'hunter', 'ignore', 'impact', 'impose', 'income', 'indeed', 'Indian', 'infant', 'inform', 'injury', 'inside', 'insist', 'intend', 'invest', 'invite', 'island', 'itself', 'jacket', 'Jewish', 'junior', 'killer', 'latter', 'launch', 'lawyer', 'leader', 'league', 'legacy', 'legend', 'length', 'lesson', 'letter', 'likely', 'listen', 'little', 'living', 'locate', 'lovely', 'mainly', 'makeup', 'manage', 'manner', 'margin', 'market', 'master', 'matter', 'medium', 'member', 'memory', 'mental', 'merely', 'method', 'middle', 'minute', 'mirror', 'mm-hmm', 'modern', 'modest', 'moment', 'mostly', 'mother', 'motion', 'murder', 'muscle', 'museum', 'Muslim', 'mutual', 'myself', 'narrow', 'nation', 'native', 'nature', 'nearby', 'nearly', 'nobody', 'normal', 'notice', 'notion', 'number', 'object', 'obtain', 'occupy', 'office', 'online', 'oppose', 'option', 'orange', 'origin', 'others', 'parent', 'partly', 'people', 'pepper', 'period', 'permit', 'person', 'phrase', 'planet', 'player', 'please', 'plenty', 'pocket', 'poetry', 'police', 'policy', 'potato', 'powder', 'prayer', 'prefer', 'pretty', 'priest', 'prison', 'profit', 'prompt', 'proper', 'public', 'pursue', 'racial', 'rarely', 'rather', 'rating', 'reader', 'really', 'reason', 'recall', 'recent', 'recipe', 'record', 'reduce', 'reform', 'refuse', 'regard', 'regime', 'region', 'reject', 'relate', 'relief', 'remain', 'remind', 'remote', 'remove', 'repeat', 'report', 'resist', 'resort', 'result', 'retain', 'retire', 'return', 'reveal', 'review', 'rhythm', 'sacred', 'safety', 'salary', 'sample', 'saving', 'scared', 'scheme', 'school', 'scream', 'screen', 'script', 'search', 'season', 'second', 'secret', 'sector', 'secure', 'select', 'Senate', 'senior', 'series', 'settle', 'severe', 'sexual', 'shadow', 'should', 'shower', 'signal', 'silent', 'silver', 'simple', 'simply', 'singer', 'single', 'sister', 'slight', 'slowly', 'smooth', 'soccer', 'social', 'source', 'Soviet', 'speech', 'spirit', 'spread', 'spring', 'square', 'stable', 'status', 'steady', 'stream', 'street', 'stress', 'strike', 'string', 'stroke', 'strong', 'studio', 'stupid', 'submit', 'sudden', 'suffer', 'summer', 'summit', 'supply', 'surely', 'survey', 'switch', 'symbol', 'system', 'tactic', 'talent', 'target', 'tennis', 'terror', 'thanks', 'theory', 'thirty', 'though', 'threat', 'throat', 'ticket', 'tissue', 'tomato', 'tongue', 'toward', 'travel', 'treaty', 'tunnel', 'twelve', 'twenty', 'unable', 'unique', 'United', 'unless', 'unlike', 'useful', 'valley', 'versus', 'vessel', 'victim', 'viewer', 'virtue', 'vision', 'visual', 'volume', 'wander', 'wealth', 'weapon', 'weekly', 'weight', 'widely', 'window', 'winner', 'winter', 'wisdom', 'within', 'wonder', 'wooden', 'worker', 'writer', 'yellow', 'addict', 'africa', 'areola', 'asshat', 'assman', 'attack', 'babies', 'beaner', 'beaver', 'biatch', 'bigass', 'bigger', 'bitchy', 'biteme', 'blacks', 'bohunk', 'boonga', 'boonie', 'breast', 'bugger', 'buried', 'byatch', 'cacker', 'cancer', 'chinky', 'christ', 'church', 'coitus', 'commie', 'condom', 'coolie', 'crappy', 'creamy', 'crimes', 'crotch', 'cummer', 'cunntt', 'dahmer', 'dammit', 'damnit', 'darkie', 'desire', 'diddle', 'doodoo', 'doodoo', 'dyefly', 'escort', 'ethnic', 'faeces', 'faggot', 'failed', 'farty', 'fatass', 'fckcum', 'feltch', 'fetish', 'firing', 'fister', 'flange', 'flydie', 'flydye', 'fondle', 'fucked', 'fucker', 'fuckin', 'fuckit', 'fungus', 'geezer', 'german', 'gringo', 'gummer', 'gyppie', 'harder', 'hardon', 'heroin', 'herpes', 'hijack', 'hindoo', 'hitler', 'hodgie', 'honger', 'honkey', 'hooker', 'horney', 'hummer', 'iblowu', 'incest', 'insest', 'israel', 'jewish', 'jizzim', 'jizzum', 'kaffer', 'kaffir', 'kaffre', 'kanake', 'kigger', 'killed', 'killer', 'kondum', 'krappy', 'kummer', 'lesbin', 'libido', 'licker', 'lickme', 'liquor', 'lolita', 'looser', 'lotion', 'macaca', 'mockey', 'mockie', 'molest', 'mormon', 'moslem', 'murder', 'muslim', 'negros', 'niggah', 'niggaz', 'nigger', 'niggle', 'niggor', 'niggur', 'niglet', 'nignog', 'nipple', 'nittit', 'nlgger', 'nlggor', 'nookey', 'nookie', 'noonan', 'nooner', 'nudger', 'orgasm', 'orgies', 'pecker', 'penile', 'period', 'phuked', 'pimped', 'pimper', 'pissed', 'pisser', 'pistol', 'polack', 'pommie', 'pooper', 'popimp', 'pudboy', 'pussie', 'racial', 'racist', 'rapist', 'rectum', 'reefer', 'reject', 'retard', 'ribbed', 'rigger', 'rimjob', 'robber', 'russki', 'sexing', 'sexpot', 'sextoy', 'sexual', 'shited', 'skanky', 'slopey', 'slutty', 'snatch', 'sniper', 'sodomy', 'soviet', 'spooge', 'spunky', 'stiffy', 'stroke', 'stupid', 'sucker', 'suckme', 'swalow', 'tampon', 'tantra', 'terror', 'tinkle', 'titjob', 'tittie', 'toilet', 'tongue', 'tortur', 'tosser', 'tranny', 'trojan', 'turnon', 'uterus', 'vagina', 'virgin', 'wanker', 'weapon', 'weenie', 'weewee', 'whites', 'whitey', 'wigger', 'willie', 'womens', 'wuzzie', 'yankee', 'zigabo', 'abandon', 'ability', 'absence', 'account', 'achieve', 'acquire', 'actress', 'address', 'advance', 'adviser', 'African', 'against', 'airline', 'airport', 'alcohol', 'already', 'amazing', 'analyst', 'analyze', 'ancient', 'another', 'anxiety', 'anybody', 'anymore', 'appoint', 'approve', 'arrange', 'arrival', 'article', 'assault', 'athlete', 'attempt', 'attract', 'average', 'balance', 'barrier', 'battery', 'because', 'bedroom', 'believe', 'beneath', 'benefit', 'besides', 'between', 'billion', 'blanket', 'bombing', 'breathe', 'briefly', 'British', 'brother', 'cabinet', 'capable', 'capital', 'captain', 'capture', 'careful', 'carrier', 'ceiling', 'central', 'century', 'certain', 'chamber', 'channel', 'chapter', 'charity', 'chicken', 
'Chinese', 'citizen', 'classic', 'clearly', 'climate', 'closely', 'clothes', 'cluster', 'collect', 'college', 'combine', 'comfort', 'command', 'comment', 'company', 'compare', 'compete', 'complex', 'compose', 'concept', 'concern', 'concert', 'conduct', 'confirm', 'connect', 'consist', 'consume', 'contact', 'contain', 'content', 'contest', 'context', 'control', 'convert', 'cooking', 'correct', 'council', 'counter', 'country', 'courage', 'crucial', 'culture', 'curious', 'current', 'declare', 'decline', 'defense', 'deficit', 'deliver', 'deserve', 'despite', 'destroy', 'develop', 'digital', 'discuss', 'disease', 'dismiss', 'display', 'dispute', 'distant', 'diverse', 'divorce', 'drawing', 'eastern', 'economy', 'edition', 'educate', 'elderly', 'element', 'embrace', 'emotion', 'English', 'enhance', 'episode', 'equally', 'evening', 'exactly', 'examine', 'example', 'exhibit', 'expense', 'explain', 'explode', 'explore', 'express', 'extreme', 'factory', 'faculty', 'failure', 'fantasy', 'fashion', 'feature', 'federal', 'feeling', 'fiction', 'fifteen', 'fighter', 'finally', 'finance', 'finding', 'fishing', 'fitness', 'foreign', 'forever', 'formula', 'fortune', 'forward', 'founder', 'freedom', 'funding', 'funeral', 'gallery', 'general', 'genetic', 'gesture', 'grocery', 'growing', 'habitat', 'handful', 'healthy', 'hearing', 'heavily', 'helpful', 'herself', 'highway', 'himself', 'history', 'holiday', 'horizon', 'housing', 'however', 'hundred', 'hunting', 'husband', 'illegal', 'illness', 'imagine', 'impress', 'improve', 'include', 'initial', 'inquiry', 'insight', 'inspire', 'install', 'instead', 'intense', 'involve', 'Islamic', 'Israeli', 'Italian', 'journal', 'journey', 'justice', 'justify', 'killing', 'kitchen', 'largely', 'lawsuit', 'leading', 'leather', 'liberal', 'library', 'license', 'limited', 'machine', 'manager', 'married', 'massive', 'meaning', 'measure', 'medical', 'meeting', 'mention', 'message', 'Mexican', 'million', 'miracle', 'missile', 'mission', 'mistake', 'mixture', 'monitor', 'morning', 'musical', 'mystery', 'natural', 'neither', 'nervous', 'network', 'nothing', 'nowhere', 'nuclear', 'observe', 'obvious', 'offense', 'officer', 'Olympic', 'ongoing', 'opening', 'operate', 'opinion', 'organic', 'outcome', 'outside', 'overall', 'package', 'painful', 'painter', 'parking', 'partner', 'passage', 'passion', 'patient', 'pattern', 'payment', 'penalty', 'perfect', 'perform', 'perhaps', 'picture', 'plastic', 'popular', 'portion', 'portray', 
'possess', 'poverty', 'predict', 'prepare', 'present', 'pretend', 'prevent', 'primary', 'privacy', 'private', 'problem', 'proceed', 'process', 'produce', 'product', 'profile', 'program', 'project', 'promise', 'promote', 'propose', 'protect', 'protein', 'protest', 'provide', 'publish', 'purpose', 'qualify', 'quality', 'quarter', 'quickly', 'quietly', 'radical', 'rapidly', 'reading', 'reality', 'realize', 'receive', 'recover', 'recruit', 'reflect', 'refugee', 'regular', 'release', 'replace', 'request', 'require', 'resolve', 'respect', 'respond', 'restore', 'revenue', 'roughly', 'routine', 'running', 'Russian', 'satisfy', 'scandal', 'scholar', 'science', 'section', 'segment', 'senator', 'serious', 'service', 'session', 'setting', 'several', 'shelter', 'shortly', 'silence', 'similar', 'society', 'soldier', 'somehow', 'someone', 'Spanish', 'speaker', 'special', 'species', 'squeeze', 'station', 'stomach', 'storage', 'strange', 'stretch', 'student', 'subject', 'succeed', 'success', 'suggest', 'suicide', 'support', 'suppose', 'Supreme', 'surface', 'surgery', 'survive', 'suspect', 'sustain', 'symptom', 'teacher', 'tension', 'testify', 'testing', 'theater', 'therapy', 'thought', 'through', 'tobacco', 'tonight', 'totally', 'tourist', 'towards', 'traffic', 'tragedy', 'trouble', 'typical', 'undergo', 'uniform', 'unknown', 'unusual', 'usually', 'utility', 'variety', 'various', 'vehicle', 'venture', 'version', 'veteran', 'victory', 'village', 'violate', 'violent', 'visible', 'visitor', 'warning', 'wealthy', 'weather', 'wedding', 'weekend', 'welcome', 'welfare', 'western', 'whereas', 'whether', 'whisper', 'willing', 'without', 'witness', 'working', 'worried', 'writing', 'addicts', 'african', 'amateur', 'analsex', 'aroused', 'assault', 'assfuck', 'asshole', 'asshore', 'asskiss', 'asslick', 'asswipe', 'badfuck', 'banging', 'baptist', 'barface', 'bazooms', 'beatoff', 'bestial', 'bigbutt', 'bitcher', 'bitches', 'bitchez', 'bitchin', 'blowjob', 'bollick', 'bollock', 'bombers', 'bombing', 'bondage', 'boobies', 'brothel', 'buggery', 'bumfuck', 'buttman', 'carruth', 'chinese', 'clogwog', 'cocaine', 'cocknob', 'colored', 'coondog', 'crapola', 'crapper', 'cumfest', 'cumming', 'cumquat', 'cumshot', 'deposit', 'destroy', 'dickman', 'dickwad', 'dipshit', 'disease', 'drunken', 'dumbass', 'ecstacy', 'execute', 'fagging', 'failure', 'fairies', 'farted', 'fatfuck', 'felcher', 'fisting', 'flasher', 'fuckbag', 'fuckers', 'fuckher', 'fuckina', 'fucking', 'fuckme', 'fuckoff', 'fuckpig', 'fuckyou', 'funeral', 'funfuck', 'gangsta', 'gaysex', 'genital', 'getiton', 'goddamn', 'handjob', 'hiscock', 'honkers', 'hookers', 'hooters', 'hosejob', 'hostage', 'hotdamn', 'hustler', 'illegal', 'israeli', 'israels', 'jackass', 'jackoff', 'japcrap', 'jerkoff', 'jigaboo', 'jiggabo', 'jigger', 'jimfish', 'juggalo', 'killing', 'kissass', 'kumming', 'kumquat', 'lactate', 'lesbain', 'lesbayn', 'lesbian', 'liberal', 'livesex', 'lovegoo', 'lovegun', 'lowlife', 'lubejob', 'lucifer', 'mexican', 'mideast', 'mulatto', 'muncher', 'nastyho', 'negroes', 'negroid', 'niggard', 'niggers', 'niggers', 'niggled', 'niggles', 'pansies', 'panties', 'peehole', 'pee-pee', 'penises', 'phuking', 'phukked', 'phungky', 'pindick', 'pisses', 'pissin', 'pissing', 'playboy', 'pooping', 'poverty', 'puddboy', 'puntang', 'pussies', 'quashie', 'quickie', 'radical', 'raghead', 'rearend', 'redneck', 'reestie', 'refugee', 'remains', 'rimming', 'russkie', 'schlong', 'scrotum', 'servant', 'sexfarm', 'sextogo', 'sextoys', 'shaggin', 'sheeney', 'shinola', 'shitcan', 'shitfit', 'shiting', 'shitola', 'shitted', 'shitter', 'shitty', 'skumbag', 'slapper', 'snigger', 'spitter', 'strapon', 'suckoff', 'suicide', 'swallow', 'tarbaby', 'titfuck', 'titties', 'torture', 'trannie', 'triplex', 'twinkie', 'upskirt', 'urinary', 'urinate', 'vaginal', 'vatican', 'wanking', 'waysted', 'welcher', 'welfare', 'wetback', 'wetspot', 'whacker', 'whigger', 'whiskey', 'abortion', 'absolute', 'academic', 'accident', 'accurate', 'activist', 'activity', 'actually', 'addition', 'adequate', 'advanced', 'advocate', 'aircraft', 'alliance', 'although', 'American', 'analysis', 'announce', 'anything', 'anywhere', 'apparent', 'approach', 'approval', 'argument', 'artistic', 'athletic', 'attitude', 'attorney', 'audience', 'baseball', 'bathroom', 'behavior', 'birthday', 'boundary', 'building', 'business', 'campaign', 'Canadian', 'capacity', 'category', 'Catholic', 'ceremony', 'chairman', 'champion', 'changing', 'chemical', 'civilian', 'clinical', 'clothing', 'collapse', 'colonial', 'complain', 'complete', 'computer', 'conclude', 'concrete', 'conflict', 'confront', 'Congress', 'consider', 'constant', 'consumer', 'continue', 'contract', 'contrast', 'convince', 'coverage', 'creation', 'creative', 'creature', 'criminal', 'criteria', 'critical', 'cultural', 'customer', 'darkness', 'daughter', 'decision', 'decrease', 'delivery', 'Democrat', 'describe', 'designer', 'detailed', 'dialogue', 'directly', 'director', 'disagree', 'disaster', 'discover', 'disorder', 'distance', 'distinct', 'district', 'division', 'document', 'domestic', 'dominant', 'dominate', 'downtown', 'dramatic', 'earnings', 'economic', 'educator', 'election', 'electric', 'emission', 'emphasis', 'employee', 'employer', 'engineer', 'enormous', 'entirely', 'entrance', 'estimate', 'European', 'evaluate', 'everyday', 'everyone', 'evidence', 'exchange', 'exciting', 'exercise', 'existing', 'exposure', 'external', 'facility', 'familiar', 'favorite', 'fighting', 'football', 'frequent', 'friendly', 'function', 'generate', 'governor', 'graduate', 'greatest', 'headline', 'heritage', 'historic', 'homeless', 'hospital', 'identify', 'identity', 'incident', 'increase', 'indicate', 'industry', 'innocent', 'instance', 'interest', 'internal', 'Internet', 'invasion', 'investor', 'involved', 'Japanese', 'judgment', 'language', 'learning', 'lifetime', 'literary', 'location', 'magazine', 'maintain', 'majority', 'marriage', 'material', 'medicine', 'military', 'minister', 'minority', 'moderate', 'moreover', 'mortgage', 'mountain', 'movement', 'multiple', 'musician', 'national', 'negative', 'neighbor', 'normally', 'northern', 'numerous', 'observer', 'occasion', 'official', 'operator', 'opponent', 'opposite', 'ordinary', 'organize', 'original', 'overcome', 'overlook', 'painting', 'perceive', 'personal', 'persuade', 'physical', 'planning', 'platform', 'pleasure', 'politics', 'portrait', 'position', 'positive', 'possible', 'possibly', 'powerful', 'practice', 'pregnant', 'presence', 'preserve', 'pressure', 'previous', 'priority', 'prisoner', 'probably', 'producer', 'progress', 'properly', 'property', 'proposal', 'proposed', 'prospect', 'provider', 'province', 'publicly', 'purchase', 'question', 'reaction', 'recently', 'recovery', 'regional', 'register', 'regulate', 'relation', 'relative', 'relevant', 'religion', 'remember', 'reporter', 'research', 'resemble', 'resident', 'resource', 'response', 'romantic', 'sanction', 'scenario', 'schedule', 'security', 'sentence', 'separate', 'sequence', 'shooting', 'shopping', 'shoulder', 'slightly', 'software', 'solution', 'somebody', 'somewhat', 'southern', 'specific', 'spending', 'standard', 'standing', 'straight', 'stranger', 'strategy', 'strength', 'strongly', 'struggle', 'suddenly', 'supposed', 'surprise', 'surround', 'survival', 'survivor', 'taxpayer', 'teaching', 'teaspoon', 'teenager', 'tendency', 'terrible', 'thinking', 'thousand', 'threaten', 'together', 'tomorrow', 'training', 'transfer', 'ultimate', 'universe', 'unlikely', 'vacation', 'valuable', 'variable', 'violence', 'whatever', 'whenever', 'withdraw', 'workshop', 'yourself', 'abortion', 'american', 'arsehole', 'assassin', 'assclown', 'assholes', 'assklown', 'asslover', 'assmunch', 'asswhore', 'backdoor', 'backseat', 'ballsack', 'barfface', 'bastard', 'bazongas', 'beastial', 'beat-off', 'bisexual', 'bitching', 'blackman', 'blackout', 'buggered', 'bullcrap', 'bulldike', 'bulldyke', 'bullshit', 'bunghole', 'buttbang', 'buttface', 'buttfuck', 'butthead', 'buttplug', 'cameltoe', 'canadian', 'catholic', 'cemetery', 'chinaman', 'chinamen', 'clitoris', 'cockhead', 'cockknob', 'cocksman', 'cocktail', 'coloured', 'copulate', 'cornhole', 'criminal', 'cumqueen', 'cunteyed', 'cuntfuck', 'cybersex', 'datnigga', 'defecate', 'dickhead', 'dickless', 'dicklick', 'dickweed', 'dipstick', 'diseases', 'dripdick', 'dumbfuck', 'easyslut', 'eatballs', 'eatpussy', 'erection', 'european', 'executed', 'farting', 'fastfuck', 'felatio', 'felching', 'fellatio', 'feltcher', 'filipina', 'filipino', 'fistfuck', 'footfuck', 'footstar', 'foreskin', 'foursome', 'freefuck', 'fuckable', 'fuckedup', 'fuckface', 'fuckfest', 'fuckhead', 'fuckknob', 'fucktard', 'gangbang', 'givehead', 'godammit', 'goddamit', 'gonzagas', 'gotohell', 'headfuck', 'henhouse', 'hijacker', 'homicide', 'horniest', 'hotpussy', 'intheass', 'italiano', 'jackshit', 'japanese', 'jijjiboo', 'jizjuice', 'knockers', 'lapdance', 'limpdick', 'lingerie', 'lovebone', 'manhater', 'manpaste', 'meatrack', 'minority', 'molester', 'molestor', 'mosshead', 'muffdive', 'murderer', 'narcotic', 'nigerian', 'niggards', 'niggards', 'niggling', 'ontherag', 'orgasim', 'peepshow', 'peepshpw', 'phonesex', 'phukking', 'pimpjuic', 'pimpsimp', 'pisshead', 'pissoff', 'playgirl', 'poontang', 'pornking', 'pussycat', 'radicals', 'redlight', 'retarded', 'roundeye', 'screwyou', 'sexhound', 'sexhouse', 'sexslave', 'sexually', 'sexwhore', 'sexymoma', 'shagging', 'shitdick', 'shitface', 'shitfuck', 'shitfull', 'shithead', 'shitlist', 'shitting', 'shooting', 'showtime', 'slanteye', 'slutting', 'slutwear', 'sniggers', 'snowback', 'sodomise', 'sodomite', 'sodomize', 'spermbag', 'spigotty', 'stringer', 
'stroking', 'suckdick', 'swastika', 'syphilis', 'testicle', 'thirdeye', 'thirdleg', 'threeway', 'titlover', 'tuckahoe', 'uptheass', 'vibrater', 'vibrator', 'vietcong', 'violence', 'accompany', 'according', 'admission', 'advantage', 'adventure', 'afternoon', 'agreement', 'apartment', 'architect', 'assistant', 'associate', 'attention', 'attribute', 'authority', 'available', 'awareness', 'basically', 'beautiful', 'beginning', 'boyfriend', 'breakfast', 'brilliant', 'calculate', 'candidate', 'carefully', 'celebrate', 'celebrity', 'certainly', 'challenge', 'character', 'childhood', 'chocolate', 'Christian', 'Christmas', 'cigarette', 'classroom', 'coalition', 'cognitive', 'colleague', 'commander', 'committee', 'community', 'complaint', 'component', 'concerned', 'condition', 'confident', 'confusion', 'consensus', 'construct', 'container', 'continued', 'corporate', 'counselor', 'criticism', 'criticize', 'currently', 'dangerous', 'defendant', 'defensive', 'democracy', 'dependent', 'depending', 'desperate', 'determine', 'different', 'difficult', 'dimension', 'direction', 'disappear', 'discourse', 'discovery', 'diversity', 'economics', 'economist', 'education', 'effective', 'efficient', 'eliminate', 'elsewhere', 'emergency', 'emotional', 'emphasize', 'encounter', 'encourage', 'equipment', 'essential', 'establish', 'everybody', 'evolution', 
'excellent', 'exception', 'executive', 'existence', 'expansion', 'expensive', 'explosion', 'extension', 'extensive', 'extremely', 'financial', 'following', 'formation', 'framework', 'frequency', 'furniture', 'generally', 'gentleman', 'gradually', 'guarantee', 'guideline', 'highlight', 'historian', 'household', 'immediate', 'immigrant', 'implement', 'important', 'incentive', 'including', 'increased', 'infection', 'inflation', 'influence', 'initially', 'insurance', 'intensity', 'intention', 'interpret', 'interview', 'introduce', 'knowledge', 'landscape', 'lifestyle', 'literally', 'long-term', 'marketing', 'meanwhile', 'mechanism', 'narrative', 'naturally', 'necessary', 'negotiate', 'newspaper', 'objective', 'obviously', 'offensive', 'operating', 'operation', 'otherwise', 'ourselves', 'passenger', 'perfectly', 'permanent', 'personnel', 'physician', 'political', 'pollution', 'potential', 'practical', 'precisely', 'pregnancy', 'president', 'primarily', 'principal', 'principle', 'procedure', 'professor', 'prominent', 'provision', 'publisher', 'recognize', 'recommend', 'recording', 'reduction', 'reference', 'regarding', 'regularly', 'reinforce', 'religious', 'remaining', 'represent', 'satellite', 'scientist', 'secretary', 'selection', 'sensitive', 'seriously', 'similarly', 'situation', 'so-called', 'something', 'sometimes', 'somewhere', 'spiritual', 'spokesman', 'stability', 'statement', 'strategic', 'structure', 'substance', 'supporter', 'surprised', 'technical', 'technique', 'telephone', 'telescope', 'temporary', 'territory', 'terrorism', 'terrorist', 'testimony', 'therefore', 'tradition', 'transform', 'translate', 'treatment', 'typically', 'universal', 'variation', 'vegetable', 'violation', 'virtually', 'volunteer', 'wonderful', 'yesterday', 'analannie', 'assbagger', 'asscowboy', 'assfucker', 'assjockey', 'asskisser', 'asslicker', 'assmonkey', 'asspacker', 'asspirate', 'assranger', 'bicurious', 'bitchslap', 'bootycall', 'bountybar', 'breastjob', 'breastman', 'butchdike', 'butchdyke', 'butt-bang', 'buttmunch', 'buttstain', 'catholics', 'childrens', 'christian', 'cigarette', 'clamdiver', 'cockblock', 'cockfight', 'cocklover', 'cockqueen', 'cockrider', 'cocksmith', 'cocksucer', 'cocksuck', 'cocktease', 'communist', 'crackpipe', 'criminals', 'crotchrot', 'cumbubble', 'cumjockey', 'cuntlick', 'damnation', 'dickbrain', 'disturbed', 'dixiedike', 'dixiedyke', 'dragqueen', 'dragqween', 'dumbbitch', 'ejaculate', 'ethiopian', 'excrement', 'execution', 'explosion', 'fatfucker', 'feltching', 'fornicate', 'freakfuck', 'fuckbuddy', 'fuckfreak', 'fuckwhore', 'gatorbait', 'goddammit', 'goddamned', 'goddamnes', 'goddamnit', 'hijacking', 'hitlerism', 'hitlerist', 'horseshit', 'hottotrot', 'inthebuff', 'kumbubble', 'kumbullbe', 'loadedgun', 'lovejuice', 'magicwand', 'marijuana', 'mastabate', 'moneyshot', 'mothafuck', 'muffdiver', 'nastyslut', 'nigerians', 'niggarded', 'niggardly', 'nigglings', 'nutfucker', 'penthouse', 'picaninny', 'pimpjuice', 'pornflick', 'premature', 'prickhead', 'primetime', 'pubiclice', 'pussylips', 'rearentry', 'rentafuck', 'scallywag', 'sexkitten', 'sexy-slim', 'shiteater', 'shitfaced', 'shithouse', 'shitstain', 'shortfuck', 'sixsixsix', 'sixtynine', 'skankfuck', 'skinflute', 'slaughter', 'sleezebag', 'slideitin', 'slimeball', 'slopehead', 'slutwhore', 'sniggered', "snigger's", 'splittail', 'stripclub', 'suckmyass', 'suckmytit', 'swallower', 'sweetness', 'terrorist', 'testicles', 'thicklips', 'threesome', 'titfucker', 'titfuckin', 'titlicker', 'towelhead', 'trisexual', 'upthebutt', 'yellowman', 'zipperhead', 'absolutely', 'accomplish', 'additional', 'adjustment', 'adolescent', 'aggressive', 'anticipate', 'apparently', 'appearance', 'appreciate', 'assessment', 'assignment', 'assistance', 'assumption', 'atmosphere', 'attractive', 'background', 'basketball', 'biological', 'capability', 'collection', 'collective', 'commercial', 'commission', 'commitment', 'comparison', 'competitor', 'completely', 'conclusion', 'conference', 'confidence', 'connection', 'consistent', 'constantly', 
'constitute', 'consultant', 'contribute', 'convention', 'conviction', 'curriculum', 'definitely', 'definition', 'democratic', 'department', 'depression', 'developing', 'difference', 'difficulty', 'disability', 'discipline', 'discussion', 'distribute', 'efficiency', 'electronic', 'elementary', 'employment', 'enterprise', 'especially', 'evaluation', 'eventually', 'everything', 'everywhere', 'exhibition', 'experience', 'experiment', 'expression', 'foundation', 'frequently', 'friendship', 'generation', 'girlfriend', 'government', 'helicopter', 'historical', 'hypothesis', 'illustrate', 'importance', 'impossible', 'impression', 'impressive', 'increasing', 'incredible', 'indication', 'individual', 'industrial', 'ingredient', 'initiative', 'instructor', 'instrument', 'interested', 'investment', 'journalist', 'laboratory', 'leadership', 'legitimate', 'limitation', 'literature', 'management', 'medication', 'membership', 'motivation', 'nomination', 'obligation', 'occupation', 'opposition', 'originally', 'particular', 'percentage', 'perception', 'permission', 'personally', 'phenomenon', 'philosophy', 'photograph', 'physically', 'politician', 'population', 'preference', 'previously', 'production', 'profession', 'proportion', 'prosecutor', 'protection', 'psychology', 'punishment', 'reasonable', 'reflection', 'regardless', 'regulation', 'relatively', 'remarkable', 'repeatedly', 'Republican', 'reputation', 'researcher', 'resistance', 'resolution', 'respondent', 'restaurant', 'retirement', 'revolution', 'scientific', 'settlement', 'specialist', 'statistics', 'strengthen', 'subsequent', 'successful', 'sufficient', 'suggestion', 'surprising', 'tablespoon', 'technology', 'television', 'themselves', 'throughout', 'tournament', 'transition', 'tremendous', 'ultimately', 'understand', 'university', 'vulnerable', 'widespread', 'assblaster', 'assmuncher', 'asspuppies', 'australian', 'balllicker', 'beastality', 'bestiality', 'bigbastard', 'bumblefuck', 'butchbabes', 'buttfucker', 'buttpirate', 'chickslick', 'clamdigger', 'cockcowboy', 'cocklicker', 'cocksmoker', 'cocksucker', 'conspiracy', 'corruption', 'crackwhore', 'cunilingus', 'cuntfucker', 'cuntsucker', 'deapthroat', 'deepthroat', 'dicklicker', 'doggystyle', 'ejaculated', 'facefucker', 'fingerfood', 'flatulence', 'footaction', 'footfucker', 'footlicker', 'fourtwenty', 'fuckfriend', 'fuckinnuts', 'fuckmehard', 'fuckmonkey', 'gangbanger', 'gonorrehea', 'greaseball', 'headlights', 'homosexual', 'kunilingus', 'lovemuscle', 'lovepistol', 'loverocket', 'mastabater', 'masterbate', 'masturbate', 'mickeyfinn', 'mothafucka', 'motherfuck', 'mufflikcer', 'nastybitch', 'nastywhore', 'niggaracci', 'niggarding', 'niggerhead', 'niggerhole', 'nipplering', 'palesimian', 'peckerwood', 'piccaninny', 'pickaninny', 'pocketpool', 'propaganda', 'prostitute', 'protestant', 'pussyeater', 'pussylover', 'republican', 'samckdaddy', 'sandnigger', 'shawtypimp', 'shitfucker', 'shithapens', 'sixtyniner', 'skankbitch', 'skankwhore', 'sleezeball', 'sniggering', 'snownigger', 'sonofbitch', 'spermacide', 'stupidfuck', 'suckmydick', 'transexual', 'unfuckable', 'whiskydick', 'whitetrash', 'whorehouse', 
'achievement', 'acknowledge', 'advertising', 'alternative', 'anniversary', 'application', 'appointment', 'appropriate', 'arrangement', 'association', 'celebration', 'cholesterol', 'combination', 'comfortable', 'communicate', 
'competition', 'competitive', 'complicated', 'composition', 'concentrate', 'consequence', 'consumption', 'controversy', 'cooperation', 'corporation', 'demonstrate', 'description', 'destruction', 'development', 'differently', 
'distinction', 'distinguish', 'educational', 'effectively', 'electricity', 'enforcement', 'engineering', 'environment', 'essentially', 'examination', 'expectation', 'explanation', 'frustration', 'fundamental', 'furthermore', 
'grandfather', 'grandmother', 'imagination', 'immediately', 'immigration', 'implication', 'improvement', 'incorporate', 'independent', 'information', 'institution', 'instruction', 'interaction', 'interesting', 'investigate', 
'involvement', 'legislation', 'maintenance', 'measurement', 'necessarily', 'negotiation', 'nonetheless', 'observation', 'opportunity', 'orientation', 'Palestinian', 'participant', 'participate', 'partnership', 'performance', 
'personality', 'perspective', 'politically', 'possibility', 'potentially', 'preparation', 'publication', 'quarterback', 'recognition', 'requirement', 'reservation', 'responsible', 'restriction', 'scholarship', 'significant', 
'substantial', 'temperature', 'traditional', 'assassinate', 'backdoorman', 'barelylegal', 'beastiality', 'breastlover', 'buttfuckers', 'buttmuncher', 'cameljockey', 'cockblocker', 'cocksucked', 'cocksucking', 'crack-whore', 'cunillingus', 'cunnilingus', 'cuntlicker', 'cyberslimer', 'dingleberry', 'doggiestyle', 'ejaculation', 'executioner', 'fannyfucker', 'fingerfuck', 'fistfucked', 'fistfucker', 'fuckinright', 'fudgepacker', 'gangbanged', 'glazeddonut', 'hillbillies', 'holestuffer', 'homobangers', 'intercourse', 'interracial', 'jesuschrist', 'junglebunny', 'kunnilingus', 'mastrabator', 'meatbeatter', 'molestation', 'mooncricket', 'mothafuckaz', 'mothafucker', 'mothafuckin', 'muffindiver', 'palestinian', 'penetration', 'porchmonkey', 'pornography', 'pussyfucker', 'pussylicker', 'shithappens', 'skankybitch', 'skankywhore', 'slavedriver', 'slimebucket', 'snatchpatch', 'sonofabitch', 'spermherder', 'spreadeagle', 'tonguetramp', 'transsexual', 'twobitwhore', 'whiskeydick', 'whitenigger', 'whorefucker', 'agricultural', 'championship', 'characterize', 'circumstance', 'conservative', 'considerable', 'construction', 'contemporary', 'contribution', 'conventional', 'conversation', 'distribution', 'dramatically', 'headquarters', 'increasingly', 'independence', 'intellectual', 'intelligence', 'intervention', 'introduction', 'investigator', 
'manufacturer', 'neighborhood', 'nevertheless', 'occasionally', 'organization', 'particularly', 'photographer', 'prescription', 'presentation', 'presidential', 'professional', 'psychologist', 'relationship', 'satisfaction', 'significance', 'specifically', 'successfully', 'surprisingly', 'athletesfoot', 'beatyourmeat', 'cherrypopper', 'conservative', 'crotchjockey', 'crotchmonkey', 'cuntlicking', 'ejaculating', 'fistfucking', 'freakyfucker', 'fuckingbitch', 'goldenshower', 'grostulation', 'heterosexual', 'lezbefriends', 'masturbating', 'mothafucked', 'motherfucked', 'motherfucker', 'motherfuckin', 'nofuckingway', 'pornprincess', 'pussypounder', 'spermhearder', 'stupidfucker', 'timbernigger', 'titbitnipply', 'tonguethrust', 'trailertrash', 'transvestite', 'tunneloflove', 'williewanker', 'administrator', 'approximately', 'communication', 'comprehensive', 'concentration', 'congressional', 'consciousness', 'consideration', 'controversial', 'correspondent', 'demonstration', 'entertainment', 'environmental', 'establishment', 'extraordinary', 'institutional', 'international', 'investigation', 'manufacturing', 'participation', 'psychological', 'significantly', 'sophisticated', 'understanding', 'unfortunately', 'alligatorbait', 'assassination', 'carpetmuncher', 'dickforbrains', 'fingerfucked', 'fingerfucker', 'fingerfuckers', 'jacktheripper', 'masterblaster', 'mothafucking', 'mothafuckings', 'motherfucking', 'niggardliness', 'pearlnecklace', 'pooperscooper', 'shitforbrains', 'shitoutofluck', 'tongethruster', 'virginbreaker', 'administration', 'characteristic', 'constitutional', 'discrimination', 'identification', 'interpretation', 'recommendation', 'representation', 'representative', 'responsibility', 'transformation', 'transportation', 'fingerfucking', 'luckycammeltoe', 'motherfuckings', 'motherlovebone', 'niggardlinesss', 'poorwhitetrash', 'purinapricness', 'smackthemonkey', 'spankthemonkey', 'devilworshipper', 'spaghettibender', 'spaghettinigger', 'African-American', 'mattressprincess', 'goddamnmuthafucker', 'gaymuthafuckinwhore']

## Hashtag manipulation

In [11]:
def clean_hashtag(tag):
  return re.sub("[^a-zA-Z]+", '', tag).lower() 

# returns list of lists of words
def split_tag(tag):
  results = []
  for w in words:
    if len(w) > len(tag): 
      break
    if tag.startswith(w):
      suffix = tag[len(w):]
      if suffix == '':
        return [[w]]
      rest = split_tag(suffix)
      for res in rest:
        result = [w] + res
        if sum([len(s) for s in result]) == len(tag):
          results.append(result)
        else:
          print('wrong length: ' + str(sum([len(s) for s in result])) + ' != ' + str(len(tag)))
  return results

def tag_to_string(split_tag):
  t = ""
  if len(split_tag) > 0:
    for word in split_tag[0]:
      t = t + " " + word
  return t

def replace_hashtag(tweet, tag, tag_to_str):
  return tweet.replace(tag, tag_to_str)

def find_hashtags(tweet):
  tags = re.findall("#[\w]*", tweet)
  return tags

In [None]:
print(find_hashtags(df.tweet[13239]))

['#Spanishrevenge', '#justice', '#HumanRights', '#FreedomOfExpression', '#Spain', '#fakedemocracy', '#cddr', '#shameonSpain', '#WakeupEurope']


This removes hashtags from a tweet and replaces it with the words in that hashtag. Though if the exact words aren't in the vocab it won't be able to do a split and then those words will in its unsplit form be added to the tweet.

In [12]:
for tweet in df.tweet:
  new_tweet = tweet
  hashtags = find_hashtags(tweet)
  if hashtags is not None:
    for tag in hashtags:
      c_tag = clean_hashtag(str(tag))
      #print(hashtag)
      split_hashtag = split_tag(c_tag)
      #print(split_hashtag)
      if len(split_hashtag) > 0:
        new_tweet = replace_hashtag(new_tweet, tag.strip(), tag_to_string(split_hashtag))
      else:
        new_tweet = replace_hashtag(new_tweet, tag.strip(), c_tag)
  df.replace(tweet, new_tweet, inplace=True)

In [13]:
print(df.tweet[55])

 gun control advocates must STOP falling all over themselves to assure electorate that they too love the HORRIFIC 2A URL


In [None]:
print(df.tweet[13239])

spanishrevenge vs.  justice humanrights and  freedom of expression spain is a fakedemocracy $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ $MENTION$ cddr shameonspain wakeupeurope $MENTION$ URL


## Saving the labels and parsed tweets of the training data

Saves the training data into numpy arrays. Labels are changed into binary representation where none offensive tweets are set to 0 and offensive is 1.

In [17]:
tweets = df.tweet.values
labels = df.subtask_a.values
labels = np.where(labels == "NOT", 0, 1)

In [18]:
print(tweets[0] + " label " + str(labels[0]))

$MENTION$ She should ask a few native Americans what their take on this is. label 1


## BERT Tokenization 

Help for how to handle the tokenization in BERT was found here: https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03

In [19]:
!pip install transformers

from transformers import BertTokenizer, BertModel, BertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 6.7MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 18.9MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 41.4MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |███

In [20]:
# Tokenize with BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

INFO:filelock:Lock 139902625113480 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpwnv8m58m


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt in cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:filelock:Lock 139902625113480 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f7




## Final text clean-up and addition of special BERT tokens

In [21]:
def clean_punctutation(tweets):
  cleaned_tweets = []
  for tweet in tweets:
    i = 0
    while(i < len(tweet)-1):
      if(isMultiplePunctuation(tweet[i], tweet[i+1])):
        tweet= tweet[:i] + tweet[i+1:]
      else:
        i+=1
    cleaned_tweets.append(tweet)
  return cleaned_tweets

def isMultiplePunctuation(c1, c2):
  if((c1 == '.' or c1 == '?' or c1 == '!') and (c2 == '.' or c2 == '?' or c2 == '!')):
    return True
  else:
    return False

def add_sentence_seperators(tweets):
  seperated_tweets = []
  for tweet in tweets:
    seperated_tweet = "[CLS] " + seperate_sentences(str(tweet))
    if(seperated_tweet[-5:] != "[SEP]"):
      seperated_tweet += " [SEP]"
    seperated_tweets.append(seperated_tweet)
  return seperated_tweets

def seperate_sentences(tweet):
  tweet = tweet.replace('.', ". [SEP]")
  tweet = tweet.replace('!', "! [SEP]")
  tweet = tweet.replace('?', "? [SEP]")
  return tweet

cleaned_tweets = clean_punctutation(tweets)
print(cleaned_tweets[1])
seperated_tweets = add_sentence_seperators(cleaned_tweets)
print(seperated_tweets[1])

#Turn it to numpy array so labels and tweets are the same type
seperated_tweets = np.array(seperated_tweets)


$MENTION$ $MENTION$ Go home youre drunk! $MENTION$ maga trump $EMOJI$$EMOJI$ URL
[CLS] $MENTION$ $MENTION$ Go home youre drunk! [SEP] $MENTION$ maga trump $EMOJI$$EMOJI$ URL [SEP]


Guide on how to tokenize and turn the lists into tensors was found here: https://mccormickml.com/2019/07/22/BERT-fine-tuning/

In [22]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
token_ids = []
attention_masks = []

# For every sentence...
for tweet in seperated_tweets:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Map tokens to their IDs.
    #   (3) Pad or truncate the sentence to `max_length`
    #   (4) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # tweet to encode.
                        add_special_tokens = False, # Add '[CLS]' and '[SEP]'?
                        max_length = 300,           # Pad & truncate all tweets.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])

    #Add the token seperators for each sentence
    token_ids.append(encoded_dict['token_type_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_ids = torch.cat(token_ids, dim=0)
labels = torch.tensor(labels)

# Print tweet 0, now as a list of IDs.
print('Original: ', seperated_tweets[1])
print('Input IDs:', input_ids[1])
print('Token IDs:', token_ids[1])


Original:  [CLS] $MENTION$ $MENTION$ Go home youre drunk! [SEP] $MENTION$ maga trump $EMOJI$$EMOJI$ URL [SEP]
Input IDs: tensor([  101,  1002,  5254,  1002,  1002,  5254,  1002,  2175,  2188,  2115,
         2063,  7144,   999,   102,  1002,  5254,  1002, 23848,  2050,  8398,
         1002,  7861, 29147,  2072,  1002,  1002,  7861, 29147,  2072,  1002,
        24471,  2140,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,

In [25]:
# Print tweet 0, now as a list of IDs.
print('Original: ', seperated_tweets[154])
print('Input IDs:', input_ids[154])
print('Token IDs:', token_ids[154])

Original:  [CLS] $MENTION$ $MENTION$ In case of antifa take a little bat or something. [SEP] Good for you- more of should do that kind of thing. [SEP]
Input IDs: tensor([ 101, 1002, 5254, 1002, 1002, 5254, 1002, 1999, 2553, 1997, 3424, 7011,
        2202, 1037, 2210, 7151, 2030, 2242, 1012,  102, 2204, 2005, 2017, 1011,
        2062, 1997, 2323, 2079, 2008, 2785, 1997, 2518, 1012,  102,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

## Attention Mask

Since each tweet needs to be of the same lenght they have been padded to be of the same lenght as the longest tweet. BERT will not need to pay attention to the padding, thus attention masks are neeeded for each tweet to distingush between real input and padding. See more: https://huggingface.co/transformers/glossary.html#attention-mask


## Split data into training and validation sets
Guide also found here: https://mccormickml.com/2019/07/22/BERT-fine-tuning/

Why it is good to have an validation set is because it is a good test to use for overfitting. Instead of just considering validation accuracy, validation loss can also be observed, which is basically a measurement of how confindent the model was on the correct prediction. 

In [26]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, token_ids, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))


11,916 training samples
1,324 validation samples


The guide also recommed to create an iterator for the dataset since this save on memory compared to using regular for loops. See more: https://pytorch.org/docs/stable/data.html

In [27]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32. 16 is choosen in this case because the GPU runs out of memory with 32
batch_size = 16

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )


## Training the model

In [28]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()


INFO:filelock:Lock 139902599672160 acquired on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp_dso8bfv


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json in cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
INFO:filelock:Lock 139902599672160 released on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…

INFO:transformers.file_utils:storing https://cdn.huggingface.co/bert-base-uncased-pytorch_model.bin in cache at /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
INFO:filelock:Lock 139902614146016 released on /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157.lock
INFO:transformers.modeling_utils:loading weights file https://cdn.huggingface.co/bert-base-uncased-pytorch_model.bin from cache at /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157





INFO:transformers.modeling_utils:Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
INFO:transformers.modeling_utils:Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [29]:
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

In [30]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4. 
epochs = 3

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)


In [31]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [32]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))



In [33]:
import random

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_token_ids = batch[2].to(device)
        b_labels = batch[3].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        loss, logits = model(b_input_ids, 
                             token_type_ids=b_token_ids, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_token_ids = batch[2].to(device)
        b_labels = batch[3].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=b_token_ids, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))



Training...
  Batch    40  of    745.    Elapsed: 0:01:03.
  Batch    80  of    745.    Elapsed: 0:02:05.
  Batch   120  of    745.    Elapsed: 0:03:07.
  Batch   160  of    745.    Elapsed: 0:04:10.
  Batch   200  of    745.    Elapsed: 0:05:12.
  Batch   240  of    745.    Elapsed: 0:06:14.
  Batch   280  of    745.    Elapsed: 0:07:16.
  Batch   320  of    745.    Elapsed: 0:08:19.
  Batch   360  of    745.    Elapsed: 0:09:21.
  Batch   400  of    745.    Elapsed: 0:10:23.
  Batch   440  of    745.    Elapsed: 0:11:25.
  Batch   480  of    745.    Elapsed: 0:12:28.
  Batch   520  of    745.    Elapsed: 0:13:30.
  Batch   560  of    745.    Elapsed: 0:14:32.
  Batch   600  of    745.    Elapsed: 0:15:34.
  Batch   640  of    745.    Elapsed: 0:16:36.
  Batch   680  of    745.    Elapsed: 0:17:38.
  Batch   720  of    745.    Elapsed: 0:18:40.

  Average training loss: 0.47
  Training epcoh took: 0:19:19

Running Validation...
  Accuracy: 0.81
  Validation Loss: 0.41
  Validation to

In [None]:
import pandas as pd

# Display floats with two decimal places.
pd.set_option('precision', 2)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

# A hack to force the column headers to wrap.
#df = df.style.set_table_styles([dict(selector="th",props=[('max-width', '70px')])])

# Display the table.
df_stats
