#               Jigsaw Unintended Bias in Toxicity Classification

<h1> 1. Business Problem

<h3> 1.1 Description:</h3>
    The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), builds technology to protect voices in conversation.
    A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion.
    They use data by human raters to improve civility in online conversations for various toxic conversational attributes.
    
   <h3>1.2 Context:</h3>
    This is a Kaggle competetion: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview
    
   <h3>1.3 Data:</h3>
    train.csv, test.csv (Download from Kaggle)
    
   <h3>1.4 Output to be submitted:</h3>
    It is in format submission.csv (Download from Kaggle)
            


<h1> 2. Machine Learning Problem Foundation

<h2>2.1 Data overview</h2>

Attribute information:
* comment_text: text of individual comments  

* target: toxicity label( to be predicted to for test data. target>=0.5 will be consider to be postive class(toxic))  


Identity Attributes:
* male
* female
* transgender
* other_gender
* heterosexual
* homosexual_gay_or_lesbian
* bisexual
* other_sexual_orientation
* christian
* jewish
* muslim
* hindu
* buddhist
* atheist
* other_religion
* black
* white
* asian
* latino
* other_race_or_ethnicity
* physical_disability
* intellectual_or_learning_disability
* psychiatric_or_mental_illness
* other_disability


# 3. Load Libraries

In [62]:
#Loading Libraries

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import string
import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from bs4 import BeautifulSoup

import contractions

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

import os
import re
from tqdm import tqdm

from keras.layers import Embedding
from keras.layers import Input
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.models import Model
from keras.models import load_model

from keras_bert import load_trained_model_from_checkpoint, load_vocabulary
from keras_bert import Tokenizer
from keras_bert import AdamWarmup, calc_train_steps

# 4. Exploratory Data Analysis

<h2> 4.1 Load Data

In [63]:
tr_data=pd.read_csv("../Data/train.csv")
test_data=pd.read_csv("../Data/test.csv")
#submission = pd.read_csv('../Data/sample_submission.csv', index_col='id')

#tr_data=pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
#test_data=pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv")
#submission = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv', index_col='id')

In [64]:
print("Shape of tr_data",tr_data.shape)
print("Shape of test_data",test_data.shape)

tr_data.head()

Shape of tr_data (1804874, 45)
Shape of test_data (97320, 2)


Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


In [65]:
tr_data.columns

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count'],
      dtype='object')

Data overview
Attribute information:
* comment_text: text of individual comments
* target: toxicity label( to be predicted to for test data. target>=0.5 will be consider to be positive class(toxic))


When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. 
There are some identity attributes which are taken care:
asian, atheist, bisexual, black, buddhist, christian, female, heterosexual, hindu, homosexual_gay_or_lesbian, intellectual_or_learning_disability, jewish, latino, male, muslim, other_disability, other_gender, other_race_or_ethnicity, other_religion, other_sexual_orientation, physical_disability, psychiatric_or_mental_illness, transgender, white.

In [66]:
# Create dataset with comment text, target and identity_attributes and drop other columns from train_data
identity_attribute=['male','female','transgender','other_gender','heterosexual','homosexual_gay_or_lesbian','bisexual','other_sexual_orientation',
                    'christian','jewish','muslim','hindu','buddhist','atheist','other_religion','black','white','asian','latino','other_race_or_ethnicity',
                    'physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness','other_disability']

data=tr_data[['comment_text','target']]
data1=tr_data[identity_attribute]
data=pd.concat([data,data1],axis=1)
data.head()

Unnamed: 0,comment_text,target,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,other_sexual_orientation,...,other_religion,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability
0,"This is so cool. It's like, 'would you want yo...",0.0,,,,,,,,,...,,,,,,,,,,
1,Thank you!! This would make my life a lot less...,0.0,,,,,,,,,...,,,,,,,,,,
2,This is such an urgent design problem; kudos t...,0.0,,,,,,,,,...,,,,,,,,,,
3,Is this something I'll be able to install on m...,0.0,,,,,,,,,...,,,,,,,,,,
4,haha you guys are a bunch of losers.,0.893617,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0


In [67]:
# Convert target and identity columns to booleans
def convert_to_bool(df, col_name):
    df[col_name] = np.where(df[col_name] >= 0.5, True, False)
    
def convert_dataframe_to_bool(df):
    bool_df = df.copy()
    for col in ['target'] + identity_attribute:
        convert_to_bool(bool_df, col)
    return bool_df

data = convert_dataframe_to_bool(data)

In [68]:
data.head()

Unnamed: 0,comment_text,target,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,other_sexual_orientation,...,other_religion,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability
0,"This is so cool. It's like, 'would you want yo...",False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Thank you!! This would make my life a lot less...,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,This is such an urgent design problem; kudos t...,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Is this something I'll be able to install on m...,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,haha you guys are a bunch of losers.,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


<h2>4.2 Data Preprocessing

Text Preprocessing

In the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Expand contractions
8. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>



In [69]:
#Preprocessing steps to delete and isolate unwanted symbols
symbols_to_delete = '→★©®●ː☆¶）иʿ。ﬂﬁ₁♭年▪←ʒ、（月■⇌ɹˤ³の¤‿عدويهصقناخلىبمغرʀɴשלוםביエンᴵאעכח‐ικξتحكسةفزط‑地谷улкноה歌мυтэпрдˢᵒʳʸᴺʷᵗʰᵉᵘοςתמדףנרךצט成都ех小土》करमा英文レクサス外国人бьыгя不つзц会下有的加大子ツشءʲшчюж戦щ明קљћ我出生天一家新ʁսհןجі‒公美阿ספ白マルハニチロ社ζ和中法本士相信政治堂版っфچیリ事「」シχψմեայինրւդک《ლさようならعدويهصقناخلىبمغرʀɴשלוםביエンᴵאעכח‐ικξتحكسةفزط‑地谷улкноה歌мυтэпрдˢᵒʳʸᴺʷᵗʰᵉᵘοςתמדףנרךצט成都ех小土》करमा英文レクサス外国人бьыгя不つзц会下有的加大子ツشءʲшчюж戦щ明קљћ我出生天一家新ʁսհןجі‒公美阿ספ白マルハニチロ社ζ和中法本士相信政治堂版っфچیリ事「」シχψմեայինրւդک《ლさようなら\n＼🍕\r🐵😑\xa0\ue014≠\t\uf818\uf04a\xad😢🐶❤️☺\uf0e0😜😎👊\u200b\u200e😁أ😍💖̶💵❥━┣┫Е┗Ｏ►👎😀😂\u202a\u202c🔥😄🏻💥ᴍʏᴇᴅᴏᴀᴋʜᴜʟᴛᴄᴘʙғᴊᴡɢ✔\x96\x92😋👏😱‼\x81ジ故障➤\u2009🚌͞🌟😊😳😧🙀😐😕\u200f👍😮😃😘☕♡◐║▬💩💯⛽🚄🏼ஜ۩۞😖ᴠ🚲✒➥😟😈═ˌ💪🙏🎯◄🌹😇💔😡\x7f👌ἐὶήὲἀίῃἴ🙄✬ＳＵＰＥＲＨＩＴ😠\ufeff☻\u2028😉😤⛺♍🙂\u3000👮💙😏🍾🎉😞\u2008🏾😅😭👻😥😔😓🏽🎆✓◾🍻🍽🎶🌺🤔😪\x08؟🐰🐇🐱🙆😨⬅🙃💕𝘊𝘦𝘳𝘢𝘵𝘰𝘤𝘺𝘴𝘪𝘧𝘮𝘣💗💚獄℅ВПАН🐾🐕❣😆🔗🚽舞伎🙈😴🏿🤗🇺🇸♫ѕＣＭ⤵🏆🎃😩█▓▒░\u200a🌠🐟💫💰💎\x95🖐🙅⛲🍰⭐🤐👆🙌\u2002💛🙁👀🙊🙉\u2004❧▰▔ᴼᴷ◞▀\x13🚬▂▃▄▅▆▇↙🤓\ue602😵άόέὸ̄😒͝☹➡🆕👅👥👄🔄🔤👉👤👶👲🔛🎓\uf0b7✋\uf04c\x9f\x10😣⏺̲̅😌🤑́🌏😯😲∙‛Ἰᾶὁ💞🚓◇🔔📚✏🏀👐\u202d💤🍇\ue613豆🏡▷❔❓⁉❗\u202f👠्🇹🇼🌸蔡🌞˚🎲😛˙关系С💋💀🎄💜🤢َِ✨是\x80\x9c\x9d🗑\u2005💃📣👿༼◕༽😰ḷЗ▱￼🤣卖温哥华议降％你失去所钱拿坏税骗🐝¯🎅\x85🍺آإ🎵🌎͟ἔ油别克🤡🤥😬🤧й\u2003🚀🤴⌠ИОРФДЯМ✘😝🖑ὐύύ特殊作群╪💨圆园▶ℐ☭✭🏈😺♪🌍⏏ệ🍔🐮🍁☔🍆🍑🌮🌯☠🤦\u200d♂𝓒𝓲𝓿𝓵안영하세요ЖК🍀😫🤤ῦ在了可以说普通话汉语好极🎼🕺☃🍸🥂🗽🎇🎊🆘☎🤠👩✈🖒✌✰❆☙🚪⚲\u2006⚭⚆⬭⬯⏖○‣⚓∎ℒ▙☏⅛✀╌🇫🇷🇩🇪🇮🇬🇧😷🇨🇦ХШ🌐\x1f杀鸡给猴看𝗪𝗵𝗲𝗻𝘆𝗼𝘂𝗿𝗮𝗹𝗶𝘇𝗯𝘁𝗰𝘀𝘅𝗽𝘄𝗱📺ｃϖ\u2000үａᴦᎥһͺ\u2007ｓǀ\u2001ɩ℮ｙｅ൦ｌƽ¸ｗｈ𝐓𝐡𝐞𝐫𝐮𝐝𝐚𝐃𝐜𝐩𝐭𝐢𝐨𝐧Ƅᴨᑯ໐ΤᏧ௦Іᴑ܁𝐬𝐰𝐲𝐛𝐦𝐯𝐑𝐙𝐣𝐇𝐂𝐘𝟎ԜТᗞ౦〔Ꭻ𝐳𝐔𝐱𝟔𝟓𝐅🐋∼ﬃ💘💓ё𝘥𝘯𝘶💐🌋🌄🌅𝙬𝙖𝙨𝙤𝙣𝙡𝙮𝙘𝙠𝙚𝙙𝙜𝙧𝙥𝙩𝙪𝙗𝙞𝙝𝙛👺🐷ℋℳ𝐀𝐥𝐪❄🚶𝙢Ἱ🤘ͦ💸☼패티Ｗ⋆𝙇ᵻ👂👃ɜ🎫\uf0a7БУ🚢🚂ગુજરાતીῆ🏃𝓬𝓻𝓴𝓮𝓽𝓼☘﴾͡๏̯﴿⚾⚽Φ₽\ue807𝑻𝒆𝒍𝒕𝒉𝒓𝒖𝒂𝒏𝒅𝒔𝒎𝒗𝒊👽😙\u200cЛ🎾👹￦⎌🏒⛸寓养宠物吗🏄🐀🚑🤷操𝒑𝒚𝒐𝑴🤙🐒℃欢迎来到拉斯𝙫⏩☮🐈𝒌𝙊𝙭𝙆𝙋𝙍𝘼𝙅ﷻ⚠🦄巨收赢得鬼愤怒要买额ẽ🚗✊🐳𝟏𝐟𝟖𝟑𝟕𝒄𝟗𝐠𝙄𝙃👇锟斤拷❌⭕▸𝗢𝟳𝟱𝟬⦁株式⛷한국어ㄸㅓ니͜ʖ𝘿𝙔₵𝒩ℯ𝒾𝓁𝒶𝓉𝓇𝓊𝓃𝓈𝓅ℴ𝒻𝒽𝓀𝓌𝒸𝓎𝙏𝙟𝘃𝗺𝟮𝟭𝟯𝟲👋🦊☐☑多伦⚡☄ǫ🐽🎻🎹⛓🏹╭╮🍷🦆为友谊祝贺与其想象对如直接问用自己猜传教没积唯认识基督徒曾经让耶稣复活死怪他但当们聊些题时候例战胜因圣把全结婚孩恐惧且栗谓这样还♾🎸🤕🤒⛑🎁批判检讨🏝🦁＞ʕ̣Δ🙋😶쥐스탱트뤼도석유가격인상이경제황을렵게만들지않록잘관리해야합다캐나에서대마초와화약금의품런성분갈때는반드시허된사용✞🔫👁┈╱╲▏▕┃╰▊▋╯┳┊☒凸ὰ💲🗯𝙈Ἄ𝒇𝒈𝒘𝒃𝑬𝑶𝕾𝖙𝖗𝖆𝖎𝖌𝖍𝖕𝖊𝖔𝖑𝖉𝖓𝖐𝖜𝖞𝖚𝖇𝕿𝖘𝖄𝖛𝖒𝖋𝖂𝕴𝖟𝖈𝕸👑🚿☝💡知彼百\uf005𝙀𝒛𝑲𝑳𝑾𝒋𝟒😦𝙒𝘾𝘽🏐𝘩𝘨ὼṑ✅☛𝑱𝑹𝑫𝑵𝑪🇰🇵👾ᓇᒧᔭᐃᐧᐦᑳᐨᓃᓂᑲᐸᑭᑎᓀᐣ🐄🎈🔨♩🐎🤞☞🐸💟🎰🌝🛳点击查🍭𝑥𝑦𝑧ＡＮＧＪＢ👣\uf020◔◡🏉💭🎥♀Ξ🐴👨🤳⬆🦍\x0b🍩𝑯𝒒😗𝟐🏂👳🍗🕉🐲̱ℏ𝑮𝗕𝗴\x91🍒⠀ꜥⲣⲏ╚🐑⏰↺⇤∏鉄件✾◦♬ї💊\uf203\uf09a\uf222\ue608\uf202\uf099\uf469\ue607\uf410\ue600燻製虚偽屁理屈｜Г𝑩𝑰𝒀𝑺🌤∵∴𝗳𝗜𝗙𝗦𝗧🍊ὺἈἡῖΛΩ⤏🇳𝒙Ձռձ冬至ὀ𝒁🔹🤚🍎𝑷🐂💅𝘬𝘱𝘸𝘷𝘐𝘭𝘓𝘖𝘹𝘲𝘫☜Βώ💢▲ΜΟΝΑΕ🇱♲𝝈↴↳💒⊘▫Ȼ⬇🚴🖕🖤🥘📍👈➕🚫🎨🌑🐻𝐎𝐍𝐊𝑭🤖🎎✧😼🕷ｇｏｖｒｎｍｔｉｄｕ２０８ｆｂ＇ｋ𝟰🇴🇭🇻🇲𝗞𝗭𝗘𝗤👼📉🍟🍦∕🌈🔭🐊🐍\uf10aˆ⚜☁ڡ🐦\U0001f92f\U0001f92a🐡💳ἱ🙇𝗸𝗟𝗠𝗷🥜🔼'
symbols_to_isolate = '.,?!-;*"…:—()%#$&_/@・ω+=”“[]^–>\\°<~•™ˈʊɒ∞§{}·ταɡ|¢`―ɪ£♥´¹≈÷′ɔ€†μ½ʻπδηλσερνʃ±µº¾．»ав⋅¿¬β⇒›¡₂₃γ″«φ⅓„：¥сɑ！−²ʌ¼⁴⁄₄‚‖⊂⅔¨×θ？∩，ɐ₀≥↑↓／√－‰≤'
isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}
    

In [70]:
#Contraction map to decontract the phrases
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}


In [71]:
#Code ref: https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    """decontract the phrases"""
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text



In [72]:
#English stopwards to be removed
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [73]:
# Implementing all pre-processing steps
from tqdm import tqdm
preprocessed_data = []
stemmer = SnowballStemmer("english")
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 
  
# tqdm is for printing the status bar
for sentance in (data['comment_text'].values):
    sentance = re.sub(r"http\S+", "", sentance)  #Removing http/https tag
    sentance = expand_contractions(sentance) # expanded short words like n't to not
    sentance = BeautifulSoup(sentance, 'lxml').get_text() # Removing xml tags
    sentance = contractions.fix(sentance)
    sentance = sentance.translate(remove_dict)
    sentance = sentance.translate(isolate_dict)
    

    sentance = re.sub("\S*\d\S*", "", sentance).strip() # Remove words with numbers
    sentance = re.sub('[^A-Za-z]+', ' ', sentance) # remove special characters
    sentance = stemmer.stem(sentance)
    sentance = lemmatizer.lemmatize(sentance)
    preprocessed_data.append(' '.join(token.lower() for token in nltk.word_tokenize(sentance) if token.lower() not in stopwords))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mukesh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [74]:
# Print some sample values of text before and after pre-processing

print(data['comment_text'].values[0])
print('-'*50)
print(preprocessed_data[0])
print('='*50)

print(data['comment_text'].values[100])
print('-'*50)
print(preprocessed_data[100])
print('='*50)

print(data['comment_text'].values[456])
print('-'*50)
print(preprocessed_data[456])
print('='*50)

print(data['comment_text'].values[10000])
print('-'*50)
print(preprocessed_data[10000])
print('='*50)

This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!
--------------------------------------------------
cool like would want mother read really great idea well done
Loving this collection. Cant wait till Season 2 is released. Should be any day now according to http://yeezy-season2.com/
--------------------------------------------------
loving collection not wait till season released day according
The preserve was sold big time to the public, the only version available to the public before the ballot was the version with the preserve. Yes it was deleted from the ballot language, but that is a pull the wool over the eyes trick that gives Metro the excuse to not be accountable. But they and the zoo are accountable for their statements -- which were unequivacally that there would be an offiste preserve for the elephants.  This just proves that we cannot trust our government to tell the truth. But we still need to hold them accountable for th

# 5. Data Splitting

In [75]:
# Adding preprocessed_data to dataframe and drop comment_text column
final=data
final['preprocessed_data']=preprocessed_data
final.head()
final=final.drop("comment_text",axis=1)
final.head()
#final_original=final
#final = final.sample(n=50000)

Unnamed: 0,target,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,other_sexual_orientation,christian,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,preprocessed_data
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,cool like would want mother read really great ...
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,thank would make life lot less anxiety inducin...
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,urgent design problem kudos taking impressive
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,something able install site releasing
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,haha guys bunch losers


In [76]:
#Data Splitting in to train and cv
X_train, X_cv, y_train, y_cv = train_test_split(final,final['target'], test_size=0.33, shuffle=True) 

print(X_train.shape,y_train.shape,X_cv.shape,y_cv.shape)

(1209265, 26) (1209265,) (595609, 26) (595609,)


In [77]:
X_train.head()

Unnamed: 0,target,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,other_sexual_orientation,christian,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,preprocessed_data
457810,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,enjoyed column also enjoyed news today kodak b...
1682152,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,not disagree post general not police dept exis...
463850,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,actually spoke jewish irony pointing peter par...
1124330,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,ok deal taking early rrsp rrif income yes draw...
864802,False,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,could not agree men lost ability play level pl...


<h1> 5. Featurization : word2vec

In [78]:
# define a function to Train your own Word2Vec model using your own text corpus

list_of_sentance_train=[]
for sentance in X_train['preprocessed_data']:
    list_of_sentance_train.append(sentance.split())


list_of_sentance_cv=[]
for sentance in X_cv['preprocessed_data']:
    list_of_sentance_cv.append(sentance.split())
    


In [79]:
print(list_of_sentance_train[1])

['not', 'disagree', 'post', 'general', 'not', 'police', 'dept', 'existing', 'building']


In [80]:
len(list_of_sentance_train)

1209265

In [140]:
## Code adapted from (https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py)
# Vectorize the text 

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

#MAX_NUM_WORDS = 100000
MAX_SEQUENCE_LENGTH = 220
#tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list_of_sentance_train + list_of_sentance_cv)
sequences = tokenizer.texts_to_sequences(list_of_sentance_train)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

train_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # fixed the length of words in a row and use padding for that
train_labels = to_categorical(np.asarray(y_train))

cv_data=pad_sequences(tokenizer.texts_to_sequences(list_of_sentance_cv), maxlen=MAX_SEQUENCE_LENGTH)
cv_labels=to_categorical(np.asarray(y_cv))

print(train_data.shape)
print(train_labels.shape)

print(cv_data.shape)
print(cv_labels.shape)

Found 319910 unique tokens.
(1209265, 220)
(1209265, 2)
(595609, 220)
(595609, 2)


In [141]:
# use glove.840B.300d.txt for word embeddings
#GloVe is an unsupervised learning algorithm for obtaining vector representations for words. 
#Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

embeddings_index = {}
#EMBEDDING_FILE="../input/glove840b300dtxt/glove.840B.300d.txt"
EMBEDDING_FILE='../data/glove.840B.300d.txt'
EMBEDDINGS_DIMENSION=300
f = open(EMBEDDING_FILE,encoding="utf-8") 
for line in f:
    values = line.split(' ')
    word = values[0] ## The first entry is the word
    coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
    embeddings_index[word] = coefs
f.close()


print('GloVe data loaded')

GloVe data loaded


In [142]:
embeddings_index["happy"]


array([ 0.036775 ,  0.40917  , -0.52141  , -0.067184 ,  0.087702 ,
       -0.048564 ,  0.40947  , -0.42818  ,  0.19304  ,  2.3925   ,
       -0.11441  , -0.22952  , -0.16061  ,  0.035533 , -0.53179  ,
        0.19764  , -0.48827  ,  0.57439  , -0.064301 ,  0.47053  ,
       -0.29647  , -0.15927  , -0.052798 ,  0.10121  , -0.054461 ,
        0.036129 , -0.16118  , -0.34139  ,  0.45834  , -0.20144  ,
       -0.29067  , -0.51888  , -0.062106 ,  0.14084  ,  0.016413 ,
        0.050826 ,  0.13243  , -0.033663 , -0.42228  , -0.30086  ,
        0.06202  ,  0.26338  ,  0.077223 ,  0.27307  ,  0.13392  ,
        0.30183  , -0.16546  ,  0.057011 , -0.0034585, -0.071113 ,
       -0.27287  , -0.10297  ,  0.07457  , -0.32104  ,  0.36696  ,
        0.27051  , -0.15776  ,  0.2978   , -0.18988  ,  0.097477 ,
        0.035665 , -0.49749  , -0.52759  , -0.046148 ,  0.021715 ,
       -0.11047  , -0.18007  ,  0.20295  ,  0.15254  , -0.045976 ,
       -0.21846  , -0.066865 , -0.21355  ,  0.017509 ,  0.6647

In [143]:
# prepare embedding matrix for all unique toxens in preprocessed text
from keras.layers import Embedding
from keras.initializers import Constant


## EMBEDDINGS_DIMENSION =  ## dimension of glove vectors
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1,
                                 EMBEDDINGS_DIMENSION))
num_words_in_embedding = 0
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        num_words_in_embedding += 1
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector


print(embedding_matrix.shape)


(319911, 300)


# 6. Define and train a LSTM Neural Net for classifying toxic comments

In [144]:
# Model Architecture
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D

def model2_fn(embedding_matrix,train_data,train_labels,cv_data,cv_labels):
    """ Model Architecture"""
    BATCH_SIZE = 512
    LSTM_UNITS = 128
    DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
    NUM_EPOCHS = 4

    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    embedding_layer = Embedding(len(tokenizer.word_index) + 1,
                                EMBEDDINGS_DIMENSION,
                                weights=[embedding_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=False)
    x = embedding_layer(sequence_input)
    x = SpatialDropout1D(0.3)(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)

    hidden = concatenate([
        GlobalMaxPooling1D()(x),
        GlobalAveragePooling1D()(x),
    ])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    
    result = Dense(128, activation='sigmoid')(hidden)
    result = Dense(2, activation='sigmoid')(result)
    
   
    # Compile model.
    print('compiling model')
    input_layer, output_layer =sequence_input, result
    model = Model(input_layer, output_layer)
    model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])


    # Train model.
    print('training model')
    model.fit(train_data,
          train_labels,
          batch_size=BATCH_SIZE,
          epochs=NUM_EPOCHS,
          validation_data=(cv_data, cv_labels),
          verbose=2)
    model.save('../input/mymodel/my_model.h5')
    return(model)
   


In [145]:
# Run the model
import keras
if os.path.exists('../input/mymodel/my_model.h5'):
    print("Loading model")
    model2 = keras.models.load_model('../input/mymodel/my_model.h5')
else:
    model2=model2_fn(embedding_matrix,train_data,train_labels,cv_data,cv_labels)

compiling model
training model
Train on 1209265 samples, validate on 595609 samples
Epoch 1/4
 - 797s - loss: 0.1441 - acc: 0.9447 - val_loss: 0.1303 - val_acc: 0.9477
Epoch 2/4
 - 789s - loss: 0.1281 - acc: 0.9493 - val_loss: 0.1272 - val_acc: 0.9491
Epoch 3/4
 - 764s - loss: 0.1232 - acc: 0.9509 - val_loss: 0.1228 - val_acc: 0.9508
Epoch 4/4
 - 777s - loss: 0.1194 - acc: 0.9520 - val_loss: 0.1222 - val_acc: 0.9511


# 7. Generate model predictions on the validation set

In [146]:
# Predict validation data
MODEL_NAME='model2'
cv_pred= model2.predict(cv_data) # predict the tokenize cv data

cv_model_data=X_cv
cv_model_data[MODEL_NAME]=cv_pred[:, 1] # combine the X_cv with the predicted y


In [148]:
cv_model_data[1:20]


Unnamed: 0,target,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,other_sexual_orientation,christian,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,preprocessed_data,model2
1331671,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,military members disobey orders risk also obey...,0.000111
755024,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,shall call glancing blow directed middle class...,0.000736
909536,True,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,ribeiro top earner sedgwick insurance practice...,0.791725
1035788,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,anothet brilliant factoid keep em coming,0.000196
1432339,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,not think would ever considered buy vw learned...,0.100668
100812,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,uk made horrible decision based fact free leav...,0.016328
1678263,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,forgot one,0.000149
1027361,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,boy wow sure came spending numbers really fast...,0.004299
292640,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,follow precedent jesus suggest pope answer sin...,0.000169
693871,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,hate buzz kill human interest story perfect ex...,0.021808



# 8. Define bias metrics, then evaluate our new model for bias using the validation set prediction
Taken from benchmark kernel


In [149]:
SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'  # stands for background negative, subgroup positive

def compute_auc(y_true, y_pred):
    try:
        return metrics.roc_auc_score(y_true, y_pred)
    except ValueError:
        return np.nan

def compute_subgroup_auc(df, subgroup, label, model_name):
    subgroup_examples = df[df[subgroup]]
    return compute_auc(subgroup_examples[label], subgroup_examples[model_name])

def compute_bpsn_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup negative examples and the background positive examples."""
    subgroup_negative_examples = df[df[subgroup] & ~df[label]]
    non_subgroup_positive_examples = df[~df[subgroup] & df[label]]
    examples = subgroup_negative_examples.append(non_subgroup_positive_examples)
    return compute_auc(examples[label], examples[model_name])

def compute_bnsp_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup positive examples and the background negative examples."""
    subgroup_positive_examples = df[df[subgroup] & df[label]]
    non_subgroup_negative_examples = df[~df[subgroup] & ~df[label]]
    examples = subgroup_positive_examples.append(non_subgroup_negative_examples)
    return compute_auc(examples[label], examples[model_name])

def compute_bias_metrics_for_model(dataset,
                                   subgroups,
                                   model,
                                   label_col,
                                   include_asegs=False):
    """Computes per-subgroup metrics for all subgroups and one model."""
    records = []
    for subgroup in subgroups:
        record = {
            'subgroup': subgroup,
            'subgroup_size': len(dataset[dataset[subgroup]])
        }
        record[SUBGROUP_AUC] = compute_subgroup_auc(dataset, subgroup, label_col, model)
        record[BPSN_AUC] = compute_bpsn_auc(dataset, subgroup, label_col, model)
        record[BNSP_AUC] = compute_bnsp_auc(dataset, subgroup, label_col, model)
        records.append(record)
    return pd.DataFrame(records).sort_values('subgroup_auc', ascending=True)


# 9. Calculate the final score
Code taken from benchmark kernel

In [150]:
def calculate_overall_auc(df, model_name):
    """ Calculate overall AUC """
    
    true_labels = df['target']
    predicted_labels = df[model_name]
    return metrics.roc_auc_score(true_labels, predicted_labels)
    
    

def power_mean(series, p):
    total = sum(np.power(series, p))
    return np.power(total / len(series), 1 / p)

def get_final_metric(bias_df, overall_auc, POWER=-5, OVERALL_MODEL_WEIGHT=0.25):
    """ Compute the final metric"""
    print(power_mean(bias_df[SUBGROUP_AUC], POWER))
    print(power_mean(bias_df[BPSN_AUC], POWER))
    print(power_mean(bias_df[BNSP_AUC], POWER))
    bias_score = np.average([
        power_mean(bias_df[SUBGROUP_AUC], POWER),
        power_mean(bias_df[BPSN_AUC], POWER),
        power_mean(bias_df[BNSP_AUC], POWER)
    ])
    print(OVERALL_MODEL_WEIGHT)
    print(overall_auc)
    print(bias_score)
    return (OVERALL_MODEL_WEIGHT * overall_auc) + ((1 - OVERALL_MODEL_WEIGHT) * bias_score)
    


In [156]:
bias_metrics_df2 = compute_bias_metrics_for_model(cv_model_data, identity_attribute, MODEL_NAME, 'target')
bias_metrics_df2_mod=bias_metrics_df2[0:22]
final_metric=get_final_metric(bias_metrics_df2_mod, calculate_overall_auc(cv_model_data, MODEL_NAME))
print("final_metric: ",final_metric)

0.8682228932796209
0.8792234594705047
0.9579244672201932
0.25
0.9585494302688679
0.9017902733234395
final_metric:  0.9159800625597966


# 10. Prediction on test data

In [159]:
submission = pd.read_csv('../Data/sample_submission.csv', index_col='id')
sequences_test= tokenizer.texts_to_sequences(test_data['comment_text'])

test_df = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)


submission['prediction'] = model2.predict(test_df)[:, 1]
submission.to_csv('submission.csv')
submission.head()

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
7097320,0.000111
7097321,0.00837
7097322,0.034457
7097323,0.000234
7097324,0.000206


# 11. Conclusions

Final_metric score is  0.9159800625597966