#               Jigsaw Unintended Bias in Toxicity Classification

<h1> 1. Business Problem

<h3> 1.1 Description:</h3>
    The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), builds technology to protect voices in conversation.
    A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion.
    They use data by human raters to improve civility in online conversations for various toxic conversational attributes.
    
   <h3>1.2 Context:</h3>
    This is a Kaggle competetion: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview
    
   <h3>1.3 Data:</h3>
    train.csv, test.csv (Download from Kaggle)
    
   <h3>1.4 Output to be submitted:</h3>
    It is in format submission.csv (Download from Kaggle)
            


<h1> 2. Machine Learning Problem Foundation

<h2>2.1 Data overview</h2>

Attribute information:
* comment_text: text of individual comments  

* target: toxicity label( to be predicted to for test data. target>=0.5 will be consider to be postive class(toxic))  


Identity Attributes:
* male
* female
* transgender
* other_gender
* heterosexual
* homosexual_gay_or_lesbian
* bisexual
* other_sexual_orientation
* christian
* jewish
* muslim
* hindu
* buddhist
* atheist
* other_religion
* black
* white
* asian
* latino
* other_race_or_ethnicity
* physical_disability
* intellectual_or_learning_disability
* psychiatric_or_mental_illness
* other_disability


# 3. Load Libraries

In [1]:
#Loading Libraries

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import string
import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from bs4 import BeautifulSoup

import contractions

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
import gc

import os
import re
from tqdm import tqdm

from keras.layers import Embedding
from keras.layers import Input
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.models import Model
from keras.models import load_model
from keras.callbacks import LearningRateScheduler

from keras_bert import load_trained_model_from_checkpoint, load_vocabulary
from keras_bert import Tokenizer
from keras_bert import AdamWarmup, calc_train_steps
from keras.losses import binary_crossentropy

Using TensorFlow backend.


# 4. Exploratory Data Analysis

<h2> 4.1 Load Data

In [2]:
tr_data=pd.read_csv("../Data/train.csv")
test_data=pd.read_csv("../Data/test.csv")
#submission = pd.read_csv('../Data/sample_submission.csv', index_col='id')

#tr_data=pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
#test_data=pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv")
#submission = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv', index_col='id')

In [3]:
print("Shape of tr_data",tr_data.shape)
print("Shape of test_data",test_data.shape)

tr_data.head()

Shape of tr_data (1804874, 45)
Shape of test_data (97320, 2)


Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


Data overview
Attribute information:
* comment_text: text of individual comments
* target: toxicity label( to be predicted to for test data. target>=0.5 will be consider to be positive class(toxic))


When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. 
There are some identity attributes which are taken care:
asian, atheist, bisexual, black, buddhist, christian, female, heterosexual, hindu, homosexual_gay_or_lesbian, intellectual_or_learning_disability, jewish, latino, male, muslim, other_disability, other_gender, other_race_or_ethnicity, other_religion, other_sexual_orientation, physical_disability, psychiatric_or_mental_illness, transgender, white.

In [4]:
# Create dataset with comment text, target and identity_attributes and drop other columns from train_data
identity_attribute=['male', 'female','homosexual_gay_or_lesbian', 'muslim', 'christian', 'jewish','white', 'black', 'psychiatric_or_mental_illness']


aux_columns = ['target', 'severe_toxicity','obscene','identity_attack','insult','threat','sexual_explicit']
data=tr_data[['comment_text']]
data1=tr_data[identity_attribute]
data2=tr_data[aux_columns]
data=pd.concat([data,data1,data2],axis=1)
data[29:34]

Unnamed: 0,comment_text,male,female,homosexual_gay_or_lesbian,muslim,christian,jewish,white,black,psychiatric_or_mental_illness,target,severe_toxicity,obscene,identity_attack,insult,threat,sexual_explicit
29,I applaud Civil's efforts to create some new t...,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30,Why bother writing a review if it's going to b...,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31,Yet call out all Muslims for the acts of a few...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.9125,0.05,0.2375,0.6125,0.8875,0.1125,0.0
32,"I love the idea of upvoting entire articles, t...",,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,Troll free since 2016.,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
x_data = data['comment_text']
y_aux_data = data[aux_columns].fillna(0).values

# Initialize weights

# Overall
weights = np.ones((len(data),)) / 4
# Subgroup
weights += (data[identity_attribute].fillna(0).values >= 0.5).sum(axis=1).astype(bool).astype(np.int) / 4
# Background Positive, Subgroup Negative
weights += (((data['target'].values >= 0.5).astype(bool).astype(np.int) +
            (data[identity_attribute].fillna(0).values < 0.5).sum(axis=1).astype(bool).astype(np.int)) > 1).astype(bool).astype(np.int) / 4
# Background Negative, Subgroup Positive
weights += (((data['target'].values < 0.5).astype(bool).astype(np.int) +
             (data[identity_attribute].fillna(0).values >= 0.5).sum(axis=1).astype(bool).astype(np.int)) > 1).astype(bool).astype(np.int) / 4
#Normalize them
#weights/= weights.mean()
loss_weight = 1.0 / weights.mean()
y_data=data['target'].values

#y = np.vstack([(data['target'].values >= 0.5).astype(np.int), weights]).T

#y_aux = np.hstack([y, y_aux_data])
for column in identity_attribute +  ['target'] : 
    data[column] = np.where(data[column] >=0.5, True, False)

In [6]:
data.head()

Unnamed: 0,comment_text,male,female,homosexual_gay_or_lesbian,muslim,christian,jewish,white,black,psychiatric_or_mental_illness,target,severe_toxicity,obscene,identity_attack,insult,threat,sexual_explicit
0,"This is so cool. It's like, 'would you want yo...",False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,0.0,0.0,0.0
1,Thank you!! This would make my life a lot less...,False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,0.0,0.0,0.0
2,This is such an urgent design problem; kudos t...,False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,0.0,0.0,0.0
3,Is this something I'll be able to install on m...,False,False,False,False,False,False,False,False,False,False,0.0,0.0,0.0,0.0,0.0,0.0
4,haha you guys are a bunch of losers.,False,False,False,False,False,False,False,False,False,True,0.021277,0.0,0.021277,0.87234,0.0,0.0


In [7]:
#y_data=np.where(data['target']>= 0.5, 1, 0)
y_data= np.vstack([(data['target'].values>=0.5).astype(np.int),weights]).T
y_data


array([[0.  , 0.25],
       [0.  , 0.25],
       [0.  , 0.25],
       ...,
       [0.  , 0.25],
       [1.  , 0.5 ],
       [0.  , 0.25]])

<h2>4.2 Data Preprocessing

Text Preprocessing

In the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Expand contractions
8. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>



In [8]:
#Preprocessing steps to delete and isolate unwanted symbols
symbols_to_delete = '→★©®●ː☆¶）иʿ。ﬂﬁ₁♭年▪←ʒ、（月■⇌ɹˤ³の¤‿عدويهصقناخلىبمغرʀɴשלוםביエンᴵאעכח‐ικξتحكسةفزط‑地谷улкноה歌мυтэпрдˢᵒʳʸᴺʷᵗʰᵉᵘοςתמדףנרךצט成都ех小土》करमा英文レクサス外国人бьыгя不つзц会下有的加大子ツشءʲшчюж戦щ明קљћ我出生天一家新ʁսհןجі‒公美阿ספ白マルハニチロ社ζ和中法本士相信政治堂版っфچیリ事「」シχψմեայինրւդک《ლさようならعدويهصقناخلىبمغرʀɴשלוםביエンᴵאעכח‐ικξتحكسةفزط‑地谷улкноה歌мυтэпрдˢᵒʳʸᴺʷᵗʰᵉᵘοςתמדףנרךצט成都ех小土》करमा英文レクサス外国人бьыгя不つзц会下有的加大子ツشءʲшчюж戦щ明קљћ我出生天一家新ʁսհןجі‒公美阿ספ白マルハニチロ社ζ和中法本士相信政治堂版っфچیリ事「」シχψմեայինրւդک《ლさようなら\n＼🍕\r🐵😑\xa0\ue014≠\t\uf818\uf04a\xad😢🐶❤️☺\uf0e0😜😎👊\u200b\u200e😁أ😍💖̶💵❥━┣┫Е┗Ｏ►👎😀😂\u202a\u202c🔥😄🏻💥ᴍʏᴇᴅᴏᴀᴋʜᴜʟᴛᴄᴘʙғᴊᴡɢ✔\x96\x92😋👏😱‼\x81ジ故障➤\u2009🚌͞🌟😊😳😧🙀😐😕\u200f👍😮😃😘☕♡◐║▬💩💯⛽🚄🏼ஜ۩۞😖ᴠ🚲✒➥😟😈═ˌ💪🙏🎯◄🌹😇💔😡\x7f👌ἐὶήὲἀίῃἴ🙄✬ＳＵＰＥＲＨＩＴ😠\ufeff☻\u2028😉😤⛺♍🙂\u3000👮💙😏🍾🎉😞\u2008🏾😅😭👻😥😔😓🏽🎆✓◾🍻🍽🎶🌺🤔😪\x08؟🐰🐇🐱🙆😨⬅🙃💕𝘊𝘦𝘳𝘢𝘵𝘰𝘤𝘺𝘴𝘪𝘧𝘮𝘣💗💚獄℅ВПАН🐾🐕❣😆🔗🚽舞伎🙈😴🏿🤗🇺🇸♫ѕＣＭ⤵🏆🎃😩█▓▒░\u200a🌠🐟💫💰💎\x95🖐🙅⛲🍰⭐🤐👆🙌\u2002💛🙁👀🙊🙉\u2004❧▰▔ᴼᴷ◞▀\x13🚬▂▃▄▅▆▇↙🤓\ue602😵άόέὸ̄😒͝☹➡🆕👅👥👄🔄🔤👉👤👶👲🔛🎓\uf0b7✋\uf04c\x9f\x10😣⏺̲̅😌🤑́🌏😯😲∙‛Ἰᾶὁ💞🚓◇🔔📚✏🏀👐\u202d💤🍇\ue613豆🏡▷❔❓⁉❗\u202f👠्🇹🇼🌸蔡🌞˚🎲😛˙关系С💋💀🎄💜🤢َِ✨是\x80\x9c\x9d🗑\u2005💃📣👿༼◕༽😰ḷЗ▱￼🤣卖温哥华议降％你失去所钱拿坏税骗🐝¯🎅\x85🍺آإ🎵🌎͟ἔ油别克🤡🤥😬🤧й\u2003🚀🤴⌠ИОРФДЯМ✘😝🖑ὐύύ特殊作群╪💨圆园▶ℐ☭✭🏈😺♪🌍⏏ệ🍔🐮🍁☔🍆🍑🌮🌯☠🤦\u200d♂𝓒𝓲𝓿𝓵안영하세요ЖК🍀😫🤤ῦ在了可以说普通话汉语好极🎼🕺☃🍸🥂🗽🎇🎊🆘☎🤠👩✈🖒✌✰❆☙🚪⚲\u2006⚭⚆⬭⬯⏖○‣⚓∎ℒ▙☏⅛✀╌🇫🇷🇩🇪🇮🇬🇧😷🇨🇦ХШ🌐\x1f杀鸡给猴看𝗪𝗵𝗲𝗻𝘆𝗼𝘂𝗿𝗮𝗹𝗶𝘇𝗯𝘁𝗰𝘀𝘅𝗽𝘄𝗱📺ｃϖ\u2000үａᴦᎥһͺ\u2007ｓǀ\u2001ɩ℮ｙｅ൦ｌƽ¸ｗｈ𝐓𝐡𝐞𝐫𝐮𝐝𝐚𝐃𝐜𝐩𝐭𝐢𝐨𝐧Ƅᴨᑯ໐ΤᏧ௦Іᴑ܁𝐬𝐰𝐲𝐛𝐦𝐯𝐑𝐙𝐣𝐇𝐂𝐘𝟎ԜТᗞ౦〔Ꭻ𝐳𝐔𝐱𝟔𝟓𝐅🐋∼ﬃ💘💓ё𝘥𝘯𝘶💐🌋🌄🌅𝙬𝙖𝙨𝙤𝙣𝙡𝙮𝙘𝙠𝙚𝙙𝙜𝙧𝙥𝙩𝙪𝙗𝙞𝙝𝙛👺🐷ℋℳ𝐀𝐥𝐪❄🚶𝙢Ἱ🤘ͦ💸☼패티Ｗ⋆𝙇ᵻ👂👃ɜ🎫\uf0a7БУ🚢🚂ગુજરાતીῆ🏃𝓬𝓻𝓴𝓮𝓽𝓼☘﴾͡๏̯﴿⚾⚽Φ₽\ue807𝑻𝒆𝒍𝒕𝒉𝒓𝒖𝒂𝒏𝒅𝒔𝒎𝒗𝒊👽😙\u200cЛ🎾👹￦⎌🏒⛸寓养宠物吗🏄🐀🚑🤷操𝒑𝒚𝒐𝑴🤙🐒℃欢迎来到拉斯𝙫⏩☮🐈𝒌𝙊𝙭𝙆𝙋𝙍𝘼𝙅ﷻ⚠🦄巨收赢得鬼愤怒要买额ẽ🚗✊🐳𝟏𝐟𝟖𝟑𝟕𝒄𝟗𝐠𝙄𝙃👇锟斤拷❌⭕▸𝗢𝟳𝟱𝟬⦁株式⛷한국어ㄸㅓ니͜ʖ𝘿𝙔₵𝒩ℯ𝒾𝓁𝒶𝓉𝓇𝓊𝓃𝓈𝓅ℴ𝒻𝒽𝓀𝓌𝒸𝓎𝙏𝙟𝘃𝗺𝟮𝟭𝟯𝟲👋🦊☐☑多伦⚡☄ǫ🐽🎻🎹⛓🏹╭╮🍷🦆为友谊祝贺与其想象对如直接问用自己猜传教没积唯认识基督徒曾经让耶稣复活死怪他但当们聊些题时候例战胜因圣把全结婚孩恐惧且栗谓这样还♾🎸🤕🤒⛑🎁批判检讨🏝🦁＞ʕ̣Δ🙋😶쥐스탱트뤼도석유가격인상이경제황을렵게만들지않록잘관리해야합다캐나에서대마초와화약금의품런성분갈때는반드시허된사용✞🔫👁┈╱╲▏▕┃╰▊▋╯┳┊☒凸ὰ💲🗯𝙈Ἄ𝒇𝒈𝒘𝒃𝑬𝑶𝕾𝖙𝖗𝖆𝖎𝖌𝖍𝖕𝖊𝖔𝖑𝖉𝖓𝖐𝖜𝖞𝖚𝖇𝕿𝖘𝖄𝖛𝖒𝖋𝖂𝕴𝖟𝖈𝕸👑🚿☝💡知彼百\uf005𝙀𝒛𝑲𝑳𝑾𝒋𝟒😦𝙒𝘾𝘽🏐𝘩𝘨ὼṑ✅☛𝑱𝑹𝑫𝑵𝑪🇰🇵👾ᓇᒧᔭᐃᐧᐦᑳᐨᓃᓂᑲᐸᑭᑎᓀᐣ🐄🎈🔨♩🐎🤞☞🐸💟🎰🌝🛳点击查🍭𝑥𝑦𝑧ＡＮＧＪＢ👣\uf020◔◡🏉💭🎥♀Ξ🐴👨🤳⬆🦍\x0b🍩𝑯𝒒😗𝟐🏂👳🍗🕉🐲̱ℏ𝑮𝗕𝗴\x91🍒⠀ꜥⲣⲏ╚🐑⏰↺⇤∏鉄件✾◦♬ї💊\uf203\uf09a\uf222\ue608\uf202\uf099\uf469\ue607\uf410\ue600燻製虚偽屁理屈｜Г𝑩𝑰𝒀𝑺🌤∵∴𝗳𝗜𝗙𝗦𝗧🍊ὺἈἡῖΛΩ⤏🇳𝒙Ձռձ冬至ὀ𝒁🔹🤚🍎𝑷🐂💅𝘬𝘱𝘸𝘷𝘐𝘭𝘓𝘖𝘹𝘲𝘫☜Βώ💢▲ΜΟΝΑΕ🇱♲𝝈↴↳💒⊘▫Ȼ⬇🚴🖕🖤🥘📍👈➕🚫🎨🌑🐻𝐎𝐍𝐊𝑭🤖🎎✧😼🕷ｇｏｖｒｎｍｔｉｄｕ２０８ｆｂ＇ｋ𝟰🇴🇭🇻🇲𝗞𝗭𝗘𝗤👼📉🍟🍦∕🌈🔭🐊🐍\uf10aˆ⚜☁ڡ🐦\U0001f92f\U0001f92a🐡💳ἱ🙇𝗸𝗟𝗠𝗷🥜🔼'
symbols_to_isolate = '.,?!-;*"…:—()%#$&_/@・ω+=”“[]^–>\\°<~•™ˈʊɒ∞§{}·ταɡ|¢`―ɪ£♥´¹≈÷′ɔ€†μ½ʻπδηλσερνʃ±µº¾．»ав⋅¿¬β⇒›¡₂₃γ″«φ⅓„：¥сɑ！−²ʌ¼⁴⁄₄‚‖⊂⅔¨×θ？∩，ɐ₀≥↑↓／√－‰≤'
isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}
    

In [9]:
#Contraction map to decontract the phrases
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}


In [10]:
#Code ref: https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    """decontract the phrases"""
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text



In [11]:
#English stopwards to be removed
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [12]:
# Implementing all pre-processing steps
from tqdm import tqdm
preprocessed_data = []
stemmer = SnowballStemmer("english")
#nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 

if os.path.exists('preprocessed_data.npy'):
    print("Loading preprocessed data")
    preprocessed_data = np.load('preprocessed_data.npy')
else:
    for sentance in (data['comment_text'].values):
        sentance = re.sub(r"http\S+", "", sentance)  #Removing http/https tag
        #sentance = expand_contractions(sentance) # expanded short words like n't to not
        sentance = BeautifulSoup(sentance, 'lxml').get_text() # Removing xml tags
        #sentance = contractions.fix(sentance)
        sentance = sentance.translate(remove_dict)
        sentance = sentance.translate(isolate_dict)


        sentance = re.sub("\S*\d\S*", "", sentance).strip() # Remove words with numbers
        sentance = re.sub('[^A-Za-z]+', ' ', sentance) # remove special characters
        sentance = stemmer.stem(sentance)
        sentance = lemmatizer.lemmatize(sentance)
        preprocessed_data.append(' '.join(token.lower() for token in nltk.word_tokenize(sentance) ))
    #np.save('preprocessed_data', preprocessed_data)

  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beau

  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


In [13]:
# Print some sample values of text before and after pre-processing

print(data['comment_text'].values[0])
print('-'*50)
print(preprocessed_data[0])
print('='*50)

print(data['comment_text'].values[100])
print('-'*50)
print(preprocessed_data[100])
print('='*50)

print(data['comment_text'].values[456])
print('-'*50)
print(preprocessed_data[456])
print('='*50)

print(data['comment_text'].values[10000])
print('-'*50)
print(preprocessed_data[10000])
print('='*50)

This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!
--------------------------------------------------
this is so cool it s like would you want your mother to read this really great idea well done
Loving this collection. Cant wait till Season 2 is released. Should be any day now according to http://yeezy-season2.com/
--------------------------------------------------
loving this collection cant wait till season is released should be any day now according to
The preserve was sold big time to the public, the only version available to the public before the ballot was the version with the preserve. Yes it was deleted from the ballot language, but that is a pull the wool over the eyes trick that gives Metro the excuse to not be accountable. But they and the zoo are accountable for their statements -- which were unequivacally that there would be an offiste preserve for the elephants.  This just proves that we cannot trust our government to te

# 5. Data Splitting

In [14]:
final=data
final['preprocessed_data']=preprocessed_data

In [15]:
# Splitting data indexes

tr_ind, val_ind = train_test_split(list(range(len(final))) ,test_size = 0.33, random_state = 23)
print(len(tr_ind), len(val_ind))

1209265 595609


<h1> 5. Featurization : word2vec

In [16]:
## Code adapted from (https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py)
# Vectorize the text 

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

#MAX_NUM_WORDS = 100000
MAX_SEQUENCE_LENGTH = 220
#tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(final['preprocessed_data'])
sequences = tokenizer.texts_to_sequences(final['preprocessed_data'][tr_ind])

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

train_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # fixed the length of words in a row and use padding for that
#train_labels = to_categorical(np.asarray(y_train))

cv_data=pad_sequences(tokenizer.texts_to_sequences(final['preprocessed_data'][val_ind]), maxlen=MAX_SEQUENCE_LENGTH)
#cv_labels=to_categorical(np.asarray(y_cv))

print(train_data.shape)

print(cv_data.shape)


Found 309240 unique tokens.
(1209265, 220)
(595609, 220)


In [17]:
#tokenizing test data
sequences_test= tokenizer.texts_to_sequences(test_data['comment_text'])

test_data = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

In [18]:
# use glove.840B.300d.txt for word embeddings
#GloVe is an unsupervised learning algorithm for obtaining vector representations for words. 
#Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

embeddings_index_glove = {}
embeddings_index_crawl = {}

#EMBEDDING_FILE="../input/glove840b300dtxt/glove.840B.300d.txt"
#EMBEDDING_FILE_CRAWL='../data/crawl-300d-2M.vec'
EMBEDDING_FILE_GLOVE='../data/glove.840B.300d.txt'
    
EMBEDDINGS_DIMENSION=300
f = open(EMBEDDING_FILE_GLOVE,encoding="utf-8") 
for line in f:
    values = line.split(' ')
    word = values[0] ## The first entry is the word
    coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
    embeddings_index_glove[word] = coefs
f.close()


#f = open(EMBEDDING_FILE_CRAWL,encoding="utf-8") 
#for line in f:
 #   values = line.split(' ')
  #  word = values[0] ## The first entry is the word
   # coefs = np.asarray(values[1:]) ## These are the vecotrs representing the embedding for the word
    #embeddings_index_crawl[word] = coefs
#f.close()



print('data loaded')

data loaded


In [19]:
embeddings_index_glove["happy"][:300]



array([ 0.036775 ,  0.40917  , -0.52141  , -0.067184 ,  0.087702 ,
       -0.048564 ,  0.40947  , -0.42818  ,  0.19304  ,  2.3925   ,
       -0.11441  , -0.22952  , -0.16061  ,  0.035533 , -0.53179  ,
        0.19764  , -0.48827  ,  0.57439  , -0.064301 ,  0.47053  ,
       -0.29647  , -0.15927  , -0.052798 ,  0.10121  , -0.054461 ,
        0.036129 , -0.16118  , -0.34139  ,  0.45834  , -0.20144  ,
       -0.29067  , -0.51888  , -0.062106 ,  0.14084  ,  0.016413 ,
        0.050826 ,  0.13243  , -0.033663 , -0.42228  , -0.30086  ,
        0.06202  ,  0.26338  ,  0.077223 ,  0.27307  ,  0.13392  ,
        0.30183  , -0.16546  ,  0.057011 , -0.0034585, -0.071113 ,
       -0.27287  , -0.10297  ,  0.07457  , -0.32104  ,  0.36696  ,
        0.27051  , -0.15776  ,  0.2978   , -0.18988  ,  0.097477 ,
        0.035665 , -0.49749  , -0.52759  , -0.046148 ,  0.021715 ,
       -0.11047  , -0.18007  ,  0.20295  ,  0.15254  , -0.045976 ,
       -0.21846  , -0.066865 , -0.21355  ,  0.017509 ,  0.6647

In [20]:
# prepare embedding matrix for all unique toxens in preprocessed text
from keras.layers import Embedding
from keras.initializers import Constant

# My system is not taking 600 dimensions of embedding, so sticking to Glove embeddings only.

## EMBEDDINGS_DIMENSION =  ## dimension of glove vectors
embedding_matrix_glove = np.zeros((len(tokenizer.word_index) + 1,
                                 EMBEDDINGS_DIMENSION))
num_words_in_embedding = 0
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index_glove.get(word)
    if embedding_vector is not None:
        num_words_in_embedding += 1
        # words not found in embedding index will be all-zeros.
        embedding_matrix_glove[i] = embedding_vector


print(embedding_matrix_glove.shape)




## EMBEDDINGS_DIMENSION =  ## dimension of glove vectors
#embedding_matrix_crawl = np.zeros((len(tokenizer.word_index)+1 ,
#                                 EMBEDDINGS_DIMENSION))
#num_words_in_embedding = 0
#for word, i in tokenizer.word_index.items():
#    embedding_vector = embeddings_index_crawl.get(word)
#    if embedding_vector is not None:
#        num_words_in_embedding += 1
#        # words not found in embedding index will be all-zeros.
#        embedding_matrix_crawl[i] = embedding_vector[:300]


#print(embedding_matrix_crawl.shape)



(309241, 300)


In [21]:
#embedding_matrix = np.concatenate([embedding_matrix_glove,embedding_matrix_crawl], axis =-1)
embedding_matrix = embedding_matrix_glove
print(embedding_matrix.shape)
#del embedding_matrix_crawl
del embedding_matrix_glove
import gc
gc.collect()

(309241, 300)


67

## Adding attention layer
Source: https://www.kaggle.com/christofhenkel/keras-baseline-lstm-attention-5-fold

In [22]:
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers

class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

# 6. Define and train a LSTM Neural Net for classifying toxic comments

In [23]:
def custom_loss(y_true, y_pred):
    return binary_crossentropy(K.reshape(y_true[:,0],(-1,1)), y_pred) * y_true[:,1]

## Model Architecture

In [24]:
# Model Architecture
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D

def build_model(embedding_matrix, num_aux_targets,loss_weight):
    """ Model Architecture"""
    
    LSTM_UNITS = 128
    DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
    

    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
    embedding_layer = Embedding(*embedding_matrix.shape,
                                weights=[embedding_matrix],
                                trainable=False)
    x = embedding_layer(sequence_input)
    x = SpatialDropout1D(0.3)(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
    X = Attention(MAX_SEQUENCE_LENGTH)(x)

    hidden = concatenate([        
        GlobalMaxPooling1D()(x),
        GlobalAveragePooling1D()(x),
    ])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    
    result = Dense(1, activation='sigmoid')(hidden)
    aux_result = Dense(num_aux_targets, activation='sigmoid')(result)
    
   
    # Compile model.
    print('compiling model')
    model = Model(inputs=sequence_input, outputs=[result, aux_result])
    model.compile(loss=[custom_loss,'binary_crossentropy'],loss_weights=[loss_weight, 1.0],
              optimizer='adam')



    return(model)
   


In [25]:
import tensorflow as tf
gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

In [26]:
# Build Model
print(y_aux_data.shape[-1])
build_model(embedding_matrix, y_aux_data.shape[-1], loss_weight).summary()

7


W0120 13:19:13.779358  1620 deprecation_wrapper.py:119] From C:\Users\mukesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0120 13:19:13.781349  1620 deprecation_wrapper.py:119] From C:\Users\mukesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0120 13:19:13.789355  1620 deprecation_wrapper.py:119] From C:\Users\mukesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0120 13:19:13.801304  1620 deprecation_wrapper.py:119] From C:\Users\mukesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0120 13:19:13.803293  1620 deprecation_wrapper.py:1

compiling model


W0120 13:19:17.688891  1620 deprecation_wrapper.py:119] From C:\Users\mukesh\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0120 13:19:17.728767  1620 deprecation.py:323] From C:\Users\mukesh\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 220)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 220, 300)     92772300    input_1[0][0]                    
__________________________________________________________________________________________________
spatial_dropout1d_1 (SpatialDro (None, 220, 300)     0           embedding_1[0][0]                
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 220, 256)     440320      spatial_dropout1d_1[0][0]        
__________________________________________________________________________________________________
bidirectio

## Training the model

In [27]:
# train the model
checkpoint_predictions = []
val_preds=[]
weights = []
NUM_MODELS=2
NUM_EPOCHS = 4
BATCH_SIZE = 512

# Run the model
import keras
if os.path.exists('../input/mymodel/my_model_new.h5'):
    print("Loading model")
    model = keras.models.load_model('../input/mymodel/my_model_new.h5')
else:
    model = build_model(embedding_matrix,y_aux_data.shape[-1], loss_weight)
    model.fit(
        train_data,
        [y_data[tr_ind], y_aux_data[tr_ind]],
        batch_size=BATCH_SIZE,
        epochs=NUM_EPOCHS,
        verbose=1,
        validation_data=(cv_data, [y_data[val_ind], y_aux_data[val_ind]]),
        callbacks=[
            LearningRateScheduler(lambda epoch: 1e-3 * (0.6 ** 1))
        ])



compiling model
Train on 1209265 samples, validate on 595609 samples
Epoch 1/4


  57344/1209265 [>.............................] - ETA: 2:41:28 - loss: 1.2630 - dense_7_loss: 0.1757 - dense_8_loss: 0.69 - ETA: 1:25:36 - loss: 1.2177 - dense_7_loss: 0.1623 - dense_8_loss: 0.69 - ETA: 1:00:12 - loss: 1.1815 - dense_7_loss: 0.1514 - dense_8_loss: 0.69 - ETA: 47:30 - loss: 1.1770 - dense_7_loss: 0.1502 - dense_8_loss: 0.6950 - ETA: 39:53 - loss: 1.1599 - dense_7_loss: 0.1450 - dense_8_loss: 0.69 - ETA: 34:47 - loss: 1.1466 - dense_7_loss: 0.1410 - dense_8_loss: 0.69 - ETA: 31:12 - loss: 1.1481 - dense_7_loss: 0.1416 - dense_8_loss: 0.69 - ETA: 28:29 - loss: 1.1293 - dense_7_loss: 0.1358 - dense_8_loss: 0.69 - ETA: 26:23 - loss: 1.1257 - dense_7_loss: 0.1348 - dense_8_loss: 0.69 - ETA: 24:42 - loss: 1.1195 - dense_7_loss: 0.1329 - dense_8_loss: 0.69 - ETA: 23:18 - loss: 1.1155 - dense_7_loss: 0.1317 - dense_8_loss: 0.69 - ETA: 22:09 - loss: 1.1118 - dense_7_loss: 0.1306 - dense_8_loss: 0.69 - ETA: 21:10 - loss: 1.1118 - dense_7_loss: 0.1306 - dense_8_loss: 0.69 - ETA: 

 115712/1209265 [=>............................] - ETA: 10:29 - loss: 1.0223 - dense_7_loss: 0.1074 - dense_8_loss: 0.67 - ETA: 10:28 - loss: 1.0210 - dense_7_loss: 0.1071 - dense_8_loss: 0.67 - ETA: 10:27 - loss: 1.0205 - dense_7_loss: 0.1070 - dense_8_loss: 0.67 - ETA: 10:26 - loss: 1.0198 - dense_7_loss: 0.1068 - dense_8_loss: 0.67 - ETA: 10:25 - loss: 1.0190 - dense_7_loss: 0.1066 - dense_8_loss: 0.67 - ETA: 10:24 - loss: 1.0184 - dense_7_loss: 0.1065 - dense_8_loss: 0.67 - ETA: 10:23 - loss: 1.0183 - dense_7_loss: 0.1065 - dense_8_loss: 0.67 - ETA: 10:22 - loss: 1.0171 - dense_7_loss: 0.1061 - dense_8_loss: 0.67 - ETA: 10:21 - loss: 1.0164 - dense_7_loss: 0.1060 - dense_8_loss: 0.67 - ETA: 10:21 - loss: 1.0155 - dense_7_loss: 0.1057 - dense_8_loss: 0.67 - ETA: 10:20 - loss: 1.0151 - dense_7_loss: 0.1056 - dense_8_loss: 0.67 - ETA: 10:19 - loss: 1.0143 - dense_7_loss: 0.1054 - dense_8_loss: 0.67 - ETA: 10:18 - loss: 1.0132 - dense_7_loss: 0.1051 - dense_8_loss: 0.67 - ETA: 10:17 - 

 174080/1209265 [===>..........................] - ETA: 9:22 - loss: 0.9587 - dense_7_loss: 0.0925 - dense_8_loss: 0.661 - ETA: 9:21 - loss: 0.9582 - dense_7_loss: 0.0924 - dense_8_loss: 0.661 - ETA: 9:21 - loss: 0.9578 - dense_7_loss: 0.0923 - dense_8_loss: 0.661 - ETA: 9:21 - loss: 0.9575 - dense_7_loss: 0.0923 - dense_8_loss: 0.661 - ETA: 9:20 - loss: 0.9570 - dense_7_loss: 0.0922 - dense_8_loss: 0.661 - ETA: 9:20 - loss: 0.9567 - dense_7_loss: 0.0921 - dense_8_loss: 0.661 - ETA: 9:19 - loss: 0.9562 - dense_7_loss: 0.0920 - dense_8_loss: 0.660 - ETA: 9:19 - loss: 0.9558 - dense_7_loss: 0.0919 - dense_8_loss: 0.660 - ETA: 9:19 - loss: 0.9551 - dense_7_loss: 0.0918 - dense_8_loss: 0.660 - ETA: 9:18 - loss: 0.9546 - dense_7_loss: 0.0916 - dense_8_loss: 0.660 - ETA: 9:18 - loss: 0.9542 - dense_7_loss: 0.0916 - dense_8_loss: 0.660 - ETA: 9:17 - loss: 0.9537 - dense_7_loss: 0.0914 - dense_8_loss: 0.660 - ETA: 9:17 - loss: 0.9533 - dense_7_loss: 0.0914 - dense_8_loss: 0.660 - ETA: 9:17 - l

 232448/1209265 [====>.........................] - ETA: 8:42 - loss: 0.9207 - dense_7_loss: 0.0854 - dense_8_loss: 0.646 - ETA: 8:41 - loss: 0.9205 - dense_7_loss: 0.0854 - dense_8_loss: 0.646 - ETA: 8:41 - loss: 0.9203 - dense_7_loss: 0.0854 - dense_8_loss: 0.646 - ETA: 8:41 - loss: 0.9200 - dense_7_loss: 0.0854 - dense_8_loss: 0.646 - ETA: 8:40 - loss: 0.9198 - dense_7_loss: 0.0853 - dense_8_loss: 0.646 - ETA: 8:40 - loss: 0.9195 - dense_7_loss: 0.0853 - dense_8_loss: 0.645 - ETA: 8:40 - loss: 0.9192 - dense_7_loss: 0.0852 - dense_8_loss: 0.645 - ETA: 8:39 - loss: 0.9190 - dense_7_loss: 0.0852 - dense_8_loss: 0.645 - ETA: 8:39 - loss: 0.9188 - dense_7_loss: 0.0852 - dense_8_loss: 0.645 - ETA: 8:39 - loss: 0.9186 - dense_7_loss: 0.0851 - dense_8_loss: 0.645 - ETA: 8:38 - loss: 0.9186 - dense_7_loss: 0.0852 - dense_8_loss: 0.645 - ETA: 8:38 - loss: 0.9183 - dense_7_loss: 0.0851 - dense_8_loss: 0.645 - ETA: 8:38 - loss: 0.9182 - dense_7_loss: 0.0852 - dense_8_loss: 0.644 - ETA: 8:38 - l

































Epoch 2/4


  58368/1209265 [>.............................] - ETA: 15:20 - loss: 0.5218 - dense_7_loss: 0.0695 - dense_8_loss: 0.29 - ETA: 13:36 - loss: 0.4960 - dense_7_loss: 0.0617 - dense_8_loss: 0.29 - ETA: 13:02 - loss: 0.4803 - dense_7_loss: 0.0572 - dense_8_loss: 0.29 - ETA: 12:44 - loss: 0.4885 - dense_7_loss: 0.0602 - dense_8_loss: 0.29 - ETA: 12:37 - loss: 0.4937 - dense_7_loss: 0.0620 - dense_8_loss: 0.29 - ETA: 12:31 - loss: 0.4923 - dense_7_loss: 0.0615 - dense_8_loss: 0.29 - ETA: 12:30 - loss: 0.4869 - dense_7_loss: 0.0598 - dense_8_loss: 0.29 - ETA: 12:29 - loss: 0.4922 - dense_7_loss: 0.0615 - dense_8_loss: 0.29 - ETA: 12:26 - loss: 0.4912 - dense_7_loss: 0.0613 - dense_8_loss: 0.29 - ETA: 12:21 - loss: 0.4885 - dense_7_loss: 0.0606 - dense_8_loss: 0.29 - ETA: 12:24 - loss: 0.4884 - dense_7_loss: 0.0605 - dense_8_loss: 0.29 - ETA: 12:21 - loss: 0.4879 - dense_7_loss: 0.0602 - dense_8_loss: 0.29 - ETA: 12:18 - loss: 0.4868 - dense_7_loss: 0.0598 - dense_8_loss: 0.29 - ETA: 12:19 - 

 116736/1209265 [=>............................] - ETA: 11:34 - loss: 0.4835 - dense_7_loss: 0.0602 - dense_8_loss: 0.29 - ETA: 11:34 - loss: 0.4838 - dense_7_loss: 0.0603 - dense_8_loss: 0.29 - ETA: 11:34 - loss: 0.4838 - dense_7_loss: 0.0603 - dense_8_loss: 0.29 - ETA: 11:34 - loss: 0.4834 - dense_7_loss: 0.0601 - dense_8_loss: 0.29 - ETA: 11:33 - loss: 0.4835 - dense_7_loss: 0.0602 - dense_8_loss: 0.29 - ETA: 11:33 - loss: 0.4834 - dense_7_loss: 0.0601 - dense_8_loss: 0.29 - ETA: 11:33 - loss: 0.4834 - dense_7_loss: 0.0601 - dense_8_loss: 0.29 - ETA: 11:33 - loss: 0.4834 - dense_7_loss: 0.0602 - dense_8_loss: 0.29 - ETA: 11:32 - loss: 0.4833 - dense_7_loss: 0.0601 - dense_8_loss: 0.29 - ETA: 11:32 - loss: 0.4831 - dense_7_loss: 0.0601 - dense_8_loss: 0.29 - ETA: 11:32 - loss: 0.4830 - dense_7_loss: 0.0600 - dense_8_loss: 0.29 - ETA: 11:32 - loss: 0.4825 - dense_7_loss: 0.0599 - dense_8_loss: 0.29 - ETA: 11:31 - loss: 0.4823 - dense_7_loss: 0.0599 - dense_8_loss: 0.29 - ETA: 11:31 - 

 175104/1209265 [===>..........................] - ETA: 11:15 - loss: 0.4783 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:15 - loss: 0.4783 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:14 - loss: 0.4782 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:14 - loss: 0.4781 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:14 - loss: 0.4781 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:13 - loss: 0.4781 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:13 - loss: 0.4781 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:12 - loss: 0.4781 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:12 - loss: 0.4781 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:11 - loss: 0.4780 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:11 - loss: 0.4778 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:11 - loss: 0.4778 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:10 - loss: 0.4778 - dense_7_loss: 0.0600 - dense_8_loss: 0.28 - ETA: 11:10 - 

 233472/1209265 [====>.........................] - ETA: 10:47 - loss: 0.4733 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:47 - loss: 0.4731 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:46 - loss: 0.4732 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:46 - loss: 0.4731 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:46 - loss: 0.4731 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:45 - loss: 0.4732 - dense_7_loss: 0.0599 - dense_8_loss: 0.28 - ETA: 10:45 - loss: 0.4732 - dense_7_loss: 0.0599 - dense_8_loss: 0.28 - ETA: 10:45 - loss: 0.4732 - dense_7_loss: 0.0599 - dense_8_loss: 0.28 - ETA: 10:44 - loss: 0.4731 - dense_7_loss: 0.0599 - dense_8_loss: 0.28 - ETA: 10:44 - loss: 0.4731 - dense_7_loss: 0.0599 - dense_8_loss: 0.28 - ETA: 10:44 - loss: 0.4730 - dense_7_loss: 0.0599 - dense_8_loss: 0.28 - ETA: 10:43 - loss: 0.4729 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:43 - loss: 0.4728 - dense_7_loss: 0.0598 - dense_8_loss: 0.28 - ETA: 10:43 - 

































Epoch 3/4


  58368/1209265 [>.............................] - ETA: 15:11 - loss: 0.3650 - dense_7_loss: 0.0594 - dense_8_loss: 0.17 - ETA: 13:25 - loss: 0.3450 - dense_7_loss: 0.0536 - dense_8_loss: 0.17 - ETA: 12:45 - loss: 0.3547 - dense_7_loss: 0.0564 - dense_8_loss: 0.17 - ETA: 12:17 - loss: 0.3535 - dense_7_loss: 0.0566 - dense_8_loss: 0.17 - ETA: 11:59 - loss: 0.3567 - dense_7_loss: 0.0575 - dense_8_loss: 0.17 - ETA: 11:57 - loss: 0.3582 - dense_7_loss: 0.0580 - dense_8_loss: 0.17 - ETA: 11:52 - loss: 0.3566 - dense_7_loss: 0.0575 - dense_8_loss: 0.17 - ETA: 11:46 - loss: 0.3550 - dense_7_loss: 0.0570 - dense_8_loss: 0.17 - ETA: 11:43 - loss: 0.3545 - dense_7_loss: 0.0569 - dense_8_loss: 0.17 - ETA: 11:39 - loss: 0.3514 - dense_7_loss: 0.0559 - dense_8_loss: 0.17 - ETA: 11:36 - loss: 0.3518 - dense_7_loss: 0.0561 - dense_8_loss: 0.17 - ETA: 11:31 - loss: 0.3486 - dense_7_loss: 0.0551 - dense_8_loss: 0.17 - ETA: 11:29 - loss: 0.3511 - dense_7_loss: 0.0559 - dense_8_loss: 0.17 - ETA: 11:28 - 

 116736/1209265 [=>............................] - ETA: 11:08 - loss: 0.3509 - dense_7_loss: 0.0562 - dense_8_loss: 0.17 - ETA: 11:08 - loss: 0.3511 - dense_7_loss: 0.0562 - dense_8_loss: 0.17 - ETA: 11:07 - loss: 0.3516 - dense_7_loss: 0.0564 - dense_8_loss: 0.17 - ETA: 11:07 - loss: 0.3517 - dense_7_loss: 0.0564 - dense_8_loss: 0.17 - ETA: 11:06 - loss: 0.3515 - dense_7_loss: 0.0564 - dense_8_loss: 0.17 - ETA: 11:06 - loss: 0.3520 - dense_7_loss: 0.0565 - dense_8_loss: 0.17 - ETA: 11:05 - loss: 0.3518 - dense_7_loss: 0.0565 - dense_8_loss: 0.17 - ETA: 11:05 - loss: 0.3520 - dense_7_loss: 0.0565 - dense_8_loss: 0.17 - ETA: 11:04 - loss: 0.3521 - dense_7_loss: 0.0566 - dense_8_loss: 0.17 - ETA: 11:04 - loss: 0.3525 - dense_7_loss: 0.0567 - dense_8_loss: 0.17 - ETA: 11:04 - loss: 0.3526 - dense_7_loss: 0.0567 - dense_8_loss: 0.17 - ETA: 11:03 - loss: 0.3527 - dense_7_loss: 0.0567 - dense_8_loss: 0.17 - ETA: 11:03 - loss: 0.3528 - dense_7_loss: 0.0568 - dense_8_loss: 0.17 - ETA: 11:02 - 

 175104/1209265 [===>..........................] - ETA: 10:24 - loss: 0.3502 - dense_7_loss: 0.0564 - dense_8_loss: 0.16 - ETA: 10:23 - loss: 0.3503 - dense_7_loss: 0.0565 - dense_8_loss: 0.16 - ETA: 10:23 - loss: 0.3503 - dense_7_loss: 0.0565 - dense_8_loss: 0.16 - ETA: 10:23 - loss: 0.3503 - dense_7_loss: 0.0565 - dense_8_loss: 0.16 - ETA: 10:22 - loss: 0.3501 - dense_7_loss: 0.0564 - dense_8_loss: 0.16 - ETA: 10:22 - loss: 0.3499 - dense_7_loss: 0.0564 - dense_8_loss: 0.16 - ETA: 10:22 - loss: 0.3498 - dense_7_loss: 0.0564 - dense_8_loss: 0.16 - ETA: 10:21 - loss: 0.3497 - dense_7_loss: 0.0564 - dense_8_loss: 0.16 - ETA: 10:21 - loss: 0.3496 - dense_7_loss: 0.0563 - dense_8_loss: 0.16 - ETA: 10:21 - loss: 0.3494 - dense_7_loss: 0.0563 - dense_8_loss: 0.16 - ETA: 10:20 - loss: 0.3494 - dense_7_loss: 0.0563 - dense_8_loss: 0.16 - ETA: 10:20 - loss: 0.3495 - dense_7_loss: 0.0563 - dense_8_loss: 0.16 - ETA: 10:20 - loss: 0.3494 - dense_7_loss: 0.0563 - dense_8_loss: 0.16 - ETA: 10:19 - 

 233472/1209265 [====>.........................] - ETA: 9:47 - loss: 0.3470 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:47 - loss: 0.3469 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:46 - loss: 0.3469 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:46 - loss: 0.3470 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:46 - loss: 0.3469 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:46 - loss: 0.3469 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:45 - loss: 0.3469 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:45 - loss: 0.3469 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:45 - loss: 0.3468 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:44 - loss: 0.3468 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:44 - loss: 0.3467 - dense_7_loss: 0.0559 - dense_8_loss: 0.167 - ETA: 9:44 - loss: 0.3467 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:43 - loss: 0.3468 - dense_7_loss: 0.0560 - dense_8_loss: 0.167 - ETA: 9:43 - l

































Epoch 4/4


  58368/1209265 [>.............................] - ETA: 14:34 - loss: 0.3038 - dense_7_loss: 0.0543 - dense_8_loss: 0.12 - ETA: 12:31 - loss: 0.3130 - dense_7_loss: 0.0570 - dense_8_loss: 0.12 - ETA: 12:09 - loss: 0.2984 - dense_7_loss: 0.0529 - dense_8_loss: 0.12 - ETA: 12:02 - loss: 0.2867 - dense_7_loss: 0.0500 - dense_8_loss: 0.12 - ETA: 11:47 - loss: 0.2924 - dense_7_loss: 0.0516 - dense_8_loss: 0.12 - ETA: 11:41 - loss: 0.2893 - dense_7_loss: 0.0502 - dense_8_loss: 0.12 - ETA: 11:39 - loss: 0.2838 - dense_7_loss: 0.0488 - dense_8_loss: 0.12 - ETA: 11:36 - loss: 0.2832 - dense_7_loss: 0.0486 - dense_8_loss: 0.12 - ETA: 11:33 - loss: 0.2870 - dense_7_loss: 0.0497 - dense_8_loss: 0.12 - ETA: 11:30 - loss: 0.2863 - dense_7_loss: 0.0494 - dense_8_loss: 0.12 - ETA: 11:29 - loss: 0.2898 - dense_7_loss: 0.0503 - dense_8_loss: 0.12 - ETA: 11:25 - loss: 0.2864 - dense_7_loss: 0.0494 - dense_8_loss: 0.12 - ETA: 11:23 - loss: 0.2890 - dense_7_loss: 0.0501 - dense_8_loss: 0.12 - ETA: 11:24 - 

 116736/1209265 [=>............................] - ETA: 11:18 - loss: 0.3006 - dense_7_loss: 0.0536 - dense_8_loss: 0.12 - ETA: 11:18 - loss: 0.3005 - dense_7_loss: 0.0536 - dense_8_loss: 0.12 - ETA: 11:18 - loss: 0.3004 - dense_7_loss: 0.0535 - dense_8_loss: 0.12 - ETA: 11:17 - loss: 0.3002 - dense_7_loss: 0.0535 - dense_8_loss: 0.12 - ETA: 11:17 - loss: 0.3004 - dense_7_loss: 0.0535 - dense_8_loss: 0.12 - ETA: 11:17 - loss: 0.3004 - dense_7_loss: 0.0536 - dense_8_loss: 0.12 - ETA: 11:17 - loss: 0.3003 - dense_7_loss: 0.0535 - dense_8_loss: 0.12 - ETA: 11:16 - loss: 0.3001 - dense_7_loss: 0.0535 - dense_8_loss: 0.12 - ETA: 11:16 - loss: 0.2999 - dense_7_loss: 0.0534 - dense_8_loss: 0.12 - ETA: 11:15 - loss: 0.2999 - dense_7_loss: 0.0534 - dense_8_loss: 0.12 - ETA: 11:15 - loss: 0.2997 - dense_7_loss: 0.0534 - dense_8_loss: 0.12 - ETA: 11:15 - loss: 0.2993 - dense_7_loss: 0.0533 - dense_8_loss: 0.12 - ETA: 11:15 - loss: 0.2992 - dense_7_loss: 0.0532 - dense_8_loss: 0.12 - ETA: 11:14 - 

 175104/1209265 [===>..........................] - ETA: 10:42 - loss: 0.3014 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:41 - loss: 0.3013 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:41 - loss: 0.3013 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:41 - loss: 0.3013 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:40 - loss: 0.3015 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:40 - loss: 0.3014 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:40 - loss: 0.3014 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:39 - loss: 0.3012 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:39 - loss: 0.3012 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:39 - loss: 0.3011 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:39 - loss: 0.3013 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:38 - loss: 0.3014 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:38 - loss: 0.3013 - dense_7_loss: 0.0540 - dense_8_loss: 0.12 - ETA: 10:38 - 

 233472/1209265 [====>.........................] - ETA: 10:04 - loss: 0.3014 - dense_7_loss: 0.0542 - dense_8_loss: 0.12 - ETA: 10:04 - loss: 0.3012 - dense_7_loss: 0.0542 - dense_8_loss: 0.12 - ETA: 10:03 - loss: 0.3012 - dense_7_loss: 0.0542 - dense_8_loss: 0.12 - ETA: 10:03 - loss: 0.3012 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:03 - loss: 0.3012 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:02 - loss: 0.3011 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:02 - loss: 0.3011 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:02 - loss: 0.3011 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:01 - loss: 0.3012 - dense_7_loss: 0.0542 - dense_8_loss: 0.12 - ETA: 10:01 - loss: 0.3011 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:01 - loss: 0.3011 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:00 - loss: 0.3010 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:00 - loss: 0.3010 - dense_7_loss: 0.0541 - dense_8_loss: 0.12 - ETA: 10:00 - 



































<keras.callbacks.History at 0x1d3a7607860>

In [28]:
# Saving the model
model.save('../input/mymodel/my_model_new.h5')

## Prediction for Validation data

In [29]:
#Prediction for Validation data
val_preds.append(model.predict(cv_data, batch_size=512)[0].flatten())

In [38]:
y_data[val_ind].shape

(595609, 2)

In [51]:
cv_preds= np.array(val_preds).T
print(cv_preds.shape)
score=roc_auc_score(y_data[val_ind][:,0],cv_preds)*100
print("Validation auc is: ",score)

(595609, 1)
Validation auc is:  96.2946537314013


## Prediction on test data

In [49]:
#Prediction for test data
checkpoint_predictions.append(model.predict(test_data, batch_size=512)[0].flatten())


In [59]:
test_preds= np.array(checkpoint_predictions).T
test_preds.shape

(97320, 1)

In [61]:
submission = pd.read_csv('../Data/sample_submission.csv', index_col='id')
submission['prediction'] = test_preds
submission.to_csv('submission_new.csv')
submission.head()

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
7097320,0.002563
7097321,0.010771
7097322,0.215665
7097323,0.001489
7097324,0.000401
