## Background

```
In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment. For evaluation, test set examples with target >= 0.5 will be considered to be in the positive class (toxic).
```

In [16]:
# The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:
TOXICITY_SUBTYPES = [
    "severe_toxicity",
    "obscene",
    "threat",
    "insult",
    "identity_attack",
    "sexual_explicit"
]

In [18]:
# Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment. The columns corresponding to identity attributes are listed below. Only identities with more than 500 examples in the test set (combined public and private) will be included in the evaluation calculation. These identities are shown in bold.

class IdentityAttribute:
    def __init__(self, identity, evaluated):
        self.identity = identity
        self.evaluated = evaluated

IDENTITY_ADDTRIBUTES = [
    IdentityAttribute("male", True),
    IdentityAttribute("female", True),
    IdentityAttribute("transgender", False),
    IdentityAttribute("other_gender", False),
    IdentityAttribute("heterosexual", False),
    IdentityAttribute("homosexual_gay_or_lesbian", True),
    IdentityAttribute("bisexual", False),
    IdentityAttribute("other_sexual_orientation", False),
    IdentityAttribute("christian", True),
    IdentityAttribute("jewish", True),
    IdentityAttribute("muslim", True),
    IdentityAttribute("hindu", False),
    IdentityAttribute("buddhist", False),
    IdentityAttribute("atheist", False),
    IdentityAttribute("other_religion", False),
    IdentityAttribute("black", True),
    IdentityAttribute("white", True),
    IdentityAttribute("asian", False),
    IdentityAttribute("latino", False),
    IdentityAttribute("other_race_or_ethnicity", False),
    IdentityAttribute("physical_disability", False),
    IdentityAttribute("intellectual_or_learning_disability", False),
    IdentityAttribute("psychiatric_or_mental_illness", True),
    IdentityAttribute("other_disability", False)
]        

```
In addition to the labels described above, the dataset also provides metadata from Jigsaw's annotation: toxicity_annotator_count and identity_annotator_count, and metadata from Civil Comments: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, disagree. Civil Comments' label rating is the civility rating Civil Comments users gave the comment.
```

In [29]:
CIVIL_COMMENT_EMOTES = [
    "funny",
    "wow",
    "sad",
    "likes",
    "disagree"
]

CIVIL_COMMENT_METADATA = [
    "created_date",
    "publication_id",
    "parent_id",
    "article_id"
]

JIGSAW_ANNOTATIONS = [
    "toxicity_annotator_count",
    "identity_annotator_count"
]

## Initialization

In [1]:
%run jupyter/notebook_modules.ipynb

In [2]:
import ops.kaggle_dependencies

importing Jupyter notebook from /home/jupyter/kaggle-jigsaw-unintended-bias-in-toxicity-classification/ops/kaggle_dependencies.ipynb
Note: you may need to restart the kernel to use updated packages.


In [3]:
import kaggle
import pandas as pd

import bronze_data_preparation
import bq
from competition import THIS_COMPETITION

importing Jupyter notebook from bronze_data_preparation.ipynb
importing Jupyter notebook from bq.ipynb
importing Jupyter notebook from competition.ipynb


In [4]:
TRUNCATE_BRONZE_DATA = False

In [6]:
bronze_data_preparation.initialize(TRUNCATE_BRONZE_DATA)

## Basic Data Structure

In [5]:
kaggle.api.competition_list_files_cli(THIS_COMPETITION)

name                    size  creationDate         
---------------------  -----  -------------------  
sample_submission.csv    1MB  2019-03-28 21:16:21  
test.csv                29MB  2019-03-28 21:16:21  
train.csv              778MB  2019-03-28 21:16:18  


In [7]:
bq.query("""
SELECT * FROM {dataset_id}.train LIMIT 10
""")

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,239639,0.0,Probably because they consistently waste funds...,0.0,0.0,0.0,0.0,0.0,,,...,26673,approved,0,0,0,2,0,0.0,0,4
1,239690,0.0,let me some up the heavy vibers feelings:\n\nG...,0.0,0.0,0.0,0.0,0.0,,,...,26655,rejected,0,0,0,0,0,0.0,0,4
2,239755,0.0,What are the incidents of disrespect shown to ...,0.0,0.0,0.0,0.0,0.0,,,...,29795,approved,0,0,0,0,0,0.0,0,4
3,240341,0.0,"What I've always found so ironic, is that even...",0.0,0.0,0.0,0.0,0.0,,,...,32846,approved,0,0,0,2,0,0.0,0,4
4,240449,0.0,"Star Trek is our future, Star Wars is our past...",0.0,0.0,0.0,0.0,0.0,,,...,32846,approved,0,0,0,1,0,0.0,0,4
5,240489,0.0,Star Wars: Deep Space Nine is clearly the best...,0.0,0.0,0.0,0.0,0.0,,,...,32846,approved,0,0,0,1,0,0.0,0,4
6,240645,0.0,Do you think that treating the majority of the...,0.0,0.0,0.0,0.0,0.0,,,...,32516,approved,0,0,0,3,0,0.0,0,4
7,240743,0.5,The comments I see in WW today could have been...,0.0,0.4,0.3,0.3,0.0,,,...,33231,approved,0,0,0,3,0,0.0,0,10
8,240810,0.2,This stuff gets weirder by the day. The might...,0.0,0.0,0.1,0.1,0.1,,,...,33626,approved,0,0,0,0,0,0.0,0,10
9,241000,0.0,"Dear Willamette Weak,\nI am writing in regards...",0.0,0.0,0.0,0.0,0.0,,,...,34427,approved,0,0,0,4,0,0.0,0,4


In [30]:
fields = (f.name for f in bq.client.get_table("{dataset_id}.train".format(dataset_id=bq.dataset_id)).schema)
known_fields = ['id', 'target', 'comment_text', "rating"]
known_fields.extend(TOXICITY_SUBTYPES)
known_fields.extend([i.identity for i in IDENTITY_ADDTRIBUTES])
known_fields.extend(CIVIL_COMMENT_EMOTES)
known_fields.extend(CIVIL_COMMENT_METADATA)
known_fields.extend(JIGSAW_ANNOTATIONS)
[u for u in fields if u not in known_fields]
    

[]

In [31]:
# TODO: data validation

## Summary Statistics

## Univariate Analysis

## Baseline Modelling