Skip to content
Kaggle Toxic Comments Competition
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
README.md
params.yml
test_params.yml

README.md

Kaggle Toxic Comment Classification Challenge

This respository contains my solution to the Kaggle Toxic Comment Classification Challenge. The competition's goal was to train a model to detect toxic comments like threats, obsenity, insults, and identity-base hate. The data set consisted of comments from Wikipedia's talk page edits.

My final solution was a bi-directional RNN with 80 GRU units. I wrote my solution in Python using Tensorflow, spaCy, and Gensim, and scikit-learn. I also used pre-trained FastText embedding vectors.

Before training the RNN, I preprocessed the data by,

  1. Tokenizing and lematizing the data (spaCy)
  2. Learning the vocabulary (Gensim)
  3. Creating TF-IDF vector models of each comment (Gensim)
  4. Scoring each vocabulary term's toxic/non-toxic discrimination using Chi2 (sklearn) and Delta TFIDF metrics.
  5. Manually correcting a small number discriminating non-dictionary words

The following diagram illustrates my final network design. Each line is a tensor annotated with it's dimensions (excluding batch size). Each box is a simplified representation of operations.

                                                 Logits                      
                                                    ▲                        
                                                    │ 1x6                    
                                                    │                        
                                         ┌─────────────────────┐             
                                         │   Dense Layer (6)   │             
                                         └─────────────────────┘             
                                                    ▲                        
                                                    │ 1x334                  
                                                    │                        
                                         ┌─────────────────────┐             
                                         │       Concat        │             
                                         └─────────────────────┘             
                                                    ▲                        
                    ┌───────────────────────────────┤                        
               1x14 │                               │ 1x320                  
           ┌────────────────┐            ┌─────────────────────┐             
           │   Reduce Max   │            │       Concat        │             
           └────────────────┘            └─────────────────────┘             
                    ▲                               ▲                        
                    │                  ┌────────────┴────────────┐           
                    │             1x160│                         │ 1x160     
                    │                  │                         │           
                    │       ┌─────────────────────┐   ┌─────────────────────┐
                    │       │   Avg Pooling 1D    │   │   Max Pooling 1D    │
                    │       └─────────────────────┘   └─────────────────────┘
                    │                  ▲                         ▲           
                    │                  │                         │           
                    │                  └────────────┬────────────┘           
                    │                               │ 150x160                
                    │                               │                        
                    │                    ┌─────────────────────┐             
                    │                    │       Concat        │             
                    │                    └─────────────────────┘             
                    │                               ▲                        
                    │                  ┌────────────┴────────────┐           
                    │           150x80 │                         │150x80     
                    │                  │                         │           
                    │       ┌─────────────────────┐   ┌─────────────────────┐
                    │       │  Forward GRU (80)   │   │  Backward GRU (80)  │
      ┌─────────────┘       └─────────────────────┘   └─────────────────────┘
      │                                ▲                         ▲           
      │                                │                         │           
      │                                └────────────┬────────────┘           
      │                                             │                        
      │                                             │ 150x300                
      │                                             │                        
      │                                ┌─────────────────────────┐           
      │                                │         Dropout         │           
      │                                └─────────────────────────┘           
      │                                             ▲                        
      │                                             │ 150x300                
      │                                             │                        
      │                                ┌─────────────────────────┐           
      │               ┌───────────────▶│   Embedding Weighting   │           
      │               │                └─────────────────────────┘           
      │               │                             ▲                        
      │               │ 150x1                       │ 150x300                
      │               │                             │                        
      │  ┌─────────────────────────┐   ┌─────────────────────────┐           
      │  │     1D Convolution      │   │   FastText Embeddings   │           
      │  └─────────────────────────┘   └─────────────────────────┘           
      │               ▲                             ▲                        
      └───────────────┤ 150x14                      │ 150x1                  
                      │                             │                        
                                                                             
                Term Scores                     Term IDs                     
                                                                             

The model's inputs were:

  • The comment's first 150 preprocessed tokens
  • The Chi2 and Delta-IDF score for each token and label

The term scores were used in two ways:

  • To weight the FastText embeddings via a 1D convolutional layer that merged the 14 scores into one scalar weight
  • As features for the final dense layer, after a reduce max operation to take the highest score for each term

Weighting the embeddings was inspired by previous experiments using TF-IDF weighted embeddings. I don't recall how much weighting the embeddings helped but I believe it had a positive effect.

Another novel thing I tried was weighting the losses for each category by their logs odds ratio. The rationale was to use boosting to address class imbalance. Again, I don't recall how much this helped but I must have had good reason to keep it!

I trained the model for 8 epochs at a batch size of 128 on my OpenSuse Linux box with a Core i7 6850K (6 cores), 32GB DRAM, and Nvidia Titan X (Pascal) GPU. My final score was an AUC ROC of 0.9804, which is normally great. However, I only ranked 2443 out of 4551 teams (53%).

Source code contains the preprocessing and final model. Also included are unused models that I tried during the competition.

You can’t perform that action at this time.