Data downloaded from https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

Example of parallelizing a slow function on a dataframe that fits in memory

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Any results you write to the current directory are saved as output.

Runnning a spacy function on a pandas df

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
pd.set_option("max_colwidth", 400)

In [4]:
train = pd.read_csv("/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")

In [5]:
train.shape
train.head()

(1804874, 45)

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,"Thank you!! This would make my life a lot less anxiety-inducing. Keep it up, and don't let anyone get in your way!",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos to you for taking it on. Very impressive!,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on my site? When will you be releasing it?,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')

An expensive function

In [7]:
train_sample = train.head(50_000)

In [8]:
train_sample.shape

(50000, 45)

In [9]:
%%time
doc = {}
for i, text in enumerate(train_sample.comment_text):
    doc[i] = nlp(text)

CPU times: user 1h 33min 18s, sys: 4min 21s, total: 1h 37min 40s
Wall time: 12min 17s


In [10]:
import dask
import dask.dataframe as dd
from dask.distributed import Client, progress
from dask import delayed

# Create a client to parallelize all dask functions.
client = Client(n_workers=4, threads_per_worker=2, memory_limit='4GB')
client

# The client will kill the worker and restart it if the memory for any worker exceeds the limit

0,1
Client  Scheduler: tcp://127.0.0.1:59887  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 16.00 GB


In [11]:
def delayed_nlp(df, col="comment_text"):
    doc_dict = {}
    for i, text in enumerate(df[col]):
        doc_dict[i] = nlp(text)
    
    return doc_dict

In [12]:
train_dask = dd.from_pandas(train_sample, npartitions=16)

meta in map_partitions needs a dataframe that matches the dtypes and column names. The dataframe can be empty

In [13]:
%%time
doc_output = train_dask.map_partitions(delayed_nlp, meta=train.head())
doc_dict = doc_output.compute()

  (           id    target  \
18750  265153  0.00000 ... efb2fc9242db0')
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)


CPU times: user 1min 31s, sys: 46.6 s, total: 2min 18s
Wall time: 7min 41s


In [14]:
len(doc_dict) # equal to the number of dataframe partitions

16

In [18]:
train_dict = {}

In [20]:
%%time
for i in range(len(doc_dict)):
    train_dict[i] = train_dask.get_partition(i).compute()

CPU times: user 220 ms, sys: 73.8 ms, total: 294 ms
Wall time: 579 ms


In [21]:
len(train_dict)

16

In [50]:
%%time
for i in range(len(doc_dict)):
    train_dict[i]["spacy_nlp"] = doc_dict[i].values()

CPU times: user 3.62 s, sys: 38.4 ms, total: 3.65 s
Wall time: 3.63 s


In [76]:
train_dict[0].loc[0, "spacy_nlp"]
type(train_dict[0].loc[0, "spacy_nlp"])

This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!

spacy.tokens.doc.Doc

In [73]:
train_dict[0].loc[0, "spacy_nlp"].print_tree()

[{'word': 'is',
  'lemma': 'be',
  'NE': '',
  'POS_fine': 'VBZ',
  'POS_coarse': 'VERB',
  'arc': 'ROOT',
  'modifiers': [{'word': 'This',
    'lemma': 'this',
    'NE': '',
    'POS_fine': 'DT',
    'POS_coarse': 'DET',
    'arc': 'nsubj',
    'modifiers': []},
   {'word': 'cool',
    'lemma': 'cool',
    'NE': '',
    'POS_fine': 'JJ',
    'POS_coarse': 'ADJ',
    'arc': 'acomp',
    'modifiers': [{'word': 'so',
      'lemma': 'so',
      'NE': '',
      'POS_fine': 'RB',
      'POS_coarse': 'ADV',
      'arc': 'advmod',
      'modifiers': []}]},
   {'word': '.',
    'lemma': '.',
    'NE': '',
    'POS_fine': '.',
    'POS_coarse': 'PUNCT',
    'arc': 'punct',
    'modifiers': []}]},
 {'word': "'s",
  'lemma': 'be',
  'NE': '',
  'POS_fine': 'VBZ',
  'POS_coarse': 'VERB',
  'arc': 'ROOT',
  'modifiers': [{'word': 'It',
    'lemma': '-PRON-',
    'NE': '',
    'POS_fine': 'PRP',
    'POS_coarse': 'PRON',
    'arc': 'nsubj',
    'modifiers': []},
   {'word': 'like',
    'lemma': 'lik

Convert the dictionary back to the dataframe

In [77]:
%%time
train_df = pd.concat(train_dict, ignore_index=True)

CPU times: user 80.8 ms, sys: 45 ms, total: 126 ms
Wall time: 125 ms


In [78]:
train_df.shape

(50000, 46)

In [85]:
train_df["spacy_nlp"][0]

print("\n Tree")
train_df["spacy_nlp"][0].print_tree()

This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!


 Tree


[{'word': 'is',
  'lemma': 'be',
  'NE': '',
  'POS_fine': 'VBZ',
  'POS_coarse': 'VERB',
  'arc': 'ROOT',
  'modifiers': [{'word': 'This',
    'lemma': 'this',
    'NE': '',
    'POS_fine': 'DT',
    'POS_coarse': 'DET',
    'arc': 'nsubj',
    'modifiers': []},
   {'word': 'cool',
    'lemma': 'cool',
    'NE': '',
    'POS_fine': 'JJ',
    'POS_coarse': 'ADJ',
    'arc': 'acomp',
    'modifiers': [{'word': 'so',
      'lemma': 'so',
      'NE': '',
      'POS_fine': 'RB',
      'POS_coarse': 'ADV',
      'arc': 'advmod',
      'modifiers': []}]},
   {'word': '.',
    'lemma': '.',
    'NE': '',
    'POS_fine': '.',
    'POS_coarse': 'PUNCT',
    'arc': 'punct',
    'modifiers': []}]},
 {'word': "'s",
  'lemma': 'be',
  'NE': '',
  'POS_fine': 'VBZ',
  'POS_coarse': 'VERB',
  'arc': 'ROOT',
  'modifiers': [{'word': 'It',
    'lemma': '-PRON-',
    'NE': '',
    'POS_fine': 'PRP',
    'POS_coarse': 'PRON',
    'arc': 'nsubj',
    'modifiers': []},
   {'word': 'like',
    'lemma': 'lik

In [79]:
train_df.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count,spacy_nlp
0,59848,0.0,"This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",0.0,0.0,0.0,0.0,0.0,,,...,rejected,0,0,0,0,0,0.0,0,4,"(This, is, so, cool, ., It, 's, like, ,, ', would, you, want, your, mother, to, read, this, ?, ?, ', Really, great, idea, ,, well, done, !)"
1,59849,0.0,"Thank you!! This would make my life a lot less anxiety-inducing. Keep it up, and don't let anyone get in your way!",0.0,0.0,0.0,0.0,0.0,,,...,rejected,0,0,0,0,0,0.0,0,4,"(Thank, you, !, !, This, would, make, my, life, a, lot, less, anxiety, -, inducing, ., Keep, it, up, ,, and, do, n't, let, anyone, get, in, your, way, !)"
2,59852,0.0,This is such an urgent design problem; kudos to you for taking it on. Very impressive!,0.0,0.0,0.0,0.0,0.0,,,...,rejected,0,0,0,0,0,0.0,0,4,"(This, is, such, an, urgent, design, problem, ;, kudos, to, you, for, taking, it, on, ., Very, impressive, !)"
3,59855,0.0,Is this something I'll be able to install on my site? When will you be releasing it?,0.0,0.0,0.0,0.0,0.0,,,...,rejected,0,0,0,0,0,0.0,0,4,"(Is, this, something, I, 'll, be, able, to, install, on, my, site, ?, When, will, you, be, releasing, it, ?)"
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,rejected,0,0,0,1,0,0.0,4,47,"(haha, you, guys, are, a, bunch, of, losers, .)"
