This notebook shows you how to use the pre-processing functionality of the `ktext` library to prepare data for Keras.   

### Preview Data

Github Issues pulled from [Github Archive](https://www.githubarchive.org/) for illustration.  This data is not provided, but just used for illustration.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_parquet('demo_df.parquet')
train, test = train_test_split(df, test_size=0.2)
train[['issue_url', 'body', 'issue_title']].head(3)

Unnamed: 0,issue_url,body,issue_title
25056,https://github.com/CitizensDev/Citizens2/issue...,update players based on navigating npc's: upda...,"fix npe, improve navigating npc skins, fix tab..."
82182,https://github.com/docker-library/tomcat/issue...,docker image from java:openjdk-8u92-jre-alpine...,add use alpine image dockerfile issue 33
14214,https://github.com/UnionOfRAD/lithium/issues/317,it would be a nice addition to the dispatcher:...,allow closure usage in dispatcher::config


Background: The data shown above are [Github Issues](https://guides.github.com/features/issues/) - specifically the body and issue title.  Suppose you want to train a model to read the body of an issue , and predict what the issue title will be.  Therefore, we need to pre-processes the **body** and **issue_title** fields from this dataframe.   

`ktext` operates on lists of strings (where each element in the list is a document).  Therefore, we can extract these text fields from the dataframe as lists. 

In [2]:
body = train.body.tolist()
issue_title = train.issue_title.tolist()

Lets look at the raw data

In [3]:
print('URL: ', train.issue_url.iloc[0])
print('Body:\n', body[0])
print('Title:\n', issue_title[0])

URL:  https://github.com/CitizensDev/Citizens2/issues/514
Body:
 update players based on navigating npc's: update player when npc navigates into players field of view. moved skin update tracker code to own class skinupdatetracker fix use incorrect setting for tab list

Title:
 fix npe, improve navigating npc skins, fix tablist setting


## Process Data For Deep Learning

In [32]:
%reload_ext autoreload
%autoreload 2
from ktext.preprocess import processor

## Initilize **processor** object
It is important to read the docstring so you can see all the options. 

In [33]:
help(processor.__init__)

Help on function __init__ in module ktext.preprocess:

__init__(self, hueristic_pct_padding:float=0.9, append_indicators:bool=False, keep_n:int=150000, padding:str='pre', padding_maxlen:Union[int, NoneType]=None, truncating:str='post')
    Parameters:
    ----------
    hueristic_pct_padding: float
        This parameter is only used if `padding_maxlen` = None.  A histogram
        of documents is calculated, and the maxlen is set hueristic_pct_padding.
    append_indicators: bool
        If True, will append the tokens '_start_' and '_end_' to the beginning
        and end of your tokenized documents.  This can be useful when training
        seq2seq models.
    keep_n: int = 150000
        This is the maximum size of your vocabulary (unique number of words
        allowed).  Consider limiting this to a reasonable size based upon
        your corpus.
    padding : str
        'pre' or 'post', pad either before or after each sequence.
    padding_maxlen : int or None
        Maximum se

Lets initiliaze our processor and let it truncate and pad our documents such such that all documents are equal to the 70th percentile of document lengths in the data set.  Furthermore, lets only keep the top 5,000 tokens in our vocabulary.

In [34]:
issue_body_proc = processor(hueristic_pct_padding=.7, keep_n=5000)

### `fit_transform` method

**fit_transform** will perform cleaning, tokenization, build a vocabulary and truncate and pad documents.  Look at the docstring to get a more detailed explanation.

In [35]:
%%time
# prepare text on 8 core node
train_result = issue_body_proc.fit_transform(body)

 See full histogram by insepecting the `document_length_stats` attribute.


CPU times: user 1min 22s, sys: 3.14 s, total: 1min 25s
Wall time: 2min 4s


In [36]:
print('shape of result', train_result.shape)
print('\noriginal string:\n', body[0])
print('after pre-processing:\n', train_result[0])

shape of result (80000, 90)

original string:
 update players based on navigating npc's: update player when npc navigates into players field of view. moved skin update tracker code to own class skinupdatetracker fix use incorrect setting for tab list

after pre-processing:
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  2  3  4  5  6  1  7  8  2  9 10  1  1 11  3 12 13 14 15 16
  1  2 17 18 19 20 21  1 22 23 24 25 26 27 28]


we can see that the body of this issue has been converted to an array of integers and has been padded so that it is of length 90.  This is because we initialized the `processor` to calculate max_len by using the hueristic of 70th percentile of document length.  We can see the histogram of document lengths by inspecting the **`document_length_stats`** attribute.

In [37]:
stats = issue_body_proc.document_length_stats
stats.head(15)

Unnamed: 0,bin,doc_count,cumsum_pct
13,10,2205,0.027563
10,20,8240,0.130562
2,30,6623,0.21335
0,40,7305,0.304663
7,50,5845,0.377725
3,60,7338,0.46945
4,70,6437,0.549913
8,80,8687,0.6585
6,90,8802,0.768525
1,100,10775,0.903212


If you do not like this behavior, you can your own desired maximum length manually by specifying the **`padding_max_len`** parameter (which will skip the process of building a histogram and will run faster!)  This will override the **`hueristic_pct_padding`** parameter if it is specified.  Example:

In [31]:
%%time
issue_body_proc2 = processor(keep_n=5000, padding_maxlen=85)
train_result2 = issue_body_proc2.fit_transform(body)



CPU times: user 56.8 s, sys: 2.39 s, total: 59.2 s
Wall time: 1min 37s


In [38]:
print('shape of result', train_result2.shape)
print('\noriginal string:\n', body[0])
print('after pre-processing:\n', train_result2[0])

shape of result (80000, 85)

original string:
 update players based on navigating npc's: update player when npc navigates into players field of view. moved skin update tracker code to own class skinupdatetracker fix use incorrect setting for tab list

after pre-processing:
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  2  3  4  5  6  1  7  8  2  9 10  1  1 11  3 12 13 14 15 16  1  2 17 18 19
 20 21  1 22 23 24 25 26 27 28]


### `fit` method

The same as the **`fit_transform`** method except no data is returned.  See docstring for further details.

### **`index2token`** attribute

The **`index2token`** attribute contains all of the index to word mappings, which we can use to reverse the indexes back to tokens.  Remember that `0` is just padding and `1` is reserved for rare (below the `keep_n` threshold) or unknown words.

In [41]:
' '.join([issue_body_proc2.id2token[idx] for idx in train_result2[0] if idx >1])

"update players based on navigating 's : update player when into players field of view . moved update tracker code to own class fix use incorrect setting for tab list"

### `token_count_pandas` method

we can use this method to see the top tokens in the dataset. 

In [43]:
token_count_df = issue_body_proc2.token_count_pandas()
token_count_df.head(10)

Unnamed: 0,count,token
13,62447,.
42,58599,the
71,53646,","
17,52374,to
35,47715,*
6,40674,:
51,40232,a
95,38626,is
67,36808,in
127,35813,and


### `transform` method

This method performs a transformation operation on new raw data, but doesn't use process based threading for parallelization, which is suitable if you do not have that much data to transform.

In [46]:
%%time
body_test = test.body.tolist()
body_test_transformed = issue_body_proc2.transform(body_test)

CPU times: user 32.8 s, sys: 28 ms, total: 32.8 s
Wall time: 32.8 s


In [60]:
print('shape of result', body_test_transformed.shape)

shape of result (20000, 85)


### `transform_parallel` method

Same as the **`transform`** method but uses process-based threading for parallelization.  In this case, we can see an appreciable speedup because we are transforming 20k documents.

In [49]:
%%time
body_test_transformed = issue_body_proc2.transform_parallel(body_test)

CPU times: user 1.46 s, sys: 340 ms, total: 1.8 s
Wall time: 8.9 s


In [50]:
body_test_transformed[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,  136,  137, 1124,  840,  541,  180,   86,    1,  373,
         19,  207,  123, 2611,   10,  288,  223,    1,   15,   65,  136,
        774,    1,   44, 1678,  596,  476,   15,  123,  386, 1127,    1,
         92,    1,   92, 1563, 1950,  241,    1,    8,  117,   73, 1563,
       1376, 2616, 4449, 1563,  131,    1,   19,   53, 3782,   13, 1812,
        596,    1,    8,   92,    1,    8,    1,   73], dtype=int32)

# Appendix

#### `vocabulary` attribute

This is a gensim object that is updated when you call **`fit`** or **`fit_transform`**

see https://radimrehurek.com/gensim/corpora/dictionary.html for more details on this object.

In [51]:
issue_body_proc.vocabulary

<gensim.corpora.dictionary.Dictionary at 0x7f2c75a36550>

#### Other methods under construction but not yet completed:

1. set_tokenizer
2. set_cleaner

####  Experimental: incremental parsing with `fit` or `fit_transform` 

If you call **`fit`** or **`fit_transform`** more than once using the same **`processor`** object, it will simply **append to the existing vocabulary** this can be useful if you only want to add to your vocabulary incrementally for some reason.  Care must be taken here as `keep_n` throws out tokens completely so set `keep_n` sufficiently large if you want to allow for new tokens to be added.  Also with incremental training you will likely want to use **`padding_max_len`** instead of the **`hueristic_pct_padding:float`** parameter.  Furthermore, everytime you call fit or fit_transform word indices are changed.  Use incremental training with extreme caution.

Example:

In [54]:
new_proc = processor(keep_n=10000, padding_maxlen=70)
new_proc.fit(body)



In [57]:
print('Number of documents: ', new_proc.vocabulary.num_docs)

Number of documents:  80000


Add 15k more documents

In [58]:
new_data = body[:15000]
new_proc.fit(new_data)



In [59]:
print('Number of documents: ', new_proc.vocabulary.num_docs)

Number of documents:  95000
