# CRF based tagger of Code-Mixed text

This tutorial utilizes the `python-crfsuite` for modeling Conditional Random Fields. Most of the code is from the tutorial on [NER using CRFs](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb). This is a demo for sequence labeling tasks like language labeling and POS tagger.

Installation requirements:

* nltk (if already not installed: use `pip install nltk`)
* scikit-learn (if already not installed: `pip install sklearn`)
* python-crfsuite (`pip install python-crfsuite`)

In [2]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pycrfsuite
import random

Loading raw text data from [ICON 2016 shared contest](http://amitavadas.com/Code-Mixing.html) on POS tagging of social media text.

In [3]:
def load_data(files):
    data, sent = [], []
    for file in files:
        with open(file, 'r') as rf:
            for line in rf:
                if line.strip() != '':
                    # Note: the shared corpus is already tokenized
                    sent.append(line.strip().split('\t'))
                else:
                    if len(sent) > 0:
                        data.append(sent)
                        sent = []
    return data

sents = load_data(['data/FB_HI_EN_CR.txt', 'data/TWT_HI_EN_CR.txt', 'data/WA_HI_EN_CR.txt',
                   'data/FB_BN_EN_CR.txt', 'data/TWT_BN_EN_CR.txt', 'data/WA_BN_EN_CR.txt',
                   'data/FB_TE_EN_CR.txt', 'data/TWT_TE_EN_CR.txt', 'data/WA_TE_EN_CR.txt'])

Creating a train and validation split. Here, we are first randomly shuffling the entire corpus and use the first 80% of the instances as train and the remaining as validation. Random seed is set to allow for reproduction of results.

In [4]:
# random.seed(7)
random.shuffle(sents)
train_sents = sents[:int(0.8*len(sents))]
valid_sents = sents[int(0.8*len(sents)):]
print("# Train sentences: %d" % (len(train_sents)))
print("# Validation sentences: %d" % (len(valid_sents)))

# Train sentences: 4186
# Validation sentences: 1047


Sample train data with language labels and POS tags

In [5]:
print(train_sents[0])

[['Gn', 'hi', 'U']]


## Extracting features

Now we extract some features from the raw text data to pass onto the CRF model 

* Current word
* Character n-grams of the current word
* Context
    * Previous word
* Begin and End of sentence markers (BOS and EOS)

Note: Please feel free to explore other features as described in lectures and other reading material that can be of help to the model.

In [6]:
def word2features(sent, k):
    word = sent[k][0]
    features = [
        'token=%s' % (word)
    ]
    # extracting n-grams, for n=1 to 5
    for i in range(1,6):
        # if the value of n is greater than the word length, we exit the loop
        if i > len(word):
            break
        character_features = [word[j:j+i] for j in range(len(word)-i+1)]
        features.extend([
            # is count of individual n-grams important? is the order important?
            "char-%d-gram=%s" % (i, ' '.join(list(set(character_features))))
        ])
    if k == 0:
        # first word in the sentence
        features.append('BOS')
    else:
        features.extend([
            "-1:word=%s" % (sent[k-1][0])
        ])
    if i == len(sent):
        # last word in the sentence         
        features.append('EOS')
 
    return features
        
def sent2features(sent):
    # generating features for all the words/tokens in a sentence `sent`    
    return [word2features(sent, i) for i in range(len(sent))]

def sent2langs(sent):
    return [language_label for token, language_label, pos_tag in sent]

def sent2pos(sent):
    return [pos_tag for token, language_label, pos_tag in sent]

def sent2tokens(sent):
    return [token for token, language_label, pos_tag in sent]

Generating these features from train and validation data. If you are building a language identification model, the y_train and y_test should be language labels (`sent2langs`), and if you are learning a pos-tagger, these should contain pos-tags (`sent2pos`) as shown in the below code

In [7]:
%%time
X_train = [sent2features(sent) for sent in train_sents]
# y_train = [sent2langs(sent) for sent in train_sents]
# for training a pos-tagging system
y_train = [sent2pos(sent) for sent in train_sents]

X_test = [sent2features(sent) for sent in valid_sents]
# y_test = [sent2langs(sent) for sent in valid_sents]
y_test = [sent2pos(sent) for sent in valid_sents]



CPU times: user 1.03 s, sys: 24.5 ms, total: 1.06 s
Wall time: 1.07 s


View a sample of train data with their corresponding features

In [None]:
print(X_train[0])

In [8]:
%%time
trainer = pycrfsuite.Trainer(verbose=True)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 1.18 s, sys: 21.8 ms, total: 1.2 s
Wall time: 1.21 s


Setting training parameters for the CRF model

In [9]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [None]:
trainer.params()

Training the model

In [10]:
%%time
trainer.train('test1.crfsuite')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 144439
Seconds required: 0.775

L-BFGS optimization
c1: 1.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 184944.424686
Feature norm: 1.000000
Error norm: 12483.045187
Active features: 45909
Line search trials: 1
Line search step: 0.000078
Seconds required for this iteration: 0.169

***** Iteration #2 *****
Loss: 158459.145080
Feature norm: 5.659668
Error norm: 34159.864001
Active features: 45871
Line search trials: 4
Line search step: 0.125000
Seconds required for this iteration: 0.313

***** Iteration #3 *****
Loss: 145972.613679
Feature norm: 4.888285
Error norm: 7962.356410
Active features: 45710
Line search trials: 1
Line search step: 1.000000
Seconds required f

***** Iteration #44 *****
Loss: 59484.698474
Feature norm: 130.642568
Error norm: 120.763590
Active features: 29274
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.083

***** Iteration #45 *****
Loss: 59456.445504
Feature norm: 131.059697
Error norm: 202.832362
Active features: 29099
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.088

***** Iteration #46 *****
Loss: 59431.407720
Feature norm: 131.486823
Error norm: 307.707884
Active features: 28903
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.086

***** Iteration #47 *****
Loss: 59405.157022
Feature norm: 131.935112
Error norm: 101.385246
Active features: 28739
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.082

***** Iteration #48 *****
Loss: 59383.653193
Feature norm: 132.224912
Error norm: 81.017232
Active features: 28636
Line search trials: 1
Line search step: 1.000000


Creating a tagger and loading the trained model

In [11]:
tagger = pycrfsuite.Tagger()
tagger.open('test1.crfsuite')

<contextlib.closing at 0x1a24772390>

Making predictions on the validation data using the trained model

In [14]:
example_sent = valid_sents[56]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2pos(example_sent)))

@news24tvchannel very good Virat and m.s Dhoni aap ko aur puri team ko bahut bahut badhaiya

Predicted: @ G_R G_J G_N CC G_N G_N G_X PSP DT G_N G_N PSP G_N G_N G_N
Correct:   @ G_R G_J G_N CC G_N G_N G_X PSP DT G_N G_X PSP G_N G_N G_N


In [15]:
y_true = sent2pos(example_sent)
y_pred = tagger.tag(sent2features(example_sent))
t = set(y_true)
t = list(t)
t

['G_J', 'G_X', 'DT', '@', 'CC', 'G_R', 'PSP', 'G_N']

In [16]:
confusion_matrix(y_true, y_pred, labels=t )

array([[1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 0, 0, 0, 0, 7]])

In [17]:
y_pred_t = []
y_true_t = []
for valids in valid_sents:
    y_pred_t.extend(tagger.tag(sent2features(valids)))
    y_true_t.extend(sent2pos(valids))

In [18]:
t = set(y_true_t)
t = list(t)
confusion_matrix(y_true_t, y_pred_t, labels=t )

array([[  77,    0,    1,    8,    0,    1,   10,    0,    0,    0,    0,
           0,    0,    1,    1,    0,   40,    0],
       [   1,  418,    0,   11,    8,    0,    6,    3,    5,    0,    0,
           5,   14,   27,    0,    3,  333,    0],
       [   2,    0,  111,    8,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,   25,    0],
       [   5,    3,    3, 2138,    6,    0,    7,    4,   13,    1,    1,
          42,    3,   29,    1,   30,  565,    0],
       [   0,    1,    0,    4,  388,    0,    0,    1,    3,    1,    0,
          18,    0,    4,    0,    1,   29,    0],
       [   0,    1,    0,   14,    0,    0,    0,    0,    0,    0,    0,
           2,    0,    0,    0,    0,   19,   10],
       [   3,    1,    1,    0,    0,    0,  245,    0,    0,    0,    0,
           0,    0,    0,    0,    1,  262,    0],
       [   1,   12,    0,    1,    0,    0,    0,   63,    5,    0,    0,
           8,    0,    8,    0,    1,   37,   10],
