### Load liar, liar dataset

Here we will load in the raw "liar, liar" dataset, and prepare it for use in training our model.

### Step 1: load raw data from tsv files

In [121]:
import pandas as pd

train_raw = pd.read_csv("./liar_dataset/train.tsv", 
                        delimiter = '\t', header = None)

train_raw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


This dataset contains our text examples in column 2, and our labels in column 1. There is also additional metadata about the speaker, which we won't use for now, but may return to later.


Currently our labels fall into six categories:

In [97]:
pd.value_counts(train_raw[1])

half-true      2114
false          1995
mostly-true    1962
true           1676
barely-true    1654
pants-fire      839
Name: 1, dtype: int64

These six labels, ranging from "true" to "pants-fire", were applied by human editors at PolitiFact.com.

We may want to group some of the labels to simplify our classification task from six-way classification to something lower. Either two-way or three-way classification may be more attainable.

Let's try a two-way classification in which we group "pants-fire," "false," and "barely-true" into one "false" category, and the other three categories ("half-true," "mostly-true," and "true") into one "true" category.

Since we're trying to detect deception we'll make "false" the positive class, indicated by a 1.

In [140]:
def LabelCats(df):
    if (df[1] in ('pants-fire','false','barely-true')):
        return 1
    else: return 0
    
train_raw['BinaryLabel'] = train_raw.apply(LabelCats, axis = 1)
pd.value_counts(train_raw['BinaryLabel'])

0    5752
1    4488
Name: BinaryLabel, dtype: int64

In [141]:
train_raw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,BinaryLabel
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer,1
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,0
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver,0
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release,1
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN,0


In [142]:
train_examples = list(train_raw[2])
train_textlabels = list(train_raw[1])
train_labels = list(train_raw['BinaryLabel'])

Here we can draw some random examples from the training data to look at our text examples, the original 6-way classification from PolitiFact.com, and our new binary label.

In [143]:
import random
ex_num = random.randint(1, len(train_examples))

print(train_examples[ex_num]) 
print(train_textlabels[ex_num])
print(train_labels[ex_num])

Barack Hussein Obama will ... force local authorities to allow Occupy protesters to live in parks.
pants-fire
1


Now we have a list of examples and a list of binary labels that are ready to be tokenized and loaded into a word embedding. Let's repeat the process for the training set before moving on to the next steps.

In [144]:
test_raw = pd.read_csv("./liar_dataset/test.tsv", 
                        delimiter = '\t', header = None)
test_raw['BinaryLabel'] = test_raw.apply(LabelCats, axis = 1)
test_examples = list(test_raw[2])
test_textlabels = list(test_raw[1])
test_labels = list(test_raw['BinaryLabel'])

### Step 2: Tokenization

Tokenization and vocabulary functions from d2l chapter 8, Text Preproicessing

In [145]:
def tokenize(lines, token='word'):
    if token == 'word':
        return [line.split(' ') for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('ERROR: unknown token type '+token)
        
train_tokens = tokenize(train_examples, token = 'word')
test_tokens = tokenize(test_examples, token = 'word')

We now have our tokenized data: `train_tokens` is a list of trianing examples, where each item is a list of the tokens (words) in the example. For instance, the first training example looks like this:

In [146]:
train_tokens[0]

['Says',
 'the',
 'Annies',
 'List',
 'political',
 'group',
 'supports',
 'third-trimester',
 'abortions',
 'on',
 'demand.']

We are now ready to move on to word embedding, with our four required objects in place:

`train_tokens`

`test_tokens`

`train_labels`

`test_labels`