## 04 Handcoding the data to train a classifier

In order to train a classifier, I need a pre-defined classification. This means that I will have to code a selection by hand. First I import necessary packages and define a string for  a base directory, which prevents extensive typing/copy-paste efforts when defining directories:

In [1]:
import os.path
import csv
import random
import datetime

basedir = os.path.expanduser('~/Dropbox/Studies/Semester 2/Block I/data_IMEM/intermediate/')

I randomly sample 1000 cases from the initial data set:

In [2]:
code_cases = random.sample(range(29536), 1000)

I could simply subset the data, write it to a csv and then by hand code zeros and ones in a new row. However, I could also make this task less error prone and less tedious writing some code!

In [None]:
dt = str(datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

# open csv with all cases
with open(basedir + 'Releases20190516-190649.csv', mode="r", encoding="utf-8") as fi:
    # open new csv to write coded cases to
    with open(basedir + "Coded"+dt+".csv",mode="w", encoding="utf-8") as fo:    
        reader = csv.reader(fi)
        writer = csv.writer(fo, lineterminator='\n')
        
        # define subsample from full set
        codesample = [row for idx, row in enumerate(reader) if idx in code_cases]
        
        # set up counter
        i = 1
        for row in codesample:
            input_val = False
            # as long as no correct input was given, keep asking
            while input_val == False:
                
                # question
                print('\nIs this text about immigration?\n\n Text ' + str(i) + '/1000:\n\n')
                
                # print press release title
                print(row[2])
                print('\n[y]es\n[n]o\n[m]ore?')
                
                # get keyboard input
                code1 = input()
                
                # if input == y, append 1
                if code1 == 'y':
                    row.append(1)
                    writer.writerow(row)
                    i += 1
                    input_val = True
                
                # append 0 for n
                elif code1 ==  'n':
                    row.append(0)
                    writer.writerow(row)
                    i += 1
                    input_val = True
                
                # if input is m, print press release content for further inspection and ask again for classification
                elif code1 == 'm':
                    print('\nIs this text about immigration?\n\nText:\n\n')
                    print(row[2] + '\n\n' + row[4])
                    print('\n[y]es\n[n]o?')
                    code2 = input()
                    if code2 == 'y':
                        row.append(1)
                        writer.writerow(row)
                        i += 1
                        input_val = True
                    elif code2 ==  'n':
                        row.append(0)
                        writer.writerow(row)
                        i += 1
                        input_val = True


Is this text about immigration?

 Text 1/1000:


Otte: Vergabe von externen Beratungsleistungen transparenter machen

[y]es
[n]o
[m]ore?


Based on the related url, I bind the coded cases to the original data again:

In [14]:
with open(basedir+"Coded20190518-132350.csv",mode="r", encoding="utf-8") as fi:
    reader = csv.reader(fi)
    positive = []
    negative = []
    for row in reader:
        if row[5] == '1':
            positive.append(row[3])
        elif row[5] == '0':
            negative.append(row[3])
        else:
            print('Error')
            break

dt = str(datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
with open(basedir + 'Cleaned20190520-122409.csv', mode="r", encoding="utf-8") as fi:
    with open(basedir + 'Cleaned_coded' + dt + '.csv', mode="w", encoding="utf-8") as fo:
        reader = csv.reader(fi)
        next(reader)
        fieldnames = ['date', 'sender', 'title', 'link', 'raw', 'clean_full', 'clean_rest', 'coding']
        writer = csv.DictWriter(fo, lineterminator='\n', fieldnames = fieldnames)
        writer.writeheader()
        for row in reader:
            if row[3] in positive:
                row.append(1)
            elif row[3] in negative:
                row.append(0)
            else:
                row.append('')    
            writer.writerow({'date':        row[0], 
                            'sender':       row[1],
                            'title':        row[2], 
                            'link':         row[3], 
                            'raw':          row[4], 
                            'clean_full':   row[5], 
                            'clean_rest':   row[6],
                            'coding':       row[7]})

Let's inspect the data: how many press releases in our set are concerned with immigration?

In [4]:
with open(basedir + 'Coded20190518-132350.csv', mode = 'r', encoding = 'utf-8') as fi:
    reader = csv.reader(fi)
    i = 0
    for row in reader:
        if row[5] == '1':
            i += 1
print(i)

29


Oh no - only 29 out of 1000 cases are related to immigration! This unbalanced sample won't be sufficient to train the classifier. However, it is not like we do not get any information from this data. I can use it to train a classifier to make an informed decision which cases to inspect in order to oversample press releases related to immigration from the full dataset.

In order to do this, I first need to create a balanced sample and vectorize the texts:

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# define train and predict set
train = []
predict = []

with open(basedir + 'Cleaned_coded20190520-151658.csv', mode="r", encoding="utf-8") as fi:
    reader = csv.reader(fi)
    for row in reader:
        if row[7] == '1':
            train.append(row)
        elif row[7] == '0':
            train.append(row)
        predict.append(row[5])

# define x and y for the classifier:
texts  = [t[5] for t in train]
predictor = [int(t[7]) for t in train]
sum(predictor) # 29 positive cases -> lets use a similar number of non-cases for the algorithm

# undersample majority class (= not immigration) to get a balanced sample
rsample = random.sample(range(100), 70)
texts_us = []
predictor_us = []
i = 0
for p,t in zip(predictor, texts):
    if p == 1:
        predictor_us.append(p)
        texts_us.append(t)
        i += 1
    if p == 0:
        if i in rsample:
            predictor_us.append(p)
            texts_us.append(t)
            i += 1 
        


# vectorize texts
vec_count = CountVectorizer(max_df=.5, min_df=5)
count = vec_count.fit_transform(texts_us)

Now we can train a model to identify most-likely cases for subsequent hand-coding:

In [6]:
from sklearn.linear_model import LogisticRegression

# define and fit model
logreg = LogisticRegression()
logreg.fit(count, predictor_us)

#%% identify most-likely immigration cases by classifying all uncoded text
count_f = vec_count.transform(predict[1:]) # [1:] excludes header
probability = logreg.predict_proba(count_f) # lets take the proability to identify most likely cases
print(sorted([p[1] for p in probability])[-500]) # return value of the 500th-most-likely case



0.8651075232554201


Using this threshold provided by our classifier, I can code 500 more cases, but this time they are far more likely to be about immigration, which should result in a more balanced sample. To be sure, I write the output of this preliminary classifier, as well as the subsample of most-likely-cases for hand-coding into csv-files so I don't lose them.

In [7]:
train2 = []
dt = str(datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
with open(basedir + 'Cleaned_coded20190520-151658.csv', mode="r", encoding="utf-8") as fi:
    with open(basedir + 'Cleaned_prelim.csv', mode="w", encoding="utf-8") as fo:
        reader = csv.reader(fi)
        fieldnames = ['date', 'sender', 'title', 'link', 'raw', 'clean_full', 'clean_rest', 'coding', 'probability']
        writer = csv.DictWriter(fo, lineterminator='\n', fieldnames = fieldnames)
        writer.writeheader()
        i = 0
        next(reader)  # Skip header row.
        for row in reader:
            row.append(round(probability[i][1], 2))
            writer.writerow({'date':        row[0], 
                            'sender':       row[1],
                            'title':        row[2], 
                            'link':         row[3], 
                            'raw':          row[4], 
                            'clean_full':   row[5], 
                            'clean_rest':   row[6],
                            'coding':       row[7],
                            'probability':  row[8]})
            i += 1
            if row[7] != '0' and row[7] != '1': # check whether cases are pre-coded 
                if float(row[8]) > .86: # only include 500 most likely cases
                    train2.append([row[2], row[3], row[4]]) # extract title, link and raw text

with open(basedir + 'oversample.csv', mode="w", encoding="utf-8") as fo:
    writer = csv.writer(fo, lineterminator='\n')
    for row in train2:
        writer.writerow(row)

Again, I use a program for the handcoding to prevent errors from skipping a row:

In [None]:
i = 0
with open(basedir + 'oversample.csv', mode="r", encoding="utf-8") as fi:
    with open(basedir + 'oversample_coded.csv', mode="w", encoding="utf-8") as fo:           
        reader = csv.reader(fi)
        writer = csv.writer(fo, lineterminator='\n')
        for row in reader:
            input_val = False
            while input_val == False:
                print('\nIs this text about immigration?\n\n Text ' + str(i) + '/495:\n\n')
                print(row[0])
                print('\n[y]es\n[n]o\n[m]ore?')
                code1 = input()
                if code1 == 'y':
                    row.append(1)
                    i += 1
                    input_val = True
                elif code1 ==  'n':
                    row.append(0)
                    i += 1
                    input_val = True
                elif code1 == 'm':
                    print('\nIs this text about immigration?\n\nText:\n\n')
                    print(row[0] + '\n\n' + row[2])
                    print('\n[y]es\n[n]o?')
                    code2 = input()
                    if code2 == 'y':
                        row.append(1)
                        i += 1
                        input_val = True
                    elif code2 ==  'n':
                        row.append(0)
                        i += 1
                        input_val = True
            writer.writerow(row)


Is this text about immigration?

 Text 0/495:


Gröhe: Dankbar für den Einsatz der Kirchen bei der Suizidprävention

[y]es
[n]o
[m]ore?
n

Is this text about immigration?

 Text 1/495:


Brand: Humanitäre Hilfe für Syrien und Nachbarländer sorgt für mehr Stabilität

[y]es
[n]o
[m]ore?
m

Is this text about immigration?

Text:


Brand: Humanitäre Hilfe für Syrien und Nachbarländer sorgt für mehr Stabilität

14.03.2019 – 16:28                                               Deutschland kommt seiner Verantwortung nach Die Bundesregierung sagt auf der Syrien-Konferenz in Brüssel 1,44 Milliarden Euro für humanitäre Hilfe und entwicklungsorientierte Maßnahmen in den Nachbarländern zu. Dazu erklärt der Vorsitzende der Arbeitsgruppe Menschenrechte und humanitäre Hilfe der CDU/CSU-Bundestagsfraktion, Michael Brand: "Mit der Zusage Deutschlands, den Betrag für die humanitäre Hilfe in Syrien und den Nachbarländern erneut zu erhöhen, kommt Deutschland als zweitgrößter internationaler Geber seiner h

I write them to the full csv again (I am fully aware that this is very inefficient, and should be done within a loop, but I lack the time to change this now):  

In [9]:
with open(basedir + "oversample_final.csv",mode="r", encoding="utf-8") as fi:
    reader = csv.reader(fi)
    positive = []
    negative = []
    for row in reader:
        if row[3] == '1':
            positive.append(row[1])
        elif row[3] == '0':
            negative.append(row[1])
        else:
            print('Error')
            break

with open(basedir+'Cleaned_coded20190520-151658.csv', mode="r", encoding="utf-8") as fi:
    with open(basedir+'full_coded.csv', mode="w", encoding="utf-8") as fo:
        reader = csv.reader(fi)
        fieldnames = ['date', 'sender', 'title', 'link', 'raw', 'clean_full', 'clean_rest', 'coding']
        writer = csv.DictWriter(fo, lineterminator='\n', fieldnames = fieldnames)
        writer.writeheader()
        next(reader)
        for row in reader:
            if row[3] in positive:
                row[7] = 1
            elif row[3] in negative:
                row[7] = 0    
            writer.writerow({'date':        row[0], 
                            'sender':       row[1],
                            'title':        row[2], 
                            'link':         row[3], 
                            'raw':          row[4], 
                            'clean_full':   row[5], 
                            'clean_rest':   row[6],
                            'coding':       row[7]})

Did my method work? How many of the cases are positive?

In [10]:
print(len(positive))

196


196/500! This should be sufficient to work with. With a little oversampling, I should be able to train a decent classifier. Lastly, I divide the hand-coded data in a training and a test set and assign groups to the training data for cross-validation:

In [16]:
with open(basedir + 'full_coded.csv', mode="r", encoding="utf-8") as fi:
    with open(basedir + 'train.csv', mode="w", encoding="utf-8") as train:
        with open(basedir + 'test.csv', mode="w", encoding="utf-8") as test:
            with open(basedir + 'predict'+dt+'.csv', mode="w", encoding="utf-8") as predict:
                reader = csv.reader(fi)
                writer_train   = csv.writer(train, lineterminator='\n')
                writer_test    = csv.writer(test, lineterminator='\n')
                writer_predict = csv.writer(predict, lineterminator='\n')
                traintest = []
                for row in reader:
                    if row[7] == '':
                        writer_predict.writerow(row)
                    else:
                        traintest.append(row)
                        
                # random sample 60:40 train:test
                rsample = random.sample(range(len(traintest)), round(len(traintest)*.4))
                i = 0
                for row in traintest:
                    if i in rsample:
                        writer_test.writerow(row)
                    else:
                        writer_train.writerow(row)
                    i += 1


# define split-samples in training data for cross-vaidation
with open(basedir + 'train.csv', mode="r", encoding="utf-8") as train:
    with open(basedir + 'crossval.csv', mode="w", encoding="utf-8") as crossval:
        reader = csv.reader(train)
        fieldnames = ['date', 'sender', 'title', 'link', 'raw', 'clean_full', 'clean_rest', 'coding', 'val_sample']
        writer = csv.DictWriter(crossval, lineterminator='\n', fieldnames = fieldnames)
        writer.writeheader()
        next(reader) # skip header
        for row in reader:
            rsample = random.sample(range(5), 1)
            row.append(rsample[0])
            writer.writerow({'date':         row[0], 
                            'sender':       row[1], 
                            'title':        row[2], 
                            'link':         row[3],
                            'raw':          row[4],
                            'clean_full':   row[5],
                            'clean_rest':   row[6], 
                            'coding':       row[7], 
                            'val_sample':   row[8]})