Steps:
1. create a DataFrame with: 
    1. id
    2. language
    3. n of tokens
    4. n of entities
    5. n of relations
    6. n of annotations
2. select a certain number of negative examples (50%)
3. select a certain number of positive examples (50%):
    1. with high density of entities/annotations
    2. selecting an equal number for each language
    3. keeping the total number of tokens ~ <= 5200

In [181]:
import glob
import os
import pandas as pd
import shutil
import random
import codecs
from random import shuffle
import citation_extractor
from citation_extractor.pipeline import read_ann_file_new
from citation_extractor.Utils.IO import file_to_instances,count_tokens

In [182]:
basedir = "/home/romanell/APh_Corpus/devset/"
testdir = "/home/romanell/APh_Corpus/testset/"
anndir = "%s%s"%(basedir,'ann/')
iobdir = "%s%s"%(basedir,'iob/')
txtdir = "%s%s"%(basedir,'txt/')

In [3]:
files = [(anndir,os.path.basename(file).replace('-doc-1.ann','')) for file in glob.glob("%s*.ann"%anndir)]

In [4]:
files[:10]

[('/home/romanell/APh_Corpus/devset/ann/', '75-13923.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-07293.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-01074.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-04941.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-07985.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-02129.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-13338.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-07106.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-04943.txt'),
 ('/home/romanell/APh_Corpus/devset/ann/', '75-09102.txt')]

In [145]:
documents = []
for dir,file in files:
    # detect language
    # count tokens
    # read in and count entities/relations/annotations
    entities,relations,annotations = read_ann_file_new(file,dir)
    n_entities,n_relations,n_annotations = len(entities),len(relations),len(annotations)
    iob = file_to_instances("%s%s"%(iobdir,file))
    n_tokens = count_tokens(iob)
    text = " ".join([token[0] for sentence in iob for token in sentence])
    lang = langid.classify(text)
    document = {
        "filename":file
        ,"language":lang[0]
        ,"language_probability":lang[1]
        ,"n_tokens":n_tokens
        ,"n_entities":n_entities
        ,"n_relations":n_relations
        ,"n_annotations":n_annotations
        ,"selected":False
    }
    documents.append(document)

In [146]:
len(documents)

6693

In [147]:
df = pd.DataFrame.from_dict(documents)

In [148]:
negative_documents = list(df[df.n_entities==0]['filename'])

In [149]:
len(negative_documents)

4769

In [151]:
total = 0
selected_positive_documents = []
for lang,group in df[df.n_entities > 1].sort_values(by=['n_relations','n_entities','n_annotations']
                                                    ,ascending=False).groupby(['language']):
    #print lang
    #print group
    total += group[:12]["n_tokens"].sum()
    selected_positive_documents += list(group[:12]["filename"])
print "selected %i documents for %i tokens in total"%(len(selected_positive_documents),total)

selected 64 documents for 5473 tokens in total


In [118]:
for i,file in enumerate(selected_positive_documents):
    print i+1,file,codecs.open("%s%s"%(txtdir,file),'r','utf-8').read()

1 75-05688.txt Der Epheserbrief hat eine ausgefeilte symmetrische Struktur. 
 Mittlerer Hauptteil ist 4, 1-16 ; inhaltlich überlappen sich darin Ekklesiologie und Ethik, die beiden Themen der Teile davor bzw. danach. 
 Dem entspricht die formale Disposition, die hier erstmals konsequent stichometrisch analysiert wird. 
 Als Masszeile dient der 15-Silben-Stichos. 
 Die Textabschnitte 1, 1-3, 21 und 4, 17-6, 24 haben genau denselben Zeilenumfang, ebenso die Hauptteile 1, 3-2,10 und 2, 11-3, 21 ; die Teile 4, 17-5, 14 und 5, 15-6, 24 stehen exakt im Verhältnis 2:3. 
 Jeweils sind Bausteine von 21, 13 oder 8 Stichoi verwendet. 
 Die Symmetrie im Briefaufbau erinnert an Körperbau und Tempelbau, kaum zufällig, denn im Epheserbrief ist beides Bild für die Kirche 

2 75-04599.txt Hinweis auf sprachliche und metrische Parallelen unter anderem bei Leonidas 30 HE (AP 9, 24) und bei Asklepiades 6 HE (AP 5, 203). 

3 75-04382.txt Anhand von Epist. 6, 21 ; 7, 17 ; 7, 9 und 2, 3
4 75-10152.txt Nach S

In [152]:
selected_negative_documents = list(df[df.n_entities==0][:95]["filename"])

In [153]:
len(selected_negative_documents)

95

In [161]:
for filename in selected_negative_documents+selected_positive_documents:
    df.loc[df.filename==filename,'selected'] = True

In [168]:
df.to_csv("/home/romanell/myfiles/aph_testset_selection.csv")

In [171]:
cat /home/romanell/myfiles/aph_testset_selection.csv | grep True | wc -l

159


In [189]:
for doc_id in list(df[df.selected == True]["filename"]):
    ann_file = "%s%s%s"%(anndir,doc_id,'-doc-1.ann')
    txt_file = "%s%s%s"%(anndir,doc_id,'-doc-1.txt')
    iob_file = "%s%s"%(iobdir,doc_id)
    orig_file = "%s%s"%(txtdir,doc_id)
    shutil.copy(ann_file,"%s%s"%(testdir,'ann/'))
    shutil.copy(txt_file,"%s%s"%(testdir,'ann/'))
    shutil.copy(iob_file,"%s%s"%(testdir,'iob/'))
    shutil.copy(orig_file,"%s%s"%(testdir,'txt/'))