**Author:** Mattias Östmar, mattiasostmar@gmail.com

**Published:** 2019-03-36

# Create pickled Pandas Dataframe with texts with paragraph separation

This is the initial dataframe created. In `create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file` we use a different method to extract the texts from the xml-files in [Hyperpartisan News Detection
PAN @ SemEval 2019](https://pan.webis.de/semeval19/semeval19-web/) where `\n\n` between paragraphs are not preserved. 

In the notebook `compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb` we compare the results from running the discoursebias software on the two different texts, one with `\n\n` preserved (this) and one with no paragraph separations, concluding that the bias index score from both versions of the texts are exactly the same.

In [1]:
import xmltodict
import pandas as pd
import json

In [75]:
!ls -la

total 1839152
drwxr-xr-x@  9 mos  staff        288 21 Mar 13:34 [34m.[m[m
drwx------@ 28 mos  staff        896 20 Mar 19:56 [34m..[m[m
-rw-r--r--@  1 mos  staff      10244 21 Mar 13:36 .DS_Store
drwxr-xr-x@  3 mos  staff         96 20 Mar 20:15 [34m.ipynb_checkpoints[m[m
-rw-r--r--@  1 mos  staff        867 20 Mar 22:14 README.md
-rw-rw-r--@  1 mos  staff  937751369 22 Nov 00:50 articles-validation-bypublisher-20181122.xml
drwxr-xr-x@ 11 mos  staff        352 21 Mar 13:36 [34mcompressed_dataset[m[m
drwxr-xr-x@ 10 mos  staff        320 21 Mar 13:34 [34mdata[m[m
-rw-r--r--@  1 mos  staff       5557 20 Mar 22:54 make_nice_dataset.ipynb


In [91]:
!ls -la data

total 5593256
drwxr-xr-x@ 10 mos  staff         320 21 Mar 13:34 [34m.[m[m
drwxr-xr-x@  9 mos  staff         288 21 Mar 13:54 [34m..[m[m
-rw-r--r--@  1 mos  staff        6148 21 Mar 13:34 .DS_Store
-rw-r--r--@  1 mos  staff        2101 21 Mar 09:51 article.xsd
-rw-rw-r--@  1 mos  staff     2718431 16 Nov 15:23 articles-training-byarticle-20181122.xml
-rw-rw-r--@  1 mos  staff  2717573604 22 Nov 01:17 articles-training-bypublisher-20181122.xml
-rw-rw-r--@  1 mos  staff      111875 16 Nov 15:25 ground-truth-training-byarticle-20181122.xml
-rw-rw-r--@  1 mos  staff   104765374 22 Nov 00:13 ground-truth-training-bypublisher-20181122.xml
-rw-rw-r--@  1 mos  staff    25504989 22 Nov 00:12 ground-truth-validation-bypublisher-20181122.xml
-rw-r--r--@  1 mos  staff        1628 21 Mar 09:52 ground-truth.xsd


In [84]:
article_xsd = xmltodict.parse(open("./compressed_dataset/article.xsd").read())
print(json.dumps(article_xsd, indent=4))

{
    "xs:schema": {
        "@xmlns:xs": "http://www.w3.org/2001/XMLSchema",
        "xs:group": {
            "@name": "articleContent",
            "xs:choice": {
                "xs:element": [
                    {
                        "@name": "p",
                        "xs:complexType": {
                            "@mixed": "true",
                            "xs:choice": {
                                "@minOccurs": "0",
                                "@maxOccurs": "unbounded",
                                "xs:group": {
                                    "@ref": "articleContent"
                                }
                            }
                        }
                    },
                    {
                        "@name": "q",
                        "xs:complexType": {
                            "@mixed": "true",
                            "xs:choice": {
                                "@minOccurs": "0",
                                "@m

In [85]:
ground_truth_xsd = xmltodict.parse(open("./compressed_dataset/ground-truth.xsd").read())
print(json.dumps(ground_truth_xsd, indent=4))

{
    "xs:schema": {
        "@xmlns:xs": "http://www.w3.org/2001/XMLSchema",
        "xs:element": {
            "@name": "articles",
            "xs:complexType": {
                "xs:sequence": {
                    "xs:element": {
                        "@name": "article",
                        "@minOccurs": "1",
                        "@maxOccurs": "unbounded",
                        "xs:complexType": {
                            "xs:attribute": [
                                {
                                    "@name": "id",
                                    "@use": "required",
                                    "xs:simpleType": {
                                        "xs:restriction": {
                                            "@base": "xs:string",
                                            "xs:pattern": {
                                                "@value": "[0-9]+"
                                            }
                                        }

In [114]:
texts = xmltodict.parse(open("data/articles-validation-bypublisher-20181122.xml").read())

ids = []
articles = []

texts_df = None

for article in texts["articles"]["article"]:
    paras = []
    
    str_id = article["@id"]
    ids.append(str_id)
    
    title = article["@title"]
    paras.append(title)
    #print("--------------------- {} ----------------".format(str_id))
    
    # Remove meta-information such as image texts, urls etc.
    text = None
    if "p" in article and article["p"] is not None:
        for para in article["p"]:
            if isinstance(para, str): # E.g. removes image data of type collections.OrderedDict
                paras.append(para)
            text = "\n\n ".join(paras)
    else:
        print("No `p` tags in article {}".format(str_id))
        text = article["#text"]
        if len(text) < 5:
            print("Suspiciously little #text in article {}".format(str_id))
    articles.append(text)
    
    texts_df = pd.DataFrame({"id":ids, "text":articles})
    texts_df.to_pickle("paragraphed_article_validation_bypublisher_20181122.pickle")

No `p` tags in article 0000339
No `p` tags in article 0000495
No `p` tags in article 0000798
No `p` tags in article 0001403
No `p` tags in article 0004217
No `p` tags in article 0004380
No `p` tags in article 0005526
No `p` tags in article 0005799
No `p` tags in article 0006557
No `p` tags in article 0007987
No `p` tags in article 0008490
No `p` tags in article 0009348
No `p` tags in article 0010433
No `p` tags in article 0010854
No `p` tags in article 0011181
No `p` tags in article 0012081
No `p` tags in article 0013531
No `p` tags in article 0013716
No `p` tags in article 0013831
No `p` tags in article 0014319
No `p` tags in article 0014971
No `p` tags in article 0016026
No `p` tags in article 0016777
No `p` tags in article 0016981
No `p` tags in article 0017177
No `p` tags in article 0017876
No `p` tags in article 0018057
No `p` tags in article 0019201
No `p` tags in article 0019335
No `p` tags in article 0019587
No `p` tags in article 0019753
No `p` tags in article 0020264
No `p` t

No `p` tags in article 0196368
No `p` tags in article 0196866
No `p` tags in article 0198526
No `p` tags in article 0198702
No `p` tags in article 0199086
No `p` tags in article 0200562
No `p` tags in article 0201501
No `p` tags in article 0202856
No `p` tags in article 0203014
No `p` tags in article 0203992
No `p` tags in article 0205823
No `p` tags in article 0207108
No `p` tags in article 0207215
No `p` tags in article 0207329
No `p` tags in article 0208939
No `p` tags in article 0209066
No `p` tags in article 0209213
No `p` tags in article 0209314
No `p` tags in article 0209370
No `p` tags in article 0210379
No `p` tags in article 0210445
No `p` tags in article 0210629
No `p` tags in article 0211210
No `p` tags in article 0212574
No `p` tags in article 0213177
No `p` tags in article 0213983
No `p` tags in article 0214181
No `p` tags in article 0215538
No `p` tags in article 0215576
No `p` tags in article 0217228
No `p` tags in article 0217639
No `p` tags in article 0218038
No `p` t

No `p` tags in article 0393643
No `p` tags in article 0395822
No `p` tags in article 0395842
No `p` tags in article 0396157
No `p` tags in article 0396543
No `p` tags in article 0397200
No `p` tags in article 0399523
No `p` tags in article 0399829
No `p` tags in article 0401001
No `p` tags in article 0401184
No `p` tags in article 0402135
No `p` tags in article 0402606
No `p` tags in article 0403161
No `p` tags in article 0403282
No `p` tags in article 0403448
No `p` tags in article 0404331
No `p` tags in article 0405595
No `p` tags in article 0406803
No `p` tags in article 0407373
No `p` tags in article 0407774
No `p` tags in article 0407965
No `p` tags in article 0408274
No `p` tags in article 0408359
No `p` tags in article 0408554
No `p` tags in article 0408564
No `p` tags in article 0411368
No `p` tags in article 0412732
No `p` tags in article 0413113
No `p` tags in article 0413474
No `p` tags in article 0414457
No `p` tags in article 0414485
No `p` tags in article 0416579
No `p` t

No `p` tags in article 0621632
No `p` tags in article 0622048
No `p` tags in article 0622547
No `p` tags in article 0623213
No `p` tags in article 0623930
No `p` tags in article 0624192
No `p` tags in article 0624367
No `p` tags in article 0626833
No `p` tags in article 0627384
No `p` tags in article 0627885
No `p` tags in article 0629190
No `p` tags in article 0630058
No `p` tags in article 0632043
No `p` tags in article 0632108
No `p` tags in article 0632564
No `p` tags in article 0632776
No `p` tags in article 0632851
No `p` tags in article 0633675
No `p` tags in article 0634389
No `p` tags in article 0635611
No `p` tags in article 0636705
No `p` tags in article 0637279
No `p` tags in article 0637519
No `p` tags in article 0638440
No `p` tags in article 0639324
No `p` tags in article 0639736
No `p` tags in article 0640496
No `p` tags in article 0640703
No `p` tags in article 0640902
No `p` tags in article 0642046
No `p` tags in article 0642187
No `p` tags in article 0642808
No `p` t

No `p` tags in article 0817006
No `p` tags in article 0817049
No `p` tags in article 0817463
No `p` tags in article 0818258
No `p` tags in article 0818365
No `p` tags in article 0818800
No `p` tags in article 0819879
No `p` tags in article 0821760
No `p` tags in article 0823109
No `p` tags in article 0823429
No `p` tags in article 0824097
No `p` tags in article 0825064
No `p` tags in article 0826878
No `p` tags in article 0828956
No `p` tags in article 0829475
No `p` tags in article 0829747
No `p` tags in article 0831474
No `p` tags in article 0833265
No `p` tags in article 0833881
No `p` tags in article 0834179
No `p` tags in article 0834458
No `p` tags in article 0835580
No `p` tags in article 0835784
No `p` tags in article 0838124
No `p` tags in article 0838220
No `p` tags in article 0838385
No `p` tags in article 0839045
No `p` tags in article 0839915
No `p` tags in article 0840569
No `p` tags in article 0842161
No `p` tags in article 0842240
No `p` tags in article 0842878
No `p` t

No `p` tags in article 1046098
No `p` tags in article 1046107
No `p` tags in article 1047497
No `p` tags in article 1047842
No `p` tags in article 1048133
No `p` tags in article 1049520
No `p` tags in article 1050845
No `p` tags in article 1051708
No `p` tags in article 1052155
No `p` tags in article 1052799
No `p` tags in article 1053327
No `p` tags in article 1053744
No `p` tags in article 1053865
No `p` tags in article 1054657
No `p` tags in article 1054856
No `p` tags in article 1057048
No `p` tags in article 1057847
No `p` tags in article 1058195
No `p` tags in article 1058256
No `p` tags in article 1059101
No `p` tags in article 1059260
No `p` tags in article 1059569
No `p` tags in article 1060162
No `p` tags in article 1060423
No `p` tags in article 1061142
No `p` tags in article 1062360
No `p` tags in article 1062361
No `p` tags in article 1066354
No `p` tags in article 1066922
No `p` tags in article 1067579
No `p` tags in article 1067785
No `p` tags in article 1068187
No `p` t

No `p` tags in article 1253431
No `p` tags in article 1254689
No `p` tags in article 1255324
No `p` tags in article 1255414
No `p` tags in article 1255690
No `p` tags in article 1256400
No `p` tags in article 1256913
No `p` tags in article 1257330
No `p` tags in article 1258551
No `p` tags in article 1260226
No `p` tags in article 1260399
No `p` tags in article 1260648
No `p` tags in article 1260717
No `p` tags in article 1261169
No `p` tags in article 1262508
No `p` tags in article 1263427
No `p` tags in article 1264207
No `p` tags in article 1264465
No `p` tags in article 1265360
No `p` tags in article 1265493
No `p` tags in article 1265672
No `p` tags in article 1265916
No `p` tags in article 1266333
No `p` tags in article 1267345
No `p` tags in article 1267409
No `p` tags in article 1267421
No `p` tags in article 1268140
No `p` tags in article 1268384
No `p` tags in article 1268397
No `p` tags in article 1268871
No `p` tags in article 1272337
No `p` tags in article 1272583
No `p` t

No `p` tags in article 1446386
No `p` tags in article 1446401
No `p` tags in article 1449240
No `p` tags in article 1449822
No `p` tags in article 1451666
No `p` tags in article 1453731
No `p` tags in article 1453820
No `p` tags in article 1454143
No `p` tags in article 1455286
No `p` tags in article 1456481
No `p` tags in article 1458007
No `p` tags in article 1458315
No `p` tags in article 1459684
No `p` tags in article 1460988
No `p` tags in article 1461859
No `p` tags in article 1463078
No `p` tags in article 1463446
No `p` tags in article 1464678
No `p` tags in article 1464898
No `p` tags in article 1465033
No `p` tags in article 1465934
No `p` tags in article 1468623
No `p` tags in article 1469538
No `p` tags in article 1471381
No `p` tags in article 1472006
No `p` tags in article 1473174
No `p` tags in article 1473748
No `p` tags in article 1474316
No `p` tags in article 1475582
No `p` tags in article 1476647
No `p` tags in article 1477592
No `p` tags in article 1479125
No `p` t

In [115]:
print(texts_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 2 columns):
id      150000 non-null object
text    150000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None


In [116]:
texts_df.head(3
             )

Unnamed: 0,id,text
0,17,SAN FRANCISCO / Head of Juvenile Probation Dep...
1,23,"University leaders ban pro-life flag display, ..."
2,29,"DONALD TRUMP, GET YOUR TINY PIGGY PERVERT HAND..."


In [88]:
train = xmltodict.parse(open("data/ground-truth-training-byarticle-20181122.xml").read())

train_df = None

hyperpartisans = []
ids = []
labeled_bys = []
urls = []


cnt = 0
for article in train["articles"]["article"]:
    hyperpartisan = article["@hyperpartisan"]
    hyperpartisans.append(hyperpartisan)
    
    str_id = article["@id"]
    ids.append(str_id)
    
    labeled_by = article["@labeled-by"]
    labeled_bys.append(labeled_by)
    
    url = article["@url"]
    urls.append(url)
    
    cnt += 1
    if cnt >= 3:
        train_df = pd.DataFrame({"id":ids, 
                                 "hyperpartisan":hyperpartisans,
                                "labeled_by":labeled_bys,
                                "url":urls})
        break

In [74]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
id               3 non-null object
hyperpartisan    3 non-null object
labeled_by       3 non-null object
url              3 non-null object
dtypes: object(4)
memory usage: 176.0+ bytes
