# Hyperpartisan News Detection

Task: https://webis.de/events/semeval-19/

Dataset: https://zenodo.org/record/5776081

Paper: https://aclanthology.org/S19-2145/

Training, validation, and test data for the PAN @ SemEval 2019 Task 4: Hyperpartisan News Detection.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/LAP/Subjects/AP1/project/hyperpartisan-news-detection

/content/drive/MyDrive/LAP/Subjects/AP1/project/hyperpartisan-news-detection


In [3]:
! pip install datasets

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 17.3 MB/s eta 0:00:01[K     |██                              | 20 kB 12.4 MB/s eta 0:00:01[K     |███                             | 30 kB 9.5 MB/s eta 0:00:01[K     |████                            | 40 kB 8.7 MB/s eta 0:00:01[K     |█████                           | 51 kB 4.6 MB/s eta 0:00:01[K     |██████                          | 61 kB 5.4 MB/s eta 0:00:01[K     |███████                         | 71 kB 5.6 MB/s eta 0:00:01[K     |████████                        | 81 kB 5.5 MB/s eta 0:00:01[K     |█████████                       | 92 kB 6.1 MB/s eta 0:00:01[K     |██████████                      | 102 kB 5.2 MB/s eta 0:00:01[K     |███████████                     | 112 kB 5.2 MB/s eta 0:00:01[K     |████████████                    | 122 kB 5.2 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 5.2 MB/s eta 0:00:01[

In [None]:
!pip install scattertext

Collecting scattertext
  Downloading scattertext-0.1.5-py3-none-any.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 4.5 MB/s 
Collecting gensim>=4.0.0
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.2 MB/s 
[?25hCollecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Collecting mock
  Downloading mock-4.0.3-py3-none-any.whl (28 kB)
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9309 sha256=3cffdc4a34938d79c6c3bae0f4bf4e7f727151dcfe3c69e717823c9f85541f1c
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: mock, gensim, flashtext, scattertext
  Attempting uninstall: gensim
    Found existing installation: gens

## Load Dataset

The data is split into multiple files. The articles are contained in the files with names starting with "articles-" (which validate against the XML schema article.xsd). The ground-truth information is contained in the files with names starting with "ground-truth-" (which validate against the XML schema ground-truth.xsd).

In [None]:
from datasets import load_dataset

### By Publisher

The first part of the data (filename contains "bypublisher") is labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com. It contains a total of 750,000 articles, half of which (375,000) are hyperpartisan and half of which are not. Half of the articles that are hyperpartisan (187,500) are on the left side of the political spectrum, half are on the right side. This data is split into a training set (80%, 600,000 articles) and a validation set (20%, 150,000 articles), where no publisher that occurs in the training set also occurs in the validation set. Similarly, none of the publishers in those sets occurs in the test set.

In [25]:
dataset_bypublisher = load_dataset('hyperpartisan_news_detection.py', "bypublisher")

Downloading and preparing dataset hyperpartisan_news_detection/bypublisher to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/bypublisher/1.0.0/bf7007ea142fcae75583f28ee2160e33cffd758d65ffc750b22125e9fe1aa04e...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/981M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.58M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/159k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/337M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.24M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset hyperpartisan_news_detection downloaded and prepared to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/bypublisher/1.0.0/bf7007ea142fcae75583f28ee2160e33cffd758d65ffc750b22125e9fe1aa04e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [26]:
dataset_bypublisher

DatasetDict({
    train: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at', 'bias'],
        num_rows: 600000
    })
    test: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at', 'bias'],
        num_rows: 4000
    })
    validation: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at', 'bias'],
        num_rows: 150000
    })
})

In [27]:
dataset_bypublisher["train"][0]

{'bias': 0,
 'hyperpartisan': True,
 'published_at': '2017-09-10',
 'text': '<p>When explaining her decision to reevaluate Title IX guidelines as they pertain to sexual assault on college campuses, Secretary of Education Betsy DeVos <a href="https://www.nbcnews.com/news/us-news/betsy-devos-overhaul-obama-era-guidance-campus-sex-assault-n799471" type="external">said</a>: &#8220;Every survivor of sexual misconduct must be taken seriously. Every student accused of sexual misconduct must know that guilt is not predetermined.&#8221;</p> \n\n<p>The Obama administration&#8217;s changes to Title IX have been <a href="" type="internal">criticized</a> for, among other things, substantially lowering the burden of proof as it pertains to sexual assault, as well as denying elements of due process to the accused.</p> \n\n<p>However, many progressives are lashing out at DeVos because they hate her, and also rape culture and stuff.</p> \n\n<p>Perhaps the most grotesque attack came when Rob Ranco, a Te

In [28]:
dataset_bypublisher["train"]["text"][:10]

['<p>When explaining her decision to reevaluate Title IX guidelines as they pertain to sexual assault on college campuses, Secretary of Education Betsy DeVos <a href="https://www.nbcnews.com/news/us-news/betsy-devos-overhaul-obama-era-guidance-campus-sex-assault-n799471" type="external">said</a>: &#8220;Every survivor of sexual misconduct must be taken seriously. Every student accused of sexual misconduct must know that guilt is not predetermined.&#8221;</p> \n\n<p>The Obama administration&#8217;s changes to Title IX have been <a href="" type="internal">criticized</a> for, among other things, substantially lowering the burden of proof as it pertains to sexual assault, as well as denying elements of due process to the accused.</p> \n\n<p>However, many progressives are lashing out at DeVos because they hate her, and also rape culture and stuff.</p> \n\n<p>Perhaps the most grotesque attack came when Rob Ranco, a Texas attorney, tweeted Friday that &#8220;I\'m not wishing for it &#8230; bu

### By Article

The second part of the data (filename contains "byarticle") is labeled through crowdsourcing on an article basis. The data contains only articles for which a consensus among the crowdsourcing workers existed. It contains a total of 645 articles. Of these, 238 (37%) are hyperpartisan and 407 (63%) are not, We will use a similar (but balanced!) test set. Again, none of the publishers in this set occurs in the test set.

In [None]:
dataset_byarticle = load_dataset('hyperpartisan_news_detection.py', "byarticle")

Downloading and preparing dataset hyperpartisan_news_detection/byarticle to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/byarticle/1.0.0/bf7007ea142fcae75583f28ee2160e33cffd758d65ffc750b22125e9fe1aa04e...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.5k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.7k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset hyperpartisan_news_detection downloaded and prepared to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/byarticle/1.0.0/bf7007ea142fcae75583f28ee2160e33cffd758d65ffc750b22125e9fe1aa04e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset_byarticle

DatasetDict({
    train: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at'],
        num_rows: 645
    })
    test: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at'],
        num_rows: 628
    })
})

In [None]:
dataset_byarticle["train"][0]

{'hyperpartisan': True,
 'published_at': '2017-09-10',
 'text': '<p>Money ( <a href="https://farm8.static.flickr.com/7020/6551534889_9c8ae52997.jpg" type="external">Image</a> by <a href="https://www.flickr.com/people/68751915@N05/" type="external">401(K) 2013</a>) <a href="https://creativecommons.org/licenses/by-sa/2.0/" type="external">Permission</a> <a type="internal">Details</a> <a type="internal">DMCA</a></p> No Pill Can Stop Tinnitus, But This 1 Weird Trick Can \n<p>The walls are closing in on Congress.</p> \n<p>Terrifying walls of water from Hurricanes Harvey and Irma, which, when the damage is totaled, could rise to a half trillion dollars. The Walls of War: The multi-trillion dollar ongoing cost of Afghanistan, Iraq and other interventions. The crumbling walls of the U.S. infrastructure, which need at least $3 trillion to be repaired or replaced. A wall of 11 million undocumented immigrants, whose deportation could easily cost $200 billion. The planned wall at the Mexican borde

## Analyze Data

In [9]:
import scattertext as st

### By Publisher

In [29]:
bypublisher_test_df = dataset_bypublisher["test"].to_pandas()

In [None]:
bypublisher_test_df

#### Hyperpartisan

In [31]:
bypublisher_test_df["hyperpartisan"] = bypublisher_test_df["hyperpartisan"].apply(str)

In [32]:
bypublisher_test_corpus = st.CorpusFromPandas(bypublisher_test_df, 
                              category_col='hyperpartisan', 
                              text_col='text').build()

In [33]:
print(list(bypublisher_test_corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'obama', 'facebook', 'twitter', 'hillary', 'p', 'barack', 'maplight', 'obamacare', 'isis']


In [34]:
term_freq_df = bypublisher_test_corpus.get_term_freq_df()
term_freq_df['Hyperpartisan Score'] = bypublisher_test_corpus.get_scaled_f_scores("True")
print(list(term_freq_df.sort_values(by='Hyperpartisan Score', ascending=False).index[:10]))

['2015in', '2016in', 'sanders', 'bernie', 'amp;#160;|', 'amp;#160;| &', '& amp;#160;|', 'rob', 'amp;#160 a', 'liberals']


In [35]:
term_freq_df['Non-Hyperpartisan Score'] = bypublisher_test_corpus.get_scaled_f_scores("False")
print(list(term_freq_df.sort_values(by='Non-Hyperpartisan Score', ascending=False).index[:10]))

['contributions', 'maplight', '# 183', '183', 'firms', 'firm', 'promoted</p', 'associate', 'alabama', 'comments</a></p']


In [41]:
html = st.produce_scattertext_explorer(bypublisher_test_corpus,
          category="True",
          category_name='Hyperpartisan',
          not_category_name='Non-hyperpartisan',
          width_in_pixels=1000)
open("by_publisher.html", 'wb').write(html.encode('utf-8'))

25256764

#### Bias

In [44]:
bypublisher_test_corpus_bias = st.CorpusFromPandas(bypublisher_test_df, 
                              category_col='bias', 
                              text_col='text').build()

In [45]:
print(list(bypublisher_test_corpus_bias.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'obama', 'facebook', 'twitter', 'hillary', 'p', 'barack', 'maplight', 'obamacare', 'isis']


In [46]:
term_freq_df = bypublisher_test_corpus_bias.get_term_freq_df()
term_freq_df['Right Score'] = bypublisher_test_corpus_bias.get_scaled_f_scores(0)
print(list(term_freq_df.sort_values(by='Right Score', ascending=False).index[:10]))

['content.ad', 'modal&amp;amp;utm_source =', 'modal&amp;amp;utm_source', '= modal&amp;amp;utm_source', 'family friendly', 'just happened', 'happened to', 'content</p p', 'friendly content</p', 'content</p']


In [48]:
term_freq_df['Left Score'] = bypublisher_test_corpus.get_scaled_f_scores(4)
print(list(term_freq_df.sort_values(by='Left Score', ascending=False).index[:10]))

NameError: ignored

In [None]:
html = st.produce_scattertext_explorer(bypublisher_test_corpus_bias,
          category=0,
          category_name='Left',
          not_category_name='Right',
          width_in_pixels=1000)
open("by_publisher_bias.html", 'wb').write(html.encode('utf-8'))

### By Article

In [10]:
byarticle_test_df = dataset_byarticle["test"].to_pandas()

In [None]:
byarticle_test_df

In [12]:
byarticle_test_df["hyperpartisan"] = byarticle_test_df["hyperpartisan"].apply(str)

In [13]:
byarticle_test_corpus = st.CorpusFromPandas(byarticle_test_df, 
                              category_col='hyperpartisan', 
                              text_col='text').build()

In [14]:
print(list(byarticle_test_corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'twitter', 'obama', 'comey', 'bannon', 'hashtag', 'tweeted', 'kaepernick', 'facebook', 'barack']


In [15]:
term_freq_df = byarticle_test_corpus.get_term_freq_df()
term_freq_df['Hyperpartisan Score'] = byarticle_test_corpus.get_scaled_f_scores("True")
print(list(term_freq_df.sort_values(by='Hyperpartisan Score', ascending=False).index[:10]))

['the left', 'class', 'gold', 'ruling', 'israel', 'the ruling', 'conservative', 'ruling class', 'china', 'he&#8217;s']


In [16]:
term_freq_df['Non-Hyperpartisan Score'] = byarticle_test_corpus.get_scaled_f_scores("False")
print(list(term_freq_df.sort_values(by='Non-Hyperpartisan Score', ascending=False).index[:10]))

['href="https://www.businessinsider.com /', 'a href="https://www.businessinsider.com', 'href="https://www.businessinsider.com', 'send free', 'type="external">https://twitter.com /', 'type="external">https://twitter.com', 'ca)</a', 'q p><a', 'o', '|']


In [17]:
html = st.produce_scattertext_explorer(byarticle_test_corpus,
          category="True",
          category_name='Hyperpartisan',
          not_category_name='Non-hyperpartisan',
          width_in_pixels=1000,
          metadata=byarticle_test_df['title'])
open("by_article.html", 'wb').write(html.encode('utf-8'))

4873082

In [22]:
from IPython.display import HTML

In [24]:
HTML(html)

Output hidden; open in https://colab.research.google.com to view.