# Hyperpartisan News Detection

Task: https://webis.de/events/semeval-19/

Dataset: https://zenodo.org/record/5776081

Paper: https://aclanthology.org/S19-2145/

Training, validation, and test data for the PAN @ SemEval 2019 Task 4: Hyperpartisan News Detection.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/master/Applications1/hyperpartisan-news-detection

In [2]:
%cd /content/drive/MyDrive/LAP/Subjects/AP1/project/hyperpartisan-news-detection

/content/drive/MyDrive/LAP/Subjects/AP1/project/hyperpartisan-news-detection


In [3]:
! pip install datasets

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 24.1 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 27.8 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 44.9 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 58.4 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.0 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading

In [4]:
!pip install scattertext

Collecting scattertext
  Downloading scattertext-0.1.6-py3-none-any.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 23.2 MB/s 
[?25hCollecting mock
  Downloading mock-4.0.3-py3-none-any.whl (28 kB)
Collecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Collecting gensim>=4.0.0
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 9.3 MB/s 
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9309 sha256=b2de61d9aaa44282f83754a7c5eb07b61d62600df06d4ddf578150197180df20
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: mock, gensim, flashtext, scattertext
  Attempting uninstall: gensim
    Found existing installation: gen

In [10]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Load Dataset

The data is split into multiple files. The articles are contained in the files with names starting with "articles-" (which validate against the XML schema article.xsd). The ground-truth information is contained in the files with names starting with "ground-truth-" (which validate against the XML schema ground-truth.xsd).

In [6]:
from datasets import load_dataset

### By Publisher

The first part of the data (filename contains "bypublisher") is labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com. It contains a total of 750,000 articles, half of which (375,000) are hyperpartisan and half of which are not. Half of the articles that are hyperpartisan (187,500) are on the left side of the political spectrum, half are on the right side. This data is split into a training set (80%, 600,000 articles) and a validation set (20%, 150,000 articles), where no publisher that occurs in the training set also occurs in the validation set. Similarly, none of the publishers in those sets occurs in the test set.

In [None]:
dataset_bypublisher = load_dataset('hyperpartisan_news_detection.py', "bypublisher")

In [None]:
dataset_bypublisher

DatasetDict({
    train: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at', 'bias'],
        num_rows: 600000
    })
    test: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at', 'bias'],
        num_rows: 4000
    })
    validation: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at', 'bias'],
        num_rows: 150000
    })
})

In [None]:
dataset_bypublisher["train"][0]

{'bias': 0,
 'hyperpartisan': True,
 'published_at': '2017-09-10',
 'text': '<p>When explaining her decision to reevaluate Title IX guidelines as they pertain to sexual assault on college campuses, Secretary of Education Betsy DeVos <a href="https://www.nbcnews.com/news/us-news/betsy-devos-overhaul-obama-era-guidance-campus-sex-assault-n799471" type="external">said</a>: &#8220;Every survivor of sexual misconduct must be taken seriously. Every student accused of sexual misconduct must know that guilt is not predetermined.&#8221;</p> \n\n<p>The Obama administration&#8217;s changes to Title IX have been <a href="" type="internal">criticized</a> for, among other things, substantially lowering the burden of proof as it pertains to sexual assault, as well as denying elements of due process to the accused.</p> \n\n<p>However, many progressives are lashing out at DeVos because they hate her, and also rape culture and stuff.</p> \n\n<p>Perhaps the most grotesque attack came when Rob Ranco, a Te

In [None]:
dataset_bypublisher["train"]["text"][:10]

['<p>When explaining her decision to reevaluate Title IX guidelines as they pertain to sexual assault on college campuses, Secretary of Education Betsy DeVos <a href="https://www.nbcnews.com/news/us-news/betsy-devos-overhaul-obama-era-guidance-campus-sex-assault-n799471" type="external">said</a>: &#8220;Every survivor of sexual misconduct must be taken seriously. Every student accused of sexual misconduct must know that guilt is not predetermined.&#8221;</p> \n\n<p>The Obama administration&#8217;s changes to Title IX have been <a href="" type="internal">criticized</a> for, among other things, substantially lowering the burden of proof as it pertains to sexual assault, as well as denying elements of due process to the accused.</p> \n\n<p>However, many progressives are lashing out at DeVos because they hate her, and also rape culture and stuff.</p> \n\n<p>Perhaps the most grotesque attack came when Rob Ranco, a Texas attorney, tweeted Friday that &#8220;I\'m not wishing for it &#8230; bu

### By Article

The second part of the data (filename contains "byarticle") is labeled through crowdsourcing on an article basis. The data contains only articles for which a consensus among the crowdsourcing workers existed. It contains a total of 645 articles. Of these, 238 (37%) are hyperpartisan and 407 (63%) are not, We will use a similar (but balanced!) test set. Again, none of the publishers in this set occurs in the test set.

In [27]:
dataset_byarticle = load_dataset('hyperpartisan_news_detection.py', "byarticle")

Downloading and preparing dataset hyperpartisan_news_detection/byarticle to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/byarticle/1.0.0/13eae61fdca037c340f13a9eaccb56ba9303c4f0e1505830fd519542be1c1478...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset hyperpartisan_news_detection downloaded and prepared to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/byarticle/1.0.0/13eae61fdca037c340f13a9eaccb56ba9303c4f0e1505830fd519542be1c1478. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [28]:
dataset_byarticle

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'clean_text', 'title', 'hyperpartisan', 'url', 'published_at'],
        num_rows: 645
    })
    test: Dataset({
        features: ['id', 'text', 'clean_text', 'title', 'hyperpartisan', 'url', 'published_at'],
        num_rows: 628
    })
})

In [29]:
dataset_byarticle["test"][0]

{'clean_text': 'according evansville watch epd bomb squad dispatched downtown evansville suspicious device located afternoon according facebook post evansville watch device found parking lot located cherry streets downtown evansville police bomb squad responding streets within one block radius shut ems fire standby last update heard device back tan truck removed assessed keep updated learn',
 'hyperpartisan': False,
 'id': '0000645',
 'published_at': '2017-10-17',
 'text': 'According to Evansville Watch, the EPD &amp; the Bomb Squad have been dispatched to Downtown Evansville after a suspicious device was located this afternoon. According to a Facebook Post from Evansville Watch, the device was found in a parking lot located at 3rd and Cherry Streets downtown. Evansville Police and Bomb Squad are responding. The streets within a one block radius have been shut down and EMS and Fire are on standby. The last update we heard, the device is in the back of a tan truck. It has been removed a

## Analyze Data

In [30]:
import scattertext as st

### By Publisher

In [None]:
bypublisher_test_df = dataset_bypublisher["test"].to_pandas()

In [None]:
bypublisher_test_df

#### Hyperpartisan

In [None]:
bypublisher_test_df["hyperpartisan"] = bypublisher_test_df["hyperpartisan"].apply(str)

In [None]:
bypublisher_test_corpus = st.CorpusFromPandas(bypublisher_test_df, 
                              category_col='hyperpartisan', 
                              text_col='text').build()

In [None]:
print(list(bypublisher_test_corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'obama', 'facebook', 'twitter', 'hillary', 'p', 'barack', 'maplight', 'obamacare', 'isis']


In [None]:
term_freq_df = bypublisher_test_corpus.get_term_freq_df()
term_freq_df['Hyperpartisan Score'] = bypublisher_test_corpus.get_scaled_f_scores("True")
print(list(term_freq_df.sort_values(by='Hyperpartisan Score', ascending=False).index[:10]))

['2015in', '2016in', 'sanders', 'bernie', 'amp;#160;|', 'amp;#160;| &', '& amp;#160;|', 'rob', 'amp;#160 a', 'liberals']


In [None]:
term_freq_df['Non-Hyperpartisan Score'] = bypublisher_test_corpus.get_scaled_f_scores("False")
print(list(term_freq_df.sort_values(by='Non-Hyperpartisan Score', ascending=False).index[:10]))

['contributions', 'maplight', '# 183', '183', 'firms', 'firm', 'promoted</p', 'associate', 'alabama', 'comments</a></p']


In [None]:
html = st.produce_scattertext_explorer(bypublisher_test_corpus,
          category="True",
          category_name='Hyperpartisan',
          not_category_name='Non-hyperpartisan',
          width_in_pixels=1000)
open("by_publisher.html", 'wb').write(html.encode('utf-8'))

25256764

#### Bias

In [None]:
bypublisher_test_corpus_bias = st.CorpusFromPandas(bypublisher_test_df, 
                              category_col='bias', 
                              text_col='text').build()

In [None]:
print(list(bypublisher_test_corpus_bias.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'obama', 'facebook', 'twitter', 'hillary', 'p', 'barack', 'maplight', 'obamacare', 'isis']


In [None]:
term_freq_df = bypublisher_test_corpus_bias.get_term_freq_df()
term_freq_df['Right Score'] = bypublisher_test_corpus_bias.get_scaled_f_scores(0)
print(list(term_freq_df.sort_values(by='Right Score', ascending=False).index[:10]))

['content.ad', 'modal&amp;amp;utm_source =', 'modal&amp;amp;utm_source', '= modal&amp;amp;utm_source', 'family friendly', 'just happened', 'happened to', 'content</p p', 'friendly content</p', 'content</p']


In [None]:
term_freq_df['Left Score'] = bypublisher_test_corpus_bias.get_scaled_f_scores(4)
print(list(term_freq_df.sort_values(by='Left Score', ascending=False).index[:10]))

In [None]:
html = st.produce_scattertext_explorer(bypublisher_test_corpus_bias,
          category=0,
          category_name='Left',
          not_category_name='Right',
          width_in_pixels=1000)
open("by_publisher_bias.html", 'wb').write(html.encode('utf-8'))

### By Article

In [31]:
byarticle_test_df = dataset_byarticle["test"].to_pandas()
byarticle_test_df

Unnamed: 0,id,text,clean_text,title,hyperpartisan,url,published_at
0,0000645,"According to Evansville Watch, the EPD &amp; t...",according evansville watch epd bomb squad disp...,Breaking: Bomb Squad Dispatched for Suspicious...,False,http://1061evansville.com/breaking-bomb-squad-...,2017-10-17
1,0000646,Photo by Scott Olson/Getty Images A man (not J...,photo scott olsongetty images man joe walsh po...,The Crazy Republican-Endorsed Logic Behind “If...,True,http://www.slate.com/blogs/the_slatest/2016/10...,2016-10-26
2,0000647,Justin Sullivan/Getty Images Hillary Clinton s...,justin sullivangetty images hillary clinton sp...,"Review of 650,000 Emails in Weiner’s Laptop fo...",False,http://www.slate.com/blogs/the_slatest/2016/10...,2016-10-30
3,0000648,"Jewel Samad/AFP/Getty Images Childish, mean-sp...",jewel samadafpgetty images childish meanspirit...,"Trump’s Astounding, Hypocritical Cruelty Peaks...",True,http://www.slate.com/blogs/xx_factor/2016/09/3...,2016-09-30
4,0000649,\nPresident Donald Trump speaks during a meeti...,president donald trump speaks meeting governor...,Max Boot: Republicans have Stockholm Syndrome ...,True,http://www.sltrib.com/opinion/commentary/2017/...,2017-10-20
...,...,...,...,...,...,...,...
623,0001303,\nThe protester was waving signs containing po...,protester waving signs containing popular altr...,Video: Alt-right protester choked out for wavi...,False,http://www.chron.com/news/houston-texas/housto...,
624,0001304,\nNFL's Kaepernick sits in protest during nati...,nfl kaepernick sits protest national anthem cl...,I was on board with Kaepernick until....,True,http://www.cnn.com/2016/08/30/opinions/where-k...,2016-08-31
625,0001305,\n1. They define themselves with the flags of ...,define flags losers white supremacists tend us...,5 Reasons White Supremacists Are The Dumbest W...,True,http://www.collegehumor.com/post/7044796/5-rea...,2017-05-05
626,0001306,Marco Rubio represents Florida in the United S...,marco rubio represents florida united states s...,Iran nuclear deal an unfolding disaster,True,http://www.cnn.com/2016/10/17/opinions/iran-nu...,2016-10-18


In [32]:
byarticle_test_df["hyperpartisan"] = byarticle_test_df["hyperpartisan"].apply(str)

#### Text

In [38]:
byarticle_test_corpus = st.CorpusFromPandas(byarticle_test_df, 
                              category_col='hyperpartisan', 
                              text_col='text').build()

In [39]:
print(list(byarticle_test_corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'twitter', 'obama', 'comey', 'facebook', 'tweeted', 'antifa', 'bannon', 'hillary', 'tweet']


In [40]:
term_freq_df = byarticle_test_corpus.get_term_freq_df()
term_freq_df['Hyperpartisan Score'] = byarticle_test_corpus.get_scaled_f_scores("True")
print(list(term_freq_df.sort_values(by='Hyperpartisan Score', ascending=False).index[:10]))

['jihad', 'trump is', 'trump has', 'iran', 'trump ’s', 'administration', 'cnn', 'cia', 'he ’s', 'trump']


In [41]:
term_freq_df['Non-Hyperpartisan Score'] = byarticle_test_corpus.get_scaled_f_scores("False")
print(list(term_freq_df.sort_values(by='Non-Hyperpartisan Score', ascending=False).index[:10]))

['dental', '|', 'votes', 'tooth', 'officers', 'police', 'constabulary', 'florida', 'regeneration', 'tooth regeneration']


In [42]:
html = st.produce_scattertext_explorer(byarticle_test_corpus,
          category="True",
          category_name='Hyperpartisan',
          not_category_name='Non-hyperpartisan',
          width_in_pixels=1000,
          metadata=byarticle_test_corpus.get_df()['title'])
open("by_article_test.html", 'wb').write(html.encode('utf-8'))

4347922

#### Clean Text

In [43]:
byarticle_test_corpus_clean = st.CorpusFromPandas(byarticle_test_df, 
                              category_col='hyperpartisan', 
                              text_col='clean_text').build()

In [44]:
print(list(byarticle_test_corpus_clean.get_scaled_f_scores_vs_background().index[:10]))

['twitter', 'trump', 'comey', 'obama', 'tweeted', 'antifa', 'facebook', 'bannon', 'tweet', 'abedin']


In [45]:
term_freq_df = byarticle_test_corpus_clean.get_term_freq_df()
term_freq_df['Hyperpartisan Score'] = byarticle_test_corpus_clean.get_scaled_f_scores("True")
print(list(term_freq_df.sort_values(by='Hyperpartisan Score', ascending=False).index[:10]))

['damore', 'iranian', 'thomas wictor', 'wictor', 'thomaswictor october', 'wictor thomaswictor', 'thomaswictor', 'john hawkins', 'young adults', 'adults know']


In [46]:
term_freq_df['Non-Hyperpartisan Score'] = byarticle_test_corpus_clean.get_scaled_f_scores("False")
print(list(term_freq_df.sort_values(by='Non-Hyperpartisan Score', ascending=False).index[:10]))

['constabulary', 'tooth regeneration', 'regeneration', 'dental', 'usher', 'party votes', 'regeneration procedures', 'ai', 'labour', 'separatist']


In [47]:
html = st.produce_scattertext_explorer(byarticle_test_corpus_clean,
          category="True",
          category_name='Hyperpartisan',
          not_category_name='Non-hyperpartisan',
          width_in_pixels=1000,
          metadata=byarticle_test_corpus_clean.get_df()['title'])
open("by_article_test_clean.html", 'wb').write(html.encode('utf-8'))

3264149