<div style="text-align:center;">
    <h1><b>Scattertext applied to Hyperpartisan News Detection</b></h1>
    <h3><b>Authors:</b> Julen Rodriguez and Mikel Salvoch</h3>
</div>

# 1. Necessary Imports

In [24]:
#numpy under 2.x could be required

In [25]:
# !pip install scattertext
import numpy as np
import pandas as pd
import scattertext as st
import spacy
from pprint import pprint

# 2. Data
Assemble the data to analyze into a Pandas data frame. It should have at least two columns, the text you'd like to analyze, and the category to study.

## 2.1. Load the data

In [26]:
# load hyperpartisan\articles-training-byarticle-20181122.tsv
convention_df = pd.read_csv('articles-training-byarticle-20181122.tsv', sep='\t')
# non header, first column: 1 for hyperpartisan, 0 for non-hyperpartisan
convention_df.columns = ['hyperpartisan', 'article']
convention_df['hyperpartisan'] = convention_df['hyperpartisan'].astype(bool)

In [27]:
convention_df.head()

Unnamed: 0,hyperpartisan,article
0,True,Trump Just Woke Up & Viciously Attacked Puerto...
1,True,"Liberals wailing about gun control, but what a..."
2,True,Laremy Tunsil joins NFL players in kneeling du...
3,False,It's 1968 All Over Again Almost a half-centur...
4,True,Gold Price in December 2017 - Myriads of Signa...


## 2.2 Load data into Scattertext corpus
Turn the data frame into a Scattertext Corpus to begin analyzing it

In [28]:
# !python -m spacy download en_core_web_sm

In [30]:
# Turn it into a Scattertext Corpus 
nlp = spacy.load("en_core_web_sm")
corpus = st.CorpusFromPandas(convention_df, 
                             category_col='hyperpartisan',
                             text_col='article',
                             nlp=nlp).build()

Here are the terms that differentiate the corpus from a general English corpus.

In [None]:
print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'twitter', 'obama', 'comey', 'tweeted', 'bannon', 'facebook', 'barack', 'hillary', 'kaepernick']


Here are the terms that are most associated with Hyperpartisan news:

In [None]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['hyperpartisan Score'] = corpus.get_scaled_f_scores('True')
pprint(list(term_freq_df.sort_values(by='hyperpartisan Score', ascending=False).index[:10]))

['the left',
 'class',
 'israel',
 'ruling',
 'china',
 'conservative',
 'ruling class',
 'the ruling',
 'he ’s',
 'the media']


And here are the terms that are most associated with non-Hyperpartisan news:

In [None]:
term_freq_df['non-hyperpartisan Score'] = corpus.get_scaled_f_scores('False')
pprint(list(term_freq_df.sort_values(by='non-hyperpartisan Score', ascending=False).index[:10]))

['send free',
 '|',
 '⚪',
 'california ca',
 '⠀',
 'hurricane',
 'august',
 'angeles',
 'los angeles',
 'los']


# 3. Visualizing the data

## 3.1. Visualizing term associations

In [None]:
html = st.produce_scattertext_explorer(
    corpus,
    category='True',
    category_name='Hyperpartisan',
    not_category_name='Non-Hyperpartisan',
    width_in_pixels=1000,
    metadata=convention_df['hyperpartisan']
)
with open("Convention-Visualization.html", 'wb') as file:
    file.write(html.encode('utf-8'))

## 3.2. Visualizing Phrase associations

In [None]:
# !pip install --user pytextrank 

Collecting pytextrank
  Downloading pytextrank-3.3.0-py3-none-any.whl.metadata (12 kB)
Collecting GitPython>=3.1 (from pytextrank)
  Downloading GitPython-3.1.44-py3-none-any.whl.metadata (13 kB)
Collecting icecream>=2.1 (from pytextrank)
  Downloading icecream-2.1.4-py3-none-any.whl.metadata (1.3 kB)
Collecting gitdb<5,>=4.0.1 (from GitPython>=3.1->pytextrank)
  Downloading gitdb-4.0.12-py3-none-any.whl.metadata (1.2 kB)
Collecting executing>=2.1.0 (from icecream>=2.1->pytextrank)
  Downloading executing-2.2.0-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->GitPython>=3.1->pytextrank)
  Downloading smmap-5.0.2-py3-none-any.whl.metadata (4.3 kB)
Collecting numpy>=1.22 (from networkx[default]>=2.6->pytextrank)
  Using cached numpy-1.26.4-cp39-cp39-win_amd64.whl.metadata (61 kB)
Downloading pytextrank-3.3.0-py3-none-any.whl (26 kB)
Downloading GitPython-3.1.44-py3-none-any.whl (207 kB)
Downloading icecream-2.1.4-py3-none-any.whl (14 kB)
Downloadin

ERROR: Could not install packages due to an OSError: [WinError 5] Acceso denegado: 'd:\\anaconda\\lib\\site-packages\\numpy\\_core\\_multiarray_tests.cp39-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



In [None]:
convention_df.head()

Unnamed: 0,hyperpartisan,article
0,True,Trump Just Woke Up & Viciously Attacked Puerto...
1,True,"Liberals wailing about gun control, but what a..."
2,True,Laremy Tunsil joins NFL players in kneeling du...
3,False,It's 1968 All Over Again Almost a half-centur...
4,True,Gold Price in December 2017 - Myriads of Signa...


In [None]:
import pytextrank

# nlp.add_pipe("textrank", last=True)

convention_df = convention_df.assign(
    parse=lambda df: df.article.apply(nlp),
    # we label as hyperpartisan and non_hyperpartisan to avoid problems with True and False reserved words
    hyperpartisan=lambda df: df.hyperpartisan.apply(lambda x: 'hyperpartisan' if x else 'non_hyperpartisan')
)
corpus = st.CorpusFromParsedDocuments(
    convention_df,
    category_col='hyperpartisan',
    parsed_col='parse',
    feats_from_spacy_doc=st.PyTextRankPhrases()
).build(
).compact(
    st.AssociationCompactor(2000, use_non_text_features=True)
)

d:\Anaconda\lib\site-packages


AttributeError: module 'pytextrank' has no attribute 'TextRank'

Note that the terms present in the corpus are named entities, and, as opposed to frequency counts, their scores are the eigencentrality scores assigned to them by the TextRank algorithm. Running ```corpus.get_metadata_freq_df('')``` will return, for each category, the sums of terms' TextRank scores. The dense ranks of these scores will be used to construct the scatter plot.

In [None]:
term_category_scores = corpus.get_metadata_freq_df('')
print(term_category_scores)

                            hyperpartisan  non_hyperpartisan
term                                                        
Puerto Rican people              0.104552           0.000000
Donald Trump                    22.772139          24.630074
Puerto Rican                     0.468275           0.000000
Trump                          125.671835          88.974905
Puerto Rico                      3.506832           0.511172
...                                   ...                ...
the stomach                      0.013283           0.000000
another interview                0.013148           0.000000
his depravity                    0.012987           0.000000
F**                              0.009696           0.000000
And possibly tranquilizers       0.005146           0.000000

[49258 rows x 2 columns]


Before we construct the plot, let's some helper variables Since the aggregate TextRank scores aren't particularly interpretable, we'll display the per-category rank of each score in the metadata_description field. These will be displayed after a term is clicked.

In [None]:
term_ranks = pd.DataFrame(
    np.argsort(np.argsort(-term_category_scores, axis=0), axis=0) + 1,
    columns=term_category_scores.columns,
    index=term_category_scores.index)

metadata_descriptions = {
    term: '<br/>' + '<br/>'.join(
        '<b>%s</b> TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())
        for cat in corpus.get_categories())
    for term in corpus.get_metadata()
}

In [None]:
category_specific_prominence = term_category_scores.apply(
    lambda r: r.hyperpartisan if r.hyperpartisan > r.non_hyperpartisan else -r.non_hyperpartisan,
    axis=1
)

Now we proceed to construct the plot defining an html file to save it.

In [None]:
html = st.produce_scattertext_explorer(
    corpus,
    category='hyperpartisan',
    not_category_name='non_hyperpartisan',
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    transform=st.dense_rank,
    metadata=corpus.get_df()['hyperpartisan'],
    scores=category_specific_prominence,
    sort_by_dist=False,
    use_non_text_features=True,
    topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
    topic_model_preview_size=0,
    metadata_descriptions=metadata_descriptions,
    use_full_doc=True
)

In [None]:
with open("Convention-Visualization-TextRank.html", 'wb') as file:
    file.write(html.encode('utf-8'))

True

## 3.3. Visualizing Empath topics and categories

In order to visualize Empath (Fast et al., 2016) topics and categories instead of terms, we'll need to create a Corpus of extracted topics and categories rather than unigrams and bigrams

In [None]:
convention_df.head()

Unnamed: 0,hyperpartisan,article,parse
0,hyperpartisan,Trump Just Woke Up & Viciously Attacked Puerto...,"(Trump, Just, Woke, Up, &, Viciously, Attacked..."
1,hyperpartisan,"Liberals wailing about gun control, but what a...","(Liberals, wailing, about, gun, control, ,, bu..."
2,hyperpartisan,Laremy Tunsil joins NFL players in kneeling du...,"(Laremy, Tunsil, joins, NFL, players, in, knee..."
3,non_hyperpartisan,It's 1968 All Over Again Almost a half-centur...,"(It, 's, 1968, All, Over, Again, , Almost, a,..."
4,hyperpartisan,Gold Price in December 2017 - Myriads of Signa...,"(Gold, Price, in, December, 2017, -, Myriads, ..."


In [None]:
# !pip install empath

In [None]:
feat_builder = st.FeatsFromOnlyEmpath()
empath_corpus = st.CorpusFromParsedDocuments(convention_df,
                                             category_col='hyperpartisan',
                                             feats_from_spacy_doc=feat_builder,
                                             parsed_col='parse').build()
html = st.produce_scattertext_explorer(empath_corpus,
                                       category='hyperpartisan',
                                       category_name='Hyperpartisan',
                                       not_category_name='Non-Hyperpartisan',
                                       width_in_pixels=1000,
                                       metadata=convention_df['hyperpartisan'],
                                       use_non_text_features=True,
                                       use_full_doc=True,
                                       topic_model_term_lists=feat_builder.get_top_model_term_lists())

In [None]:
with open("Convention-Visualization-Empath.html", 'wb') as file:
    file.write(html.encode('utf-8'))

## 3.4. Ordering Terms by Corpus Characteristicness

We are identifying terms that are frequent within the studied documents but less common in general language. The characteristic score compares these terms against a general English frequency list.

In [31]:
# ***CAUTION: run only if needed***
convention_df = convention_df.assign(
    parse=lambda df: df.article.apply(nlp),
    # we label as hyperpartisan and non_hyperpartisan to avoid problems with True and False reserved words
    hyperpartisan=lambda df: df.hyperpartisan.apply(lambda x: 'hyperpartisan' if x else 'non_hyperpartisan')
)

In [32]:
convention_df.head()

Unnamed: 0,hyperpartisan,article,parse
0,hyperpartisan,Trump Just Woke Up & Viciously Attacked Puerto...,"(Trump, Just, Woke, Up, &, Viciously, Attacked..."
1,hyperpartisan,"Liberals wailing about gun control, but what a...","(Liberals, wailing, about, gun, control, ,, bu..."
2,hyperpartisan,Laremy Tunsil joins NFL players in kneeling du...,"(Laremy, Tunsil, joins, NFL, players, in, knee..."
3,non_hyperpartisan,It's 1968 All Over Again Almost a half-centur...,"(It, 's, 1968, All, Over, Again, , Almost, a,..."
4,hyperpartisan,Gold Price in December 2017 - Myriads of Signa...,"(Gold, Price, in, December, 2017, -, Myriads, ..."


In [33]:
corpus = (st.CorpusFromPandas(convention_df,
                              category_col='hyperpartisan',
                              text_col='article',
                              nlp=st.whitespace_nlp_with_sentences)
          .build()
          .get_unigram_corpus()
          .compact(st.ClassPercentageCompactor(term_count=2,
                                               term_ranker=st.OncePerDocFrequencyRanker)))
html = st.produce_characteristic_explorer(
    corpus,
    category='hyperpartisan',
    category_name='Hyperpartisan',
    not_category_name='Non-Hyperpartisan',
    metadata=convention_df['hyperpartisan'],
)

In [34]:
with open("Convention-Visualization-Characteristics.html", 'wb') as file:
    file.write(html.encode('utf-8'))