# Analyzing Hyperpartisan Documents

## 1. Setup Working Environment

### 1.1. Load zip data files

#### 1.1.1. Download zip files from Google Drive link

In [1]:
import gdown

zip_data_url = "https://drive.google.com/drive/folders/1e8tgF2tGdZJ0HU-7pKX7yMLBLw71XTMe?usp=drive_link"
gdown.download_folder(zip_data_url)

Retrieving folder contents


Processing file 11XnbNY-7LdzEoo-HLaTzQcnFy4CVBPPZ articles-validation-bypublisher-20181122.zip
Processing file 193XKdPA2rq5oq1KNKZdp7-jCDcjE9rOb ground-truth-validation-bypublisher-20181122.zip


Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From (original): https://drive.google.com/uc?id=11XnbNY-7LdzEoo-HLaTzQcnFy4CVBPPZ
From (redirected): https://drive.google.com/uc?id=11XnbNY-7LdzEoo-HLaTzQcnFy4CVBPPZ&confirm=t&uuid=b3d2f4d0-b8bb-4d2e-ba30-12f235b42efa
To: /home/jfu08/PycharmProjects/AnalyzingHyperpartisanDocuments/notebooks/data/articles-validation-bypublisher-20181122.zip
100%|██████████| 337M/337M [01:48<00:00, 3.11MB/s] 
Downloading...
From: https://drive.google.com/uc?id=193XKdPA2rq5oq1KNKZdp7-jCDcjE9rOb
To: /home/jfu08/PycharmProjects/AnalyzingHyperpartisanDocuments/notebooks/data/ground-truth-validation-bypublisher-20181122.zip
100%|██████████| 5.24M/5.24M [00:01<00:00, 3.17MB/s]
Download completed


['/home/jfu08/PycharmProjects/AnalyzingHyperpartisanDocuments/notebooks/data/articles-validation-bypublisher-20181122.zip',
 '/home/jfu08/PycharmProjects/AnalyzingHyperpartisanDocuments/notebooks/data/ground-truth-validation-bypublisher-20181122.zip']

#### 1.1.2. Move downloaded data files to the corresponding `zip_data_path`

In [2]:
import os
import shutil


downloaded_data_path = 'data'
zip_data_path = '../data/zip'

os.makedirs(zip_data_path)
for zip_data_file in os.listdir('data'):
    source_path = os.path.join(downloaded_data_path, zip_data_file)
    destination_path = os.path.join(zip_data_path, zip_data_file)
    shutil.move(source_path, destination_path)

os.rmdir(downloaded_data_path)

### 1.3. Install required packages

In [3]:
!pip install -r ../requirements.txt



## 2. Perform Hyperpartisan Documents Analysis 

### 2.1 Add necessary import statements

In [4]:
from src.log_odd_ratios.LogOddRatiosAnalyzer import LogOddRatiosAnalyzer
from src.log_odd_ratios.LogOddRatiosCalculator import LogOddRatiosCalculator, TokenType
from src.constant_values.enums import DocumentType
from src.get_hyperpartisan_data.HyperpartisanDocumentsFormatter import HyperpartisanDocumentsFormatter
from src.get_hyperpartisan_data.HyperpartisanDocumentsProcessor import HyperpartisanDocumentsProcessor
from src.utils import console_output_formatter

### 2.2. Adapt hyperpartisan documents format into text files

In [5]:
console_output_formatter.print_section_header(section_header='Data formatting')
hyperpartisan_documents_formatter = HyperpartisanDocumentsFormatter()
hyperpartisan_documents_formatter.adapt_dataset_format()

###########################
#                         #
#     DATA FORMATTING     #
#                         #
###########################

Extracting XML data from zip files ...
Extracting file articles-validation-bypublisher-20181122.zip ...
File articles-validation-bypublisher-20181122.zip extracted
Extracting file ground-truth-validation-bypublisher-20181122.zip ...
File ground-truth-validation-bypublisher-20181122.zip extracted


Getting ground truth values ...: 100%|██████████| 150000/150000 [00:00<00:00, 281905.20it/s]
Extracting text from articles ...: 100%|██████████| 150000/150000 [00:24<00:00, 6011.95it/s]






### 2.3. Process hyperpartisan documents (if necessary) and load

In [6]:
console_output_formatter.print_section_header(section_header='Data loading / preprocessing')
hyperpartisan_documents_processor = HyperpartisanDocumentsProcessor()

hyperpartisan_document_list = hyperpartisan_documents_processor.get_clean_documents(
    document_type=DocumentType.HYPERPARTISAN
)
console_output_formatter.print_document_list_stats(
    document_list=hyperpartisan_document_list,
    document_type=DocumentType.HYPERPARTISAN
)

non_hyperpartisan_document_list = hyperpartisan_documents_processor.get_clean_documents(
    document_type=DocumentType.NON_HYPERPARTISAN
)
console_output_formatter.print_document_list_stats(
    document_list=non_hyperpartisan_document_list,
    document_type=DocumentType.NON_HYPERPARTISAN
)

########################################
#                                      #
#     DATA LOADING / PREPROCESSING     #
#                                      #
########################################

The HYPERPARTISAN document list already exists. Loading it from pickle file ...
Clean document list (HYPERPARTISAN) loaded successfully. 

DOCUMENT LIST STATISTICS (HYPERPARTISAN):
Number of documents: 75000
Document sample after performing preprocessing:


The NON-HYPERPARTISAN document list already exists. Loading it from pickle file ...
Clean document list (NON-HYPERPARTISAN) loaded successfully. 

DOCUMENT LIST STATISTICS (NON-HYPERPARTISAN):
Number of documents: 75000
Document sample after performing preprocessing:


### 2.4. Remove infrequent words (log-odd ratio is sensitive to them)

In [7]:
hyperpartisan_document_list, non_hyperpartisan_document_list = (hyperpartisan_documents_processor.
    remove_infrequent_words(
        hyperpartisan_documents=hyperpartisan_document_list,
        non_hyperpartisan_documents=non_hyperpartisan_document_list
    )
)

Calculating unigrams frequency for the whole corpus ...
Unigrams frequency calculated successfully. 

Removing infrequent words from the corpus ...
Infrequent words removed successfully. 


### 2.5. Calculate log-odd ratios

#### 2.5.1. Calculate on `UNIGRAMS`

In [8]:
console_output_formatter.print_section_header(section_header='Log-odd ratios calculation (on unigrams)')
log_odd_ratios_calculator = LogOddRatiosCalculator(
    hyperpartisan_documents=hyperpartisan_document_list,
    non_hyperpartisan_documents=non_hyperpartisan_document_list,
    token_type=TokenType.UNIGRAM
)

log_odd_ratios_calculator.calculate_log_odd_ratios()

####################################################
#                                                  #
#     LOG-ODD RATIOS CALCULATION (ON UNIGRAMS)     #
#                                                  #
####################################################


Calculating unigrams frequency for HYPERPARTISAN ...: 100%|██████████| 75000/75000 [00:06<00:00, 12182.48it/s]
Calculating unigrams frequency for NON-HYPERPARTISAN ...: 100%|██████████| 75000/75000 [00:10<00:00, 7410.57it/s]



Calculating o values for HYPERPARTISAN ...
o values calculated successfully.
Highest values: {',': 0.07904558812222764, '.': 0.05905167566126709, '’': 0.017834360472030117, ';': 0.010282649671985545, '”': 0.009157323635250748, '“': 0.0091371932147936, '&': 0.008775418331800382, '#': 0.008418050626064075, '160': 0.008213163032176257, "'s": 0.006635610382607333, ':': 0.006315486901737, "''": 0.004601259868295875, ')': 0.004290455344636663, '(': 0.004171680981721613, '``': 0.004038930885589468, 'people': 0.0035832291148293633, '?': 0.0035464001595223124, 'would': 0.003481079941100117, 'one': 0.003438485628349947, 'trump': 0.0033961246759758932} ... 

Calculating o values for NON-HYPERPARTISAN ...
o values calculated successfully.
Highest values: {',': 0.07733586360719497, '.': 0.0575189046097237, '’': 0.018772515090338244, ';': 0.01231004804516985, '&': 0.010770717754870098, '#': 0.010621687172105558, '160': 0.010550057423716536, ':': 0.007379839242451155, '“': 0.007092393923448603, '”':

#### 2.5.2. Calculate on `BIGRAMS`

In [9]:
console_output_formatter.print_section_header(section_header='Log-odd ratios calculation (on bigrams)')
log_odd_ratios_calculator = LogOddRatiosCalculator(
    hyperpartisan_documents=hyperpartisan_document_list,
    non_hyperpartisan_documents=non_hyperpartisan_document_list,
    token_type=TokenType.BIGRAM
)

log_odd_ratios_calculator.calculate_log_odd_ratios()

###################################################
#                                                 #
#     LOG-ODD RATIOS CALCULATION (ON BIGRAMS)     #
#                                                 #
###################################################


Calculating bigrams frequency for HYPERPARTISAN ...: 100%|██████████| 75000/75000 [00:28<00:00, 2668.83it/s]
Calculating bigrams frequency for NON-HYPERPARTISAN ...: 100%|██████████| 75000/75000 [00:59<00:00, 1254.03it/s]



Calculating o values for HYPERPARTISAN ...
o values calculated successfully.
Highest values: {('#', '160'): 0.008207190205724578, ('&', '#'): 0.00820713221746512, ('160', ';'): 0.00820713221746512, ('.', '’'): 0.002100346182670835, (',', '”'): 0.0019771636865404305, (',', '’'): 0.0018951256541697179, ('.', '“'): 0.0017713638492958578, ('.', "''"): 0.0014065261258350292, (',', '“'): 0.0012648976549313271, ('.', '``'): 0.0011852349477046455, (',', "''"): 0.001079142535134596, ('.', ','): 0.000984561713714705, (';', '&'): 0.0009715578911369213, ('.', "'s"): 0.0009260614611197653, (',', ','): 0.0008269952913596135, ('united', 'states'): 0.000720864523626125, (')', ','): 0.0007113809853931122, ('.', '&'): 0.0007063536404343701, (')', '.'): 0.0006886155143177224, ('said', '.'): 0.0006643658569952778} ... 

Calculating o values for NON-HYPERPARTISAN ...
o values calculated successfully.
Highest values: {('&', '#'): 0.010543000855111064, ('#', '160'): 0.010543000855111064, ('160', ';'): 0.010

### 2.6. Get most relevant words for each document group

#### 2.6.1. Most relevant words (on `UNIGRAMS`)

In [10]:
console_output_formatter.print_section_header(section_header='Most relevant words extraction (on unigrams)')
log_odd_ratios_analyzer = LogOddRatiosAnalyzer(token_type=TokenType.UNIGRAM)

hyperpartisan_most_relevant_words_on_unigrams = (log_odd_ratios_analyzer.get_most_relevant_words(document_type=DocumentType.HYPERPARTISAN))
print(f'{hyperpartisan_most_relevant_words_on_unigrams} \n')
hyperpartisan_most_relevant_words_not_inf_on_unigrams = (log_odd_ratios_analyzer.
    get_most_relevant_words(
        document_type=DocumentType.HYPERPARTISAN,
        infinite_values=False
    )
)
print(f'{hyperpartisan_most_relevant_words_not_inf_on_unigrams} \n\n')

non_hyperpartisan_most_relevant_words_on_unigrams = (log_odd_ratios_analyzer.get_most_relevant_words(document_type=DocumentType.NON_HYPERPARTISAN))
print(f'{non_hyperpartisan_most_relevant_words_on_unigrams} \n')
non_hyperpartisan_most_relevant_words_not_inf_on_unigrams = (log_odd_ratios_analyzer.
    get_most_relevant_words(
        document_type=DocumentType.NON_HYPERPARTISAN,
        infinite_values=False
    )
)
print(non_hyperpartisan_most_relevant_words_not_inf_on_unigrams)

########################################################
#                                                      #
#     MOST RELEVANT WORDS EXTRACTION (ON UNIGRAMS)     #
#                                                      #
########################################################

Loading log-odd ratios for unigrams from pickle file ...
Log-odd ratios loaded successfully. 

Obtaining top 50 most relevant unigrams for HYPERPARTISAN:
{'wwc': inf, 'arrigoni': inf, 'bernays': inf, 'doesn�t': inf, 'hierzulande': inf, 'non-capitalist': inf, 'lannisters': inf, 'trusto': inf, "qur'an": inf, 'trumper': inf, 'lower-third': inf, 'self-authored': inf, 'bieten': inf, 'gender-affirming': inf, 'editrix': inf, '//twitter.com/chuckcjohnson/status/506397004038017024': inf, 'tsuburaya': inf, 'economy/finance/crypto': inf, 'shaftan': inf, 'aric': inf, 'wurden': inf, 'δεν': inf, 'daario': inf, '2c350': inf, 'kaili': inf, 'c4ss': inf, '�': inf, 'hudes': inf, 'tgp': inf, 'burgos': inf, 'khazar': inf, 'ed

#### 2.6.2. Most relevant words (on `BIGRAMS`)

In [11]:
console_output_formatter.print_section_header(section_header='Most relevant words extraction (on bigrams)')
log_odd_ratios_analyzer = LogOddRatiosAnalyzer(token_type=TokenType.BIGRAM)

hyperpartisan_most_relevant_words_on_bigrams = (log_odd_ratios_analyzer.get_most_relevant_words(document_type=DocumentType.HYPERPARTISAN))
print(f'{hyperpartisan_most_relevant_words_on_bigrams} \n')
hyperpartisan_most_relevant_words_not_inf_on_bigrams = (log_odd_ratios_analyzer.
    get_most_relevant_words(
        document_type=DocumentType.HYPERPARTISAN,
        infinite_values=False
    )
)
print(f'{hyperpartisan_most_relevant_words_not_inf_on_bigrams} \n\n')

non_hyperpartisan_most_relevant_words_on_bigrams = (log_odd_ratios_analyzer.get_most_relevant_words(document_type=DocumentType.NON_HYPERPARTISAN))
print(f'{non_hyperpartisan_most_relevant_words_on_bigrams} \n')
non_hyperpartisan_most_relevant_words_not_inf_on_bigrams = (log_odd_ratios_analyzer.
    get_most_relevant_words(
        document_type=DocumentType.NON_HYPERPARTISAN,
        infinite_values=False
    )
)
print(non_hyperpartisan_most_relevant_words_not_inf_on_bigrams)

#######################################################
#                                                     #
#     MOST RELEVANT WORDS EXTRACTION (ON BIGRAMS)     #
#                                                     #
#######################################################

Loading log-odd ratios for bigrams from pickle file ...
Log-odd ratios loaded successfully. 

Obtaining top 50 most relevant bigrams for HYPERPARTISAN:
{('incitements', 'assassination'): inf, ('delaware', 'violent'): inf, ('incapacitating', 'aggressors'): inf, ('budget', 'brinkmanship'): inf, ('thrown', 'light'): inf, ('game', 'stereo'): inf, ('sense', 'murdered'): inf, ('word', 'hurricane'): inf, ('hid', 'profit'): inf, ('chances', 'eventual'): inf, ('sharpton', 'andré'): inf, ('nuptials', 'state'): inf, ('spokesman', 'nueces'): inf, ('certain', 'implications'): inf, ('chart', 'disaster'): inf, ('sweepstakes', 'kathy'): inf, ('sticking', '7-11'): inf, ('leaves', 'empire'): inf, ('movies', 'looper'): inf, ('wi

\* NOTE: Due to excessive RAM memory usage some checkpoints have been stored via *pickle* files and the environment has been restarted 2-3 times. 

## 3. Results Analysis