Skip to content

jonas-becker/automated-crawling-and-multi-level-clustering-of-news-articles

Repository files navigation

Crawling And Multi-Level Clustering Of News Articles 📰

Crawling a dataset of news articles from CommonCrawl and clustering them by their topics.

Table of Contents

Motivation

Every day thousands of news articles with different political orientations are released.
The goal of this project is to create a large collection of news articles (>250k) that...

  1. ...covers the latest five years of the English-speaking news agenda
  2. ...has a multi-level topic structure to enable exploration of news articles at varying similarity level

e.g., from a cluster that reports about the power of the national leaders worldwide to more narrowly related clusters that report about leaders of a particular country. Because of the five-years timeframe of this news collection, it is possible to investigate how the narrative, agenda, and/or framing changed within the narrow clusters.

This project has been a part of the course Key Competencies in Computer Science at the University of Wuppertal to collect and aggregate news articles for cross-document coreference resolution at scale and was supervised by Anastasia Zhukova.

Installation

To set up a conda enviroment and install requirements:

conda env create -f env.yml
conda activate kccs

We recommend using python 3.9.4 to run this project.

This project consists of two parts: The crawler and the clustering algorithms. The crawler works as a usual python script. The clustering is performed within two jupyter notebooks to allow easier adjusting of hyperparameters and visualisations.

The dataset consists of ~268.000 American news articles from 03/2016 to 07/2021. The websites chosen are based on the POLUSA dataset to ensure a diverse political spectrum.

Crawling

To crawl a dataset of news articles from CommonCrawl, type:

python crawl.py

Pipeline

The crawler is gathering WARC data from CommonCrawl and processing it into a json layout. This json data can later be used for clustering.

Running the crawler for the first time will produce a commoncrawl_archives.json file. This allows the crawler to be stopped and continued crawling at a later time. If there exists a file with said name, the crawler will skip the initialization of WARC paths and continue downloading immediately (while skipping already processed data). That can be used to extend an existing dataset while only changing the amount of articles crawled per timeframe.

Clustering

After Crawling you may cluster the dataset on one or multiple levels.

  1. Latent Dirichlet Allocation (LDA): First start by running the jupyter notebook:
LDA.ipynb
  1. K-Means & Timed Events: For the second and third layer you may run the jupyter notebook:
KMeans.ipynb

Pipeline LDA Clustering

The LDA.ipynb notebook is taking all json files within the directory ./crawl_json and performs this pipeline on the concluded data. Each cluster will get assigned a json file representing a cluster.

Preprocessing

In preprocessing multiple filters are being applied to the dataset. This makes the overall topic of the articles easier to determine. You can get an idea about how preprocessing improves our dataset for our specific use case with the following wordclouds. First there is the wordcloud representing the plain maintext of all articles. The second wordcloud only represents the words that have not been filtered out by the preprocessing.

Pipeline K-Means Clustering

To apply K-Means on the text based dataset we are generating tfidf vectors for all news articles to represent each document. The KMeans.ipynb notebook is taking all json files within the directory ./lda_clustered_json one by one. Each level 1 cluster will then get devided into subclusters which are represented by a folder hierarchy in the output.

Output Directories

  • Crawler: ./crawl_json
  • LDA Clustering: ./LDA_clustered_json
    • Generate multiple json files with each one representing a cluster of different topics
  • Three-Level Clustering: ./clustered_json
    • Generate directories by clustering algorithms

The clustering output consists of multiple directories and json files which are named after the format

  • LDA: cluster_X-keyword1_keyword2_keyword3
  • K-Means: cluster_X-keyword1_keyword2_keyword3_keyword4_keyword5
  • Timeframe: year-month.json

with the keywords representing the most dominant keywords within the cluster (sorted descending). All json-outputs follow the news-please format while adding some new variables. The added variables are:

Variable Description
LDA_ID ID of the articles corresponding to level 1 cluster
LDA_topic_percentage Indicator about how well the article fits into its LDA cluster
LDA_topic_keywords The most dominant keywords within a LDA cluster
kMeans_ID ID of the articles corresponding to level 2 cluster
kMeans_topic_keywords The most dominant keywords within a K-Means cluster
year-month Representing the timeframe this article has been released in

JSON Output Example

"date_download": "09/07/2021, 01:35:50",
"date_modify": "09/07/2021, 01:35:50",
"date_publish": "2016-05-31T05:12:50Z",
"description": "string",
"language": "en",
"source_domain": "www.website.com",
"maintext": "maintext string",
"url": "http://www.website.com/xyz",
"LDA_ID": 0,
"LDA_topic_percentage": 0.5086100101,
"LDA_topic_keywords": "president, king, trump, stone, heche, government, house, right, degeneres, official",
"kMeans_ID": 0,
"kMeans_topic_keywords": "photo, journal, wall, street, jason, accurate, trump, look, transcript, tour",
"year_month": "2016-05"

Parameters

To achieve the best results you may change some parameters in the code. The following parameters have a significant influence on the quality of the produced dataset.

Crawler

Parameter Description
TARGET_WEBSITES Websites you want to keep crawled data from
TEST_TARGETS URLs to request WARC-files from CommonCrawl
INDEXES Indexes from CommonCrawl
MAX_ARCHIVE_FILES_PER_URL Maximum amount of archive files per item of TEST_TARGETS
MINIMUM_MAINTEXT_LENGTH Shorter articles will be discarded
MAX_CONNECTION_RETRIES Maximum retries while downloading
START_NUMERATION_AT Change if you want to extend the dataset
DESIRED_LANGUAGE Select desired language, for example en

Define Indexes (which represent the release dates of news articles) by choosing them from the CommonCrawl Index List.

Clustering

LDA Clustering

These parameters can be adjusted within the LDA_clustering.ipynb (first level).

Parameter Description
topic_amount_start Minimum amount of clusters
topic_amount_end Maximum amount of clusters
iteration_interval The default interval is 1
desired_coherence Algorithm stops when value is reached

The LDA Pipeline filters out a predefined list of stopwords extended by a json file. You can add/remove keywords by seperating them with commas in this file:

stopwords.json

K-Means Clustering

These parameters can be adjusted within the KMeans_clustering.ipynb (second & third level).

Parameter Description
max_clusters Maximum possible clusters
min_df Igonore terms that appear in less articles (percent)
max_df Igonore terms that appear in more articles (percent)

Results

LDA Clustering

The optimal amount of clusters is determined by calculating the coherence score for each iteration of the algorithm. The definitive choice of clusters depends on said coherence score.
As you can see in the data, the maximum coherence score is achieved relatively quickly. This makes LDA a good choice as a level 1 clustering algorithm as it is not too specific.

Data

Amount Of Clusters Coherence Score (percentage)
... ...
14 48.48
15 51.29
16 (best result) 58.06
17 50.73
18 54.37
... ...

K-Means Clustering

The optimal amount of clusters is determined by performing K-Means on multiple amounts of clusters. The definitive choice of clusters is made by calculating the elbow/knee of the distortion curve. The amount of level 2 clusters is calculated independently for every level 1 cluster. We chose min_df = 0.05 and max_df = 0.6 for this dataset.

Data

We applied K-Means onto every main LDA-Cluster. Here you can see the detected elbow/knee for the distortion curve of cluster_0-president_king_trump:

You can find all additional distortion graphs for our dataset in the directory ./repo_images/kMeans_elbow_curves/. Each detected elbow has been applied to be the cluster amount of choice for the K-Means Clustering.

Dataset

The complete resulting dataset is containing ~268.000 clustered news articles from 03/2016 to 07/2021.

  • You can download the sample dataset which has been crawled with this project by clicking here.
  • You can download the already clustered dataset by clicking here.

References

About

Crawling a dataset of news articles from CommonCrawl and clustering them by their topics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published