Skip to content

internetarchive/tarb_content_drift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tarb_gsoc23_content_drift

Overview

This repository houses the Content Drift Assessment Tool developed for the TARB project at the Internet Archive. The tool comprises a collection of Python scripts designed to analyze Wikipedia pages and compute relevancy scores for embedded non-Wikipedia links using advanced NLP models like BERT and LDA.

Directory Structure

  • scripts/: This directory contains Python scripts that utilize BERT and LDA models to calculate various metrics related to content relevancy. These metrics are crucial for understanding how well the embedded links align with the content of the Wikipedia pages.

  • api/: This directory hosts APIs that expose the various relevancy metrics calculated by the models. These APIs can be integrated into other systems or used for batch processing.

  • data/: This directory contains TSV (Tab Separated Values) files that store anchor texts, sub-headings, and surrounding paragraphs for each analyzed Wikipedia page. These files serve as the data foundation for the relevancy calculations.

  • webui/: This directory features a Streamlit application that provides a user-friendly interface to interact with the BERT model for calculating relevancy metrics. It serves as a demo to showcase the capabilities of the tool.

Prerequisites

  • Python 3.x
  • pip

Installation and Usage

  1. Clone the repository to your local machine. git clone https://github.com/internetarchive/tarb_gsoc23_content_drift.git

  2. Instructions for the rest of the use cases are within the directories.

About

Content drift assessment tool for TARB project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages