tarb_gsoc23_content_drift

Overview

This repository houses the Content Drift Assessment Tool developed for the TARB project at the Internet Archive. The tool comprises a collection of Python scripts designed to analyze Wikipedia pages and compute relevancy scores for embedded non-Wikipedia links using advanced NLP models like BERT and LDA.

Directory Structure

scripts/: This directory contains Python scripts that utilize BERT and LDA models to calculate various metrics related to content relevancy. These metrics are crucial for understanding how well the embedded links align with the content of the Wikipedia pages.
api/: This directory hosts APIs that expose the various relevancy metrics calculated by the models. These APIs can be integrated into other systems or used for batch processing.
data/: This directory contains TSV (Tab Separated Values) files that store anchor texts, sub-headings, and surrounding paragraphs for each analyzed Wikipedia page. These files serve as the data foundation for the relevancy calculations.
webui/: This directory features a Streamlit application that provides a user-friendly interface to interact with the BERT model for calculating relevancy metrics. It serves as a demo to showcase the capabilities of the tool.

Prerequisites

Python 3.x
pip

Installation and Usage

Clone the repository to your local machine. git clone https://github.com/internetarchive/tarb_gsoc23_content_drift.git
Instructions for the rest of the use cases are within the directories.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
api		api
data		data
scripts		scripts
webui		webui
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api

api

data

data

scripts

scripts

webui

webui

.DS_Store

.DS_Store

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

tarb_gsoc23_content_drift

Overview

Directory Structure

Prerequisites

Installation and Usage

About

Releases

Packages

Contributors 2

Languages

License

internetarchive/tarb_content_drift

Folders and files

Latest commit

History

Repository files navigation

tarb_gsoc23_content_drift

Overview

Directory Structure

Prerequisites

Installation and Usage

About

Resources

License

Stars

Watchers

Forks

Languages