Hackathon Savoirs

Spatial analysis and reading recommendation system for the Savoirs digital humanities corpus.

Overview

Analyzes the Savoirs French academic corpus by extracting geographic references from TEI/XML-encoded articles, geocoding place names, and computing spatial similarity between articles. Articles that discuss geographically related topics are surfaced as reading recommendations.

Tech Stack

Geospatial: GeoPandas, Shapely, Geopy (Nominatim), Geocoder (GeoNames API)
XML Processing: BeautifulSoup with lxml parser
Data Processing: Pandas
Data Format: GeoPackage (GPKG)

Pipeline

TEI/XML Corpus (~100 articles)
    ↓
Extract place names from <placename> tags
    ↓
Geocode locations (GeoNames API + Nominatim fallback)
    ↓
Save as GeoPackage (one per article)
    ↓
Compute pairwise spatial distances between articles
    ↓
Export similarity rankings as CSV

Project Structure

hackaton_savoirs/
├── python/
│   ├── pre_processing.py        # Geocoding pipeline (TEI → GeoPackage)
│   ├── main.py                  # Pairwise distance computation
│   ├── functions_xml.py         # TEI/XML extraction (titles, authors, places)
│   ├── functions_geodata.py     # Geocoding and GeoDataFrame creation
│   ├── geo_article.py           # GeoArticle class for spatial similarity
│   ├── functions.py             # General utilities
│   └── tests/                   # Unit tests
├── CorpusTEI/                   # Input: TEI-encoded articles (ANABASES journal)
└── data/
    ├── article_locations/       # Output: GeoPackage files per article
    └── distances/               # Output: CSV distance matrices

Usage

# Stage 1: Geocode place names from TEI corpus
python python/pre_processing.py

# Stage 2: Compute pairwise article distances
python python/main.py

Context

Built during the Hackathon Savoirs for challenges on the Savoirs corpus: extracting knowledge through text mining, spatial analysis, and data visualization, and designing reading recommendation strategies based on geographic and thematic similarity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hackathon Savoirs

Overview

Tech Stack

Pipeline

Project Structure

Usage

Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CorpusTEI		CorpusTEI
python		python
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Hackathon Savoirs

Overview

Tech Stack

Pipeline

Project Structure

Usage

Context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages