Spatial analysis and reading recommendation system for the Savoirs digital humanities corpus.
Analyzes the Savoirs French academic corpus by extracting geographic references from TEI/XML-encoded articles, geocoding place names, and computing spatial similarity between articles. Articles that discuss geographically related topics are surfaced as reading recommendations.
- Geospatial: GeoPandas, Shapely, Geopy (Nominatim), Geocoder (GeoNames API)
- XML Processing: BeautifulSoup with lxml parser
- Data Processing: Pandas
- Data Format: GeoPackage (GPKG)
TEI/XML Corpus (~100 articles)
↓
Extract place names from <placename> tags
↓
Geocode locations (GeoNames API + Nominatim fallback)
↓
Save as GeoPackage (one per article)
↓
Compute pairwise spatial distances between articles
↓
Export similarity rankings as CSV
hackaton_savoirs/
├── python/
│ ├── pre_processing.py # Geocoding pipeline (TEI → GeoPackage)
│ ├── main.py # Pairwise distance computation
│ ├── functions_xml.py # TEI/XML extraction (titles, authors, places)
│ ├── functions_geodata.py # Geocoding and GeoDataFrame creation
│ ├── geo_article.py # GeoArticle class for spatial similarity
│ ├── functions.py # General utilities
│ └── tests/ # Unit tests
├── CorpusTEI/ # Input: TEI-encoded articles (ANABASES journal)
└── data/
├── article_locations/ # Output: GeoPackage files per article
└── distances/ # Output: CSV distance matrices
# Stage 1: Geocode place names from TEI corpus
python python/pre_processing.py
# Stage 2: Compute pairwise article distances
python python/main.pyBuilt during the Hackathon Savoirs for challenges on the Savoirs corpus: extracting knowledge through text mining, spatial analysis, and data visualization, and designing reading recommendation strategies based on geographic and thematic similarity.