Name-based nationality prediction

This is a data analysis repository for the study at

Scrape Wikipedia

Names are pulled from Wikipedia's category of Living People. navigates through the multiple pages of the category to allow access to all 900,000+ pages. actually navigates to individual people's pages, as linked from the living people list, and parses out name and nationality. This step is time-intensive, so parallelization is strongly recommended. This functionality is available via

03.process-wiki-output cleans the names returned in step 2 (removing nicknames, common suffixes, etc.) and maps the nationality returned in step 2 to a larger region and subsequent classifier class.

Build classifier splits the names into a list of n-grams (e.g. "Jane Doe" split into 3-grams becomes the document ["Jan","ane","ne ", "e D", " Do", "Doe"]) which is use as input for the classifier. and splits the wiki data into training and evaluation sets. builds the LSTM neural network used as a classifier on the training data created in step 4, and then returns accuracy metrics as measured on the evaluation set.

Use classifier takes conference speaker and journal author data from (code to produce these data also available in that repo) and runs them through our classifier.


The entire repository is released under a BSD 3-Clause License as written in In addition, the contents of the data and models directories are released under a CC0 Public Domain Dedication. Note that this repository contains information from Wikipedia, which is licensed under CC BY-SA 3.0. However, our understanding is that we are reusing purely factual data from Wikipedia that is not subject to copyright in the United States.


