This repository includes a workshop (more info below) introducing word embedding models as well as hack session starter code for loading and exploring word embedding models with charter school data. Some data are contained in the repo; others will be linked into the Jupyter instance we'll set up to start the workshop. The charter school data come from author Jaren Haber's web-scraping of charter school websites, and the embeddings were created in the word2vec implementation in gensim. The repository is prepared for TextXD 2018 (http://www.textxd.org/) at the Berkeley Institute for Data Science (BIDS), UC Berkeley.
This one-hour workshop introduces word embeddings in Python and explores the features produced through the word2vec model. We'll mainly use the Akkadian ORACC corpus, put together by Professor Niek Veldhuis, UC Berkeley Near Eastern Studies. We'll also look briefly at a Word2Vec model trained on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts made available by Ryan Heuser.
- Learn the intuition behind word embedding models (WEMs)
- Learn how to implement a WEM using the gensim implementation of word2vec
- Explore a corpus you've probably never seen before
- Think through how visualization of WEMs might help you explore your corpus
- Implement text analysis on a non-English language
All are welcome! You don't need to know how neural nets work or be a Python expert to benefit from this workshop. We'll focus on the concepts behind word embeddings more than the specific syntax. This workshop will be most useful to people who have some familiarity with Python but have never done word embeddings before.
If you notice a problem with these materials, please make an issue describing the problem. Collaboration and transparency are worth everyone's time!
- Jaren Haber
- Laura Nelson and the D-Lab