This course investigates how to use digitized texts -- news articles, speeches, laws, press releases, party manifestos/platforms, transcripts, open-ended surveys, Tweets, etc. -- as sources of data for social science research.
We begin with overviews of the ``text as data'' field in political science -- which is heavily influenced by computer science branch of natural language processing (NLP). The idea is to get you data as a poor graduate student for free that you can then use in your own research to answer questions of theoretical interest.
We then discuss theory/mechanics of converting text into data. This will include topics like preprocessing text and related NLP tasks (e.g., stemming, tokenizing) and representing text as data (e.g., bag-of-words, measures of association), etc. Text data is often ``messy'' so handling that will be a large part of this course (e.g., web scraping, file encodings, file formats, extracting only relevant text from strings, etc.).
We'll then turn to the major approaches to measuring social science concepts with textual data, including rule-based methods, supervised learning from human-coded or known examples, and un-supervised methods. As we go, we will discuss particular measurement objectives like classification, scaling, topic modeling, and analysis of sentiment and stance, as well as ways of validating our models.
Depending on time, student interest and capacity, we may learn about the neural network / deep learning approach that has come to dominate NLP in recent years.
The course will assume students have some graduate level work in statistical inference, quantitative social science methodology, or machine learning, and at least know what R is but ideally some experience with R.