Text Classification: A Practical Introduction in Python
Massive Data Institute, Georgetown University
A workshop with theOverview
One of the most common tasks in natural language processing (NLP) is classifying texts into categories: which emails are spam, which tweets are happy, etc. This method is called text classification and involves building algorithms to predict text labels. This workshop will give you hands-on experience with a typical text classification workflow, including text featurization, supervised machine learning, and model evaluation. Basic familiarity with Python is required, but no prior experience with NLP is needed.
Workshop goals
- Get comfortable with the vocabulary of text classification
- Understand the intuition behind supervised machine learning
- Learn how to implement a few key supervised machine learning algorithms, including logistic regression and decision trees
- Learn a few methods for model evaluation and how to implement, including cross-validation
- Gain experience implementing, comparing, and optimizing machine learning models
- Gain the foundational knowledge for continued learning
Prerequisites
We will get our hands dirty implementing an assortment of simple web-crawling tools. To follow along with the code—which is the point—will need some familiarity with Python and Jupyter Notebooks. If you haven't programmed in Python or haven’t used Jupyter Notebooks, please do some self-teaching before this workshop using resources like those listed below.
Getting started & software prerequisites
For simplicity, just click the "Launch Binder" button (at the top of this Readme) to create a virtual environment ready for this workshop. It may take a few minutes; if it takes longer than 10, try again.
If you want to run the code on your computer, you have two options. You could use Anaconda to make installation easy: download Anaconda . Or if you already have Python 3.x installed with the full list of libraries listed under requirements.txt
, you're welcome to clone this repository and follow along on your own machine. You can also install all the necessary packages like so:
pip3 install -r requirements.txt
About the presenter
Dr. Jaren Haber is a Postdoctoral Fellow with Georgetown University’s Massive Data Institute. His research applies computational methods to study how organizational contexts, social categories, and media segmentation shape the impacts of structural inequalities. He also leads the GU Interdisciplinary Text Analysis Research (GUITAR) working group. Dr. Haber received his PhD in Sociology from the University of California, Berkeley in 2020.
Online references
Supervised machine learning
- Scikit-learn user guide on supervised learning
- Scikit-learn tutorial on machine learning
- Great intro book: Machine Learning and Pattern Recognition by David Bishop
- Canonical book on statistics & machine learning: Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman
Articles showing text classification in action
- "The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods" by Laura Nelson, Derek Burk, Marcel Knudsen, and Leslie McCall
- "Identifying what types of blog posts are censored in China" by Gary King, Jennifer Pan, and Margaret E Roberts
- "Literary Pattern Recognition" by Hoyt Long, Richard So
- "How Quickly Do Literary Standards Change?" by Ted Underwood, Jordan Sellers
NLP in Python
- NLP course & scripts, for social scientists & digital humanists (Laura Nelson)
- NLP textbook (Jurafsky & Martin @ Stanford)
- Book on NLTK (NLTK team)
- Datasets for NLP (Hugging Face)
- Intro to SpaCy and NLP concepts (Allison Parrish)
- Workshops on NLTK and SpaCy (Geoff Bacon @ D-Lab)
Python and Jupyter notebooks
- Introduction to Jupyter Notebooks (Real Python)
- Quick Python intro (a Jupyter Notebook)
- Great book on Python (with exercises): “Python for Everybody” (Charles Severance)
- Official Python Tutorial
- Python tutorials for social scientists (Neal Caren)
- Popular intro book for all things Python: Automate the Boring Stuff with Python, free for Georgetown students/affiliates (log in here)
Contributing
If you spot a problem with these materials, please make an issue describing the problem or contact Jaren at jhaber@berkeley.edu. If you want to suggest additional resources or materials, please branch and make a pull request!
Acknowledgments
- Laura K. Nelson, especially her course on Text Analysis for the Social Sciences and Humanities
- D-Lab at the University of California, Berkeley, especially their Machine Learning in Python workshop
- Geoff Bacon, especially his Introduction to Text Classification workshop
- Summer Institute in Computational Social Science