Natural language processing: An introduction in Python
Massive Data Institute, Georgetown University
A workshop with theOverview
This workshop will equip newcowers to natural language processing or NLP (with some Python know-how) with a foundation for applying NLP methods in their work. The focus is on common steps in an NLP research workflow and user-friendly implementations of popular packages and methods.
We will first go through the common “preprocessing recipe” used as for a variety of applications and NLP techniques. This includes: a) tokenization; b) removing stopwords, punctuation, and numbers; c) stemming/lemmatizing words; d) calculation of word frequencies / proportions; and e) part of speech tagging. We will then go over simple dictionary methods (including sentiment analysis) using a bag-of-words approach.
For a recorded introduction to NLP and text preprocessing, watch my talk here on YouTube (58 mins). You can also see the slides under the day-1/
folder.
Workshop goals
- Build intuitions about opportunities and limitations for using text as data
- Understand at a high-level:
- how a few primary NLP methods work
- what kinds of questions they answer
- how to design and implement an NLP project
- Gain practice with:
- preprocessing text data
- common steps in NLP
- dictionary methods
- NLTK and Scikit-learn
- Acquire resources for further learning
Prerequisites
We will get our hands dirty implementing basic natural language processing tools and methods. To follow along with the code—which is the point—will need some familiarity with Python and Jupyter Notebooks. If you haven't programmed in Python or haven’t used Jupyter Notebooks, please do some self-teaching before this workshop using resources like those listed below.
Getting started & software prerequisites
For simplicity, just click the "Launch Binder" button (at the top of this Readme) to create a virtual environment ready for this workshop. It may take a few minutes; if it takes longer than 10, try again.
If you want to run the code on your computer, you have two options. You could use Anaconda to make installation easy: download Anaconda . Or if you already have Python 3.x installed with the full list of libraries listed under requirements.txt
, you're welcome to clone this repository and follow along on your own machine. You can also install all the necessary packages like so:
pip3 install -r requirements.txt
Open-Access, Online Resources on Python and NLP
- Introduction to Jupyter Notebooks (Real Python)
- Quick Python intro (a Jupyter Notebook)
- Great book on Python (with exercises): “Python for Everybody” (Charles Severance)
- Official Python Tutorial
- Python tutorials for social scientists (Neal Caren)
- NLP course & scripts, for social scientists & digital humanists (Laura Nelson)
- NLP textbook (Jurafsky & Martin @ Stanford)
- Book on NLTK (NLTK team)
- Datasets for NLP (Hugging Face)
- Intro to SpaCy and NLP concepts (Allison Parrish)
- Workshops on NLTK and SpaCy (Geoff Bacon @ D-Lab)
Contributing
If you spot a problem with these materials, please make an issue describing the problem.
Acknowledgments
- D-Lab at the University of California, Berkeley
- Summer Institute in Computational Social Science
- Laura Nelson
- Geoff Bacon
- Ben Gebre-Medhin
- David Bamman