Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Natural language processing: An introduction in Python

A workshop with the Massive Data Institute, Georgetown University


This workshop will equip newcowers to natural language processing or NLP (with some Python know-how) with a foundation for applying NLP methods in their work. The focus is on common steps in an NLP research workflow and user-friendly implementations of popular packages and methods.

We will first go through the common “preprocessing recipe” used as for a variety of applications and NLP techniques. This includes: a) tokenization; b) removing stopwords, punctuation, and numbers; c) stemming/lemmatizing words; d) calculation of word frequencies / proportions; and e) part of speech tagging. We will then go over simple dictionary methods (including sentiment analysis) using a bag-of-words approach.

For a recorded introduction to NLP and text preprocessing, watch my talk here on YouTube (58 mins). You can also see the slides under the day-1/ folder.

Workshop goals

  • Build intuitions about opportunities and limitations for using text as data
  • Understand at a high-level:
    • how a few primary NLP methods work
    • what kinds of questions they answer
    • how to design and implement an NLP project
  • Gain practice with:
    • preprocessing text data
    • common steps in NLP
    • dictionary methods
    • NLTK and Scikit-learn
  • Acquire resources for further learning


We will get our hands dirty implementing basic natural language processing tools and methods. To follow along with the code—which is the point—will need some familiarity with Python and Jupyter Notebooks. If you haven't programmed in Python or haven’t used Jupyter Notebooks, please do some self-teaching before this workshop using resources like those listed below.

Getting started & software prerequisites

For simplicity, just click the "Launch Binder" button (at the top of this Readme) to create a virtual environment ready for this workshop. It may take a few minutes; if it takes longer than 10, try again.

If you want to run the code on your computer, you have two options. You could use Anaconda to make installation easy: download Anaconda . Or if you already have Python 3.x installed with the full list of libraries listed under requirements.txt, you're welcome to clone this repository and follow along on your own machine. You can also install all the necessary packages like so:

pip3 install -r requirements.txt

Open-Access, Online Resources on Python and NLP


If you spot a problem with these materials, please make an issue describing the problem.


MDI logo


An introduction to Natural Language Processing for NLP beginners with some Python know-how. Created for GU's Massive Data Institute in fall 2020 by Jaren Haber, PhD







No releases published


No packages published