Code and notes for lecture, on NLP and ML, for NYC Ascent
Jupyter Notebook Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
do_ml_on_feature_tables (all.csv).py


Notebooks for a lecture to Postdocs at NYC Ascent by Cesar Koirala, Kyle P. Johnson, and Ken Bame, on 26 February 2016. These focus on some fundamentals of natural language processing (NLP) and how to leverage machine learning for insights into human language.


These were created on Windows 7 with the multi-platform Anaconda distribution.

Software setup

  1. Install Anaconda for your OS.
  2. install pandas (from the Anaconda Prompt, conda install pandas)
  3. Install scikit-learn (conda install scikit-learn)

Get lecture code

  1. Install Git
  2. With a terminal app (on Windows, Git Bash is strongly recommended), fetch this repo's source: git clone
  3. Change into the repo (cd lecture_nyc_ascent) and start the Jupyter notebook (jupyter notebook)


The folder tweets contains two .csv files, one of popular tweets (more than 500 retweets) and another on unpopular tweets (with fewer than 10 retweets). These were obtained with the script To use this file, you will need to obtain authentication tokens and add them to

tweets_to_features.ipynb is a Jupyter notebook which illustrates some NLP basics (e.g., tokenization, stopword filter) and also shows how to extract features from text (e.g., bag of words). When you run all of its commands, it will create a diectory feature_tables which keep several feature tables for the tweets.

The code_snippets directory has some simplified code which serve as easy-to-understand examples of what appear in the other notebooks.