Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
00 accessibility and its discontents.ipynb
01 Lists.ipynb
02 Dictionaries and Web APIs.ipynb
03 Strings and regular expressions.ipynb
04 Data munging.ipynb
04 Data_munging-skeleton.ipynb
04 Data_munging_as_taught.ipynb
05 boolean_practice.ipynb
06 Movie ratings redux.ipynb
07_vectors_films_and_text as given.ipynb
09 XML, HTML, Beautiful Soup.ipynb
10 MongoDB.ipynb
Data and Databases Homework Assignment 1 with answers.ipynb
Data and Databases Homework Assignment 1.ipynb
Data and Databases Homework Assignment 2 with answers.ipynb
Data and Databases Homework Assignment 2.ipynb
Data and Databases Homework Assignment 3.ipynb
Data and Databases Homework Assignment 4.ipynb
Data and Databases Homework Assignment 5 with answers.ipynb
Data and Databases Homework Assignment 5.ipynb
Data and Databases Homework Assignment 6 with answers.ipynb
Data and Databases Homework Assignment 6.ipynb
Notes 2014-06-23.ipynb
scraping menupages.ipynb

title date time affiliation instructors Room State of Being
Data and Databases
5/28/14 - 7/14/14
T & Th
Columbia University, Lede Program
Allison Parrish, Matthew Jones, Dan Vegeto (TA)
Pulitzer 607B

description: Consideration of both the scientific and social implications of counting, of turning the world into bits. Through the process of gaining fluency in the use of Python, students will spend some time thinking through representations of core "data types" like time, location, text, image, sound and relationships (or networks), and the computational "affordances" associated with each. Students will study several common metaphors for organizing and storing data – from structureless key-value stores, to document collections like MongoDB, to a single table or spreadsheet, to the "multiple tables" of a relational database. We will also discuss ideas behind publishing or sharing data, moving from HTML documents and Web 1.0 to data services and APIs in Web 2.0, to semantics in Web 3.0. These efforts will be project-driven, with students using and building modern data services with a scripting language. Their projects will underscore the reality that data are plentiful and circulate and interact in a kind of informational ecosystem. As researchers, our students will be called on both to access and to publish data products.


Readings must be completed before the beginning of class for each session. They are likely to change as our collective interests become clearer. The readings comprise, on the one hand, promient examples of data journalism, and, on the other, more reflective methodological reflections, often in more academic idioms.


  • 35% attendance and participation (incl. reading discussions)
  • 35% final project
  • 30% homework assignments (5% each)


session 01: tuesday, may 27th 2014

  • setting up an ec2 server

  • how to use ipython

session 02: thursday, may 29th 2014


homework assignment due 6-3: Lists and list operations (answer key).

session 03: tuesday, june 3rd 2014

  • dictionaries, getting results from JSON APIs. Notes here.


  • Wickham, Hadley, Deborah Swayne, and David Poole. “Bay Area Blues: The Effect of the Housing Crisis.” Beautiful Data: The Stories Behind Elegant Data Solutions, 2009, 303–22.

    • as you read this: what are all of the different sources of structured data that they draw upon? What formats?
  • Felten, Edward W. “Declaration of Professor Edward W. Felten in ACLU et Al. v. James R. Clapper Et. Al.,” August 26, 2013.


session 04: thursday, june 5th 2014

  • strings and string operations; regular expressions. Notes here.


homework assignment due 6-10: Dictionaries, web APIs, strings, regular expressions (answer key).

session 05: tuesday, june 10th 2014

Making structure: number munging

  • basic linear algebra
  • tables, arrays
  • Pandas & NumPy


  • Edwards, Paul. A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. Cambridge: MIT Press, 2010, ch. 10: “Making Data Global”

session 06: thursday, june 12th 2014

Making structure: text mining

  • text munging: textblob, nltk - tokenizing, stemming
    • tdm
    • bag of words and its alternatives
    • algorithms (clustering, LSA)
    • sentiment analysis


homework assignment: TBD

session 07: monday, june 16th 2014

Documenting data journalism

Readings: "Presenting data to the public,"

Friedman, Batya, and Helen Nissenbaum. “Bias in Computer Systems.” ACM Transactions on Information Systems (TOIS) 14, no. 3 (1996): 330–47.

session 08: tuesday, june 17th 2014

(overflow/catch-up day for previous sessions)

session 09: tuesday june 24th 2014

HTTP, HTML (Beautiful Soup), XML. Notes here.


  • Liu, Alan. “Transcendental Data: Toward a Cultural History and Aesthetics of the New Encoded Discourse.” Critical Inquiry 31, no. 1 (September 2004): 49–84. doi:10.1086/427302.

session 10: thursday june 26th 2014


Homework assignment, due July 1st: Scraping with Beautiful Soup (answer key).

session 11: tuesday july 1st 2014

mongodb, an introduction. Notes here.

session 12: thursday july 3rd 2014

introduction to web APIs w/tornado


homework assignment due july 8th: MongoDB and Tornado (answer key).

session 13: tuesday july 8th 2014

slush/overflow/lab/selected topics day

session 14: thursday july 10th 2014

final project presentations

additional resources


learn python the hard way:

how to think like a computer scientist (python edition):

data wrangling

data journalism examples and awardees

additional readings cut from syllabus

David Easley and Jon Kleinberg., Networks, Crowds, and Markets: Reasoning about a Highly Connected World.,, ch. 2

“Connected China”,

TED talk: