A collection of resources and readings for people wanting to get acquainted with computational social science.
Credit goes to Andrew Hall for pitching the idea at Data Science Nights@Northwestern. Quoted text is taken directly from the website or document. Suggestions welcome.
This matrix of skills is a nice starting point for thinking about "data science" or computational social science as a collection of activities that can be more or less complex.
- Perspectives on Computational Analysis Syllabus
- Computational Social Science, syllabus by Nir Grinberg, Ben-Gurion University
- Very high-level view of what makes up "data science:" Curriculum Guidelines for Undergraduate Programs in Data Science
-
Learn R in R: the
swirl
package. -
Fast lane to learning R by Norman Matloff (professor of Computer Science at UC Davis). Self-description:
This site is for those who know nothing of R, or maybe even nothing of programming, and seek QUICK, painless entree to the world of R.
The course is quite thorough regarding base R, including graphics (ggplot2 is covered as well). NM is a proponent of learning base R first before learning third-party packages and I tend to agree.
-
R for Data Science by Garret Grolemund and Hadley Wickham. The authors are important originators of/contributors to the so-called "tidyverse", a collection of packages for R. These packages tend make things easier (especially for automated workflows). However, starting out with the "tidyverse" when learning R is, in my opinion, a bit like learning to run before learning to walk.
-
Starting from zero, Data Carpentry workshop. These resources are intended for in-person workshops but can be used by self-learners.
This is an introduction to R designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about R syntax, the RStudio interface, and move through how to import CSV files, the structure of data frames, how to deal with factors, how to add/remove rows and columns, how to calculate summary statistics from a data frame, and a brief introduction to plotting.
-
Data Science Course in a Box (Course materials) by Mine Cetinkaya-Rundel for RStudio. Primarily intended for teachers but might be valuable for self-learners too. Self-presentation:
Data Science in a Box contains the materials required to teach (or learn from) an introductory data science course using R, all of which are freely-available and open-source. They include course materials such as slide decks, homework assignments, guided labs, sample exams, a final project assignment, as well as materials for instructors such as pedagogical tips, information on computing infrastructure, technology stack, and course logistics.
See datasciencebox.org for everything you need to know about the project!
-
R for Stata users, for people coming from Stata and wanting to learn R. An earlier draft is available for free. This book is structured somewhat similarly to the O'Reilly Cookbooks, i.e. it is a laundry list of problems or situations for which solutions are given in both Stata and R. If your particular problem is among those covered, great! If not, you won't get around learning the basics of R and translating Stata logic into R logic yourself.
-
Chromebook Data Science project
Chromebook Data Science (CBDS) is an online educational program to help anyone who can read, write, and use a computer to move into data science.> It is offered by faculty members in the Johns Hopkins Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health. There are currently 12 courses that are offered in the Chromebook Data Science Curriculum.
-
UK Data Service Data Skills Modules
These introductory level interactive modules are designed for users who want to get to grips with keys aspects of survey, longitudinal and aggregate data.
-
The BBC's visual and data journalism cookbook for R, Blog post announcing and explaining the launch of the BBC's visual and data journalism cookbook in R
-
Tutorials on the scientific Python ecosystem: a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert.
"How to name files," Jenny Bryan's speaker deck
"Project structure & Naming files," Danielle Navarro (inspired by Jenny Bryan), slides
Version control in R:
"starting R markdown,", a YouTube tutorial playlist by Danielle Navarro
Should I learn R or Python? It depends on what you want to do with it. What tasks do you want to accomplish? What professional goals to you want to attain? Norman Matloff discusses how R and Python compare to each other for various tasks (and on some more general dimensions) here.
Matthew Sagalnik, Bit by bit (free version)
Matthew Sagalnik, Bit by bit (tree version)
Bernard E. Harcourt, Against Prediction (tree version) Summary: Against Prediction argues that predictive policing models not “crime” but “arrests”, i.e. police behavior, not the supposed underlying behavior (not what crimes will happen where, but who will be arrested). Therefore, it will reinforce existing trends in policing instead of “improving” policing.
Bernard E. Harcourt, Against Prediction (working paper)
Bernard E. Harcourt, Against Prediction (review by Cosma Shalizi)
Thoughts on algorithmic fairness: "Algorithmic fairness is an interdisciplinary research field concerned with the various ways that algorithms may perpetuate or reinforce unfair legacies of our history, and how we might modify the alorithms or systems they are used in to prevent this. For example, if the training data used in a machine learning methods contains patterns caused by things like racism, sexism, ableism, or other types of injustice, then the model may learn those patterns and use them to make predictions and decisions that are unfair. There are many ways that technology can have unintended consequences, and this is just one of them."
Claus Witte, Fundamentals of Data Visualization
Claus Witte, Fundamentals of Data Visualization (R markdown source)
Kieran J. Healy, Data Visualization - A practical introduction
Garret Grolemund, Hadley Wickham, R for Data Science
Quartz guide to real world data munging problems
Data cleaning in R needn't be hard - presentation materials by Crystal Lewis
Fairness definitions and their politics: presentation by Arvind Narayanan, video, presentation by Arvind Narayanan, text, article by Verma and Rubin, pdf
Notes from @DynamicWebPaige on Google's ML Fairness course
I have to thank the terrific Tom Theile and the pretty excellent Peter Eibich for suggesting many of these. In particular, you could check out Tom's guide to finding datasets online.
Inter-university Consortium for Political and Social research - A data repository for mostly survey data. North America-centric.
Internet Archive datasets The Internet Archive is primarily know for the Wayback Machine. It also stores and makes available data.
Urban Institute Data Catalogue, UIDC announcement and short presentation
Wikidata Wikidata is a human-curated database of every "fact" of Wikipedia (and more) in a structured format.
Google tool for finding datasets
Data portals - a list of open data portals globally
Office for national statistics
GovData - German administrative data
German maps and geographic data
European Union data repository
Cross-national equivalent file - The Cross-National Equivalent File (CNEF) project harmonizes a subset of the data found on seven panel data sets collected in Australia, Canada, China, Germany, Korea, Russia, Switzerland, UK, and US.
LIS acquires datasets with income, wealth, employment, and demographic data from many high- and middle-income countries, harmonises them to enable cross-national comparisons, and makes them publicly available in two databases, the Luxembourg Income Study Database (LIS) and the Luxembourg Wealth Study Database (LWS).
German and EU surveys and administrative data
Cook County Open Data - State Attorney (e.g. arrest data)
Wesleyan Media Project: "The Wesleyan Media Project tracks and analyzes all broadcast advertisements aired by or on behalf of federal and state election candidates in every media market in the country."
The Stanford Open Policing Project: "Our team is gathering, analyzing, and releasing records from millions of traffic stops by law enforcement agencies across the country."
The @unitedstates project Scrapers and parsers for many aspects regarding Congress, e.g. bios of members past and present, data about bills and roll call votes, district shapefiles, and much more.
Congressional record parser: "This tool converts HTML files containing the text of the Congressional Record into structured text data. It is particularly useful for identifying speeches by members of Congress."
OpenStreetMap OpenStreetMap is a volunteer-built map of the world. You may download all of the data (or parts of it) from OpenStreetMap (https://wiki.openstreetmap.org/wiki/Planet.osm) and or the Internet Archive (https://archive.org/details/osmdata).
List of sociologists on twitter, by Philip N. Cohen
List of demographers on twitter, by Conrad Hacket
List of demographers on twitter, by Cameron Campbell
#rladies
#rstats
The Data Science job market is saturated
Cheatsheet - Neural networks/maching learning
R for the rest of use, Resources
R resources collection, NU Research Computing Services
Python resources collection, NU Research Computing Services
Following recent events at DataCamp (see here, here, here, here), this guide prefers to recommend other resources. The course offer is, however, comprehensive, and university students may benefit from special offers.