EPSY 887 Data Science Institute (Fall 2014)
Over the last five to ten years, data science has becoming one of the fastest growing fields and was named one of the "sexiest" jobs by Harvard Business Review. But is data science something new? Conway (2010) describes data science as the intersection of statistics, hacking skill (e.g. programming, computer skills, etc.), and expertise (see Figure 1). This seminar will explore the skills and tools necessary for data science with a special emphasis on the role of data science in educational and social science contexts. The first third of the course will focus on learning R, the open source statistical software and programming language used by many data scientists (i.e. hacking skills). The middle third will explore some of the important statistical procedures used by data scientists including: data visualizations, classification and regression trees, logistic regression, propensity score analysis, and other topics as time permits (i.e. math and statistical knowledge). The final third of class will be left for topics of special interest to students and their research agendas (i.e. substantive expertise). Class examples will utilize the Programme of International Student Assessment (PISA), a large scale international study conducted every three years. Other open and freely available datasets will also be discussed as appropriate.http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
- Introduction to R (e.g. data input, recoding, etc.)
- Reshaping data (
- Data visualization vis-à-vis a grammar of graphics (
- Introduction to programming for data analysis (e.g. loops, conditional statements, functions, etc.)
- Missing data (
- Analysis of complex surveys (e.g. use of replicate weights and multiple plausible values) (
- Document preparation and typesetting with Markdown, LaTeX, and Sweave
- R package development
- Web based and interactive graphics (including Shiny).
- Software project management principles as applied to data analysis (e.g. source control, progress tracking, versioning, Github, R-Forge, etc.).
- Other statistical topics as identified by students and appropriate for analysis of large datasets. Topics may include:
- Propensity score analysis
- Item response theory
- Random forests
- Classification and regression trees
- Cluster analysis
- Principal component analysis
- Factor analysis
- Multilevel modeling
Students are encouraged to bring their own data and/or research questions as this seminar will emphasize applied statistics and analysis. Class examples however, will utilize the Programme for International Student Assessment (PISA; OECD 2009).
It is recommended that students have at least two graduate level statistics courses (EPSY 530 and EPSY 630 or equivelent). No prior experiences with R is expected, but some experience with using statistical software would be helpful.
This course will make substantial use of the R software language. The following software is required and freely available. See the installation page for details.
The first two books, Kabacoff and Matloff, are recommended for learning R. They provide two different perspectives on R. Specifically, Kabacoff presents R from a data analyst point-of-view whereas Matloff provides more of a software programming perspective. They complement each other nicely. Zumel's book covers topics in R more relevant to data science. Lastly, Shron's provides a discussion about thinking about data from different perspectives. It is not necessary to purchase all these books.
Kabacoff, R.J. (2011). R in Action: Data Analysis and Graphics with R. Shelter Island, NY: Manning.
Matloff, N. (2011). The Art of R Programming. San Francisco, CA: No Stratch Press
Zumel, N., & Mount, J. (2014). Practical Data Science with R. Shelter Island, NY: Manning.
Shron, M. (2014). Thinking With Data: How to Turn Information into Insights. Sebastopol, CA: O'Reilly.
NOTE: Tentative. Subject to change
|Aug-26||Intro to Data Science; Tools of the Trade|
|Sep-2||Loading and working with data in R|
|Sep-9||Data Visualizations using a grammar of graphics|
|Sep-16||More Grammar of Graphics|
|Sep-23||Intro to programming in R|
|Sep-30||Missing Data / Analysis of complex surveys|
|Oct-7||Classification and Regression Trees (CART) methods|
|Oct-14||Intro to Propensity Score Analysis|
|Oct-28||PSA with Non-Binary Treatments|
|Nov-11||Reproducible research with Markdown and LaTeX|
This course is graded as pass/fail. Successful students will attend and participate in the weekly classes. Additionally, contributing the course wiki is expected.
The culmination of the course will be a short (20 to 30 minute) presentation and document outlining the analysis you conducted with your dataset. Students are encouraged to bring their own dataset (e.g. data to be used for a dissertation), but that is not necessary. Many free and public datasets are available for use and will be discussed in the first couple classes.
Whatever you produce for this course should be your own work and created specifically for this course. You cannot present work produced by others, nor offer any work that you presented or will present to another course. If you borrow text or media from another source or paraphrase substantial ideas from someone else, you must provide a reference to your source.
The University policy on academic dishonesty is clearly outlined in the Student Bulletin, and includes, but is not limited to plagiarism, cheating on examinations, multiple submissions, forgery, unauthorized collaboration, and falsification. These are serious infractions of University regulations and could result in a failing grade for the work in question, a failing grade in the course, or dismissal from the University. http://www.albany.edu/undergraduate_bulletin/regulations.html
Reasonable accommodations will be provided for students with documented physical, sensory, systemic, cognitive, learning and psychiatric disabilities. If you believe you have a disability requiring accommodation in this class, please notify the Director of Disabled Student Services (Campus Center 137, 442-5490). That office will provide the course instructor with verification of your dis- ability, and will recommend appropriate accommodations. For more information, visit the website of the UAlbany Office for Disabled Student Services. http://www.albany.edu/studentlife/DSS/ guidelines/accomodation.html