Skip to content

Lectures, assignments and all other material for the course

Notifications You must be signed in to change notification settings

gsantoshi/course_materials

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

output
html_document pdf_document
default
default

Modern Data Structures

QMSS G5072 - Columbia University

Fall 2018
Lecture: Mondays 6.10 - 8pm (but see weekly schedule)
Location: 825 Seeley W. Mudd Building

Instructor: Thomas Brambor
thomas.brambor.com
tb2729@columbia.edu
IAB 509E Mon 4.50 - 5.50pm

TA1: Crystal Ni
xn2115@tc.columbia.edu
IAB 270J Thur 10am - 12pm

TA2: Mikaela Zhang
xz2782@columbia.edu
IAB 270J Tue 10am - 12pm


Quick Links

Course Description

This course is intended to provide a detailed tour on how to access, clean, “munge” and organize data, both big and small. (It should also give students a flavor of what would be expected of them in a typical data science interview.) Each week will have simple, moderate and complex examples in class, with code to follow. Students will then practice additional exercises at home. The end point of each project would be to get the data organized and cleaned enough so that it is in a data-frame, ready for subsequent analysis and graphing. Therefore, no analysis or visualization (beyond just basic tables and plots to make sure everything was correctly organized) will be taught; and this will free up substantial time for the “nitty-gritty” of all of this data wrangling.

Course Website

All lecture materials, exercises, and (links to) readings will be made available in the GitHub course repository.

This is a fairly new course. The materials and topics indicated below are a provisional roadmap that will be adjusted to the needs of the students. I will let you know well ahead of time of any changes.

Communications

For all questions to the members of the teaching team, we will be using a discussion forum on Piazza. Please sign up here.. The forum will be used to exchange questions about lectures, assignments, software etc. Students are encouraged to help each other!

Students are asked to customize their Piazza notification preferences to receive immediate (ASAP) notifications of messages and announcements through the third-party provider of choice (e.g. email, SMS/text). Students are also asked to log into the course regularly (more than twice a week) and check announcements and the Piazza inbox immediately upon logging in to stay on top of developments in the course as they occur.

Please send all emails and messages to the instructor and teaching assistants through Piazza. Messages sent through the Piazza Inbox (Send a Message) feature will be answered within 24 hours during the week and within 48 hours on weekends. Please consider these response times when asking about assignments etc.


References and Resources

Books

There are no required books for the course. All required readings will be provided as PDFs or links. However, here are some books that you may find useful in addition to the lectures and course readings.

  • Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1 edition). O’Reilly Media. -- Great as introduction on how to use R. From the creator of many R packages that we use in the course, this will help with the usual tasks of data import and management, modeling, and some visualization. Book is available for free online.

  • Wickham, H. (2014). Advanced R (1 edition). Boca Raton, FL: Chapman and Hall/CRC. Book is available for free online

  • Boehmke, B. C. (2016). Data Wrangling with R (1st ed.). New York, NY: Springer. Book is available as an electronic resource in the library

Free Online Resources

Datacamp

I have obtained a license from Datacamp, a provider of online education. All students will be enrolled within the first week of the course. The syllabus indicates suggested Datacamp modules to complement the lecture material and graded assignments. Of course, feel free to try out other offerings you are interested in from Datacamp's high quality content.

R, RStudio, and R Markdown
  • IDRE at UCLA has lots of tutorials, code examples, for R and other statistical packages.
  • Try R. In-browser, interactive online tutorial. Particularly useful if you have not used R (much) before.
  • Cheat sheets for data wrangling, data visualization, general use of R, R Studio, R Markdown etc.
  • R Studio resources for R Markdown. Get started here with markdown.
  • Awsome-R A curated list of great R packages and tools.
Git and GitHub
Coding Help Sites
  • http://stackoverflow.com/ Programming Q&A site. Excellent first stop if you have questions on coding. Searching for keywords, and restrict your queries by adding tags about the coding language or package in square brackets, e.g. [R],[ggplot], or [shiny].

  • http://stats.stackexchange.com/ A stackoverflow off-shoot with a bit more focus on conceptual questions in statistics.

  • http://rseek.org/ Search engine for R-related stuff, including tutorials and code.

Requirements and Assessments

Requirements

This course will guide you through the data wrangling process using the software package R for most exercises. The program R itself can be downloaded for free at http://cran.r-project.org/.

Some familiarity with the software, in particular with regards to the base functions in R is assumed. Knowledge of specific packages and other software tools will be built throughout the course. If you have extensive experience with other similar programming tools, say Python or Matlab, you will be fine. However, if you are completely new to R and do not have compensatory experience in other coding languages, please consider the QMSS course "Data Mining" instead.

You will need to have access to your own computer to install software and packages, do your assignments etc. I highly recommend bringing your laptop to class to follow along the coding tutorials and examples.

Assessments

Homework

Homework problems will be assigned on a weekly basis, and students are expected to work on them alone.

Exams

There is no in-class final exam. Instead, the focus will be on developing a final project in the form of an R package.

Grade Distribution

The distribution of the parts for your grade is as follows:

  • Final Project = 30%
  • Homework Assignments = 60%
  • Attendance and Participation = 10%

Policies

Attendance and Class Participation

Your attendance and participation are necessary at every meeting. This class will work best when students ask a lot of questions.

Academic Integrity

This course is based on the principles of academic integrity established by Columbia University and agreed to by each student. The same rules hold in this course. Academic dishonesty will not be tolerated. All submitted work must be your own work and properly cited.

The full guidelines on academic integrity as well as a review of how or what to cite, can be found here: http://gsas.columbia.edu/academic-integrity

Students found guilty of plagiarism or academic dishonesty will be subject to appropriate disciplinary action, which may include reduction of grade, a failure in the course, suspension or expulsion. This includes lab reports – if they are copied from another student, severe penalties may be applied. ** Note that plagarism is also possible when writing code, so be careful to write your own code.

Late Assignment Policy

Students will lose points for handing in late assignments, at the discretion of the instructor and teaching assistant.

Other

Turn off or silence your cell phones prior to the beginning of class. I reserve the right to answer all calls (your's, not mine) received during class time and let your friends know what you are learning that day.

Feel free to use laptops in class - in fact, I encourage it. Respecting your classmates and myself, please refrain from using Facebook, shopping sites or other random distractions during class.

Changes

There may be adjustments of readings, assignments, exams, and classrooms. Changes will be posted on Piazza/Github along with announcements.

Slides

Lecture slides will be made available on the course website. However, I believe that learning and understanding is better served when you need to aggregate and structure your notes yourself, so I suggest you do so as well.


Lecture Topics

  • On your own: Install R and R Studio on your own computer. Try out R Markdown (use the tutorial to get familiar).
  • Datacamp: To review base R complete the course Introduction to R.

Part 1 - Data Manipulation

  • On your own:

    • Sign up for a GitHub account.
    • Install GitHub Desktop (if you are confident in using command-line Git or have a different software preference, feel free to skip this step.)
    • Claim your private repository connected with this class.
  • Reading:

  • Datacamp:

    • Git: Feel free to check out this fairly comprehensive introduction to Git. Several things are beyond what is required in the course (undo, branches, collaboration).
    • R Markdown: For a more comprehensive introduction to R Markdown the course Reporting with R Markdown is worth a look.
  • Advanced Topics (optional, on your own only):

    • Combining Shiny & RMarkdown (Overview here):
    • RMarkdown can be used directly from the command line or from within R. You can render .R scripts into reports.
    • Report Automation: The creation of report (as well as uploading/emailing) can be automated completely.
    • Git:
      • The in-class introduction to Git was centered around GitHub. To learn a bit more, get comfortable with command line git usage.
      • Also, make sure you understand how branches work and how to work with a group of people.
      • Submit something to a public repository on Github using a pull request.

Homework 1: Using RMarkdown and Github. Also see the homework submission instructions.

Homework 2: Data Wrangling with the Tidyverse. Also see the homework submission instructions.

  • Reading:

  • Datacamp:

  • Advanced Topics (optional, on your own only):

    • We discussed how scoping depended on R environments. Learn more about how these enviroments are called, how to create new environments, how you can look up their content, and how to define the search path of a scoping operation. See Wickham, Advanced R, chapter on "Environments"
    • We only discussed for loops in lecture. Check out two other types of loops - while and repeat loops - and how they can be useful for programming. Datacamp has a tutorial on "A Tutorial on Loops in R - Usage and Alternatives".
    • A great way to hide additional arguments for advanced users of a function is the ... (read dot-dot-dot) argument. Try to get familiar with it and hide some options of a function you created.

Homework 3: for loops and functions. Also see the homework submission instructions.

Homework 4: Functions II. Also see the homework submission instructions.

Homework 5: Writing an R Package. Also see the homework submission instructions.

Homework 6: Working with Strings. Also see the homework submission instructions.

Part 2 - Getting Data In

  • On your own: Install the httr package.

  • Reading:

  • Datacamp:

  • Advanced Topics (optional, on your own only):

    • Writing a R API client: We have learned how to write an R package before. So how about writing an R API package if none is available yet. CRAN provides some Best practices for API packages by Hadley Wickham.
    • Creating an API: We can go even further. The plumber package allows you to turn your existing R code into a web API.

Homework 7: Calling an API using httr. Also see the homework submission instructions.

[(Nov 5) University holiday. NO CLASS is held.

Try to catch up with any material. Ask questions on Piazza to clarify any issues.

Homework 8: Writing a simple API client. Also see the homework submission instructions.

Homework 9: Web Scraping from Wikipedia. Also see the homework submission instructions.

Homework 10: Practicing SQL Queries. Also see the homework submission instructions.

Part 3 - Other “Big Data” Considerations

Final Project Proposal: Final Project Proposal due on Dec 1.

Final Project due on Dec 17: Final Project Description.

About

Lectures, assignments and all other material for the course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 81.7%
  • JavaScript 18.0%
  • CSS 0.3%