Materials for GWU DNSC 6279 and 6290
DNSC 6279 ("Data Mining") provides exposure to various data preprocessing, statistics, and machine learning techniques that can be used both to discover relationships in large data sets and to build predictive models. Techniques covered will include basic and analytical data preprocessing, regression models, decision trees, neural networks, clustering, association analysis, and basic text mining. Techniques will be presented in the context of data driven organizational decision making using statistical and machine learning approaches.
DNSC 6290 ("Machine Learning") provides a follow up course to DNSC 6279 that will expand on both the theoretical and practical aspects of subjects covered in the pre-requisite course while optionally introducing new materials. Techniques covered may include feature engineering, penalized regression, neural networks and deep learning, ensemble models including stacked generalization and super learner approaches, matrix factorization, model validation, and model interpretation. Classes will be taught as workshops where groups of students will apply lecture materials to the ongoing Kaggle Advanced Regression and Digit Recognizer contests.
Some external reference material
- A Few Kaggle Grandmasters Pointers:
- Data visualization
- Data science quick references
- Data science interview questions
- Python introductory materials
Course Syllabi (Outdated/Unofficial)
DNSC 6279 ("Data Mining"): Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD), MSBA Program Candidacy or instructor approval.
DNSC 6290 ("Machine Learning"): Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD), Data Mining, MSBA Program Candidacy or instructor approval.
Mr. Patrick Hall
Location: Duques Hall, Room 255 Thursdays 6:10-8:40 PM
Office Hours: Funger Hall, Room 415 Thursdays 5:00 - 6:00 PM
Copyrights and Licenses
Some teaching materials are copyrighted by the instructor. Some copyrights are owned by other individuals and entities.
Most code examples are copyrighted by the instructor and provided with an MIT license, meaning they can be used for almost anything as long as the copyright and license notice are preserved. Some code examples are copyrighted by other entities, and usually provided with an Apache Version 2 license. These code examples can be also used for nearly any purpose, even commercially, as long as the copyright and license notice are preserved.
DNSC 6279 ("Data Mining")
Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar
An Introduction to Statistical Learning with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
DNSC 6290 ("Machine Learning")
Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
A Primer on Scientific Programming with Python, by Hans Petter Langtangen
The student is responsible for studying and understanding all assigned materials. If reading generates questions that are not discussed in class, the student has the responsibility of addressing the instructor privately or raising the issue in an appropriate digital medium.
Some materials for this class have personal or corporate copyrights or licenses that prevent them from being shared on GitHub. Those materials or other internal information will be shared with students via Blackboard.
DNSC 6279 ("Data Mining")
The course grade will be based on team homework assignments, a midterm and final exam, and a team project. Each grading component is described in detail below.
Homework Assignments: You will be given several homework assignments during the semester. Homework assignments will typically require the use of software. A typical homework assignment will consist of a few problems with several parts. Homework assignments may be completed in groups of 2-4 students. You may be given up to several weeks to complete the assignment. Late homework assignments may be rejected. In preparing your homework assignments, please follow these guidelines:
- Ensure any submitted computer program solutions are commented and runnable in a standard Python, R, or SAS environment.
- Ensure any written solutions are typed or easily readable by anyone.
- Ensure a clear logical flow and mark your answers.
- Print/type your name(s) on the top right hand corner of every page or in a header of any papers submitted.
Midterm and Final Exam: A midterm exam will address content from the first half of the class and a final exam will address content from the second half of the class. The final exam will be scheduled during finals' week. Graduate final exams are scheduled by the university late in the semester. The final exam date will be made known at that time. No make-up midterm or final exams will be given. The exams are individual assignments. If you are taking the class remotely and cannot attend the exams in-person, make arrangements with the instructor immediately.
Project: The project is designed to serve as an exercise in applying one or more of the data mining techniques covered in the course to analyze real life data sets. A primary objective is to understand the complexities that arise in mining large, real life datasets that are often inconsistent, incomplete, and unclean. Students can use a variety of software tools to perform the analysis, including standard Python, R, or SAS packages. This is a semester long project, and students have the option to work in 2-4 person teams. The deliverables include a formal project proposal (due mid-semester), and a final report or presentation (due at the end of the semester). Projects can be a group or individual assignment. As the project for this class, students may select:
- A current Kaggle contest
- Their MSBA practicum project
- Group homework assignments: 25%
- Midterm exam: 30%
- Final exam: 30%
- Group semester Project: 15%
|Numeric Grade||Letter grade|
DNSC 6290 ("Machine Learning")
In class Participation: As this will be a 6 week, workshop based course, student attendance and participation in class is expected.
Kaggle Performance: Lecture materials and hands on workshop materials will be geared toward application to the Kaggle Advanced Regression and Digit Recognizer contests. Students are expected to participate in these contests as individuals or in groups and to do reasonably well.
Public Github Contributions: Students are expected to write code and generate other artifacts (i.e. notebooks, visualizations, markdown) and to store them in a publicly accessible GitHub repository (or other public location, i.e. personal website).
- In class participation: 1/3
- Kaggle Performance: 1/3
- Public Github Contributions: 1/3
If you are struggling with an assignment or class materials, require extra time for an assignment, or simply require additional assistance, see the instructor immediately.
Cheating and plagiarism will not be tolerated. Any case will automatically result in loss of all the points for the assignment, and may be a reason for a failing grade and/or grounds for dismissal. In case of a group assignment, all group members will receive a zero grade.
Any suspected case of cheating or plagiarism or behavior in violation of the rules of this course will be reported to the Office of Academic Integrity. Students are expected to know and understand all college policies, especially the code of academic integrity.
Please contact the Disability Support Services to establish eligibility and to coordinate reasonable accommodation.
Regular attendance is expected, except for remote students. All students are held responsible for all of the work of the courses in which they are registered, and all absences must be excused by the instructor before provision is made to make up the work missed.
Class Policy Changes
The instructor reserves the right to revise any item on this syllabus, including, but not limited to any class policy, course outline or schedule, grading policy, tests, etc. Note that the requirements for deliverables may be clarified and expanded in class, via email, on GitHub, or on Blackboard. Students are expected to complete the deliverables incorporating such additions.
Anaconda Python Python is an approachable, general purpose programming language with excellent add on libraries for math and data analysis. Anaconda Python is a commercial version of Python that bundles these add on packages (and many other packages) together with convenient development utilities like the Spyder IDE.
H2o.ai is a package of high performance functions and algorithms for preprocessing data and training statistical and machine learning models. It can be accessed without the need for coding through a standalone, web browser client or by installing additional coding interfaces for R and/or Python.
PySpark is a convenient, Python-based way to use the extremely powerful and scalable Spark platform. (Spark is becoming the new standard commercial data engineering tool.)
R is a tremendously popular language for data analysis, with thousands of user contributed packages for different types of data analysis tasks.
R Studio is the standard IDE for the R language.
SAS 9.4 and Enterprise Miner is a commercial package for preprocessing data and training statistical and machine learning models. Enterprise Miner allows for the construction of complex data mining workflows without writing code. Enterprise Miner is a proprietary commercial product and not freely available. You may access Enterprise Miner through the SAS on Demand for Academics portal or by contacting the GWU Instructional Technology Lab.
SAS 9.4 University Edition is a free edition of SAS' proprietary commercial data analysis software. SAS University Edition contains the newest version of several SAS software packages along with learning tools and utilities for new users. It also requires a virtual machine player which you may need to install separately.
TensorFlow + Keras are two of several popular deep learning toolkits and libraries; this particular combination will work on Windows. TensorFlow is a lower-level library for performing mathematical operations. It is GPU-enabled. (GPU support is optional but helpful for this class.) Keras is a higher level library that makes TensorFlow easier to use for building and training common deep learning architectures. They are both available as Python packages.
XGBoost is an optimized and highly accurate library for gradient boosted regression and classification. There are Python and R packages available for available XGBoost. (I have found XGBoost is easiest to install as R an package, but if you get stuck with Python and Windows, you can try following the directions in this blog post.)
Using Git for this Material
You are welcome to use git and/or GitHub to save and manage your own copies of class materials.
The easiest way to do so is to download this entire repository as a zip file. However you will need to download a new copy of the repository whenever changes are made to this repository. To download the course repository, navigate to the course GitHub repository (i.e. this page) and click the 'Clone or Download' button and then select 'Download Zip'.
If you would like to take advantage of the version control capabilities of git then you need to follow these steps.
Install required software
Fork and pull materials
Navigate to the course GitHub repository (i.e. this page) and click the 'Fork' button.
Enter the following statements on the git bash command line:
$ cd <parent directory>
$ mkdir GWU_data_mining
$ cd GWU_data_mining
$ git init
$ git remote add origin https://github.com/<your username>/GWU_data_mining.git
$ git remote add upstream https://github.com/jphall663/GWU_data_mining.git
$ git pull origin master
$ git lfs install
$ git lfs track '*.jpg' '*.png' '*.csv' '*.sas7bdat'
Dockerfile to create Anaconda Python 3.5 environment with H2O, XGBoost, and GraphViz.
Start the image with:
docker run -i -t -p 8888:8888 <image_id> /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && /opt/conda/bin/jupyter notebook --notebook-dir=/GWU_data_mining --ip='*' --port=8888 --no-browser"