GPU Open Analytics Initiative
Accelerating the Scalable Data Science Environment with GPU-enabled Python
KDD'18 Hands-On Tutorial
Tuesday 8:30 am
Software / Hardware Requirements
The tutorial will leverage cloud resources that will provide the a common environment for all students.
Requirements:
-
Laptop with WiFi
- We will be using the conference WiFi, please ensure that you can connect prior to the tutorial
-
Web browser - latest version of any will work, preference is towards Firefox or Chrome.
Tutorial Agenda
Introductions
- Who we are
Getting Connected
- Connect to Qwiklabs
- Introduction notebook to validate
Introduction and Background
- Big Data Ecosystem
- Challenges in Big Data today
- Apache Arrow
- GPUs for compute
- The GPU Open Analytics Initiative
- The GPU Data Frame (GDF)
- Python library for GDF (PyGDF)
Hands-on: Data Loading and Manipulation
-
Lab 1: Data Loading and Manipulation
- Traditional interface through Pandas
- Pandas to/from PyGDF
- Column Function and Basic Transforms
- Filtering
-
Student Assignment
Break
Hands-on: Data Science and Machine Learning
- Lab 3: Classification using XGBoost
- Familarize with IoT cyber network data
- Data ingest and feature extraction
- Time binning and preparation for classifiation
- Building XGBoost model
- Evaluating the model via ROC curves and AUC
- Student Assignment:
- Investigation into other time binnings, aggregations, and XGBoost parameters
- Using additional features (quantitative and categorical) in the data to build better models
- Moving beyond connection logs to other log types (e.g., DNS) and building models
Break
Wrap-up and Conclusion-
- Roadmap
- Scaling out to multi-GPU and multi-node
- Partner Activities
- Comclusion