All materials for workshops - HackOn(Data) - Toronto
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Workshops are based on the databricks EdX lectures on Apache Spark.

Goal: Learn how to apply data science and data engineering techniques using parallel programming in Apache Spark. The workshops are heavily based on the databricks EdX Series.

Participants are expected to review the material before the session and complete the weekly challenges. Participants will be awarded up to points each week based on their participation, labs completion, and other tasks (more below). Questions during the session are encouraged, but priority will be given to questions send before the session and those that benefit the majority of the group.

Points system:

Please view "Competition" tab on FAQ

How to communicate?

Please view "Communication" tab on FAQ

Late Submissions

Submission time Points Subtracted
1 week > submission > 2 weeks 1
2 week > submission > 3 weeks 2
3 week > submission > 4 weeks 3
4 week > submission > 5 weeks 4
5 week > submission > 9 weeks 5


The sessions will be delivered out by Link to Session. Click on the session title to see the link for that session. All sessions will be recorded and published after the live session.

Every Tuesday from 6:30pm to 8:30pm, starting on July 4, 2017 until the day of the hackathon.


Remote sessions: Use

Join the live session here URL

Jul 4 - In-person - Intro

  • Notebook usage
  • Intro to spark and pySpark API
  • Using RDDs
  • Lambda functions
  • RDD actions, transformation, caching
  • Debugging and lazy evaluation

Recording is Available Here

*Please subscribe to HackOn(Data) channel to get notified when we upload a new video!

Jul 11 - Virtual session - RDDs

  • Create a RDD and pair RDD
  • Counting words
  • Finding unique words and mean value
  • Reference to regular expressions
  • Apply word count to a file

Jul 18 - Virtual session - Data Exploration

  • Server log analysis statistics
  • Finding problematic endpoints, unique hosts
  • Visualizing data analysis results
  • Data exploration

Jul 25 - Virtual Session - Text Analysis

  • Text similarity of Entity Resolution
  • Weighted bag-of-words
  • Cosine similarity
  • Scalable Entity Resolution
  • Analysis

Aug 1 - Virtual Session: Review

  • Math review
  • Numpy and Spark
  • Lambda functions

Aug 8 - Virtual session - Read, parse, and visualize dataset

  • Baseline model
  • Train linear regression
  • Hyperparameter tuning
  • Features interaction

Aug 15 - Virtual session - Feature Hashing

  • One-Hot Encoding (OHE)
  • OHE Dictionary
  • Prediction and log loss evaluation
  • Feature reduction

Aug 22 - In-person - Principal Component Analysis

  • PCA on a sample dataset
  • PCA calculation and evaluation
  • Data preprocessing for PCA
  • Feature-based aggregation

Sep 9 - Hackathon day

Additional information: