ScaDaMaLe Course
[site](https://lamastex.github.io/scalable-data-science/sds/3/x/) and
[book](https://lamastex.github.io/ScaDaMaLe/index.html)

Introduction
============

-   **Course Name:** *Scalable Data Science and Distributed Machine
    Learning*
-   **Course Acronym:** *ScaDaMaLe* or *sds-3.x*.

The course is the fifth and final mandatory course in the [AI-Track of
the WASP Graduate
School](https://wasp-sweden.org/graduate-school/ai-graduate-school-courses/).
It is given in three modules. In addition to academic lectures there is
invited guest speakers from industry.

This site provides course contents for modules 1 and 3 with some
background materials for module 2. This content is referred to as
**sds-3.x** here.

**Module 1** – Introduction to Data Science: Introduction to
fault-tolerant distributed file systems and computing.

The whole data science process illustrated with industrial case-studies.
Practical introduction to scalable data processing to ingest, extract,
load, transform, and explore (un)structured datasets. Scalable machine
learning pipelines to model, train/fit, validate, select, tune, test and
predict or estimate in an unsupervised and a supervised setting using
nonparametric and partitioning methods such as random forests.
Introduction to distributed vertex-programming.

**Module 2** – Distributed Deep Learning: Introduction to the theory and
implementation of distributed deep learning.

Classification and regression using generalised linear models, including
different learning, regularization, and hyperparameters tuning
techniques. The feedforward deep network as a fundamental network, and
the advanced techniques to overcome its main challenges, such as
overfitting, vanishing/exploding gradient, and training speed. Various
deep neural networks for various kinds of data. For example, the CNN for
scaling up neural networks to process large images, RNN to scale up deep
neural models to long temporal sequences, and autoencoder and GANs.

**Module 3** – Decision-making with Scalable Algorithms

Theoretical foundations of distributed systems and analysis of their
scalable algorithms for sorting, joining, streaming, sketching,
optimising and computing in numerical linear algebra with applications
in scalable machine learning pipelines for typical decision problems
(eg. prediction, A/B testing, anomaly detection) with various types of
data (eg. time-indexed, space-time-indexed and network-indexed).
Privacy-aware decisions with sanitized (cleaned, imputed, anonymised)
datasets and datastreams. Practical applications of these algorithms on
real-world examples (eg. mobility, social media, machine sensors and
logs). Illustration via industrial use-cases.

Expected Reference Readings
---------------------------

Note that you need to be logged into your library with access to these
publishers:

-   <https://learning.oreilly.com/library/view/high-performance-spark/9781491943199/>
-   <https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/>
-   <https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/>
-   Introduction to Algorithms, Third Edition, Thomas H. Cormen,
    Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein from
    -   <https://ebookcentral.proquest.com/lib/uu/reader.action?docID=3339142>
-   [Reading Materials
    Provided](https://github.com/lamastex/scalable-data-science/tree/master/read)

Course Contents
---------------

The databricks notebooks will be made available as the course progresses
in the : - course site at: -
[\[site\](https://lamastex.github.io/scalable-data-science/sds/3/x/) and
\[book\](https://lamastex.github.io/ScaDaMaLe/index.html)](https://lamastex.github.io/scalable-data-science/sds/3/x/)
- and course book at: -
<https://lamastex.github.io/ScaDaMaLe/index.html>

-   You may upload Course Content into Databricks Community Edition
    from:
    -   [2020 dbc
        ARCHIVES](https://github.com/lamastex/scalable-data-science/tree/master/dbcArchives/2020)
    -   [Extra
        Resources](https://github.com/lamastex/scalable-data-science/blob/master/dbcArchives/2017/parts/xtraResources.dbc)

Course Assessment
-----------------

There will be minimal reading and coding exercises that will not be
graded. The main assessment will be based on a peer-reviewed group
project. The group project will include notebooks/codes along with a
video of the project presentation. Each group cannot have more than four
members and should be seen as an opportunity to do something you are
passionate about or interested in, as opposed to completing and
auto-gradeable programming assessment in the shortest amount of time.

Detailed instructions will be given in the sequel especially over 12-16
Open Office Hours after the lab/lectures finish after 6 full days of
interactions.

Course Sponsors
---------------

The course builds on contents developed since 2016 with support from New
Zealand's Data Industry. The 2017-2019 versions were academically
sponsored by Uppsala University's Inter-Faculty Course grant, Department
of Mathematics and The Centre for Interdisciplinary Mathematics and
industrially sponsored by [databricks](https://databricks.com),
[AWS](https://aws.amazon.com/) and Swedish data industry via [Combient
AB](https://combient.com), [SEB](https://seb.se/) and [Combient Mix
AB](https://combient.com/mix). This 2021 version is academically
sponsored by [AI-Track of the WASP Graduate
School](https://wasp-sweden.org/graduate-school/ai-graduate-school-courses/)
and [Centre for Interdisciplinary
Mathematics](https://www.math.uu.se/research/cim/) and industrially
sponsored by [databricks](https://databricks.com) and
[AWS](https://aws.amazon.com/) via *databricks University Alliance* and
[Combient Mix AB](https://combient.com/mix) via industrial mentorships.

Course Instructor
-----------------

I, Raazesh Sainudiin or **Raaz**, will be an instructor for the course.

I have

-   more than 15 years of academic research experience in applied
    mathematics and statistics and
-   over 3 and 5 years of full-time and part-time experience in the data
    industry.

I currently (2020) have an effective joint appointment as:

-   [Associate Professor of Mathematics with specialisation in Data
    Science](http://katalog.uu.se/profile/?id=N17-214) at [Department of
    Mathematics](http://www.math.uu.se/), [Uppsala
    University](http://www.uu.se/), Uppsala, Sweden and
-   Director, Technical Strategy and Research at [Combient Mix
    AB](https://combient.com/mix), Stockholm, Sweden

Quick links on Raaz's background:

-   <https://www.linkedin.com/in/raazesh-sainudiin-45955845/>
-   [Raaz's academic CV](https://lamastex.github.io/cv/)
-   [Raaz's publications list](https://lamastex.github.io/publications/)

Industrial Case Study
---------------------

We will see an industrial case-study that will illustrate a concrete
**data science process** in action in the sequel.

What is the [Data Science Process](https://en.wikipedia.org/wiki/Data_science)
==============================================================================

**The Data Science Process in one picture**

![what is
sds?](https://github.com/lamastex/scalable-data-science/raw/master/assets/images/sds.png "sds")

------------------------------------------------------------------------

What is scalable data science and distributed machine learning?
---------------------------------------------------------------

Scalability merely refers to the ability of the data science process to
scale to massive datasets (popularly known as *big data*).

For this we need *distributed fault-tolerant computing* typically over
large clusters of commodity computers -- the core infrastructure in a
public cloud today.

*Distributed Machine Learning* allows the models in the data science
process to be scalably trained and extract value from big data.

What is Data Science?
---------------------

It is increasingly accepted that [Data
Science](https://en.wikipedia.org/wiki/Data_science)

> is an inter-disciplinary field that uses scientific methods,
> processes, algorithms and systems to extract knowledge and insights
> from many structural and unstructured data. Data science is related to
> data mining, machine learning and big data.

> Data science is a "concept to unify statistics, data analysis and
> their related methods" in order to "understand and analyze actual
> phenomena" with data. It uses techniques and theories drawn from many
> fields within the context of mathematics, statistics, computer
> science, domain knowledge and information science. Turing award winner
> Jim Gray imagined data science as a "fourth paradigm" of science
> (empirical, theoretical, computational and now data-driven) and
> asserted that "everything about science is changing because of the
> impact of information technology" and the data deluge.

Now, let us look at two industrially-informed academic papers that
influence the above quote on what is Data Science, but with a view
towards the contents and syllabus of this course.

Source: [Vasant Dhar, Data Science and Prediction, Communications of the
ACM, Vol. 56 (1). p. 64,
DOI:10.1145/2500499](http://dl.acm.org/citation.cfm?id=2500499)

**key insights in the above paper**

-   Data Science is the study of *the generalizabile extraction of
    knowledge from data*.
-   A common epistemic requirement in assessing whether new knowledge is
    actionable for decision making is its predictive power, not just its
    ability to explain the past.
-   A *data scientist requires an integrated skill set spanning*
    -   mathematics,
    -   machine learning,
    -   artificial intelligence,
    -   statistics,
    -   databases, and
    -   optimization,
    -   along with a deep understanding of the craft of problem
        formulation to engineer effective solutions.

Source: [Machine learning: Trends, perspectives, and prospects, M. I.
Jordan, T. M. Mitchell, Science 17 Jul 2015: Vol. 349, Issue 6245, pp.
255-260, DOI:
10.1126/science.aaa8415](http://science.sciencemag.org/content/349/6245/255.full-text.pdf+html)

**key insights in the above paper**

-   ML is concerned with the building of computers that improve
    automatically through experience
-   ML lies at the intersection of computer science and statistics and
    at the core of artificial intelligence and data science
-   Recent progress in ML is due to:
    -   development of new algorithms and theory
    -   ongoing explosion in the availability of online data
    -   availability of low-cost computation (*through clusters of
        commodity hardware in the *cloud\* )
-   The adoption of data science and ML methods is leading to more
    evidence-based decision-making across:
    -   health sciences (neuroscience research, )
    -   manufacturing
    -   robotics (autonomous vehicle)
    -   vision, speech processing, natural language processing
    -   education
    -   financial modeling
    -   policing
    -   marketing

  

But what is Data Engineering (including Machine Learning Engineering and
Operations) and how does it relate to Data Science?

Data Engineering
================

There are several views on what a data engineer is supposed to do:

Some views are rather narrow and emphasise division of labour between
data engineers and data scientists:

-   https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition
    -   Let's check out what skills a data engineer is expected to have
        according to the link above.

> "Ian Buss, principal solutions architect at Cloudera, notes that data
> scientists focus on finding new insights from a data set, while data
> engineers are concerned with the production readiness of that data and
> all that comes with it: formats, scaling, resilience, security, and
> more."

> What skills do data engineers need? Those “10-30 different big data
> technologies” Anderson references in “Data engineers vs. data
> scientists” can fall under numerous areas, such as file formats, &gt;
> ingestion engines, stream processing, batch processing, batch SQL,
> data storage, cluster management, transaction databases, web
> frameworks, data visualizations, and machine learning. And that’s just
> the tip of the iceberg.

> Buss says data engineers should have the following skills and
> knowledge:

> -   They need to know Linux and they should be comfortable using the
>     command line.
> -   They should have experience programming in at least Python or
>     Scala/Java.
> -   They need to know SQL.
> -   They need some understanding of distributed systems in general and
>     how they are different from traditional storage and processing
>     systems.
> -   They need a deep understanding of the ecosystem, including
>     ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g.
>     Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu).
>     They should know the strengths and weaknesses of each tool and
>     what it's best used for.
> -   They need to know how to access and process data.

Let's dive deeper into such highly compartmentalised views of data
engineers and data scientists and the so-called "machine learning
engineers" according the following view:

-   https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

embedded below.

  

The Data Engineering Scientist as "The Middle Way"
--------------------------------------------------

Here are some basic axioms that should be self-evident.

-   Yes, there are differences in skillsets across humans
    -   some humans will be better and have inclinations for engineering
        and others for pure mathematics by nature and nurture
    -   one human cannot easily be a master of everything needed for
        innovating a new data-based product or service (very very rarely
        though this happens)
-   Skills can be gained by any human who wants to learn to the extent
    s/he is able to expend time, energy, etc.

For the **Scalable Data Engineering Science Process:** *towards
Production-Ready and Productisable Prototyping for the Data-based
Factory* we need to allow each data engineer to be more of a data
scientist and each data scientist to be more of a data engineer, up to
each individual's *comfort zones* in technical and
mathematical/conceptual and time-availability planes, but with some
**minimal expectations** of mutual appreciation.

This course is designed to help you take the first minimal steps towards
such a **data engineering science**.

In the sequel it will become apparent **why a team of data engineering
scientists** with skills across the conventional (2021) spectrum of data
engineer versus data scientist **is crucial** for **Production-Ready and
Productisable Prototyping for the Data-based Factory**, whose outputs
include standard AI products today.

Standing on shoulders of giants!
--------------------------------

This course will build on content owned by the instructors in two other
edX courses from 2015 where needed.

-   [BerkeleyX/CS100-1x, Introduction to Big Data Using Apache Spark by
    Anthony A Joseph, Chancellor's Professor, UC
    Berkeley](https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x)
-   [BerkeleyX/CS190-1x, Scalable Machine Learning by Ameet Talwalkar,
    Ass. Prof., UC Los
    Angeles](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x)

This course will be an *expanded and up-to-date scala version* with an
emphasis on *individualized course project* as opposed to completing
labs that test sytactic skills that are auto-gradeable.

We will also be borrowing more theoretical aspects from the following
course:

-   [Stanford/CME323, Distributed Algorithms and Optimization by Reza
    Zadeh, Ass. Prof., Institute for Computational and Mathematical
    Engineering, Stanford Univ.](http://stanford.edu/~rezab/dao/)

Note the **Expected Reference Readings** above for this course.

A Brief Tour of Data Science
============================

History of Data Analysis and Where Does "Big Data" Come From?
-------------------------------------------------------------

The following content was created by Anthony Joseph and used in
BerkeleyX/CS100.1x from 2015.

-   **(watch now 1:53):** A Brief History of Data Analysis
    -   [![A Brief History of Data Analysis by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/5fSSvYlDkag/0.jpg)](https://www.youtube.com/watch?v=5fSSvYlDkag)
-   **(watch now 5:05)**: Where does Data Come From?
    -   [![Where Does Data Come From by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/eEJFlHE7Gt4/0.jpg)](https://www.youtube.com/watch?v=eEJFlHE7Gt4?rel=0&autoplay=1&modestbranding=1)
    -   SUMMARY of Some of the sources of big data.
        -   online click-streams (a lot of it is recorded but a tiny
            amount is analyzed):
            -   record every click
            -   every ad you view
            -   every billing event,
            -   every transaction, every network message, and every
                fault.
        -   User-generated content (on web and mobile devices):
            -   every post that you make on Facebook
            -   every picture sent on Instagram
            -   every review you write for Yelp or TripAdvisor
            -   every tweet you send on Twitter
            -   every video that you post to YouTube.
        -   Science (for scientific computing):
            -   data from various repositories for natural language
                processing:
                -   Wikipedia,
                -   the Library of Congress,
                -   twitter firehose and google ngrams and digital
                    archives,
            -   data from scientific instruments/sensors/computers:
                -   the Large Hadron Collider (more data in a year than
                    all the other data sources combined!)
                -   genome sequencing data (sequencing cost is dropping
                    much faster than Moore's Law!)
                -   output of high-performance computers
                    (super-computers) for data fusion,
                    estimation/prediction and exploratory data analysis
        -   Graphs are also an interesting source of big data (*network
            science*).
            -   social networks (collaborations, followers, fb-friends
                or other relationships),
            -   telecommunication networks,
            -   computer networks,
            -   road networks
        -   machine logs:
            -   by servers around the internet (hundreds of millions of
                machines out there!)
            -   internet of things.

Data Science Defined, Cloud Computing and What's Hard About Data Science?
-------------------------------------------------------------------------

The following content was created by Anthony Joseph and used in
BerkeleyX/CS100.1x from 2015.

-   **(watch now 2:03)**: Data Science Defined
    -   [![Data Science Defined by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/g4ujW1m2QNc/0.jpg)](https://www.youtube.com/watch?v=g4ujW1m2QNc?rel=0&modestbranding=1)
-   **(watch now 1:11)**: Cloud Computing
-   [![Cloud Computing by Anthony Joseph in
    BerkeleyX/CS100.1x](http://img.youtube.com/vi/TAZvh0WmOHM/0.jpg)](https://www.youtube.com/watch?v=TAZvh0WmOHM?rel=0&modestbranding=1)
-   In fact, if you are logged into `https://*.databricks.com/*` you are
    computing in the cloud!
-   The Scalable Data Science course is supported by Databricks Academic
    Partners Program and the AWS Educate Grant to University of
    Canterbury (applied by Raaz Sainudiin in 2015).
-   **(watch now 3:31)**: What's hard about data science
    -   [![What's hard about data science by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/MIqbwJ6AbIY/0.jpg)](https://www.youtube.com/watch?v=MIqbwJ6AbIY?rel=0&modestbranding=1)

Here is a recommended light reading on **What is "Big Data" --
Understanding th History** (18 minutes): -
<https://towardsdatascience.com/what-is-big-data-understanding-the-history-32078f3b53ce>

  

------------------------------------------------------------------------

------------------------------------------------------------------------

Background Materials on Data Science
------------------------------------

The following content was created by Anthony Joseph and used in
BerkeleyX/CS100.1x from 2015.

-   **(watch later 2:31)**: Why all the excitement about *Big Data
    Analytics*? (using google search to now-cast google flu-trends)

    -   [![A Brief History of Data Analysis by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/16wqonWTAsI/0.jpg)](https://www.youtube.com/watch?v=16wqonWTAsI)

-   other interesting big data examples - recommender systems and
    netflix prize?

-   **(watch later 10:41)**: Contrasting data science with traditional
    databases, ML, Scientific computing

    -   [![Data Science Database Contrast by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/c7KG0c3ADk0/0.jpg)](https://www.youtube.com/watch?v=c7KG0c3ADk0)
    -   SUMMARY:
    -   traditional databases versus data science
        -   preciousness versus cheapness of the data
        -   ACID and eventual consistency, CAP theorem, ...
        -   interactive querying: SQL versus noSQL
        -   querying the past versus querying/predicting the future
    -   traditional scientific computing versus data science
        -   science-based or mechanistic models versus data-driven
            black-box (deep-learning) statistical models (of course both
            schools co-exist)
        -   super-computers in traditional science-based models versus
            cluster of commodity computers
    -   traditional ML versus data science
        -   smaller amounts of clean data in traditional ML versus
            massive amounts of dirty data in data science
        -   traditional ML researchers try to publish academic papers
            versus data scientists try to produce actionable intelligent
            systems

-   **(watch later 1:49)**: Three Approaches to Data Science

    -   [![Approaches to Data Science by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/yAOEyeDVn8s/0.jpg)](https://www.youtube.com/watch?v=yAOEyeDVn8s)

-   **(watch later 4:29)**: Performing Data Science and Preparing Data,
    Data Acquisition and Preparation, ETL, ...

    -   [![Data Science Database Contrast by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/3V6ws_VEzaE/0.jpg)](https://www.youtube.com/watch?v=3V6ws_VEzaE)

-   **(watch later 2:01)**: Four Examples of Data Science Roles

    -   [![Data Science Roles by Anthony Joseph in
        BerkeleyX/CS100.1x](http://img.youtube.com/vi/gB-9rdM6W1A/0.jpg)](https://www.youtube.com/watch?v=gB-9rdM6W1A)
    -   SUMMARY of Data Science Roles.
    -   individual roles:
        1.  business person
        2.  programmer
    -   organizational roles:
        1.  enterprise
        2.  web company
    -   Each role has it own unique set of:
        -   data sources
        -   Extract-Transform-Load (ETL) process
        -   business intelligence and analytics tools
    -   Most Maths/Stats/Computing programs cater to the *programmer*
        role
        -   Numpy and Matplotlib, R, Matlab, and Octave.

What should *you* be able to do at the end of this course?
==========================================================

By following these online interactions in the form of lab/lectures,
asking questions, engaging in discussions, doing HOMEWORK assignments
and completing the group project, you should be able to:

-   Understand the principles of fault-tolerant scalable computing in
    Spark
    -   in-memory and generic DAG extensions of Map-reduce
    -   resilient distributed datasets for fault-tolerance
    -   skills to process today's big data using state-of-the art
        techniques in Apache Spark 3.0, in terms of:
        -   hands-on coding with realistic datasets
        -   an intuitive understanding of the ideas behind the
            technology and methods
        -   pointers to academic papers in the literature, technical
            blogs and video streams for *you to futher your theoretical
            understanding*.
-   More concretely, you will be able to:
    -   Extract, Transform, Load, Interact, Explore and Analyze Data
    -   Build Scalable Machine Learning Pipelines (or help build them)
        using Distributed Algorithms and Optimization
-   How to keep up?
    -   This is a fast-changing world.
    -   Recent videos around Apache Spark are archived here (these
        videos are a great way to learn the latest happenings in
        industrial R&D today!):
        -   https://databricks.com/sparkaisummit/north-america/sessions
-   What is mathematically stable in the world of 'big data'?
    -   There is a growing body of work on the analysis of parallel and
        distributed algorithms, the work-horse of big data and AI.
    -   We will see some of this in a theoretical module later, but the
        focus here is on how to write programs and analyze data.