Getting to know each other
-
Motivation: the era of big data, sources of big data, statistics, applications and use cases.
For further Applications of Bid Data, have a look at: Big Data Applications
-
Outline: Topics to be covered throughout the course
- The position of Spark in Hadoop echo system, the main component of Apache Spark, and the limitation of using Hadoop in processing big data.
- Scala Programming with practices on Spark applications
- The main differences between RDDs, Dataframes, and Datasets in Spark
- Use cases revealing the important role of Spark SQL in data exploration
- Introduction to NoSQL and practices of data manipulation using MongoDB.
- Indexing and searching big textual content using Elasticsearch with practical exaples.
- Spark ML library with use case scenarios on regression and recommendation systems.
- Spark NLP in text preprocessing and analysis.
- Introduction to Tensorflow and Keras to classify huge textual content and apply that for sentiment analysis.
- Data Stream and Apache Kafka
- Big Data Tools
- What is Apache Spark?
- Hadoop sub-project by UC Berkeley's ampLab in 2009
- Spark is a UNIFIED platform for cluster computing and distributed big data processing, enabling efficient big data management and analytics.
- Spark is designed to be fast and general purpose.
- data parallelism and fault-tolerance
- Spark History Depicted in the figure below
- Who Can Benefit from Spark?
- Data Scientists: Spark provides data-focused tools towards going beyond problems that fit in a single machine
- Engineers: Spark helps eliminating the need to use many different special-purpose platforms for streaming, machine learning, and graph analytics.
- Students: The rich API provided by Spark makes it easy to learn data analysis and program development in Java, Scala or Python.
- Researchers: New opportunities exist for designing distributed algorithms and testing their performance in clusters.
- Spark vs Hadoop (sorting 1PB)
- Spark Features:
- Real-time processing with low latency because of in-memory computation
- 100x time faster for large-scale processing
- polyglot: building spark applications using multiple languages
- Powerful caching
- multiple deployment mode: e.g., mesos, yarn, spark cluster manager
- Industry using Apache Spark:
-
Spark Architecture:
- Spark uses a master-worker architecture
- Master node: a driver program that drives your application (The code you are writing) or the shell if you are using an interactive shell.
- a node where a spark-context is created.
- A spark-context: a gateway to all Spark functionalities and take care of job execution within cluster.
- Spark-context communicates with cluster manager that controls the resources in spark cluster and allocate suitable resources to the tasks composing a certain job.
- Examples of cluster manager: Spark Standalone Cluster, Apache Mesos, Hadoop Yarn, and Kubernetes.
- task are distributed over the worker nodes.
- Worker nodes (slave nodes): they execute the tasks and return result to Spark context.
-
Spark Applications are built using a number of programming languages, i.e., Scala, Python, Java, R
-
Scala is preferred an long with Spark since:
- Scala is a statically typed language which allows us to find compile time errors
- Scala has multiple standard libraries and cores which allows quick integration of the databases in Big Data ecosystems.
- Scala is frequently over 10 times faster than Python.
- Single line of code in Scala can sometimes replace 20 to 25 lines of code in Java
- Excellent built-in concurrency support and libraries For more information, see (Scala vs Python)
-
Spark Shells:
- Spark-shell: with Scala Support
- PySpark with python Support
- SparkR with R support
-
Spark RDDS:
- a distributed collections of items.
- a read-only (i.e., immutable) collection of items partitioned
- in-memory data sharing
- fundamental data structure of apache spark across a set of machines that can be rebuilt if a partition is destroyed.
- a considerable amount of work done in Spark includes:
- creating RDDs
- transforming existing RDDs
- performing actions on RDDs
- RDDS creations:
- loading an already existing set of objects
val localRDD = sc.textFile("localfile.txt")
- parallelizing a data collection in the driver
val listRDD = sc.parallelize(myList)
val listRDD = sc.parallelize(List("Java", "Scala", "R"))
- transforming a certain RDD into another RDD
val resultRDD = listRDD.filter(x=>x.contains("R"))
- loading an already existing set of objects
- Operations on RDDs:
-
transformation: creating a new RDD from an existing one, e.g., using
map, filter, groupBy, reduceByKey, partitionBy, sortByKey, join, etc
E.g.,val resultRDD = listRDD.map(x=>(x,1))
E.g.,sc.parallelize(List(1,2,3,4,5)).filter(_ %2==0)
//filter: returns an RRD containing all elements that meet a passed predicate -
action: applied on the RDD, e.g., using
reduce, collect, take, count, saveAsTextFile, etc
computation takes place only on actions and not on transformations! (Lazy evaluation)
At any point we can force the execution of transformation by applying a simple action such as count().
E.g.,val result = sc.parallelize(1 to 100).reduce(_+_)
// sum of all elements
E.g.,val result = sc.parallelize(List("Hamed","Abdelhaq")).reduce((x1,x2)=>x1 + " " + x2)
Exercise: Have a look at
first
,take
action functions
-
-
Spark Components: The main functionality is implemented in Spark Core. Other components exist, that integrate tightly with Spark Core.
-
Useful libraries in Spark: Currently the following libs exist and they are evolving really-really fast:
- Spark SQL: a library for querying structures datasets as well as distributed datasets, allowing relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark.
- Spark Streaming: a library to ease the development of complex streaming applications. Data can be inserted into Spark from different sources like Kafka, and Kinesis.
- Spark Machine Learning (MLlib): MLlib is Spark's scalable machine learning library.
- Spark Graph (GraphX): an API for graph processing and graph-parallel algorithms on-top of Spark
- Java installed
- IntelliJ installed and Scala included
- Spark binaries downloaded and configured. For more detailed steps, follow the instructions in Configuring Spark
-
Running spark-shell cmd type "spark-shell"
Spark Context:
-
the entry gate of Apache Spark functionality.
-
allows your Spark Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos).
-
To create SparkContext, first SparkConf should be established.
- The SparkConf has a configuration parameter that our Spark driver application will pass to SparkContext.
- SparkContext is generated automatically when spark-shell is loaded. SparkContext will use default configurations to set the parameters of SparkConf.
-
-
Simple commands: creating list variable, parallelize it, and returning a sublist with elements below 10
-
Also, we will try out the code directly using spark-shell.
-
-
-
Clarifying the concept of transformation (lazy transformation) and actions with examples
-
Basics of Scala programming
After completing this session, the participants will be able to:
- Understand the concept of Information Retrieval and its role in building search engines
- Apply APIs from SparkSQL library to answer queries over DataFrames
- Differentiate between DataFrames, Datasets and RDDs
- Index textual content using ElasticSearch framework
- Pose rich queries against ES indexes
Querying structures datasets as well as distributed datasets, allowing relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. Spark SQL provides an interface that makes RDDs appear as relational tables. Utilizing these sorts of abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, throughout a single application. In other words, Spark SQL allows developers to:
- Import relational data from Parquet files and Hive tables
- Run SQL queries over imported data and existing RDDs
- Build UDF and call during the manipulation of DataFrames
- Easily write RDDs out to Parquet files and other storage resources
Resource 2.1 brief introduction of Spark APIs i.e. RDD, DataFrame and Dataset
- RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner
- Spark Dataframe:
- data organized into named columns
- similar to tables in a relational database
- allows developers to impose a structure onto a distributed collection of data
- memory management and optimized execution plan
- Spark Dataset:
- extension of DataFrame API
- provides type-safe, object-oriented programming interface
- If you are trying to access the column which does not exist in the table, Dataset APIs does support compile-time error.
Parquet stores binary data in a column-wise manner. That is the values in each column are organized so that they are all adjacent, enabling better compression and performance. This is essential for queries which read particular columns from tables with many columns since only needed columns are read and IO is reduced. Read this for more details on Parquet. Resource 2.2
- loading a parquet file into a spark dataframe
// Read in the parquet file created above
// The result a DataFrame
val df = spark.read.parquet("path_of_parquet_file")
- storing a dataframe as a parquet file
df.write.parquet("path_of_parquet_file")
- App-1: In this demo, we read a json file containing a large number of records representing customer information.
- More comments/description/details can be found as inline comments within the code lines.
- A number of practices are also embedded within code segments.
- The code of this demo can be accessed through: code
- Information Retrieval (IR)
- Definition
- Motivation
- Applications
- Technologies and links to big data
- [Resource-2.4](tutorials/Information Retrieval-1.pptx)
- What is ES?
- Elasticsearch is an Apache Lucene-based search server
- A real-time distributed and open source full-text search and analytics engine.
- accessible from RESTful web service interface and uses schema less JSON (JavaScript Object Notation) documents to store data.
- It enables users to explore very large amount of data at very high speed.
- Key features:
- scalable up to petabytes of structured and unstructured data.
- can be used as a replacement of document stores like MongoDB
- currently being used by many big organizations like Wikipedia, StackOverflow, GitHub etc.
- open source and available under the Apache license.
Install ES via ElasticSearch Download.
- Creating an index: (school)
- Inserting a new record to school index
localhost:9200/school/_doc
{
"name":"Saint Paul School222",
"description":"ICSE Afiliation222",
"street":"Dawarka",
"city":"Delhi",
"state":"Delhi",
"zip":"110075",
"location":[28.5733056, 77.0122136],
"fees":5000,
"tags":["Good Faculty", "Great Sports"],
"rating":"4.5"
}
-
Data organization under ES
- An index is a collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.
- Elasticsearch also has the ability to be schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document.
- however, you know more about your data and how you want to use it than Elasticsearch can. You can define rules to control dynamic mapping and explicitly define mappings
- More information
-
Document indexing
-
Query Processing
-
MLlib is Spark's machine learning (ML) library makes the implementation of different machine learning algorithms scalable and easy. The main features of MLlib can be summarized as follows:
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
[more details on Spark MLlib] (https://spark.apache.org/docs/latest/ml-guide.html)
-
Transformers, estimators, and pipelines:
First, let's start with ML Pipelines. ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames allowing users to create and tune practical machine learning pipelines.
-
Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
-
Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
-
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Estimators and transformers
-
Estimators, transformers, and a pipeline
Practice-1:Regression in Boston Housing Dataset
he goal here is to build a regression model using data collected from homes in suburbs of Boston. This model is used to make prediction about the monetary value of houses in Boston, which is very beneficial for real estate agents who could make use of such information on a daily basis.
- ILOs to be achieved:
- Applying transformers and estimators.
- combining predictors (features) using VectorAssembler.
- Applying LinearRegression and RandomForestRegressor as estimators in Spark MLlib.
- Model evaluation using R2 ans RMSE.
- Read more about the data here
- Exercise-1: using class ParamGridBuilder to conduct model selection and improve the performance the generated model. Use this resource: model selection using hyper-parameter tuning
Using MLlib in Spark to predict which passengers survived the Titanic shipwreck.
-
ILOs to be achieved:
- Using pipeline as estimators
- Creating classification models using MLlib
- Introducing and using StringIndexer and OneHotEncoder.
-
Exercises
- rewrite the code using pyspark
- evaluate the model built above by calculating:
- the accuracy, precision, and recall
- AUC (extra)
- Use Random forest and compare its results with Logistic regression
Practice-3: Personalized Movie Recommendation
-
ILOs to be achieved:
- having an idea about collaborative filtering.
-
We will use two files from this MovieLens dataset: "ratings.dat" and "movies.dat". All ratings are contained in the file "ratings.dat" and are in the following format:
UserID::MovieID::Rating::Timestamp
Movie information is in the file "movies.dat" and is in the following format:
MovieID::Title::Genres
Each group presents the current status of their project
- What we have discussed on Topic-3
- Plan describing the topics to be discussed on Topic-4
-
Introduction to NLP
-
Applications
-
Challenging in analysing text
https://bit.ly/2Szk5zF -
Introduction to Spark NLP
Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It provides an easy API to integrate with ML Pipelines developed by John Snow Lab in order to:- provide a single unified solution for all your NLP needs
- take advantage of transfer learning
- bridge the gap of having a lack of any NLP library that’s fully supported by Spark
- deliver a mission-critical, enterprise-grade NLP library
-
Services provided by Spark NLP:
- tokenization
- lemmatization
- stemming
- POS
- Sentiment Analyzed
- Spell checking
- pre-trained pipelines and models
- named entity recognition and more.
-
Practice-1: Text Preprocessing and NER using Spark
data
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
- Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets,
- Data can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.
more details
[taken from: https://databricks.com/glossary/what-is-spark-streaming]
- Spark Streaming receives live input data streams and divides the data into batches
- batches are then processed by the Spark engine to generate the final stream of results in batches.
[taken from: https://spark.apache.org/docs/latest/streaming-programming-guide.html]
- batches are then processed by the Spark engine to generate the final stream of results in batches.
- Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
- can be created either from input data streams from sources such as Kafka, and Kinesis,
- or by applying high-level operations on other DStreams
- Internally, a DStream is represented as a sequence of RDDs.
Practice-1: Counting words using Spark Streaming
- In this practice, students understand how to:
- using spark to create a streaming context.
- Build a consumer that receives lines of text and counts the number of words in an streaming fashion
- using DStream APIs
- In this practice, students understand how to:
- using spark to create a streaming context.
- using Dataframes and datasets in a streaming context
- Running Zookeeper
zkServer.cmd
- Running Kafka server
kafka-server-start.bat [path to server.properties]
- Creating a new topic
kafka-topics.bat --create --topic [topic name] --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
- Running a producer to send events to that topic
kafka-console-producer.bat --topic [topic name] --broker-list localhost:9092
- Running a consumer to read these events (messeges)
kafka-console-consumer.bat --topic testing-kafka1 --bootstrap-server localhost:9092 --from-beginning