- Spark is a complete in-memory framework. Data gets loaded from, for instance HDFS, into the memory of workers. Data locality: processing data locally where it is stored is the most efficient thing to do. That's exactly what Spark is doing. You can and should run Spark workers directly on the data nodes of your Hadoop cluster.
- There is no longer a fixed map and reduce stage. Your code can be as complex as you want.
- Once in memory, the input data and the intermediate results stay in memory (until the job finishes). They do not get written to a drive like with MapReduce.
- This makes Spark the optimal choice for doing complex analytics. It allows you for instance to do iterative processes. Modifying a dataset multiple times in order to create an output is totally easy.
- Streaming analytics capability is also what makes Spark so great. Spark has natively the option to schedule a job to run every X seconds or X milliseconds.
- As a result, Spark can deliver you results from streaming data in "real time".
- Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool, PySpark.
- The best part of Spark is that it offers various built-in packages for machine learning, making it more versatile. These inbuilt machine learning packages are known as
ML-lib
in Apache Spark. - MLlib is a very similar API to the scikit learn API. We don’t need to learn it separately.
- MLlib offers various types of Machine learning rebuild models.
- Spark’s
MLlib
supports computer vision as well as Natural Language Processing. - It can be implemented on Realtime Data as well as distributed data systems.
- Compared to Hadoop, Spark is "just" an analytics framework. It has no storage capability. Although it has a standalone resource management, you usually don't use that feature.
- So, if Hadoop and Spark are not the same things, can they work together? As Storage you use HDFS. Analytics is done with Apache Spark and YARN is taking care of the resource management.
- It just would not make sense to have two resource managers managing the same server's resources. Sooner or later they will get in each others way. That's why the Spark standalone resource manager is seldom used. So, the question is not Spark or Hadoop. The question has to be: Should you use Spark or MapReduce alongside Hadoop's HDFS and YARN.
- If you are doing simple batch jobs like counting values or doing calculating averages: Go with MapReduce.
- If you need more complex analytics like machine learning or fast stream processing: Go with Apache Spark.
- In-memory caching allows real-time computation and low latency.
- It can be deployed using multiple ways: Spark’s cluster manager, Mesos, and Hadoop via Yarn.
- User-friendly API is available for all popular languages that hide the complexity of running distributed systems.
- It is 100x faster than Hadoop MapReduce in memory and 10x faster on disk.
- Although you can run any SQL query in Sparl, don't except Spark to run in a few milli-secondes like mysql or postgres do.
- Although Spark is low latency compared to other big data solutions like Hive, Impala, you cannot compare it with classic database, Spark is not a database where data are indexed!
- Generally one of the most used function is Spark’s
applyInPandas()
which enables splitting giant data sources into Pandas-sized chunks and processing them independently.
- Install java
- Install pyspark from Anaconda:
conda install -c conda-forge pyspark
alternatively use pip:pip install pyspark
- Download the latest version of Apache Spark from this link, unzip it and place the folder in you home directory and change the folder name to just spark.
- Define these environment variables, On Unix/Mac, this can be done in
.bashrc
or.bash_profile.
export SPARK_HOME=~/spark
# Tell spark which version of python you want to use
export PYSPARK_PYTHON=~/anaconda3/bin/python
- Verify installation:
cd spark
# launching pyspark
./bin/pyspark
# or to launch the scala console
./bin/spark-shell
- If you are using jupyter notebook:
- Install java
- Then
pip install pyspark
- Istalling Spark on your local machine can get very complicated, and it might not be worth the effort, since you won’t actually run a production-like cluster on your local machine.
- It’s just easier and quicker to use a container; pull the Spark image from Docker Hub. Once this is done, you will be able to access a ready notebook at localhost:8888 by running
$ docker run -p 8888:8888 jupyter/pyspark-notebook
- Navigate to the Notebook and try to run this:
import pyspark
sc = pyspark.SparkContext()
- via Synapse
- via Elastic Map Reduce)
- via Kubernetes
- Examples of manipulating with data (crimes data) and building a RandomForest model with PySpark MLlib
- GroupBy And Aggregate Functions
- Tuning Spark Partitions
- Dataframe- Handling Missing Values
- Dataframes - Filter operation
- Pyspark ML
- PCA with PySpark on a local machine
- Linear regression
- pySpark basics
- pySpark dataframe wrangling
- Building a KMeans with PySpark MLib
- How to install PySpark locally
- What is Spark
- Introduction to spark and its datasets
- YouTube #1
- YouTube #2
- YouTube #3
- YouTube #4
- YouTube #5
- YouTube #6
- Notes on parallel computing and big data
- Apache Spark Tutorial: Get Started With Serving ML Models With Spark
- PySpark Course: Big Data Handling with Python and Apache Spark