# Running PySpark on Google Colab
To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.4.7 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. Follow the steps to install the dependencies:

In [1]:
#Java Installation
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
#Download Apache Spark Package
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz

In [3]:
#Extract the Spark Package
!tar -xf spark-2.4.7-bin-hadoop2.7.tgz

In [4]:
# Install PySpark and FindSpark packages to access the functionality of Spark
!pip install -q pyspark
!pip install -q findspark

[K     |████████████████████████████████| 212.3MB 72kB/s 
[K     |████████████████████████████████| 204kB 34.9MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [5]:
#Set the environment variables for JAVA and SPARK
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

In [6]:
# Import FindSpark and PySpark
import findspark
findspark.init()
import pyspark
# Create sc - SparkContext object to start creating RDDs and performing transformation and actions
sc = pyspark.SparkContext(appName='TestApp')

In [9]:
# Create a list
l = [1,2,3,4]
# Create an RDD from list l using parallelize function
data = sc.parallelize(l)

In [10]:
# Perform action - count() on the RDD - data
data.count()

4

In [11]:
# Perform action - collect() on the RDD - data
data.collect()

[1, 2, 3, 4]

That's It!!