<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/PySpark/pyspark_loading_csv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark on Google Colab Load CSV Example

Following features makes Apache Spark more unique,

* Speed — Run workloads 100x faster.
* Ease of Use — Open for several programming languages such as Java, Scala, Python, and R.
* Generality — Combine SQL, streaming, and complex analytics.
* Runs Everywhere — Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

### Setting Up Spark on Colabs

### Setting up PySpark in Colab

Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.

In [51]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will download and unzip Apache Spark with Hadoop 2.7 to install it.

Apache Spark Versions:
http://apache.osuosl.org/spark/

In [52]:
!wget -q http://apache.osuosl.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz 
!tar xf spark-3.2.1-bin-hadoop2.7.tgz

Now, it’s time to set the ‘environment’ path.

In [53]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.2.1-bin-hadoop2.7/"

Then we need to import the ‘findspark’ library that will locate Spark on the system and import it as a regular library.

In [54]:
!pip install findspark



In [55]:
import findspark
findspark.init()

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

In [56]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()



Setup data from github using Sparkfiles

In [57]:
from pyspark import SparkFiles
url = "https://raw.githubusercontent.com/jgamel/learn_n_dev/input_data/Mall_Customers.csv"
spark.sparkContext.addFile(url)

Setup data from github using Wget

In [64]:
!wget --continue https://raw.githubusercontent.com/jgamel/learn_n_dev/input_data/Mall_Customers.csv -O /tmp/Mall_Customers.csv

--2022-05-03 21:34:18--  https://raw.githubusercontent.com/jgamel/learn_n_dev/input_data/Mall_Customers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3981 (3.9K) [text/plain]
Saving to: ‘/tmp/Mall_Customers.csv’


2022-05-03 21:34:18 (31.1 MB/s) - ‘/tmp/Mall_Customers.csv’ saved [3981/3981]



In [58]:
spark

### Loading data into PySpark

In PySpark we deal with large-scale datasets. So it’s an important task to load data for data processing. The following command shows how to load data into PySpark. Here we are using a simple data set that contains customer data. In read.csv() we have pass two parameters which are the path of our CSV file and header=True for accepting the header of our CSV file.

Load using Sparkfiles

In [68]:
df1 = spark.read.csv(SparkFiles.get("Mall_Customers.csv"), header=True, inferSchema= True)

Load using Wget

In [69]:
df2 = spark.read.csv("/tmp/Mall_Customers.csv", header=True, inferSchema= True)

### Data Exploration with PySpark DF

After loading data, we can perform several tasks related to our dataset. Let’s explore a few of them.

* Display data - By show() operator we can display our dataset as follows.

In [70]:
df1.show(10)

+----------+------+---+------------------+----------------------+
|CustomerID|Gender|Age|Annual Income (k$)|Spending Score (1-100)|
+----------+------+---+------------------+----------------------+
|         1|  Male| 19|                15|                    39|
|         2|  Male| 21|                15|                    81|
|         3|Female| 20|                16|                     6|
|         4|Female| 23|                16|                    77|
|         5|Female| 31|                17|                    40|
|         6|Female| 22|                17|                    76|
|         7|Female| 35|                18|                     6|
|         8|Female| 23|                18|                    94|
|         9|  Male| 64|                19|                     3|
|        10|Female| 30|                19|                    72|
+----------+------+---+------------------+----------------------+
only showing top 10 rows



In [71]:
df2.show(10)

+----------+------+---+------------------+----------------------+
|CustomerID|Gender|Age|Annual Income (k$)|Spending Score (1-100)|
+----------+------+---+------------------+----------------------+
|         1|  Male| 19|                15|                    39|
|         2|  Male| 21|                15|                    81|
|         3|Female| 20|                16|                     6|
|         4|Female| 23|                16|                    77|
|         5|Female| 31|                17|                    40|
|         6|Female| 22|                17|                    76|
|         7|Female| 35|                18|                     6|
|         8|Female| 23|                18|                    94|
|         9|  Male| 64|                19|                     3|
|        10|Female| 30|                19|                    72|
+----------+------+---+------------------+----------------------+
only showing top 10 rows



* Count the records

In [72]:
df1.count()

200

* Drop null values - If there are any null values on the dataset remove them.

In [73]:
df1 = df1.na.drop()
df1.count()

200

* Display specific columns only

In [74]:
df1.select("Gender","Age").show(5)

+------+---+
|Gender|Age|
+------+---+
|  Male| 19|
|  Male| 21|
|Female| 20|
|Female| 23|
|Female| 31|
+------+---+
only showing top 5 rows

