# Install Pyspark

ðŸŽ¯ AGENDA


Different file formats we can read data from to data fram

How to read CSV, JSON, Parquet, Text, ORC

Options while reading files
â€“ header, inferSchema, delimiter, multiline
â€“ schema definition
â€“ handling bad records

How to write data into files

Partitions & performance

Hands-on labs

Interview questions & alternative methods

In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F

spark= SparkSession \
       .builder \
       .appName("Our First Spark Example") \
       .getOrCreate()

spark

[33m0% [Working][0m            Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:3 https://cli.github.com/packages stable InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,151 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,472 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:12 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [3,532 kB]
Hit:13 https:

# Reading and Writing Files to DataFrames



### **1. Introduction to Writing Data into Spark DataFrames**

In PySpark, the `SparkSession` object provides various methods to read data from external sources into DataFrames. Data can come from different formats like CSV, JSON, Parquet, and more. Each format has its own set of parameters to control how the data is read.

### **2. Basic Syntax**

```python
spark.read.format("file_format").option("key", "value").load("file_path")
```

- `file_format`: The format of the file (e.g., csv, json, parquet, etc.).
- `option`: Configuration options (e.g., header, inferSchema).
- `load()`: Specifies the path of the file to read.

---

### **3. Reading Data from Different Formats**

#### **3.1 Reading CSV Files**

CSV (Comma-Separated Values) is a common format for data storage.

**Basic Example**:
```python
# Reading a CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.show()
```
- `header=True`: Indicates that the first row of the CSV contains column names.
- `inferSchema=True`: Automatically infers the schema (data types) of the columns.

**With Additional Options**:
```python
df = spark.read.option("delimiter", ";").csv("path/to/file.csv", header=True, inferSchema=True)
df.show()
```
- `delimiter=";"`: Specifies a custom delimiter (e.g., semicolon) instead of the default comma.

---

#### **3.2 Reading JSON Files**

JSON (JavaScript Object Notation) is a lightweight data-interchange format.

**Basic Example**:
```python
# Reading a JSON file
df = spark.read.json("path/to/file.json")
df.show()
```

**With Schema Inference**:
```python
df = spark.read.option("multiline", True).json("path/to/file.json")
df.show()
```
- `multiline=True`: Used when the JSON file contains data spanning multiple lines.

---

#### **3.3 Reading Parquet Files**

Parquet is a columnar storage file format that provides efficient data compression.

**Basic Example**:
```python
# Reading a Parquet file
df = spark.read.parquet("path/to/file.parquet")
df.show()
```
- Parquet format automatically preserves the schema and data types, so there's no need for `header` or `inferSchema` options.

---

#### **3.4 Reading Text Files**

Text files contain plain text, where each line represents one record.

**Basic Example**:
```python
# Reading a text file
df = spark.read.text("path/to/file.txt")
df.show()
```
- This reads the file as a single column DataFrame with each line stored as a string.

---

#### **3.5 Reading ORC Files**

ORC (Optimized Row Columnar) is a format used for optimized storage.

**Basic Example**:
```python
# Reading an ORC file
df = spark.read.orc("path/to/file.orc")
df.show()
```

---

### **4. Writing DataFrames to Files**

You can also write Spark DataFrames to different file formats using the `write` method.

#### **4.1 Writing CSV Files**

**Example**:
```python
# Writing DataFrame to CSV
df.write.csv("path/to/output_directory", header=True)
```

#### **4.2 Writing JSON Files**

**Example**:
```python
# Writing DataFrame to JSON
df.write.json("path/to/output_directory")
```

#### **4.3 Writing Parquet Files**

**Example**:
```python
# Writing DataFrame to Parquet
df.write.parquet("path/to/output_directory")
```

#### **4.4 Writing Text Files**

**Example**:
```python
# Writing DataFrame to Text
df.write.text("path/to/output_directory")
```

#### **4.5 Writing ORC Files**

**Example**:
```python
# Writing DataFrame to ORC
df.write.orc("path/to/output_directory")
```

---

### **5. Handling Large Files with PySpark**

For large datasets, PySpark can split the data into multiple partitions for better performance. By default, PySpark reads files in a distributed manner.

#### **Example**:
```python
# Specifying the number of partitions while reading
df = spark.read.csv("path/to/large_file.csv", header=True, inferSchema=True).repartition(5)
```
- `.repartition(5)`: This will split the file into 5 partitions for parallel processing.

---

### **6. Additional Options for Reading Files**

1. **Schema Definition**: You can define your own schema instead of relying on schema inference.

   **Example**:
   ```python
   from pyspark.sql.types import StructType, StructField, StringType, IntegerType
   
   schema = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), True),
       StructField("salary", IntegerType(), True)
   ])
   
   df = spark.read.schema(schema).csv("path/to/file.csv", header=True)
   df.show()
   ```

2. **Handling Bad Records**: Use the `mode` option to specify how bad records (corrupt or malformed) are handled.
   - **Options**: `PERMISSIVE` (default), `DROPMALFORMED`, `FAILFAST`
   
   **Example**:
   ```python
   df = spark.read.option("mode", "DROPMALFORMED").csv("path/to/file.csv", header=True)
   df.show()
   ```

---

### **7. Practical Exercise for Students**

**Exercise**: Ask students to read a CSV file, filter the data based on some conditions, and write the filtered DataFrame to a Parquet file.

```python
# Reading CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Filtering the data (e.g., filtering rows where age is greater than 30)
filtered_df = df.filter(df['age'] > 30)

# Writing the filtered DataFrame to Parquet format
filtered_df.write.parquet("path/to/output_directory")
```

---



In [None]:
df=spark.read.csv('/content/person_data.csv',header=True)
filter_df=df.filter(df['age']<30)
#filter1_df.write.parquet('/content/filtered1.parquet')
filter_output1 = spark.read.parquet("/content/filtered1.parquet/part-00000-28ca61f7-ea3e-4725-a875-0b1b0e5b1ae3-c000.snappy.parquet")
filter_output1.show()





+----------+---------+----------+---+
|First_Name|Last_Name|       DOB|Age|
+----------+---------+----------+---+
|   Michael|  Johnson|1998-05-14| 21|
|   Michael|    Jones|1987-02-15| 29|
|      Sara|      Doe|1985-06-15| 25|
|   Michael|   Garcia|2002-12-05| 28|
|    Olivia|     King|1984-03-26| 21|
|    Sophia|     King|1987-04-12| 26|
|    Olivia|    Allen|1977-04-22| 20|
|    Robert|   Walker|1984-04-09| 24|
|    Olivia|     Hall|1980-01-03| 21|
+----------+---------+----------+---+



In [None]:
# convert the csv file to json
Break

In [None]:
#df.write.parquet("/content/person_data.parquet")

In [None]:
df33=spark.read.parquet("/content/person_data.parquet/part-00000-317c95fe-5c78-4429-bbbf-794e8236335f-c000.snappy.parquet")
df33.collect()


[Row(First_Name='Michael', Last_Name='Garcia', DOB='1990-06-12', Age='48'),
 Row(First_Name='Lucy', Last_Name='Jones', DOB='2018-10-08', Age='31'),
 Row(First_Name='Sara', Last_Name='Williams', DOB='2016-08-24', Age='47'),
 Row(First_Name='Michael', Last_Name='Johnson', DOB='1998-05-14', Age='21'),
 Row(First_Name='Michael', Last_Name='Jones', DOB='1987-02-15', Age='29'),
 Row(First_Name='Sara', Last_Name='Doe', DOB='1985-06-15', Age='25'),
 Row(First_Name='Emily', Last_Name='Johnson', DOB='2014-01-05', Age='33'),
 Row(First_Name='Jane', Last_Name='Williams', DOB='1998-01-28', Age='40'),
 Row(First_Name='Michael', Last_Name='Garcia', DOB='2002-12-05', Age='28'),
 Row(First_Name='Michael', Last_Name='Brown', DOB='2011-09-23', Age='40'),
 Row(First_Name='John', Last_Name='Johnson', DOB='1999-07-18', Age='35'),
 Row(First_Name='Jane', Last_Name='Martinez', DOB='2001-03-06', Age='55'),
 Row(First_Name='Emily', Last_Name='Brown', DOB='1997-10-07', Age='41'),
 Row(First_Name='Emily', Last_Na

In [None]:
df31=spark.read.parquet("/content/person_data.parquet/part-00000-317c95fe-5c78-4429-bbbf-794e8236335f-c000.snappy")
df31.collect()

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/content/person_data.parquet/part-00000-317c95fe-5c78-4429-bbbf-794e8236335f-c000.snappy.

In [None]:
df4.read.parquet("/content/emp1.parqu

In [None]:
What is the primary storage format used
in Databricks for structured data?

-> csv
-> parquet
->json
->

In [None]:
. Which PySpark method is used to read a CSV file with headers into a DataFrame?
a) spark.read.csv("file.csv", header=True)
b) spark.read.format("csv").option("header", "true").load("file.csv")
c) spark.read.option("header", "true").csv("file.csv")
d) All of the above

In [None]:
2. What is the correct way to register a DataFrame
 as a temporary view in PySpark?

a) df.createTempView("my_view")
b) df.createOrReplaceTempView("my_view")
c) df.registerTempTable("my_view")
d) Both a and b are correct

In [None]:
3. Which Spark SQL function would you use to parse a JSON string column?
a) json_parse()
b) from_json()
c) parse_json()
d) json_extract()

In [None]:
4. In PySpark, how do you select specific columns from a DataFrame?
a) df.select("col1", "col2")
b) df.select(df.col1, df.col2)
c) df.select(F.col("col1"), F.col("col2"))
d) All of the above

In [None]:
df.write.mode("overwrite").partitionBy("year") \
.parquet("/path/to/table")

df.write.mode("overwrite").partitionBy("year").parquet("/path/to/table")
a) Writes data in append mode with year partitioning
b) Writes data in overwrite mode with year partitioning
c) Creates a new table partitioned by year
d) Updates existing partitions for the specified year


In [None]:
. Which SQL command creates a managed table in Databricks?
a) CREATE TABLE table_name USING DELTA LOCATION '/path'
b) CREATE TABLE table_name (col1 STRING, col2 INT) USING DELTA
c) CREATE OR REPLACE TABLE table_name AS SELECT * FROM source
d) Both b and c are correct

In [None]:
8. Which PySpark transformation is used to remove duplicate rows?
a) df.distinct()
b) df.dropDuplicates()
c) df.drop_duplicates()
d) Both a and b are correct

In [None]:
1. Which PySpark method is used to filter rows in a DataFrame?
a) df.filter(col("age") > 25)
b) df.where(col("age") > 25)
c) df.filter("age > 25")
d) All of the above

In [None]:
from pyspark.sql.functions import col, when
df.withColumn("status", when(col("age") >= 18, "Adult").otherwise("Minor"))


a) Filters rows where age >= 18
b) Creates a new column "status" with conditional values
c) Updates the "age" column with status values
d) Groups data by age ranges

In [None]:
 What does the collect() action do in PySpark?
a) Collects all rows from the DataFrame to the driver
b) Groups rows by a specified column
c) Counts the number of rows in the DataFrame
d) Removes duplicate rows from the DataFrame

list of rows


In [None]:
8. In Spark SQL, which function is used to get the current date?
a) current_date()
b) now()
c) today()
d) getdate()

In [None]:
What is the correct way to add a new column with a constant value to a DataFrame?
a) df.withColumn("new_col", "constant_value")
b) df.withColumn("new_col", lit("constant_value"))
c) df.select("*", lit("constant_value").alias("new_col"))
d) Both b and c are correct

In [None]:
Which window function would you use to assign row numbers to each row within a partition?
a) rank()
b) dense_rank()
c) row_number()
d) ntile()

In [None]:
df.groupBy("department").agg(
    count("*").alias("employee_count"),
    avg("salary").alias("avg_salary")
)

a) Filters employees by department
b) Groups by department and calculates count and average salary
c) Sorts employees by department and salary
d) Creates a pivot table

In [None]:
df=spark.read.format("csv").option("header", True).load("/content/sample_data.csv")
df.show()


+---+-------+---+---------+
| id|   name|age|     city|
+---+-------+---+---------+
|  1|  Alice| 25|Hyderabad|
|  2|    Bob| 30|Bangalore|
|  3|Charlie| 28|  Chennai|
+---+-------+---+---------+



In [None]:
df=spark.read.format("csv").option("header", True).option("inferSchema",True).load("/content/sample_data.csv")
df.show()

+---+-------+---+---------+
| id|   name|age|     city|
+---+-------+---+---------+
|  1|  Alice| 25|Hyderabad|
|  2|    Bob| 30|Bangalore|
|  3|Charlie| 28|  Chennai|
+---+-------+---+---------+



In [None]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)



In [None]:
df.write.parquet("/content/sample_data.parquet")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=spark.read.format("parquet").load("/content/sample_data.parquet")
df.show()

+---+-------+---+---------+
| id|   name|age|     city|
+---+-------+---+---------+
|  1|  Alice| 25|Hyderabad|
|  2|    Bob| 30|Bangalore|
|  3|Charlie| 28|  Chennai|
+---+-------+---+---------+



In [None]:
df=spark.read.format("json").load("/content/sample_data.json")
df.show()

+---+---------+---+-------+
|age|     city| id|   name|
+---+---------+---+-------+
| 25|Hyderabad|  1|  Alice|
| 30|Bangalore|  2|    Bob|
| 28|  Chennai|  3|Charlie|
+---+---------+---+-------+



In [None]:
df=spark.read.json("/content/sample_data.json")
df.show()

+---+---------+---+-------+
|age|     city| id|   name|
+---+---------+---+-------+
| 25|Hyderabad|  1|  Alice|
| 30|Bangalore|  2|    Bob|
| 28|  Chennai|  3|Charlie|
+---+---------+---+-------+



In [None]:
df.write.csv("/content/sample_data1.csv")