# PySpark ETL with CSV, JSON, Parquet, Text, Spark Dataframe

[1. Read CSV file](#read-csv-file) 

[2. Read JSON file](#read-json-file)

[3. Read Parquet file](#read-parquet-file)

[4. Read Text file](#read-text-file)

[5. Create Texp Table](#create-temp-table)

[6. Create JSON from CSV df](#create-json-from-csv)

[7. Create CSV from Parquet df](#create-csv-from-parquet)

[8. Create Parquet from JSON df](#create-parquet-from-json)


### Import required libraries

In [5]:
# Load required Libraries
from pyspark.sql import SparkSession

In [4]:
pip install pyspark

Defaulting to user installation because normal site-packages is not writeable
Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
     ---------------------------------------- 0.0/317.3 MB ? eta -:--:--
     ---------------------------------------- 0.8/317.3 MB 3.7 MB/s eta 0:01:25
     ---------------------------------------- 1.6/317.3 MB 3.4 MB/s eta 0:01:34
     ---------------------------------------- 2.9/317.3 MB 4.4 MB/s eta 0:01:12
      --------------------------------------- 4.5/317.3 MB 5.2 MB/s eta 0:01:01
      --------------------------------------- 5.8/317.3 MB 5.4 MB/s eta 0:00:58
      --------------------------------------- 7.3/317.3 MB 5.7 MB/s eta 0:00:55
     - -------------------------------------- 8.7/317.3 MB 5.8 MB/s eta 0:00:53
     - ------------------------------------- 10.0/317.3 MB 5.9 MB/s eta 0:00:52
     - ------------------------------------- 11.5/317.3 MB 6.0 MB/s eta 0:00:51
     - ------------------------------------- 13.1/317.3 MB 6.1 M

In [None]:
# Start Spark session

spark = SparkSession.builder.appName("Extract").getOrCreate()
sqlContext = SparkSession(spark)

# Not showing warning, only errors
spark.sparkContext.setLogLevel("ERROR")

<a id="read-csv-file"> </a>

## Read CSV File

In [None]:
# Load CSV file

csv_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("nyc_taxi_zone.csv")

In [None]:
# check df schema

csv_df.printSchema() 

In [None]:
csv_df.show(10)

<a id="read-json-file"> </a>

## Read JSON File

In [None]:
# load JSON file into DF

json_df = spark.read.format("json").option("multiline","true").load("nyc_taxi_zone.json")

In [None]:
# check df schema

json_df.printSchema()

In [None]:
json_df.show(10)

<a id="read-parquet-file"> </a>

## Read Parquet File

In [None]:
# load parquet file

parquet_df = spark.read.format("parquet").load("yellow_tripdata_2024-01.parquet")

In [None]:
# check df schema

parquet_df.printSchema()

In [None]:
parquet_df.show(10)

In [None]:
parquet_df.count()

<a id="read-text-file"> </a>

## Read Text File

In [None]:
# load text file

text_df = spark.read.text("sample.txt")

In [None]:
text_df.show()

In [None]:
text_df = spark.read.option("lineSep",",").text("sample.txt")

In [None]:
text_df.show()

<a id="create-temp-table"> </a>

## Create Temp Table

In [None]:
csv_df.createOrReplaceTempView("tempCSV")

In [None]:
json_df.createOrReplaceTempView("tempJSON")

In [None]:
parquet_df.createOrReplaceTempView("tempParquet")

In [None]:
text_df.createOrReplaceTempView("tempText")

In [None]:
sqlContext.sql("select * from tempCSV LIMIT 10").show()

In [None]:
sqlContext.sql("select * from tempJSON limit 10").show()

In [None]:
sqlContext.sql("select count(*) as count from tempParquet").show()

<a id="create-json-from-csv"> </a>

## Create JSON file from CSV df

In [None]:
csv_df.write.format("json").save("jsondata",mode='append')

<a id="create-csv-from-parquet"> </a>

## Create CSV file from Parquet df

In [None]:
parquet_df.write.format("csv").option("header","true").save("csvdata",mode='append')

<a id="create-parquet-from-json"> </a>

## Create Parquet file from JSON df

In [None]:
json_df.write.format("parquet").option("compression","snappy").save("parquetdata",mode='append')

### Stoping Spark

In [None]:
spark.stop()