* How to create a CSV file with malformed records
* How to read it in PySpark
* How to handle malformed rows using different modes
(PERMISSIVE, DROPMALFORMED, FAILFAST)

In [9]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

25/11/15 11:48:11 WARN Utils: Your hostname, user-HP-Pavilion-x360-Convertible-14-dh0xxx resolves to a loopback address: 127.0.1.1; using 192.168.1.24 instead (on interface wlo1)
25/11/15 11:48:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/15 11:48:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


There are three modes:  
* PERMISSIVE 
* DROPMALFORMED 
* FAILFAST

* **PERMISSIVE:** In this mode, the records that contain malformed entries are moved to a column (_corrupt_records) and the malformed entries are
  are kept NULL. It is required in exploratory data analysis. It is true for the analysis where the corrupt record are corrected. 
* **MALFORMED** :  In this mode, records containing malformed data are excluded and the data frame after reading contain correct records only.
  It is ideal when the malformed records are small in number. 
* **FAILFAST**  :  While trying to read a file containing malformed records that will throw error when a malformed record is found first time.
  This is important where data integrity is essential such as in financial institutions, and machine learning models where sensitive data is used.
  A single point of failure is not accepted in such cases.

import os

csv_path = "/home/user/Documents/spark/interview_questions/malformed.csv"

# Create malformed CSV content
data = """id,name,age \n
1,John,30 \n
2,Amanda,twenty \n      # malformed: age should be integer
3,Bob \n
4,Chris,40,Extra \n     # malformed: extra column
five,Sarah,25        # malformed: id should be integer
"""

In [5]:
# Create malformed CSV content
data = """id,name,age 
1,John,30 
2,Amanda,twenty       
3,Bob 
4,Chris,40,Extra      
five,Sarah,25        
"""
with open(csv_path, "w") as f:
    f.write(data)

print("CSV file created:", csv_path)

CSV file created: /home/user/Documents/spark/interview_questions/malformed.csv


In [6]:
with open(csv_path, "r") as f:
    for line in f:
        print(line)

id,name,age 

1,John,30 

2,Amanda,twenty       

3,Bob 

4,Chris,40,Extra      

five,Sarah,25        



In [14]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("_corrupt_record", StringType(),True)
])


* Read the malformed CSV (three modes)
* (A) PERMISSIVE mode (default)   
* Keeps malformed rows; bad data goes to _corrupt_record.

In [19]:
df_permissive = (spark.read.format("csv")
                 .option("header", "true")
                 .option("mode", "PERMISSIVE")
                 .schema(schema)
                 .load(csv_path))

df_permissive.show(truncate=False)


+----+------+----+
|id  |name  |age |
+----+------+----+
|1   |John  |NULL|
|2   |Amanda|NULL|
|3   |Bob   |NULL|
|4   |Chris |40  |
|NULL|Sarah |NULL|
+----+------+----+



25/11/15 13:27:11 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: id, name, age 
 Schema: id, name, age
Expected: age but found: age 
CSV file: file:///home/user/Documents/spark/interview_questions/malformed.csv


(B) DROPMALFORMED mode

* Drops bad rows entirely.

In [20]:
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])


In [21]:
df_drop = (spark.read.format("csv")
           .option("header", "false")
           .option("mode", "DROPMALFORMED")
           .schema(schema)
           .load(csv_path))

df_drop.show()


+---+----+---+
| id|name|age|
+---+----+---+
+---+----+---+



In [24]:
try: 
    df = (spark.read.format("csv")
               .option("header", True)
               .option("mode", "FAILFAST") 
               .schema(schema) 
               .load(csv_path)
         ) 
    df.show()
except Exception as e:
    print(f"Error : {e}")
    

25/11/15 13:38:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: id, name, age 
 Schema: id, name, age
Expected: age but found: age 
CSV file: file:///home/user/Documents/spark/interview_questions/malformed.csv
25/11/15 13:38:58 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
org.apache.spark.SparkException: Encountered error while reading file file:///home/user/Documents/spark/interview_questions/malformed.csv. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:863)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.B

Error : An error occurred while calling o159.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 10) (user-HP-Pavilion-x360-Convertible-14-dh0xxx.lan executor driver): org.apache.spark.SparkException: Encountered error while reading file file:///home/user/Documents/spark/interview_questions/malformed.csv. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:863)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buf

In [26]:
filepath="/home/user/sql_cookbook_pyspark/file1.csv"
df = (spark.readStream 
      .format("csv")
      .option("header", True) 
      .option("inferSchema", True) 
      .load(filepath) 
)
df.show()


                                                                                

+---+---------+---+------------+
| id|     name|age|        city|
+---+---------+---+------------+
|  1|  Grace_1| 30|    New York|
|  2|Charlie_2| 44|    Columbus|
|  3|  Grace_3| 23|   Cleveland|
|  4|Charlie_4| 54|  Pittsburgh|
|  5|  Frank_5| 36|    New York|
|  6|   Sara_6| 58|    New York|
|  7| Naryan_7| 56|    Hartford|
|  8|   Amit_8| 44|      Dallas|
|  9|    Eva_9| 38|    Hartford|
| 10| Grace_10| 23|      Austin|
| 11| David_11| 43|      Albany|
| 12| Helen_12| 50|  Providence|
| 13|Naryan_13| 35|      Albany|
| 14|   Bob_14| 34|     Chicago|
| 15| Alice_15| 25|    Hartford|
| 16| Alice_16| 39|      Austin|
| 17|   Eva_17| 20|Indianapolis|
| 18| David_18| 50|     Newyark|
| 19|   Bob_19| 49|     Chicago|
| 20|  Sara_20| 43| Los Angeles|
+---+---------+---+------------+
only showing top 20 rows

