#Telecom Domain Read & Write Ops Assignment - Building Datalake & Lakehouse
This notebook contains assignments to practice Spark read options and Databricks volumes.
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.


##First Import all required libraries & Create spark session object

###Write SQL statements to create:

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign


In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone

In [0]:
%sql
  CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")

###Explain the difference between (Just google and understand why we are going for volume concept for prod ready systems):
We use volumes in production because they provide persistent, secure, portable, and scalable storage independent of container lifecycle, unlike bind mounts which are host-dependent and risky.

####a. Volume vs DBFS/FileStore

DBFS/FileStore is suitable only for development and temporary data, whereas Volumes are production-ready because they provide governance, security, auditing, and scalability through Unity Catalog.

####b. Why production teams prefer Volumes for regulated data

they provide governance, security, auditing, and scalability through Unity Catalog.

####Data files to use in this usecase

In [0]:
customer_csv = """
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,Sneha,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
"""
usage_csv = """customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
"""

tower_logs_region1 = """event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
"""

#2. Filesystem operations

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/")

In [0]:
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv", customer_csv, True)
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv", usage_csv, True)
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv", tower_logs_region1, True)
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/tower_logs_region1.csv", tower_logs_region1, True)
display(dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/"))

####Write a command to validate whether files were successfully copied

In [0]:
display(dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/"))
display(dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/"))
display(dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/"))
display(dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/"))


#3. Spark Directory Read Use Cases

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",inferSchema=True,recursiveFileLookup=True,pathGlobFilter="tower*",header=True)
df1.show()
df1.count()


In [0]:
df2=spark.read.csv(path=["Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region*","/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/","/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/"],recursiveFileLookup=True)
df2.count()

In [0]:
df3=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",inferSchema=True).toDF("cust_id","name","age","city","plan")
display(df3)

In [0]:
df3=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",inferSchema=True).toDF("cust_id","name","age","city","plan")
display(df3)

df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')
display(df5)
df6=spark.read.option("header","true").option("inferSchema","true").option('sep','|').csv(path="dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")
display(df6)



###How schema inference handled “abc” in age?

In [0]:
from pyspark.sql.types import *
schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("plan", StringType(), True)
])
schema_df=spark.read.schema(schema).option("header","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
schema_df.show()
schema_df.printSchema()

####Write a note on What changed when we use header or inferSchema with true/false?
- header = true
- Spark treats the first row as column names
- Columns get meaningful names
- First row is NOT part of data

**header = false (default)
Spark treats first row as data
Column names become _c0, _c1, _c2...**

####inferSchema = true vs inferSchema = false

- inferSchema = true
- Spark scans data and detects best data type
- Converts columns to:
- 
- IntegerType
- DoubleType
- BooleanType
- TimestampType
- Enables numeric operations



**inferSchema = false (default)**

- All columns are read as STRING
- No automatic type conversion

##5. Column Renaming Usecases

In [0]:
todf=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",inferSchema=True).toDF("cust_id","name","age","city","plan")
todf.show()
todf.printSchema()

####Apply column names and datatype using the schema function for usage data

In [0]:
schema_function="cust_id int,name string,age int,city string,plan string"
df7=spark.read.schema(schema_function).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",header=True)
df7.show()
df7.printSchema()

####
**Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data**

In [0]:
from pyspark.sql.types import *
schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("plan", StringType(), True)
])
schema_df=spark.read.schema(schema).option("header","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
schema_df.show()
schema_df.printSchema()

##Spark Write Operations using
- #####csv, json, orc, parquet, delta, saveAsTable, insertInto, xml with different write mode, header and sep options

In [0]:
from pyspark.sql.types import *
schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("plan", StringType(), True)
])
schema_df=spark.read.schema(schema).option("header","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

schema_df.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/csv_targetdata",header=True,mode='overwrite')
schema_df.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/json_targetdata",mode='overwrite')
schema_df.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/parquet_targetdata",mode='overwrite')
schema_df.write.format("orc").mode("overwrite").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/orc_targetdata")
schema_df.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/delta_targetdata",mode='overwrite')
schema_df.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/xml_targetdata",mode='overwrite',rowTag="customer")
schema_df.write.insertInto("telecom_catalog_assign.landing_zone.customertbl",overwrite=True)





In [0]:
schema_df.write.saveAsTable("telecom_catalog_assign.landing_zone.customertbl")

##6. Write Operations (Data Conversion/Schema migration) – CSV Format Usecases

#####Write customer data into CSV format using overwrite mode
- Write usage data into CSV format using append mode
- Write tower data into CSV format with header enabled and custom separator (|)
- Read the tower data in a dataframe and show only 5 rows.
- Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:

df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')
display(df5)
df5.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/csv_targetdata",header=True,mode='append')



In [0]:
df6=spark.read.option("header","true").option("inferSchema","true").option('sep','|').csv(path="dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")
display(df6)
df6.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/csv_targetdata_pipe_delimiter",header=True,sep='|',mode='overwrite')

##7. Write Operations (Data Conversion/Schema migration)– JSON Format Usecases
####Write customer data into JSON format using overwrite mode
- Write usage data into JSON format using append mode and snappy compression format
- Write tower data into JSON format using ignore mode and observe the behavior of this mode
- Read the tower data in a dataframe and show only 5 rows.
- Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')
display(df5)
df5.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/json_targetdata_new",mode='append',compression='snappy')



In [0]:
df6=spark.read.option("header","true").option("inferSchema","true").option('sep','|').csv(path="dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")

df6.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/json_targetdata_without_mode_new1")

df7=spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/json_targetdata_without_mode_new1")
display(df7)


##8. Write Operations (Data Conversion/Schema migration) – Parquet Format Usecases
#####Write customer data into Parquet format using overwrite mode and in a gzip format
- Write usage data into Parquet format using error mode
- Write tower data into Parquet format with gzip compression option
- Read the usage data in a dataframe and show only 5 rows.
- Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:

df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')
df5.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/parquet_targetdata_new4",mode='error')
df6=spark.read.option("header","true").option("inferSchema","true").option('sep','|').csv(path="dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")
df6.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/parquet_targetdata_new4",compression='gzip',mode='overwrite')
df7=spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/parquet_targetdata_new4")
display(df7)


##9. Write Operations (Data Conversion/Schema migration) – Orc Format Usecases
#####Write customer data into ORC format using overwrite mode
- Write usage data into ORC format using append mode
- Write tower data into ORC format and see the output file structure
- Read the usage data in a dataframe and show only 5 rows.
- Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:


df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')

df5.write.format("orc").mode("append").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/orc_targetdata_new")

df6=spark.read.option("header","true").option("inferSchema","true").option('sep','|').csv(path="dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")
df6.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/orc_targetdata_new_asitisfile")
df6_tower_read=spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/orc_targetdata_new_asitisfile")
df6_tower_read.show()
display(df6_tower_read)


In [0]:
df5.write.format("orc").mode("append").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/orc_targetdata_new")
df6=spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/orc_targetdata_new")
df6.show()


##10. Write Operations (Data Conversion/Schema migration) – Delta Format Usecases
#####Write customer data into Delta format using overwrite mode
- Write usage data into Delta format using append mode
- Write tower data into Delta format and see the output file structure
- Read the usage data in a dataframe and show only 5 rows.
- Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
- Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

In [0]:
df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')
df5.show(5)

df5.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/delta_targetdata_output_append",mode='append')
df5.show(5)

df6_delta=spark.read.option("header","true").option("inferSchema","true").option("sep",'|').csv(path="dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")
df6_delta.write.format("delta").mode("overwrite").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/delta_targetdata_output_overwrite")
df6_delta.show()






In [0]:
df6_delta_read=spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/delta_targetdata_output_append")

df6_delta.show(5)


##11. Write Operations (Lakehouse Usecases) – Delta table Usecases
#####
- Write customer data using saveAsTable() as a managed table
- Write usage data using saveAsTable() with overwrite mode
- Drop the managed table and verify data removal
- Go and check the table overview and realize it is in delta format in the Catalog.
- Use spark.read.sql to write some simple queries on the above tables created.

In [0]:
from pyspark.sql.types import *
schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("plan", StringType(), True)
])
schema_df=spark.read.schema(schema).option("header","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
schema_df.printSchema()

schema_df.write.saveAsTable("telecom_catalog_assign.landing_zone.customertbl_update")

In [0]:
df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')
df5.write.saveAsTable("telecom_catalog_assign.landing_zone.usagertbl_overwrite",mode='overwrite')
df5.show(5)

In [0]:
%sql
drop table if exists telecom_catalog_assign.landing_zone.usagertbl_overwrite
    


In [0]:
spark.sql("select * from telecom_catalog_assign.landing_zone.usagertbl_overwrite").show()

##12. Write Operations (Lakehouse Usecases) – Delta table Usecases
#####Write customer data using insertInto() in a new table and find the behavior
- Write usage data using insertTable() with overwrite mode

In [0]:
from pyspark.sql.types import *
schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("plan", StringType(), True)
])
schema_df=spark.read.schema(schema).option("header","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
schema_df.write.insertInto("telecom_catalog_assign.landing_zone.customertbl_insertinto")
spark.sql("select * from telecom_catalog_assign.landing_zone.customertbl_insertinto").display()


In [0]:
df5=spark.read.csv("dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",inferSchema=True,header=True,sep='\t')

df5.write.insertInto("telecom_catalog_assign.landing_zone.usagertbl_overwrite",overwrite=True)
df5.show()



##13. Write Operations (Lakehouse Usecases) – Delta table Usecases
- Write customer data into XML format using rowTag as cust
- Write usage data into XML format using overwrite mode with the rowTag as usage
- Download the xml data and open the file in notepad++ and see how the xml file looks like.

In [0]:
from pyspark.sql.types import *
schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("plan", StringType(), True)
])
schema_df=spark.read.schema(schema).option("header","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
schema_df.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/xml_targetdata_output",rowTag="customer")
spark.read.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/xml_targetdata_output",rowTag="customer").display()

#####Write usage data into XML format using overwrite mode with the rowTag as usage

In [0]:
df5.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/xml_targetdata_output_overwrite_data",rowTag="usage",mode="overwrite")
spark.read.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/serialized_compressed_data_sources/xml_targetdata_output_overwrite_data",rowTag="usage").display()


##14. Compare all the downloaded files (csv, json, orc, parquet, delta and xml)
- Capture the size occupied between all of these file formats and list the formats below based on the order of size from small to big.

##2-Advanced-Readops

- Very Important - path:PathOrPaths,schema,sep,header, inferSchema,
- Important - mode,columnNameOfCorruptRecord,quote,escape,

In [0]:
struct="id int,name string, amount decimal(10,2),pos date,corrupted_record string"

df1=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata.txt",mode='permissive',comment="#")
display(df1)



In [0]:
df2=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata.txt",mode='dropMalformed',comment="#")
display(df2)

In [0]:
struct="id int,name string, amount decimal(10,2),pos date"
df5=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='failFast',comment='#')
display(df5)

In [0]:
struct="id int,name string, amount decimal(10,2),pos date"
df3=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata.txt",mode='permissive',comment="#",columnNameOfCorruptRecord='corrupted_record')
display(df3)

In [0]:
df3=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='permissive',comment="#",escape="~",ignoreLeadingWhiteSpace=True,ignoreTrailingWhiteSpace=True,multiLine=True,dateFormat='yyyy-dd-MM')
display(df3)


In [0]:
df5=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='failFast')
display(df5)

In [0]:
df4=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='permissive',comment="#",escape="~",ignoreLeadingWhiteSpace=True,ignoreTrailingWhiteSpace=True,multiLine=True,dateFormat='yyyy-dd-MM',nullValue='na')
display(df4)

In [0]:
df5=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='dropMalformed',comment="#",escape="~",ignoreLeadingWhiteSpace=True,ignoreTrailingWhiteSpace=True,multiLine=True,dateFormat='yyyy-dd-MM',nullValue='na',quote="'")
display(df5)

In [0]:
df5=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='dropMalformed',comment="#",escape="~",ignoreLeadingWhiteSpace=True,ignoreTrailingWhiteSpace=True,multiLine=True,dateFormat='yyyy-dd-MM',nullValue='na',quote="'")
display(df5)

In [0]:
df5=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata1.txt",mode='permissive',comment="#",escape="~",ignoreLeadingWhiteSpace=True,ignoreTrailingWhiteSpace=True,multiLine=True,dateFormat='yyyy-dd-MM',nullValue='na',quote="'",nanValue='nan')
display(df5)

###Schema Evaluvation

In [0]:
struct="id int,name string, amount decimal(10,2),pos date"
df2=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/sampledate/malformeddata.txt",mode='dropMalformed',comment="#")
display(df2)
df2.write.parquet("/Volumes/we47catalog1/we47db1/we47volume1/mergeschema_output2/malformeddata.parquet")

In [0]:
struct="id int,name string, amount decimal(10,2),pos date,city string"
df5=spark.read.schema(struct).csv("/Volumes/we47catalog1/we47db1/we47volume1/mergeschema/malformeddata1.txt",mode='failFast',comment="#",escape="~",ignoreLeadingWhiteSpace=True,ignoreTrailingWhiteSpace=True,multiLine=True,dateFormat='yyyy-dd-MM',nullValue='na',quote="'",nanValue='nan')
display(df5)
df5.write.format("parquet").save("/Volumes/we47catalog1/we47db1/we47volume1/mergeschema_output2/malformeddata.parquet",mode='append')

In [0]:
merge_schema = spark.read.parquet(
    "/Volumes/we47catalog1/we47db1/we47volume1/mergeschema_output2/malformeddata.parquet",merge_schema=True)
display(merge_schema)