#Telecom Domain Write Ops Assignment - Building Datalake & Lakehouse
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>


##First Import all required libraries & Create spark session object

## Spark Write Operations using 
- csv, json, orc, parquet, delta, saveAsTable, insertInto, xml with different write mode, header and sep options

##6. Write Operations (Data Conversion/Schema migration) – CSV Format Usecases
1. Write customer data into CSV format using overwrite mode
2. Write usage data into CSV format using append mode
3. Write tower data into CSV format with header enabled and custom separator (|)
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
from pyspark.sql.functions import lower,col
#Write customer data into CSV format using overwrite mode
#Simulating a different dataset 
customer_data = '''
custid,name,age,city,plan
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_src_data/cust.csv", customer_data,overwrite = True)

df_cust = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_src_data/cust.csv", header = True)

df_cust = df_cust.withColumn("plan", lower(col("plan")))
df_cust.show()

df_cust.write.mode('overwrite').option('header','true').csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_csv_op/")


In [0]:
#Moving the already existing usage file into usage_tgt, to see how append works
dbutils.fs.mv('/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_csv.csv', '/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/')

#Write usage data into CSV format using append mode
#Simulating new data to append with the existing one
usage_data = '''customer_id\tvoice_mins\tdata_mb\tsms_count
106\t210\t1800\t18
107\t75\t900\t7
108\t600\t5200\t65
109\t30\t150\t1
110\t420\t2600\t34
111\t95\t1100\t9
112\t300\t3500\t22
113\t15\t50\t0
114\t510\t4800\t58
115\t180\t1400\t12
116\t0\t300\t0
117\t260\t2200\t19
118\t360\t4100\t28
119\t60\t800\t4
120\t700\t6000\t72
'''
#saving the above data in a source folder (To read this into a dataframe and write it to tgt with append mode)
dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_append_csv.csv", usage_data,overwrite = True)

#Seems the move didn't work. So I'm again writing the original data into source

usage_data = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_csv.csv", usage_data,overwrite = True)

#Now I'm just reading the original data from file usage_csv.csv
usage_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_csv.csv",header = True, sep = '\t')

#I'm writing that to tgt - I deleted the moved file from target folder from UI
usage_df.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/",sep='\t',header=True,mode='overwrite')

#I'm reading and writing the new data simulated to append with old data
usage_df_append = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_append_csv.csv",header = True, sep = '\t' )
usage_df_append.write.mode("append").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/",sep='\t',header=True,mode='append')


In [0]:
usage_final_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/",header = True, sep = '\t')
#Appended new and old data successfully
display(usage_final_df)


In [0]:
#Write tower data into CSV format with header enabled and custom separator (|)
tower_all_data = '''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5001|101|TWR01|-80|region1|ericsson|2025-01-10 10:21:54
5002|104|TWR05|-75|region1|ericsson|2025-01-10 11:01:12
5003|106|TWR06|-45|region1|nokia|2025-01-10 10:21:54
5004|107|TWR07|-55|region1|nokia|2025-01-10 11:01:12
5005|108|TWR08|-66|region1|huawei|2025-01-13 10:21:54
5006|109|TWR09|-76|region1|huawei|2025-01-10 11:01:12
5007|111|TWR10|-10|region2|ericsson|2025-01-19 10:21:54
5008|112|TWR11|-73|region2|ericsson|2025-01-18 11:01:12
5009|113|TWR16|-80|region2|nokia|2025-01-20 10:21:54
5010|117|TWR15|-75|region2|nokia|2025-01-28 11:01:12
5011|118|TWR06|-10|region2|huawei|2025-01-20 10:21:54
5012|119|TWR05|-15|region2|huawei|2025-01-10 11:01:12'''

dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_src/tower_all.csv", tower_all_data,overwrite = True)

tower_final_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_src/tower_all.csv",header = True, sep = '|')

tower_final_df.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/",sep='|',header=True,mode='overwrite')

In [0]:
#Read the tower data in a dataframe and show only 5 rows.

df_tower_read = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/",sep='|',header=True)
df_tower_read.show(5)

##7. Write Operations (Data Conversion/Schema migration)– JSON Format Usecases
1. Write customer data into JSON format using overwrite mode
2. Write usage data into JSON format using append mode and snappy compression format
3. Write tower data into JSON format using ignore mode and observe the behavior of this mode
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#Write customer data into JSON format using overwrite mode

cust_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/", header = True)
display(cust_df)
cust_df.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/jsonout", mode = "overwrite")

In [0]:
#Write usage data into JSON format using append mode and snappy compression format
usage_final_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/",header = True, sep = '\t')
#display(usage_final_df)
usage_final_df.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/jsonout/", mode = "append", compression = 'snappy')



In [0]:
#Write tower data into JSON format using ignore mode and observe the behavior of this mode
df_tower_read_ig = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/",sep='|',header=True)
df_tower_read_ig.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/ignore_mode/", mode = 'ignore')


In [0]:
#Read the tower data in a dataframe and show only 5 rows.
df_tow = spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/ignore_mode/").show(5)

##8. Write Operations (Data Conversion/Schema migration) – Parquet Format Usecases
1. Write customer data into Parquet format using overwrite mode and in a gzip format
2. Write usage data into Parquet format using error mode
3. Write tower data into Parquet format with gzip compression option
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#Write customer data into Parquet format using overwrite mode and in a gzip format
cust_data = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/", header = True)
cust_data.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_parquet/", mode = "overwrite", compression = "gzip")
#Write usage data into Parquet format using error mode
usage_data = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/*.csv", header=True, sep = '\t')
display(usage_data)
usage_data.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/usage_pqt", mode = "error")

In [0]:
usage_data = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/*.csv", header=True, sep = '\t')
display(usage_data)

In [0]:
#Read the usage data in a dataframe and show only 5 rows.
usage_data = spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/usage_pqt/").show(5)

In [0]:
#Write tower data into Parquet format with gzip compression option
tower_data = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_src/tower_all.csv", header = True, sep = '|')
tower_data.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/tower_pqt", compression = 'gzip')

In [0]:
tow_data = spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/tower_pqt")
display(tow_data)

##9. Write Operations (Data Conversion/Schema migration) – Orc Format Usecases
1. Write customer data into ORC format using overwrite mode
2. Write usage data into ORC format using append mode
3. Write tower data into ORC format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
cust_data = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/", header = True)
cust_data.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_orc/", mode = "overwrite")

In [0]:
usage_data = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_csv.csv", header=True, sep = '\t')
usage_data.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/usage_orc/", mode = "overwrite")
usage_data_orc = spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/usage_orc/")
display(usage_data_orc)
usage_data_orc = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_append_csv.csv", header=True, sep = '\t')
usage_data_orc.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/usage_orc/", mode = "append")
usage_data_orc_append = spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tgt/usage_orc/")
display(usage_data_orc_append)


In [0]:
#Write tower data into ORC format and see the output file structure

tower_json_read = spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/ignore_mode/part-00000-tid-3793366797753628138-e413dc75-1169-4bc7-bae1-110f41dce9c3-257-1-c000.json")
tower_json_read.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/tower_orc/")

In [0]:
#Read the usage data in a dataframe and show only 5 rows.
usage_data.show(5)

##10. Write Operations (Data Conversion/Schema migration) – Delta Format Usecases
1. Write customer data into Delta format using overwrite mode
2. Write usage data into Delta format using append mode
3. Write tower data into Delta format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
6. Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

In [0]:
#Write customer data into Delta format using overwrite mode

delta_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_src_data/", header = True)
delta_df.write.format("delta").mode("overwrite").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_delta/")

In [0]:
#Write usage data into Delta format using append mode
cust_append = '''
custid,name,age,city,plan
107,Karthik,38,Coimbatore,POSTPAID
108,Divya,25,,PREPAID
109,Ramesh,60,Madurai,POSTPAID
110,Anitha,34,Bangalore,PREPAID
'''
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_src_data/cust_append.csv", cust_append,overwrite = True)

df_cust = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_src_data/cust_append.csv", header = True)

In [0]:
df_cust.write.format("delta").mode("append").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust_delta/")

In [0]:
#Write usage data into Delta format using append mode

usage_delta = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_csv.csv", header = True, sep = '\t')
usage_delta.write.format("delta").mode("overwrite").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_delta_output/")
usage_delta = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_src/usage_append_csv.csv", header = True, sep = '\t')
usage_delta.write.format("delta").mode("append").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_delta_output/")


In [0]:
#Write tower data into Delta format and see the output file structure

tower_df = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_src/", header = True, sep = '|')
tower_df.write.format("delta").mode("overwrite").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower_all_data/tower_all_data_tgt/tower_delta_output/")

In [0]:
#Read the usage data in a dataframe and show only 5 rows.
spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_delta_output/").show(5)

##11. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using saveAsTable() as a managed table
2. Write usage data using saveAsTable() with overwrite mode
3. Drop the managed table and verify data removal
4. Go and check the table overview and realize it is in delta format in the Catalog.
5. Use spark.read.sql to write some simple queries on the above tables created.


##12. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using insertInto() in a new table and find the behavior
2. Write usage data using insertTable() with overwrite mode

##13. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data into XML format using rowTag as cust
2. Write usage data into XML format using overwrite mode with the rowTag as usage
3. Download the xml data and open the file in notepad++ and see how the xml file looks like.

##14. Compare all the downloaded files (csv, json, orc, parquet, delta and xml) 
1. Capture the size occupied between all of these file formats and list the formats below based on the order of size from small to big.

##15. Do a final exercise of defining one/two liner of... 
1. When to use/benifits csv
2. When to use/benifits json
3. When to use/benifit orc
4. When to use/benifit parquet
5. When to use/benifit delta
6. When to use/benifit xml
7. When to use/benifit delta tables
