#Telecom Domain ReadOps Assignment
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://fplogoimages.withfloats.com/actual/68009c3a43430aff8a30419d.png)
![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

##1. Write SQL statements to create:
1. A catalog named telecom_catalog_assign
2. A schema landing_zone
3. A volume landing_vol
4. Using dbutils.fs.mkdirs, create folders:<br>
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/
5. Explain the difference between (Just google and understand why we are going for volume concept for prod ready systems):<br>
a. Volume vs DBFS/FileStore<br>
b. Why production teams prefer Volumes for regulated data<br>

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign;
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone;
CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol;

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")


In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2")

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/ericsson")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/nokia")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/huawei")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/ericsson")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/nokia")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/huawei")

In [0]:
cust_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/"
usage_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/"
tower_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/"

###DBFS (Databricks File System):

DBFS is a filesystem abstraction layer used by Databricks clusters to access data stored in cloud object storage. It provides a Unix-like filesystem interface (dbfs:/) that Spark and other runtimes use to read and write files. DBFS maps filesystem operations to native cloud storage APIs and serves as the technical access layer for file operations. DBFS itself does not provide governance, fine-grained access control, or auditing. It is commonly used for legacy workloads, temporary files, and backward compatibility.

###Volume:

A volume is a Unity Catalog–governed object that represents a logical volume of storage in a cloud object storage location. Volumes are designed to store non-tabular data and provide governance capabilities similar to tables, including fine-grained access control, auditing, and lineage. Volumes are organized under a catalog and schema alongside tables and views. A volume can be either managed or external. Files stored in volumes are accessed using paths under /Volumes/<catalog>/<schema>/<volume>/.

Although volume paths often appear as dbfs:/Volumes/..., this does not mean volumes are DBFS. DBFS is used only as the runtime filesystem interface, while Unity Catalog enforces governance, permissions, and auditing on the volume.

###Key Difference:

DBFS focuses on providing a filesystem interface to interact with cloud object storage, whereas volumes focus on governed, secure, and organized file storage under Unity Catalog. DBFS is the access mechanism used by the runtime, while volumes are the authoritative storage objects that control security and metadata. For new production workloads, Databricks recommends using Unity Catalog volumes instead of DBFS mounts.

###Volume vs DBFS / FileStore

###DBFS 
FileStore is a filesystem abstraction layer provided by Databricks that allows clusters to interact with cloud object storage using Unix-like paths (dbfs:/, /dbfs/). FileStore is a DBFS-backed location mainly intended for temporary files, UI uploads, and notebooks. DBFS provides convenience and backward compatibility but does not offer governance, fine-grained access control, auditing, or lineage, making it unsuitable for production-grade systems.

###Volumes
 are Unity Catalog–governed storage objects designed for storing non-tabular data in production environments. Volumes provide enterprise-grade features such as fine-grained access control (GRANT/REVOKE), auditing, lineage, and centralized credential management. Volumes are organized under catalogs and schemas, enabling consistent data governance across teams. Although volume paths may appear as dbfs:/Volumes/..., DBFS is only the runtime access layer, while Unity Catalog enforces all security and governance rules.

###Why volumes are preferred for production-ready systems:
Volumes enable secure, auditable, and governed access to files, align with enterprise data governance standards, support multi-team environments, and eliminate credential sprawl, making them the recommended approach for all new production workloads.

##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

##2. Filesystem operations
1. Write code to copy the above datasets into your created Volume folders:
Customer → /Volumes/.../customer/
Usage → /Volumes/.../usage/
Tower (region-based) → /Volumes/.../tower/region1/ and /Volumes/.../tower/region2/

2. Write a command to validate whether files were successfully copied

In [0]:
customer_data = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''
dbutils.fs.put(f"{cust_path}/customer_csv.csv", customer_data,overwrite = True)

In [0]:
usage_data = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''
dbutils.fs.put(f"{usage_path}/usage_csv.csv", usage_data,overwrite = True)

In [0]:
tower_region1_ericsson_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5001|101|TWR01|-80|region1|ericsson|2025-01-10 10:21:54
5002|104|TWR05|-75|region1|ericsson|2025-01-10 11:01:12
'''
dbutils.fs.put(f"{tower_path}/region1/ericsson/tower_region1_ericsson.csv", tower_region1_ericsson_data,overwrite = True)

In [0]:
tower_region1_nokia_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5003|106|TWR06|-45|region1|nokia|2025-01-10 10:21:54
5004|107|TWR07|-55|region1|nokia|2025-01-10 11:01:12
'''
dbutils.fs.put(f"{tower_path}/region1/nokia/tower_region1_nokia.csv", tower_region1_nokia_data,overwrite = True)

In [0]:
tower_region1_huawei_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5005|108|TWR08|-66|region1|huawei|2025-01-13 10:21:54
5006|109|TWR09|-76|region1|huawei|2025-01-10 11:01:12
'''
dbutils.fs.put(f"{tower_path}/region1/huawei/tower_region1_huawei.csv", tower_region1_huawei_data,overwrite = True)

In [0]:
tower_region2_ericsson_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5007|111|TWR10|-10|region2|ericsson|2025-01-19 10:21:54
5008|112|TWR11|-73|region2|ericsson|2025-01-18 11:01:12
'''
dbutils.fs.put(f"{tower_path}/region2/ericsson/tower_region2_ericsson.csv", tower_region2_ericsson_data,overwrite = True)

tower_region2_nokia_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5009|113|TWR16|-80|region2|nokia|2025-01-20 10:21:54
5010|117|TWR15|-75|region2|nokia|2025-01-28 11:01:12
'''
dbutils.fs.put(f"{tower_path}/region2/nokia/tower_region2_nokia.csv", tower_region2_nokia_data,overwrite = True)

tower_region2_huawei_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5011|118|TWR06|-10|region2|huawei|2025-01-20 10:21:54
5012|119|TWR05|-15|region2|huawei|2025-01-10 11:01:12
'''
dbutils.fs.put(f"{tower_path}/region2/huawei/tower_region2_huawei.csv", tower_region2_huawei_data,overwrite = True)

##3. Directory Read Use Cases
1. Read all tower logs using:
Path glob filter (example: *.csv)
Multiple paths input
Recursive lookup

2. Demonstrate these 3 reads separately:
Using pathGlobFilter
Using list of paths in spark.read.csv([path1, path2])
Using .option("recursiveFileLookup","true")

3. Compare the outputs and understand when each should be used.

In [0]:
#1.Read all tower logs using: Path glob filter (example: *.csv) Multiple paths input Recursive lookup
#print(spark)
from pyspark.sql.session import SparkSession
spark1 = SparkSession.builder.getOrCreate()
#print(spark1)
df_tower_recursive = (
    spark.read
         .format("csv")
         .option("recursiveFileLookup", "true")
         .option("pathGlobFilter", "*.csv")
         .option("header", True)
         .option("sep" , '|')
         .load(tower_path)
)
display(df_tower_recursive)

In [0]:
#Reading from mulitple path
df_tower_multi_path = (
    spark.read
         .csv(path = [f"{tower_path}/region2/huawei/tower_region1_huawei.csv",f"{tower_path}/region2/nokia/tower_region2_nokia.csv"], header = True, inferSchema = True, sep = "|")
)
display(df_tower_multi_path)

#Reading using recursive option
from pyspark.sql.session import SparkSession
spark1 = SparkSession.builder.getOrCreate()
#print(spark1)
df_tower_recursive_alone = (
    spark.read
         .format("csv")
         .option("recursiveFileLookup", "true")
         .option("header", True)
         .option("sep" , '|')
         .load(f"{tower_path}/region1")
)
display(df_tower_recursive_alone)

##4. Schema Inference, Header, and Separator
1. Try the Customer, Usage files with the option and options using read.csv and format function:<br>
header=false, inferSchema=false<br>
or<br>
header=true, inferSchema=true<br>
2. Write a note on What changed when we use header or inferSchema  with true/false?<br>
3. How schema inference handled “abc” in age?<br>

In [0]:
df_customer_allfalse = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv", header = False, inferSchema = False)
df_usage_allfalse = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_csv.csv", header = False, inferSchema = False, sep = '\t')
df_customer_alltrue = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv", header = True, inferSchema = True)
df_usage_alltrue = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_csv.csv", header = True, inferSchema = True , sep = '\t')

display(df_customer_allfalse)
display(df_customer_alltrue)
display(df_usage_allfalse)
display(df_usage_alltrue)
df_customer_alltrue.printSchema()

##5. Column Renaming Usecases
1. Apply column names using string using toDF function for customer data
2. Apply column names and datatype using the schema function for usage data
3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data 

In [0]:
from pyspark.sql.types import StructType, IntegerType, StringType, StructField, TimestampType
df_customer_cols = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv").toDF("cust_id","cust_name","age","city","plan")

schema = "cust_id integer, voice_mins float,data_mb integer, sms_count integer"

df_usage_cols = spark.read.schema(schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_csv.csv",sep = '\t')

schema_struct_cust = StructType([StructField("cust_id", IntegerType(), False), StructField("cust_name", StringType(), True), StructField("age", IntegerType(), True), StructField("cust_city", StringType(), True), StructField("cust_plan", StringType(), True)])

df_customer_struct = spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv", schema = schema_struct_cust)

schema_struct_tower = StructType([StructField("event_id", IntegerType(), True), StructField("cust_id", IntegerType(), True), StructField("tower_id", StringType(), True), StructField("sig_strngth", IntegerType(), True), StructField("region", StringType(), True), StructField("vendor", StringType(), True), StructField("signal_ts", TimestampType(), True)])


df_struct_tower = df_tower_recursive = (
    spark.read
         .format("csv")
         .schema(schema_struct_tower)
         .option("recursiveFileLookup", "true")
         .option("pathGlobFilter", "*.csv")
         .option("header", True)
         .option("sep" , '|')
         .load(tower_path)
)



display(df_customer_cols)
display(df_usage_cols)
display(df_customer_struct)
display(df_struct_tower)


## 6. More to come (stay motivated)....