## Data is stored into HDFS

In [1]:
%file

ls /user/majesteye/DS05_INSURANCE_DATASET/input


In [2]:
%spark
sc.hadoopConfiguration.set("fs.defaultFS", "hdfs://namenode:9000")


In [3]:
val df = spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("delimiter", ",")
    .load("/user/majesteye/DS05_INSURANCE_DATASET/input/*.csv")

In [4]:
df.printSchema()

# Spark Memory Management

If you are working with large datasets in Spark and performing multiple actions or transformations on the same DataFrame, it’s a good idea to **cache** it to improve performance.

### How to Cache a DataFrame

You can cache a DataFrame using `.cache()`:

```scala
df.cache()  // Cache the DataFrame to memory
```

### Persist

`persist()` is more flexible than cache because it allows you to specify how and where to store the DataFrame, such as in memory, on disk, or a combination of both.

```scala
import org.apache.spark.storage.StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
```

`MEMORY_ONLY`: Stores the DataFrame in memory only. If there is not enough memory, the DataFrame will not be cached and must be recomputed.
`MEMORY_AND_DISK`: Stores the DataFrame in memory, but spills it to disk if there is not enough memory available.
`DISK_ONLY`: Stores the DataFrame only on disk. This is useful if the data is too large to fit in memory.

### Unpersist

`unpersist()` is used to release the cached or persisted DataFrame from memory or disk. This helps in freeing up resources once the DataFrame is no longer needed.

```scala
df.unpersist()
```