# Writing Data

Just as there are many ways to read data, we have just as many ways to write data.

In this notebook, we will take a quick peek at how to write data back out to Parquet files.

**Technical Accomplishments:**
- Writing data to Parquet files

## Getting Started

In [0]:
from pyspark.sql import SparkSession

In [0]:
# Initialize Spark Session
spark = (SparkSession.builder
         .appName("Read CSV Data")
         .getOrCreate())

In [0]:
%run ../DatasetSourcePath

## Writing Data

Let's start with one of our original CSV data sources, **pageviews_by_second.tsv**:

In [0]:
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("timestamp", StringType(), False),
  StructField("site", StringType(), False),
  StructField("requests", IntegerType(), False)
])

csvFile = sourcePath + "/dataset/pageviews_by_second.tsv"

csvDF = (spark.read
  .option('header', 'true')
  .option('sep', "\t")
  .schema(csvSchema)
  .csv(csvFile)
)

Now that we have a `DataFrame`, we can write it back out as Parquet files or other various formats.

In [0]:
# directory_path = sourcePath + "/--you name--/out/pageviews_by_second.parquet"
directory_path = "FileStore/out/pageviews_by_second.parquet"
print("Output location: " + directory_path)

(csvDF.write                       # Our DataFrameWriter
  .option("compression", "snappy") # One of none, snappy, gzip, and lzo
  .mode("overwrite")               # Replace existing files
  .parquet(directory_path)               # Write DataFrame to Parquet files
)

In [0]:
csvDF.count()

And lastly we can read that same parquet file back in and display the results:

In [0]:
# display(spark.read.parquet(directory_path).limit(10))
display(spark.read.parquet('dbfs:/' + directory_path).limit(10))

Now that the file has been written out, we can see it in the DBFS:

In [0]:
file_list = dbutils.fs.ls(directory_path)
for f in file_list:
  print(f.name)

In [0]:
spark.read.parquet('dbfs:/' + directory_path).count()

In [0]:
# import os

# # List files in the directory
# file_list = os.listdir(directory_path)

# # Display the list of files
# print("\n".join(file_list))

In [0]:
dirPath = "dbfs:/FileStore/out/sitecount.parquet"
csvDF.groupBy('site').count().write.mode('overwrite').save(dirPath)

In [0]:
spark.read.load(dirPath).display()

### Save data into partition

In [0]:
dirPath = "dbfs:/FileStore/out/pagecount/partition"
csvDF.write.mode('overwrite').partitionBy('site').save(dirPath)

In [0]:
file_list = dbutils.fs.ls(dirPath)
for f in file_list:
  print(f.name)

In [0]:
spark.read.format("delta").load(dirPath + "/site=desktop").count()