## Fix the CSV Data

### The Data Source
The sale-poc contains sales data for the month of May, 2017.

In [0]:
%run ../MountDatasetSample

### Exploring file with dbutils

In [0]:
datapath = "/mnt/files/dataset/sale-poc/"
datapath

In [0]:
%fs ls 'mnt/files/dataset/sale-poc/'

### Exploring file with Spark

In [0]:
df1 = spark.read.csv(datapath+"sale-20170501.csv", header=True)
df1.printSchema()

In [0]:
%fs head 'mnt/files/dataset/sale-poc/sale-20170501.csv'

In [0]:
df2 = spark.read.csv(datapath+"sale-20170502.csv", header=True)
df2.printSchema()

In [0]:
%fs head "/mnt/files/dataset/sale-poc/sale-20170502.csv"

### Handling and fixing poorly formed CSV files

The steps below provide example code for fixing the poorly-formed CSV file, sale-20170502.csv we discovered during exploration of the files in the `dataste/sale-poc` folder. This is just one of many ways to handle "fixing" a poorly-formed CSV file using Spark.

To "fix" the bad file, we need to take a programmatic approach, using Python to read in the contents of the file and then parse them to put them into the proper shape.

> To handle the data being in a single row, we can use the textFile() method of our SparkContext to read the file as a collection of rows into a resilient distributed dataset (RDD). This allows us to get around the errors around the number of columns because we are essentially getting a single string value stored in a single column.

In [0]:
rdd = sc.textFile( datapath + "/sale-20170502.csv" )

In [0]:
df1 = spark.read.text(datapath + "/sale-20170502.csv")
df1.first()

In [0]:
# Since we know there is only one row, grab the first row of the RDD and split in on the field delimiter (comma).
data = rdd.first().split(',')

field_count = len(data)

# Print out the count of fields read into the array.
print(field_count)

By splitting the row on the field delimiter, we created an array of all the individual field values in the file, the count of which you can see above.

In [0]:
import math

expected_row_count = math.floor(field_count / 11)
print(f'The expected row count is: {expected_row_count}')

Next, let's create an array to store the data associated with each "row".

> We will set the max_index to the number of columns that are expected in each row. We know from our exploration of other files in the dataset/sale-poc folder that they contain 11 columns, so that is the value we will set.

In addition to setting variables, we will use the cell below to loop through the data array and assign every 11 values to a row. By doing this, we are able to "split" the data that was once a single row into appropriate rows containing the proper data and columns from the file.

In [0]:
import numpy as np
# Create an array to store the data associated with each "row". Set the max_index to the number of columns that are in each row. This is 11, which we noted above when viewing the schema of the May 1 file.
row_list = []
max_index = 11

# Now, we are going to loop through the array of values extracted from the single row of the file and build rows consisting of 11 columns.
while max_index <= len(data):
    row = [data[i] for i in np.arange(max_index-11, max_index)]
    row_list.append(row)

    max_index += 11

print(f'The row array contains {len(row_list)} rows. The expected number of rows was {expected_row_count}.')

Finally, we can use the row_list we created above to create a DataFrame. We can add to this a schema parameter, which contains the column names we saw in the schema of the first file.

In [0]:
df_fixed = spark.createDataFrame(row_list,schema=['TransactionId', 'CustomerId', 'ProductId', 'Quantity', 'Price', 'TotalAmount', 'TransactionDateId', 'ProfitAmount', 'Hour', 'Minute', 'StoreId'])

display(df_fixed.limit(10))

### Write the "fixed" file into the data lake

In [0]:
savePath = "FileStore/out/sale-20170502-fixed.csv"
df_fixed.write.format('csv').option('header',True).mode('overwrite').option('sep',',').save(savePath)

In [0]:
%fs ls FileStore/out/sale-20170502-fixed.csv