-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# DataFrames and Transformations Review
## De-Duping Data Lab

In this exercise, we're doing ETL on a file we've received from a customer. That file contains data about people, including:

* first, middle and last names
* gender
* birth date
* Social Security number
* salary

But, as is unfortunately common in data we get from this customer, the file contains some duplicate records. Worse:

* In some of the records, the names are mixed case (e.g., "Carol"), while in others, they are uppercase (e.g., "CAROL").
* The Social Security numbers aren't consistent either. Some of them are hyphenated (e.g., "992-83-4829"), while others are missing hyphens ("992834829").

If all of the name fields match -- if you disregard character case -- then the birth dates and salaries are guaranteed to match as well,
and the Social Security Numbers *would* match if they were somehow put in the same format.

Your job is to remove the duplicate records. The specific requirements of your job are:

* Remove duplicates. It doesn't matter which record you keep; it only matters that you keep one of them.
* Preserve the data format of the columns. For example, if you write the first name column in all lowercase, you haven't met this requirement.

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> The initial dataset contains 103,000 records.
The de-duplicated result has 100,000 records.

Next, write the results in **Delta** format as a **single data file** to the directory given by the variable *deltaDestDir*.

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> Remember the relationship between the number of partitions in a DataFrame and the number of files written.

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">DataFrameReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">DataFrameWriter</a>

In [0]:
%run ./Includes/Classroom-Setup

It's helpful to look at the file first, so you can check the format. `dbutils.fs.head()` (or just `%fs head`) is a big help here.

In [0]:
%fs head dbfs:/mnt/training/dataframes/people-with-dups.txt

### Loading the data

In [0]:
# TODO

sourceFile = "dbfs:/mnt/training/dataframes/people-with-dups.txt"
deltaDestDir = workingDir + "/people.parquet"

# In case it already exists
dbutils.fs.rm(deltaDestDir, True)

# Complete your work here...
spark.conf.set("spark.sql.shuffle.partitions", 8)

df = (spark
      .read
      .option("header", "true")
      .option("inferSchema", "true")
      .option("sep", ":")
      .csv(sourceFile)
     )
display(df)

firstName,middleName,lastName,gender,birthDate,salary,ssn
Emanuel,Wallace,Panton,M,1988-03-04T00:00:00.000+0000,101255,935-90-7627
Eloisa,Rubye,Cayouette,F,2000-06-20T00:00:00.000+0000,204031,935-89-9009
Cathi,Svetlana,Prins,F,2012-12-22T00:00:00.000+0000,35895,959-30-7957
Mitchel,Andres,Mozdzierz,M,1966-05-06T00:00:00.000+0000,55108,989-27-8093
Angla,Melba,Hartzheim,F,1938-07-26T00:00:00.000+0000,13199,935-27-4276
Rachel,Marlin,Borremans,F,1923-02-23T00:00:00.000+0000,67070,996-41-8616
Catarina,Phylicia,Dominic,F,1969-09-29T00:00:00.000+0000,201021,999-84-8888
Antione,Randy,Hamacher,M,2004-03-05T00:00:00.000+0000,271486,917-96-3554
Madaline,Shawanda,Piszczek,F,1996-03-17T00:00:00.000+0000,183944,963-87-9974
Luciano,Norbert,Sarcone,M,1962-12-14T00:00:00.000+0000,73069,909-96-1669


### Feature Engineering

In [0]:
# ANSWER
from pyspark.sql.functions import col, lower, translate

dedupedDF = (df
             .select(col("*"),
                     lower(col("firstName")).alias("lcFirstName"),
                     lower(col("lastName")).alias("lcLastName"),
                     lower(col("middleName")).alias("lcMiddleName"),
                     translate(col("ssn"), "-", "").alias("ssnNums")
                     # regexp_replace(col("ssn"), "-", "").alias("ssnNums")  # An alternate function to strip the hyphens
                     # regexp_replace(col("ssn"), """^(\d{3})(\d{2})(\d{4})$""", "$1-$2-$3").alias("ssnNums")  # An alternate that adds hyphens if missing
                    )
             .dropDuplicates(["lcFirstName", "lcMiddleName", "lcLastName", "ssnNums", "gender", "birthDate", "salary"])
             .drop("lcFirstName", "lcMiddleName", "lcLastName", "ssnNums")
            )
display(dedupedDF)

firstName,middleName,lastName,gender,birthDate,salary,ssn
Aaron,Andrea,Mondloch,M,1937-12-02T00:00:00.000+0000,75637,906-59-7221
Aaron,Jermaine,Resler,M,1982-08-26T00:00:00.000+0000,80253,911-19-1232
Aaron,Brady,Morgans,M,1935-10-25T00:00:00.000+0000,283121,912-45-3172
Aaron,Russ,Kopera,M,2008-05-08T00:00:00.000+0000,272069,914-32-4016
Aaron,Willard,Kolden,M,1978-02-07T00:00:00.000+0000,73609,916-12-3224
Aaron,Lesley,Strnad,M,1977-10-25T00:00:00.000+0000,179723,916-43-4368
Aaron,Jonah,Crnich,M,1980-12-09T00:00:00.000+0000,256775,916-48-8115
Aaron,Noe,Arujo,M,1983-05-18T00:00:00.000+0000,163936,918-18-4042
Aaron,Wyatt,Cubito,M,1949-12-10T00:00:00.000+0000,181455,921-39-4145
Aaron,Abraham,Tatters,M,1931-10-06T00:00:00.000+0000,80020,922-54-5718


### Write to Delta table

In [0]:
# Now, write the results in Delta format as a single file. We'll also display the Delta files to make sure they were written as expected.

(dedupedDF
 .repartition(1)
 .write
 .mode("overwrite")
 .format("delta")
 .save(deltaDestDir)
)

display(dbutils.fs.ls(deltaDestDir))

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/spark_programming/asp_3_4_review/people.parquet/_delta_log/,_delta_log/,0,1658818290000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/spark_programming/asp_3_4_review/people.parquet/part-00000-de2602ed-4a8c-4727-ae5d-1cbe54fe0ad3-c000.snappy.parquet,part-00000-de2602ed-4a8c-4727-ae5d-1cbe54fe0ad3-c000.snappy.parquet,2770567,1658818289000


**CHECK YOUR WORK**

In [0]:
verify_files = dbutils.fs.ls(deltaDestDir)
verify_delta_format = False
verify_num_data_files = 0
for f in verify_files:
    if f.name == '_delta_log/':
        verify_delta_format = True
    elif f.name.endswith('.parquet'):
        verify_num_data_files += 1

assert verify_delta_format, "Data not written in Delta format"
assert verify_num_data_files == 1, "Expected 1 data file written"

verify_record_count = spark.read.format("delta").load(deltaDestDir).count()
assert verify_record_count == 100000, "Expected 100000 records in final result"

del verify_files, verify_delta_format, verify_num_data_files, verify_record_count

## Clean up classroom
Run the cell below to clean up resources.

In [0]:
%run "./Includes/Classroom-Cleanup"

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>