## Problem 03

Weightage: 50

Convert data from json to a Hive table with simple data types. **address** should be stored in 4 columns - city, postal_code, state and street. **phone_numbers** should be of type string and existing numbers should be comma separated.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements

* Create Database `whoami`_mt02
* Create external table pointing to below location.
```
/user/`whoami`/mock_test_02/problem03/solution
```
* Table Name: **addresses**
* Column Names and Data Types
```
 |-- id: long
 |-- first_name: string
 |-- last_name: string
 |-- gender: string
 |-- email: string
 |-- ip_address: string
 |-- address: string
 |-- phone_numbers: string
```
* Use parquet file format. There should be exactly 4 files under the folder related to table.
* Data should be sorted in ascending order by id.

## Validation

Here are the self validation steps:
* Validate if the table is created or not.
```
import getpass
username = getpass.getuser()
data = spark.read.table(f'{username}_mt02.addresses')
```
* Get Schema by running `data.printSchema()`. Output should be as below. Ignore Nullability if it does not match exactly.
```
root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- email: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- address: string (nullable = true)
 |-- phone_numbers: string (nullable = true)
```
* Get count by running `data.count()`. It should return **1,000,000**.
* Run `data.orderBy('id').show()` to validate the data. Output should be like this.
| id|first_name|   last_name|gender|               email|     ip_address|             address|       phone_numbers|
|---|----------|------------|------|--------------------|---------------|--------------------|--------------------|
|  1|    Corrie|Van den Oord|Female|cvandenoord0@etsy...| 114.69.173.253|4391 Coleman Lane...|        785-310-6676|
|  2|  Nikolaus|     Brewitt|  Male|nbrewitt1@dailyma...| 191.17.232.147|263 Graedel Drive...|609-306-0365,205-...|
|  3|    Orelie|      Penney|Female|openney2@vistapri...| 109.206.143.10|5152 Twin Pines P...|619-727-2916,570-...|
|  4|     Ashby|    Maddocks|  Male|  amaddocks3@home.pl|  171.173.96.24|31696 Longview Wa...|415-708-9669,806-...|
|  5|      Kurt|        Rome|  Male|krome4@shutterfly...| 125.35.144.111|8854 Rusk Street,...|        915-912-2446|
|  6|    Idelle|      Dorsey|Female|idorsey5@artistee...|  89.128.71.151|95353 Carpenter P...|949-257-9443,504-...|
|  7|      Levy|       Pacey|  Male|lpacey6@bloglovin...|   149.60.175.7|2997 Maryland Cir...|202-667-9730,810-...|
|  8|   Hershel|       Kneal|  Male|hkneal7@engadget.com|206.179.142.167|3101 Ilene Plaza,...|719-481-1263,561-...|
|  9|     Kelly|  Gatheridge|Female|kgatheridge8@mysp...|  101.79.38.236|91956 Stone Corne...|        719-867-3789|
| 10|     Aksel|       Ewles|  Male| aewles9@samsung.com|  219.49.91.115|39 Warbler Avenue...|        501-903-4014|
| 11| Millicent|    Whitwell|Female| mwhitwella@army.mil|   46.90.146.88|3598 Carpenter Ci...|716-582-3471,864-...|
| 12|      Levy|    Fennelow|  Male|lfennelowb@so-net...| 102.19.205.231|20 Glacier Hill P...|330-621-5532,562-...|
| 13|     Bucky|       Harle|  Male|   bharlec@europa.eu|113.149.152.231|06 Badeau Alley,M...|        712-111-4246|
| 14|     Randy|   Kleinmann|Female|rkleinmannd@frien...|   5.89.218.201|04 Manitowish Dri...|515-109-8291,408-...|
| 15|   Eveleen|     Lanaway|Female|elanawaye@blinkli...| 219.210.42.139|512 Prairie Rose ...|360-712-7619,239-...|
| 16|  Eleonore|      Cordle|Female|ecordlef@printfri...| 34.134.136.163|2893 Red Cloud Tr...|                    |
| 17|     Monte|     Sidaway|  Male|msidawayg@unicef.org|162.189.175.228|50225 Eagle Crest...|213-965-4880,619-...|
| 18|     Heddi|      Sackes|Female|hsackesh@business...|   59.44.144.36|65403 Hermina Pla...|                    |
| 19|    Tabina|     Olivari|Female|    tolivarii@goo.gl|   91.22.33.111|05 Ryan Trail,Arl...|602-594-0067,803-...|
| 20|Rutherford|   Josephson|  Male|rjosephsonj@sprin...| 93.124.195.224|38907 Sunfield Pa...|979-330-2010,562-...|
* Validate whether the files are of type Parquet or not.
```
import getpass
username = getpass.getuser()
data_p = spark. \
    read. \
    parquet(f'/user/{username}/mock_test_02/problem03/solution')
```
* Run `data_p.orderBy('id').show()` to validate the data. Output should be like above.

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'Problem 03 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [3]:
spark.sql(f"DROP DATABASE IF EXISTS {username}_mt02 CASCADE")

In [4]:
spark.sql(f"CREATE DATABASE {username}_mt02")

In [5]:
spark.catalog.setCurrentDatabase(f'{username}_mt02')

In [6]:
spark.sql(f'DROP TABLE if exists {username}_mt02.addresses')

In [7]:
from pyspark.sql.functions import expr, concat_ws, col
df=spark.read.json('/public/addresses')
final_df = df.select("id", "first_name", "last_name", "gender", "email", "ip_address",
          expr("concat_ws(' ', address.street, address.state, address.postal_code, address.city)").alias("address"),
          expr("concat_ws(',', phone_numbers)").alias("phone_numbers")). \
    orderBy("id")

In [8]:
final_df.coalesce(4).write.format('parquet'). \
    options(path="/user/itv002480/mock_test_02/problem03/solution").mode("overwrite").saveAsTable("addresses")

In [9]:
spark.catalog.listTables()

[Table(name='addresses', database='itv002480_mt02', description=None, tableType='EXTERNAL', isTemporary=False)]

In [10]:
spark.sql('DESCRIBE FORMATTED addresses').show(100, False)

+----------------------------+----------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                   |comment|
+----------------------------+----------------------------------------------------------------------------+-------+
|id                          |bigint                                                                      |null   |
|first_name                  |string                                                                      |null   |
|last_name                   |string                                                                      |null   |
|gender                      |string                                                                      |null   |
|email                       |string                                                                      |null   |
|ip_address                  |string                                    

In [13]:
import getpass
username = getpass.getuser()
data_p = spark. \
  read. \
  parquet(f'/user/{username}/mock_test_02/problem03/solution')

In [14]:
data_p.orderBy('id').show()

+---+----------+------------+------+--------------------+---------------+--------------------+--------------------+
| id|first_name|   last_name|gender|               email|     ip_address|             address|       phone_numbers|
+---+----------+------------+------+--------------------+---------------+--------------------+--------------------+
|  1|    Corrie|Van den Oord|Female|cvandenoord0@etsy...| 114.69.173.253|4391 Coleman Lane...|        785-310-6676|
|  2|  Nikolaus|     Brewitt|  Male|nbrewitt1@dailyma...| 191.17.232.147|263 Graedel Drive...|609-306-0365,205-...|
|  3|    Orelie|      Penney|Female|openney2@vistapri...| 109.206.143.10|5152 Twin Pines P...|619-727-2916,570-...|
|  4|     Ashby|    Maddocks|  Male|  amaddocks3@home.pl|  171.173.96.24|31696 Longview Wa...|415-708-9669,806-...|
|  5|      Kurt|        Rome|  Male|krome4@shutterfly...| 125.35.144.111|8854 Rusk Street ...|        915-912-2446|
|  6|    Idelle|      Dorsey|Female|idorsey5@artistee...|  89.128.71.151

In [2]:
import getpass
username = getpass.getuser()
data = spark.read.table(f'{username}_mt02.addresses')

In [3]:
data.show()

+------+----------+-----------+------+--------------------+---------------+--------------------+--------------------+
|    id|first_name|  last_name|gender|               email|     ip_address|             address|       phone_numbers|
+------+----------+-----------+------+--------------------+---------------+--------------------+--------------------+
|749687|      Cash|    Zanutti|  Male|czanutti12c6@tele...| 167.239.107.38|29 Ruskin Parkway...|                    |
|749688|    Siward|      Timby|  Male|stimby12c7@accuwe...|   1.193.228.37|837 Fremont Parkw...|        202-266-2715|
|749689|    Andrej|       Leil|  Male|aleil12c8@bigcart...|143.161.109.163|68936 1st Place C...|843-539-4452,915-...|
|749690|    Jeffie|   Sutworth|  Male|jsutworth12c9@ama...|  44.87.175.234|901 Bunting Plaza...|754-951-1231,816-...|
|749691|     Jorry|       Shay|Female|jshay12ca@rakuten...|  176.255.2.116|75396 Bluejay Pas...|        754-403-4969|
|749692|   Prissie|Yakunchikov|Female|pyakunchikov12cb@.

In [4]:
data.count()

1000000