<a href="https://colab.research.google.com/github/iGhostlp/Albus/blob/Gunter-y-Ernesto/Proyecto_BBVA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Armado del entorno

In [2]:
# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

In [3]:
# Unzip the file
!tar xf spark-3.3.2-bin-hadoop3.tgz

In [4]:
!readlink -f $(which java) | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


In [5]:
# Set up the environment for Spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
os.environ["SPARK_HOME"] = '/content/spark-3.3.2-bin-hadoop3'

In [6]:
# Install library for finding Spark
!pip install -q findspark

# Import the libary
import findspark

# Initiate findspark
findspark.init()

In [7]:
# Import SparkSession
from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.master("local[*]").config('spark.sql.parquet.datetimeRebaseModeInRead','CORRECTED').getOrCreate()

# Check Spark Session Information
spark

# Extraccion de datos desde parquet, clientes y teléfonos.

In [8]:
df_customer = spark.read.load('sample_data/customer_basics_bootcamp.snappy.parquet', sep=',', inferschema='true', header='true')
df_phones = spark.read.load('sample_data/phones_bootcamp.snappy.parquet', sep=',', inferschema='true', header='true')


In [9]:
df_customer_phones = df_customer.join(df_phones, 'customer_id', how='inner')

In [10]:
df_customer_phones.show()

+-----------+-------------+-------------+--------------+---------------+--------------------+---------+----------------+-----------+-----------------------+-----------------------+----------+-------------------+-----------------+--------------------------+-----------+----------------+-----------------+-------------------+-----------+----------+----------------------+--------------------+---------------------+-----------+--------------------------+----------------------------+-------------------------------+------------------------+------------------------+--------------------------+------------------------+--------------------------+----------------------------+------------------+------------------------+-------------------+-------------------+---------------------------+-----------------+-------------------+-----------------+-------------------------------+--------------------+---------------------+------------------+----------------+--------------------+--------------------------+---

#Filtro telefonos
QUIERO: Filtrar el DataFrame de contactos telefónicos de clientes y resguardar los 3 contactos más actuales por cliente.   
PARA: Reducir el volumen de datos y trabajar solo con los más actualizados

registry_entry_date Momento en el que se realiza el alta de un registro
last_change_date Fecha en el que se registra en el sitema un cambio en la informacion


In [11]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col, row_number, desc, collect_list
from pyspark.sql.window import Window

Cortamos la tabla, las columnas que no consideramos parte del analisis

In [12]:
df_phones_cut = df_phones.drop('phone_intern_id','phone_country_id', 'aditional_info_txt_desc', 'primary_phone_type','address_sequence_type','address_town_name','zipcode_id', 'province_id','sender_application_id','normalization_status_type','normalization_reason_name','validity_start_date','validity_end_date','dlvy_day_monday_type','dlvy_day_tuesday_type','dlvy_day_wednesday_type','dlvy_day_thursday_type','dlvy_day_friday_type','dlvy_day_friday_type','dlvy_day_saturday_type','delivery_contact_start_hm_date','delivery_contact_end_hm_date','operational_load_date','normalization_date')

In [13]:
df_phones_sorted = df_phones_cut.orderBy([df_phones_cut.customer_id, desc('last_change_date')])
df_phones_sorted.show()

+-----------+--------------+-------------------+-----------------+----------+---------------+-------------+-------------------+-----------------+-------------+-------------+--------------------------+---------------------+--------------------+----------------+-------------------+----------------+----------------+-------------------+--------------------+-----------------------+
|customer_id|phone_use_type|address_sequence_id|phone_sequence_id|phone_type|prefix_phone_id|phone_area_id|cellphone_prefix_id|phone_exchange_id|phone_line_id|     phone_id|customer_phone_status_type|phone_status_mod_date|contact_channel_type|wrong_phone_type|registry_entry_date|register_user_id|last_change_date|last_change_user_id|last_change_hms_date|last_change_terminal_id|
+-----------+--------------+-------------------+-----------------+----------+---------------+-------------+-------------------+-----------------+-------------+-------------+--------------------------+---------------------+------------------


Concatenamos las columnas del numero de telefono para armarlo en una sola


In [14]:
df_phones_sorted = df_phones_sorted.select(concat(df_phones_sorted.prefix_phone_id,df_phones_sorted.phone_area_id,df_phones_sorted.phone_exchange_id,df_phones_sorted.phone_line_id).alias('Full_Phone'),'customer_id','last_change_date')

In [15]:
df_phones_sorted.show()

+-------------+-----------+----------------+
|   Full_Phone|customer_id|last_change_date|
+-------------+-----------+----------------+
| 542664697946|   00000007|      2022-09-01|
| 543412847321|   00000039|      2022-09-20|
| 541125064159|   00000044|      2022-10-03|
| 543815909885|   00000381|      2022-07-19|
| 542975296284|   00000442|      2022-07-20|
| 542974729337|   00000442|      2022-07-20|
| 541166793207|   00001419|      2022-11-01|
| 541138700150|   00001939|      2022-09-21|
| 542994477116|   00002707|      2022-08-17|
| 543424662478|   00002790|      2022-11-18|
| 543424883620|   00002790|      2022-10-08|
|5435415988799|   00004287|      2022-08-29|
| 541124084447|   00004724|      2022-11-04|
| 542613862762|   00005527|      2022-10-03|
| 542615904192|   00005527|      2022-10-03|
| 543854065887|   00007932|      2022-09-29|
| 543489493578|   00011850|      2022-10-27|
| 541125781080|   00011850|      2022-10-24|
| 543401534381|   00012051|      2022-10-06|
| 54116114

Utilizamos windows para realizar las particiones de costumer_id

In [16]:
window = Window.partitionBy(df_phones_sorted.customer_id).orderBy(desc(df_phones_sorted.last_change_date))

In [17]:
df_phone = df_phones_sorted.withColumn('row_num', row_number().over(window))

In [18]:
df_phone = df_phone.filter(df_phone.row_num <= 3)

In [19]:
df_phone.show()

+-------------+-----------+----------------+-------+
|   Full_Phone|customer_id|last_change_date|row_num|
+-------------+-----------+----------------+-------+
| 542664697946|   00000007|      2022-09-01|      1|
| 543815909885|   00000381|      2022-07-19|      1|
| 543424662478|   00002790|      2022-11-18|      1|
| 543424883620|   00002790|      2022-10-08|      2|
|5435415988799|   00004287|      2022-08-29|      1|
| 543854065887|   00007932|      2022-09-29|      1|
| 543489493578|   00011850|      2022-10-27|      1|
| 541125781080|   00011850|      2022-10-24|      2|
| 543401534381|   00012051|      2022-10-06|      1|
| 542216208511|   00013498|      2022-08-22|      1|
| 541165050605|   00014664|      2022-10-06|      1|
| 541161577947|   00041884|      2022-07-28|      1|
| 541165187983|   00048225|      2022-07-08|      1|
| 541169951912|   00052103|      2022-09-27|      1|
| 543364577255|   00056407|      2022-10-17|      1|
| 543412740967|   00058519|      2022-10-26|  

Realizamos la tabla pivot

In [20]:
df_pivot_phone = df_phone.groupBy('customer_id').agg(collect_list('Full_Phone').alias('last_3_changes_list'))

In [21]:
df_pivot_phone = df_pivot_phone.selectExpr('customer_id', 'last_3_changes_list[0] as Phone_1', 'last_3_changes_list[1] as Phone_2', 'last_3_changes_list[2] as Phone_3')

In [22]:
df_pivot_phone.show()

+-----------+-------------+------------+-------+
|customer_id|      Phone_1|     Phone_2|Phone_3|
+-----------+-------------+------------+-------+
|   00000007| 542664697946|        null|   null|
|   00000381| 543815909885|        null|   null|
|   00002790| 543424662478|543424883620|   null|
|   00004287|5435415988799|        null|   null|
|   00007932| 543854065887|        null|   null|
|   00011850| 543489493578|541125781080|   null|
|   00012051| 543401534381|        null|   null|
|   00013498| 542216208511|        null|   null|
|   00014664| 541165050605|        null|   null|
|   00041884| 541161577947|        null|   null|
|   00048225| 541165187983|        null|   null|
|   00052103| 541169951912|        null|   null|
|   00056407| 543364577255|        null|   null|
|   00058519| 543412740967|        null|   null|
|   00058909| 541125222584|        null|   null|
|   00064043| 542976219525|542975133355|   null|
|   00064339| 542966425661|        null|   null|
|   00071569| 543584

Reemplazamos null por ---

In [23]:
df_pivot_phone = df_pivot_phone.na.fill('---')
df_pivot_phone.show()

+-----------+-------------+------------+-------+
|customer_id|      Phone_1|     Phone_2|Phone_3|
+-----------+-------------+------------+-------+
|   00000007| 542664697946|         ---|    ---|
|   00000381| 543815909885|         ---|    ---|
|   00004287|5435415988799|         ---|    ---|
|   00007932| 543854065887|         ---|    ---|
|   00011850| 543489493578|541125781080|    ---|
|   00012051| 543401534381|         ---|    ---|
|   00013498| 542216208511|         ---|    ---|
|   00014664| 541165050605|         ---|    ---|
|   00041884| 541161577947|         ---|    ---|
|   00048225| 541165187983|         ---|    ---|
|   00052103| 541169951912|         ---|    ---|
|   00056407| 543364577255|         ---|    ---|
|   00058519| 543412740967|         ---|    ---|
|   00058909| 541125222584|         ---|    ---|
|   00064043| 542976219525|542975133355|    ---|
|   00064339| 542966425661|         ---|    ---|
|   00071569| 543584112643|         ---|    ---|
|   00077558| 541151

#Extraccion de datos desde parquet, clientes y emails.

In [24]:
df_emails = spark.read.parquet('sample_data/emails_bootcamp.snappy.parquet')

In [25]:
df_customer_emails = df_emails.join(df_customer, 'customer_id', how="right")

#Filtrando emails

In [26]:
df_emails_cut = df_emails.drop('role_type', 'email_type','address_sequence_id','residence_type','primary_email_type','email_domain_type','encripted_type','field_length_number','comments_desc','customer_email_status_type','email_app_type','register_user_id','last_change_user_id','last_change_hms_date','last_change_terminal_id','operational_load_date','customer_email_status_date','registry_entry_date')

In [27]:
df_emails_sorted = df_emails_cut.orderBy([df_emails_cut.customer_id, desc('last_change_date')])

df_emails_sorted.toPandas()

Unnamed: 0,customer_id,email_desc,last_change_date
0,00001419,JU_LY1@HOTMAIL.COM,2019-05-11
1,00001419,EMILIA.RUBIANES@HOTMAIL.COM,2015-10-08
2,00002790,NOTIENE@HOIMAIL.COM,2019-07-29
3,00002790,DIGITALIZACION@EECC.COM,1900-01-01
4,00014664,alq@ciudad.com.ar,2009-06-27
...,...,...,...
495,28990339,ELSAMO@GMAIL.COM,2019-11-08
496,28993945,navyig@fibertel.com.ar,2009-06-27
497,29003190,ANLAU_08@LIVE.COM.AR,2020-06-01
498,29008648,ROBERTOWINY@GMAIL.COM,2013-01-23


In [28]:
window = Window.partitionBy(df_emails_sorted.customer_id).orderBy(desc(df_emails_sorted.last_change_date))

In [29]:
df_email = df_emails_sorted.withColumn('row_num', row_number().over(window))

In [30]:
df_email = df_email.filter(df_email.row_num <= 3)

In [31]:
df_email.toPandas()

Unnamed: 0,customer_id,email_desc,last_change_date,row_num
0,00001419,JU_LY1@HOTMAIL.COM,2019-05-11,1
1,00001419,EMILIA.RUBIANES@HOTMAIL.COM,2015-10-08,2
2,00002790,NOTIENE@HOIMAIL.COM,2019-07-29,1
3,00002790,DIGITALIZACION@EECC.COM,1900-01-01,2
4,00014664,alq@ciudad.com.ar,2009-06-27,1
...,...,...,...,...
495,28990339,ELSAMO@GMAIL.COM,2019-11-08,1
496,28993945,navyig@fibertel.com.ar,2009-06-27,1
497,29003190,ANLAU_08@LIVE.COM.AR,2020-06-01,1
498,29008648,ROBERTOWINY@GMAIL.COM,2013-01-23,1


In [32]:
df_pivot_email = df_email.groupBy('customer_id').agg(collect_list('email_desc').alias('last_3_changes_list'))

In [33]:
df_pivot_email = df_pivot_email.selectExpr('customer_id', 'last_3_changes_list[0] as Email_1', 'last_3_changes_list[1] as Email_2', 'last_3_changes_list[2] as Email_3')

In [34]:
df_pivot_email.toPandas()

Unnamed: 0,customer_id,Email_1,Email_2,Email_3
0,00001419,JU_LY1@HOTMAIL.COM,EMILIA.RUBIANES@HOTMAIL.COM,
1,00002790,NOTIENE@HOIMAIL.COM,DIGITALIZACION@EECC.COM,
2,00014664,alq@ciudad.com.ar,,
3,00027568,SCATIVA@GMAIL.COM,,
4,00027877,LAURAFRISCHKNECHT@FIBERTEL.COM.AR,,
...,...,...,...,...
424,28990339,ELSAMO@GMAIL.COM,,
425,28993945,navyig@fibertel.com.ar,,
426,29003190,ANLAU_08@LIVE.COM.AR,,
427,29008648,ROBERTOWINY@GMAIL.COM,,


In [35]:
df_pivot_email = df_pivot_email.na.fill('---')
df_pivot_email.toPandas()

Unnamed: 0,customer_id,Email_1,Email_2,Email_3
0,00001419,JU_LY1@HOTMAIL.COM,EMILIA.RUBIANES@HOTMAIL.COM,---
1,00002790,NOTIENE@HOIMAIL.COM,DIGITALIZACION@EECC.COM,---
2,00014664,alq@ciudad.com.ar,---,---
3,00027568,SCATIVA@GMAIL.COM,---,---
4,00027877,LAURAFRISCHKNECHT@FIBERTEL.COM.AR,---,---
...,...,...,...,...
424,28990339,ELSAMO@GMAIL.COM,---,---
425,28993945,navyig@fibertel.com.ar,---,---
426,29003190,ANLAU_08@LIVE.COM.AR,---,---
427,29008648,ROBERTOWINY@GMAIL.COM,---,---


# Extraccion de datos desde .parquet, clientes y direcciones.

In [36]:
df_address = spark.read.parquet('sample_data/address_bootcamp.snappy.parquet')

In [37]:
df_customer_address = df_address.join(df_customer, 'customer_id')

#Filtrado Address

In [38]:
df_address_cut = df_address.drop('address_priority_number','address_sequence_id','residence_type','address_without_number_type','province_id','address_country_id','other_information_desc','address_relationship_type','address_start_date','address_verified_date','customer_locator_verified_type','address_status_mod_date','contact_channel_type','sender_application_id','returned_mail_type','normalization_status_type','normalization_reason_name','normalization_date','normalized_level_match_number','dlvy_day_monday_type','dlvy_day_tuesday_type','dlvy_day_wednesday_type','dlvy_day_thursday_type','dlvy_day_friday_type','dlvy_day_saturday_type','delivery_contact_start_hm_date','delivery_contact_end_hm_date','prev_address_sequence_id','registry_entry_date','register_user_id','last_change_user_id','last_change_hms_date','last_change_terminal_id','registration_type','operational_load_date')

In [39]:
df_address_cut.toPandas()

Unnamed: 0,customer_id,street_name,address_outdoor_id,address_indoor_id,indoor_number,address_district_name,address_town_name,address_department_name,zipcode_id,long_zipcode_id,last_change_date
0,00000660,PASCUALA DEL JUNCAL,0000850,,,,VIRREYES,112233114455,01646,,2011-07-01
1,07121078,SAN NICOLAS,0002478,,,,VILLA VATTEONE,FLORENCIO VARELA,01888,B1853AMP,2016-09-29
2,22374047,J DE LA CRUZ CONTRERAS,0000408,,,,FLORENCIO VARELA,FLORENCIO VARELA,01888,B1888IIJ,2011-07-01
3,00002450,BACACAY,0001466,,,,ITUZAINGO,ITUZAINGO,01714,B1714ERV,2009-01-26
4,00003925,MANUELA PEDRAZA,0001715,5,B,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01429,C1429CBE,2009-01-26
...,...,...,...,...,...,...,...,...,...,...,...
2495,00005200,PARAGUAY,0003091,,,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01425,C1425BRK,2009-01-26
2496,08341355,AV CORRIENTES,0004923,2,F,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01414,C1414AJC,2009-01-26
2497,06987700,ESTADOS UNIDOS,0002772,2,B,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01227,C1227ABT,2009-01-26
2498,00015314,AV ALVAREZ THOMAS,0000195,13,A,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01427,C1427CCB,2009-01-26


In [40]:
df_address_sorted = df_address_cut.orderBy([df_address_cut.customer_id, desc('last_change_date')])
df_address_sorted.toPandas()

Unnamed: 0,customer_id,street_name,address_outdoor_id,address_indoor_id,indoor_number,address_district_name,address_town_name,address_department_name,zipcode_id,long_zipcode_id,last_change_date
0,00000003,AV PRES BARTOLOME MITRE,0001500,,,,CRUCESITA,AVELLANEDA,01870,B1873AMN,2011-09-07
1,00000050,DR A ALSINA,0002849,,,,FLORIDA,VICENTE LOPEZ,01602,B1602EEA,2017-04-18
2,00000050,DR A ALSINA,0002849,,,,,VICENTE LOPEZ,01602,,2016-10-12
3,00000173,CALLE 150,0003726,,,,,BERAZATEGUI,01885,,2016-10-12
4,00000173,CALLE 150,0003726,,,,,BERAZATEGUI,01885,,2016-10-12
...,...,...,...,...,...,...,...,...,...,...,...
2495,29015563,IBANEZ TENIENTE 1 RO,0001355,,,,BELLA VISTA,112233114455,01661,,2009-01-26
2496,29015902,GARIBALDI,0001554,,,,RAMOS MEJIA,LA MATANZA,01704,B1704IEH,2009-01-26
2497,29017100,DOMINGO MATHEU,0000972,,,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01219,C1219AAH,2009-01-26
2498,29017191,VIRREY ARREDONDO,0002641,3,B,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01426,C1426DZI,2009-01-26


In [41]:
df_address_sorted = df_address_sorted.select('customer_id', concat(df_address_sorted.street_name,df_address_sorted.address_outdoor_id,df_address_sorted.address_indoor_id,df_address_sorted.indoor_number).alias('Full_Address'),'address_district_name','address_town_name','address_department_name','zipcode_id','long_zipcode_id','last_change_date')

In [42]:
df_address_sorted.toPandas()

Unnamed: 0,customer_id,Full_Address,address_district_name,address_town_name,address_department_name,zipcode_id,long_zipcode_id,last_change_date
0,00000003,AV PRES BARTOLOME MITRE 0001500,,CRUCESITA,AVELLANEDA,01870,B1873AMN,2011-09-07
1,00000050,DR A ALSINA 0002849,,FLORIDA,VICENTE LOPEZ,01602,B1602EEA,2017-04-18
2,00000050,DR A ALSINA 0002849,,,VICENTE LOPEZ,01602,,2016-10-12
3,00000173,CALLE 150 0003726,,,BERAZATEGUI,01885,,2016-10-12
4,00000173,CALLE 150 0003726,,,BERAZATEGUI,01885,,2016-10-12
...,...,...,...,...,...,...,...,...
2495,29015563,IBANEZ TENIENTE 1 RO 0001355,,BELLA VISTA,112233114455,01661,,2009-01-26
2496,29015902,GARIBALDI 0001554,,RAMOS MEJIA,LA MATANZA,01704,B1704IEH,2009-01-26
2497,29017100,DOMINGO MATHEU 0000972,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01219,C1219AAH,2009-01-26
2498,29017191,VIRREY ARREDONDO 00026413 B,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01426,C1426DZI,2009-01-26


In [43]:
window = Window.partitionBy(df_address_sorted.customer_id).orderBy(desc(df_address_sorted.last_change_date))

In [44]:
df_address = df_address_sorted.withColumn('row_num', row_number().over(window))

In [45]:
df_address = df_address.filter(df_address.row_num <= 3)
df_address.toPandas()

Unnamed: 0,customer_id,Full_Address,address_district_name,address_town_name,address_department_name,zipcode_id,long_zipcode_id,last_change_date,row_num
0,00000003,AV PRES BARTOLOME MITRE 0001500,,CRUCESITA,AVELLANEDA,01870,B1873AMN,2011-09-07,1
1,00000050,DR A ALSINA 0002849,,FLORIDA,VICENTE LOPEZ,01602,B1602EEA,2017-04-18,1
2,00000050,DR A ALSINA 0002849,,,VICENTE LOPEZ,01602,,2016-10-12,2
3,00000173,CALLE 150 0003726,,,BERAZATEGUI,01885,,2016-10-12,1
4,00000173,CALLE 150 0003726,,,BERAZATEGUI,01885,,2016-10-12,2
...,...,...,...,...,...,...,...,...,...
2478,29015563,IBANEZ TENIENTE 1 RO 0001355,,BELLA VISTA,112233114455,01661,,2009-01-26,1
2479,29015902,GARIBALDI 0001554,,RAMOS MEJIA,LA MATANZA,01704,B1704IEH,2009-01-26,1
2480,29017100,DOMINGO MATHEU 0000972,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01219,C1219AAH,2009-01-26,1
2481,29017191,VIRREY ARREDONDO 00026413 B,,CIUDAD AUTONOMA BUENOS AIRES,CAPITAL FEDERAL,01426,C1426DZI,2009-01-26,1


In [46]:
df_pivot_address = df_address.groupBy('customer_id').agg(collect_list('Full_Address').alias('last_3_changes_list'))

In [47]:
df_pivot_address = df_pivot_address.selectExpr('customer_id', 'last_3_changes_list[0] as Address_1', 'last_3_changes_list[1] as Address_2', 'last_3_changes_list[2] as Address_3')

In [48]:
df_pivot_address = df_pivot_address.na.fill('---')
df_pivot_address.toPandas()

Unnamed: 0,customer_id,Address_1,Address_2,Address_3
0,00000003,AV PRES BARTOLOME MITRE 0001500,---,---
1,00000050,DR A ALSINA 0002849,DR A ALSINA 0002849,---
2,00000173,CALLE 150 0003726,CALLE 150 0003726,CALLE 150 0003726
3,00000188,CALLE 156 0004344,---,---
4,00000204,GRAL PINTO 0002441,---,---
...,...,...,...,...
2395,29015563,IBANEZ TENIENTE 1 RO 0001355,---,---
2396,29015902,GARIBALDI 0001554,---,---
2397,29017100,DOMINGO MATHEU 0000972,---,---
2398,29017191,VIRREY ARREDONDO 00026413 B,---,---


#Agregar una nueva columna a los DataFrame de contactos, indicando el contact_type según corresponda (address, email, phone)

In [49]:
from pyspark.sql.functions import lit

In [50]:
df_phones_contact_col = df_phones.withColumn('contact_type_phones', lit('phone'))

In [51]:
df_phones_contact_col.select('customer_id','contact_type_phones',).show()

+-----------+-------------------+
|customer_id|contact_type_phones|
+-----------+-------------------+
|   29354201|              phone|
|   29389432|              phone|
|   29382041|              phone|
|   07395331|              phone|
|   29349520|              phone|
|   29349520|              phone|
|   29390571|              phone|
|   29353393|              phone|
|   29387501|              phone|
|   29391432|              phone|
|   29390553|              phone|
|   29389361|              phone|
|   29400313|              phone|
|   29401025|              phone|
|   29401299|              phone|
|   29325999|              phone|
|   29402665|              phone|
|   29402665|              phone|
|   29375587|              phone|
|   29375587|              phone|
+-----------+-------------------+
only showing top 20 rows



In [52]:
df_emails_contact_col = df_emails.withColumn('contact_type_emails', lit('e-mail'))
df_emails.show()

+-----------+---------+-------------------+--------------+----------+------------------+--------------------+-----------------+--------------+-------------------+----------------+--------------------------+--------------+----------------+-------------------+--------------------+-----------------------+---------------------+--------------------------+-------------------+----------------+
|customer_id|role_type|address_sequence_id|residence_type|email_type|primary_email_type|          email_desc|email_domain_type|encripted_type|field_length_number|   comments_desc|customer_email_status_type|email_app_type|register_user_id|last_change_user_id|last_change_hms_date|last_change_terminal_id|operational_load_date|customer_email_status_date|registry_entry_date|last_change_date|
+-----------+---------+-------------------+--------------+----------+------------------+--------------------+-----------------+--------------+-------------------+----------------+--------------------------+--------------

In [53]:
df_address_contact_col = df_address.withColumn('contact_type_address', lit('address'))
df_address.show()

+-----------+--------------------+---------------------+--------------------+-----------------------+----------+---------------+----------------+-------+
|customer_id|        Full_Address|address_district_name|   address_town_name|address_department_name|zipcode_id|long_zipcode_id|last_change_date|row_num|
+-----------+--------------------+---------------------+--------------------+-----------------------+----------+---------------+----------------+-------+
|   00000003|AV PRES BARTOLOME...|                  ...|CRUCESITA        ...|   AVELLANEDA       ...|     01870|       B1873AMN|      2011-09-07|      1|
|   00000050|DR A ALSINA      ...|                  ...|FLORIDA          ...|   VICENTE LOPEZ    ...|     01602|       B1602EEA|      2017-04-18|      1|
|   00000050|DR A ALSINA      ...|                  ...|                 ...|   VICENTE LOPEZ    ...|     01602|               |      2016-10-12|      2|
|   00000173|CALLE 150        ...|                  ...|                 ...

#Combinar los DataFrame de contactos telefónicos de clientes, direcciones de clientes y email de clientes en uno solo.
Use full, pq entiendo que tiene que abarcar a todos los clientes, indepentientemente que el mismo cliente, tenga 1 o 3 tipos de contactos cargados

In [54]:
df_contact_types = df_phones_contact_col.join(df_address_contact_col, 'customer_id', how='full')\
.join(df_emails_contact_col, 'customer_id', how='full')



In [55]:
df_contact_types.select('customer_id','contact_type_phones','contact_type_address','contact_type_emails').show()

+-----------+-------------------+--------------------+-------------------+
|customer_id|contact_type_phones|contact_type_address|contact_type_emails|
+-----------+-------------------+--------------------+-------------------+
|   00000007|              phone|                null|               null|
|   00000188|               null|             address|               null|
|   00000204|               null|             address|               null|
|   00000228|               null|             address|               null|
|   00000274|               null|             address|               null|
|   00000282|               null|             address|               null|
|   00000305|               null|             address|               null|
|   00000381|              phone|                null|               null|
|   00000429|               null|             address|               null|
|   00000445|               null|             address|               null|
|   00000451|            

Join tablas pivot

In [59]:
df_contactos = df_pivot_phone.join(df_pivot_email, "customer_id") \
                   .join(df_pivot_address, "customer_id")

In [60]:
sorted(df_contactos.columns)

['Address_1',
 'Address_2',
 'Address_3',
 'Email_1',
 'Email_2',
 'Email_3',
 'Phone_1',
 'Phone_2',
 'Phone_3',
 'customer_id']

In [61]:
df_contactos.show()

+-----------+------------+------------+-------+--------------------+--------------------+-------+--------------------+---------+---------+
|customer_id|     Phone_1|     Phone_2|Phone_3|             Email_1|             Email_2|Email_3|           Address_1|Address_2|Address_3|
+-----------+------------+------------+-------+--------------------+--------------------+-------+--------------------+---------+---------+
|   00002790|543424662478|543424883620|    ---| NOTIENE@HOIMAIL.COM|DIGITALIZACION@EE...|    ---|AV CORDOBA       ...|      ---|      ---|
|   00014664|541165050605|         ---|    ---|   alq@ciudad.com.ar|                 ---|    ---|AV TRIUNVIRATO   ...|      ---|      ---|
|   00056407|543364577255|         ---|    ---|NELLY.S.GEREZ@GMA...|                 ---|    ---|AV DR J BAUTISTA ...|      ---|      ---|
|   00058909|541125222584|         ---|    ---|jmanau@fibertel.c...|                 ---|    ---|AV PTE J D PERON ...|      ---|      ---|
|   00079642|541168707818| 

#QUIERO: Filtrar el DataFrame de contactos y resguardar solo 3 contactos por cliente respetando la prioridad phone, email, address. Resguardar el resultado como una vista temporal.
PARA: Reducir el volumen de datos y trabajar solo con los más actualizados.

In [62]:
df_contactos_1 = df_contactos.select('customer_id','Phone_1','Email_1','Address_1')
df_contactos_1.createOrReplaceTempView("contactos_temp")

In [63]:
df_contactos_1.show()

+-----------+------------+--------------------+--------------------+
|customer_id|     Phone_1|             Email_1|           Address_1|
+-----------+------------+--------------------+--------------------+
|   00002790|543424662478| NOTIENE@HOIMAIL.COM|AV CORDOBA       ...|
|   00014664|541165050605|   alq@ciudad.com.ar|AV TRIUNVIRATO   ...|
|   00056407|543364577255|NELLY.S.GEREZ@GMA...|AV DR J BAUTISTA ...|
|   00058909|541125222584|jmanau@fibertel.c...|AV PTE J D PERON ...|
|   00079642|541168707818| MARITA@4HOTMAIL.COM|VICTORIANO AGUILA...|
|   00096670|541144071782|CONTACTO@COMPRESO...|CATTANEO         ...|
|   00101437|543815853371|   NOTIENE@GMAIL.COM|CALLE 51         ...|
|   00477803|543416646849|LEMOS-LILIANA@HOT...|PTE D F SARMIENTO...|
|   01010176|543816425198|  CORTESMT@GMAIL.COM|DEAN FUNES       ...|
|   01157212|541124841730|ADURIZM.57@GMAIL.COM|BENITO PEREZ GALD...|
|   01670443|541155784448|smbortolussi@s5.c...|HILARION DE LA QU...|
|   01718049|541138492510|CARMELA1

QUIERO: Agregar una nueva columna al DataFrame de contactos telefónicos de clientes, resguardando el contacto en formato json contenido en string, con los datos: Phone_type (mobile, landline ), Código país., Código de Área, Número teléfono. Tkt 31

In [64]:
from pyspark.sql.functions import concat_ws, to_json, struct

In [65]:
df_contact_types = df_contact_types.withColumn("phone_contact", to_json(struct(concat_ws(",", df_contact_types.phone_type, df_contact_types.phone_country_id, df_contact_types.prefix_phone_id, df_contact_types.phone_area_id, df_contact_types.cellphone_prefix_id, df_contact_types.phone_exchange_id, df_contact_types.phone_line_id))))

In [66]:
df_contact_types.show()

+-----------+--------------+-------------------+-----------------+----------+----------------+---------------+-------------+-------------------+-----------------+-------------+------------+---------------+-----------------------+------------------+---------------------+-----------------+----------+-----------+--------------------------+---------------------+--------------------+---------------------+----------------+-------------------------+-------------------------+------------------+-------------------+-----------------+--------------------+---------------------+-----------------------+----------------------+--------------------+----------------------+------------------------------+----------------------------+-------------------+----------------+----------------+-------------------+--------------------+-----------------------+---------------------+-------------------+--------------------+---------------------+--------------------+-----------------------+----------+---------------+-

QUIERO: Agregar una nueva columna al DataFrame de direcciones de clientes, resguardando el contacto en formato json contenido en string, con los datos: Calle, Número, Piso, Depto, Localidad, Provincia, Código postal
PARA: Enriquecer los datos aplicando el formato requerido Tkt 32

#Generar una vista temporal 

Generar una vista temporal a partir del DataFrame de contactos.
PARA: Preparar los datos para trabajar con SparkSQL. Tkt 35

In [67]:
df_contactos_vt = df_contactos
df_contactos_vt.createTempView('contact_vtemporal')

A partir del archivo t_abtq_customer_basics.
PARA: Preparar los datos para trabajar con SparkSQL. Utilizo el archivo normalizado. Ticket 36

In [68]:
df_customer_vt = spark.read.load('sample_data/customer_basics_bootcamp.snappy.parquet', sep=',', inferschema='true', header='true')



In [69]:
df_customer_vt.createTempView('customer_vtemporal')

QUIERO: Generar una vista temporal a partir del archivo t_acog_marital_status_type.
PARA: Preparar los datos para trabajar con SparkSQL Tkt 37

In [71]:
df_marital_status_vt = spark.read.load('sample_data/t_acog_marital_status_type.snappy.parquet', sep=',', inferschema='true', header='true')

In [72]:
df_marital_status_vt.createTempView('marital_status_type_vt')

QUIERO: Generar una vista temporal a partir del archivo t_acog_nationality.
PARA: Preparar los datos para trabajar con SparkSQ Tkt 38

In [73]:
df_nationality_vt = spark.read.load('sample_data/t_acog_nationality.snappy.parquet', sep=',', inferschema='true', header='true')

In [74]:
df_nationality_vt.createTempView('nationality_vt')