<img src="pyspark.jpeg" width="300" height="150"> 

# Introducción al uso de pyspark

## Felipe Meza

### Pre-requisitos

- findspark
- pyspark
- spark

Comenzamos con la importación de librerías y demás parámetros necesarios...

In [6]:
print("hola")

hola


In [1]:
from pyspark.sql import SparkSession

spark=SparkSession.builder.appName('data_processing').getOrCreate()
#spark = SparkSession.builder.master("local").appName("Search").config(conf=SparkConf()).getOrCreate()

import pyspark.sql.functions as F
from pyspark.sql.types import *

In [3]:
import findspark
findspark.init("C:\Spark")


from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, date_format, udf 
from pyspark.sql.types import DateType



Creamos el esquema (estructura de columnas que usaremos para crear el DataFrame:

In [4]:
schema=StructType().add("user_id","string").add("country","string").add("browser", "string").add("OS",'string').add("age", "integer")

y se crea el DataFrame:

In [5]:
df=spark.createDataFrame([("A203",'India',"Chrome","WIN", 33),
                          ("A201",'China',"Safari","MacOS",35),
                          ("A205",'UK',"Mozilla", "Linux",25)],schema=schema)

Se visualiza el esquema del DataFrame:

In [7]:
df.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- country: string (nullable = true)
 |-- browser: string (nullable = true)
 |-- OS: string (nullable = true)
 |-- age: integer (nullable = true)



Se visualiza el DataFrame:

In [8]:
df.show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|  India| Chrome|  WIN| 33|
|   A201|  China| Safari|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Creamos un nuevo DataFrame pero esta vez con algunos valores nulos:

In [9]:
df_na=spark.createDataFrame([("A203",None,"Chrome","WIN",33),("A201",'China',None,"MacOS",35),("A205",'UK',"Mozilla","Linux",25)],schema=schema)

In [11]:
df_na.show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|   null| Chrome|  WIN| 33|
|   A201|  China|   null|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Se muestra el DataFrame anterior pero esta vez con ceros en vez de null, notese que solo se muestran NO SE MODIFICA EL DATAFRAME:

In [10]:
df_na.fillna('0').show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|      0| Chrome|  WIN| 33|
|   A201|  China|      0|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Notese que no se ha modificado:

In [11]:
df_na.show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|   null| Chrome|  WIN| 33|
|   A201|  China|   null|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Se rellenan los null con parámetros específicos por columna, de nuevo se muestran no se modifican:

In [12]:
df_na.fillna( { 'country':'USA', 'browser':'Safari' } ).show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|    USA| Chrome|  WIN| 33|
|   A201|  China| Safari|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Se descartan los que tengan algun valor null:

In [13]:
df_na.na.drop().show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Se descartan los que tengan algun valor null en COUNTRY:

In [16]:
df_na.na.drop(subset='country').show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A201|  China|   null|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Se pueden reemplazar valores por otros...

In [14]:
df_na.replace("Chrome","Google Chrome").show()

+-------+-------+-------------+-----+---+
|user_id|country|      browser|   OS|age|
+-------+-------+-------------+-----+---+
|   A203|   null|Google Chrome|  WIN| 33|
|   A201|  China|         null|MacOS| 35|
|   A205|     UK|      Mozilla|Linux| 25|
+-------+-------+-------------+-----+---+



Se pueden descartar columnas completas:

In [15]:
df_na.drop('user_id').show()

+-------+-------+-----+---+
|country|browser|   OS|age|
+-------+-------+-----+---+
|   null| Chrome|  WIN| 33|
|  China|   null|MacOS| 35|
|     UK|Mozilla|Linux| 25|
+-------+-------+-----+---+



Podemos invocar a un archivo con datos y generar un DataFrame:

In [16]:
df=spark.read.csv("customer_data.csv",header=True, inferSchema=True)

Se analizan los datos:

In [17]:
df.count()

2000

In [18]:
len(df.columns)

7

In [19]:
df.printSchema()

root
 |-- Customer_subtype: string (nullable = true)
 |-- Number_of_houses: integer (nullable = true)
 |-- Avg_size_household: integer (nullable = true)
 |-- Avg_age: string (nullable = true)
 |-- Customer_main_type: string (nullable = true)
 |-- Avg_Salary: integer (nullable = true)
 |-- label: integer (nullable = true)



In [20]:
df.show(3)

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Lower class large...|               1|                 3|30-40 years|Family with grown...|     44905|    0|
|Mixed small town ...|               1|                 2|30-40 years|Family with grown...|     37575|    0|
|Mixed small town ...|               1|                 2|30-40 years|Family with grown...|     27915|    0|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
only showing top 3 rows



In [21]:
df.summary().show()

+-------+--------------------+------------------+------------------+-----------+--------------------+-----------------+------------------+
|summary|    Customer_subtype|  Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|       Avg_Salary|             label|
+-------+--------------------+------------------+------------------+-----------+--------------------+-----------------+------------------+
|  count|                2000|              2000|              2000|       2000|                2000|             2000|              2000|
|   mean|                null|            1.1075|            2.6895|       null|                null|     1616908.0835|            0.0605|
| stddev|                null|0.3873225521186316|0.7914562220841646|       null|                null|6822647.757312146|0.2384705099001677|
|    min|Affluent senior a...|                 1|                 1|20-30 years|      Average Family|             1361|                 0|
|    25%|                nu

Mostrar solo unas columnas y filas:

In [22]:
df.select(['Customer_subtype','Avg_Salary']).show()

+--------------------+----------+
|    Customer_subtype|Avg_Salary|
+--------------------+----------+
|Lower class large...|     44905|
|Mixed small town ...|     37575|
|Mixed small town ...|     27915|
|Modern, complete ...|     19504|
|  Large family farms|     34943|
|    Young and rising|     13064|
|Large religious f...|     29090|
|Lower class large...|      6895|
|Lower class large...|     35497|
|     Family starters|     30800|
|       Stable family|     39157|
|Modern, complete ...|     40839|
|Lower class large...|     30008|
|        Mixed rurals|     37209|
|    Young and rising|     45361|
|Lower class large...|     45650|
|Traditional families|     18982|
|Mixed apartment d...|     30093|
|Young all america...|     27097|
|Low income catholics|     23511|
+--------------------+----------+
only showing top 20 rows



Mostrar solo ciertos datos a partir de un criterio:

In [23]:
df.filter(df['Avg_Salary'] > 1000000).count()

128

In [24]:
df.filter(df['Avg_Salary'] > 1000000).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
| High status seniors|               1|                 3|40-50 years|Successful hedonists|   4670288|    0|
| High status seniors|               1|                 3|50-60 years|Successful hedonists|   9561873|    0|
| High status seniors|               1|                 2|40-50 years|Successful hedonists|  18687005|    0|
| High status seniors|               1|                 2|40-50 years|Successful hedonists|  24139960|    0|
| High status seniors|               1|                 2|50-60 years|Successful hedonists|   6718606|    0|
|High Income, expe...|               1|                 3|40-50 years|Successful hedonists|  19347139|    0|
|High Income, expe.

Flitrado:

In [25]:
df.filter(df['Avg_Salary'] > 500000).filter(df['Number_of_houses'] > 2).count()

4

In [26]:
df.filter(df['Avg_Salary'] > 500000).filter(df['Number_of_houses'] > 2).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    596723|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    944444|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    788477|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    994077|    0|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+



Otra forma de filtrar es con WHERE y operaciones Booleanas:

In [27]:
df.where((df['Avg_Salary'] > 500000) & (df['Number_of_houses'] > 2)).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    596723|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    944444|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    788477|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    994077|    0|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+



Se puede saber cuantos datos hay de una caterogira especifica:

In [28]:
df.groupBy('Customer_subtype').count().show()

+--------------------+-----+
|    Customer_subtype|count|
+--------------------+-----+
|Lower class large...|  288|
|Mixed small town ...|   47|
|Modern, complete ...|   93|
|  Large family farms|   26|
|    Young and rising|   78|
|Large religious f...|  107|
|     Family starters|   55|
|       Stable family|   62|
|        Mixed rurals|   67|
|Traditional families|  129|
|Mixed apartment d...|   34|
|Young all america...|   62|
|Low income catholics|   72|
|Large family, emp...|   56|
| Young, low educated|   56|
|Middle class fami...|  122|
|Dinki's (double i...|   17|
| High status seniors|   76|
|Couples with teen...|   83|
|Young seniors in ...|   22|
+--------------------+-----+
only showing top 20 rows



In [29]:
df.groupBy('Number_of_houses').count().show()

+----------------+-----+
|Number_of_houses|count|
+----------------+-----+
|               1| 1808|
|               2|  178|
|               3|   12|
|              10|    1|
|               5|    1|
+----------------+-----+



Extrayendo INSIGHTS...Que caracteristica define a los que tienen más de 3 casas????

In [30]:
df.filter(df['Number_of_houses'] > 3).show()

+----------------+----------------+------------------+-----------+------------------+----------+-----+
|Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|Customer_main_type|Avg_Salary|label|
+----------------+----------------+------------------+-----------+------------------+----------+-----+
|    Single youth|              10|                 2|30-40 years|     Career Loners|     13815|    0|
|Young and rising|               5|                 1|40-50 years|       Living well|     17123|    0|
+----------------+----------------+------------------+-----------+------------------+----------+-----+



Para un determinado grupo (columna) puedo analizar valores asociados a ella en forma de estadistica (promedio etc) 

In [31]:
df.groupBy('Customer_main_type').agg(F.mean('Avg_Salary')).show()

+--------------------+--------------------+
|  Customer_main_type|     avg(Avg_Salary)|
+--------------------+--------------------+
|Family with grown...|  28114.191881918818|
|      Average Family|  104256.62337662338|
|             Farmers|  30209.333333333332|
|         Living well|  31194.044943820223|
|Conservative fami...|  29504.419491525423|
|Retired and Relig...|   27338.80693069307|
|      Driven Growers|   30769.04069767442|
|Successful hedonists|1.6278923510309279E7|
|    Cruising Seniors|  28870.333333333332|
|       Career Loners|             32272.6|
+--------------------+--------------------+



In [32]:
df.groupBy('Customer_main_type').agg(F.max('Avg_Salary')).show()

+--------------------+---------------+
|  Customer_main_type|max(Avg_Salary)|
+--------------------+---------------+
|Family with grown...|          49901|
|      Average Family|         991838|
|             Farmers|          49965|
|         Living well|          49816|
|Conservative fami...|          49965|
|Retired and Relig...|          49564|
|      Driven Growers|          49932|
|Successful hedonists|       48919896|
|    Cruising Seniors|          49526|
|       Career Loners|          49903|
+--------------------+---------------+



También puedo mostrar los datos un orden ascendente o decendente: 

In [33]:
df.sort("Avg_Salary").show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Low income catholics|               1|                 2|40-50 years|Retired and Relig...|      1361|    0|
|Lower class large...|               1|                 2|40-50 years|Family with grown...|      1502|    0|
|Lower class large...|               1|                 2|40-50 years|Family with grown...|      1718|    1|
|Lower class large...|               1|                 3|40-50 years|Family with grown...|      1750|    0|
|Lower class large...|               1|                 3|40-50 years|Family with grown...|      1865|    0|
|Lower class large...|               1|                 2|50-60 years|Family with grown...|      2021|    0|
|Lower class large.

In [34]:
df.sort("Avg_Salary", ascending=True).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Low income catholics|               1|                 2|40-50 years|Retired and Relig...|      1361|    0|
|Lower class large...|               1|                 2|40-50 years|Family with grown...|      1502|    0|
|Lower class large...|               1|                 2|40-50 years|Family with grown...|      1718|    1|
|Lower class large...|               1|                 3|40-50 years|Family with grown...|      1750|    0|
|Lower class large...|               1|                 3|40-50 years|Family with grown...|      1865|    0|
|Lower class large...|               1|                 2|50-60 years|Family with grown...|      2021|    0|
|Lower class large.

In [35]:
df.sort("Avg_Salary", ascending=False).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
| High status seniors|               1|                 2|60-70 years|Successful hedonists|  48919896|    0|
|High Income, expe...|               1|                 2|50-60 years|Successful hedonists|  48177970|    0|
|High Income, expe...|               1|                 2|50-60 years|Successful hedonists|  48069548|    1|
|High Income, expe...|               1|                 3|40-50 years|Successful hedonists|  46911924|    0|
| High status seniors|               1|                 3|40-50 years|Successful hedonists|  46614009|    0|
|High Income, expe...|               1|                 3|30-40 years|Successful hedonists|  45952441|    0|
|High Income, expe.

También se pueden mezclar los criterios anteriores:

In [36]:
df.groupBy('Customer_subtype').agg(F.avg('Avg_Salary'). alias('mean_salary')).orderBy('mean_salary',ascending=False). show()

+--------------------+--------------------+
|    Customer_subtype|         mean_salary|
+--------------------+--------------------+
| High status seniors| 2.507677857894737E7|
|High Income, expe...|2.3839817807692308E7|
|Affluent young fa...|   662068.7777777778|
|Affluent senior a...|   653638.8235294118|
|Senior cosmopolitans|             49903.0|
|Students in apart...|  35532.142857142855|
|  Large family farms|   33135.61538461538|
| Young, low educated|   33072.21428571428|
|Large family, emp...|  32867.857142857145|
|      Suburban youth|             32558.0|
|    Village families|  32449.470588235294|
|Middle class fami...|  31579.385245901638|
|Modern, complete ...|             31576.0|
|   Etnically diverse|             31572.0|
|    Young and rising|  30795.897435897437|
|       Mixed seniors|  30759.267605633802|
|Very Important Pr...|         30548.40625|
|Religious elderly...|   30540.59574468085|
|     Family starters|             30376.2|
|Career and childcare|  30110.93

Con el uso de un numero y FALSE en el SHOW se pueden ver mas de 20 lineas que es lo que se tiene por defecto:

In [37]:
df.groupBy('Customer_subtype').agg(F.avg('Avg_Salary'). alias('mean_salary')).orderBy('mean_salary',ascending=False). show(50,False)

+------------------------------------------+--------------------+
|Customer_subtype                          |mean_salary         |
+------------------------------------------+--------------------+
|High status seniors                       |2.507677857894737E7 |
|High Income, expensive child              |2.3839817807692308E7|
|Affluent young families                   |662068.7777777778   |
|Affluent senior apartments                |653638.8235294118   |
|Senior cosmopolitans                      |49903.0             |
|Students in apartments                    |35532.142857142855  |
|Large family farms                        |33135.61538461538   |
|Young, low educated                       |33072.21428571428   |
|Large family, employed child              |32867.857142857145  |
|Suburban youth                            |32558.0             |
|Village families                          |32449.470588235294  |
|Middle class families                     |31579.385245901638  |
|Modern, c

Con COLLECT se puede determinar el tipo de datos por cada valor:

In [38]:
df.groupby("Customer_subtype").agg(F.collect_set("Number_of_houses")).show()

+--------------------+-----------------------------+
|    Customer_subtype|collect_set(Number_of_houses)|
+--------------------+-----------------------------+
|Lower class large...|                       [1, 2]|
|Mixed small town ...|                          [1]|
|Modern, complete ...|                       [1, 2]|
|  Large family farms|                          [1]|
|    Young and rising|                    [1, 5, 2]|
|Large religious f...|                       [1, 2]|
|     Family starters|                       [1, 2]|
|       Stable family|                    [1, 2, 3]|
|        Mixed rurals|                          [1]|
|Traditional families|                       [1, 2]|
|Mixed apartment d...|                    [1, 2, 3]|
|Young all america...|                       [1, 2]|
|Low income catholics|                          [1]|
|Large family, emp...|                       [1, 2]|
| Young, low educated|                       [1, 2]|
|Middle class fami...|                       [

Que otras operaciones conoce que sean relavantes en la inspección de datos? 