# Tarea Apache Spark
>Importa las librerías necesarias dónde sea necesario


In [2]:
import pandas as pd 
from pyspark.sql import SparkSession

### SparkSession
>Crea un SparkSession para comenzar la tarea

In [3]:
spark = SparkSession.builder \
    .appName("Mi Aplicación Spark") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/07 10:25:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Crear un DataFrame
>Lee el csv datosTarea.csv, mételo a un DF y muéstralo.

In [4]:
df = spark.read.csv("datosTarea.csv", header=True, inferSchema=True)
df

DataFrame[Index: int, Organization Id: string, Name: string, Website: string, Country: string, Description: string, Founded: int, Industry: string, Number of employees: int, Networth: int, stock_price: int, profit: int]

In [6]:
df.show()

+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+--------+-----------+------+
|Index|Organization Id|                Name|             Website|             Country|         Description|Founded|            Industry|Number of employees|Networth|stock_price|profit|
+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+--------+-----------+------+
|    1|FAB0d41d5b5d22c|         Ferrell LLC|  https://price.net/|    Papua New Guinea|Horizontal empowe...|   1990|            Plastics|               3498|  402269|         33| 12125|
|    2|6A7EdDEA9FaDC52|Mckinney, Riley a...|http://www.hall-b...|             Finland|User-centric syst...|   2015|Glass / Ceramics ...|               4952|  569480|         49| 12001|
|    3|0bFED1ADAE4bcC1|          Hester Ltd|http://sullivan-r...|          

### Filtro de datos
>Consigue todas las empresas que empiecen con 'M' y tengan entre 4000 y 7000 empleados. Sólo muestra los nombres y el número de empleados.

In [18]:
filtered_df = df.filter((df.Name.startswith("M")) & (df["Number of employees"] >= 4000) & (df["Number of employees"] <= 7000))
result_df = filtered_df.select("Name", "Number of employees")
result_df.show()


+--------------------+-------------------+
|                Name|Number of employees|
+--------------------+-------------------+
|Mckinney, Riley a...|               4952|
|       Mcintosh-Mora|               4389|
|     Mckenzie-Melton|               4589|
|          Massey LLC|               5004|
|        Mays-Preston|               5786|
+--------------------+-------------------+



>Consigue todos los países que no inicien con las letras 'b', 's' y 'm', pero que tampoco tengan un netword mayor a 500000. Muestra el nombre de la compañía, el país y el networth.

In [19]:
from pyspark.sql.functions import col

filtered_df = df.filter(~col("Country").startswith("B") & \
                        ~col("Country").startswith("S") & \
                        ~col("Country").startswith("M") & \
                        (col("Networth") <= 500000))
result_df = filtered_df.select("Name", "Country", "Networth")
result_df.show()

+--------------------+--------------------+--------+
|                Name|             Country|Networth|
+--------------------+--------------------+--------+
|         Ferrell LLC|    Papua New Guinea|  402269|
|      Holder-Sellers|        Turkmenistan|  105914|
|Keller, Campos an...|             Liberia|  329130|
|         Harrell LLC|          Guadeloupe|  251274|
|Dickson, Richmond...|      Czech Republic|  359030|
|        Prince-David|    Christmas Island|  120289|
|         Rivas Group|           Australia|  477824|
|Sloan, Mays and W...|                Chad|   41975|
|Glass, Barrera an...|     Kyrgyz Republic|  300150|
|Baker, Mccann and...|               Kenya|  188370|
|Valentine, Fergus...|              Jersey|  412274|
|           Walls LLC|          Cape Verde|  192969|
|Mitchell, Warren ...| Trinidad and Tobago|  438839|
|      Walton-Barnett|      Western Sahara|  200789|
|     Bartlett-Arroyo|Northern Mariana ...|  458504|
|         Berg-Sparks|              Canada|  2

### Funciones
Crea una función con @pandas_udf que que le reste a los profits la media en cada renglón. Crea una nueva columna que muestre los resultados.

In [8]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("double", PandasUDFType.SCALAR)
def subtract_mean(profit_series):
    return profit_series - profit_series.mean()




In [9]:
df = df.withColumn("profit_minus_mean", subtract_mean(df["profit"]))


In [10]:
df.show()


[Stage 3:>                                                          (0 + 1) / 1]

+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+--------+-----------+------+-----------------+
|Index|Organization Id|                Name|             Website|             Country|         Description|Founded|            Industry|Number of employees|Networth|stock_price|profit|profit_minus_mean|
+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+--------+-----------+------+-----------------+
|    1|FAB0d41d5b5d22c|         Ferrell LLC|  https://price.net/|    Papua New Guinea|Horizontal empowe...|   1990|            Plastics|               3498|  402269|         33| 12125|           -194.0|
|    2|6A7EdDEA9FaDC52|Mckinney, Riley a...|http://www.hall-b...|             Finland|User-centric syst...|   2015|Glass / Ceramics ...|               4952|  569480|         49| 12001|    

                                                                                

### Grouping data
>Agrupa por industry y muestra cuáles son las empresas con el profit más alto. Muestra los primeros tres.

In [15]:
from pyspark.sql.functions import desc, row_number
from pyspark.sql.window import Window
max_profit_df = df.groupBy("Industry").max("profit")
max_profit_df = max_profit_df.withColumnRenamed("max(profit)", "max_profit")
max_profit_df = max_profit_df.withColumnRenamed("Industry", "IndustryGroup")
joined_df = df.join(max_profit_df, (df.Industry == max_profit_df.IndustryGroup) & (df.profit == max_profit_df.max_profit))
windowSpec = Window.partitionBy(joined_df['Industry']).orderBy(desc(joined_df['profit']))
ranked_df = joined_df.withColumn("rank", row_number().over(windowSpec))
top_companies_df = ranked_df.filter(ranked_df.rank <= 3)
top_companies_df.select("Industry", "Name", "profit").show()


+--------------------+--------------------+------+
|            Industry|                Name|profit|
+--------------------+--------------------+------+
|          Accounting|          Massey LLC|  6931|
|Alternative Dispu...|      Velazquez-Odom| 18466|
|Architecture / Pl...|         Branch-Mann|  8745|
|       Arts / Crafts|         Berg-Sparks| 14682|
|          Automotive|      Holder-Sellers|  8200|
|  Banking / Mortgage|        Prince-David|  6476|
|     Broadcast Media|   Clements-Espinoza| 12510|
|  Building Materials|Wilkinson, Charle...| 11036|
|Business Supplies...|          Soto Group| 12498|
|Capital Markets /...|       Eaton-Morales| 16816|
|Civic / Social Or...|     Bartlett-Arroyo| 16020|
|   Civil Engineering|Sloan, Mays and W...| 12456|
|Computer Software...|           Hicks LLC| 16068|
|        Construction|         Harrell LLC|  8266|
|Consumer Electronics|Erickson, Andrews...|  6931|
|      Consumer Goods|        Gonzales Ltd|  7369|
|   Consumer Services|         

>Agrupa por industry y calcula el promedio de empleados que tienen

In [16]:
from pyspark.sql.functions import avg
avg_employees_df = df.groupBy("Industry").agg(avg("Number of employees").alias("average_employees"))
avg_employees_df.show()

+--------------------+-----------------+
|            Industry|average_employees|
+--------------------+-----------------+
|Primary / Seconda...|6457.666666666667|
|     Broadcast Media|           2589.0|
|           Wholesale|           5010.0|
|Investment Manage...|           3133.5|
|    Food / Beverages|           9011.0|
|  Gambling / Casinos|           4873.0|
|Logistics / Procu...|           4155.0|
|            Maritime|            769.0|
|            Wireless|           6146.0|
|Education Management|            339.0|
|       Arts / Crafts|           2800.0|
|           Insurance|           1215.0|
|  Financial Services|           5157.0|
|Business Supplies...|           9097.0|
|Consumer Electronics|           5022.0|
|       Public Safety|           5287.0|
|Information Techn...|           3934.0|
|Civic / Social Or...|           2442.0|
|      Consumer Goods|           9069.0|
|Glass / Ceramics ...|           4952.0|
+--------------------+-----------------+
only showing top

### SQL
>Usando Spark SQL, obtén cuántas empresas se fundaron despúes del 2000.

In [17]:
df.createOrReplaceTempView("empresas")
result = spark.sql("SELECT COUNT(*) AS total_empresas FROM empresas WHERE Founded > 2000")
result.show()


+--------------+
|total_empresas|
+--------------+
|            38|
+--------------+



### ML Regresión Lineal
>Con número de empleados, networth y stock price, obtén una predicción del profit a través de una regresión lineal.

In [21]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Preparar los datos
assembler = VectorAssembler(inputCols=["Number of employees", "Networth", "stock_price"], outputCol="features")
assembled_df = assembler.transform(df)

# Dividir los datos en conjuntos de entrenamiento y prueba
train_data, test_data = assembled_df.randomSplit([0.7, 0.3])

# Crear y entrenar el modelo de regresión lineal
lr = LinearRegression(featuresCol="features", labelCol="profit")
lr_model = lr.fit(train_data)

# Hacer predicciones en el conjunto de prueba
predictions = lr_model.transform(test_data)

# Evaluar el modelo
evaluator = RegressionEvaluator(labelCol="profit", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) en el conjunto de prueba: %g" % rmse)

# Mostrar algunas predicciones
predictions.select("prediction", "profit", "features").show()


23/12/07 10:47:08 WARN Instrumentation: [d3050896] regParam is zero, which might cause numerical instability and overfitting.
23/12/07 10:47:09 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/12/07 10:47:09 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
23/12/07 10:47:10 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


Root Mean Squared Error (RMSE) en el conjunto de prueba: 4096.76
+------------------+------+--------------------+
|        prediction|profit|            features|
+------------------+------+--------------------+
|13275.273360936133| 12001|[4952.0,569480.0,...|
|11584.018976383506|  8297|[9011.0,1036264.0...|
|11534.760959963733| 17809|[9079.0,1044084.0...|
|13380.444263317988| 18079|[5984.0,688160.0,...|
|11625.811547650183|  8266|[2185.0,251274.0,...|
|  13302.1327402587| 12456|[365.0,41975.0,43.0]|
|11795.854818091233|  5730|[3715.0,427224.0,...|
|11623.035788453673| 12915|[7645.0,879174.0,...|
|11503.912121765901|  9426|[7034.0,808909.0,...|
|13552.569284683013| 11514|[9443.0,1085945.0...|
|11125.413522852545|  7369|[9069.0,1042934.0...|
| 11304.89594200228| 11781|[3527.0,405604.0,...|
|11226.070809637167|  5634|[3450.0,396749.0,...|
|10997.207313609753| 10173|[7202.0,828229.0,...|
| 11194.04626934738| 18239|[6923.0,796144.0,...|
|11853.030359158209| 13879|[3934.0,452409.0,...|
|108

>Una vez que obtengas los resultados, a través del api de pandas, conviértelo en un pandas on spark DataFrame y pásalo a csv.

In [23]:
# Selecciona solo las columnas de interés que no son de tipo estructurado
selected_columns_df = predictions.select("prediction", "profit", "Number of employees", "Networth", "stock_price")

# Convierte a Pandas on Spark DataFrame
ps_df = selected_columns_df.to_pandas_on_spark()

# Exporta a CSV
ps_df.to_csv("predictions.csv", index=False)


                                                                                