<a href="https://colab.research.google.com/github/luisosmx/spark/blob/main/pyspark_poc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar -xzf spark-3.1.2-bin-hadoop3.2.tgz
!mv spark-3.1.2-bin-hadoop3.2.tgz /opt/spark-3.1.2
!pip install -q findspark
!pip install pyspark==3.1.2 
!ln -s /opt/spark-3.1.2 /opt/spark
!export SPARK_HOME=/opt/spark
!export PATH=$SPARK_HOME/bin:$PATH

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
ln: failed to create symbolic link '/opt/spark': File exists


In [None]:
!pip install pyspark==3.1.2 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip freeze

In [None]:
import os
import findspark

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['SPARK_HOME'] = '/content/spark-3.1.2-bin-hadoop3.2'
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'

findspark.init()

sc = pyspark.SparkContext("local[*]")
spark = SparkSession(sc)

print('Modules imported and Spark loaded')

Modules imported and Spark loaded


# Loading data into PySpark

Getting data from Github repo without clonning the project, just using [raw.githubusercontent.com](https://stackoverflow.com/questions/39065921/what-do-raw-githubusercontent-com-urls-represent)

In [None]:
!wget --continue /content/afluenciastc_simple_01_2023.csv

/content/afluenciastc_simple_01_2023.csv: Scheme missing.


Reading the file with a Spark dataframe

In [None]:
spark_df = spark.read.csv('/content/afluenciastc_simple_01_2023.csv')
spark_df.show(20)



+----------+----+-----+-------+-------------------+---------+
|       _c0| _c1|  _c2|    _c3|                _c4|      _c5|
+----------+----+-----+-------+-------------------+---------+
|     fecha|anio|  mes|  linea|           estacion|afluencia|
|2010-01-01|2010|Enero|Linea 1|           Zaragoza|    20227|
|2010-01-01|2010|Enero|Linea 1| Isabel la Católica|     6487|
|2010-01-01|2010|Enero|Linea 1|          Moctezuma|    10304|
|2010-01-01|2010|Enero|Linea 1|        Pino Suárez|     8679|
|2010-01-01|2010|Enero|Linea 1|       Gómez Farías|    19499|
|2010-01-01|2010|Enero|Linea 6|Deptvo. 18 de Marzo|      621|
|2010-01-01|2010|Enero|Linea 6|  La Villa-Basilica|    24792|
|2010-01-01|2010|Enero|Linea 9|          Pantitlán|    27000|
|2010-01-01|2010|Enero|Linea 8|             Aculco|     3652|
|2010-01-01|2010|Enero|Linea 9|          Velódromo|     3239|
|2010-01-01|2010|Enero|Linea 5|Autobuses del Norte|    16824|
|2010-01-01|2010|Enero|Linea 5|          Misterios|     3513|
|2010-01

In [None]:
df = spark.read.option("header",True).csv('/content/afluenciastc_simple_01_2023.csv')
df.show(20)

+----------+----+-----+-------+-------------------+---------+
|     fecha|anio|  mes|  linea|           estacion|afluencia|
+----------+----+-----+-------+-------------------+---------+
|2010-01-01|2010|Enero|Linea 1|           Zaragoza|    20227|
|2010-01-01|2010|Enero|Linea 1| Isabel la Católica|     6487|
|2010-01-01|2010|Enero|Linea 1|          Moctezuma|    10304|
|2010-01-01|2010|Enero|Linea 1|        Pino Suárez|     8679|
|2010-01-01|2010|Enero|Linea 1|       Gómez Farías|    19499|
|2010-01-01|2010|Enero|Linea 6|Deptvo. 18 de Marzo|      621|
|2010-01-01|2010|Enero|Linea 6|  La Villa-Basilica|    24792|
|2010-01-01|2010|Enero|Linea 9|          Pantitlán|    27000|
|2010-01-01|2010|Enero|Linea 8|             Aculco|     3652|
|2010-01-01|2010|Enero|Linea 9|          Velódromo|     3239|
|2010-01-01|2010|Enero|Linea 5|Autobuses del Norte|    16824|
|2010-01-01|2010|Enero|Linea 5|          Misterios|     3513|
|2010-01-01|2010|Enero|Linea 7|     Constituyentes|     1417|
|2010-01

#Replace incoming column headers from Spanish to English

In [None]:
spark_df.select("_c3","_c4","_c5") \
     .show(20)

+-------+-------------------+---------+
|    _c3|                _c4|      _c5|
+-------+-------------------+---------+
|  linea|           estacion|afluencia|
|Linea 1|           Zaragoza|    20227|
|Linea 1| Isabel la Católica|     6487|
|Linea 1|          Moctezuma|    10304|
|Linea 1|        Pino Suárez|     8679|
|Linea 1|       Gómez Farías|    19499|
|Linea 6|Deptvo. 18 de Marzo|      621|
|Linea 6|  La Villa-Basilica|    24792|
|Linea 9|          Pantitlán|    27000|
|Linea 8|             Aculco|     3652|
|Linea 9|          Velódromo|     3239|
|Linea 5|Autobuses del Norte|    16824|
|Linea 5|          Misterios|     3513|
|Linea 7|     Constituyentes|     1417|
|Linea 7|          Refinería|     2325|
|Linea 3|            Etiopía|     7078|
|Linea 7|            Polanco|     6173|
|Linea 4|    Canal del Norte|     2317|
|Linea 4|          Bondojito|     2474|
|Linea 4|        Santa Anita|     1042|
+-------+-------------------+---------+
only showing top 20 rows



#Filter Rows

In [None]:
spark_df.select("_c3","_c4","_c5") \
  .where("_c3 == 'Linea 1'") \
  .show(20)

+-------+------------------+-----+
|    _c3|               _c4|  _c5|
+-------+------------------+-----+
|Linea 1|          Zaragoza|20227|
|Linea 1|Isabel la Católica| 6487|
|Linea 1|         Moctezuma|10304|
|Linea 1|       Pino Suárez| 8679|
|Linea 1|      Gómez Farías|19499|
|Linea 1|    Salto del Agua| 5483|
|Linea 1|          Balderas| 3771|
|Linea 1|          Tacubaya|12110|
|Linea 1|      Observatorio|30492|
|Linea 1|       Chapultepec|22692|
|Linea 1|         Pantitlán|17042|
|Linea 1|Blvd. Puerto Aéreo|16348|
|Linea 1|           Sevilla| 4713|
|Linea 1|          Balbuena| 2879|
|Linea 1|        Candelaria| 9685|
|Linea 1|            Merced|21524|
|Linea 1|       Insurgentes|19578|
|Linea 1|       Juanacatlán| 1493|
|Linea 1|        Cuauhtémoc| 5791|
|Linea 1|        San Lázaro|12677|
+-------+------------------+-----+
only showing top 20 rows



#Sorting

In [None]:
spark_df.select("_c3","_c4","_c5") \
  .where("_c4 in ('Chapultepec','Observatorio','San Lázaro')") \
  .orderBy("_c3") \
  .show(50)

+-------+------------+-----+
|    _c3|         _c4|  _c5|
+-------+------------+-----+
|Linea 1| Chapultepec|58767|
|Linea 1|Observatorio|64659|
|Linea 1|Observatorio|77938|
|Linea 1|Observatorio|85805|
|Linea 1|  San Lázaro|25708|
|Linea 1|  San Lázaro|27868|
|Linea 1| Chapultepec|55607|
|Linea 1|  San Lázaro|23551|
|Linea 1| Chapultepec|52262|
|Linea 1| Chapultepec|41543|
|Linea 1|  San Lázaro|28121|
|Linea 1|Observatorio|46024|
|Linea 1|  San Lázaro|45170|
|Linea 1| Chapultepec|47254|
|Linea 1| Chapultepec|51880|
|Linea 1|Observatorio|96871|
|Linea 1| Chapultepec|51550|
|Linea 1|  San Lázaro|33554|
|Linea 1|  San Lázaro|29557|
|Linea 1| Chapultepec|58686|
|Linea 1| Chapultepec|62015|
|Linea 1|Observatorio|84456|
|Linea 1|Observatorio|78425|
|Linea 1|  San Lázaro|26184|
|Linea 1|  San Lázaro|28196|
|Linea 1|  San Lázaro|24156|
|Linea 1| Chapultepec|70333|
|Linea 1|Observatorio|84285|
|Linea 1|  San Lázaro|24735|
|Linea 1| Chapultepec|66817|
|Linea 1| Chapultepec|39736|
|Linea 1|  San

#Grouping

In [None]:
spark_df.groupBy("_c4").count() \
  .show(30)

+--------------------+-----+
|                 _c4|count|
+--------------------+-----+
|           Panteones| 4779|
| Deptvo. 18 de Marzo| 9558|
|    Bosque de Aragón| 4171|
|      Aquiles Serdán| 4171|
|         Universidad| 4779|
|              Tepito| 4779|
|           Atlalilco| 9558|
|           Iztacalco| 4779|
|  Inst. del Petróleo| 8342|
|       Ciudad Azteca| 4779|
|           Velódromo| 4171|
|         San Joaqu�n|   31|
| Autobuses del Norte| 4779|
|              Aragón| 4171|
|       Plaza AragÃ³n|  608|
|         Pino Suárez| 8342|
|           Garibaldi| 9558|
|           Bondojito| 4779|
|         Insurgentes| 4779|
|          Candelaria| 9558|
|     Villa de Cortés| 4171|
|          Tlatelolco| 4779|
|    San Juan LetrÃ¡n|  608|
|         San Antonio| 4779|
|              Olivos| 4779|
|Ricardo Flores Magón| 4171|
|             Eugenia| 4779|
|          Zapotitlán| 4171|
|   Bosque de AragÃ³n|  608|
|        Romero Rubio| 4779|
+--------------------+-----+
only showing t