<a href="https://colab.research.google.com/github/gregorylira/learning-pyspark/blob/main/Fundamentos_Pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is Spark?

Spark is a platform for cluster computing. Spark allows you to spread data and perform calculations in clusters with several nodes (in other words, as if there were several different computers doing the calculations). Splitting your data makes working with very large data sets easier because each node only works with a small amount of data.

Deciding whether or not Spark is the best solution for your problem requires some experience, but you can consider questions such as:

- Is my data too big to run on a single machine?
- Can my calculations be easily parallelized?

install pyspark on colab

In [None]:
pip install pyspark

mounting the drive to get all data csvs

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from pyspark.sql import SparkSession
# creating a local session of pyspark
spark = (
    SparkSession.builder
    .master('local') # create a local instance
    .appName("learning_pyspark_01")
    .getOrCreate()
    )

## Dataframe

Spark's main data structure is the Resilient Distributed Dataset (RDD). This is a low-level object that allows Spark to work its magic by splitting data across multiple nodes in the cluster. However, RDDs are difficult to work with directly.

in this case i using Spark DataFrame

The Spark DataFrame is designed to behave much like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, but DataFrames are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to achieve the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the correct way to optimize the query, but the DataFrame implementation has a lot of this optimization built in

To start working with Spark DataFrames, you first need to create a SparkSession object from your SparkContext. You can think of SparkContext as your connection to the cluster and SparkSession as your interface to that connection.

## how visualization table

Your SparkSession has an attribute called catalog that lists all the data within the cluster. This attribute has some methods to extract different information.

One of the most useful is the .listTables() method, which returns the names of all tables in your cluster as a list.

In [5]:
print(spark.catalog.listTables())
#in this case, the list is empty because i not load a dataframe in cluster

[]


## Import table and make query

One of the advantages of the DataFrame interface is that you can run SQL queries against the tables in your Spark cluster

in this session i loading a flight.csv. This table contains a row for every flight that left Portland International Airport (PDX) or Seattle-Tacoma International Airport (SEA) in 2014 and 2015.

In [None]:
flight_path = "./drive/MyDrive/learning/spark/flights_small.csv"
flights = spark\
        .read.format("csv")\
        .option("inferSchema", "True")\
        .option("header", "True")\
        .csv(flight_path)