# Intro to PySpark

In this module, we will introduce [PySpark](https://spark.apache.org/docs/3.3.1/api/python/index.html#:~:text=PySpark%20is%20an%20interface%20for,data%20in%20a%20distributed%20environment.) - the Python API for Apache Spark.

If you know Python, PySpark will come naturally.

Here is a great aritcle to check out if you want learn more: [__What is PySpark?__](https://www.databricks.com/glossary/pyspark)


>📋🐍 - We will run thru some simple __PySpark__ operations in this notebook.


---
__Last Update:__ 27 MAY 2024

#3.1 - Get your hands on some data ✋

🏎️ - For this exercise, get our [Formula 1 World Championship](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020) data from Kaggle


>Note: These will be CSV files. Make sure to complete __Notebook #2 -  Intro to Data Tasks__ before getting started here. 

In [0]:
#This is a dataframe of Construtors that have competed in F1
constructors_df = spark.read.option("header","true").option("inferSchema","true").csv("dbfs:/FileStore/tables/constructors.csv")
display(constructors_df) 

constructorId,constructorRef,name,nationality,url
1,mclaren,McLaren,British,http://en.wikipedia.org/wiki/McLaren
2,bmw_sauber,BMW Sauber,German,http://en.wikipedia.org/wiki/BMW_Sauber
3,williams,Williams,British,http://en.wikipedia.org/wiki/Williams_Grand_Prix_Engineering
4,renault,Renault,French,http://en.wikipedia.org/wiki/Renault_in_Formula_One
5,toro_rosso,Toro Rosso,Italian,http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso
6,ferrari,Ferrari,Italian,http://en.wikipedia.org/wiki/Scuderia_Ferrari
7,toyota,Toyota,Japanese,http://en.wikipedia.org/wiki/Toyota_Racing
8,super_aguri,Super Aguri,Japanese,http://en.wikipedia.org/wiki/Super_Aguri_F1
9,red_bull,Red Bull,Austrian,http://en.wikipedia.org/wiki/Red_Bull_Racing
10,force_india,Force India,Indian,http://en.wikipedia.org/wiki/Racing_Point_Force_India


In [0]:
#This is a dataframe of race resutls of the constructor's championship
constructor_results_df = spark.read.option("header","true").option("inferSchema","true").csv("dbfs:/FileStore/tables/constructor_results.csv")
display(constructor_results_df) 

constructorResultsId,raceId,constructorId,points,status
1,18,1,14.0,\N
2,18,2,8.0,\N
3,18,3,9.0,\N
4,18,4,5.0,\N
5,18,5,2.0,\N
6,18,6,1.0,\N
7,18,7,0.0,\N
8,18,8,0.0,\N
9,18,9,0.0,\N
10,18,10,0.0,\N


#3.2 - PySpark Commands 🐍

Now that we have our data, lets answer another question using _PySpark_!

>💭 - Which constructor has scored the __most points in F1 History__?


🤝 - __Joins__ allow you to combine two or more tables together if they have a common column.

In this scenario - we will be joining our 2 tables together on the "constructorID" column.

> __Alex The Analyst__ has great video on Joins if you want to know more: [Joins in MySQL | Intermediate MySQL](https://www.youtube.com/watch?v=lXQzD09BOH0)

In [0]:

#Here - we join the constructor information with the race results information for further analysis
points_df = constructors_df.join(constructor_results_df, constructors_df["constructorID"] == constructor_results_df["constructorID"]).select("name","nationality", "points")
points_df.display()

name,nationality,points
McLaren,British,14.0
BMW Sauber,German,8.0
Williams,British,9.0
Renault,French,5.0
Toro Rosso,Italian,2.0
Ferrari,Italian,1.0
Toyota,Japanese,0.0
Super Aguri,Japanese,0.0
Red Bull,Austrian,0.0
Force India,Indian,0.0


__groupBy()__ allows you to group together rows that have the same value.

In this scenario - we will perform __groupBy()__ on the points column to see how many points each constructor has earned during its time in F1.

> [__Spark By {Examples}__](https://sparkbyexamples.com/) is another great resource to reference on your learning journey: [PySpark Groupby Explanined with Examples](https://sparkbyexamples.com/pyspark/pyspark-groupby-explained-with-example/) 


In [0]:
#Using the newly created points_df, we can see how many points each construtor has earned in its F1 history
grouped_points_df = points_df.select("*").groupBy("name").sum("points").withColumnRenamed("sum(points)", "points")
display(grouped_points_df)

name,points
Shannon,0.0
Shadow,58.0
Cooper-Castellotti,3.0
Leyton House,8.0
Shadow-Matra,0.0
Politoys,0.0
Lotus-Pratt & Whitney,0.0
Cooper-OSCA,0.0
Coloni,0.0
Onyx,6.0


Lastly, we can us [__Sort__](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sort.html) to return a dataframe that is sorted by the specified column.

Here - we will sort the __grouped_points_df__ to see which constructor has scored the most points in F1 

>Hint: 🟥🟨


In [0]:
#Ferrari has scored the most points in the history of F1. The Ferrari Tax pays off!! Grande Machina!!
display(grouped_points_df.sort("points", ascending = False)) 

name,points
Ferrari,9505.0
Mercedes,7060.5
Red Bull,6891.0
McLaren,6191.5
Williams,3609.0
Renault,1777.0
Force India,1098.0
Team Lotus,918.0
Benetton,861.5
Lotus F1,706.0



#🏁 Conclusion - READ THE DOCS!!!

By the end of this notebook, you will have been introduced to some __intermediate PySpark__ tasks.

>🔐 - If you take anything from this module, remeber to __READ THE DOCS__!!

> 📝 - __Technical Reference Documentation__ tells you how to do __EVERYTHING__, you just have to __READ IT__! 

Here is a link to the [PySpark API Reference](https://spark.apache.org/docs/3.1.2/api/python/reference/index.html) - which holds __all public PySpark modules, classes, functions, and methods.__

Docs can be intimidating at first, so dont try to read everything.
> 🔑 - __The 80/20 Rule__ - Focus on the most important information you need for your specific task
