## F1 Big Data Challenge

#### Perguntas

1. Pontuação ~~mediana~~ média por temporada dos 20 melhores pilotos das últimas 10 temporadas

2. Todas as corridas onde apenas 3 equipes pontuaram.

3. Melhor tempo de Pitstop e equipe que executou e piloto que estava no carro por temporada

4. Melhor tempo de Pitstop por equipe por temporada

5. Piloto que mais pontuou daqueles que nunca subiram no pódio <br/>
   o piloto não pode ter subido em um pódio em sua carreira da formula um para entrar nesse grupo

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg
from pathlib import Path

In [2]:
conf = SparkConf().setAppName("f1-challenge")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [3]:
! hadoop fs -put ../datasets/f1

put: `f1/circuits.csv': File exists
put: `f1/constructorResults.csv': File exists
put: `f1/constructorStandings.csv': File exists
put: `f1/constructors.csv': File exists
put: `f1/driverStandings.csv': File exists
put: `f1/drivers.csv': File exists
put: `f1/lapTimes.csv': File exists
put: `f1/pitStops.csv': File exists
put: `f1/qualifying.csv': File exists
put: `f1/races.csv': File exists
put: `f1/results.csv': File exists
put: `f1/seasons.csv': File exists
put: `f1/status.csv': File exists


In [4]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-03-07 21:10 hdfs://node-master:9000/user/root/.sparkStaging
drwxr-xr-x   - root supergroup          0 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1


In [5]:
! hadoop fs -ls hdfs://node-master:9000/user/root/f1

Found 13 items
-rw-r--r--   2 root supergroup       8667 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/circuits.csv
-rw-r--r--   2 root supergroup     224140 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/constructorResults.csv
-rw-r--r--   2 root supergroup     267256 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/constructorStandings.csv
-rw-r--r--   2 root supergroup      15617 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/constructors.csv
-rw-r--r--   2 root supergroup     768136 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/driverStandings.csv
-rw-r--r--   2 root supergroup      79533 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/drivers.csv
-rw-r--r--   2 root supergroup   12118621 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/lapTimes.csv
-rw-r--r--   2 root supergroup     220898 2023-03-07 21:02 hdfs://node-master:9000/user/root/f1/pitStops.csv
-rw-r--r--   2 root supergroup     315477 2023-03-07 21:02 hdfs://node-m

In [6]:
files = ["circuits.csv",
"constructorResults.csv",
"constructorStandings.csv",
"constructors.csv",
"driverStandings.csv",
"drivers.csv",
"lapTimes.csv",
"pitStops.csv",
"qualifying.csv",
"races.csv",
"results.csv",
"seasons.csv",
"status.csv",]

In [7]:
dfs = {}
for file in files:
    dfs[file[:-4]] = spark.read.format("csv").option("header", "true").load(f"/user/root/f1/{file}")

In [8]:
dfs.keys()

dict_keys(['circuits', 'constructorResults', 'constructorStandings', 'constructors', 'driverStandings', 'drivers', 'lapTimes', 'pitStops', 'qualifying', 'races', 'results', 'seasons', 'status'])

In [9]:
for file, df in dfs.items():
    print(file)
    df.printSchema()

circuits
root
 |-- circuitId: string (nullable = true)
 |-- circuitRef: string (nullable = true)
 |-- name: string (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- lng: string (nullable = true)
 |-- alt: string (nullable = true)
 |-- url: string (nullable = true)

constructorResults
root
 |-- constructorResultsId: string (nullable = true)
 |-- raceId: string (nullable = true)
 |-- constructorId: string (nullable = true)
 |-- points: string (nullable = true)
 |-- status: string (nullable = true)

constructorStandings
root
 |-- constructorStandingsId: string (nullable = true)
 |-- raceId: string (nullable = true)
 |-- constructorId: string (nullable = true)
 |-- points: string (nullable = true)
 |-- position: string (nullable = true)
 |-- positionText: string (nullable = true)
 |-- wins: string (nullable = true)
 |-- _c7: string (nullable = true)

constructors
root
 |-- constructorId: string (nullable

In [10]:
dfs["seasons"].show()

+----+--------------------+
|year|                 url|
+----+--------------------+
|2009|http://en.wikiped...|
|2008|http://en.wikiped...|
|2007|http://en.wikiped...|
|2006|http://en.wikiped...|
|2005|http://en.wikiped...|
|2004|http://en.wikiped...|
|2003|http://en.wikiped...|
|2002|http://en.wikiped...|
|2001|http://en.wikiped...|
|2000|http://en.wikiped...|
|1999|http://en.wikiped...|
|1998|http://en.wikiped...|
|1997|http://en.wikiped...|
|1996|http://en.wikiped...|
|1995|http://en.wikiped...|
|1994|http://en.wikiped...|
|1993|http://en.wikiped...|
|1992|http://en.wikiped...|
|1991|http://en.wikiped...|
|1990|http://en.wikiped...|
+----+--------------------+
only showing top 20 rows



## Q1

In [11]:
last_10_seasons = dfs["seasons"].orderBy(dfs["seasons"].year.desc()).limit(10)

In [12]:
races_last_10_seasons = dfs["races"].filter(
    (dfs["races"].year.isin(list(last_10_seasons.toPandas()["year"])))
)

In [13]:
dfs["results"] = dfs["results"].withColumn("points_", dfs["results"].points.cast("int"))

In [14]:
dfs["results"].join(
    races_last_10_seasons, dfs["results"].raceId == races_last_10_seasons.raceId, "inner"
).join(
    dfs["drivers"], dfs["results"].driverId == dfs["drivers"].driverId, "inner"
).groupBy(
    dfs["results"].driverId, dfs["drivers"].driverRef
).agg(
    avg("points_").alias("avg_points"),
    sum("points_").alias("all_points")
).orderBy(
    col("avg_points").desc()
).limit(10).show()

+--------+--------------+------------------+----------+
|driverId|     driverRef|        avg_points|all_points|
+--------+--------------+------------------+----------+
|       1|      hamilton|13.890173410404625|      2403|
|      20|        vettel|13.780346820809248|      2384|
|       3|       rosberg| 10.15032679738562|      1553|
|      17|        webber|10.074468085106384|       947|
|       8|     raikkonen| 7.774436090225564|      1034|
|       4|        alonso| 7.635294117647059|      1298|
|     822|        bottas|7.3061224489795915|       716|
|     830|max_verstappen| 7.016666666666667|       421|
|      18|        button| 6.512987012987013|      1003|
|     817|     ricciardo| 6.325581395348837|       816|
+--------+--------------+------------------+----------+

