# Dataset
## Vorbereitung: Einrichten der Arbeitsumgebung und Bereitstellen der Daten

In [12]:
# !hdfs dfs -mkdir /airline-data

In [65]:
!hdfs dfs -ls /

Found 2 items
drwxr-xr-x   - cluster supergroup          0 2022-05-12 12:26 /airline-data
drwxr-xr-x   - cluster supergroup          0 2022-05-05 12:04 /user


In [67]:
# !hdfs dfs -put ~/BDLC-Project-G01/Dataset/airline.csv /airline-data
# !hdfs dfs -put ~/BDLC-Project-G01/Dataset/carriers.txt /airline-data

In [68]:
!hdfs dfs -ls /airline-data

Found 2 items
-rw-r--r--   2 cluster supergroup 12029207752 2022-05-12 12:29 /airline-data/airline.csv
-rw-r--r--   2 cluster supergroup       43758 2022-05-12 12:29 /airline-data/carriers.txt


## Erste Schritte: Schema inspizieren

In [69]:
from pyspark.sql import SparkSession

In [70]:
spark = SparkSession.builder.appName('BabySteps').getOrCreate()

In [72]:
df_airlines = spark.read.csv('/airline-data/airline.csv', header=True)

                                                                                

In [73]:
df_airlines.show(2)

+-----------------+-------+--------+-------+----------+----------+--------------+----------------+---------+------------+---------+----------+--------+-------+----+--------+--------+---------+-----------------+-----+--------+------+-------------+-------+------+-------+-------------+------------+----+
|ActualElapsedTime|AirTime|ArrDelay|ArrTime|CRSArrTime|CRSDepTime|CRSElapsedTime|CancellationCode|Cancelled|CarrierDelay|DayOfWeek|DayofMonth|DepDelay|DepTime|Dest|Distance|Diverted|FlightNum|LateAircraftDelay|Month|NASDelay|Origin|SecurityDelay|TailNum|TaxiIn|TaxiOut|UniqueCarrier|WeatherDelay|Year|
+-----------------+-------+--------+-------+----------+----------+--------------+----------------+---------+------------+---------+----------+--------+-------+----+--------+--------+---------+-----------------+-----+--------+------+-------------+-------+------+-------+-------------+------------+----+
|               53|     32|      -8|   1642|      1650|      1545|            65|             

In [74]:
df_airlines.printSchema()

root
 |-- ActualElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Cancelled: string (nullable = true)
 |-- CarrierDelay: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Diverted: string (nullable = true)
 |-- FlightNum: string (nullable = true)
 |-- LateAircraftDelay: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- NASDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- SecurityDelay: string (nullable = true)
 |-- TailNum: string (nullable = true

In [75]:
spark.stop()

## Erste Schritte: Als Parquet speichern

In [76]:
from pyspark.sql import SparkSession

In [77]:
spark = SparkSession.builder.appName('ConvertToParquet').getOrCreate()

In [78]:
!hdfs dfs -ls /airline-data/

Found 2 items
-rw-r--r--   2 cluster supergroup 12029207752 2022-05-12 12:29 /airline-data/airline.csv
-rw-r--r--   2 cluster supergroup       43758 2022-05-12 12:29 /airline-data/carriers.txt


### Parquet File schreiben
Repartitionfaktor: 4x Anzahl von Cores im Cluster rsp. 55 für die verfügbaren Cores

In [79]:
spark.read.csv('/airline-data/airline.csv', header=True).repartition(55).write.parquet('/airline-data/airline.parquet')

                                                                                

In [80]:
!hdfs dfs -ls /airline-data/

Found 3 items
-rw-r--r--   2 cluster supergroup 12029207752 2022-05-12 12:29 /airline-data/airline.csv
drwxr-xr-x   - cluster supergroup           0 2022-05-12 12:35 /airline-data/airline.parquet
-rw-r--r--   2 cluster supergroup       43758 2022-05-12 12:29 /airline-data/carriers.txt


### Raw File löschen

In [81]:
!hdfs dfs -rm /airline-data/airline.csv

Deleted /airline-data/airline.csv


In [82]:
!hdfs dfs -ls /airline-data/

Found 2 items
drwxr-xr-x   - cluster supergroup          0 2022-05-12 12:35 /airline-data/airline.parquet
-rw-r--r--   2 cluster supergroup      43758 2022-05-12 12:29 /airline-data/carriers.txt


In [83]:
spark.stop()