# Übung 2.9 Todesursachen
Wie bereits in anderen Übungen besprochen, muss man zuerst wieder die Daten in das HDFS laden. Dazu habe ich wieder die Datei `death2016.csv` in das Volume des `namenode` Containers hineinkopiert, ich bin in das dazugehörende Verzeichnis in dem `namenode` Container hineingegangen mittels `docker exec -it namenode bash` und `cd /hadoop-data`, und schließlich habe ich die Daten in das HDFS mittels `hadoop fs -copyFromLocal death2016.csv workspace/pyspark` kopiert. 

In [3]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, BooleanType
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
from pyspark.sql import dataframe
from pyspark.sql.functions import to_timestamp, to_date, year, dayofweek

## Einlesen der Datei

In [14]:
# Spark session & context
spark = SparkSession \
    .builder \
    .master('spark://spark-master:7077') \
    .appName("uebung_29") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
sc = spark.sparkContext

In [31]:
# Define schema of death file
death_cols = [
    StructField('country', StringType()),
    StructField('cause_no', IntegerType()),
    StructField('cause_name', StringType()),
    StructField('sex', StringType()),
    StructField('age', IntegerType()),
    StructField('age_group', StringType()),
]

for year in range(2000, 2017):
    death_low_up = [StructField(f'deaths_{year}', FloatType()),
                    StructField(f'low_{year}', FloatType()),
                    StructField(f'up_{year}', FloatType())]
    death_cols += death_low_up
    
death_schema = StructType(death_cols)

In [34]:
# Read in death file
file_path = 'hdfs://namenode:8020/user/root/workspace/pyspark/death2016.csv'
deaths = spark.read.csv(file_path, death_schema)
print(deaths.printSchema())
print(deaths.show(1))

root
 |-- country: string (nullable = true)
 |-- cause_no: integer (nullable = true)
 |-- cause_name: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- age_group: string (nullable = true)
 |-- deaths_2000: float (nullable = true)
 |-- low_2000: float (nullable = true)
 |-- up_2000: float (nullable = true)
 |-- deaths_2001: float (nullable = true)
 |-- low_2001: float (nullable = true)
 |-- up_2001: float (nullable = true)
 |-- deaths_2002: float (nullable = true)
 |-- low_2002: float (nullable = true)
 |-- up_2002: float (nullable = true)
 |-- deaths_2003: float (nullable = true)
 |-- low_2003: float (nullable = true)
 |-- up_2003: float (nullable = true)
 |-- deaths_2004: float (nullable = true)
 |-- low_2004: float (nullable = true)
 |-- up_2004: float (nullable = true)
 |-- deaths_2005: float (nullable = true)
 |-- low_2005: float (nullable = true)
 |-- up_2005: float (nullable = true)
 |-- deaths_2006: float (nullable = true)
 |-- 

In [46]:
# Creating a temporary view so that we can execute HiveQL statements
deaths.createOrReplaceTempView('deaths')

## Teilaufgabe 1
Todesursache mit Anzahl an Todesfällen

In [48]:
death_cols = [f'deaths_{year}' for year in range(2000, 2017)]
death_cols_str = '+'.join(death_cols)
stmt = (f'select cause_name, sum({death_cols_str}) as total '
       'from deaths where country = "DEU" '
        'group by cause_name '
       'order by total desc;')
print('HiveQL statement:', stmt)
death_cause_ger = spark.sql(stmt)
death_cause_ger.show()

HiveQL statement: select cause_name, sum(deaths_2000+deaths_2001+deaths_2002+deaths_2003+deaths_2004+deaths_2005+deaths_2006+deaths_2007+deaths_2008+deaths_2009+deaths_2010+deaths_2011+deaths_2012+deaths_2013+deaths_2014+deaths_2015+deaths_2016) as total from deaths where country = "DEU" group by cause_name order by total desc;
+--------------------+--------------------+
|          cause_name|               total|
+--------------------+--------------------+
|          All Causes|1.4551129185058594E7|
|Noncommunicable d...|1.3325673312011719E7|
|Cardiovascular di...|    6070731.26240921|
| Malignant neoplasms|   3786898.846229553|
|Ischaemic heart d...|   3556194.562406063|
|              Stroke|  1135246.9709677696|
|Other circulatory...|   888717.4789266586|
|Respiratory diseases|   875449.3753051758|
|Trachea; bronchus...|    738239.602329731|
|  Digestive diseases|   737520.3945465088|
|Chronic obstructi...|   711904.0276870728|
|    Ischaemic stroke|   681536.0698115826|
|Communica

In [4]:
# NEVER FORGET to stop the session
spark.stop()