![](imgs/kodolamaczlogo.png)

# Przetwarzanie Big Data z użyciem Apache Spark

Autor notebooka: Jakub Nowacki.


## Spark SQL - statystyki z użyciem SQL 

W katalogu `data/` znajduje się plik `rollingsales_bronx.json`.

Poniżej analizujemy dane statystyczne domów: liczność, średnie powierzchnie, lata i ceny, grupując po dzielnicy i typie.

In [1]:
import pyspark
import pyspark.sql.functions as func

spark = pyspark.sql.SparkSession.builder \
    .appName('HouseStatsSQL') \
    .getOrCreate()

#sc = pyspark.SparkContext(appName='HouseStatsSQL')
#spark = pyspark.sql.SQLContext(sc)

In [2]:
# Czytanie danych
sales = spark.read.json('data/rollingsales_bronx.json')
# Nie zapomnij zarejestrować tabeli
sales.registerTempTable('sales')
sales.printSchema()
sales.show()

root
 |-- grossArea: long (nullable = true)
 |-- hood: string (nullable = true)
 |-- landArea: long (nullable = true)
 |-- price: long (nullable = true)
 |-- type: string (nullable = true)
 |-- year: long (nullable = true)

+---------+--------------------+--------+------+--------------------+----+
|grossArea|                hood|landArea| price|                type|year|
+---------+--------------------+--------+------+--------------------+----+
|     2048|BATHGATE         ...|    1842|355000|01  ONE FAMILY HO...|1901|
|     1290|BATHGATE         ...|    1103|474819|01  ONE FAMILY HO...|1910|
|     1344|BATHGATE         ...|    1986|210000|01  ONE FAMILY HO...|1899|
|     1431|BATHGATE         ...|    2329|343116|01  ONE FAMILY HO...|1901|
|     4452|BATHGATE         ...|    1855|     0|02  TWO FAMILY HO...|1931|
|     2400|BATHGATE         ...|    2000|316500|02  TWO FAMILY HO...|1993|
|     2394|BATHGATE         ...|    2498|390000|02  TWO FAMILY HO...|1995|
|     1542|BATHGATE       

In [3]:
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    AVG(landArea) as avgLandArea,
    AVG(year) as avgYear,
    AVG(price) as avgPrice
FROM sales
GROUP BY hood, type
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()

+--------------------+--------------------+-----+------------------+------------------+------------------+
|                hood|                type|count|       avgLandArea|           avgYear|          avgPrice|
+--------------------+--------------------+-----+------------------+------------------+------------------+
|BATHGATE         ...|01  ONE FAMILY HO...|    4|            1815.0|           1902.75|         345733.75|
|BATHGATE         ...|02  TWO FAMILY HO...|   10|            2131.1|            1947.2|          203427.6|
|BATHGATE         ...|03  THREE FAMILY ...|    6|            2252.5|            1919.0| 292019.1666666667|
|BATHGATE         ...|05  TAX CLASS 1 V...|    1|            2099.0|               0.0|           40730.0|
|BATHGATE         ...|07  RENTALS - WAL...|    6|3161.6666666666665|            1924.5| 424286.1666666667|
|BATHGATE         ...|10  COOPS - ELEVA...|    1|               0.0|               0.0|               4.0|
|BATHGATE         ...|14  RENTALS - 4

## Zadania

* Popraw wyniki usuwając błędne dane.
* Zastosuj odpowiednią prezentację dany w zależności od typu.
* Policz średnie dla domów z XX w. tylko dla grup zawierających więcej niż 10 domów.
* ★ Policz średnie tylko dla 10 najbogatszych dzielnic.

In [15]:
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count, -- costam
    CAST(AVG(landArea) AS INT) as avgLandArea,
    CAST(AVG(year) AS INT) as avgYear,
    CAST(AVG(price) AS DECIMAL(10,2)) as avgPrice
FROM sales
WHERE year > 1500 AND landArea > 0 AND price > 0
GROUP BY hood, type
HAVING count > 10
ORDER BY count DESC, hood, type
"""
a1 = spark.sql(query)
a1.show()
a1.printSchema()

+--------------------+--------------------+-----+-----------+-------+---------+
|                hood|                type|count|avgLandArea|avgYear| avgPrice|
+--------------------+--------------------+-----+-----------+-------+---------+
|THROGS NECK      ...|01  ONE FAMILY HO...|   83|       2686|   1945|340273.33|
|BAYCHESTER       ...|01  ONE FAMILY HO...|   77|       2854|   1941|242051.42|
|BAYCHESTER       ...|02  TWO FAMILY HO...|   74|       2781|   1954|344140.58|
|WILLIAMSBRIDGE   ...|02  TWO FAMILY HO...|   73|       2697|   1939|343334.67|
|SCHUYLERVILLE/PEL...|02  TWO FAMILY HO...|   57|       3137|   1938|387447.02|
|SOUNDVIEW        ...|02  TWO FAMILY HO...|   57|       2586|   1943|309658.05|
|MORRIS PARK/VAN N...|02  TWO FAMILY HO...|   48|       2367|   1935|376374.40|
|THROGS NECK      ...|02  TWO FAMILY HO...|   47|       2852|   1949|342417.23|
|CASTLE HILL/UNION...|02  TWO FAMILY HO...|   45|       2651|   1949|332409.47|
|MORRISANIA/LONGWO...|03  THREE FAMILY .

In [20]:
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count, -- costam
    CAST(AVG(landArea) AS INT) as avgLandArea,
    CAST(AVG(year) AS INT) as avgYear,
    CAST(AVG(price) AS DECIMAL(10,2)) as avgPrice
FROM sales
WHERE year > 1500 AND landArea > 0 AND price > 0 AND hood IN (
    SELECT hood FROM sales GROUP BY hood ORDER BY AVG(price) DESC LIMIT 10
    ) 
GROUP BY hood, type
HAVING count > 10
ORDER BY count DESC, hood, type
"""
a1 = spark.sql(query)
a1.show()
a1.printSchema()

+--------------------+--------------------+-----+-----------+-------+-----------+
|                hood|                type|count|avgLandArea|avgYear|   avgPrice|
+--------------------+--------------------+-----+-----------+-------+-----------+
|HIGHBRIDGE/MORRIS...|02  TWO FAMILY HO...|   24|       2407|   1945|  274374.42|
|BEDFORD PARK/NORW...|02  TWO FAMILY HO...|   23|       2413|   1917|  360086.17|
|FORDHAM          ...|07  RENTALS - WAL...|   22|       6657|   1921| 2315263.64|
|BEDFORD PARK/NORW...|07  RENTALS - WAL...|   21|       8995|   1922| 3907523.81|
|MELROSE/CONCOURSE...|07  RENTALS - WAL...|   18|       4512|   1924| 1800854.17|
|MOUNT HOPE/MOUNT ...|08  RENTALS - ELE...|   17|      15337|   1932| 7809504.35|
|MELROSE/CONCOURSE...|03  THREE FAMILY ...|   16|       2272|   1907|  277009.06|
|MOTT HAVEN/PORT M...|02  TWO FAMILY HO...|   16|       1921|   1941|  317152.25|
|KINGSBRIDGE HTS/U...|07  RENTALS - WAL...|   15|       7084|   1919| 2510546.27|
|MELROSE/CONCOUR

In [19]:
sales.cache()

DataFrame[grossArea: bigint, hood: string, landArea: bigint, price: bigint, type: string, year: bigint]