![](imgs/kodolamaczlogo.png)

# Przetwarzanie Big Data z użyciem Apache Spark

Autor notebooka: Jakub Nowacki.


## Spark SQL - statystyki z użyciem SQL 

W katalogu `data/` znajduje się plik `rollingsales_bronx.json`.

Poniżej analizujemy dane statystyczne domów: liczność, średnie powierzchnie, lata i ceny, grupując po dzielnicy i typie.

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
import pyspark.sql.functions as func

spark = pyspark.sql.SparkSession.builder \
    .appName('HouseStatsSQL') \
    .getOrCreate()

#sc = pyspark.SparkContext(appName='HouseStatsSQL')
#spark = pyspark.sql.SQLContext(sc)

In [3]:
# Czytanie danych
sales = spark.read.json('data/rollingsales_bronx.json')
# Nie zapomnij zarejestrować tabeli
sales.registerTempTable('sales')
sales.printSchema()
sales.show()

root
 |-- grossArea: long (nullable = true)
 |-- hood: string (nullable = true)
 |-- landArea: long (nullable = true)
 |-- price: long (nullable = true)
 |-- type: string (nullable = true)
 |-- year: long (nullable = true)

+---------+--------------------+--------+------+--------------------+----+
|grossArea|                hood|landArea| price|                type|year|
+---------+--------------------+--------+------+--------------------+----+
|     2048|BATHGATE         ...|    1842|355000|01  ONE FAMILY HO...|1901|
|     1290|BATHGATE         ...|    1103|474819|01  ONE FAMILY HO...|1910|
|     1344|BATHGATE         ...|    1986|210000|01  ONE FAMILY HO...|1899|
|     1431|BATHGATE         ...|    2329|343116|01  ONE FAMILY HO...|1901|
|     4452|BATHGATE         ...|    1855|     0|02  TWO FAMILY HO...|1931|
|     2400|BATHGATE         ...|    2000|316500|02  TWO FAMILY HO...|1993|
|     2394|BATHGATE         ...|    2498|390000|02  TWO FAMILY HO...|1995|
|     1542|BATHGATE       

In [4]:
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    AVG(landArea) as avgLandArea,
    AVG(year) as avgYear,
    AVG(price) as avgPrice
FROM sales
GROUP BY hood, type
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()

+--------------------+--------------------+-----+------------------+------------------+------------------+
|                hood|                type|count|       avgLandArea|           avgYear|          avgPrice|
+--------------------+--------------------+-----+------------------+------------------+------------------+
|BATHGATE         ...|01  ONE FAMILY HO...|    4|            1815.0|           1902.75|         345733.75|
|BATHGATE         ...|02  TWO FAMILY HO...|   10|            2131.1|            1947.2|          203427.6|
|BATHGATE         ...|03  THREE FAMILY ...|    6|            2252.5|            1919.0| 292019.1666666667|
|BATHGATE         ...|05  TAX CLASS 1 V...|    1|            2099.0|               0.0|           40730.0|
|BATHGATE         ...|07  RENTALS - WAL...|    6|3161.6666666666665|            1924.5| 424286.1666666667|
|BATHGATE         ...|10  COOPS - ELEVA...|    1|               0.0|               0.0|               4.0|
|BATHGATE         ...|14  RENTALS - 4

## Zadania

* Popraw wyniki usuwając błędne dane.
* Zastosuj odpowiednią prezentację dany w zależności od typu.
* Policz średnie dla domów z XX w. tylko dla grup zawierających więcej niż 10 domów.
* ★ Policz średnie tylko dla 10 najbogatszych dzielnic.

In [18]:
# popraw wyniki
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    AVG(landArea) as avgLandArea,
    AVG(year) as avgYear,
    AVG(price) as avgPrice
FROM sales
WHERE YEAR > 1500 AND landArea > 0 AND price > 0
GROUP BY hood, type
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()

+--------------------+--------------------+-----+------------------+------------------+------------------+
|                hood|                type|count|       avgLandArea|           avgYear|          avgPrice|
+--------------------+--------------------+-----+------------------+------------------+------------------+
|BATHGATE         ...|01  ONE FAMILY HO...|    4|            1815.0|           1902.75|         345733.75|
|BATHGATE         ...|02  TWO FAMILY HO...|    6|2378.1666666666665|1951.8333333333333|          339046.0|
|BATHGATE         ...|03  THREE FAMILY ...|    4|           1937.75|           1926.25|         438028.75|
|BATHGATE         ...|07  RENTALS - WAL...|    3|2904.6666666666665|            1918.0| 848572.3333333334|
|BATHGATE         ...|22  STORE BUILDIN...|    1|            5489.0|            1900.0|          466154.0|
|BATHGATE         ...|29  COMMERCIAL GA...|    2|           14556.0|            1944.0|         1647500.0|
|BATHGATE         ...|30  WAREHOUSES 

In [19]:
# zastosuj odpowiednią prezentację dany w zależności od typu.
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    AVG(landArea) as avgLandArea,
    CAST(AVG(year) as INT) as avgYear,
    CAST(AVG(price) as DECIMAL(10,2)) as avgPrice
FROM sales
WHERE YEAR > 1500 AND landArea > 0 AND price > 0
GROUP BY hood, type
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()
a1.printSchema()

+--------------------+--------------------+-----+------------------+-------+----------+
|                hood|                type|count|       avgLandArea|avgYear|  avgPrice|
+--------------------+--------------------+-----+------------------+-------+----------+
|BATHGATE         ...|01  ONE FAMILY HO...|    4|            1815.0|   1902| 345733.75|
|BATHGATE         ...|02  TWO FAMILY HO...|    6|2378.1666666666665|   1951| 339046.00|
|BATHGATE         ...|03  THREE FAMILY ...|    4|           1937.75|   1926| 438028.75|
|BATHGATE         ...|07  RENTALS - WAL...|    3|2904.6666666666665|   1918| 848572.33|
|BATHGATE         ...|22  STORE BUILDIN...|    1|            5489.0|   1900| 466154.00|
|BATHGATE         ...|29  COMMERCIAL GA...|    2|           14556.0|   1944|1647500.00|
|BATHGATE         ...|30  WAREHOUSES   ...|    1|            9180.0|   1931|9733979.00|
|BATHGATE         ...|41  TAX CLASS 4 -...|    2|            5261.5|   1940| 535000.00|
|BAYCHESTER       ...|01  ONE FA

In [23]:
# Policz średnie dla domów z XX w. tylko dla grup zawierających więcej niż 10 domów.
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    AVG(landArea) as avgLandArea,
    CAST(AVG(year) as INT) as avgYear,
    CAST(AVG(price) as DECIMAL(10,2)) as avgPrice
FROM sales
WHERE YEAR > 1500 AND landArea > 0 AND price > 0
GROUP BY hood, type
HAVING count > 10
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()
a1.printSchema()

+--------------------+--------------------+-----+------------------+-------+----------+
|                hood|                type|count|       avgLandArea|avgYear|  avgPrice|
+--------------------+--------------------+-----+------------------+-------+----------+
|BAYCHESTER       ...|01  ONE FAMILY HO...|   77|2854.4935064935066|   1941| 242051.42|
|BAYCHESTER       ...|02  TWO FAMILY HO...|   74|2781.4594594594596|   1954| 344140.58|
|BAYCHESTER       ...|03  THREE FAMILY ...|   17|2522.6470588235293|   1975| 460011.06|
|BEDFORD PARK/NORW...|01  ONE FAMILY HO...|   11| 2839.090909090909|   1909| 541681.82|
|BEDFORD PARK/NORW...|02  TWO FAMILY HO...|   23| 2413.304347826087|   1917| 360086.17|
|BEDFORD PARK/NORW...|07  RENTALS - WAL...|   21| 8995.285714285714|   1922|3907523.81|
|BEDFORD PARK/NORW...|08  RENTALS - ELE...|   13| 10067.76923076923|   1931|3668538.46|
|BELMONT          ...|02  TWO FAMILY HO...|   18|2724.9444444444443|   1915| 296362.00|
|BELMONT          ...|07  RENTAL

In [33]:
# ★ Policz średnie tylko dla 10 najbogatszych dzielnic.
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    CAST(AVG(landArea) AS INT) as avgLandArea,
    CAST(AVG(year) as INT) as avgYear,
    CAST(AVG(price) as DECIMAL(10,2)) as avgPrice
FROM sales
WHERE YEAR > 1500 AND landArea > 0 AND price > 0 AND hood in (
    SELECT hood FROM sales GROUP BY hood ORDER BY sum(price) desc LIMIT 10)
GROUP BY hood, type
HAVING count > 10
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()
a1.printSchema()

+--------------------+--------------------+-----+------------------+-------+-----------+
|                hood|                type|count|       avgLandArea|avgYear|   avgPrice|
+--------------------+--------------------+-----+------------------+-------+-----------+
|BEDFORD PARK/NORW...|01  ONE FAMILY HO...|   11| 2839.090909090909|   1909|  541681.82|
|BEDFORD PARK/NORW...|02  TWO FAMILY HO...|   23| 2413.304347826087|   1917|  360086.17|
|BEDFORD PARK/NORW...|07  RENTALS - WAL...|   21| 8995.285714285714|   1922| 3907523.81|
|BEDFORD PARK/NORW...|08  RENTALS - ELE...|   13| 10067.76923076923|   1931| 3668538.46|
|FORDHAM          ...|02  TWO FAMILY HO...|   11|2231.5454545454545|   1918|  288136.36|
|FORDHAM          ...|07  RENTALS - WAL...|   22| 6657.909090909091|   1921| 2315263.64|
|FORDHAM          ...|08  RENTALS - ELE...|   12|10386.083333333334|   1936| 3728073.58|
|HIGHBRIDGE/MORRIS...|02  TWO FAMILY HO...|   24|2407.6666666666665|   1945|  274374.42|
|HIGHBRIDGE/MORRIS...

In [34]:
# ★ Policz średnie tylko dla 10 najbogatszych dzielnic.
query = """
SELECT 
    hood,
    type,
    COUNT(*) as count,
    CAST(AVG(landArea) AS INT) as avgLandArea,
    CAST(AVG(year) AS INT) as avgYear,
    CAST(AVG(price) AS DECIMAL(10,2)) as avgPrice
FROM sales
WHERE YEAR > 1500 AND landArea > 0 AND price > 0 AND hood IN (
    SELECT hood FROM sales GROUP BY hood ORDER BY AVG(price) DESC LIMIT 10
)
GROUP BY hood, type
HAVING count > 10
ORDER BY hood, type
"""
a1 = spark.sql(query)
a1.show()
a1.printSchema()

+--------------------+--------------------+-----+-----------+-------+-----------+
|                hood|                type|count|avgLandArea|avgYear|   avgPrice|
+--------------------+--------------------+-----+-----------+-------+-----------+
|BEDFORD PARK/NORW...|01  ONE FAMILY HO...|   11|       2839|   1909|  541681.82|
|BEDFORD PARK/NORW...|02  TWO FAMILY HO...|   23|       2413|   1917|  360086.17|
|BEDFORD PARK/NORW...|07  RENTALS - WAL...|   21|       8995|   1922| 3907523.81|
|BEDFORD PARK/NORW...|08  RENTALS - ELE...|   13|      10067|   1931| 3668538.46|
|FORDHAM          ...|02  TWO FAMILY HO...|   11|       2231|   1918|  288136.36|
|FORDHAM          ...|07  RENTALS - WAL...|   22|       6657|   1921| 2315263.64|
|FORDHAM          ...|08  RENTALS - ELE...|   12|      10386|   1936| 3728073.58|
|HIGHBRIDGE/MORRIS...|02  TWO FAMILY HO...|   24|       2407|   1945|  274374.42|
|HIGHBRIDGE/MORRIS...|03  THREE FAMILY ...|   11|       3767|   1951|  387788.00|
|HIGHBRIDGE/MORR