![](imgs/kodolamaczlogo.png)

# Przetwarzanie Big Data z użyciem Apache Spark

Autor notebooka: Jakub Nowacki.


## Spark SQL - statystyki z użyciem metod DataFrame 

W katalogu `data/` znajduje się plik `rollingsales_bronx.csv`.

Poniżej analizujemy dane statystyczne domów: liczność, średnie powierzchnie, lata i ceny, grupując po dzielnicy i typie.

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
import pyspark.sql.functions as func
import pyspark.sql.types as types

spark = pyspark.sql.SparkSession.builder \
    .appName('HouseStatsDF') \
    .getOrCreate()
    
#sc = pyspark.SparkContext(appName='HouseStatsDF')
#spark = pyspark.sql.SQLContext(sc)

In [3]:
import re

# Użycie Row to jedna z metod tworzenia DataFrame z danych; nie wymaga dodatkowego schematu
# zobacz dokumentację: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html
def from_csv(line):
    # Stałe podane są tylko dla czytelności
    HOOD_COLUMN = 1
    TYPE_COLUMN = 2;
    LAND_AREA_COLUMN = 14;
    GROSS_AREA_COLUMN = 15;
    YEAR_COLUMN = 16;
    PRICE_COLUMN = 19;
    # Dzielimy linię na kolumny i przypisujemy do kluczy w słowniku
    c = line.split(',')
    row = dict()
    row['hood'] = c[HOOD_COLUMN];
    row['type'] = c[TYPE_COLUMN];
    row['landArea'] = int(re.sub(r'[^\d]', '', c[LAND_AREA_COLUMN]));
    row['grossArea'] = int(re.sub(r'[^\d]', '', c[GROSS_AREA_COLUMN]));
    row['year'] = int(re.sub(r'[^\d]', '', c[YEAR_COLUMN]));
    row['price'] = int(re.sub(r'[^\d]', '', c[PRICE_COLUMN]));
    # Zwracamy obiekt Row
    return pyspark.Row(**row)

In [4]:
# Używamy powyższą metodę generującą obiekty Row do przeczytania CSV linia po linii
# Zauważ, że tworzymy zwykłe RDD
sales_rdd = spark.sparkContext.textFile('data/rollingsales_bronx.csv').map(lambda line: from_csv(line))
# Z RDD złożonego z obiektów Row możemy bezpośrednio stworzyć DataFrame bez dodatkowego schematu
sales = spark.createDataFrame(sales_rdd)
sales.printSchema()
sales.show()

root
 |-- grossArea: long (nullable = true)
 |-- hood: string (nullable = true)
 |-- landArea: long (nullable = true)
 |-- price: long (nullable = true)
 |-- type: string (nullable = true)
 |-- year: long (nullable = true)

+---------+--------------------+--------+------+--------------------+----+
|grossArea|                hood|landArea| price|                type|year|
+---------+--------------------+--------+------+--------------------+----+
|     2048|BATHGATE         ...|    1842|355000|01  ONE FAMILY HO...|1901|
|     1290|BATHGATE         ...|    1103|474819|01  ONE FAMILY HO...|1910|
|     1344|BATHGATE         ...|    1986|210000|01  ONE FAMILY HO...|1899|
|     1431|BATHGATE         ...|    2329|343116|01  ONE FAMILY HO...|1901|
|     4452|BATHGATE         ...|    1855|     0|02  TWO FAMILY HO...|1931|
|     2400|BATHGATE         ...|    2000|316500|02  TWO FAMILY HO...|1993|
|     2394|BATHGATE         ...|    2498|390000|02  TWO FAMILY HO...|1995|
|     1542|BATHGATE       

In [None]:
a1 = sales.groupBy('hood', 'type') \
    .agg(func.count('type').alias('count'),
         func.avg('landArea').alias('avgLandArea'),
         func.avg('year').alias('avgYear'),
         func.avg('price').alias('avgPrice')) \
    .orderBy('hood', 'type')
a1.show()

## Zadania

* Popraw wyniki usuwając błędne dane.
* Zastosuj odpowiednią prezentację dany w zależności od typu.
* Policz średnie dla domów z XX w. tylko dla grup zawierających więcej niż 10 domów.
* ★ Policz średnie tylko dla 10 najbogatszych dzielnic.

In [22]:
#a1 = sales.where(sales.year > 1500).where(sales.landArea > 0) \
a1 = sales.where((sales.year > 1500) & (sales.landArea > 0) & (sales.price > 0)) \
    .groupBy('hood', 'type') \
    .agg(func.count('type').alias('count'),
         func.avg('landArea').alias('avgLandArea'),
         func.avg('year').cast('int').alias('avgYear'),
         #func.avg('price').cast('decimal(10,2)').alias('avgPrice')) \
         func.avg('price').cast(types.DecimalType(10,2)).alias('avgPrice')) \
    .where('count > 10') \
    .orderBy(func.col('count').desc(), 'hood', 'type')
    #.orderBy('count', ascending=False) \
a1.show()

+--------------------+--------------------+-----+------------------+-------+---------+
|                hood|                type|count|       avgLandArea|avgYear| avgPrice|
+--------------------+--------------------+-----+------------------+-------+---------+
|THROGS NECK      ...|01  ONE FAMILY HO...|   83|2686.3373493975905|   1945|340273.33|
|BAYCHESTER       ...|01  ONE FAMILY HO...|   77|2854.4935064935066|   1941|242051.42|
|BAYCHESTER       ...|02  TWO FAMILY HO...|   74|2781.4594594594596|   1954|344140.58|
|WILLIAMSBRIDGE   ...|02  TWO FAMILY HO...|   73| 2697.794520547945|   1939|343334.67|
|SCHUYLERVILLE/PEL...|02  TWO FAMILY HO...|   57|3137.9298245614036|   1938|387447.02|
|SOUNDVIEW        ...|02  TWO FAMILY HO...|   57| 2586.315789473684|   1943|309658.05|
|MORRIS PARK/VAN N...|02  TWO FAMILY HO...|   48|2367.2083333333335|   1935|376374.40|
|THROGS NECK      ...|02  TWO FAMILY HO...|   47|2852.5106382978724|   1949|342417.23|
|CASTLE HILL/UNION...|02  TWO FAMILY HO...|

In [24]:
posh = sales.groupBy('hood') \
   .agg(func.avg('price').alias('p')) \
   .orderBy('p', ascending=False) \
   .select('hood') \
   .limit(10)
   
p = [r['hood'] for r in posh.collect()]
p

['MOUNT HOPE/MOUNT EDEN    ',
 'HIGHBRIDGE/MORRIS HEIGHTS',
 'MOTT HAVEN/PORT MORRIS   ',
 'CO-OP CITY               ',
 'FORDHAM                  ',
 'BRONX PARK               ',
 'PELHAM PARKWAY SOUTH     ',
 'BEDFORD PARK/NORWOOD     ',
 'KINGSBRIDGE HTS/UNIV HTS ',
 'MELROSE/CONCOURSE        ']

In [27]:
a1 = sales.join(posh, on='hood', how='leftsemi') \
.where((sales.year > 1500) & (sales['price'] > 0)) \
.where('landArea > 0') \
.groupBy('hood', 'type') \
.agg(func.count('type').alias('count'),
    func.avg('landArea').alias('avgLandArea'),
    func.avg('year').cast('int').alias('avgYear'),
    func.avg('price').cast(types.DecimalType(10,2)).alias('avgPrice')) \
.where('count > 10') \
.orderBy(func.col('count').desc(), 'hood', 'type')

a1.show()
a1.printSchema()

+--------------------+--------------------+-----+------------------+-------+-----------+
|                hood|                type|count|       avgLandArea|avgYear|   avgPrice|
+--------------------+--------------------+-----+------------------+-------+-----------+
|HIGHBRIDGE/MORRIS...|02  TWO FAMILY HO...|   24|2407.6666666666665|   1945|  274374.42|
|BEDFORD PARK/NORW...|02  TWO FAMILY HO...|   23| 2413.304347826087|   1917|  360086.17|
|FORDHAM          ...|07  RENTALS - WAL...|   22| 6657.909090909091|   1921| 2315263.64|
|BEDFORD PARK/NORW...|07  RENTALS - WAL...|   21| 8995.285714285714|   1922| 3907523.81|
|MELROSE/CONCOURSE...|07  RENTALS - WAL...|   18| 4512.944444444444|   1924| 1800854.17|
|MOUNT HOPE/MOUNT ...|08  RENTALS - ELE...|   17|15337.823529411764|   1932| 7809504.35|
|MELROSE/CONCOURSE...|03  THREE FAMILY ...|   16|         2272.0625|   1907|  277009.06|
|MOTT HAVEN/PORT M...|02  TWO FAMILY HO...|   16|          1921.375|   1941|  317152.25|
|KINGSBRIDGE HTS/U...