# Big Data Analysis on houses in King County, WA, USA

![King County](https://upload.wikimedia.org/wikipedia/commons/b/bc/Seattle_-_King_County_Courthouse_and_King_County_Administration_Building_01.jpg)

The data for this project was sourced from my another project from **IBM Data Science Speicalisation. Project name: House Sales in King County** [source](https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv)

I have used this data for big data analysis using the features given in the data for each house.
The data has following features:
- Date
- Id
- Price
- bedrooms
- bathrooms
- sqft (Living Area)
- sqft (Total Area)
- Floors
- waterfront
- view
- condition
- grade
- Sqft of top floors
- sqft of basement
- Year Built
- Year renovated
- Zipcode
- Lat
- Long


In [1]:
#Setting up environment for sparksql-pyspark
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
sc = SparkContext('local')
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

import pandas as pd #for importing datasets and cleaning

import matplotlib.pyplot as plt # for basic visualisation
%matplotlib inline
import folium # for plotting geospatial data


In [3]:
df = pd.read_csv("./ML datasets/kc_house_data_NaN.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,0,7129300520,20141013T000000,221900.0,3.0,1.0,1180,5650,1.0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,1,6414100192,20141209T000000,538000.0,3.0,2.25,2570,7242,2.0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,2,5631500400,20150225T000000,180000.0,2.0,1.0,770,10000,1.0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,3,2487200875,20141209T000000,604000.0,4.0,3.0,1960,5000,1.0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,4,1954400510,20150218T000000,510000.0,3.0,2.0,1680,8080,1.0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [55]:
df.shape

(21613, 22)

- The dataframe has the shape of 21613,22
- 22 Rows
- 21613 records

In [6]:
cleaned_df = df.drop(["Unnamed: 0","date"],axis =1).dropna()
cleaned_df.isna().sum()

id               0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [8]:
house_df = sqlContext.createDataFrame(cleaned_df)
house_df.show(5)

+----------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+
|        id|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|
+----------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+
|7129300520|221900.0|     3.0|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|
|6414100192|538000.0|     3.0|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|
|563150040

In [10]:
# counting houses per zipcode
house_count = house_df.select("id","zipcode","lat","long").groupBy("zipcode").count()

In [11]:
house_count.show(5)

+-------+-----+
|zipcode|count|
+-------+-----+
|  98148|   57|
|  98166|  254|
|  98136|  263|
|  98065|  308|
|  98115|  583|
+-------+-----+
only showing top 5 rows



In [None]:
fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.4)
 
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()

In [28]:
penthouses = house_df.select("id","price","bedrooms","bathrooms","sqft_living","sqft_lot","floors","condition","grade",
                             "sqft_above","sqft_basement","yr_built","zipcode","lat","long").filter("bedrooms>3" and "bathrooms>2")


11239

In [29]:
penthouses_df = penthouses.toPandas()
penthouses_map = folium.Map(zoom_start=12,width=500,height=500,location=[47.4319561,-122.3638441])
for i, r in penthouses_df.iterrows():
    #setting for the popup
    popup=folium.Popup(r['id'],max_width=1000)
    #Plotting the Marker for each house
    folium.map.Marker(
        location=[r['lat'], r['long']], 
        popup=popup,
        icon=folium.Icon(color="green",icon="house", prefix='fa')
    ).add_to(penthouses_map)
    
penthouses_map.save("index.html")

In [44]:
most_expensive = house_df.select("id","price","sqft_lot","lat","long","zipcode").filter("price>2000000")
most_expensive.count()

198

In [45]:
most_expensive_df =most_expensive.toPandas()
most_expensive_map = folium.Map(zoom_start=12,width=500,height=500,location=[47.4319561,-122.3638441])
for i, r in most_expensive_df.iterrows():
    #setting for the popup
    popup=folium.Popup(r['id'],max_width=1000)
    #Plotting the Marker for each house
    folium.map.Marker(
        location=[r['lat'], r['long']], 
        popup=popup,
        icon=folium.Icon(color="red",icon="house", prefix='fa')
    ).add_to(most_expensive_map)
    
most_expensive_map

In [53]:
# with waterfront

with_waterfront = house_df.select("id","lat","long","price","waterfront").filter("waterfront == 1" and "bedrooms > 5")
with_waterfront.count()

334

In [54]:
with_waterfront_df =with_waterfront.toPandas()
with_waterfront_map = folium.Map(zoom_start=12,width=500,height=500,location=[47.4319561,-122.3638441])
for i, r in with_waterfront_df.iterrows():
    #setting for the popup
    popup=folium.Popup(r['id'],max_width=1000)
    #Plotting the Marker for each house
    folium.map.Marker(
        location=[r['lat'], r['long']], 
        popup=popup,
        icon=folium.Icon(color="blue",icon="house", prefix='fa')
    ).add_to(with_waterfront_map)
    
with_waterfront_map