<a href="https://colab.research.google.com/github/jianzhiw/SparkML/blob/master/SparkBostonHousing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab #

Google colab is a Google Research Project created to aid in machine learning, deep learning education and research. It's a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud.

Requirement:
1. Google Account
2. Internet connection

Benefits:
1. Free of charge
2. No setup required
3. Cloud computing
4. Free GPU for faster computing

# How to enable GPU #

Go to Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU

![alt text](https://i.imgur.com/hQi43WH.png)

# To keep a copy of this Jupyter notebook #

Go to File -> Save a copy in Drive

You can now access the notebook in your My Drive -> Colab Notebook

# Let's Get Started with PySpark #

Apache Spark was build to analyze Big Data with faster speed. One of the important features that Apache Spark offers is the ability to run the computations in memory. It is also considered to be more efficient than MapReduce for the complex application running on Disk.

<br>

PySpark is the collaboration of Apache Spark and Python. Apache Spark is an open-source cluster-computing framework, built around speed, ease of use, and streaming analytics whereas Python is a general-purpose, high-level programming language. PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. If you’re already familiar with Python and libraries such as Pandas, then PySpark is a great language to learn in order to create more scalable analyses and pipelines. 

# The Dataset #

In this tutorial we are going to look at Boston Housing dataset. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features. You can click on [this](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html?source=post_page---------------------------) to know more about the dataset. 

<br><br>

Before dive deep into the code, you will need to download this [dataset](https://drive.google.com/file/d/1JdhWZPPClfJwHpOJPDgue_9VDkSrOIA2/view). 

<br><br>

To do so:
1. Click on the arrow button on the left
2. Select Upload and insert the file
3. You are good to go

![alt text](https://i.imgur.com/ku133ZN.png)

Alternatively, you can use terminal command below to download the file.



In [0]:
# Remove the hashtag in order to run it
# !wget -O BostonHousing.csv https://drive.google.com/uc?id=1JdhWZPPClfJwHpOJPDgue_9VDkSrOIA2

# Running PySpark in Colab # 

To run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.4.3 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7"

Run a local spark session to test your installation:

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Your Colab is now ready to run Pyspark. Let's build a simple Linear Regression model.

# Linear Regression Model #
Linear Regression model is one the oldest and widely used machine learning approach which assumes a relationship between dependent and independent variables. For example, a modeler might want to predict the forecast of the rain based on the humidity ratio. Linear Regression consists of the best fitting line through the scattered points on the graph and the best fitting line is known as the regression line.

<br>

The goal of this exercise to predict the housing prices by the given features. Let's predict the prices of the Boston Housing dataset by considering MEDV as the output variable and all the other variables as input.

For our linear regression model we need to import two modules from Pyspark i.e. Vector Assembler and Linear Regression. Vector Assembler is a transformer that assembles all the features into one vector from multiple columns that contain type double. We could have used StringIndexer if any of our columns contains string values to convert it into numeric values. Luckily, the BostonHousing dataset only contains double values, so we don't need to worry about StringIndexer for now.

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

dataset = spark.read.csv('BostonHousing.csv',inferSchema=True, header =True)

Notice that we used InferSchema inside read.csv mofule. InferSchema enables us to infer automatically different data types for each column.

<br>

Let us print look into the dataset to see the data types of each column:

In [0]:
dataset.printSchema()

Next step is to convert all the features from different columns into a single column and let's call this new vector column as 'Attributes' in the output column.

In [0]:
#Input all the features in one vector column
assembler = VectorAssembler(inputCols=['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat'], outputCol = 'Attributes')

output = assembler.transform(dataset)

#Input vs Output
finalized_data = output.select("Attributes","medv")

finalized_data.show()

Here, 'Attributes' are in the input features from all the columns and 'medv' is the target column. Next, we should split the training and testing data according to our dataset (0.8 and 0.2 in this case).

In [0]:
#Split training and testing data
train_data,test_data = finalized_data.randomSplit([0.8,0.2])


regressor = LinearRegression(featuresCol = 'Attributes', labelCol = 'medv')

#Learn to fit the model from training set
regressor = regressor.fit(train_data)

#To predict the prices on testing set
pred = regressor.evaluate(test_data)

#Predict the model
pred.predictions.show()

We can also print the coefficient and intercept of the regression model by using the following command:

In [0]:
#coefficient of the regression model
coeff = regressor.coefficients

#X and Y intercept
intr = regressor.intercept

print ("The coefficient of the model is : %a" %coeff)
print ("The Intercept of the model is : %f" %intr)

# Basic Statistical Analysis #
Once we are done with the basic linear regression operation, we can go a bit further and analyze our model statistically by importing RegressionEvaluator module from Pyspark.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
eval = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="rmse")

# Root Mean Square Error
rmse = eval.evaluate(pred.predictions)
print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval.evaluate(pred.predictions, {eval.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval.evaluate(pred.predictions, {eval.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"})
print("r2: %.3f" %r2)

# Source #

[PySpark in Colab](https://github.com/asifahmed90/pyspark-ML-in-Colab)

[Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html?source=post_page---------------------------)