<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/misc/pyspark_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark in Google Colab

- From [PySpark in Google Colab](https://towardsdatascience.com/pyspark-in-google-colab-6821c2faf41c) in [towardsdatascience.com](https://towardsdatascience.com) by [Asif Ahmed](https://towardsdatascience.com/@asifahmed90?source=post_page-----6821c2faf41c----------------------)
Creating a simple linear regression model with PySpark in Colab

Updated by [John Fogarty](https://github.com/jfogarty) for Python 3.6 and [Base2 MLI](https://github.com/base2solutions/mli) and [colab](https://colab.research.google.com) standalone evaluation.

With broadening sources of the data pool, the topic of Big Data has received an increasing amount of attention in the past few years. Besides dealing with the gigantic data of all kinds and shapes, the target turnaround time of the analysis part for the big data has been reduced significantly. Not only has this speed and efficiency helped in the immediate analysis of the Big Data but also in identifying new opportunities. This, in turn, has lead to smarter business moves, more efficient operations, higher profits, and happier customers.

Apache Spark was build to analyze Big Data with faster speed. One of the important features that Apache Spark offers is the ability to run the computations in memory. It is also considered to be more efficient than MapReduce for the complex application running on Disk.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.

PySpark is the interface that gives access to Spark using the Python programming language. PySpark is an API developed in python for spark programming and writing spark applications in Python style, although the underlying execution model is the same for all the API languages.
In this tutorial, we will mostly deal with the PySpark machine learning library Mllib that can be used to import the Linear Regression model or other machine learning models.

### Yes, but why Google Colab?

Colab by Google is based on Jupyter Notebook which is an incredibly powerful tool that leverages google docs features. Since it runs on google server, we don't need to install anything in our system locally, be it Spark or deep learning model. The most attractive features of Colab are the free GPU and TPU support! Since the GPU support runs on Google's own server, it is, in fact, faster than some commercially available GPUs like the Nvidia 1050Ti. A piece of general system information allocated for a user looks like the following:

```
Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 11439MB | Used: 0MB | Util  0% | Total 11439MB
```


## Running Pyspark in Colab

To run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

**Note!** This was out of date and had to be updated from [apache spark](https://www-us.apache.org/dist/spark) to 2.4.3 form 2.4.1 before it would install.

Now that we have installed Spark and Java in Colab, it is time to set the environment path that enables us to run PySpark in our Colab environment. Set the location of Java and Spark by running the following code:


In [0]:
import os
JAVA_HOME = "/usr/lib/jvm/java-8-openjdk-amd64"
SPARK_HOME = "/content/spark-2.4.3-bin-hadoop2.7"

def set_os_environ_path(var, val):
    os.environ[var] = val
    if not os.path.exists(JAVA_HOME):
        print(f"** Yikes! the {var} path {val} does not exist!  Your environment is not valid.")

set_os_environ_path("JAVA_HOME",  JAVA_HOME)
set_os_environ_path("SPARK_HOME", SPARK_HOME)


**Note!** You **must** check these paths in the **Files** tab on the left side of your notebook page.  

We can run a local spark session to test our installation:

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

## Linear Regression Model

Linear Regression model is one the oldest and widely used machine learning approach which assumes a relationship between dependent and independent variables. For example, a modeler might want to predict the forecast of the rain based on the humidity ratio. Linear Regression consists of the best fitting line through the scattered points on the graph and the best fitting line is known as the regression line. Detailed about linear regression can be found here.

For our purpose of starting with Pyspark in Colab and to keep things simple, we will use the famous Boston Housing dataset. A full description of this dataset can be found in this [link](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html).

### The Boston Housing Dataset

A Dataset derived from information collected by the U.S. Census Service concerning housing in the area of Boston Mass.
BackUpDelve

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the [StatLib archive](http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

#### Dataset Naming
The name for this dataset is simply boston. It has two prototasks: nox, in which the nitrous oxide level is to be predicted; and price, in which the median value of a home is to be predicted

#### Miscellaneous Details

- Origin : The origin of the boston housing data is Natural.
- Usage : This dataset may be used for Assessment.
- Number of Cases : The dataset contains a total of 506 cases.
- Order : The order of the cases is mysterious.
- Variables : There are 14 attributes in each case of the dataset. They are:
  - CRIM - per capita crime rate by town
  - ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  - INDUS - proportion of non-retail business acres per town.
  - CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  - NOX - nitric oxides concentration (parts per 10 million)
  - RM - average number of rooms per dwelling
  - AGE - proportion of owner-occupied units built prior to 1940
  - DIS - weighted distances to five Boston employment centres
  - RAD - index of accessibility to radial highways
  - TAX - full-value property-tax rate per 10,000 dollars
  - PTRATIO - pupil-teacher ratio by town
  - B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  - LSTAT - % lower status of the population
  - MEDV - Median value of owner-occupied homes in $1000's

- Note : Variable #14 seems to be censored at 50.00 \(corresponding to a median price of 50,000 dollars);

 Censoring is suggested by the fact that the highest median price of exactly 50,000 dollars is reported in 16 cases, while 15 cases have prices between 40,000 dollars  and 50,000 dollars , with prices rounded to the nearest hundred. Harrison and Rubinfeld do not mention any censoring.

### Getting the dataset

The goal of this exercise is to predict the housing prices from the given features. Let’s predict the prices of the Boston Housing dataset by considering MEDV as the target variable and all other variables as input features.

We can download the dataset from this [Github repo: We can download the dataset from this [Github repo: link](https://github.com/asifahmed90/pyspark-ML-in-Colab](https://github.com/asifahmed90/pyspark-ML-in-Colab/blob/master/BostonHousing.csv) and keep it somewhere accessible in our local drives. The dataset can be loaded in the Colab directory using the following command from the same drive.

```
    from google.colab import files
    files.upload()
```

**JF Note : this is tedious so instead let's fetch it directly from the raw Github content using this code:

In [0]:
#@title Nasty File Transfer Utility Tools
import numpy as np
import requests
import shutil
import os
from bs4 import BeautifulSoup

ds = np.DataSource()
def copyHere(URL, toPath, quiet=False):
    toDir, toFile = os.path.split(toPath)
    toPath = os.path.join(toDir, toFile)
    if os.path.exists(toPath):
        if not quiet:
            print(f"- Skipped copy of existing file {toPath}.")
    else:
        if ds.exists(URL):
            if not toFile:
                urlPrefix, toFile = os.path.split(URL)
            response = requests.get(URL, stream=True)
            if toDir:
                if not os.path.exists(toDir): 
                  print(f"- Creating directory '{toDir}'.")
                  os.makedirs(toDir)
            with open(toPath, 'wb') as fin: shutil.copyfileobj(response.raw, fin)
            if not quiet:
                print(f"- Copied {URL} to {toPath}.")
        else:
            print(f"** Sorry, can't copy '{URL}' to '{toPath}'.")

In [0]:
Github_REPO = 'https://github.com/asifahmed90/pyspark-ML-in-Colab/'
REPO        = 'https://raw.githubusercontent.com/asifahmed90/pyspark-ML-in-Colab/'
BRANCH      = 'master/'
DIR         = 'train/cats/'
TMPDIR      = 'tmpData'
filename    = 'BostonHousing.csv'

URL = os.path.join(REPO, BRANCH, filename)
copyHere(URL, filename, quiet=False)

- Copied https://raw.githubusercontent.com/asifahmed90/pyspark-ML-in-Colab/master/BostonHousing.csv to BostonHousing.csv.


In [0]:
response = requests.get(URL, stream=True)
with open('BH.csv', 'wb') as fin: shutil.copyfileobj(response.raw, fin)

We can now check the directory content of the Colab `/content` directory.  Note that your colab is running in a full VM instance and you are installing new packages into the root wil full superuser privileges.

In [0]:
!pwd ; ls -al

/content
total 449232
drwxr-xr-x  1 root root      4096 Aug 16 20:25 .
drwxr-xr-x  1 root root      4096 Aug 16 16:44 ..
-rw-r--r--  1 root root     11769 Aug 16 20:25 BostonHousing.csv
drwxr-xr-x  1 root root      4096 Aug 13 16:04 .config
drwxr-xr-x  1 root root      4096 Aug  2 16:06 sample_data
drwxr-xr-x 13 1000 1000      4096 May  1 05:19 spark-2.4.3-bin-hadoop2.7
-rw-r--r--  1 root root 229988313 May  1 05:57 spark-2.4.3-bin-hadoop2.7.tgz
-rw-r--r--  1 root root 229988313 May  1 05:57 spark-2.4.3-bin-hadoop2.7.tgz.1


We should see a file named BostonHousing.csv saved. Now that we have uploaded the dataset successfully, we can start analyzing.

For our [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model, we need to import [Vector Assembler](https://spark.apache.org/docs/2.2.0/ml-features.html) and [Linear Regression](https://spark.apache.org/docs/2.1.1/ml-classification-regression.html) modules from the [PySpark API](). Vector Assembler is a transformer tool that assembles all the features into one vector from multiple columns that contain type [double](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). We should have used (must use) [StringIndexer](https://spark.rstudio.com/reference/ft_string_indexer/) if any of our columns contains string values to convert it into numeric values. Luckily, the BostonHousing dataset only contains type double, so we can skip StringIndexer for now.

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
dataset = spark.read.csv('BostonHousing.csv', inferSchema=True, header=True)

import pandas as pd
df = pd.read_csv('BH.csv')

UnicodeDecodeError: ignored

In [0]:
import chardet
import pandas as pd
fn = 'BH.csv'
with open(fn, 'rb') as f: result = chardet.detect(f.read()) 
xx = pd.read_csv(fn, encoding=result['encoding'])

UnicodeDecodeError: ignored

Notice that we used InferSchema inside read.csv(). InferSchema automatically infers different data types for each column.
Let us print look into the dataset to see the data types of each column:

In [0]:
dataset

DataFrame[�      }}ے: string, 9��{}F?�����v֤1���f[2��^p#2�d��������
���ǿ����������������_���������������?��������������������������?��������������������?�R�����w���Wz��_�]��Z}뫼�x�KG{I}�W
�=�O��R���jz�;�D(m8BQyY��m����wٻO���Ry�@ѱ��[��C�rD�w�X����: string]

### End of notebook.