# Jupyter Notebook

This is a Jupyter Notebook, which is a basically just a super fancy Python shell.

You may have "cells" that can either be text (like this one) or executable Python code. Notebooks are really nice because they allow you to rapidly develop Python code by writing small bits of code, testing their output, and moving on to the next bit; this interactive nature of the notebook is a huge plus to professional Python developers. 

It's also nice, because it's really easy to share your code with others and surround it with text to tell a story! 

# Colaboratory
Colaboratory is a service provided by Google to take a Jupyter Notebook (a standard formay of a `.ipynb` file) and let users edit/run the code in the notebook for free! 

This notebook is write-protected so you are not able to edit the  notebook that the whole class will look at, but you are able to open up the notebook in "playground mode" which lets you make edits to a temporary copy of the notebook. If you want to save the changes you made to this notebook, you will have to follow the instructions when you try to save to copy the notebook to your Google Drive. 

# Setup
Make sure you run the following cell(s) before trying to run any the following cells. You do not need to understand what they are doing, it's just a way to make sure there is a file we want to use stored on the computer running this notebook.


In [0]:
import requests

def save_file(url, file_name):
  r = requests.get(url)
  with open(file_name, 'wb') as f:
    f.write(r.content)

save_file('https://courses.cs.washington.edu/courses/cse163/19sp/' +
          'files/lectures/05-29/BostonHousing.csv', 'BostonHousing.csv')

# Functional programming
For this problem, we want to multiply every number in a list2

In [0]:
nums = [1, 5, 7, 10, 14, 27]

We have seen before in class that we can solve this with a simple for loop

In [0]:
two_nums = []
for num in nums:
  two_nums.append(2 * num)
print(two_nums)

[2, 10, 14, 20, 28, 54]


You might also remember we can use a **list comprehension** to abstract away building up the new list explicitly

In [0]:
[2 * num for num in nums]

[2, 10, 14, 20, 28, 54]

While this appraoch is a bit easier to read, it still requires us to write out an explicit loop. In **functional programming**, we prefer to not have to write "how" to do something but focus more on the "what". 

In Python, we use the `map` function to apply a given function over a list of values. The `map` function is "lazy" in the sense it won't compute all the values unless we ask for them. To do this, we pass it to the `list` constructor so it evaluates all the elements.

In [0]:
def times_two(num):
  return 2 * num

list(map(times_two, nums))

[2, 10, 14, 20, 28, 54]

We could go ahead and write our own `our_map` function to show exactly how `map` is implemented (minus the lazy evaluation)

In [0]:
def our_map(f, vals):
  return [f(v) for v in vals]

In [0]:
our_map(times_two, nums)

[2, 10, 14, 20, 28, 54]

It is also good to remind you that we can use the lambda syntax to define an **anonymous function** so we don't have to go write a named function like `times_two` every single time.

In [0]:
list(map(lambda num: 2 * num, nums))

[2, 10, 14, 20, 28, 54]

We then went on to your next higher-order function **filter**. This is similar to the structure of `map`, but instead of returning new values the function passed should return `bool`s. `filter` will keep all values the function returns true for

In [0]:
list(filter(lambda num: num % 2 == 0, nums))

[10, 14]

Reduce was actually removed as a built-in function from Python in version 3.0. If we want to use `reduce`, we have to import it from `functools`.

The function passed to `reduce` takes 2 values, the first is the current accumalation of the previous values and the second is the next file; the function shows how the combine the next value with all the previous ones.

In [0]:
from functools import reduce
reduce(lambda cumulative, curr: cumulative + curr, nums)

64

The example above computes the sum of all the numbers. It might be clearer too see one way that we could implement `reduce`

In [0]:
def our_reduce(f, vals):
  acc = vals[0]
  for v in vals[1:]:
    acc = f(acc, v)
  return acc

# Spark
We saw in lecture this framework called Spark let's us easily write code that runs on a distributed system. Below we have an example that trains a machine learning model to predict housing prices. Notice that the API is very different than `sklearn`, but there are a lot of the same components. 

This example is a bit lame in some sense because it is running locally (on this one Google machine) and not on a distributed system, but the cool thing is this code will run on a distributed system with minor modification! That is the nice thing about using the Spark API in this case.

We will explain what each cell is trying to do, but we cannot explain every single line without writing a huge chapter.

This cell is installing all the necessary tools to run Spark (Java, Spark, `findspark` python program)

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

The next cell is setting some "environment variables" so that the program knows where to find the required files.

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7"

`findspark` is the library that lets us use Spark in Python. The `spark` variable is what we will use to talk to the Spark server

In [0]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
# This is the line that lets us connect to a remote server cluster
spark = SparkSession.builder.master("local[*]").getOrCreate()

The following cell reads the Boston housing dataset into a Spark data structure

In [0]:
dataset = spark.read.csv('BostonHousing.csv', inferSchema=True, header=True)

The `dataset` variable has a method to let us see the columns and their types

In [0]:
dataset.printSchema()

root
 |-- crim: double (nullable = true)
 |-- zn: double (nullable = true)
 |-- indus: double (nullable = true)
 |-- chas: integer (nullable = true)
 |-- nox: double (nullable = true)
 |-- rm: double (nullable = true)
 |-- age: double (nullable = true)
 |-- dis: double (nullable = true)
 |-- rad: integer (nullable = true)
 |-- tax: integer (nullable = true)
 |-- ptratio: double (nullable = true)
 |-- b: double (nullable = true)
 |-- lstat: double (nullable = true)
 |-- medv: double (nullable = true)



The code in this cell sets up the training data so that we separate the features and the outputs. The list line shows us what this new dataset looks like.  The dataset has two columns, one that stores a vector of all the features for that row, and the other is label `medv`.

In [0]:
from pyspark.ml.feature import VectorAssembler

# Input all the features in one vector column
assembler = VectorAssembler(inputCols=['crim', 'zn', 'indus', 'chas', 'nox', 
                                       'rm', 'age', 'dis', 'rad', 'tax', 
                                       'ptratio', 'b', 'lstat'], 
                            outputCol = 'Attributes')
output = assembler.transform(dataset)

# Input vs Output
finalized_data = output.select('Attributes', 'medv')
finalized_data.show()

+--------------------+----+
|          Attributes|medv|
+--------------------+----+
|[0.00632,18.0,2.3...|24.0|
|[0.02731,0.0,7.07...|21.6|
|[0.02729,0.0,7.07...|34.7|
|[0.03237,0.0,2.18...|33.4|
|[0.06905,0.0,2.18...|36.2|
|[0.02985,0.0,2.18...|28.7|
|[0.08829,12.5,7.8...|22.9|
|[0.14455,12.5,7.8...|27.1|
|[0.21124,12.5,7.8...|16.5|
|[0.17004,12.5,7.8...|18.9|
|[0.22489,12.5,7.8...|15.0|
|[0.11747,12.5,7.8...|18.9|
|[0.09378,12.5,7.8...|21.7|
|[0.62976,0.0,8.14...|20.4|
|[0.63796,0.0,8.14...|18.2|
|[0.62739,0.0,8.14...|19.9|
|[1.05393,0.0,8.14...|23.1|
|[0.7842,0.0,8.14,...|17.5|
|[0.80271,0.0,8.14...|20.2|
|[0.7258,0.0,8.14,...|18.2|
+--------------------+----+
only showing top 20 rows



The cell below actually trains the model and prints its predictions on the test set

In [0]:
from pyspark.ml.regression import LinearRegression

# Split training and testing data
train_data,test_data = finalized_data.randomSplit([0.8, 0.2])

# Learn to fit the model from training set
regressor = LinearRegression(featuresCol = 'Attributes', labelCol = 'medv')
regressor = regressor.fit(train_data)

# To predict the prices on testing set
pred = regressor.evaluate(test_data)
pred.predictions.show()

+--------------------+----+------------------+
|          Attributes|medv|        prediction|
+--------------------+----+------------------+
|[0.00906,90.0,2.9...|32.2| 31.41528319800314|
|[0.01311,90.0,1.2...|35.4|30.961419505594797|
|[0.01381,80.0,0.4...|50.0| 40.68022853897539|
|[0.01538,90.0,3.7...|44.0| 37.53039981560234|
|[0.01951,17.5,1.3...|33.0| 22.91406322761884|
|[0.02177,82.5,2.0...|42.3| 36.77392391351592|
|[0.02763,75.0,2.9...|30.8|31.641561563069722|
|[0.02875,28.0,15....|25.0| 29.41820794057692|
|[0.02985,0.0,2.18...|28.7| 25.17682920723198|
|[0.03113,0.0,4.39...|17.5| 16.10388162964486|
|[0.03445,82.5,2.0...|24.1|29.162470316265285|
|[0.03466,35.0,6.0...|19.4| 23.16534532152732|
|[0.03502,80.0,4.9...|28.5| 33.87080772723026|
|[0.03615,80.0,4.9...|27.9| 32.28690618950903|
|[0.03659,25.0,4.8...|24.8| 26.14666933761385|
|[0.03932,0.0,3.41...|22.0|27.673911925939606|
|[0.04527,0.0,11.9...|20.6|23.010732689972585|
|[0.0456,0.0,13.89...|23.3|26.843149181282804|
|[0.0459,52.5

The model we are using is called **linear regresssion** which tries to fit a "line" through the dataset to predict the value. We put line in quotes because in higher dimensions, its actually a hyper-plane. With linear regression, we can look at the coefficients for each feature and the intercept of the line.

In [0]:
# coefficient of the regression model
coeff = regressor.coefficients
print(f'The coefficient of the model is : {coeff}')

# Y intercept
intr = regressor.intercept
print (f'The Intercept of the model is : {intr}')

The coefficient of the model is : [-0.10725558788839656,0.051397562790752235,0.025320444127234085,2.8069557086057144,-18.896903103025814,3.7332456339812734,0.002297829720421558,-1.6259291041305706,0.29383293785790826,-0.01352466376227244,-0.868844389251999,0.0089061537930031,-0.5190221632958711]
The Intercept of the model is : 37.05904798202716


The last step is to evaluate the model by looking at the test error. There are two metrics we will look at 
* The first is called Root Mean Square Error (RMSE) which measures the square root of the sum of the squared errors made by the model on each point (a lower RMSE is better). Mathematically, for our model $\hat{f}$ on a dataset $X$ with labels $y$, the RMSE is 

$$RMSE(\hat{f}, X, y) = \sqrt{\frac{1}{n}\sum_{i=1}^n \left( y_i - \hat{f}(X_i)\right)^2}$$

* The other metric used is the $R^2$ correlation coefficient, which is a common statistical measurement to identify the "fit" of the model. A value closer to 1 indicates a better fit.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

eval = RegressionEvaluator(labelCol='medv', 
                           predictionCol='prediction', 
                           metricName='rmse')

# Root Mean Square Error
rmse = eval.evaluate(pred.predictions)
print(f'RMSE: {rmse}')

# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"})
print(f'r2: {r2}')

RMSE: 4.686977314624061
r2: 0.7307918797596511
