# Regression and Spark (42 points)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Regression (16 points)
In this regression task, we will use the following dataset. Please take a look at the description output.

In [2]:
import sklearn.datasets as mldata
data_dict = mldata.load_boston()
print(data_dict['DESCR']) # output 1 point

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

Coding Question 1: Similarly as our classification task in workshop4, you are required to use the data to fit a linear regression model, conduct similar prediction, and show the performance (MSE). (15 points)

In [3]:
# translate the data_dict to dataframe
housing = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names']) 
housing['MEDV'] = data_dict['target'] 
housing.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [4]:
from sklearn.model_selection import train_test_split 
train, test = train_test_split(housing, test_size=0.25, random_state=100)
x_train = train.drop('MEDV', axis=1).values
y_train = train['MEDV'].values
x_test = test.drop('MEDV', axis=1).values
y_test = test['MEDV'].values

print("Training Data Size: ", len(train))
print("Test Data Size: ", len(test))

('Training Data Size: ', 379)
('Test Data Size: ', 127)


In [6]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(fit_intercept=True)
linear_model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [7]:
from sklearn.metrics import mean_squared_error
y_train_pred = linear_model.predict(x_train)
mean_squared_error(y_train, y_train_pred)

20.506781668028975

In [8]:
y_test_pred = linear_model.predict(x_test)
mean_squared_error(y_test, y_test_pred)

27.173144173043564

# Spark (25  points)

In workshop 4, we have learned how to write a word count task in Spark using both notebook and cluster node.

Now you will write your first Spark job to accomplish the following task (you are required to finish the task in both ways, that is, jupyter notebook and cluster node, similarly as workshop 4):

• Outputs the number of words that start with each letter (i.e., 52 letters as A, B, C, ... Z, and a, b, c, ..., z). This means that for every letter we want to count the total number of (non-unique) words that start with that letter. In your implementation, you need to ignore all non-alphabetic characters.

• Run your program over the same input data pg100.txt as in workshop 4.

What to hand-in: 
1. Jupyter notebook version as workshop 4 in the following. (15 points) 
2. Submit a zip file contains: 1) The output file containing results we want (e.g., print out or txt output by the application) [3 points]; 2) Your screen shot that finishes running your application [1 point]; 3) Your source code (.py) [5 points]; 4) Include your full name in the source code [1 point]. (10 points in total)

In [9]:
import os
import findspark
os.environ["PYSPARK_PYTHON"] = "python2"
findspark.init("/Users/juhuah/Downloads/spark-2.4.2-bin-hadoop2.7/",)

In [10]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
        .master("local[*]")
        .appName("LectureExample")
        .getOrCreate()
)
sc = spark.sparkContext

In [11]:
import re

# Load the data
lines = sc.textFile("./workshop4/pg100.txt")

# Parse the data into words
words = lines.flatMap(lambda line: re.split(r'[^\w]+', line))

# Remove empty words
words = words.filter(lambda word: word != '')

# Remove words not starting with alphabet
final = words.filter(lambda word: word[0].isalpha())

# Convert each word into a tuple of its first letter and 1
pairs = final.map(lambda word: (word[0], 1))

# Sum the counts for each character and save them to disk
counts = pairs.reduceByKey(lambda c1, c2: c1 + c2)

# Sort the result by key (better for end-users)
counts = counts.sortByKey()

#Result
counts.toDF().toPandas()

Unnamed: 0,_1,_2
0,A,21692
1,B,10992
2,C,11171
3,D,6275
4,E,8511
5,F,8086
6,G,6218
7,H,10198
8,I,30031
9,J,1762
