# Classification and Spark

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Classification example (6 points)
First, we need to install the sklearn (scikit-learn) package that contains simple and efficient tools for data mining and data analysis. Then, we load a data set from the sklearn.datasets as an example for classification.

In [None]:
!pip install sklearn

In [None]:
import sklearn.datasets as mldata
data_dict = mldata.load_breast_cancer() #load the data

# translate the data_dict to dataframe
cancer = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names']) 

cancer['bias'] = 1.0 # for the convenience of model fitting

# Target data_dict['target'] = 0 is malignant; 1 is benign
cancer['malignant'] = 1 - data_dict['target'] 
cancer.iloc[0]

Now we can conduct one train-test split.

In [None]:
from sklearn.model_selection import train_test_split 

#train_test_split in sklearn can help spit the data as follows
train, test = train_test_split(cancer, test_size=0.25, random_state=100)
x_train = train.drop('malignant', axis=1).values
y_train = train['malignant'].values
x_test = test.drop('malignant', axis=1).values
y_test = test['malignant'].values

print("Training Data Size: ", len(train))
print("Test Data Size: ", len(test))

The training data can be used to fit a LogisticRegression model as follows.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=False, C=1e-5, solver='lbfgs')
model.fit(x_train, y_train)

In [None]:
# Show the average train accuracy of the model
correct_train = model.predict(x_train) == y_train
np.mean(correct_train)

In [None]:
# Show the average test accuracy of the model
correct_test = model.predict(x_test) == y_test
np.mean(correct_test)

# Spark example (6 points)

# Installing PySpark Locally

Uncomment the following ones to install Spark locally in the same folder as this notebook:

In [None]:
#!curl -O http://mirror.metrocast.net/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz 
#!tar -xvf spark-2.4.2-bin-hadoop2.7.tgz

The following Python Library will configure your python environment

In [None]:
!pip install findspark

If you would like to try using Spark on a cluster for free without any setup checkout [Databricks Community Edition](https://databricks.com/try-databricks)

# Launching PySpark

Setup the PySpark environment.

In [None]:
import os
import findspark
os.environ["PYSPARK_PYTHON"] = "python2"
findspark.init("spark-2.4.2-bin-hadoop2.7/",)

Initialize the SparkSQL session which contains a basic Spark Context.  This may take a few moments to launch the cluster of (typically 4 to 8 python jobs in the background).  Note in a real Spark deployment you would simply change the `.master("local[*]")` to instead point to the YARN resource manager.  To learn more about deploying Spark on a cluster of machines read [this tutorial](https://spark.apache.org/docs/latest/cluster-overview.html).

Note: You must have Java installed on your computer for this to work!

In [None]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
        .master("local[*]")
        .appName("LectureExample")
        .getOrCreate()
)
sc = spark.sparkContext

## Word Count Example

As a quick example of what Spark can do, the following code will compute the word counts of pg100 in a parallelized fashion. That means that if your computer has multiple processors, they are all put to use computing the word counts.

Below the layer of abstraction that we can see, it is running map reduce.

In [None]:
import re #regular expression used to split lines of text into words

lines = sc.textFile("./pg100.txt") # download pg100.txt from canvas in fold of Spark

#Split the lines into words (including all alphanumeric characters)
words = lines.flatMap(lambda line: re.split(r'[^\w]+', line))

#Mapper
pairs = words.map(lambda word: (word, 1))

#Reducer
counts = pairs.reduceByKey(lambda n1, n2: n1 + n2)

#Result
counts.toDF().toPandas()

What if you want to remove some words with specific conditions?

In [None]:
words = words.filter(lambda word: word != '')

counts = words.map(lambda word: (word, 1)) \
              .reduceByKey(lambda a, b: a + b)

counts.toDF().toPandas()