# Assignment 4

This assignment is meant to exercise you on **Inference** and **Spark**.

To receive credit, answer all questions correctly and submit to Canvas before the deadline.

**This assignment is due Monday, May 24 at 11:59 PM.**

**NOTE: All instructions deserve 0 point. However, -5 for if you do not follow, run, and understand.**

**YOUR FULL NAME (1 POINT)**: Nazim Zerrouki

## Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the assignment, we ask that you **write your solutions individually**. If you do discuss the assignment with others, please **include their names** below.

**Collaborators**: *list collaborators here*

In [10]:
# import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

# Part 1: Cross Validation and A/B Testing (20 points)

In A3, we have used the following data to train a LogisticRegression classifier using one train-test split. 

As we know, we can use other models instead of LogisticRegression to solve the prediction problem. However, one-shot comparison using only one train-test split is not sufficient to tell which model is better to solve the problem.

In the following, we will use cross validation to compare two different models.

In [11]:
import sklearn.datasets as mldata
data_dict = mldata.load_breast_cancer() #load the data
print(data_dict['DESCR']) 

# You may copy your code in A3 to translate the data_dict to dataframe and prepare the target
dataset = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names']) 
dataset['Target'] = data_dict['target']
dataset.head()

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,Target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Scikit-learn has built-in support for cross validation. However, to fairly compare two models, we need to make sure the same folds are used to cross validate two models. Complete the following function.

1. Use the [`KFold.split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function to get 5 splits on the entire data. Note that `split` returns the indices of the data for that split.
2. For **each** split:
    1. Select out the training and validation rows and columns based on the split indices and features.
    2. Compute the RMSE on the validation split for each model.
    3. Return both the error vector and the average error across all cross validation splits for each model.

In [12]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, accuracy_score

def compute_CV_scores(modelA, modelB, X_train, Y_train):
    '''
    Split the training data into 5 subsets.
    For each subset, 
        fit models holding out that subset
        compute the MSE on that subset (the validation set)
    You should be fitting 5 models total.
    Return MSEs and average MSE of modelA and modelB

    Args:
        modelA and modelB: sklearn models with fit and predict functions 
        X_train (data_frame): Data
        Y_train (data_frame): Label 

    Return:
        MSE vector containing 5 errors for modelA
        MSE vector containing 5 errors for modelB
        the average MSE for the 5 splits of modelA
        the average MSE for the 5 splits of modelB
    '''
    kf = KFold(n_splits=5)
    validation_accuracies_A = []
    validation_accuracies_B = []
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X_train[train_idx], X_train[valid_idx]
        split_Y_train, split_Y_valid = Y_train[train_idx], Y_train[valid_idx]

        # Fit the modelA on the training split
        modelA.fit(X_train[train_idx], Y_train[train_idx])
        
        # Compute the RMSE on the validation split
        Y_valid_pred = modelA.predict(X_train[valid_idx])
        accuracyA = accuracy_score(Y_train[valid_idx], Y_valid_pred)
        validation_accuracies_A.append(accuracyA)

        # Fit the modelB on the training split
        modelB.fit(X_train[train_idx], Y_train[train_idx])
        
        # Compute the RMSE on the validation split
        Y_valid_pred = modelB.predict(X_train[valid_idx])
        accuracyB = accuracy_score(Y_train[valid_idx], Y_valid_pred)

        validation_accuracies_B.append(accuracyB)
        
    return validation_accuracies_A, np.mean(validation_accuracies_A), validation_accuracies_B, np.mean(validation_accuracies_B)

Using the above function, compare the average error bettween LogisticRegression and SVM on the brest cancer prediction problem. Which one is with less average error? **Please use the code to clearly show your conclusion.**

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
X_train = np.array(dataset.drop('Target', axis=1).values)
Y_train = np.array(dataset['Target'])
#print(X_train)
#print(Y_train)

In [14]:
accuracyA, avg_accuracyA, accuracyB, avg_accuracyB = compute_CV_scores(LogisticRegression(max_iter=5000), SVC(), X_train, Y_train)
print("Model A:", avg_accuracyA, "Model B:", avg_accuracyB)

Model A: 0.9507840397453812 Model B: 0.9069243906225741


**Answer:** ModelA has an average accuracy of ~0.951 compared to ModelB's average accuracy of 0.907. Thus, we conclude that ModelA has greater accuracy.

Statistical inference is a necessary step to sufficiently tell if one method is better than the other. Please use [student t-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to tell if one method is **significantly** better than the other and **explain why** (with a significance level 0.05).

In [15]:
from scipy.stats import ttest_ind
ttest_ind(accuracyA, accuracyB, equal_var=False)

Ttest_indResult(statistic=1.181867163362735, pvalue=0.2935126401850613)

**Answer:** Because the p value falls within the range of [0.5, 0.95], we conclude that the null hypothesis holds true for the fact that both models perform on an equal level with a confidence level of 95%. There is no statistical significance between the two. 

# Part 2: Spark (19 points)

In the classroom, we have learned how to write a word count task in Spark using notebook. Please feel free to use the example as a reference to finish this task.

Now you will write your first Spark job to accomplish the following task:

1. Outputs the number of words that start with each letter (i.e., 52 letters as A, B, C, ... Z, and a, b, c, ..., z). This means that for every letter we want to count the total number of (non-unique) words that start with that letter. **Example: every occurence of 'Apple2019' as a word should contribute 1 count to letter A.**

1. Run your program over the same input data pg100.txt as in the classroom and output the result as a dataframe similarly as the example shown in class.

In [16]:
!pip install findspark
!pip install pyspark
!pip install -U -q PyDrive
!apt-get update
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:10 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:14 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic 

In [17]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [18]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

In [19]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [20]:
import re

lines = sc.textFile("./pg100.txt")

#Split the lines into words (including all alphanumeric characters)
words = lines.flatMap(lambda line: re.split(r'[^\w]+', line))

#Mapper
words = words.filter(lambda word: word.isalpha()).map(lambda word: word[0])

pairs = words.map(lambda word: (word, 1))

#Reducer
counts = pairs.reduceByKey(lambda n1, n2: n1 + n2)
#grouped = counts.groupBy(lambda word: word[0][0])
#Result
counts.toDF().toPandas()

Unnamed: 0,_1,_2
0,C,11171
1,W,14809
2,S,13572
3,b,35009
4,i,32389
5,c,23812
6,r,11256
7,g,14949
8,L,7312
9,R,3978
