# Assignment 4

This assignment is meant to exercise you on **Inference** and **Spark**.

To receive credit, answer all questions correctly and submit to Canvas before the deadline.

**This assignment is due Monday, May 24 at 11:59 PM.**

**NOTE: All instructions deserve 0 point. However, -5 for if you do not follow, run, and understand.**

**YOUR FULL NAME (1 POINT)**: *Jing Tian*

## Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the assignment, we ask that you **write your solutions individually**. If you do discuss the assignment with others, please **include their names** below.

**Collaborators**: *list collaborators here*

In [1]:
# import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

# Part 1: Cross Validation and A/B Testing (20 points)

In A3, we have used the following data to train a LogisticRegression classifier using one train-test split. 

As we know, we can use other models instead of LogisticRegression to solve the prediction problem. However, one-shot comparison using only one train-test split is not sufficient to tell which model is better to solve the problem.

In the following, we will use cross validation to compare two different models.

In [2]:
import sklearn.datasets as mldata
data_dict = mldata.load_breast_cancer() #load the data
print(data_dict['DESCR']) 

# You may copy your code in A3 to translate the data_dict to dataframe and prepare the target
pd_cancer = pd.DataFrame()
pd_cancer['data'] =  data_dict['data'].tolist()
pd_cancer['target'] =  data_dict['target']
pd_cancer['true_label'] = pd_cancer['target'].apply(lambda x: 'malignant' if x==0 else 'benign')
pd_cancer.head()

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

Unnamed: 0,data,target,true_label
0,"[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, ...",0,malignant
1,"[20.57, 17.77, 132.9, 1326.0, 0.08474, 0.07864...",0,malignant
2,"[19.69, 21.25, 130.0, 1203.0, 0.1096, 0.1599, ...",0,malignant
3,"[11.42, 20.38, 77.58, 386.1, 0.1425, 0.2839, 0...",0,malignant
4,"[20.29, 14.34, 135.1, 1297.0, 0.1003, 0.1328, ...",0,malignant


Scikit-learn has built-in support for cross validation. However, to fairly compare two models, we need to make sure the same folds are used to cross validate two models. Complete the following function.

1. Use the [`KFold.split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function to get 5 splits on the entire data. Note that `split` returns the indices of the data for that split.
2. For **each** split:
    1. Select out the training and validation rows and columns based on the split indices and features.
    2. Compute the Accuracy on the validation split for each model.
    3. Return both the accuracy vector and the average accuracy across all cross validation splits for each model.

In [3]:
from sklearn.model_selection import KFold

def compute_CV_accuracy(modelA, modelB, X_train, Y_train):
    '''
    Split the training data into 5 subsets.
    For each subset, 
        fit models holding out that subset
        compute the accuracy on that subset (the validation set)
    You should be fitting 5 models total.
    Return Accuracies and average accuracy of modelA and modelB

    Args:
        modelA and modelB: sklearn models with fit and predict functions 
        X_train (data_frame): Data
        Y_train (data_frame): Label 

    Return:
        Accuracy vector containing 5 accuracies for modelA
        Accuracy vector containing 5 accuracies for modelB
        the average accuracy for the 5 splits of modelA
        the average accuracy for the 5 splits of modelB
    '''
    kf = KFold(n_splits=5)
    validation_accuracies_A = []
    validation_accuracies_B = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X[train_idx], X[valid_idx]
        split_Y_train, split_Y_valid = y[train_idx], y[valid_idx]

        # Fit the modelA on the training split
        clfA = modelA.fit(split_X_train, split_Y_train)
        
        # Compute the prediction accuracy on the validation split
        accuracyA = clfA.score(split_X_valid, split_Y_valid)
        validation_accuracies_A.append(accuracyA)

        # Fit the modelB on the training split
        clfB = modelB.fit(split_X_train, split_Y_train)
        
        # Compute the prediction accuracy on the validation split
        accuracyB = clfB.score(split_X_valid, split_Y_valid)
        validation_accuracies_B.append(accuracyB)
        
    return validation_accuracies_A, np.mean(validation_accuracies_A), validation_accuracies_B, np.mean(validation_accuracies_B)

Using the above function, compare the average accuracy between LogisticRegression and SVM on the breast cancer prediction problem. Which one is with higher average accuracy? **Please use the code to clearly show your conclusion.**

**Answer: Based on my conclusion, the LogisticRegression has higher average accuracy compared to SVM., since the average accuracy in LogisticRegression is around 97.72%, and that in SVM is around 97.01%.**

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
modelA = LogisticRegression()
modelB = SVC()
# normalize the data
sc = StandardScaler()
sc.fit(pd_cancer['data'].tolist())
X = sc.transform(pd_cancer['data'].tolist())
y = pd_cancer['target'].to_numpy()
score_A, mean_A, score_B, mean_B = compute_CV_accuracy(modelA, modelB, X, y)
print("The accuracy in Logistic:", ["{:.2%}".format(score) for score in score_A])
print("The accuracy in SVM:     ", ["{:.2%}".format(score) for score in score_B])
print("The mean accuracy in Logistic:", "{:.2%}".format(mean_A))
print("The mean accuracy in SVM:     ", "{:.2%}".format(mean_B))
print("Conclusion: Logistic has higher average accuracy.")

The accuracy in Logistic: ['97.37%', '95.61%', '98.25%', '98.25%', '99.12%']
The accuracy in SVM:      ['94.74%', '96.49%', '97.37%', '99.12%', '97.35%']
The mean accuracy in Logistic: 97.72%
The mean accuracy in SVM:      97.01%
Conclusion: Logistic has higher average accuracy.


Statistical inference is a necessary step to sufficiently tell if one method is better than the other. Please use [student t-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to tell if one method is **significantly** better than the other and **explain why** (with a significance level 0.05).

**Answer: In t-test, H0 is that the two methods are equal, and H1 is that one method is significantly better than the other. With a significance level 0.05, since we observed a large p-value which is larger than 0.025, we cannot reject the null hypothesis. So it can be concluded that the two methods are equal even though we got a slight higher average accuracy in Logistic.**

In [5]:
from scipy import stats
print(stats.ttest_ind(score_A, score_B, equal_var=False))
print(stats.ttest_ind(score_A, score_B))

Ttest_indResult(statistic=0.7601044118173171, pvalue=0.46967865771022477)
Ttest_indResult(statistic=0.7601044118173169, pvalue=0.46899819537426424)


# Part 2: Spark (19 points)

In the classroom, we have learned how to write a word count task in Spark using notebook. Please feel free to use the example as a reference to finish this task.

Now you will write your first Spark job to accomplish the following task:

1. Outputs the number of words that start with each letter (i.e., 52 letters as A, B, C, ... Z, and a, b, c, ..., z). This means that for every letter we want to count the total number of (non-unique) words that start with that letter. **Example: every occurrence of 'Apple2019' as a word should contribute 1 count to letter A.**

1. Run your program over the same input data pg100.txt as in the classroom and output the result as a dataframe similarly as the example shown in class.

In [7]:
# setup Spark on your Colab environment
!pip install pyspark
!pip install -U -q PyDrive
!apt-get update
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 64kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 19.7MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=441caaae42321c04dfd4f671a64e2ec3dd2e6a64ce0a4ec3904d5793b2005460
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1
Get:

In [9]:
# write a Spark application
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [51]:
import re #regular expression used to split lines of text into words

lines = sc.textFile("./pg100.txt") # load the file
# Split the lines into words and remove the words which don't start with a letter
words = lines.flatMap(lambda line: re.split(r'[^\w]+', line))
words_letters = words.filter(lambda word: word.isalpha())

# convert each word to its first letter
words_letters = words_letters.flatMap(lambda count: count[0])
print("First 5 elements in words_letters:", words_letters.take(5))

# Mapper
pairs = words_letters.map(lambda word: (word, 1))
print("First 5 elements in pairs:", pairs.take(5))

# Reducer
counts = pairs.reduceByKey(lambda n1, n2: n1 + n2)
print("First 5 elements in counts:", counts.take(5))

First 5 elements in words_letters: ['T', 'P', 'G', 'E', 'o']
First 5 elements in pairs: [('T', 1), ('P', 1), ('G', 1), ('E', 1), ('o', 1)]
First 5 elements in counts: [('C', 11171), ('W', 14809), ('S', 13572), ('b', 35009), ('i', 32389)]


In [52]:
# Result
result = counts.toDF().toPandas()
print('The total number of words which start with a letter:', result['_2'].sum())
result

The total number of words which start with a letter: 930707


Unnamed: 0,_1,_2
0,C,11171
1,W,14809
2,S,13572
3,b,35009
4,i,32389
5,c,23812
6,r,11256
7,g,14949
8,L,7312
9,R,3978
