## Employing pySpark to classify emails and detect spam

### Alexandros Dimitrios Nalmpantis; Georgios Kyriakopoulos (2017)

_**Abstract:**_ _This notebook presents pySpark code that processes and clasifies text data from a corpus of emails. The code implements  data wrangling and classification techniques (i.e. logistic regression analysis) to build a process that recognises whether a certain email is spam or not. Alternative specifications of the classification technique used are explored to assess the performance and efficiency of the process._

* We start with importing the  modules that the analysis will use.

In [1]:
import re
from operator import add
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.util import MLUtils
from math import log
import time
from pprint import pprint

* We define directory paths, which include the corpus of text files that the analysis will use. **[Task a]**

In [2]:
prefix = 'hdfs://saltdean.nsqdc.city.ac.uk/data/'
dirPath = prefix + 'spam/bare/part1'
dirPath_2 = prefix + 'spam/bare/part10'

* We create two functions. Function *__splitFileWords()__* creates (file, word) tuples (words being in lowercase). Using this function, a second function *__read_file_word_RDD()__* uploads the text files and generates information, nemaley, the count and the location of the files used as well as descriptive statistics (histogram) of word counts. **[Task b]**

In [3]:
def splitFileWords(file_text): # function (a) builds (file, word) tuples from (file, text) tuples
    f,t = file_text # define the input to the function
    file_word_List = [] # create an empty (file,word) list
    word_List = re.split('\W+',t) # split texts into words using regular expression
    for w in word_List: 
        file_word_List.append((f,w.lower())) # append words in lowercase to their corresponding file
    return file_word_List

def read_file_word_RDD(argDir): # function (b) builds (file, word) tuples using function (a) (which builds (file, word) tuples from (file, text) tuples 
    file_text_RDD = sc.wholeTextFiles(argDir)# read the files and build (file, text) tuples
    file_word_RDD = file_text_RDD.flatMap(splitFileWords) #use function (a)to build (file, word) tuples
    #print('Read {} files from directory {}'.format(file_text_RDD.count(), argDir)) # print count and location of files used
    #print('file word count histogram')
    #print(file_word_RDD.map(lambda fwL: (len(fwL[1]))).histogram([0,10,100,500, 1000, 5000, 10000])) # print word-count histogram 
    return file_word_RDD 

file_word_RDD = read_file_word_RDD(dirPath) # apply function (b) on the text corpus for the analysis 
pprint(file_word_RDD.take(2)) # print (file, word) tuples indicatively


[('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/3-1msg1.txt',
  'subject'),
 ('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/3-1msg1.txt', 're')]


* We create the function *__file_word_RDD_map_reduce()__* that yields ((file, word), count) tuples. **[Task c]**

In [4]:
def file_word_RDD_map_reduce(file_word): # function (c) uses map and reduce to build ((file, word), count) tuples
    file_word_1_RDD = file_word.map(lambda x: (x,1)) # map (file, word) tuples against 1
    fileWord_count_RDD = file_word_1_RDD.reduceByKey(add) # aggregate the (file, word) tuples
    return fileWord_count_RDD

fileWord_count_RDD = file_word_RDD_map_reduce(file_word_RDD) # map (file, word) tuples 
pprint(fileWord_count_RDD.take(2)) # print ((file, word), 1) tuples indicatively

[(('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/5-1230msg1.txt',
   'quite'),
  1),
 (('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/spmsga140.txt',
   'that'),
  31)]


* We create the function *__reorganise_tuples()__* that reorganises the reduced tuples from *((file, word), count)* to *(file, (word, count))*. **[Task c - continued]**

In [5]:
def reorganise_tuples(fw_c): # function (d) reorganises tuples from ((file, word), count) to (file, (word, count))
    fw,c = fw_c # unpack the ((file, word), count) tuple into its elements
    f,w = fw # unpack the nested (filename,word) tuple into its elements
    return (f,[(w,c)]) # reorganise the elements into the structure (file, (word, count))

file_wordCount_RDD = fileWord_count_RDD.map(reorganise_tuples) 
pprint(file_wordCount_RDD.top(2))

[('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/spmsga141.txt',
  [('your', 5)]),
 ('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/spmsga141.txt',
  [('you', 9)])]


* We create the function *__make_file_termFreq_norm_RDD()__* that yields normalised frequency vectors. **[Task c - continued]**

In [6]:
def make_file_termFreq_norm_RDD(argDir): # function (e) produces normalised frequency vectors
    file_word_RDD = read_file_word_RDD(argDir) # use function (b) 
    fileWord_count_RDD = file_word_RDD_map_reduce(file_word_RDD) # use function (c)
    file_wordCount_RDD = fileWord_count_RDD.map(reorganise_tuples) # use function (d)
    file_wordCount2_RDD = file_wordCount_RDD.reduceByKey(add)
    file_wordCount_norm_RDD = file_wordCount2_RDD.map(lambda f_wcL:(f_wcL[0],[(w,c/sum([c for (w, c) in f_wcL[1]]))for (w,c) in f_wcL[1]])) # normalise
    return file_wordCount_norm_RDD                                                

file_wordCount_norm_RDD = make_file_termFreq_norm_RDD(dirPath) # test
print(file_wordCount_norm_RDD.take(1))

word_count_norm = file_wordCount_norm_RDD.take(1)[0][1] # get the first normalised word count list
pprint(sum([c for (w,c) in word_count_norm])) # check that sum of normalised sum approximates 1 

[('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/3-550msg1.txt', [('sikillian', 0.045454545454545456), ('or', 0.09090909090909091), ('does', 0.045454545454545456), ('lists', 0.09090909090909091), ('latin', 0.045454545454545456), ('bitnet', 0.045454545454545456), ('any', 0.045454545454545456), ('to', 0.045454545454545456), ('query', 0.045454545454545456), ('anyone', 0.045454545454545456), ('annotext', 0.045454545454545456), ('', 0.045454545454545456), ('thanks', 0.045454545454545456), ('greek', 0.045454545454545456), ('subject', 0.045454545454545456), ('know', 0.045454545454545456), ('classical', 0.045454545454545456), ('michael', 0.045454545454545456), ('internet', 0.045454545454545456), ('dedicated', 0.045454545454545456)])]
0.9999999999999999


* We create the function *__hashing_vectorizer()__* that creates fixed-sized frequency vectors. **[Task d]**

In [7]:
def hashing_vectorizer(word_count_list, N): # function (f) applies the hashing approach to creating a vector
     v = [0] * N  # create fixed size vector of 0s
     for word_count in word_count_list: 
         word,count = word_count# unpack tuple
         h = hash(word)# get hash value
         v[h % N] = v[h % N] + count # add count
     return v# return hashed word vector

* We create the function *__make_file_wordHashVector_norm_RDD()__* that creates fixed-sized frequency vectors. **[Task d - continued]**

In [8]:
def make_file_wordHashVector_norm_RDD(file_wordCount_norm, argN): # function (g) applies the hashing vectoriser
    file_wordHashVector_norm_RDD = file_wordCount_norm.map(lambda f_wc: (f_wc[0],hashing_vectorizer(f_wc[1],argN))) 
    return file_wordHashVector_norm_RDD

N=100
file_wordHashVector_norm_RDD = make_file_wordHashVector_norm_RDD(make_file_termFreq_norm_RDD(dirPath),N)
print(file_wordHashVector_norm_RDD.take(2)) # test
print(sum(file_wordHashVector_norm_RDD.take(1)[0][1])) # test

[('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/3-550msg1.txt', [0.09090909090909091, 0.09090909090909091, 0, 0.09090909090909091, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.045454545454545456, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.045454545454545456, 0, 0, 0, 0, 0, 0, 0, 0, 0.045454545454545456, 0, 0, 0.045454545454545456, 0, 0.045454545454545456, 0, 0, 0.045454545454545456, 0, 0, 0, 0, 0, 0.045454545454545456, 0.045454545454545456, 0, 0, 0, 0.045454545454545456, 0, 0, 0, 0, 0, 0, 0, 0.045454545454545456, 0, 0, 0, 0, 0, 0, 0, 0.045454545454545456, 0, 0, 0, 0.045454545454545456, 0, 0, 0, 0, 0.09090909090909091, 0.045454545454545456, 0, 0, 0.045454545454545456]), ('hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part1/3-416msg2.txt', [0.006915629322268327, 0.013831258644536654, 0.009681881051175657, 0.02351313969571231, 0, 0.012448132780082988, 0.017980636237897647, 0.013831258644536652, 0.0027662517289073307, 0.019363762102351315, 0.0304287690

* We create the function *__make_label_point_RDD()__* that builds labelled-point objects. The function works with input arguments being either a path or an RDD and assigns filename 1 to texts marked as spam emails and filename 0 to files not marked as spam emails **[Task e]**

In [9]:
def make_label_point_RDD(inp,argN,trg): # function (f) creates labelled points where 1=spam; 0=non-spam and works when the argument inp is either a path (when trg =='path') or an RDD (when trg==''). The latter becomes useful at Task e  
    file_wordHashVector_norm_RDD = make_file_wordHashVector_norm_RDD((make_file_termFreq_norm_RDD(inp) if trg=='path' else inp),argN) # retrive the hashed normalised vector using function (g)
    label_point_RDD = file_wordHashVector_norm_RDD.map(lambda f_wVec: LabeledPoint(0 if (re.search('spmsg', f_wVec[0])==None) else 1,f_wVec[1])) # assign labelled points
    return label_point_RDD

label_point_RDD = make_label_point_RDD(dirPath, 100,'path')# test
print(label_point_RDD.take(3)) # test

[LabeledPoint(0.0, [0.0909090909091,0.0909090909091,0.0,0.0909090909091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0454545454545,0.0,0.0454545454545,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0,0.0,0.0454545454545,0.0454545454545,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0454545454545,0.0,0.0,0.0,0.0,0.0909090909091,0.0454545454545,0.0,0.0,0.0454545454545]), LabeledPoint(0.0, [0.00691562932227,0.0138312586445,0.00968188105118,0.0235131396957,0.0,0.0124481327801,0.0179806362379,0.0138312586445,0.00276625172891,0.0193637621024,0.030428769018,0.00138312586445,0.00553250345781,0.00138312586445,0.00414937759336,0.00691562932227,0.00691562932227,0.0553250345781,0.00691562932227,0.0110650069156,0.00829875518672,0.016597

* We create the function *__trainModel()__* that trains a classifier for spam (=1) versus non-spam (=0) files **[Task f]**

In [23]:
def trainModel(argDir,argN): # function(g) trains a logistic regression model to detect spam vs. non-spam files
    label_point_RDD_train = make_label_point_RDD(argDir, argN,'path') # uses function (f) to build labelled points when argument inp is a path (i.e. trg=='path')
    logReg_model = LogisticRegressionWithLBFGS.train(label_point_RDD_train) # train the algorithm
    correct = label_point_RDD_train.map(lambda lp: 1 if logReg_model.predict(lp.features) == lp.label else 0).sum() # calculate correctly classified data points
    count = label_point_RDD_train.count() # counts the size of training set
    print('training Logistic Regression with Limited-memory the Broyden–Fletcher–Goldfarb–Shanno (BFGS)')
    print('training data items: {}, correct: {}'.format(count, correct))
    print('training accuracy {:.1%}'.format(correct/count)) 
    return logReg_model

logReg_model = trainModel(dirPath, 100) # test

training Logistic Regression with Limited-memory the Broyden–Fletcher–Goldfarb–Shanno (BFGS)
training data items: 289, correct: 289
training accuracy 100.0%


* We test the classifier built in the function *__testModel()__* on a different corpus **[Task g]**










In [24]:
def testModel(argDir, argN): # function (h) tests the logistic regression model that is trained with function (g)
    label_point_RDD_test = make_label_point_RDD(argDir, argN,'path') # uses function (f) to build labelled points when armument inp is a path (i.e. trg=='path')
    logReg_model = trainModel(dirPath, 100) # calls the sustantive model
    correct_test = label_point_RDD_test.map( lambda lp: 1 if logReg_model.predict(lp.features) == lp.label else 0).sum() # calculates correctly classified data points after applying the substantive model
    count_test = label_point_RDD_test.count() # counts data points in the test dataset
    print('test data items: {}, correct:{}'.format(count_test,correct_test))
    print('testing accuracy {:.1%}'.format(correct_test/count_test))

N = 100
logReg_model_test = testModel(dirPath_2, N) #test

training Logistic Regression with Limited-memory the Broyden–Fletcher–Goldfarb–Shanno (BFGS)
training data items: 289, correct: 289
training accuracy 100.0%
test data items: 291, correct:254
testing accuracy 87.3%


* We create the function *__trainTestModel()__* that trains and tests a classifier for spam (=1) versus non-spam (=0) files **[Task h]**

In [10]:
def trainTestModel(train_inp,test_inp,argN,trg): # function (i) trains and tests the classifier
    label_point_RDD_train = make_label_point_RDD(train_inp, argN,trg) # build label point from training data using function (f)
    logReg_model_train = LogisticRegressionWithLBFGS.train(label_point_RDD_train) # train logistic regression
    correct_train = label_point_RDD_train.map(lambda lp: 1 if logReg_model_train.predict(lp.features) == lp.label else 0).sum() 
    count_train = label_point_RDD_train.count()
    label_point_RDD_test = make_label_point_RDD(test_inp, argN,trg) # build label point from test data using function (f)
    correct_test = label_point_RDD_test.map( lambda lp: 1 if logReg_model_train.predict(lp.features) == lp.label else 0).sum()     
    count_test = label_point_RDD_test.count()
    print('Logistic Regression BFGS; training data items: {}, correct: {}'.format(count_train, correct_train))
    print('Logistic Regression BFGS; training accuracy {:.1%}'.format(correct_train/count_train))
    print('Logistic Regression BFGS; test data items: {}, correct:{}'.format(count_test,correct_test))
    print('Logistic Regression BFGS; testing accuracy {:.1%}'.format(correct_test/count_test))
    #return trainTestModel
    
trainTestModel_example = trainTestModel('hdfs://saltdean/data/spam/bare/part6','hdfs://saltdean/data/spam/bare/part10',100,'path') # test

Logistic Regression BFGS; training data items: 289, correct: 289
Logistic Regression BFGS; training accuracy 100.0%
Logistic Regression BFGS; test data items: 291, correct:271
Logistic Regression BFGS; testing accuracy 93.1%


* We define a dictionary for the four text corpora that we will examine **[Task h]**

In [11]:
setDict = {'EXPERIMENT A: Testing different vector sizes - No preprocessing':prefix+'spam/bare/',
           'EXPERIMENT B: Testing different vector sizes - Stopwords removed':prefix+'spam/stop/',
           'EXPERIMENT C: Testing different vector sizes - Lemmatised':prefix+'spam/lemm/',
           'EXPERIMENT D: Testing different vector sizes - Lemmatised and stopwords removed':prefix+'spam/lemm_stop/'}

* We conduct **Experiment 1** that explores classification accuracy against the size of the training dataset **[Task h - continued]**

In [27]:
N=100 # define indicative vector size

print('\n***** Experiment 1 initiated *****')
for sp in sorted(setDict):
    print('\n*****',sp)
    testPath = setDict[sp]+'part10' # define the path of the test data
    dirPattern = setDict[sp]+"part[1-{}]" # define a pattern whereby train data will be loaded (i.e. part 1; part1 + part2; ...)
    print(testPath)
    for i in range(1,10):
        print('\n**********add part %d**********' %(i)) 
        trainPaths = dirPattern.format(i) 
        print(trainPaths) #just for testing, remove later
        trainTestModel(trainPaths,testPath,N,'path')

print('\n***** Experiment 1 completed *****')


***** Experiment 1 initiated *****

***** EXPERIMENT A: Testing different vector sizes - No preprocessing
hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part10

**********add part 1**********
hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part[1-1]
Logistic Regression BFGS; training data items: 289, correct: 289
Logistic Regression BFGS; training accuracy 100.0%
Logistic Regression BFGS; test data items: 291, correct:254
Logistic Regression BFGS; testing accuracy 87.3%

**********add part 2**********
hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part[1-2]
Logistic Regression BFGS; training data items: 578, correct: 578
Logistic Regression BFGS; training accuracy 100.0%
Logistic Regression BFGS; test data items: 291, correct:266
Logistic Regression BFGS; testing accuracy 91.4%

**********add part 3**********
hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part[1-3]
Logistic Regression BFGS; training data items: 867, correct: 867
Logistic Regression BFGS; training accuracy 100.0%
Lo

**Key observations on the output of Experiment 1:** (i) For a given vector size N, increases in the size of the training dataset tend to enhance testing accuracy; (ii) For a given vector size N, removing stop words from the text corpus is linked to decreases in testing accuracy. A potential interpretation of this effect is that stopwords may be carrying information about the writing style of the texts, which is indicative of whether they are spam or not; and (iii) For a given vector size N, processing a lemmatised corpus yields fairly similar testing accuracy to processing an unprocessed corpus.

* We conduct **Experiment 2** that explores classification accuracy against vector size **[Task h - continued]**

In [28]:
print('\n***** Experiment 2 initiated *****')

for sp in sorted(setDict):
    print('\n*****',sp)
    testPath = setDict[sp]+'part10'
    trainPaths = setDict[sp]+"part[1-9]"
    print(testPath)
    print(trainPaths)
    for i in [3, 10, 30, 100, 300, 1000, 3000]:
        print('\n***** size of vector N = %d' %(i))
        start_time = time.time()
        trainTestModel(trainPaths,testPath,i,'path')
        print("*****processing duration %s seconds*****" % (time.time() - start_time))

print('\n***** Experiment 2 completed *****')


***** Experiment 2 initiated *****

***** EXPERIMENT A: Testing different vector sizes - No preprocessing
hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part10
hdfs://saltdean.nsqdc.city.ac.uk/data/spam/bare/part[1-9]

***** size of vector N = 3
Logistic Regression BFGS; training data items: 2602, correct: 2165
Logistic Regression BFGS; training accuracy 83.2%
Logistic Regression BFGS; test data items: 291, correct:233
Logistic Regression BFGS; testing accuracy 80.1%
*****processing duration 50.01354503631592 seconds*****

***** size of vector N = 10
Logistic Regression BFGS; training data items: 2602, correct: 2194
Logistic Regression BFGS; training accuracy 84.3%
Logistic Regression BFGS; test data items: 291, correct:248
Logistic Regression BFGS; testing accuracy 85.2%
*****processing duration 50.11001801490784 seconds*****

***** size of vector N = 30
Logistic Regression BFGS; training data items: 2602, correct: 2354
Logistic Regression BFGS; training accuracy 90.5%
Logistic Regr

**Key observations on the output of Experiment 2:** (i) Increasing the vector size tends to increase testing accuracy, but is  computationally more intensive, as computation timings suggest; (ii) For a given vector size N, increases in the size of the training dataset tend to enhance testing accuracy (as in Experiment 1); (iii) For a given vector size N, removing stop words from the text corpus is linked to decreases in testing accuracy (as in Experiment 1); (iv) For a given vector size N, processing a lemmatised corpus yields fairly similar testing accuracy to processing an unprocessed corpus (as in Experiment 1).

* Finally, we conduct **Experiment 3** that explores differently preprocessed datasets. We define the vector size N = 300, based on the outcome of experiment 3 **[Task h - continued]**

In [12]:
print('\n***** Experiment 3 initiated *****')

N=300 # vector size of 300 appears faster with equivalent accuracy

print('EXPERIMENT 3: Testing differently preprocessed data sets')
print('training on parts 1-9, N = {}'.format(N))
for sp in sorted(setDict):
    print('\n*****',sp)
    testPath = setDict[sp]+'part10'
    trainPaths = setDict[sp]+"part[1-9]"
    #print(testPath)
    #print(trainPaths)
    trainTestModel(trainPaths,testPath,N,'path') 
    
print('\n***** Experiment 3 completed *****')
    


***** Experiment 3 initiated *****
EXPERIMENT 3: Testing differently preprocessed data sets
training on parts 1-9, N = 300

***** EXPERIMENT A: Testing different vector sizes - No preprocessing
Logistic Regression BFGS; training data items: 2602, correct: 2602
Logistic Regression BFGS; training accuracy 100.0%
Logistic Regression BFGS; test data items: 291, correct:284
Logistic Regression BFGS; testing accuracy 97.6%

***** EXPERIMENT B: Testing different vector sizes - Stopwords removed
Logistic Regression BFGS; training data items: 2602, correct: 2602
Logistic Regression BFGS; training accuracy 100.0%
Logistic Regression BFGS; test data items: 291, correct:273
Logistic Regression BFGS; testing accuracy 93.8%

***** EXPERIMENT C: Testing different vector sizes - Lemmatised
Logistic Regression BFGS; training data items: 2602, correct: 2602
Logistic Regression BFGS; training accuracy 100.0%
Logistic Regression BFGS; test data items: 291, correct:285
Logistic Regression BFGS; testing ac

**Key observations on the output of Experiment 3:** (i) Training the lemmatised corpus yields marginally higher testing accuracy compared to training the unprocessed corpus; (ii) Perhaps *surprisingly*, training a corpus where stop words have been removed and words have been lemmatised yields substantially higher testing accuracy compared to training a corpus where stop words have been removed but lemmatisation hasn't been applied.

* We define a function that creates TF.IDF RDD that will be used for experiment 4. **[Task i]**

In [15]:
def make_f_wtfiL_RDD(path): # define function (j) that creates tf.idf
    # Calculuate the IDFs
    fw_RDD = read_file_word_RDD(path)
    fw_u_RDD=fw_RDD.distinct() # maintains unique (file, word) pairs
    fw_uwf_RDD=fw_u_RDD.map(lambda fw:(fw[1],[fw[0]])) # reorganises (file, word) -to (word,[file])
    fw_uwfn_RDD=fw_uwf_RDD.reduceByKey(add) # joins the lists of files with reduceByKey
    vocSize = fw_RDD.map(lambda fw: fw[0]).distinct().count() # calculates the vocabulary size (i.e. the count of text files)
    #print('\nvocSize: {}'.format(vocSize)) # print the vocabulary size 
    wIdf_RDD=fw_uwfn_RDD.map(lambda wf: (wf[0],log(vocSize/(1+len(wf[1]))))) # calculates the IDF  
    # print('\nwIdf_RDD.count(): ',wIdf_RDD.count()) # testing
    # print('\ncalculated idf',wIdf_RDD.take(2)) # testing

    # Gets the normalise word counts (TFs) and organise by word (word,(file,count))
    f_wcLn_RDD = make_file_termFreq_norm_RDD(path) # creates the normalised word count lists 
    #print('f_wcLn_RDD: ',f_wcLn_RDD.map(
    #        lambda x: sum([c for (w,c) in x[1]]).histogram([0,10,100,1000,10000]))) # checks for the per-file word counts
    w_fcn_RDD=f_wcLn_RDD.flatMap(lambda fwc:[(w,(fwc[0],c)) for (w,c) in fwc[1]]) #creates a list of tuples [(word,(file,count)), ..] and uses flatmap print('w_fcn_RDD.count(): {}'.format(w_fcn_RDD.count())) # for testing
    # print('\nmade the tf as (w,(f,cn))',w_fcn_RDD.take(2)) # testing

    # now we can join the IFDs and TFs by the words (word,(file,coun)) join (word,idf) to (word,((file,count),idf))
    w_fcnIdf_RDD=w_fcn_RDD.join(wIdf_RDD) #Join the IDF and TF RRD's
    # print( '\nw_fcnIdf_RDD.count(): ', w_fcnIdf_RDD.count())
    # print( '\njoined(w,((f,cn),idf))', w_fcnIdf_RDD.take(2))

    # we have doubly nested tuples (word,((file,count),idf)) in the RDD, 
    # but they let us calculate the TF.IDF per file and word (file,[(word,count*idf)]).
    f_wtfiL_RDD=w_fcnIdf_RDD.map(lambda w_fcnIdf:(w_fcnIdf[1][0][0],[(w_fcnIdf[0],(w_fcnIdf[1][0][1]*w_fcnIdf[1][1]))])) #Map to (f,[(w,Tf*idf)])
    f_wtfiL_RDD=f_wtfiL_RDD.reduceByKey(add)
    # print('\nf_wtfiL_RDD.count()', f_wtfiL_RDD.count())
    # print('\n Calculated TF.IDF',str(f_wtfiL_RDD.take(2)))

    return f_wtfiL_RDD

* We conduct **Experiment 4** that explores TF.IDF accuracy against the size of the training dataset **[Task i - continued]**

In [16]:
#Aplication of trainModel and test Model to RDDs created with make_f_wtfiL_RDD

N=100 # indicative vector size

print('EXPERIMENT 4: Testing differently preprocessed data sets with TF.IDF')
print('\n***** Experiment 4 initiated *****')
print('training on parts 1-9, N = {}'.format(N))
for sp in sorted(setDict):
    print('\n*****',sp)
    test_RDD = make_f_wtfiL_RDD(setDict[sp]+'part10') # apply function (j) to create TF.IDF on test data
    train_RDD = make_f_wtfiL_RDD(setDict[sp]+"part[1-9]") # apply function (j) to create TF.IDF on train data
    trainTestModel(train_RDD,test_RDD,N,'') # apply function (i) to train and test the data
print('\n***** Experiment 4 completed *****')

EXPERIMENT 4: Testing differently preprocessed data sets with TF.IDF

***** Experiment 4 initiated *****
training on parts 1-9, N = 100

***** EXPERIMENT A: Testing different vector sizes - No preprocessing
Logistic Regression BFGS; training data items: 2602, correct: 2336
Logistic Regression BFGS; training accuracy 89.8%
Logistic Regression BFGS; test data items: 291, correct:251
Logistic Regression BFGS; testing accuracy 86.3%

***** EXPERIMENT B: Testing different vector sizes - Stopwords removed
Logistic Regression BFGS; training data items: 2602, correct: 2278
Logistic Regression BFGS; training accuracy 87.5%
Logistic Regression BFGS; test data items: 291, correct:245
Logistic Regression BFGS; testing accuracy 84.2%

***** EXPERIMENT C: Testing different vector sizes - Lemmatised
Logistic Regression BFGS; training data items: 2602, correct: 2353
Logistic Regression BFGS; training accuracy 90.4%
Logistic Regression BFGS; test data items: 291, correct:261
Logistic Regression BFGS; t

**Key observations on the output of Experiment 4:** (i) For N =100, training on data with tf.idf yields systematically lower testing accuracy compared to training on hashed normalised word counts (see Experiment 2); (ii) For N=100, testing accuracy is maximised when training on lemmatised data.