## Neroma Kossi : 3A ENSAE, AS-DS
> Projet du cours d'éléments logiciels pour le traitement de données massives

Our porject consists of parallelising the Latent Dirichlet Allocation (**LDA**) algorithm. The base paper is [https://www.semanticscholar.org/paper/PLDA%3A-Parallel-Latent-Dirichlet-Allocation-for-Wang-Bai/376ffb536c3dc5675e9ab875b10b9c4a1437da5d](PLDA, a parallel gibbs sampling based algorithm).

The main idea is  to run concurrent Gibb's sampling algorithms. This could be done via a distributed framework like MPI or mapReduce, we will be considering the last one in this project. Pyspark will be the standard library for the mapReduce architecture.

In [1]:
from pyspark import SparkConf,  SparkContext  # Spark

In [2]:
import numpy as np # math ops
import os, shutil, json #File ops
import pickle as pkl # Serialiser

from datetime import datetime
import time

In [3]:
# Some utilities saved into custom modules

from nlp import preprocessAndGetTokens
from fileUtils import load, pickleLoader, dump, saveByPartition

# 1. Create the Spark Context

In [4]:
driver_memory = '1g' # Max memory available for the driver
executor_memory = '1g' # Max memory by executor
# We have to set those params before instantiating the SparkContext, other It would be too late
pyspark_submit_args = ' --driver-memory {0} --executor-memory {1} pyspark-shell'\
                                .format(driver_memory, executor_memory)
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

In [5]:
conf = SparkConf().setAll([
     ('spark.app.name', 'pLDA'), 
     ('spark.master', 'local[*]'), # the number of cores is set to max
    ('spark.scheduler.mode', 'FAIR'),
    ('spark.files.maxPartitionBytes', '500Mo')])

In [6]:
spark = SparkContext(conf = conf) # Here we create the Spark context

In [7]:
spark._conf.getAll()

[('spark.files.maxPartitionBytes', '500Mo'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.name', 'pLDA'),
 ('spark.scheduler.mode', 'FAIR'),
 ('spark.driver.host', '192.168.0.41'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.memory', '1g'),
 ('spark.app.id', 'local-1549573274837'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.port', '43347')]

# 2. Data pre-processing

Our dataset is made of many abstracts of research papers in Computer Science, Neuroscience, and Biomedical. It is available at no cost on https://labs.semanticscholar.org/corpus/

## 2.1. Load the data from file

In [120]:
def LoadPaperAbstract(docstr, trunc = 1500):
    """Convert the paper into json, keep only paper's id and first 1500 chars of its Abstract."""
    
    doc = json.loads(docstr)
    return (doc["id"], doc["paperAbstract"][:trunc], doc["title"][:trunc])

In [121]:
def processPaperAbstract(abstract):
    """This is a wrapper that calls the preprocessAndGetTokens function. The latest function will apply 
    some basic nlp tehchnics on the paper's abstract : lowercase-isation, stopwords removing, stemming..."""
    
    return np.array(preprocessAndGetTokens(abstract))

In [122]:
nbPartitions = 10 # Set the number of partitions, this is important as our Gibbs sampler is designed to 
                # lunch one sampler per partition 

> **Let's read the data**

In [137]:
data = spark.textFile("sample-S2-records")\
            .repartition(nbPartitions)\
                .map(LoadPaperAbstract)
# data = spark.textFile('/home/nerk/Downloads/s2-corpus-46.gz')\
#         .repartition(nbPartitions)\
#                 .map(LoadPaperAbstract)

In [138]:
data.getNumPartitions() # Same as assigned above

10

In [None]:
data.sample(False, 1/100)

In [139]:
%%time

data.map(lambda x : len(x[1])).take(10) # A look on the abstracts' length

CPU times: user 18.9 ms, sys: 136 µs, total: 19.1 ms
Wall time: 236 ms


[1500, 0, 0, 762, 0, 0, 628, 603, 360, 0]

In [140]:
%%time
data.take(2)  # A sampler of our datatset

CPU times: user 19.8 ms, sys: 550 µs, total: 20.4 ms
Wall time: 144 ms


[('4cbba8127c8747a3b2cfb9c1f48c43e5c15e323e',
  'Primary debulking surgery (PDS) has historically been the standard treatment for advanced ovarian cancer. Recent data appear to support a paradigm shift toward neoadjuvant chemotherapy with interval debulking surgery (NACT-IDS). We hypothesized that stage IV ovarian cancer patients would likely benefit from NACT-IDS by achieving similar outcomes with less morbidity. Patients with stage IV epithelial ovarian cancer who underwent primary treatment between January 1, 1995 and December 31, 2007, were identified. Data were retrospectively extracted. Each patient record was evaluated to subclassify stage IV disease according to the sites of tumor dissemination at the time of diagnosis. The Kaplan–Meier method was used to compare overall survival (OS) data. A total of 242 newly diagnosed stage IV epithelial ovarian cancer patients were included in the final analysis; 176 women (73%) underwent PDS, 45 (18%) NACT-IDS, and 21 (9%) chemotherapy onl

In [141]:
if os.path.exists("matrix/docTitles/") :
    shutil.rmtree("matrix/docTitles/")
data.map(lambda x :  {x[0]:x[1]}).saveAsPickleFile("matrix/docTitles/") # Save doc titles

## 2.2. Preprocessing

In [142]:
%%time

#Now we do all the preprocessing, and save the dataset
folder = "corpus/sample-S2-records/"
if os.path.exists(folder) :
    shutil.rmtree(folder)
    
data = data.mapValues(processPaperAbstract)\
                    .filter(lambda x : len(x[1]) > 0)\
                    
data.saveAsPickleFile("corpus/sample-S2-records/", 10)

CPU times: user 11 ms, sys: 141 µs, total: 11.2 ms
Wall time: 1.26 s


In [143]:
data.take(1) # A sample

[('4cbba8127c8747a3b2cfb9c1f48c43e5c15e323e',
  array(['unit', 'analysi', 'compar', 'includ', 'histor', 'decemb',
         'surgeri', 'subclassifi', 'onli', 'paradigm', 'chemotherapi',
         'treatment', 'diagnosi', 'cancer', 'trend', 'complet', 'januari',
         'appear', 'admiss', 'recent', 'final', 'toward', 'complic',
         'accord', 'receiv', 'resect', 'ovarian', 'frequenc', 'os',
         'support', 'treat', 'method', 'outcom', 'use', 'versus', 'patient',
         'postop', 'extract', 'evalu', 'time', 'standard', 'diagnos',
         'tumor', 'nactid', 'retrospect', 'achiev', 'overal', 'higher',
         'kaplanmei', 'vs', 'record', 'advanc', 'diseas', 'iv', 'pds',
         'shift', 'site', 'stage', 'neoadjuv', 'identifi', 'intens', 'like',
         'epitheli', 'residu', 'less', 'benefit', 'hypothes', 'total',
         'longer', 'group', 'underw', 'month', 'surviv', 'median',
         'dissemin', 'data', 'signific', 'interv', 'morbid', 'similar',
         'newli', 'frequen

> Here, our dataset is in the primal format (docId, docContent). Next, we will assign a random topic to each word in a document. We will also need to build the Vocaulary and the set of the documents. 

# 2.3. Building the vocabulary and the set of docs

**Reloading and partionning the dataset**

In [38]:
# corpus = spark.pickleFile("corpus/sample-S2-records/" ).repartition(nbPartitions)
corpus = spark.pickleFile("corpus/corpus-46/part-00000" ).repartition(nbPartitions)
#         .union(spark.pickleFile("corpus/corpus-46/part-00001" ))
corpus.getNumPartitions()

10

In [39]:
# corpus2 = corpus.map(lambda x  : randomPartitionner(x, nbPartitions))\
#             .partitionBy(nbPartitions)

In [40]:
%%time
corpus.take(1)

CPU times: user 11.4 ms, sys: 1.09 ms, total: 12.5 ms
Wall time: 4.05 s


[('be8aa262e6d0c7122ba3e4b619da264c00ae15b9',
  array(['approxim', 'analysi', 'ethnic', 'men', 'physic', 'old', 'unclear',
         'louisiana', 'state', 'pressur', 'intak', 'brfss', 'postmenopaus',
         'three', 'determin', 'condit', 'alcohol', 'may', 'cholesterol',
         'veget', 'result', 'moder', 'blood', 'tennesse', 'screen',
         'relationship', 'factor', 'activ', 'onethird', 'effect', 'report',
         'studi', 'system', 'behavior', 'whether', 'daili', 'treat',
         'method', 'four', 'diabet', 'tobacco', 'adjust', 'use', 'femal',
         'crosssect', 'set', 'diagnos', 'race', 'surveil', 'howev',
         'premenopaus', 'obes', 'control', 'half', 'conclus', 'smoke',
         'weight', 'recommend', 'present', 'educ', 'relat', 'symptom',
         'purpos', 'comorbid', 'status', 'indic', 'regress', 'fruit',
         'larg', 'data', 'nevada', 'signific', 'overweight', 'year',
         'women', 'across', 'care', 'primari', 'logist', 'size', 'michigan',
         'risk'

## 2.3.1. Build the vocabularies (one per partition)

In [41]:
import importlib, builder
importlib.reload(builder)

<module 'builder' from '/home/nerk/Documents/3A_ENSAE/mapReduceLda/builder.py'>

In [42]:
from builder  import makeVocabularies, makeVocabulariesFolder, getUniqueWords, getUniqueWords2

In [43]:
makeVocabulariesFolder() # Instantiate the vocabularies' folder

In [44]:
%%time

# Here we compute the set of unique words. But a word can sometimes very long, 
# we will assign (in next steps) to each word a number ranging from 0 to V-1, where V == size of ours vocabs
uniqueWordsByPartition = corpus.mapPartitionsWithIndex(getUniqueWords).collect()

CPU times: user 57.4 ms, sys: 83.7 ms, total: 141 ms
Wall time: 27.7 s


In [45]:
# corpus.glom().map(len).collect()

In [49]:
# Number of documents & words per partition

L = [{"Partition": "%02d"%i, "ndocs": len(x[0]), "nvocabs": len(x[1])} for x, i 
             in zip(uniqueWordsByPartition, range(nbPartitions))  ]
L[:5]

[{'Partition': '00', 'ndocs': 3250, 'nvocabs': 31522},
 {'Partition': '01', 'ndocs': 3260, 'nvocabs': 32318},
 {'Partition': '02', 'ndocs': 3260, 'nvocabs': 32495},
 {'Partition': '03', 'ndocs': 3260, 'nvocabs': 33301},
 {'Partition': '04', 'ndocs': 3260, 'nvocabs': 33233}]

In [50]:
print("Totoal docs : %d "%sum(l["ndocs"] for l in L))

Totoal docs : 32569 


In [51]:
%%time
# Here we build the vocabularies, one per partition

makeVocabularies([ w[1] for w in  uniqueWordsByPartition]) # Build and save the vocabularies

Vocabulary 0 successfully built
Vocabulary 1 successfully built
Vocabulary 2 successfully built
Vocabulary 3 successfully built
Vocabulary 4 successfully built
Vocabulary 5 successfully built
Vocabulary 6 successfully built
Vocabulary 7 successfully built
Vocabulary 8 successfully built
Vocabulary 9 successfully built

 Global vocabulary  built too
CPU times: user 6.3 s, sys: 672 ms, total: 6.98 s
Wall time: 5.62 s


In [62]:
del uniqueWordsByPartition # free up somme memory

## 2.3.2. Make docMaps :  the set of all the documents

In [63]:
# import builder, importlib
# importlib.relaod(builder)

In [64]:
from builder import makeDocsMaps, makeDocsMapsFolder

In [65]:
makeDocsMapsFolder() # Instantiate the documents' folder

In [66]:
%%time

corpus.mapPartitionsWithIndex(makeDocsMaps).collect()

CPU times: user 9.09 ms, sys: 123 µs, total: 9.21 ms
Wall time: 574 ms


['docMap 0 successfully built',
 'docMap 1 successfully built',
 'docMap 2 successfully built',
 'docMap 3 successfully built',
 'docMap 4 successfully built',
 'docMap 5 successfully built',
 'docMap 6 successfully built',
 'docMap 7 successfully built',
 'docMap 8 successfully built',
 'docMap 9 successfully built']

## 2.3.3. Test if vocabularies and docMaps are correctly built

In [67]:
# vocabSize = len(load("matrix/vocabulary/vocabAll"))
# print("vocabSize : %d "%vocabSize)

In [68]:
# ndocs00 = len(load("matrix/docsMap/docs__0000__"))
# print("Number of docs in docs00 : %d"%ndocs00)

As voacabularies & docMaps was successfully built, let's load them

In [69]:
%%time
vocabAll = load("matrix/vocabulary/vocabAll")

vocabs = [load("matrix/vocabulary/vocab__%04d__"%ind) for ind in range(nbPartitions)] 

CPU times: user 1.15 s, sys: 39.5 ms, total: 1.19 s
Wall time: 1.19 s


In [70]:
print("Total words in Vocab : ", len(vocabAll))

Total words in Vocab :  151812


In [71]:
%%time
from builder import loadDocsAll
docsAll = loadDocsAll(nbPartitions)

docs = [load("matrix/docsMap/docs__%04d__"%ind) for ind in range(nbPartitions)] 

CPU times: user 25.7 ms, sys: 177 µs, total: 25.9 ms
Wall time: 43.3 ms


In [72]:
%%time
nbDocs = list(map(len, docs))
nbVocabs = list(map(len, vocabs))
print(nbDocs[:2], nbVocabs[:2])

[3250, 3260] [31522, 32318]
CPU times: user 0 ns, sys: 1.71 ms, total: 1.71 ms
Wall time: 1.06 ms


## 3. Prepare the data for the training step
>Encode corpus and add topics : using ids instead of doc full text

## 3.1. Encode corpus

In [73]:
import builder, importlib
importlib.reload(builder)
from builder  import encodeAddTopics

In [74]:
%%time 

# The corpius is in full text again, let's change it in the next step
corpus.take(1)

CPU times: user 7.22 ms, sys: 0 ns, total: 7.22 ms
Wall time: 57.4 ms


[('be8aa262e6d0c7122ba3e4b619da264c00ae15b9',
  array(['approxim', 'analysi', 'ethnic', 'men', 'physic', 'old', 'unclear',
         'louisiana', 'state', 'pressur', 'intak', 'brfss', 'postmenopaus',
         'three', 'determin', 'condit', 'alcohol', 'may', 'cholesterol',
         'veget', 'result', 'moder', 'blood', 'tennesse', 'screen',
         'relationship', 'factor', 'activ', 'onethird', 'effect', 'report',
         'studi', 'system', 'behavior', 'whether', 'daili', 'treat',
         'method', 'four', 'diabet', 'tobacco', 'adjust', 'use', 'femal',
         'crosssect', 'set', 'diagnos', 'race', 'surveil', 'howev',
         'premenopaus', 'obes', 'control', 'half', 'conclus', 'smoke',
         'weight', 'recommend', 'present', 'educ', 'relat', 'symptom',
         'purpos', 'comorbid', 'status', 'indic', 'regress', 'fruit',
         'larg', 'data', 'nevada', 'signific', 'overweight', 'year',
         'women', 'across', 'care', 'primari', 'logist', 'size', 'michigan',
         'risk'

In [75]:
nbTopics = 10

In [76]:
# We can notice that all the words have been encoded into symbolic ids, topics  have been added too
corpus2 = corpus.mapPartitionsWithIndex(lambda ind, part : encodeAddTopics(ind, part,docs[ind],
                                                                           vocabs[ind], nbTopics), 
                                       preservesPartitioning = True)

In [77]:
%%time
corpus2.take(1) # Just word's and doc's ids now, topics have been added too

CPU times: user 24.6 s, sys: 506 ms, total: 25.1 s
Wall time: 26.9 s


[(0,
  (2153, array([    2, 15724, 13033,  5618, 27042,  4616,  9594,  9125, 31045,
           1042, 14542,  2022,  9139, 17210, 27572, 17212,  6891, 11348,
            290, 15306, 17485, 15791, 17005, 31340, 29573, 17495, 21204,
          29580, 13827, 29581, 19967, 30079,  9923, 16770, 11866, 16776,
           7207, 10183,  5454,  8231, 31372, 17772,   349, 15103, 27135,
           8245, 30390, 23937, 26646, 20745,  2829, 14883,   606, 21484,
          31161,  8783, 31177,  9490, 22036,   421, 30690,  4552,  5287,
          17614, 10760, 21799, 27228,  7066,  7589, 20103,  5828, 27500,
          12255, 10531, 12748, 31499,  4823, 25997, 13490,  9089, 21349,
          20878, 14750, 29989, 25515,  8879]), array([9, 6, 7, 1, 7, 9, 6, 7, 5, 5, 7, 1, 6, 5, 3, 2, 9, 4, 3, 5, 4, 8,
          7, 2, 9, 1, 0, 8, 4, 2, 5, 5, 9, 1, 0, 5, 7, 3, 8, 9, 3, 5, 6, 4,
          5, 0, 5, 8, 6, 8, 8, 3, 9, 2, 6, 0, 1, 9, 1, 8, 0, 7, 3, 6, 7, 2,
          8, 1, 4, 4, 0, 3, 1, 3, 3, 0, 6, 5, 0, 9, 0, 2, 1,

## 3.2. Save the whole work for the next step 

In [78]:
from fileUtils import saveAsPickleFile

In [79]:
%%time
# saveAsPickleFile(corpus2)
if os.path.exists("initial_train"):
    shutil.rmtree("initial_train")
corpus2.saveAsPickleFile("initial_train")

CPU times: user 24.5 s, sys: 211 ms, total: 24.7 s
Wall time: 37.6 s


> Here is the end of the data preprocessing, the data is in the right format now and we can train our model. Let's sart

In [83]:
del data, corpus, corpus2

# 3. Parallel LDA (mapReduce version)

> Here the ML part

# Define some parameters

In [84]:
nbVocabAll = len(vocabAll)
alpha = 0.1
beta = 0.1

In [85]:
from builder  import init

In [86]:
# from builder import makeConfig, updateConfig, get_now
# makeConfig(id = "all", countWordsUpdated = {str(ind):False for ind in range(nbPartitions)}, time = get_now())

# Training

In [87]:
import importlib, model, builder, fileUtils
importlib.reload(model)
importlib.reload(builder)
importlib.reload(fileUtils)
from fileUtils import saveAsPickleFile
from model import pldaMap0
from builder import updateCountWordsAll, init

In [88]:
# pldaMap(0, 1, alpha, beta, len(vocabAll), nbTopics)

In [89]:
# rdd = spark.pickleFile("pickle/")
# rdd.getNumPartitions()

In [90]:
rdd = spark.pickleFile("initial_train")
# (doc_id, doc_words, doc_topics) <--- the format
rdd.take(1)

[(3,
  (3209, array([14352, 26794,  6539, 25779, 16817, 18158, 30432,  4878, 25353,
          28446, 14962, 20647,  3334, 19869,  1979,  2372,  4267, 25919,
          21715, 28195, 17774, 15978, 29534,  1131, 29342, 20536,  1263,
           4825, 27174,  6720, 10404, 28654, 27089,  7755, 28584, 20482,
          21929, 14178, 26413, 32594, 22826,   502, 12037,  9509,  2975,
           5851, 28311, 33085, 31120, 17305, 12981,  9543, 27103, 28075,
          13975, 14374,  6536, 21439, 20559]), array([9, 6, 7, 1, 7, 9, 6, 7, 5, 5, 7, 1, 6, 5, 3, 2, 9, 4, 3, 5, 4, 8,
          7, 2, 9, 1, 0, 8, 4, 2, 5, 5, 9, 1, 0, 5, 7, 3, 8, 9, 3, 5, 6, 4,
          5, 0, 5, 8, 6, 8, 8, 3, 9, 2, 6, 0, 1, 9, 1])))]

In [99]:
%%time

t0 = time.time()
rdd = spark.pickleFile("initial_train").partitionBy(nbPartitions).map(lambda x: x[1])
# rdd = corpus2
init(rdd, vocabs, nbDocs, nbVocabs, len(vocabAll), nbTopics)


for i in range(20):
    rdd = rdd.mapPartitionsWithIndex(lambda ind, part : pldaMap0(ind, part, alpha, beta, nbVocabAll, nbTopics),
                       preservesPartitioning= True )
    saveAsPickleFile(rdd)
    rdd = spark.pickleFile("pickle/").partitionBy(nbPartitions).map(lambda x: x[1])
    updateCountWordsAll()
print("Time : {}".format(time.time() - t0))

Time : 1518.7058153152466
CPU times: user 26.1 s, sys: 1.39 s, total: 27.5 s
Wall time: 25min 18s


In [68]:
from builder import initDocCounts, initWordCounts, initCountWordsAll

In [69]:
from model import makeConfig, get_now, pldaMap, updateConfig

In [70]:
rdd = spark.pickleFile("initial_train").partitionBy(nbPartitions).map(lambda x: x[1])
rdd.take(1)

[(25, array([ 886,    0,  802, 1284,    3,  535,  265, 1048,  589,  806,  193,
          767,  269,  348,   52,  393,  428, 1083, 1298, 1019,  974,  553,
         1022, 1167,  775,  376,   20,  982,  477,  872,  361,  819,   21,
          649,  983, 1243,  907,  211,  367,  322,  961,  746,  486, 1104,
          953, 1310,  289,  704,  660, 1240,  752, 1189, 1149,  529, 1156]), array([6, 2, 5, 3, 9, 8, 0, 9, 2, 8, 1, 2, 7, 2, 5, 0, 1, 2, 8, 5, 6, 4,
         4, 4, 0, 7, 5, 3, 2, 9, 2, 9, 5, 8, 4, 0, 7, 7, 1, 5, 5, 2, 3, 8,
         5, 2, 2, 2, 0, 4, 9, 2, 7, 1, 0]))]

In [71]:
from multiprocessing import Process

In [72]:
def supervise(nrounds):
    """The superviser."""
    processes = [Process(target= pldaMap,
            args = (ind, nrounds,alpha, beta, len(vocabAll), nbTopics)) for ind in range(nbPartitions)]
    if  __name__ == "__main__":
        # Run processes
        for p in processes:
            p.start()
    count = 0        
    while count < nrounds :
        allFree = True
        for id_ in range(nbPartitions):
            with open("configs/config__%04d__"%id_, "r") as f :
                slave = json.load(f)
                if slave["state"] == "busy":
                    allFree = False
                    
        if allFree :
            updateCountWordsAll()
            updateConfig(id = "all", 
                         countWordsUpdated = {str(ind):True for ind in range(nbPartitions)}, time = get_now() )
            count += 1
        
            
        if not all([p.is_alive() for  p in processes]) :
            for p in processes :
                p.kill() 
            raise ValueError("Some process is died !!!!!!!!!!!!")
        time.sleep(1)
        
    for p in processes :
                p.kill() 

In [79]:
rdd = spark.pickleFile("initial_train").partitionBy(nbPartitions).map(lambda x: x[1])
rdd.mapPartitionsWithIndex(saveByPartition).collect()

rdd.mapPartitionsWithIndex(lambda ind, part : initDocCounts(ind, part, nbDocs[ind], nbTopics)).collect()
rdd.mapPartitionsWithIndex(lambda ind, part : initWordCounts(ind, part, nbVocabs[ind], nbTopics)).collect()

initCountWordsAll(vocabs, len(vocabAll), nbTopics)

countWords = load("matrix/countWords/words_all")
countWords[:10]

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 2., 2., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [80]:
makeConfig(id = "all", countWordsUpdated = {str(ind):False for ind in range(nbPartitions)}, time = get_now())

all


In [93]:
# master = Process(target = supervise, args = [10] )
# master.start()

In [76]:
master.kill()

In [94]:
# pldaMap(0, 1, alpha, beta, len(vocabAll), nbTopics)

# Post-training analysis

In [111]:
%%time
cl = 5
subdoc = 1
countDocs = load("matrix/countDocs/docs__%04d__"%subdoc)
countDocs[:10]

CPU times: user 1.88 ms, sys: 86 µs, total: 1.96 ms
Wall time: 1.28 ms


In [112]:
topics = countDocs.argmax(1)
v = np.where(topics == cl)
countDocs[v]

array([[ 0.,  0.,  3., ...,  0.,  0.,  3.],
       [ 0.,  0.,  0., ...,  0.,  1.,  2.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 9.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  4., 11., ...,  0.,  0.,  4.]])

In [113]:
dks = np.array( list(docs[subdoc].items()))
cluster = dks[v]
cluster[:10]

array([['c112e6cad39c90c3b8e4f3c7c99705a763272eb8', '2'],
       ['a14a171c3627543bc0db4c5a07e9ee861acd2887', '12'],
       ['ce2b170225dbcd0e453817a8efa9c5bfddc6edc2', '31'],
       ['6c0a25fc368e0d8ae4680798b834e71e96c6820b', '38'],
       ['fd6658ca16d647a564666bb947d678c763423cb1', '69'],
       ['fa080d44b8906c268cca8ae15602d0bcd041a077', '70'],
       ['75de42b2686d0337a8b58f4482ea72c18ee8f307', '71'],
       ['ed82035983dd18f638d5d7689d0daaecf38cfbeb', '74'],
       ['3618734a404b80e8adf020104a74e87450ac05d1', '82'],
       ['bcab651e0fb36eece46bde097e1f4de1cf8081a5', '96']], dtype='<U40')

In [114]:
corpus = spark.pickleFile("corpus/corpus-46/part-00000" ).repartition(nbPartitions)

In [115]:
%%time
corpus.filter(lambda x : np.isin(x[0], cluster[:, 0])).take(10)

CPU times: user 15.9 ms, sys: 7.83 ms, total: 23.8 ms
Wall time: 5.85 s


[('32ebec9e3312f7a0b47b7ab346a91c8ceb1e7cae',
  array(['secur', 'level', 'face', 'therefor', 'degrad', 'import', 'forc',
         'anomali', 'avail', 'possibl', 'current', 'network', 'fall',
         'propos', 'arriv', 'expect', 'determin', 'minim', 'monitor',
         'flaw', 'experi', 'complet', 'region', 'isp', 'seri', 'show',
         'reason', 'cope', 'connect', 'servic', 'syn', 'legitim', 'target',
         'effect', 'profit', 'henc', 'provid', 'model', 'queue', 'note',
         'daili', 'method', 'negat', 'nevertheless', 'client', 'tune',
         'use', 'analyz', 'necessari', 'maximum', 'detect', 'howev',
         'access', 'unwant', 'particular', 'fals', 'act', 'first', 'tcp',
         'avoid', 'attack', 'poisson', 'deni', 'explain', 'realtim',
         'intens', 'also', 'order', 'caus', 'synflood', 'paper', 'machin',
         'accept', 'potenti', 'purpos', 'sever', 'length', 'fail',
         'suitabl', 'correct', 'threat', 'requir', 'segment', 'solut',
         'allow', 'prob

In [169]:
countWords = load("matrix/countWords/words__%04d__"%1)
countWords[countWords < 0]

array([], dtype=float64)

In [170]:
countWords[506]

array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])

In [171]:
def func(ind, part):
    S = []
    if ind == 2 :
        for el in part :
            if el[0] == 0:
                S.append((ind, el))
    return S

In [172]:
rdd = spark.pickleFile("pickle/").partitionBy(nbPartitions).map(lambda x: x[1])

In [173]:
el = rdd.mapPartitionsWithIndex(func).collect()[0]
el

(2, (0, array([497,   3,  97,  98,   7,  99,   8, 240, 502, 352, 571, 390, 171,
         317, 172,  14, 210,  61, 436, 214,  66, 359, 543, 216, 176, 286,
         474, 513, 549, 181, 399, 584,  23, 220, 112,  73, 330, 225, 405,
         113, 447, 184, 478, 260, 188, 145, 484, 593, 294, 147,  75, 410,
         152, 451,  78,  79, 454, 415,  40, 416, 267,  44,  83,  45, 528,
         564, 343,  48,  88, 122, 424,  90, 565, 124]), array([5, 8, 5, 5, 9, 9, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 9,
         5, 5, 9, 5, 5, 5, 5, 5, 5, 9, 5, 5, 5, 5, 5, 5, 9, 5, 5, 9, 5, 5,
         6, 6, 5, 5, 5, 5, 9, 5, 9, 9, 5, 9, 9, 5, 9, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 9, 6])))

In [174]:
np.where(el[1][1] == 543)

(array([22]),)

In [175]:
el[1][2][22]

5

In [176]:
countDocs[0]

array([ 0.,  0., 33.,  4.,  8.,  1.,  0.,  0.,  8.,  0.])

In [177]:
countDocs[0].sum(), len(el[1][1])

(54.0, 74)

# 4. Conclusion