# TEXT MINING ASSIGNMENT 
NICOLE MARIA FORMENTI - 941481

### Part (a): (Text data pre-processing)
Consider the corpus you choose. If it is the case, consider a subset of the corpus.
**Task 1**:
- Clean the corpus by eliminating punctuation and stop words
- Tokenize it
- Try to obtain bi-grams

**Task 2**:
- Split the original corpus in sentences.
- Vectorise it with bag-of words and TF-IDF methods.
- Try to form a document-term matrix.

**Task 3**: 
- Try to create a pipeline for implementing Task 1, parts 1 and 2 .

In [601]:
import spacy
import re

import nltk
from nltk.corpus import stopwords 
import string
import pandas as pd
import numpy as np

In [615]:
filename = '20news-bydate-train/sci.med/58800'
file = open(filename, 'rt')
text = file.read()
file.close()
#Printing:
print(text)

From: neal@cmptrc.lonestar.org (Neal Howard)
Subject: Re: Science and methodology (was: Homeopathy ... tradition?)
Organization: CompuTrac Inc., Richardson TX
Lines: 20

In article <1993Apr15.150550.15347@ecsvax.uncecs.edu> ccreegan@ecsvax.uncecs.edu (Charles L. Creegan) writes:
>
>What about Kekule's infamous derivation of the idea of benzene rings
>from a daydream of snakes in the fire biting their tails?  Is this
>specific enough to count?  Certainly it turns up repeatedly in basic
>phil. of sci. texts as an example of the inventive component of
>hypothesizing. 

I sometimes wonder if Kekule's dream wasn't just a wee bit influenced by
aromatic solvent vapors ;-) heh heh.


-- 
Neal Howard   '91 XLH-1200      DoD #686      CompuTrac, Inc (Richardson, TX)
	      doh #0000001200   |355o33|      neal@cmptrc.lonestar.org
	      Std disclaimer: My opinions are mine, not CompuTrac's.
         "Let us learn to dream, gentlemen, and then perhaps
          we shall learn the truth." -- August

# TASK 1 
### PART 1: clean the corpus by eliminating stop words
We eliminate both stop words which are common words that don't add any useful information and punctuation marks

In [616]:
nltk.download('stopwords')
stop_words_en = stopwords.words("english") 
stop_words_en = stop_words_en + list(string.punctuation)

[nltk_data] Downloading package stopwords to /Users/niki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [617]:
# tokenise to remove stop words and punctuation
text = [w for w in nltk.word_tokenize(text.lower()) if w not in stop_words_en]
# we also remove the new line and tab for readibility
text = ' '.join(text) 
text = text.replace('\n', ' ')
text = text.replace('\t', ' ')
text = text.replace('\'', '')
text



The initial part of the article containing the mail and the subject writing it is not deleted since some mails of users can be correlated to some specific topics and the articles are titled according to their content. Some punctuation marks won't be eliminated by we ignore them for now since they probably don't pose a real problem. 

### PART 2: tokenize it
Tokenize means to separate the text into unigrams, which are single words

In [618]:
# tokenize

token = nltk.word_tokenize(text.lower())

for t in token:
    print(t)

neal
cmptrc.lonestar.org
neal
howard
subject
science
methodology
homeopathy
...
tradition
organization
computrac
inc.
richardson
tx
lines
20
article
1993apr15.150550.15347
ecsvax.uncecs.edu
ccreegan
ecsvax.uncecs.edu
charles
l.
creegan
writes
kekule
s
infamous
derivation
idea
benzene
rings
daydream
snakes
fire
biting
tails
specific
enough
count
certainly
turns
repeatedly
basic
phil
sci
texts
example
inventive
component
hypothesizing
sometimes
wonder
kekule
s
dream
nt
wee
bit
influenced
aromatic
solvent
vapors
heh
heh
--
neal
howard
91
xlh-1200
dod
686
computrac
inc
richardson
tx
doh
0000001200
|355o33|
neal
cmptrc.lonestar.org
std
disclaimer
opinions
mine
computrac
s
``
let
us
learn
dream
gentlemen
perhaps
shall
learn
truth
--
august
kekule
1890


### PART 3: obtain bigrams
Bigrams are part of the text formed by 2 consecutive single words 

In [375]:
# create bigrams
bigrams = nltk.ngrams(nltk.word_tokenize(text), n = 2)
for grams in bigrams: 
    print (grams)

('neal', 'cmptrclonestarorg')
('cmptrclonestarorg', 'neal')
('neal', 'howard')
('howard', 'subject')
('subject', 'science')
('science', 'methodology')
('methodology', 'homeopathy')
('homeopathy', 'tradition')
('tradition', 'organization')
('organization', 'computrac')
('computrac', 'inc')
('inc', 'richardson')
('richardson', 'tx')
('tx', 'line')
('line', '20')
('20', 'article')
('article', '1993apr1515055015347')
('1993apr1515055015347', 'ecsvaxuncecsedu')
('ecsvaxuncecsedu', 'ccreegan')
('ccreegan', 'ecsvaxuncecsedu')
('ecsvaxuncecsedu', 'charles')
('charles', 'l')
('l', 'creegan')
('creegan', 'write')
('write', 'kekule')
('kekule', "'s")
("'s", 'infamous')
('infamous', 'derivation')
('derivation', 'idea')
('idea', 'benzene')
('benzene', 'ring')
('ring', 'daydream')
('daydream', 'snake')
('snake', 'fire')
('fire', 'bite')
('bite', 'tail')
('tail', 'specific')
('specific', 'enough')
('enough', 'count')
('count', 'certainly')
('certainly', 'turn')
('turn', 'repeatedly')
('repeatedly', '

# TASK 2
### PART 1
Transform the corpus in sentences, in this case each article corresponds to a document. The documents are saved in a dataframe

In [626]:
from glob import glob
import os
import re

- Training data

In [627]:
path = '20news-bydate-test'

corpus = []
labels = []

for fold in os.listdir(path):
     # retrieve the path of the folders where the documents are stored. each folder corresponds to 
     #a topic. 11 topics have been selected and for each of them we take 200 documents
    if fold == '.DS_Store':
        continue
    path_fold = os.path.join(path, fold)
    path_files = os.listdir(path_fold)

    # randomly select the 200 documents for each topic
    random_file = np.random.choice(path_files,200)
    # retrieve the name of the topic 
    label = re.findall(r'/\S+', path_fold)[0][1:]
    
    for f in random_file:    
        # the encoding used is latin-1, since utf-8 isn't compatible with the encoding of some symbols
        #in the documents
        with open(os.path.join(path_fold,f), 'r', encoding='latin-1') as file:
            file = file.read()
            # remove new line and tab
            file = file.replace('\n', ' ')
            file = file.replace('\t', ' ')
            
            corpus.append(file) #vector of documents
            labels.append(label) #vector of labels assigned to each document
            

In [424]:
# create a dataframe which associates to each document its topic
corpus_df_train = pd.DataFrame(corpus, columns=['text'])
corpus_df_train['labels'] = labels
display(corpus_df_train)

Unnamed: 0,text,labels
0,From: amoss@shuldig.cs.huji.ac.il (Amos Shapir...,talk.politics.mideast
1,From: enis@cbnewsg.cb.att.com (enis.surensoy) ...,talk.politics.mideast
2,From: sera@zuma.UUCP (Serdar Argic) Subject: T...,talk.politics.mideast
3,From: sera@zuma.UUCP (Serdar Argic) Subject: A...,talk.politics.mideast
4,From: sunder@grusin.crhc.uiuc.edu (Srinivas Su...,talk.politics.mideast
...,...,...
2195,From: cdt@sw.stratus.com (C. D. Tavares) Subje...,talk.politics.guns
2196,From: jagst18+@pitt.edu (Josh A Grossman) Subj...,talk.politics.guns
2197,From: rats@cbnewsc.cb.att.com (Morris the Cat)...,talk.politics.guns
2198,From: malexan@a.cs.okstate.edu (ALEXANDER MICH...,talk.politics.guns


- Test data

In [429]:
# apply the same procedure to retrieve the test set
path = '20news-bydate-test'

corpus = []
labels = []

for fold in os.listdir(path):
    if fold == '.DS_Store':
        continue
    path_fold = os.path.join(path, fold)
    path_files = os.listdir(path_fold)
    
    random_file = np.random.choice(path_files,100)
    label = re.findall(r'/\S+', path_fold)[0][1:]
    
    for f in random_file:    
        with open(os.path.join(path_fold,f), 'r', encoding='latin-1') as file:
            file = file.read()
            file = file.replace('\n', ' ')
            file = file.replace('\t', ' ')
            
            corpus.append(file)
            labels.append(label)
            

In [430]:
corpus_df_test = pd.DataFrame(corpus, columns=['text'])
corpus_df_test['labels'] = labels
display(corpus_df_test)

Unnamed: 0,text,labels
0,From: eshneken@ux4.cso.uiuc.edu (Edward A Shne...,talk.politics.mideast
1,From: sadek@cbnewsg.cb.att.com (mohamed.s.sade...,talk.politics.mideast
2,From: oaf@zurich.ai.mit.edu (Oded Feingold) Su...,talk.politics.mideast
3,From: tclock@orion.oac.uci.edu (Tim Clock) Sub...,talk.politics.mideast
4,From: jake@bony1.bony.com (Jake Livni) Subject...,talk.politics.mideast
...,...,...
1095,From: cescript@mtu.edu (Charles Scripter) Subj...,talk.politics.guns
1096,From: paull@hplabsz.hpl.hp.com (Robert Paull) ...,talk.politics.guns
1097,From: jpsb@NeoSoft.com (Jim Shirreffs) Subject...,talk.politics.guns
1098,From: rscharfy@magnus.acs.ohio-state.edu (Ryan...,talk.politics.guns


### PART 2

- Vectorise using TF-IDF

First we a pipeline to clean the text and lemmatize it. Lemmatization retrieves the root of the word which is the base format, so it is a normalisation of the words, so that words with the same root but in a different form are not counted as different

In [666]:
## BUILD A PIPELINE
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nlp = spacy.load("en_core_web_sm")


class Preprocess( BaseEstimator, TransformerMixin ):

    def __init__( self, lang = "english"):
        self.lang = lang
        self.stop_words_en = stopwords.words(self.lang) + list(string.punctuation)
      
    def fit( self, X, y = None ):
        return self 
    
    # the method clean() remove the stop words and punctuation and re-join the tokens in a string
    def clean( self, x ):
        
        text = [w for w in nltk.word_tokenize(x.lower()) if w not in self.stop_words_en]
        text = ' '.join([elem for elem in text]) 
        text = text.replace('\'','')
        
        return str(text)
    
    # the method lemmatize() retrieve the lemma for each word and then re-join the tokens in a string
    def lemmatize( self, x ):
        doc = nlp(x)
        text = [t.lemma_ for t in doc]
        return ' '.join(text) 
    
    # apply the clean() and lemmatize() method to the column 'text' of the given dataset
    def transform( self, X, y = None ):
        return X['text'].apply(self.clean).apply(self.lemmatize)

In [644]:
# apply the function to a subset of the dataset
processed_text = Preprocess().transform(corpus_df_train[:20]) #it works!
processed_text

0     amoss shuldig.cs.huji.ac.il amo shapira subjec...
1     enis cbnewsg.cb.att.com enis.surensoy subject ...
2     sera zuma.uucp serdar argic subject traditiona...
3     sera zuma.uucp serdar argic subject many mosle...
4     sunder grusin.crhc.uiuc.edu srinivas sunder su...
5     dfs doe.carleton.ca david f. skoll subject mos...
6     adam endor.uucp adam shostack subject freedom ...
7     eggertj moses.ll.mit.edu jim eggert x6127 g41 ...
8     adam endor.uucp adam shostack subject israel a...
9     center policy research cpr igc.apc.org subject...
10    aap wam.umd.edu alberto adolfo pinkas subject ...
11    kunda hanuman.eng.sun.com ramachandra p. kunda...
12    ab4z virginia.edu ` ` andi beyer   subject fre...
13    jaskew spam.maths.adelaide.edu.au joseph askew...
14    eggertj moses.ll.mit.edu jim eggert x6127 g41 ...
15    reply - to dcs witsend.tnet.com ` ` d. c. sess...
16    nstramer supergas.dazixco.ingr.com naftaly str...
17    dfs doe.carleton.ca david f. skoll subject

In [667]:
# build a pipeline which can be implemented with other operations
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

preproccess_pipe = Pipeline([('preprocess', Preprocess())])
preproccess_pipe 

Pipeline(memory=None, steps=[('preprocess', Preprocess(lang='english'))],
         verbose=False)

- Vectorize with TF-IDF: it counts the term frequency over the inverse dovument frequency. It is a measure that assigns more weights to terms specific for a certain document, so frequent in that document but not in all the others 

In [658]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_text)
# we print a subset of the terms 
print(vectorizer.get_feature_names()[100:200])

['accomodation', 'accord', 'account', 'achieve', 'act', 'action', 'actual', 'actually', 'adam', 'adams', 'add', 'addis', 'address', 'adelaide', 'adis', 'administration', 'adolfo', 'advance', 'adventure', 'african', 'agent', 'agnostic', 'ago', 'ahmed', 'ahronot', 'ai843', 'aid', 'aiken', 'aim', 'airlift', 'al', 'alan', 'albanian', 'alberto', 'algeria', 'ali', 'aliyev', 'allow', 'almost', 'alone', 'already', 'also', 'alternative', 'although', 'ambargo', 'american', 'ammunition', 'amo', 'amos', 'amoss', 'anarchy', 'andi', 'angell', 'annoy', 'another', 'anounced', 'answer', 'anticipate', 'antisemitic', 'anybody', 'anyone', 'anything', 'anyway', 'anyways', 'apart', 'apartheid', 'apc', 'apparently', 'appeal', 'appear', 'applied', 'apply', 'appressian', 'apr', 'april', 'arab', 'arabic', 'arabs', 'archive', 'archives', 'area', 'arena', 'argic', 'argument', 'arif', 'arm', 'armenia', 'armenian', 'armenians', 'around', 'arrival', 'article', 'ask', 'askew', 'assad', 'assassination', 'assault', 'as

In [659]:
print(X.shape) #there are 20 sentences and 1389 words

(20, 1389)


In [660]:
# we return the tf-idf score in a document-term matrix which contains douments as rows and terms as columns
display(X.toarray())

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.19662758],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.05423027, ..., 0.05423027, 0.        ,
        0.        ]])

- Vectorise using BAG OF WORDS: bow simply counts the number of occurence of each word in a document

In [651]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed_text)
print(vectorizer.get_feature_names()[100:200])

['accomodation', 'accord', 'account', 'achieve', 'act', 'action', 'actual', 'actually', 'adam', 'adams', 'add', 'addis', 'address', 'adelaide', 'adis', 'administration', 'adolfo', 'advance', 'adventure', 'african', 'agent', 'agnostic', 'ago', 'ahmed', 'ahronot', 'ai843', 'aid', 'aiken', 'aim', 'airlift', 'al', 'alan', 'albanian', 'alberto', 'algeria', 'ali', 'aliyev', 'allow', 'almost', 'alone', 'already', 'also', 'alternative', 'although', 'ambargo', 'american', 'ammunition', 'amo', 'amos', 'amoss', 'anarchy', 'andi', 'angell', 'annoy', 'another', 'anounced', 'answer', 'anticipate', 'antisemitic', 'anybody', 'anyone', 'anything', 'anyway', 'anyways', 'apart', 'apartheid', 'apc', 'apparently', 'appeal', 'appear', 'applied', 'apply', 'appressian', 'apr', 'april', 'arab', 'arabic', 'arabs', 'archive', 'archives', 'area', 'arena', 'argic', 'argument', 'arif', 'arm', 'armenia', 'armenian', 'armenians', 'around', 'arrival', 'article', 'ask', 'askew', 'assad', 'assassination', 'assault', 'as

In [652]:
print(X.shape) # 20 documents and 1389 terms

(20, 1389)


In [653]:
# we return the bow value in the term-document matrix
display(X.toarray())

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 2],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 1, 0, 0]])

We build a function to extract from a selected document the words and the corresponding scores. We can decide to extract words corresponding to values higher than a certain threshold or the top frequent words

In [654]:
def extract_keywords(doc, document_n=None, method='tfidf', threshold=0, top_words = None):
    if method == 'tfidf':
        vectorizer = TfidfVectorizer()
        X = vectorizer.fit_transform(processed_text)
        features = vectorizer.get_feature_names()
        
    elif method == 'bow':
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform(processed_text)
        features = vectorizer.get_feature_names()
        
    # it deals with single documents (integers). it retrieves the words and the
    #corresponding scores where the score is above a selected threshold
    
    if isinstance(document_n, int):
        keywords=[]
        for (x,name) in zip(X.toarray()[document_n],features):
                if x>=threshold:
                    keywords.append((name,round(x,3)))
        
        # returns the topwords 
        if top_words:
            dict_ = {}
            for i,j in keywords:
                dict_[i]=j

            return sorted(dict_.items(), key=lambda item: item[1])[-top_words:]
        else:
            return keywords
    
    # it deals with lists of documents (integers)
    else:        
        keywords=[]
        for d in document_n:
            for (x,name) in zip(X.toarray()[int(d)],features):
                if x>=threshold:
                    keywords.append((name,round(x,3)))
                    
        # returns the topwords 
        if top_words:
            dict_ = {}
            for i,j in keywords:
                dict_[i]=j

            return sorted(dict_.items(), key=lambda item: item[1])[-top_words:]
        else:
            return keywords

In [655]:
# extract from document 10 the words having a tf-idf score higher than 0.15
extract_keywords(processed_text, document_n=10, method='tfidf', threshold=0.15)

[('adam', 0.215),
 ('atheist', 0.19),
 ('believe', 0.162),
 ('cultural', 0.333),
 ('god', 0.238),
 ('idea', 0.275),
 ('identity', 0.286),
 ('jewish', 0.169),
 ('nation', 0.19),
 ('religion', 0.238)]

In [662]:
# extract from all the documents the top 10 words according to the bag of word value
extract_keywords(processed_text, document_n=range(0,20), method='bow', top_words=10)

[('new', 3),
 ('palestine', 3),
 ('people', 3),
 ('bony', 4),
 ('bony1', 4),
 ('com', 4),
 ('palestinean', 5),
 ('state', 5),
 ('jake', 7),
 ('nationalism', 10)]

### PART 3: Build the term-document matrix
We display the term-document format in a readable format where it is easy to retrieve columns and rows corresponding to specific terms or documents

In [471]:
# term document matrix

import pandas as pd

def TermDoc_matrix(doc, method='tfidf'):
    if method == 'tfidf':
        vectorizer = TfidfVectorizer()
        X = vectorizer.fit_transform(processed_text)
        features = vectorizer.get_feature_names()
        
        matrix = pd.DataFrame(data=X.toarray().round(3), columns=features)
        matrix = matrix.rename_axis("Terms", axis="columns")
        matrix = matrix.rename_axis("Documents")
        
        return matrix
    
    elif method == 'bow':
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform(processed_text)
        features = vectorizer.get_feature_names()
        
        matrix = pd.DataFrame(data=X.toarray(), columns=features)
        matrix = matrix.rename_axis("Terms", axis="columns")
        matrix = matrix.rename_axis("Documents")
        
        return matrix

In [472]:
pd.set_option('display.max_columns', 500)

matrix_tfidf = TermDoc_matrix(processed_text, 'tfidf')
display(matrix_tfidf)

Terms,00,000,005019,005225,012045,013527,05,09,1000,10716,11,111030,116,14,1483500379,156,16,1699,17,178,184547,1919,1920,1923,1934,1948,1972,1983,1989,1993,1993apr26,1993apr27,1993may10,1993may12,1993may19,19th,1pp,1rbn60,1rh,1slm8r,1smbma,1smllm,20,200,20058,20r,21,210,211316,21705,21904,23,2370,24910,25,26,27,28,28455,287,29,2bec0a64,30,303,32,33,37,38,39,40,41,45,46,48,50,56,581,6101,800,80301,8231,8543,8mr,91904,93,93apr24130647,93apr26212846,93may8143340,93may9230207,93y05m11d509,98,9972,aap,ab4z,ababa,abhorrant,able,absolute,ac,accept,accomodation,accord,account,achieve,act,action,actual,actually,adam,adams,add,addis,address,adelaide,adis,administration,adolfo,advance,adventure,african,agent,agnostic,ago,ahmed,ahronot,ai843,aid,aiken,aim,airlift,al,alan,albanian,alberto,algeria,ali,aliyev,allow,almost,alone,already,also,alternative,although,ambargo,american,ammunition,amo,amos,amoss,anarchy,andi,angell,annoy,another,anounced,answer,anticipate,antisemitic,anybody,anyone,anything,anyway,anyways,apart,apartheid,apc,apparently,appeal,appear,applied,apply,appressian,apr,april,arab,arabic,arabs,archive,archives,area,arena,argic,argument,arif,arm,armenia,armenian,armenians,around,arrival,article,ask,askew,assad,assassination,assault,assure,astein,atc,atheism,atheist,att,attack,attraction,au,aunt,author,authority,autumn,avail,avoid,aware,azerbaijani,azzam,baboon,baby,back,baffle,balkan,bar,barbarism,barlow,base,basis,battle,battles,bc744,be,become,bedford,begin,behind,believe,bellini,benefit,benzion,berkeley,bernadotte,besides,beyer,beyond,bi,big,bigoted,bigotry,bill,blood,body,bomb,...,sex,sexual,shall,shameful,shapira,shares,shaul,shell,shoot,shooting,shostack,show,shuldig,significant,silver,simply,since,single,site,six,sixteen,skoll,sky,small,smart,society,soldier,solution,somalia,someone,something,somewhat,somwhere,son,soon,sorry,soul,sound,source,sovereignty,soviet,spam,speach,speak,specifically,speech,spend,spokesman,sport,spread,square,squatter,srinivas,srv,stab,stage,standard,start,starvation,state,statistic,stay,steet,stein,still,stillness,stop,story,stramer,street,strife,student,stupid,subject,submachine,substantive,successful,sue,suite,sun,sunder,supergas,support,sure,surensoy,surprise,surrender,survive,survivor,synagogue,syria,syrian,system,take,talk,tape,target,tashach,tclock,television,tell,ten,tent,term,terminator,territorially,territory,terrorism,testify,testimony,the,themslve,thing,think,thomas,thorny,thou,thousand,thread,three,throughout,ticket,tight,tim,time,tiny,tmail,tnet,to,today,together,torah,torture,tortured,total,tourist,traditional,traffic,truck,true,trusteeship,try,tue,turk,turkish,turkiye,turks,turn,turning,twenty,two,ucdavis,uchicago,uci,uiuc,umd,umich,un,unable,unarmed,uncle,university,unknown,unlike,untill,upi,uproar,urbana,usa,use,usenet,ustache,uucp,uva,valuable,various,ve,version,vicinity,video,view,views,village,violation,virginia,voice,vs,wail,wait,wall,wam,want,war,warn,waste,watch,way,weapon,week,welcome,welfare,well,welshed,west,whether,whoever,wife,wiggle,will,window,within,witness,witsend,wo,woe,woman,wonder,word,work,world,worship,would,wound,write,wrong,x6127,yarah,yassin,year,yediot,yehuda,yemen,yemeni,yesterday,yet,yfn,yisrael,yitzhak,york,yosef,young,ysu,yugoslavia,zahran,zeidan,zichron,zion,zionism,zionist,zuma
Documents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1,Unnamed: 229_level_1,Unnamed: 230_level_1,Unnamed: 231_level_1,Unnamed: 232_level_1,Unnamed: 233_level_1,Unnamed: 234_level_1,Unnamed: 235_level_1,Unnamed: 236_level_1,Unnamed: 237_level_1,Unnamed: 238_level_1,Unnamed: 239_level_1,Unnamed: 240_level_1,Unnamed: 241_level_1,Unnamed: 242_level_1,Unnamed: 243_level_1,Unnamed: 244_level_1,Unnamed: 245_level_1,Unnamed: 246_level_1,Unnamed: 247_level_1,Unnamed: 248_level_1,Unnamed: 249_level_1,Unnamed: 250_level_1,Unnamed: 251_level_1,Unnamed: 252_level_1,Unnamed: 253_level_1,Unnamed: 254_level_1,Unnamed: 255_level_1,Unnamed: 256_level_1,Unnamed: 257_level_1,Unnamed: 258_level_1,Unnamed: 259_level_1,Unnamed: 260_level_1,Unnamed: 261_level_1,Unnamed: 262_level_1,Unnamed: 263_level_1,Unnamed: 264_level_1,Unnamed: 265_level_1,Unnamed: 266_level_1,Unnamed: 267_level_1,Unnamed: 268_level_1,Unnamed: 269_level_1,Unnamed: 270_level_1,Unnamed: 271_level_1,Unnamed: 272_level_1,Unnamed: 273_level_1,Unnamed: 274_level_1,Unnamed: 275_level_1,Unnamed: 276_level_1,Unnamed: 277_level_1,Unnamed: 278_level_1,Unnamed: 279_level_1,Unnamed: 280_level_1,Unnamed: 281_level_1,Unnamed: 282_level_1,Unnamed: 283_level_1,Unnamed: 284_level_1,Unnamed: 285_level_1,Unnamed: 286_level_1,Unnamed: 287_level_1,Unnamed: 288_level_1,Unnamed: 289_level_1,Unnamed: 290_level_1,Unnamed: 291_level_1,Unnamed: 292_level_1,Unnamed: 293_level_1,Unnamed: 294_level_1,Unnamed: 295_level_1,Unnamed: 296_level_1,Unnamed: 297_level_1,Unnamed: 298_level_1,Unnamed: 299_level_1,Unnamed: 300_level_1,Unnamed: 301_level_1,Unnamed: 302_level_1,Unnamed: 303_level_1,Unnamed: 304_level_1,Unnamed: 305_level_1,Unnamed: 306_level_1,Unnamed: 307_level_1,Unnamed: 308_level_1,Unnamed: 309_level_1,Unnamed: 310_level_1,Unnamed: 311_level_1,Unnamed: 312_level_1,Unnamed: 313_level_1,Unnamed: 314_level_1,Unnamed: 315_level_1,Unnamed: 316_level_1,Unnamed: 317_level_1,Unnamed: 318_level_1,Unnamed: 319_level_1,Unnamed: 320_level_1,Unnamed: 321_level_1,Unnamed: 322_level_1,Unnamed: 323_level_1,Unnamed: 324_level_1,Unnamed: 325_level_1,Unnamed: 326_level_1,Unnamed: 327_level_1,Unnamed: 328_level_1,Unnamed: 329_level_1,Unnamed: 330_level_1,Unnamed: 331_level_1,Unnamed: 332_level_1,Unnamed: 333_level_1,Unnamed: 334_level_1,Unnamed: 335_level_1,Unnamed: 336_level_1,Unnamed: 337_level_1,Unnamed: 338_level_1,Unnamed: 339_level_1,Unnamed: 340_level_1,Unnamed: 341_level_1,Unnamed: 342_level_1,Unnamed: 343_level_1,Unnamed: 344_level_1,Unnamed: 345_level_1,Unnamed: 346_level_1,Unnamed: 347_level_1,Unnamed: 348_level_1,Unnamed: 349_level_1,Unnamed: 350_level_1,Unnamed: 351_level_1,Unnamed: 352_level_1,Unnamed: 353_level_1,Unnamed: 354_level_1,Unnamed: 355_level_1,Unnamed: 356_level_1,Unnamed: 357_level_1,Unnamed: 358_level_1,Unnamed: 359_level_1,Unnamed: 360_level_1,Unnamed: 361_level_1,Unnamed: 362_level_1,Unnamed: 363_level_1,Unnamed: 364_level_1,Unnamed: 365_level_1,Unnamed: 366_level_1,Unnamed: 367_level_1,Unnamed: 368_level_1,Unnamed: 369_level_1,Unnamed: 370_level_1,Unnamed: 371_level_1,Unnamed: 372_level_1,Unnamed: 373_level_1,Unnamed: 374_level_1,Unnamed: 375_level_1,Unnamed: 376_level_1,Unnamed: 377_level_1,Unnamed: 378_level_1,Unnamed: 379_level_1,Unnamed: 380_level_1,Unnamed: 381_level_1,Unnamed: 382_level_1,Unnamed: 383_level_1,Unnamed: 384_level_1,Unnamed: 385_level_1,Unnamed: 386_level_1,Unnamed: 387_level_1,Unnamed: 388_level_1,Unnamed: 389_level_1,Unnamed: 390_level_1,Unnamed: 391_level_1,Unnamed: 392_level_1,Unnamed: 393_level_1,Unnamed: 394_level_1,Unnamed: 395_level_1,Unnamed: 396_level_1,Unnamed: 397_level_1,Unnamed: 398_level_1,Unnamed: 399_level_1,Unnamed: 400_level_1,Unnamed: 401_level_1,Unnamed: 402_level_1,Unnamed: 403_level_1,Unnamed: 404_level_1,Unnamed: 405_level_1,Unnamed: 406_level_1,Unnamed: 407_level_1,Unnamed: 408_level_1,Unnamed: 409_level_1,Unnamed: 410_level_1,Unnamed: 411_level_1,Unnamed: 412_level_1,Unnamed: 413_level_1,Unnamed: 414_level_1,Unnamed: 415_level_1,Unnamed: 416_level_1,Unnamed: 417_level_1,Unnamed: 418_level_1,Unnamed: 419_level_1,Unnamed: 420_level_1,Unnamed: 421_level_1,Unnamed: 422_level_1,Unnamed: 423_level_1,Unnamed: 424_level_1,Unnamed: 425_level_1,Unnamed: 426_level_1,Unnamed: 427_level_1,Unnamed: 428_level_1,Unnamed: 429_level_1,Unnamed: 430_level_1,Unnamed: 431_level_1,Unnamed: 432_level_1,Unnamed: 433_level_1,Unnamed: 434_level_1,Unnamed: 435_level_1,Unnamed: 436_level_1,Unnamed: 437_level_1,Unnamed: 438_level_1,Unnamed: 439_level_1,Unnamed: 440_level_1,Unnamed: 441_level_1,Unnamed: 442_level_1,Unnamed: 443_level_1,Unnamed: 444_level_1,Unnamed: 445_level_1,Unnamed: 446_level_1,Unnamed: 447_level_1,Unnamed: 448_level_1,Unnamed: 449_level_1,Unnamed: 450_level_1,Unnamed: 451_level_1,Unnamed: 452_level_1,Unnamed: 453_level_1,Unnamed: 454_level_1,Unnamed: 455_level_1,Unnamed: 456_level_1,Unnamed: 457_level_1,Unnamed: 458_level_1,Unnamed: 459_level_1,Unnamed: 460_level_1,Unnamed: 461_level_1,Unnamed: 462_level_1,Unnamed: 463_level_1,Unnamed: 464_level_1,Unnamed: 465_level_1,Unnamed: 466_level_1,Unnamed: 467_level_1,Unnamed: 468_level_1,Unnamed: 469_level_1,Unnamed: 470_level_1,Unnamed: 471_level_1,Unnamed: 472_level_1,Unnamed: 473_level_1,Unnamed: 474_level_1,Unnamed: 475_level_1,Unnamed: 476_level_1,Unnamed: 477_level_1,Unnamed: 478_level_1,Unnamed: 479_level_1,Unnamed: 480_level_1,Unnamed: 481_level_1,Unnamed: 482_level_1,Unnamed: 483_level_1,Unnamed: 484_level_1,Unnamed: 485_level_1,Unnamed: 486_level_1,Unnamed: 487_level_1,Unnamed: 488_level_1,Unnamed: 489_level_1,Unnamed: 490_level_1,Unnamed: 491_level_1,Unnamed: 492_level_1,Unnamed: 493_level_1,Unnamed: 494_level_1,Unnamed: 495_level_1,Unnamed: 496_level_1,Unnamed: 497_level_1,Unnamed: 498_level_1,Unnamed: 499_level_1,Unnamed: 500_level_1,Unnamed: 501_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.057,0.063,0.0,0.063,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.057,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.071,0.143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.052,0.0,0.048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.084,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.021,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.063,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052,0.0,0.0,0.0,0.039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.126,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104,0.0,0.0,0.0,0.048,0.0,0.0,0.0,0.0,0.052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.032,0.0,0.029,0.071,0.052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.136,0.0,0.0,0.0,0.0,0.0,0.054,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054,0.0,0.0,0.0,0.0,0.0,0.032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.054,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.136,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054,0.0,0.068,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.054,0.06,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.046,0.0,0.0,0.0,0.0,0.0,0.0,0.136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.098,0.098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.224,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.098,0.098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.295,0.0,0.0,0.0,0.098,0.197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.098,0.0,0.0,0.0,0.098,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062,0.081,0.0,0.0,0.0,0.0,0.112,0.0,0.112,0.0,0.0,0.0,0.0,0.081,0.0,0.089,0.098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.089,0.055,0.0,0.0,0.098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.197
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.086,0.086,0.097,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.257,0.0,0.0,0.077,0.086,0.086,0.171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086,0.0,0.097,0.0,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054,0.071,0.0,0.0,0.171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.114,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.109,0.0,0.0,0.0,0.055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.064,0.121,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.069,0.0,0.109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138,0.0,0.069,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021,0.0,0.0,0.0,0.0,0.0,0.0,0.275,0.0,0.046,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138,0.0,0.0,0.207,0.0,0.0,0.242,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034,0.0,0.03,0.0,0.027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069,0.0,0.0,0.205,0.0,0.0,0.0,0.0,0.0,0.031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.149,0.0,0.0,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041,0.0,0.075,0.0,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.091,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.091,0.0,0.0,0.101,0.0,0.0,0.101,0.0,0.0,0.0,...,0.115,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067,0.115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.273,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051,0.0,0.046,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032,0.0,0.059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113,0.0,0.0,0.0,0.051,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.307,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.087,0.064,0.051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064,0.0,0.0,0.129,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.057,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.129,0.0,0.0,0.0,0.0,0.0,0.038,0.0,0.0,0.0,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.227,0.0,0.0,0.0,0.0,0.0,0.032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038,0.0,0.0,0.0,0.064,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.227,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032,0.0,0.028,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.063,0.063,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.063,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.149,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055,0.0,0.0,0.0,0.0,0.046,0.0,0.0,0.0,0.0,0.0,0.055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.188,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039,0.0,0.0,0.0,0.126,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.126,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.074,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.126,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.042,0.0,0.063,0.0,0.063,0.137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.126,0.0


In [473]:
pd.set_option('display.max_columns', 500)

matrix_bow = TermDoc_matrix(processed_text, method='bow')
display(matrix_bow)

Terms,00,000,005019,005225,012045,013527,05,09,1000,10716,11,111030,116,14,1483500379,156,16,1699,17,178,184547,1919,1920,1923,1934,1948,1972,1983,1989,1993,1993apr26,1993apr27,1993may10,1993may12,1993may19,19th,1pp,1rbn60,1rh,1slm8r,1smbma,1smllm,20,200,20058,20r,21,210,211316,21705,21904,23,2370,24910,25,26,27,28,28455,287,29,2bec0a64,30,303,32,33,37,38,39,40,41,45,46,48,50,56,581,6101,800,80301,8231,8543,8mr,91904,93,93apr24130647,93apr26212846,93may8143340,93may9230207,93y05m11d509,98,9972,aap,ab4z,ababa,abhorrant,able,absolute,ac,accept,accomodation,accord,account,achieve,act,action,actual,actually,adam,adams,add,addis,address,adelaide,adis,administration,adolfo,advance,adventure,african,agent,agnostic,ago,ahmed,ahronot,ai843,aid,aiken,aim,airlift,al,alan,albanian,alberto,algeria,ali,aliyev,allow,almost,alone,already,also,alternative,although,ambargo,american,ammunition,amo,amos,amoss,anarchy,andi,angell,annoy,another,anounced,answer,anticipate,antisemitic,anybody,anyone,anything,anyway,anyways,apart,apartheid,apc,apparently,appeal,appear,applied,apply,appressian,apr,april,arab,arabic,arabs,archive,archives,area,arena,argic,argument,arif,arm,armenia,armenian,armenians,around,arrival,article,ask,askew,assad,assassination,assault,assure,astein,atc,atheism,atheist,att,attack,attraction,au,aunt,author,authority,autumn,avail,avoid,aware,azerbaijani,azzam,baboon,baby,back,baffle,balkan,bar,barbarism,barlow,base,basis,battle,battles,bc744,be,become,bedford,begin,behind,believe,bellini,benefit,benzion,berkeley,bernadotte,besides,beyer,beyond,bi,big,bigoted,bigotry,bill,blood,body,bomb,...,sex,sexual,shall,shameful,shapira,shares,shaul,shell,shoot,shooting,shostack,show,shuldig,significant,silver,simply,since,single,site,six,sixteen,skoll,sky,small,smart,society,soldier,solution,somalia,someone,something,somewhat,somwhere,son,soon,sorry,soul,sound,source,sovereignty,soviet,spam,speach,speak,specifically,speech,spend,spokesman,sport,spread,square,squatter,srinivas,srv,stab,stage,standard,start,starvation,state,statistic,stay,steet,stein,still,stillness,stop,story,stramer,street,strife,student,stupid,subject,submachine,substantive,successful,sue,suite,sun,sunder,supergas,support,sure,surensoy,surprise,surrender,survive,survivor,synagogue,syria,syrian,system,take,talk,tape,target,tashach,tclock,television,tell,ten,tent,term,terminator,territorially,territory,terrorism,testify,testimony,the,themslve,thing,think,thomas,thorny,thou,thousand,thread,three,throughout,ticket,tight,tim,time,tiny,tmail,tnet,to,today,together,torah,torture,tortured,total,tourist,traditional,traffic,truck,true,trusteeship,try,tue,turk,turkish,turkiye,turks,turn,turning,twenty,two,ucdavis,uchicago,uci,uiuc,umd,umich,un,unable,unarmed,uncle,university,unknown,unlike,untill,upi,uproar,urbana,usa,use,usenet,ustache,uucp,uva,valuable,various,ve,version,vicinity,video,view,views,village,violation,virginia,voice,vs,wail,wait,wall,wam,want,war,warn,waste,watch,way,weapon,week,welcome,welfare,well,welshed,west,whether,whoever,wife,wiggle,will,window,within,witness,witsend,wo,woe,woman,wonder,word,work,world,worship,would,wound,write,wrong,x6127,yarah,yassin,year,yediot,yehuda,yemen,yemeni,yesterday,yet,yfn,yisrael,yitzhak,york,yosef,young,ysu,yugoslavia,zahran,zeidan,zichron,zion,zionism,zionist,zuma
Documents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1,Unnamed: 229_level_1,Unnamed: 230_level_1,Unnamed: 231_level_1,Unnamed: 232_level_1,Unnamed: 233_level_1,Unnamed: 234_level_1,Unnamed: 235_level_1,Unnamed: 236_level_1,Unnamed: 237_level_1,Unnamed: 238_level_1,Unnamed: 239_level_1,Unnamed: 240_level_1,Unnamed: 241_level_1,Unnamed: 242_level_1,Unnamed: 243_level_1,Unnamed: 244_level_1,Unnamed: 245_level_1,Unnamed: 246_level_1,Unnamed: 247_level_1,Unnamed: 248_level_1,Unnamed: 249_level_1,Unnamed: 250_level_1,Unnamed: 251_level_1,Unnamed: 252_level_1,Unnamed: 253_level_1,Unnamed: 254_level_1,Unnamed: 255_level_1,Unnamed: 256_level_1,Unnamed: 257_level_1,Unnamed: 258_level_1,Unnamed: 259_level_1,Unnamed: 260_level_1,Unnamed: 261_level_1,Unnamed: 262_level_1,Unnamed: 263_level_1,Unnamed: 264_level_1,Unnamed: 265_level_1,Unnamed: 266_level_1,Unnamed: 267_level_1,Unnamed: 268_level_1,Unnamed: 269_level_1,Unnamed: 270_level_1,Unnamed: 271_level_1,Unnamed: 272_level_1,Unnamed: 273_level_1,Unnamed: 274_level_1,Unnamed: 275_level_1,Unnamed: 276_level_1,Unnamed: 277_level_1,Unnamed: 278_level_1,Unnamed: 279_level_1,Unnamed: 280_level_1,Unnamed: 281_level_1,Unnamed: 282_level_1,Unnamed: 283_level_1,Unnamed: 284_level_1,Unnamed: 285_level_1,Unnamed: 286_level_1,Unnamed: 287_level_1,Unnamed: 288_level_1,Unnamed: 289_level_1,Unnamed: 290_level_1,Unnamed: 291_level_1,Unnamed: 292_level_1,Unnamed: 293_level_1,Unnamed: 294_level_1,Unnamed: 295_level_1,Unnamed: 296_level_1,Unnamed: 297_level_1,Unnamed: 298_level_1,Unnamed: 299_level_1,Unnamed: 300_level_1,Unnamed: 301_level_1,Unnamed: 302_level_1,Unnamed: 303_level_1,Unnamed: 304_level_1,Unnamed: 305_level_1,Unnamed: 306_level_1,Unnamed: 307_level_1,Unnamed: 308_level_1,Unnamed: 309_level_1,Unnamed: 310_level_1,Unnamed: 311_level_1,Unnamed: 312_level_1,Unnamed: 313_level_1,Unnamed: 314_level_1,Unnamed: 315_level_1,Unnamed: 316_level_1,Unnamed: 317_level_1,Unnamed: 318_level_1,Unnamed: 319_level_1,Unnamed: 320_level_1,Unnamed: 321_level_1,Unnamed: 322_level_1,Unnamed: 323_level_1,Unnamed: 324_level_1,Unnamed: 325_level_1,Unnamed: 326_level_1,Unnamed: 327_level_1,Unnamed: 328_level_1,Unnamed: 329_level_1,Unnamed: 330_level_1,Unnamed: 331_level_1,Unnamed: 332_level_1,Unnamed: 333_level_1,Unnamed: 334_level_1,Unnamed: 335_level_1,Unnamed: 336_level_1,Unnamed: 337_level_1,Unnamed: 338_level_1,Unnamed: 339_level_1,Unnamed: 340_level_1,Unnamed: 341_level_1,Unnamed: 342_level_1,Unnamed: 343_level_1,Unnamed: 344_level_1,Unnamed: 345_level_1,Unnamed: 346_level_1,Unnamed: 347_level_1,Unnamed: 348_level_1,Unnamed: 349_level_1,Unnamed: 350_level_1,Unnamed: 351_level_1,Unnamed: 352_level_1,Unnamed: 353_level_1,Unnamed: 354_level_1,Unnamed: 355_level_1,Unnamed: 356_level_1,Unnamed: 357_level_1,Unnamed: 358_level_1,Unnamed: 359_level_1,Unnamed: 360_level_1,Unnamed: 361_level_1,Unnamed: 362_level_1,Unnamed: 363_level_1,Unnamed: 364_level_1,Unnamed: 365_level_1,Unnamed: 366_level_1,Unnamed: 367_level_1,Unnamed: 368_level_1,Unnamed: 369_level_1,Unnamed: 370_level_1,Unnamed: 371_level_1,Unnamed: 372_level_1,Unnamed: 373_level_1,Unnamed: 374_level_1,Unnamed: 375_level_1,Unnamed: 376_level_1,Unnamed: 377_level_1,Unnamed: 378_level_1,Unnamed: 379_level_1,Unnamed: 380_level_1,Unnamed: 381_level_1,Unnamed: 382_level_1,Unnamed: 383_level_1,Unnamed: 384_level_1,Unnamed: 385_level_1,Unnamed: 386_level_1,Unnamed: 387_level_1,Unnamed: 388_level_1,Unnamed: 389_level_1,Unnamed: 390_level_1,Unnamed: 391_level_1,Unnamed: 392_level_1,Unnamed: 393_level_1,Unnamed: 394_level_1,Unnamed: 395_level_1,Unnamed: 396_level_1,Unnamed: 397_level_1,Unnamed: 398_level_1,Unnamed: 399_level_1,Unnamed: 400_level_1,Unnamed: 401_level_1,Unnamed: 402_level_1,Unnamed: 403_level_1,Unnamed: 404_level_1,Unnamed: 405_level_1,Unnamed: 406_level_1,Unnamed: 407_level_1,Unnamed: 408_level_1,Unnamed: 409_level_1,Unnamed: 410_level_1,Unnamed: 411_level_1,Unnamed: 412_level_1,Unnamed: 413_level_1,Unnamed: 414_level_1,Unnamed: 415_level_1,Unnamed: 416_level_1,Unnamed: 417_level_1,Unnamed: 418_level_1,Unnamed: 419_level_1,Unnamed: 420_level_1,Unnamed: 421_level_1,Unnamed: 422_level_1,Unnamed: 423_level_1,Unnamed: 424_level_1,Unnamed: 425_level_1,Unnamed: 426_level_1,Unnamed: 427_level_1,Unnamed: 428_level_1,Unnamed: 429_level_1,Unnamed: 430_level_1,Unnamed: 431_level_1,Unnamed: 432_level_1,Unnamed: 433_level_1,Unnamed: 434_level_1,Unnamed: 435_level_1,Unnamed: 436_level_1,Unnamed: 437_level_1,Unnamed: 438_level_1,Unnamed: 439_level_1,Unnamed: 440_level_1,Unnamed: 441_level_1,Unnamed: 442_level_1,Unnamed: 443_level_1,Unnamed: 444_level_1,Unnamed: 445_level_1,Unnamed: 446_level_1,Unnamed: 447_level_1,Unnamed: 448_level_1,Unnamed: 449_level_1,Unnamed: 450_level_1,Unnamed: 451_level_1,Unnamed: 452_level_1,Unnamed: 453_level_1,Unnamed: 454_level_1,Unnamed: 455_level_1,Unnamed: 456_level_1,Unnamed: 457_level_1,Unnamed: 458_level_1,Unnamed: 459_level_1,Unnamed: 460_level_1,Unnamed: 461_level_1,Unnamed: 462_level_1,Unnamed: 463_level_1,Unnamed: 464_level_1,Unnamed: 465_level_1,Unnamed: 466_level_1,Unnamed: 467_level_1,Unnamed: 468_level_1,Unnamed: 469_level_1,Unnamed: 470_level_1,Unnamed: 471_level_1,Unnamed: 472_level_1,Unnamed: 473_level_1,Unnamed: 474_level_1,Unnamed: 475_level_1,Unnamed: 476_level_1,Unnamed: 477_level_1,Unnamed: 478_level_1,Unnamed: 479_level_1,Unnamed: 480_level_1,Unnamed: 481_level_1,Unnamed: 482_level_1,Unnamed: 483_level_1,Unnamed: 484_level_1,Unnamed: 485_level_1,Unnamed: 486_level_1,Unnamed: 487_level_1,Unnamed: 488_level_1,Unnamed: 489_level_1,Unnamed: 490_level_1,Unnamed: 491_level_1,Unnamed: 492_level_1,Unnamed: 493_level_1,Unnamed: 494_level_1,Unnamed: 495_level_1,Unnamed: 496_level_1,Unnamed: 497_level_1,Unnamed: 498_level_1,Unnamed: 499_level_1,Unnamed: 500_level_1,Unnamed: 501_level_1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,3,0,0,0,1,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,0,0,1,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,4,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,3,0,0,4,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,...,1,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,6,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,1,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0


In [475]:
matrix_bow[matrix_bow['speak']>0].index # the word occurs in document 9,12,15

Int64Index([9, 12, 15], dtype='int64', name='Documents')

## TASK 3: create a pipeline for cleaning the corpus and tokenizing it

In [619]:
## BUILD A PIPELINE
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nlp = spacy.load("en_core_web_sm")


class Preprocess( BaseEstimator, TransformerMixin ):

    def __init__( self, lang = "english"):
        self.lang = lang
        self.stop_words_en = stopwords.words(self.lang) + list(string.punctuation)
      
    def fit( self, X, y = None ):
        return self 
    
    # the method clean() tokenize the words and eliminate the stop words, returning a list
    #of tokens
    def clean( self, x ):
        
        text = [w for w in nltk.word_tokenize(x.lower()) if w not in self.stop_words_en]
        return text
    
    # transform applies the method clean() to the column 'text' of the dataset given as input
    def transform( self, X, y = None ):
        return X['text'].apply(self.clean)

In [620]:
processed_text = Preprocess().transform(corpus_df_train[:20]) #it works!
processed_text

0     [amoss, shuldig.cs.huji.ac.il, amos, shapira, ...
1     [enis, cbnewsg.cb.att.com, enis.surensoy, subj...
2     [sera, zuma.uucp, serdar, argic, subject, trad...
3     [sera, zuma.uucp, serdar, argic, subject, many...
4     [sunder, grusin.crhc.uiuc.edu, srinivas, sunde...
5     [dfs, doe.carleton.ca, david, f., skoll, subje...
6     [adam, endor.uucp, adam, shostack, subject, fr...
7     [eggertj, moses.ll.mit.edu, jim, eggert, x6127...
8     [adam, endor.uucp, adam, shostack, subject, is...
9     [center, policy, research, cpr, igc.apc.org, s...
10    [aap, wam.umd.edu, alberto, adolfo, pinkas, su...
11    [kunda, hanuman.eng.sun.com, ramachandra, p., ...
12    [ab4z, virginia.edu, ``, andi, beyer, '', subj...
13    [jaskew, spam.maths.adelaide.edu.au, joseph, a...
14    [eggertj, moses.ll.mit.edu, jim, eggert, x6127...
15    [reply-to, dcs, witsend.tnet.com, ``, d., c., ...
16    [nstramer, supergas.dazixco.ingr.com, naftaly,...
17    [dfs, doe.carleton.ca, david, f., skoll, s

In [621]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

clean_tokenize = Pipeline([('preprocess', Preprocess())])
clean_tokenize

Pipeline(memory=None, steps=[('preprocess', Preprocess(lang='english'))],
         verbose=False)

---
### Part (b): (Classification and clustering, topic model and summarisation).

Consider the corpus you choose or another corpus suitable for the tasks included in this part3 . If it is the case, consider a subset of the corpus. Exploit what you have done in Part (a).

**Task 1**:
- Perform classification and clustering and provide comments (within your Python code) on your results (commenting your code).

**Task 2:**
- Perform topic model and provide comments on your results. 

**Task 3:**
- Perform summarisation and provide comments (within your Python code) on your results.

## TASK 1
### CLASSIFICATION
For classification it will be applied the 
- Logistic regression, which is a simple and fast to train model
- Linear support vector classifier, which tends to have high performances. The linear kernel allows for a faster training
- Random Forest, which is among the best performing ensemble classifiers

To assess the performance of the model on the test set the F1 score, precision and recall scores will be used. The precision focuses on reducing the false negatives, while the recall on the false positives. Instead, the F1 score balances the two scores

In [491]:
display(corpus_df_train, corpus_df_test)

Unnamed: 0,text,labels
0,From: amoss@shuldig.cs.huji.ac.il (Amos Shapir...,talk.politics.mideast
1,From: enis@cbnewsg.cb.att.com (enis.surensoy) ...,talk.politics.mideast
2,From: sera@zuma.UUCP (Serdar Argic) Subject: T...,talk.politics.mideast
3,From: sera@zuma.UUCP (Serdar Argic) Subject: A...,talk.politics.mideast
4,From: sunder@grusin.crhc.uiuc.edu (Srinivas Su...,talk.politics.mideast
...,...,...
2195,From: cdt@sw.stratus.com (C. D. Tavares) Subje...,talk.politics.guns
2196,From: jagst18+@pitt.edu (Josh A Grossman) Subj...,talk.politics.guns
2197,From: rats@cbnewsc.cb.att.com (Morris the Cat)...,talk.politics.guns
2198,From: malexan@a.cs.okstate.edu (ALEXANDER MICH...,talk.politics.guns


Unnamed: 0,text,labels
0,From: eshneken@ux4.cso.uiuc.edu (Edward A Shne...,talk.politics.mideast
1,From: sadek@cbnewsg.cb.att.com (mohamed.s.sade...,talk.politics.mideast
2,From: oaf@zurich.ai.mit.edu (Oded Feingold) Su...,talk.politics.mideast
3,From: tclock@orion.oac.uci.edu (Tim Clock) Sub...,talk.politics.mideast
4,From: jake@bony1.bony.com (Jake Livni) Subject...,talk.politics.mideast
...,...,...
1095,From: cescript@mtu.edu (Charles Scripter) Subj...,talk.politics.guns
1096,From: paull@hplabsz.hpl.hp.com (Robert Paull) ...,talk.politics.guns
1097,From: jpsb@NeoSoft.com (Jim Shirreffs) Subject...,talk.politics.guns
1098,From: rscharfy@magnus.acs.ohio-state.edu (Ryan...,talk.politics.guns


The labels are mapped to integers numbers for performing the classification and clustering

In [668]:
# create a dictionary which has as key the label and as associated value the integer corresponding to the class
topics = corpus_df_train['labels'].unique()
map_dict = {}

for (t,n) in zip(topics,range(0,len(topics))):
    map_dict[t]=n

In [493]:
# map the labels to the corresponding integer value
corpus_df_train['labels'] = corpus_df_train['labels'].map(map_dict)
corpus_df_test['labels'] = corpus_df_test['labels'].map(map_dict)

In [494]:
display(corpus_df_train, corpus_df_test)

Unnamed: 0,text,labels
0,From: amoss@shuldig.cs.huji.ac.il (Amos Shapir...,0
1,From: enis@cbnewsg.cb.att.com (enis.surensoy) ...,0
2,From: sera@zuma.UUCP (Serdar Argic) Subject: T...,0
3,From: sera@zuma.UUCP (Serdar Argic) Subject: A...,0
4,From: sunder@grusin.crhc.uiuc.edu (Srinivas Su...,0
...,...,...
2195,From: cdt@sw.stratus.com (C. D. Tavares) Subje...,10
2196,From: jagst18+@pitt.edu (Josh A Grossman) Subj...,10
2197,From: rats@cbnewsc.cb.att.com (Morris the Cat)...,10
2198,From: malexan@a.cs.okstate.edu (ALEXANDER MICH...,10


Unnamed: 0,text,labels
0,From: eshneken@ux4.cso.uiuc.edu (Edward A Shne...,0
1,From: sadek@cbnewsg.cb.att.com (mohamed.s.sade...,0
2,From: oaf@zurich.ai.mit.edu (Oded Feingold) Su...,0
3,From: tclock@orion.oac.uci.edu (Tim Clock) Sub...,0
4,From: jake@bony1.bony.com (Jake Livni) Subject...,0
...,...,...
1095,From: cescript@mtu.edu (Charles Scripter) Subj...,10
1096,From: paull@hplabsz.hpl.hp.com (Robert Paull) ...,10
1097,From: jpsb@NeoSoft.com (Jim Shirreffs) Subject...,10
1098,From: rscharfy@magnus.acs.ohio-state.edu (Ryan...,10


#### LOGISTIC REGRESSION

In [575]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer

pipe_logistic = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()), #vectorization of features using TF-IDF
                          ('scale', StandardScaler(with_mean=False)), #standardize the features value
                              #by dividing by the standard deviation
                              #mean is set to false since the matrix is sparse
                          ('classify', LogisticRegression(max_iter=10000, tol=0.1, solver='liblinear'))])
                              #the solver 'liblinear' supports both L1 and L2 norm

In [576]:
# perform CROSS-VALIDATION by trying different parameters

from sklearn.model_selection import GridSearchCV

param_logistic = {
    'classify__penalty' : ('l1', 'l2'), #regularisation through L1 or L2 norm, since
    #the matrix of term is very sparse. in this way we can reduce the number of features
    #either by setting some of them to 0 or very close to 0
    'classify__C' : np.logspace(-4, 4, 4) #regularisation parameter
}

cv_logistic = GridSearchCV(pipe_logistic, param_logistic, cv=5, scoring='f1_weighted')
# f1 weighted is used since the problem is a multiclassification problem. it calculates
#the f1 score for each label and then take the weighted average based on the number
#of each label in the dataset

In [577]:
model = cv_logistic.fit(corpus_df_train[['text']], corpus_df_train['labels'])

In [578]:
print(cv_logistic.best_score_, cv_logistic.best_params_)

0.9242982160782764 {'classify__C': 21.54434690031882, 'classify__penalty': 'l2'}


In [579]:
y_pred = model.predict(corpus_df_test[['text']])

In [580]:
from sklearn.metrics import f1_score, precision_score, recall_score

# "Shepherd %s is %d years old." %

print('f1 score: %f' % f1_score(corpus_df_test['labels'], y_pred, average='weighted'),
      'precision: %f' % precision_score(corpus_df_test['labels'], y_pred, average='weighted'),
      'recall: %f' % recall_score(corpus_df_test['labels'], y_pred, average='weighted')
         )

f1 score: 0.924497 precision: 0.925099 recall: 0.924545


The resulting F1, precision and recall scores are around 92%, which is a good result. 

#### SVM

In [582]:
from sklearn.svm import LinearSVC # faster implementation for large datasets

pipe_svc = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()), #vectorization of features using TF-IDF
                          ('scale', StandardScaler(with_mean=False)), #standardize the features value
                              #by dividing by the standard deviation
                              #mean is set to false since the matrix is sparse
                          ('classify', LinearSVC(max_iter=10000))])

In [583]:
# perform CROSS-VALIDATION by trying different parameters

param = {
    'classify__C': [0.01, 0.1, 1], # regularisation parameter. it indicates the 
    #amount of observations we allow to be on the wrong side of the separator or 
    #inside the margin
}

cv_svc = GridSearchCV(pipe_svc, param, cv=5, scoring='f1_weighted')

In [584]:
cv_svc.fit(corpus_df_train[['text']], corpus_df_train['labels'])

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('pre_process',
                                        Pipeline(memory=None,
                                                 steps=[('preprocess',
                                                         Preprocess(lang='english'))],
                                                 verbose=False)),
                                       ('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
 

In [585]:
print(cv_svc.best_score_, cv_svc.best_params_)

0.9202506631942207 {'classify__C': 0.01}


In [586]:
y_pred = cv_svc.predict(corpus_df_test[['text']])

In [587]:
from sklearn.metrics import f1_score, precision_score, recall_score

print('f1 score: %f' % f1_score(corpus_df_test['labels'], y_pred, average='weighted'),
      'precision: %f' % precision_score(corpus_df_test['labels'], y_pred, average='weighted'),
      'recall: %f' % recall_score(corpus_df_test['labels'], y_pred, average='weighted')
         )

f1 score: 0.925584 precision: 0.926222 recall: 0.925455


Even in this case the F1, precision and recall scores are around 92%, providing a good classification.

#### RANDOM FOREST

In [588]:
from sklearn.ensemble import RandomForestClassifier

In [589]:
pipe_rf = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()), #vectorization of features using TF-IDF
                          ('scale', StandardScaler(with_mean=False)), #standardize the features value
                            #by dividing by the standard deviation
                            #mean is set to false since the matrix is sparse
                          ('classify', RandomForestClassifier(max_features='auto'))])
                            # max_features = 'auto' select at most sqrt(n.features) at each split

In [590]:
# perform CROSS-VALIDATION by trying different parameters

param_rf = {
    'classify__n_estimators': [100, 200, 300], # n.trees to be used for the ensemble
}

cv_rf = GridSearchCV(pipe_rf, param_rf, cv=5, scoring='f1_weighted')

In [591]:
cv_rf.fit(corpus_df_train[['text']], corpus_df_train['labels'])

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('pre_process',
                                        Pipeline(memory=None,
                                                 steps=[('preprocess',
                                                         Preprocess(lang='english'))],
                                                 verbose=False)),
                                       ('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
 

In [592]:
print(cv_rf.best_score_, cv_rf.best_params_)

0.9126904650570193 {'classify__n_estimators': 300}


In [593]:
y_pred = cv_rf.predict(corpus_df_test[['text']])

In [594]:
# fai gridsearch
print('f1 score: %f' % f1_score(corpus_df_test['labels'], y_pred, average='weighted'),
      'precision: %f' % precision_score(corpus_df_test['labels'], y_pred, average='weighted'),
      'recall: %f' % recall_score(corpus_df_test['labels'], y_pred, average='weighted')
         )

f1 score: 0.925255 precision: 0.927761 recall: 0.925455


All the classificators have a similar test score around 92%, moreover the F1 score, precision and recall return the same result. Since the SVM is the longer model to train, it can be excluded. The logistic regression is the simplest and fastest method to train, so it should be preferred if it works properly. The cross-validation procedure selected the L2 norm penalisation, also called Ridge Rgeression.

---
### CLUSTERING
Clustering is an unsupervised method which is used to group similar documents into groups. It relies on the concept of distance, since in the same cluster we find similar observations, but they must be sufficienlty dissimilar from the observations in the other clusters. <br>
It will be applied:
- K-means clustering, where we randomly select an observation as the centroid for each cluster and recursively add to each cluster the nearest observations and adjust the centroid until convergence
- Hierarchical agglomerative clustering, which recursively merge pairs of similar observations or clusters according to some metrics
- A combination of PCA to reduce the feature space and K-means clustering

#### K-MEANS CLUSTERING

In [503]:
from sklearn.cluster import KMeans

In [504]:
for i in sorted(map_dict.keys()):
    print(i)

# we see there are possibly 5 topics: computers, vehicles, sport, medicine and politics

comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.med
talk.politics.guns
talk.politics.mideast
talk.politics.misc


In [507]:
# k -means 5 clusters

kmeans_pipe_5 = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()),#vectorization of features using TF-IDF
                          ('cluster', KMeans(n_clusters=5))])

In [508]:
kmeans_5 = kmeans_pipe_5.fit_predict(corpus_df_train[['text']])

In [509]:
# create a new database containing the original label and the assigned cluster which will be used later on
corpus_df_train_kmeans = corpus_df_train.copy()
corpus_df_train_kmeans['kmeans_5'] = kmeans_5

In [510]:
display(corpus_df_train_kmeans)

Unnamed: 0,text,labels,kmeans_5
0,From: amoss@shuldig.cs.huji.ac.il (Amos Shapir...,0,4
1,From: enis@cbnewsg.cb.att.com (enis.surensoy) ...,0,4
2,From: sera@zuma.UUCP (Serdar Argic) Subject: T...,0,4
3,From: sera@zuma.UUCP (Serdar Argic) Subject: A...,0,4
4,From: sunder@grusin.crhc.uiuc.edu (Srinivas Su...,0,0
...,...,...,...
2195,From: cdt@sw.stratus.com (C. D. Tavares) Subje...,10,1
2196,From: jagst18+@pitt.edu (Josh A Grossman) Subj...,10,0
2197,From: rats@cbnewsc.cb.att.com (Morris the Cat)...,10,0
2198,From: malexan@a.cs.okstate.edu (ALEXANDER MICH...,10,0


We re-transform the labels from integers to their original name for readibility and then we create a function cluster_topicounts() which counts the number of documents with the same labels in each cluster found. In this way we can assess if the clustering has been able to properly separate topics

In [669]:
# inverse mapping of labels
inverse_map_dict = {}
for k,v in map_dict.items():
    inverse_map_dict[v] = k

In [512]:
corpus_df_train_kmeans['labels'] = corpus_df_train_kmeans['labels'].map(inverse_map_dict)

In [513]:
def cluster_topicounts(n_clusters,column_name):
    for i in range(0,n_clusters):
        # select the label of all the documents belonging to the same cluster
        x = corpus_df_train_kmeans[corpus_df_train_kmeans[column_name] == i]['labels']
        print('cluster n.%d' % i)
        # count the number of documents with the same label
        y = np.unique(x,return_counts=True)
        # return the cluster and the labels contained with the corresponding counts
        for n,j in zip(y[0],y[1]):
            print('label %s:'%n,
                  'count %d'%j)
    
        print('\n')

In [514]:
cluster_topicounts(5, 'kmeans_5')    

cluster n.0
label comp.os.ms-windows.misc: count 95
label comp.sys.ibm.pc.hardware: count 72
label comp.sys.mac.hardware: count 120
label rec.autos: count 198
label rec.motorcycles: count 200
label rec.sport.baseball: count 79
label rec.sport.hockey: count 32
label sci.med: count 195
label talk.politics.guns: count 55
label talk.politics.mideast: count 72
label talk.politics.misc: count 65


cluster n.1
label rec.sport.hockey: count 1
label sci.med: count 2
label talk.politics.guns: count 145
label talk.politics.mideast: count 8
label talk.politics.misc: count 132


cluster n.2
label comp.os.ms-windows.misc: count 105
label comp.sys.ibm.pc.hardware: count 127
label comp.sys.mac.hardware: count 80
label rec.autos: count 2
label rec.sport.baseball: count 2
label rec.sport.hockey: count 5
label sci.med: count 3


cluster n.3
label comp.sys.ibm.pc.hardware: count 1
label rec.sport.baseball: count 119
label rec.sport.hockey: count 162


cluster n.4
label talk.politics.mideast: count 120
lab

- Cluster 0: related to computers (mainly hardware), medicine, auto and motorcycles
- Cluster 1: related to politics (guns and miscellaneous)
- Cluster 2: related to computers
- Cluster 3: related to baseball and hockey
- Cluster 4: it's related to the mideast politics

We can try to add other clusters to see if we can better group topics in cluster 1

In [515]:
# k-means 6 clusters

kmeans_pipe_6 = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()),#vectorization of features by using TF-IDF
                          ('cluster', KMeans(n_clusters=6))])

In [516]:
kmeans_6 = kmeans_pipe_6.fit_predict(corpus_df_train[['text']])

In [517]:
corpus_df_train_kmeans['kmeans_6'] = kmeans_6

In [518]:
cluster_topicounts(6, 'kmeans_6')

cluster n.0
label comp.os.ms-windows.misc: count 122
label comp.sys.ibm.pc.hardware: count 91
label comp.sys.mac.hardware: count 20
label rec.autos: count 2


cluster n.1
label comp.os.ms-windows.misc: count 77
label comp.sys.ibm.pc.hardware: count 98
label comp.sys.mac.hardware: count 180
label rec.autos: count 36
label rec.motorcycles: count 28
label rec.sport.baseball: count 61
label rec.sport.hockey: count 17
label sci.med: count 197
label talk.politics.guns: count 43
label talk.politics.mideast: count 90
label talk.politics.misc: count 80


cluster n.2
label comp.os.ms-windows.misc: count 1
label comp.sys.ibm.pc.hardware: count 11
label rec.autos: count 154
label rec.motorcycles: count 172
label rec.sport.baseball: count 20
label rec.sport.hockey: count 13
label sci.med: count 3
label talk.politics.guns: count 8
label talk.politics.mideast: count 4
label talk.politics.misc: count 1


cluster n.3
label talk.politics.mideast: count 96
label talk.politics.misc: count 44


cluster n.4

- Cluster 0: computers
- Cluster 1: computers, medicine and politics
- Cluster 2: auto and motorcycles
- Cluster 3: mideastern politics
- Cluster 4: baseball and hockey
- Cluster 5: politics, in particular guns

We still obtain a good clustering. The kmeans cannot distinguish well between computers, medicine and politics. However, the other topics are quite well separated. By reducing or increasing the number of topics we don't seem to obtain a better result

#### HIERARCHICAL CLUSTERING
Now we try with the hierarchical clustering. We use the average and complete linkage since the single linkage usually doesn't provide good results. We use 5 clusters <br>
The cosine distance is the metrics used to calculate the distance since it works well with documents

In [519]:
# HIERARCHICAL CLUSTERING average linkage

from sklearn.cluster import AgglomerativeClustering
from mlxtend.preprocessing import DenseTransformer


aggl_avg_5 = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()),# vectorization of features by using TF-IDF
                          ('to_dense', DenseTransformer()), #to not have a sparse matrix since agglomerative clustering 
                           #cannot work with them
                          ('hier_cluster',  AgglomerativeClustering(n_clusters=5, affinity='cosine', linkage='average'))])

In [520]:
model = aggl_avg_5.fit_predict(corpus_df_train[['text']], corpus_df_train['labels'])

In [521]:
corpus_df_train_kmeans['aggl_avg_5'] = model

In [522]:
cluster_topicounts(5, 'aggl_avg_5')

cluster n.0
label comp.os.ms-windows.misc: count 198
label comp.sys.ibm.pc.hardware: count 199
label comp.sys.mac.hardware: count 200
label rec.autos: count 200
label rec.motorcycles: count 200
label rec.sport.baseball: count 197
label rec.sport.hockey: count 200
label sci.med: count 199
label talk.politics.guns: count 200
label talk.politics.mideast: count 198
label talk.politics.misc: count 199


cluster n.1
label sci.med: count 1
label talk.politics.mideast: count 2
label talk.politics.misc: count 1


cluster n.2
label comp.os.ms-windows.misc: count 1


cluster n.3
label comp.sys.ibm.pc.hardware: count 1


cluster n.4
label comp.os.ms-windows.misc: count 1
label rec.sport.baseball: count 3




The clustering is quite bad, since almost all the documents are in cluster 0 and there are few documents in the other clusters

In [523]:
# HIERARCHICAL CLUSTERING complete linkage

from sklearn.cluster import AgglomerativeClustering


aggl_compl_5 = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()), #vectorization of features by using TF-IDF
                          ('to_dense', DenseTransformer()), #to not have a sparse matrix since agglomerative clustering cannot
                            #work with them
                          ('hier_cluster',  AgglomerativeClustering(n_clusters=5, affinity='cosine', linkage='complete'))])

In [524]:
model = aggl_compl_5.fit_predict(corpus_df_train[['text']], corpus_df_train['labels'])

In [525]:
corpus_df_train_kmeans['aggl_compl_5'] = model

In [526]:
cluster_topicounts(5, 'aggl_compl_5')

cluster n.0
label comp.os.ms-windows.misc: count 84
label comp.sys.ibm.pc.hardware: count 88
label comp.sys.mac.hardware: count 104
label rec.autos: count 99
label rec.motorcycles: count 110
label rec.sport.baseball: count 68
label rec.sport.hockey: count 60
label sci.med: count 64
label talk.politics.guns: count 32
label talk.politics.mideast: count 29
label talk.politics.misc: count 31


cluster n.1
label comp.os.ms-windows.misc: count 52
label comp.sys.ibm.pc.hardware: count 67
label comp.sys.mac.hardware: count 45
label rec.autos: count 83
label rec.motorcycles: count 80
label rec.sport.baseball: count 54
label rec.sport.hockey: count 37
label sci.med: count 94
label talk.politics.guns: count 143
label talk.politics.mideast: count 161
label talk.politics.misc: count 147


cluster n.2
label comp.os.ms-windows.misc: count 11
label comp.sys.ibm.pc.hardware: count 27
label comp.sys.mac.hardware: count 30
label rec.autos: count 7
label rec.motorcycles: count 10
label rec.sport.baseball:

- cluster 0: Computers, auto and motorcycles
- cluster 1: Politics and medicine
- Cluster 2: not very clear, seems to be related with hockey
- Cluster 3: not very clear, it seems computers
- Cluster 4: not very clear, it seems to be related with baseball and hockey


We obtain a better result than before but still not very satisfying 

#### PCA + KMEANS
Another alternative is that of reducing the dimensionality of the feature space before performing the cluster by using PCA. In this way the terms are summarised

In [531]:
# pca + kmeans 5

from sklearn.decomposition import PCA

pca_kmeans_5 = Pipeline([('pre_process', preproccess_pipe), #remove stop-words, punctuation,
                              #normalise through lemmatization
                          ('tfidf', TfidfVectorizer()),#vectorization of features by using TF-IDF
                               #('transpose1', Transpose()), #shape = 60000 if only this
                          ('to_dense', DenseTransformer()), #to not have a sparse matrix since PCA cannot
                            #work with them
                          ('pca', PCA(n_components = 500)), # we try to use 500 features
                          ('cluster', KMeans(n_clusters=5))])

In [532]:
model = pca_kmeans_5.fit_predict(corpus_df_train[['text']], corpus_df_train['labels'])

In [533]:
model.shape

(2200,)

In [534]:
corpus_df_train_kmeans['pca_kmeans_5'] = model

In [535]:
cluster_topicounts(5, 'pca_kmeans_5')

cluster n.0
label talk.politics.guns: count 73
label talk.politics.misc: count 56


cluster n.1
label rec.autos: count 1
label sci.med: count 1
label talk.politics.guns: count 28
label talk.politics.misc: count 1


cluster n.2
label comp.os.ms-windows.misc: count 28
label comp.sys.ibm.pc.hardware: count 65
label comp.sys.mac.hardware: count 22


cluster n.3
label comp.os.ms-windows.misc: count 172
label comp.sys.ibm.pc.hardware: count 135
label comp.sys.mac.hardware: count 178
label rec.autos: count 199
label rec.motorcycles: count 184
label rec.sport.baseball: count 200
label rec.sport.hockey: count 200
label sci.med: count 199
label talk.politics.guns: count 91
label talk.politics.mideast: count 200
label talk.politics.misc: count 143


cluster n.4
label rec.motorcycles: count 16
label talk.politics.guns: count 8




The result is not satisfying. The best clustering was performed by the 5-means algorithm

## TASK 2
### TOPIC MODELLING
It is used in order to retreive keywords from a corpus of documents which help us in understanding its main topic. We proceed in a similar way as with clustering, we retrieve the keywords and the associated topics, then retrieve the count of the original labels of the documents. <br>
The model used is the latent dirichlet allocation, which returns, given a document, its probability of belonging to each class

In [477]:
# create a function to transform the input into the gensim dictionary format, which will be used by the latent dirichlet allocation model

class To_Dictionary( BaseEstimator, TransformerMixin ):

    def __init__( self):
        self = self
       
    
    def fit( self, X, y = None ):
        return self 
    
    
    # tokenize the test to transform it into the gensim dictionary format
    def tokenize( self, X ):
        
        corpus = []
        for document in X:
            d = []
            doc = nltk.word_tokenize(document)
            for tok in doc:
                if len(tok) > 4:
                    d.append(tok)
            corpus.append(d)
        
        return corpus
    
    # return the gensim dictionary
    def transform( self, X, y = None ):
        return corpora.Dictionary(self.tokenize(X))

In [478]:
from gensim import corpora

# first we create the dictionary which will be used as a mapping in the LdaModel()
dictionary_pipe = Pipeline([('pre_process', Preprocess()),
                         ('to_dict', To_Dictionary())])

In [479]:
dictionary_pipe = dictionary_pipe.transform(corpus_df_train[['text']])
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7fe5c880b650>

In [480]:
# apply bag of words, which is the required input for LDA

class To_bow( BaseEstimator, TransformerMixin ):

    def __init__( self, dictionary):
        self.dictionary = dictionary
       
    
    def fit( self, X, y = None ):
        return self 
    
    def tokenize( self, X ):
        
        corpus = []
        for document in X:
            d = []
            doc = nltk.word_tokenize(document) # tokenize the document
            for tok in doc:
                if len(tok) > 4: # perform a further selection by eliminating words with less than 5 words
                    d.append(tok)
            corpus.append(d)
        
        return corpus
            
    
    # transform the input text into a suitable format for LdaModel
    def transform( self, X, y = None ):

        return [self.dictionary.doc2bow(text) for text in self.tokenize(X)]

In [481]:
pipe = Pipeline([('pre_process', Preprocess()),
                ('doc2bow',  To_bow(dictionary=dictionary))])

corpus = pipe.transform(corpus_df_train[['text']])

In [482]:
print(corpus[:2])

[[(10, 1), (16, 1), (17, 1), (19, 2), (22, 1), (57, 4), (73, 1), (94, 1), (106, 1), (136, 2), (167, 2), (182, 1), (190, 1), (204, 2), (205, 2), (211, 1), (307, 1), (332, 1), (333, 1), (426, 1), (438, 1), (503, 1), (624, 1), (650, 1), (785, 2), (801, 2), (853, 2), (1024, 1), (1061, 1), (1205, 1), (1227, 1), (1270, 2), (1364, 1), (1568, 1), (1999, 1), (2053, 2), (2607, 1), (2698, 1), (2701, 1), (3176, 1), (4574, 1), (4929, 1), (4930, 1), (4931, 2), (4939, 1), (4945, 1), (4949, 2), (5282, 1), (5319, 1), (12199, 1), (12249, 1), (24811, 1)], [(2, 1), (10, 1), (16, 1), (25, 1), (28, 1), (66, 2), (78, 1), (127, 2), (155, 1), (171, 1), (226, 1), (245, 4), (254, 1), (297, 4), (367, 2), (382, 2), (414, 1), (417, 3), (463, 2), (481, 1), (537, 1), (598, 2), (629, 1), (635, 2), (638, 3), (639, 1), (662, 1), (862, 1), (877, 1), (899, 1), (1172, 1), (1193, 1), (1195, 3), (1273, 1), (1420, 1), (1493, 1), (1497, 1), (1505, 2), (1523, 2), (1532, 1), (1533, 3), (1545, 2), (1696, 2), (1769, 3), (1952, 1),

As with clustering we try 5 topics

In [538]:
# train the latent dirichlet allocation model with 5 topics

from gensim.models.ldamodel import LdaModel
NUM_TOPICS = 5

LDA = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

In [484]:
# for each topic print the 10 most relevant words and their importance for the topic

topics = LDA.print_topics(num_words=10)

for topic in topics:
    print(topic)

(0, '0.011*"subject" + 0.011*"organization" + 0.010*"write" + 0.008*"would" + 0.008*"drive" + 0.006*"article" + 0.004*"think" + 0.004*"university" + 0.004*"problem" + 0.003*"player"')
(1, '0.016*"subject" + 0.015*"organization" + 0.009*"write" + 0.009*"university" + 0.007*"article" + 0.005*"window" + 0.005*"system" + 0.005*"would" + 0.004*"problem" + 0.004*"distribution"')
(2, '0.012*"organization" + 0.011*"subject" + 0.010*"write" + 0.007*"bullet" + 0.007*"would" + 0.006*"article" + 0.005*"wound" + 0.005*"evidence" + 0.005*"state" + 0.005*"university"')
(3, '0.009*"article" + 0.009*"write" + 0.007*"subject" + 0.006*"grenade" + 0.006*"organization" + 0.006*"irvine" + 0.005*"agent" + 0.005*"press" + 0.005*"would" + 0.005*"koresh"')
(4, '0.018*"people" + 0.013*"write" + 0.013*"would" + 0.011*"article" + 0.009*"subject" + 0.008*"organization" + 0.008*"believe" + 0.008*"think" + 0.008*"right" + 0.007*"clinton"')


In [485]:
# create a dictionary to store for each document the probability of belonging to a
#certain topic. Each document is associated with a sub-dictionary

def topic_dictionary(model, corpus):
    get_topics = model.get_document_topics(corpus) 
    
     # create the dictionary storing documents
    doc2topic_dict = {}
    for (doc, num_doc) in zip(get_topics, range(0,len(get_topics))):
    
        # create the dictionary storing for each document its distribution of topics
        topic2prob = {}
        # returns a list of tuples associating the topics with the conditional probability 
        #of the document to belong to each of them.
        
        # build the sub-dictionary for each document containing as a key the number
        #of topic and as a vaue the probability of the document to belong to that topic
        for topic in doc: 
            topic2prob[topic[0]] = topic[1]
    
        # assign to each document a unique integer key, then assign the sub-dictionary
        #created before
        doc2topic_dict[num_doc] = topic2prob
    
    return doc2topic_dict

In [537]:
doc2topic = topic_dictionary(LDA, corpus)

In [487]:
doc2topic[14]

{0: 0.10232485, 1: 0.52694935, 3: 0.2492066, 4: 0.11909425}

As with clustering we retrieve for each topic the count of the documents with the same label

In [488]:
def get_topic_label(topics_dict, labels_array, return_counts=False, n_topics=None):

    topics = []
    # return the topic associated with the maximum conditioned probability
    for i in range(0,len(labels_array)):
        extr_dict = topics_dict[i]
        topics.append((max(extr_dict.items(), key=operator.itemgetter(1))[0]))
    
    # create a dataframe containing the predicted topic and the original label of each document
    labels_array = labels_array.map(inverse_map_dict)
    df = pd.DataFrame(labels_array)
    df['topics'] = topics

    # return the number of documents with the same label for each topic 
    if return_counts:
        for i in range(0,n_topics):
            x = df[df['topics'] == i]['labels']
            print('topic n.%d' % i)
        
            y = np.unique(x,return_counts=True)
            for n,j in zip(y[0],y[1]):
                print('label %s:'%n,
                      'count %d'%j)
    
            print('\n')
        
        return df

    else:
        return df

In [495]:
df = get_topic_label(doc2topic, corpus_df_train['labels'], return_counts=True, n_topics=5 )

topic n.0
label comp.os.ms-windows.misc: count 59
label comp.sys.ibm.pc.hardware: count 75
label comp.sys.mac.hardware: count 50
label rec.autos: count 114
label rec.motorcycles: count 139
label rec.sport.baseball: count 137
label rec.sport.hockey: count 181
label sci.med: count 15
label talk.politics.guns: count 2
label talk.politics.mideast: count 23
label talk.politics.misc: count 18


topic n.1
label comp.os.ms-windows.misc: count 125
label comp.sys.ibm.pc.hardware: count 113
label comp.sys.mac.hardware: count 134
label rec.autos: count 41
label rec.motorcycles: count 19
label rec.sport.baseball: count 23
label rec.sport.hockey: count 6
label sci.med: count 99
label talk.politics.guns: count 21
label talk.politics.mideast: count 58
label talk.politics.misc: count 64


topic n.2
label comp.os.ms-windows.misc: count 12
label comp.sys.ibm.pc.hardware: count 7
label comp.sys.mac.hardware: count 5
label rec.autos: count 23
label rec.motorcycles: count 17
label rec.sport.baseball: count 

- Topic 0: mainly related to baseball, hockey, auto and motorcycles, so sport
- Topic 1: computers and medicine
- Topic 2: not very clear, seems also to be related with medicine
- Topic 3: mideastern politics 
- Topic 4: politics 

Topic 2 doesn't well separate documents. In particular medicine is not clearly categorised in a specific topic

In [496]:
df[:10]

Unnamed: 0,labels,topics
0,talk.politics.mideast,1
1,talk.politics.mideast,3
2,talk.politics.mideast,3
3,talk.politics.mideast,3
4,talk.politics.mideast,1
5,talk.politics.mideast,1
6,talk.politics.mideast,1
7,talk.politics.mideast,1
8,talk.politics.mideast,4
9,talk.politics.mideast,3


In [501]:
# LDA model with 4 topics

from gensim.models.ldamodel import LdaModel
NUM_TOPICS = 4

LDA = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

doc2topic = topic_dictionary(LDA, corpus)

df = get_topic_label(doc2topic, corpus_df_train['labels'], return_counts=True, n_topics=4 )

topic n.0
label comp.os.ms-windows.misc: count 3
label comp.sys.ibm.pc.hardware: count 1
label comp.sys.mac.hardware: count 2
label rec.autos: count 5
label rec.motorcycles: count 4
label rec.sport.baseball: count 17
label rec.sport.hockey: count 23
label sci.med: count 56
label talk.politics.guns: count 7
label talk.politics.mideast: count 81
label talk.politics.misc: count 53


topic n.1
label comp.os.ms-windows.misc: count 1
label comp.sys.ibm.pc.hardware: count 4
label comp.sys.mac.hardware: count 1
label rec.autos: count 2
label rec.motorcycles: count 7
label rec.sport.baseball: count 15
label rec.sport.hockey: count 12
label sci.med: count 1
label talk.politics.guns: count 27
label talk.politics.mideast: count 51
label talk.politics.misc: count 6


topic n.2
label comp.os.ms-windows.misc: count 196
label comp.sys.ibm.pc.hardware: count 195
label comp.sys.mac.hardware: count 196
label rec.autos: count 192
label rec.motorcycles: count 184
label rec.sport.baseball: count 165
label r

There seems to be a worse separation with respect to the 5 topics case. The main bulk of documents have been classified in topic 2. By reducing or increasing the number of topics the result doesn't seem to improve. <br>
We can also obtain a visual representation of the various topics and their keywords, along with their distance in the space

In [539]:
import pyLDAvis.gensim #it's a library to visualise topics

lda_display = pyLDAvis.gensim.prepare(LDA, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

## TASK 3
### DOCUMENT SUMMARIZATION
Document summarization can be used either to summarize the documents by retrieving the most relevant sentences or by retrieving the most important keywords

**GENSIM LIBRARY**: summarize the document by selecting the most important sentences

In [540]:
from gensim.summarization.summarizer import summarize

In [572]:
# clean the text. in this case we forcibly remove all the punctuation marks which are not removed with the previous text cleaning, 
#except for useful punctuation marks which are useful for text readibility. 

def clean_text(sentence_n):
    text = corpus_df_train['text'][sentence_n]
    
    list_punct = list(string.punctuation)
    for i in [',','.','(',')','!','?']:
        list_punct.remove(i)
        
    
    for punct in list_punct:
        text = text.replace(punct, '')
    
    return text

In [573]:
# EXAMPLE 1: the speaker talks about what it means to be Jewish and its main opinion is that it is not merely a fact of being a 
#believer, but of having a specific cultural identity, in fact he is an atheist. That's why he belives Jewish should have the right
#of constituting their own state
text = clean_text(sentence_n=10)
text

'From aapwam.umd.edu (Alberto Adolfo Pinkas) Subject Re Israel An Apartheid state. Organization University of Maryland, College Park Lines 56 NNTPPostingHost rac2.wam.umd.edu  In article 1993May12.013527.21904das.harvard.edu adamendor.uucp (Adam Shostack) writes   Which was my point. By converting to another religion I do not loose my cultural identity, I just loose my religious identification.   I disagree.  By converting to another religion, you certainly do change your cultural identity, and lose that part of you which was Jewish.   I would change one of the many parts that define my cultural identity. If I loose a leg, it might change my personality, but I do not stop being a human being.  Even more, when someone gets a baboon heart, that person is still human.   To be a part or not of the Jeish Nation is defined by my culture and not by my religion. Actually, if I am an atheist, which is in fact like  converting into a nonJewish in terms of religion, I am still considered as part 

In [574]:
print(summarize(text, word_count=100))

By converting to another religion, you certainly do change your cultural identity, and lose that part of you which was Jewish.
Actually, if I am an atheist, which is in fact like  converting into a nonJewish in terms of religion, I am still considered as part of the Jewish Nation.
For me, religion is just another piece in what constitutes the cultural identity of the Jewish people.
I believe that as a people with a  cultural identity they constitute a Nation and have the same right as any other people in the world to have their own State.


From this example it can be seen that the summarizer has chosen sentences clearly indicating opinions of the speaker, and where the pronoun 'I' is often used. The method has taken full sentences from the text. Let's try with another example

In [677]:
# EXAMPLE 2: the speaker is talking about hockey matches and the main message is that she doesn't agree with people saying that Pens 
#matches have become boring due to the fact that they seem to alway win, but she's still excited to watch and cheer them
text = clean_text(sentence_n=1000)
text

'From Anna Matyas am2xandrew.cmu.edu Subject Re Pens fans reactions Organization HSS Deans Office, Carnegie Mellon, Pittsburgh, PA Lines 46  9835blue.cis.pitt.edu  8fqBHaG00WBMIrPhqandrew.cmu.edu NNTPPostingHost po2.andrew.cmu.edu InReplyTo 8fqBHaG00WBMIrPhqandrew.cmu.edu   Terence Rokop writes  Richard J Coyle writes   Thats not inner calm.  Its boredom, and its being spoiled.  The Arenas been as quiet as a church on many nights this year too many of us just take winning for granted.  Its been seemingly forever since the team lost, and weve forgotten what its like to feel real excitement and surprise at victory.  I dont really agree with this.  But it is an entirely different high, at any rate.  The first Cup the Pens won, I didnt think about anything else I just watched Mario and all skate the thing around the ice.  Now it seems to be more of a question whether or not, thirty years from now, young hockey fans (may there be millions!) will still ask us what it was like to watch this t

In [678]:
print(summarize(text, word_count=100))

From Anna Matyas am2xandrew.cmu.edu Subject Re Pens fans reactions Organization HSS Deans Office, Carnegie Mellon, Pittsburgh, PA Lines 46  9835blue.cis.pitt.edu  8fqBHaG00WBMIrPhqandrew.cmu.edu NNTPPostingHost po2.andrew.cmu.edu InReplyTo 8fqBHaG00WBMIrPhqandrew.cmu.edu   Terence Rokop writes  Richard J Coyle writes   Thats not inner calm.
Its been seemingly forever since the team lost, and weve forgotten what its like to feel real excitement and surprise at victory.
The first Cup the Pens won, I didnt think about anything else I just watched Mario and all skate the thing around the ice.
But Im every bit as excited this year and I am experiencing that inner calm to which Susan originally referred.
Inner calm is not boredom.


In this case the summarizer selected the heading of the mail, which is not that informative. However also in this case it seems that the algorithm selected focusedpersonal opinions of the speaker. They probably are regarded as more informative

**TEXTRANK ALGORITHM**: we use the textrank algorithm to return the most important keywords first and then to return the most important sentences

In [595]:
from collections import OrderedDict
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load('en_core_web_sm')

In [596]:
class TextRank4Keyword():
    """Extract keywords from text"""
    
    def __init__(self):
        self.d = 0.85 # damping coefficient, usually is .85
        self.min_diff = 1e-5 # convergence threshold
        self.steps = 10 # iteration steps
        self.node_weight = None # save keywords and its weight

    
    def set_stopwords(self, stopwords):  
        """Set stop words"""
        for word in STOP_WORDS.union(set(stopwords)):
            lexeme = nlp.vocab[word]
            lexeme.is_stop = True
    
    def sentence_segment(self, doc, candidate_pos, lower):
        """Store those words only in cadidate_pos"""
        sentences = []
        for sent in doc.sents:
            selected_words = []
            for token in sent:
                # Store words only with cadidate POS tag
                if token.pos_ in candidate_pos and token.is_stop is False:
                    if lower is True:
                        selected_words.append(token.text.lower())
                    else:
                        selected_words.append(token.text)
            sentences.append(selected_words)
        return sentences
        
    def get_vocab(self, sentences):
        """Get all tokens"""
        vocab = OrderedDict()
        i = 0
        for sentence in sentences:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = i
                    i += 1
        return vocab
    
    def get_token_pairs(self, window_size, sentences):
        """Build token_pairs from windows in sentences"""
        token_pairs = list()
        for sentence in sentences:
            for i, word in enumerate(sentence):
                for j in range(i+1, i+window_size):
                    if j >= len(sentence):
                        break
                    pair = (word, sentence[j])
                    if pair not in token_pairs:
                        token_pairs.append(pair)
        return token_pairs
        
    def symmetrize(self, a):
        return a + a.T - np.diag(a.diagonal())
    
    def get_matrix(self, vocab, token_pairs):
        """Get normalized matrix"""
        # Build matrix
        vocab_size = len(vocab)
        g = np.zeros((vocab_size, vocab_size), dtype='float')
        for word1, word2 in token_pairs:
            i, j = vocab[word1], vocab[word2]
            g[i][j] = 1
            
        # Get Symmeric matrix
        g = self.symmetrize(g)
        
        # Normalize matrix by column
        norm = np.sum(g, axis=0)
        g_norm = np.divide(g, norm, where=norm!=0) # this is ignore the 0 element in norm
        
        return g_norm

    
    def get_keywords(self, number=10):
        """Print top number keywords"""
        node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))
        for i, (key, value) in enumerate(node_weight.items()):
            print(key + ' - ' + str(value))
            if i > number:
                break
        
        
    def analyze(self, text, 
                candidate_pos=['NOUN', 'PROPN'], 
                window_size=4, lower=False, stopwords=list()):
        """Main function to analyze text"""
        
        # Set stop words
        self.set_stopwords(stopwords)
        
        # Parse text by spaCy
        doc = nlp(text)
        
        # Filter sentences
        sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words
        
        # Build vocabulary
        vocab = self.get_vocab(sentences)
        
        # Get token_pairs from windows
        token_pairs = self.get_token_pairs(window_size, sentences)
        
        # Get normalized matrix
        g = self.get_matrix(vocab, token_pairs)
        
        # Initionlization for weight(pagerank value)
        pr = np.array([1] * len(vocab))
        
        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr = (1-self.d) + self.d * np.dot(g, pr)
            if abs(previous_pr - sum(pr))  < self.min_diff:
                break
            else:
                previous_pr = sum(pr)

        # Get weight for each node
        node_weight = dict()
        for word, index in vocab.items():
            node_weight[word] = pr[index]
        
        self.node_weight = node_weight

In [681]:
# EXAMPLE 1:

text = corpus_df_train['text'][10]

tr4w = TextRank4Keyword()
tr4w.analyze(text, candidate_pos = ['NOUN', 'PROPN'], window_size=4, lower=False)
tr4w.get_keywords(10)

idea - 2.6712676788435012
Nation - 2.23762938485165
right - 2.0829765382571233
people - 1.892057659292521
identity - 1.5668412384011923
religion - 1.526976492270181
Israel - 1.4141027019410557
god - 1.4113879558016182
meaning - 1.2479166666666663
Park - 1.2137986111111108
Lines - 1.2137986111111108
College - 1.2067152777777777


The algorithm returns the most important words for summarising the document. They are words specific to the documents, which talks about the personal opinion of the speaker about what it means to be Jewish, such as 'identity', 'religion', 'israel', 'god' etc, 

In [695]:
# EXAMPLE 2: 
    
text = corpus_df_train['text'][1000]

tr4w = TextRank4Keyword()
tr4w.analyze(text, candidate_pos = ['NOUN', 'PROPN'], window_size=4, lower=False)
tr4w.get_keywords(10)

fans - 2.977215104166666
game - 1.714572569444444
time - 1.4212635416666668
Pens - 1.3589833333333332
surprise - 1.3191927083333335
Mellon - 1.3187027777777778
express - 1.3124635416666668
year - 1.2479166666666668
Richard - 1.2479166666666663
man - 1.1442934027777778
Carnegie - 1.1386916666666667
Pittsburgh - 1.1386916666666667


The algorithm captures the keywords which help us in understanding what is the text about, which is sport. However it doesn't mention anything specific about hockey

We now try to return the 

In [682]:
import re

import numpy as np
from nltk import sent_tokenize, word_tokenize

from nltk.cluster.util import cosine_distance

In [683]:
def normalize_whitespace(text):
    """
    Translates multiple whitespace into single space character.
    If there is at least one new line character chunk is replaced
    by single LF (Unix new line) character.
    """
    return re.sub(r"\s+", " ", text)


def is_blank(string):
    """
    Returns `True` if string contains only white-space characters
    or is empty. Otherwise `False` is returned.
    """
    return not string or string.isspace()


def get_symmetric_matrix(matrix):
    """
    Get Symmetric matrix
    :param matrix:
    :return: matrix
    """
    return matrix + matrix.T - np.diag(matrix.diagonal())


def core_cosine_similarity(vector1, vector2):
    """
    measure cosine similarity between two vectors
    :param vector1:
    :param vector2:
    :return: 0 < cosine similarity value < 1
    """
    return 1 - cosine_distance(vector1, vector2)


'''
Note: This is not a summarization algorithm. 
This Algorithm pics top sentences irrespective of the order they appeared.
'''


class TextRank4Sentences():
    def __init__(self):
        self.damping = 0.85  # damping coefficient, usually is .85
        self.min_diff = 1e-5  # convergence threshold
        self.steps = 100  # iteration steps
        self.text_str = None
        self.sentences = None
        self.pr_vector = None

    def _sentence_similarity(self, sent1, sent2, stopwords=None):
        if stopwords is None:
            stopwords = []

        sent1 = [w.lower() for w in sent1]
        sent2 = [w.lower() for w in sent2]

        all_words = list(set(sent1 + sent2))

        vector1 = [0] * len(all_words)
        vector2 = [0] * len(all_words)

        # build the vector for the first sentence
        for w in sent1:
            if w in stopwords:
                continue
            vector1[all_words.index(w)] += 1

        # build the vector for the second sentence
        for w in sent2:
            if w in stopwords:
                continue
            vector2[all_words.index(w)] += 1

        return core_cosine_similarity(vector1, vector2)

    def _build_similarity_matrix(self, sentences, stopwords=None):
        # create an empty similarity matrix
        sm = np.zeros([len(sentences), len(sentences)])

        for idx1 in range(len(sentences)):
            for idx2 in range(len(sentences)):
                if idx1 == idx2:
                    continue

                sm[idx1][idx2] = self._sentence_similarity(sentences[idx1], sentences[idx2], stopwords=stopwords)

        # Get Symmeric matrix
        sm = get_symmetric_matrix(sm)

        # Normalize matrix by column
        norm = np.sum(sm, axis=0)
        sm_norm = np.divide(sm, norm, where=norm != 0)  # this is ignore the 0 element in norm

        return sm_norm

    def _run_page_rank(self, similarity_matrix):

        pr_vector = np.array([1] * len(similarity_matrix))

        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr_vector = (1 - self.damping) + self.damping * np.matmul(similarity_matrix, pr_vector)
            if abs(previous_pr - sum(pr_vector)) < self.min_diff:
                break
            else:
                previous_pr = sum(pr_vector)

        return pr_vector

    def _get_sentence(self, index):

        try:
            return self.sentences[index]
        except IndexError:
            return ""

    def get_top_sentences(self, number=5):

        top_sentences = []

        if self.pr_vector is not None:

            sorted_pr = np.argsort(self.pr_vector)
            sorted_pr = list(sorted_pr)
            sorted_pr.reverse()

            index = 0
            for epoch in range(number):
                sent = self.sentences[sorted_pr[index]]
                sent = normalize_whitespace(sent)
                top_sentences.append(sent)
                index += 1

        return top_sentences

    def analyze(self, text, stop_words=None):
        self.text_str = text
        self.sentences = sent_tokenize(self.text_str)

        tokenized_sentences = [word_tokenize(sent) for sent in self.sentences]

        similarity_matrix = self._build_similarity_matrix(tokenized_sentences, stop_words)

        self.pr_vector = self._run_page_rank(similarity_matrix)

In [690]:
# EXAMPLE 1: 

text = clean_text(sentence_n=10)

tr4sh = TextRank4Sentences()
tr4sh.analyze(text)
print(tr4sh.get_top_sentences(1))

['Actually, if I am an atheist, which is in fact like converting into a nonJewish in terms of religion, I am still considered as part of the Jewish Nation.']


The most importance sentence extracted from document 10 summarize quite well the main opinion of the speaker which can be inferred by reading the whole document. <br>
Let's try with the second document

In [691]:
# EXAMPLE 2:
text = clean_text(sentence_n=1000)

tr4sh = TextRank4Sentences()
tr4sh.analyze(text)
print(tr4sh.get_top_sentences(1))

['I doubt that will happen but its possible.']


A sentence alone is not very informative, we try with more

In [694]:
text = clean_text(sentence_n=1000)

tr4sh = TextRank4Sentences()
tr4sh.analyze(text)
print(tr4sh.get_top_sentences(5))

['I doubt that will happen but its possible.', 'Not in the least.', 'Mom.', 'Uhuh.', 'Its boredom, and its being spoiled.']


Even by using 5 sentences the algorithm is not able to clearly catch the main opinion of the writer in this case. Instead, it selects non-informative sentences which seems to contain strong opinions, but in fact don't convey a specific message. The previous method worked better for this document