# Keyword-Extractor EN-ES

This notebook extracts keywords from various text files as.txt input for two languages: English and Spanish. This is due that is used to extract the keywords for artists whose descriptive text uses both of these languages. This keyword extraction is necessary to compare the MetMuseum keyword list with keywords from artists external to the collection. 

Is better to have different languages as separate data because of nltk stopwords, but particularly because the tokenizers may work badly when mixing languages (the keywords can be concatenated or expanded, it might require some extra lines if so).

It obtains either a pickle or txt file. The count-list outputs a nested list with the count value (which might be annoying when importing to the keyword comparison file) While the tf-idf list obtains a plain list. When applying TF-IDF I basically don't leave any word out of the document given some error that might come from the short nature of the files analysed, therefore rendering kinda useless the tf-idf purpose that is getting rid of overtly present terms, but still works fine given how short are the files in relative terms.

In [1]:
import glob
import os
import re
import codecs
from nltk.corpus import stopwords
import operator
from tabulate import tabulate
from nltk.corpus import stopwords
from stop_words import get_stop_words
import nltk

In [36]:
book_filenames = sorted(glob.glob("D:\Path\to\ArtistTxtFiles\English\*.txt"))
print("Found books:")
book_filenames

Found books:


['D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\catalinaortiz1065.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\feliperomero30.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\juan_covelli.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\juancortes79.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\oldnewflesh.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\silvanuchis.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\susanaordonezartista.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\vanessanietoromero.txt',
 'D:\\ML\\ColombiArtistas\\Textos\\Todos\\English\\vivianabtroya.txt']

In [23]:
#obtains the filename getting rid of path and extension. Make sure this value and the one in the for loop down below match.
book_filenames[1][43:-4]

'feliperomero30'

Makes a dictionary with the filename as key and the text of each file as value. There was some issue when reading characters coming from a pytesseract-ocr output file, but weirdly not from those files which i just copy and pasted. Is important to keep the 'utf-8' encoder as it has spanish characters (also if you were to adapt it to other languages later). if encoded with 'charmap' or other reads the files but replaces accents with meaningless garbage.

In [37]:
artist_dictionary = {}
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        artist_dictionary[book_filename[43:-4]] = book_file.read()
    #print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\catalinaortiz1065.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\feliperomero30.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\juan_covelli.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\juancortes79.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\oldnewflesh.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\silvanuchis.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\susanaordonezartista.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\vanessanietoromero.txt'...

Reading 'D:\ML\ColombiArtistas\Textos\Todos\English\vivianabtroya.txt'...



In [7]:
#confirm the data was loaded properly
artist_dictionary['agustinalallana']

'\r\nOriente\r\nEn la cosmovisión de los pueblos originarios, la unión íntima entre el ser humano y el animal tiene una profunda relevancia. Son los animales del entorno inmediato los que sobrecogen al hombre, haciendo que este elabore mitos y creencias en torno a ellos. A través de los animales, el hombre precolombino intentó transmitir su modo de internalizar el funcionamiento del cosmos que lo rodeaba. Los cazadores-recolectores, por razones de sobrevivencia, sostuvieron un contacto más íntimo con la fauna de su entorno.\r\nEsta interrelación les permitió ahondar en las conductas y ciclos de distintas especies, lo que desembocaría en una comprensión especializada de los momentos adecuados para la caza, para de este modo asegurar el sustento del grupo. Desde esta compenetración necesaria y vital con la fauna surgieron las representaciones de diferentes especies en paredes rocosas, en las que se plasman escenas de cacería, aludiendo a la preponderancia que el mundo animal poseía para 

In [8]:
#confirm the keys and have them visible in case you need a cherry-picked example
artist_dictionary.keys()

dict_keys(['adalbertocalvogonzalez_artista', 'agustinalallana', 'alequint', 'alexandra_mccormick_artista', 'aliriocc', 'andres_moreno_hoffmannn', 'anibaldo081', 'aparissio', 'burningflags', 'camilabarretohoyos', 'camilacostalzate', 'carolina_diaz_g', 'catalinajaramilloquijano', 'colectivomonomero', 'crila_regina', 'dacunat', 'ejelejalea', 'estebanl_lopez_e', 'faaloon', 'federicopuyo', 'felipeuribemejia', 'felipezapataz4', 'fernandaluzavendano', 'gabrielhernandezserrato', 'guarnizo_david', 'haymuchasanas', 'jpuribem', 'juansebastianrosillo', 'klauslundi', 'layosandres', 'mauriciojaramilloartista', 'mayacorredorr', 'namejiam', 'natalia_castillo_rincon', 'nclsbarrera', 'ojedaenel.lente', 'rocio_pardo_e', 'sebastian_fonnegra', 'sergiogalvis.art', 'soniarojaspez', 'sophiaprietov', 'villamilyvillamil'])

In [14]:
#clean word with regex
def clean_word(word):
    cleaned_word = re.sub('[^A-Za-z]+', '', word)
    return cleaned_word

def getWordList(url):
    word_list = []
#find the words in paragraph tag
    for text in corpus_raw:
        if text.text is None:
            continue
        #content
        content = text.text
        #lowercase and split into an array
        words = content.lower().split()

        #for each word
        for word in words:
            #remove non-chars
            cleaned_word = clean_word(word)
            #if there is still something there
            if len(cleaned_word) > 0:
                #add it to our word list
                word_list.append(cleaned_word)
                
    return word_list

def createFrquencyTable(word_list):
    #word count
    word_count = {}
    for word in word_list:
        #index is the word
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1

    return word_count

#remove stop words


## Español

In [25]:
def remove_stop_words(frequency_list):
    stop_words = get_stop_words('spanish')

    temp_list = []
    for key,value in frequency_list:
        if key not in stop_words:
            temp_list.append([key, value])

    return temp_list

In [26]:
stop_words = get_stop_words('spanish')

## Inglés

In [62]:
def remove_stop_words(frequency_list):
    stop_words = get_stop_words('en')

    temp_list = []
    for key,value in frequency_list:
        if key not in stop_words:
            temp_list.append([key, value])

    return temp_list

### Extraer keywords a una llave del diccionario a la vez

Este es el código para extraer las palabras clave por CONTEO. Se extrae las palabras claves de el diccionario puesto en la primera línea siguiente. Como resultado genera una tablita muy linda.

In [47]:
toget_list = artist_dictionary['agustinalallana']


#Take away dots, commas and split
toget_list = re.sub('\W', ' ', toget_list).lower().split()
#Deletes short typos or any word less than 3 charachters
new_toget = [word for word in toget_list if len(word) > 2]

In [48]:
page_word_list = toget_list
#create table of word counts, dictionary
page_word_count = createFrquencyTable(page_word_list)
#sort the table by the frequency count
sorted_word_frequency_list = sorted(page_word_count.items(), key=operator.itemgetter(1), reverse=True)
#remove stop words if the user specified (searchmode is the function that activates when input > 2 that wrote up there)
sorted_word_frequency_list = remove_stop_words(sorted_word_frequency_list)

In [50]:
#sum the total words to calculate frequencies   
total_words_sum = 0
for key,value in sorted_word_frequency_list:
    total_words_sum = total_words_sum + value
    
#just get the top 30 words
if len(sorted_word_frequency_list) > 20:
    sorted_word_frequency_list = sorted_word_frequency_list[:20]

In [52]:
#sorted_word_frequency_list

In [41]:
 #create our final list which contains words, frequency (word count), percentage
final_list = []

for key,value in sorted_word_frequency_list:
    percentage_value = float(value * 100) / total_words_sum
    final_list.append([key, value, round(percentage_value, 4)])

#headers before the table
print_headers = ['Word', 'Frequency', 'Frequency Percentage']

#print the table with tabulate
print(tabulate(final_list, headers=print_headers, tablefmt='orgtbl'))


| Word     |   Frequency |   Frequency Percentage |
|----------+-------------+------------------------|
| selva    |          10 |                 1.5625 |
| ser      |           8 |                 1.25   |
| hombre   |           6 |                 0.9375 |
| mundo    |           6 |                 0.9375 |
| animal   |           5 |                 0.7812 |
| animales |           5 |                 0.7812 |
| tiempo   |           5 |                 0.7812 |
| gallo    |           5 |                 0.7812 |
| fauna    |           4 |                 0.625  |
| especies |           4 |                 0.625  |
| indígena |           4 |                 0.625  |
| putumayo |           4 |                 0.625  |
| pueblos  |           3 |                 0.4688 |
| íntima   |           3 |                 0.4688 |
| través   |           3 |                 0.4688 |
| así      |           3 |                 0.4688 |
| frente   |           3 |                 0.4688 |
| bien     |

### Extraer keywords a todo el directorio

Este es el código para extraer las palabras clave por CONTEO de todos los documentos del directorio (correr una vez por idioma, y cambiar el documento de stopwords arriba dependiendo de cual es el contenido. Como resultado genera un diccionario con una nested list como resultado, con la palabra y el conteo.

In [None]:
#En caso que ya se haya creado el diccionario no volver a correr
artist_keywords_count = {}

In [63]:
for k in artist_dictionary:
    toget_list = artist_dictionary[k]
    #Take away dots, commas and split
    toget_list = re.sub('\W', ' ', toget_list).lower().split()
    #Deletes short typos or any word less than 3 charachters
    new_toget = [word for word in toget_list if len(word) > 2]
    page_word_list = toget_list
    #create table of word counts, dictionary
    page_word_count = createFrquencyTable(page_word_list)
    #sort the table by the frequency count
    sorted_word_frequency_list = sorted(page_word_count.items(), key=operator.itemgetter(1), reverse=True)
    #remove stop words if the user specified (searchmode is the function that activates when input > 2 that wrote up there)
    sorted_word_frequency_list = remove_stop_words(sorted_word_frequency_list)
    #sum the total words to calculate frequencies   
    total_words_sum = 0
    for key,value in sorted_word_frequency_list:
        total_words_sum = total_words_sum + value
    #just get the top 30 words
    if len(sorted_word_frequency_list) > 20:
        sorted_word_frequency_list = sorted_word_frequency_list[:20]
    #create our final list which contains words, frequency (word count), percentage
    artist_keywords_count[k] = sorted_word_frequency_list

In [64]:
artist_keywords_count

{'adalbertocalvogonzalez_artista': [['serie', 6],
  ['asedio', 6],
  ['paisaje', 5],
  ['reducido', 5],
  ['papel', 3],
  ['paso', 2],
  ['lenguajes', 1],
  ['universos', 1],
  ['desdoblados', 1],
  ['arqueología', 1],
  ['graffiti', 1],
  ['futurismo', 1],
  ['surrealismo', 1],
  ['representaciones', 1],
  ['arqueológicas', 1],
  ['dibujo', 1],
  ['pintura', 1],
  ['alta', 1],
  ['velocidad', 1],
  ['construcción', 1]],
 'agustinalallana': [['selva', 10],
  ['ser', 8],
  ['hombre', 6],
  ['mundo', 6],
  ['animal', 5],
  ['animales', 5],
  ['tiempo', 5],
  ['gallo', 5],
  ['fauna', 4],
  ['especies', 4],
  ['indígena', 4],
  ['putumayo', 4],
  ['pueblos', 3],
  ['íntima', 3],
  ['través', 3],
  ['así', 3],
  ['frente', 3],
  ['bien', 3],
  ['serie', 3],
  ['región', 3]],
 'alequint': [['cuerpo', 5],
  ['haciendo', 3],
  ['demás', 3],
  ['pregunta', 2],
  ['tiempo', 2],
  ['después', 2],
  ['celebrar', 2],
  ['mismo', 2],
  ['forma', 2],
  ['balnearios', 2],
  ['sino', 2],
  ['mientras'

In [65]:
f = open("keywords_artistas_conteo.txt","w")
f.write( str(artist_keywords_count) )
f.close()

In [66]:
import pickle

f = open("keywords_artistas_conteo.pkl","wb")
pickle.dump(artist_keywords_count,f)
f.close()

In [55]:
sorted_word_frequency_list

[['sed', 1],
 ['montañas', 1],
 ['naturaleza', 1],
 ['popular', 1],
 ['ornamento', 1],
 ['illustración', 1],
 ['pintura', 1],
 ['tejido', 1],
 ['patrón', 1],
 ['color', 1]]

# Keywords with TF-IDF

So apparently this code didn't work in local machine, but now it does. Is also simpler since spanish is better supported than chinese. The keywords by frequency and by tf-idf work differently have different ranking, but in such a short process is not absolutely relevant. This first section runs independent of language. To load a different language go back to load either the english or spanish texts.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [6]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

## Español

*Extraer todas las keywords de una*

Next cell is to introduce stopwords and tokenizer for spanish. Then I included some stopwords that were not in the nltk list. I modified the max_df for the count vectorizer, as with some documents having just 10 words and others many more, there is some error when leaving the 85% original value. This might include some not as relevant words for the longer texts, but still gives a good representation while keeping all 10 words provided for artists with few words.

In [7]:
tokenizer_spa = nltk.data.load('tokenizers/punkt/spanish.pickle')
stop_words = get_stop_words('spanish')
#get the text column
stoplist = stop_words#.set(stopwords.words("spanish"))


In [None]:
nuevas_stop_words = ["después","antes","tan","van","trata","trato","toma", "the", "that", "tema","puesto","hacia","ésta","éste","hace","éstos","éstas","realizó","últimos","veces","toda","todo"]
for wrd in nuevas_stop_words:
    stop_words.append(wrd)

In [None]:
stop_words

In [11]:
#Just run once if you don't want to get rid of the info. Since the dictionary key is assigned anew with each run, even when having
#a mistake keeps updating the key without overwriting the whole dictionary as a whole
artist_keywords_tfidf = {}

In [19]:
stoplist = stop_words
for k in artist_dictionary:
    raw_sentences = tokenizer_spa.tokenize(artist_dictionary[k])
    docs = raw_sentences
    
    #create a vocabulary of words, 
    #ignore words that appear in 85% of documents, 
    #eliminate stop words
    cv = CountVectorizer(max_df=1, stop_words = stoplist)
    word_count_vector = cv.fit_transform(docs)
    word_count_vector.shape
    tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
    tfidf_transformer.fit(word_count_vector)
    
    #The keyword analyzer can only extract from raw text data, so i have to apply the stopwords to the Metcorpus
    #stop_words = set(stopwords.words('english')) 
    corpus_raw_lower = artist_dictionary[k].lower()
    corpus_raw_3 = u""
    
    clean_corp = corpus_raw_lower.split()
    for w in clean_corp :
        if w not in stop_words and len(w) >= 3 :
            corpus_raw_3 += w + ' ' 
            
    # you only needs to do this once
    feature_names=cv.get_feature_names()
    
    # get the document that we want to extract keywords from
    doc = corpus_raw_3
    
    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
    
    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())
    
    #extract only the top n; n here is 10
    keywords = extract_topn_from_vector(feature_names,sorted_items,20)
    
    terms_list = []
    
    for y in keywords:
        terms_list.append(y)
        artist_keywords_tfidf[k] = terms_list
    
    print("\n===Keywords===")
    for b in keywords:
        print(b,keywords[b])
            


===Keywords===
serie 0.479
asedio 0.479
reducido 0.399
paisaje 0.399
papel 0.239
paso 0.16
velocidad 0.08
universos 0.08
trabajo 0.08
surrealismo 0.08
representaciones 0.08
pintura 0.08
pieza 0.08
parte 0.08
nacrílico 0.08
lenguajes 0.08
graffiti 0.08
futurismo 0.08
formal 0.08
dibujo 0.08

===Keywords===
indicio 0.095
último 0.047
íntimo 0.047
época 0.047
área 0.047
zona 0.047
yai 0.047
xvi 0.047
wisuya 0.047
vínculos 0.047
vínculo 0.047
voluntad 0.047
viviendas 0.047
vital 0.047
visión 0.047
virtud 0.047
vida 0.047
verdor 0.047
vasijas 0.047
varias 0.047

===Keywords===
vida 0.099
vez 0.099
verlos 0.099
ve 0.099
universales 0.099
todas 0.099
tipo 0.099
sociedad 0.099
sistema 0.099
sentencias 0.099
semejante 0.099
segunda 0.099
retratos 0.099
restringirlas 0.099
respuesta 0.099
rescatar 0.099
reparador 0.099
rendirse 0.099
regla 0.099
reconocer 0.099

===Keywords===
madeja 0.133
patrimonio 0.088
oficina 0.088
minuta 0.088
medellín 0.088
llantas 0.088
invertido 0.088
intangible 0.088


In [20]:
artist_keywords_tfidf

{'adalbertocalvogonzalez_artista': ['serie',
  'asedio',
  'reducido',
  'paisaje',
  'papel',
  'paso',
  'velocidad',
  'universos',
  'trabajo',
  'surrealismo',
  'representaciones',
  'pintura',
  'pieza',
  'parte',
  'nacrílico',
  'lenguajes',
  'graffiti',
  'futurismo',
  'formal',
  'dibujo'],
 'agustinalallana': ['indicio',
  'último',
  'íntimo',
  'época',
  'área',
  'zona',
  'yai',
  'xvi',
  'wisuya',
  'vínculos',
  'vínculo',
  'voluntad',
  'viviendas',
  'vital',
  'visión',
  'virtud',
  'vida',
  'verdor',
  'vasijas',
  'varias'],
 'alequint': ['vida',
  'vez',
  'verlos',
  've',
  'universales',
  'todas',
  'tipo',
  'sociedad',
  'sistema',
  'sentencias',
  'semejante',
  'segunda',
  'retratos',
  'restringirlas',
  'respuesta',
  'rescatar',
  'reparador',
  'rendirse',
  'regla',
  'reconocer'],
 'alexandra_mccormick_artista': ['madeja',
  'patrimonio',
  'oficina',
  'minuta',
  'medellín',
  'llantas',
  'invertido',
  'intangible',
  'impresión',
  '

dict_keys(['naturaleza', 'popular'])

## English

Next cell is to introduce stopwords and tokenizer for english. Then I included some stopwords that were not in the nltk list. I modified the max_df for the count vectorizer, as with some documents having just 10 words and others many more, there is some error when leaving the 85% original value. This might include some not as relevant words for the longer texts, but still gives a good representation while keeping all 10 words provided for artists with few words. Last, I save the value dictionary as a pickle and a txt.

In [None]:
tokenizer_en = nltk.data.load('tokenizers/punkt/english.pickle')
stop_words = get_stop_words('en')
#.set(stopwords.words("spanish"))

In [32]:
new_stop_words = ['will','whose','well','ways','way','vs','eish','wider']
for wrd in new_stop_words:
    stop_words.append(wrd)

In [42]:
#stop_words

In [38]:
stoplist = stop_words
for k in artist_dictionary:
    raw_sentences = tokenizer_en.tokenize(artist_dictionary[k])
    docs = raw_sentences
    #create a vocabulary of words, 
    #ignore words that appear in 85% of documents, 
    #eliminate stop words
    cv = CountVectorizer(max_df=1, stop_words = stoplist)
    word_count_vector = cv.fit_transform(docs)
    word_count_vector.shape
    tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
    tfidf_transformer.fit(word_count_vector)
    
    #The keyword analyzer can only extract from raw text data, so i have to apply the stopwords to the Metcorpus
    #stop_words = set(stopwords.words('english')) 
    corpus_raw_lower = artist_dictionary[k].lower()
    corpus_raw_3 = u""
    
    clean_corp = corpus_raw_lower.split()
    for w in clean_corp :
        if w not in stop_words and len(w) >= 3 :
            corpus_raw_3 += w + ' ' 
            
    # you only needs to do this once
    feature_names=cv.get_feature_names()
    
    # get the document that we want to extract keywords from
    doc = corpus_raw_3
    
    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
    
    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())
    
    #extract only the top n; n here is 10
    keywords = extract_topn_from_vector(feature_names,sorted_items,20)
    
    terms_list = []
    
    for y in keywords:
        terms_list.append(y)
        artist_keywords_tfidf[k] = terms_list
    
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])


===Keywords===
represent 0.111
packaging 0.111
little 0.111
home 0.111
book 0.111
year 0.055
work 0.055
women 0.055
within 0.055
white 0.055
wears 0.055
wanted 0.055
vehicle 0.055
vast 0.055
used 0.055
us 0.055
unfolding 0.055
underwear 0.055
typographic 0.055
translating 0.055

===Keywords===
bush 0.208
avoid 0.138
armed 0.138
youth 0.069
years 0.069
worship 0.069
whether 0.069
waters 0.069
watching 0.069
visual 0.069
villages 0.069
victims 0.069
valid 0.069
usually 0.069
usual 0.069
uses 0.069
unto 0.069
undetermined 0.069
two 0.069
transits 0.069

===Keywords===
mapping 0.092
forever 0.092
dimension 0.092
cet 0.092
away 0.092
édouard 0.046
åse 0.046
zkm 0.046
zeroes 0.046
year 0.046
word 0.046
within 0.046
widespread 0.046
western 0.046
welcome 0.046
web 0.046
weave 0.046
volume 0.046
visiting 0.046
visit 0.046

===Keywords===
juan 0.092
wounds 0.046
without 0.046
warlike 0.046
virtual 0.046
violence 0.046
villegas 0.046
vietnam 0.046
updated 0.046
unpredictable 0.046
unleashed 0.0

In [39]:
artist_keywords_tfidf

{'adalbertocalvogonzalez_artista': ['serie',
  'asedio',
  'reducido',
  'paisaje',
  'papel',
  'paso',
  'velocidad',
  'universos',
  'trabajo',
  'surrealismo',
  'representaciones',
  'pintura',
  'pieza',
  'parte',
  'nacrílico',
  'lenguajes',
  'graffiti',
  'futurismo',
  'formal',
  'dibujo'],
 'agustinalallana': ['indicio',
  'último',
  'íntimo',
  'época',
  'área',
  'zona',
  'yai',
  'xvi',
  'wisuya',
  'vínculos',
  'vínculo',
  'voluntad',
  'viviendas',
  'vital',
  'visión',
  'virtud',
  'vida',
  'verdor',
  'vasijas',
  'varias'],
 'alequint': ['vida',
  'vez',
  'verlos',
  've',
  'universales',
  'todas',
  'tipo',
  'sociedad',
  'sistema',
  'sentencias',
  'semejante',
  'segunda',
  'retratos',
  'restringirlas',
  'respuesta',
  'rescatar',
  'reparador',
  'rendirse',
  'regla',
  'reconocer'],
 'alexandra_mccormick_artista': ['madeja',
  'patrimonio',
  'oficina',
  'minuta',
  'medellín',
  'llantas',
  'invertido',
  'intangible',
  'impresión',
  '

In [40]:
f = open("keywords_artistas_tfidf.txt","w")
f.write( str(artist_keywords_tfidf) )
f.close()

In [41]:
import pickle

f = open("keywords_artistas_tfidf.pkl","wb")
pickle.dump(artist_keywords_tfidf,f)
f.close()

*What is left from the individual processing*

In [30]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer()

In [31]:
#The keyword analyzer can only extract from raw text data, so i have to apply the stopwords to the Metcorpus
#stop_words = set(stopwords.words('english')) 
corpus_raw_lower = artist_dictionary['agustinalallana'].lower()
corpus_raw_3 = u""

clean_corp = corpus_raw_lower.split()
for w in clean_corp :
    if w not in stop_words and len(w) >= 3 :
        corpus_raw_3 += w + ' ' 

In [32]:
if clean_corp[0] not in stop_words and len(clean_corp[0]) >= 3 :
    print(clean_corp[0])
# len(clean_corp[0])

oriente


In [33]:
# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc = corpus_raw_3

#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords = extract_topn_from_vector(feature_names,sorted_items,20)

print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


===Keywords===
selva 0.206
ser 0.185
mundo 0.152
hombre 0.145
tiempo 0.134
gallo 0.127
animales 0.127
animal 0.127
putumayo 0.107
indígena 0.107
fauna 0.107
especies 0.107
vez 0.092
íntima 0.086
través 0.086
serie 0.086
región 0.086
pueblos 0.086
frente 0.086
colombia 0.086
