# Multilingual Sentence Encoder simmilarity comparison between Met-Corpus and 50 Colombian Contemporary Artists Keywords

This notebook takes the MetMuseum 100 Keywords obtained from the `Scraping` and `Text` folders, and compares them to the keywords obtained by running the `Keyword_Extractor_EN-ES` notebook on text scrapped from each contemporary artist website and Instagram account. I used the [Multilingual Universal Sentence Encoder](https://www.tensorflow.org/hub/tutorials/retrieval_with_tf_hub_universal_encoder_qa) because originally was designed for text in english and chinese, later for english and spanish, and as such multilingual embedding is useful as allows word embedding without using separate models for each language.

It then makes a ranking of the artists based on who has the highest similarity values to all keywords, and to the 30 most similar keywords. These two parameters are taken into account given how the range of values to the overall collection is quite narrow, thus including a second parameter can lessen the influence of a decimal difference.

Have to be careful about the specific versions of Tensorflow and sentencepiece programs. There are newer versions by Google to the [MUSE](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder) model, but this one took long enough to work, so I'll stick with it for now.

In [1]:
import codecs
import glob
import logging
import os
import pprint
import re
import pickle
import multiprocessing
import nltk
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Setting up GUSE

*IMPORTANT*

The compatible library versions:

`tensorflow and tensorflow-gpu == 1.13.1` 

`tf-sentencepiece == 0.1.81` 

`tensorflow-hub == 0.7.0`

Is recommended to run it this notebook in a new environment, as it requires these very specific library versions

In [None]:
#latest Tensorflow that supports sentencepiece is 1.13.1
#In case you want to uninstall your current tensorflow version
# !pip uninstall --quiet --yes tensorflow
# !pip uninstall --quiet --yes tensorflow-gpu


In [42]:
#@title Setup Environment
!pip install --quiet tensorflow-gpu==1.13.1
!pip install --quiet tensorflow==1.13.1
!pip install --quiet tensorflow-hub==0.7.0
!pip install --quiet bokeh
!pip install --quiet tf-sentencepiece==0.1.81
!pip install --quiet simpleneighbors
!pip install --quiet tqdm

### Imports

In [1]:
#@title Setup common imports and functions
import bokeh
import bokeh.models
import bokeh.plotting
import numpy as np
import os
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece  # Not used directly but needed to import TF ops.
import sklearn.metrics.pairwise

from simpleneighbors import SimpleNeighbors
from tqdm import tqdm
from tqdm import trange

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


The 16-language multilingual module is the default but feel free to pick others from the list and compare the results.

In [2]:
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/1'  

#@param ['https://tfhub.dev/google/universal-sentence-encoder-multilingual/1', 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/1', 'https://tfhub.dev/google/universal-sentence-encoder-xling-many/1']

# Set up graph.
g = tf.Graph()
with g.as_default():
  text_input = tf.placeholder(dtype=tf.string, shape=[None])
  multiling_embed = hub.Module(module_url)
  embedded_text = multiling_embed(text_input)
  init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

# Initialize session.
session = tf.Session(graph=g)
session.run(init_op)

Instructions for updating:
Colocations handled automatically by placer.


W1007 23:39:40.599957 140207048439616 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I1007 23:39:45.843727 140207048439616 saver.py:1483] Saver not created because there are no variables in the graph to restore


### Text Lists

*texts* is a list containing the Metcorpus keywords.

The *artist_keywords* come in a pickle file as a dictionary, varying from 10 to 20 keywords. The amount of keywords does not seem to affect the overall position in the ranking, as both first and last spots are 20 keyword-long artists, and you have some 10 keyword long artist towards the end (it happened with a previous trial that just having less words in a string makes it work better, but since each term is individual it does not affect much.

In [3]:
import pickle
artist_keywords = pickle.load(open("/storage/keywords_artistas_tfidf.pkl", 'rb'))
artist_keywords

{'adalbertocalvogonzalez_artista': ['serie',
  'asedio',
  'reducido',
  'paisaje',
  'papel',
  'paso',
  'velocidad',
  'universos',
  'trabajo',
  'surrealismo',
  'representaciones',
  'pintura',
  'pieza',
  'parte',
  'nacrílico',
  'lenguajes',
  'graffiti',
  'futurismo',
  'formal',
  'dibujo'],
 'agustinalallana': ['indicio',
  'último',
  'íntimo',
  'época',
  'área',
  'zona',
  'yai',
  'xvi',
  'wisuya',
  'vínculos',
  'vínculo',
  'voluntad',
  'viviendas',
  'vital',
  'visión',
  'virtud',
  'vida',
  'verdor',
  'vasijas',
  'varias'],
 'alequint': ['vida',
  'vez',
  'verlos',
  've',
  'universales',
  'todas',
  'tipo',
  'sociedad',
  'sistema',
  'sentencias',
  'semejante',
  'segunda',
  'retratos',
  'restringirlas',
  'respuesta',
  'rescatar',
  'reparador',
  'rendirse',
  'regla',
  'reconocer'],
 'alexandra_mccormick_artista': ['madeja',
  'patrimonio',
  'oficina',
  'minuta',
  'medellín',
  'llantas',
  'invertido',
  'intangible',
  'impresión',
  '

In [5]:
texts = ['first', 'uninterruptedly', 'circulating', 'later', 'time', 'tragically', 'postponed', 'hiatus', 'untimely', 
         'delayed', 'protracted', 'premature', 'exhausting', 'dissatisfied', 'monopolized', 'portrait', 'mademoiselle', 
         'madame', 'monsieur', 'saint', 'martyr', 'early', 'museum', 'christ', 'period', 'figure', 'nude', 'androgynous', 
         'posture', 'work', 'collaboration', 'drafts', 'cartoonist', 'miniaturist', 'versatile', 'printmaker', 'mediocre', 
         'speculating', 'style', 'manner', 'idiom', 'classic', 'mannerism', 'madonna', 'trends', 'vocabulary', 'peculiarity',
         'characteristic', 'undeniably', 'quintassentially', 'art', 'timeline', 'secrecy', 'society', 'ethnographical', 
         'international', 'jewelry', 'archaeology', 'rarities',  'gold', 'drawing', 'century', 'collection', 'artist', 
         'painter', 'etcher', 'prinmaker', 'inventor', 'copyist', 'precocious', 'collaborator', 'improbably', 'uninspired',
         'confident', 'ambidiextrous', 'muralist', 'painting', 'unquestionably', 'masterpiece', 'picture', 'imitator',
         'rethinking', 'magisterial', 'dislike', 'design',  'pattern', 'geometrical', 'layout', 'scheme', 'structurally', 
         'ornamentation', 'composition', 'repeat', 'chasing', 'elaboration', 'floral']

In [6]:
#just in case you need a list of the names of the artists here it is
for k in artist_keywords:
    print(k)

adalbertocalvogonzalez_artista
agustinalallana
alequint
alexandra_mccormick_artista
aliriocc
andres_moreno_hoffmannn
anibaldo081
aparissio
burningflags
camilabarretohoyos
camilacostalzate
carolina_diaz_g
catalinajaramilloquijano
colectivomonomero
crila_regina
dacunat
ejelejalea
estebanl_lopez_e
faaloon
federicopuyo
felipeuribemejia
felipezapataz4
fernandaluzavendano
gabrielhernandezserrato
guarnizo_david
haymuchasanas
jpuribem
juansebastianrosillo
klauslundi
layosandres
mauriciojaramilloartista
mayacorredorr
namejiam
natalia_castillo_rincon
nclsbarrera
ojedaenel.lente
rocio_pardo_e
sebastian_fonnegra
sergiogalvis.art
soniarojaspez
sophiaprietov
vicentavictoriagomez
villamilyvillamil
catalinaortiz1065
danielalopezp
feliperomero30
juan_covelli
juancortes79
oldnewflesh
silvanuchis
susanaordonezartista
vanessanietoromero
vivianabtroya


#### Embedding of each artist list in a separate variable

In [7]:
MethEmb = session.run(embedded_text, feed_dict={text_input: texts})

#obtains the embeddings in variables with the format emb_dictionarykey

namespace = globals()
for k in artist_keywords:
    namespace['emb_%s' % k] = session.run(embedded_text, feed_dict={text_input: artist_keywords[k]})


#### Variable that determines the similarity of each word pair, originally from MUSE

In [8]:
def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,):
    
    assert len(embeddings_1) == len(labels_1)
    assert len(embeddings_2) == len(labels_2)

  # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
    sim = 1 - np.arccos(sklearn.metrics.pairwise.cosine_similarity(embeddings_1,embeddings_2))/np.pi
    
    embeddings_1_col, embeddings_2_col, sim_col = [], [], []
    for i in range(len(embeddings_1)):
        for j in range(len(embeddings_2)):
            embeddings_1_col.append(labels_1[i])
            embeddings_2_col.append(labels_2[j])
            sim_col.append(sim[i][j])
    df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),
                columns=['embeddings_1', 'embeddings_2', 'sim'])
    return df

#### Similarity obtaining the values from the prior embedding

In [9]:
for k in artist_keywords:
    namespace['df_%s' % k] = visualize_similarity(MethEmb, namespace['emb_%s' % k], texts, artist_keywords[k]) 

# df_simm = visualize_similarity(MethEmb, Key1emb, texts, AltargerelDelgerama)
# dfart2 = visualize_similarity(MethEmb, Key2emb, texts, AnaCurk)

*List of top simmilarities between collection-artist words*

replace `df_artistname` in order to see the top words for a particular artist

In [15]:
# Prueba para ver si funciona la similitud. Escogi a esta artista ya que sabia de antemano que tenia la coincidencia
# flowers - floral, que se refleja con similitud de 0.80

df_agustinalallana.sort_values(by=['sim'],ascending=False).head(35)

Unnamed: 0,embeddings_1,embeddings_2,sim
63,time,época,0.780992
151,protracted,verdor,0.772422
363,period,época,0.753094
123,untimely,época,0.732093
768,timeline,época,0.731801
181,exhausting,verdor,0.731308
375,figure,indicio,0.719905
590,manner,voluntad,0.718293
586,manner,verdor,0.716051
918,century,época,0.714054


### Values that stores the mean and median of all the similarity column

In [60]:
simmil_art_mean = []
simmil_art_median = []

#finds and stores in above list the mean in each dataframe

for k in artist_keywords:
    simmil_art_mean.append(namespace['df_%s' % k]['sim'].mean())
    
#finds and stores in above list the median in each dataframe

for k in artist_keywords:
    simmil_art_median.append(namespace['df_%s' % k]['sim'].median())
    

In [61]:
#Both mean and median values have around the same range of min and max, so is unnecesary to include both in calculation
print(min(simmil_art_mean))
print(max(simmil_art_mean))

print(min(simmil_art_median))
print(max(simmil_art_median))


0.6111212555939952
0.6275568507993833
0.6091335713863373
0.6257765293121338


### Value that stores the mean of the top 30 words (closeness to top 5%)

In [62]:
#I just used average as it captured a wider variation between min - max values.

simmil_30 = []

for k in artist_keywords:
    simmil_30.append(namespace['df_%s' % k]['sim'].sort_values(ascending=False).head(30).mean())
    

In [63]:
print(min(simmil_30))
print(max(simmil_30))

0.6833540479342143
0.746122153600057


Getting artist names and keywords in a list so as to fit them in a dictionary to build the DataFrame

In [18]:
nombres_artistas =  []
palabras_key = []

for k in artist_keywords:
    nombres_artistas.append(k)
    palabras_key.append(artist_keywords[k])    
    
ArtistTextDict = {'Artist':nombres_artistas,'Keywords':palabras_key,'Average Simil to Text Core':simmil_art_mean,'Median Simil to Text Core':simmil_art_median, 'Simil top 30':simmil_30}

## General Dataframe with keywords extracted using TF-IDF

In [19]:
df = pd.DataFrame(ArtistTextDict) 
df

Unnamed: 0,Artist,Keywords,Average Simil to Text Core,Median Simil to Text Core,Simil top 30
0,adalbertocalvogonzalez_artista,"[asedio, formal, dibujo, serie, representacion...",0.620029,0.617029,0.724822
1,agustinalallana,"[indicio, verdor, íntimo, época, fauna, volunt...",0.623766,0.62264,0.707438
2,alequint,"[rendirse, reconocer, cuerpo, enigma, tipo, re...",0.620684,0.619929,0.725793
3,alexandra_mccormick_artista,"[madeja, invertido, época, imagen, fenómeno, t...",0.627455,0.625081,0.732312
4,aliriocc,"[urbano, tejido, metal, material, hogar, habit...",0.615642,0.613963,0.68497
5,andres_moreno_hoffmannn,"[ópticos, ángulos, visualizar, visuales, vislu...",0.615792,0.613789,0.712518
6,anibaldo081,"[dibujo, íntimo, paisaje, nubes, naturaleza, m...",0.623523,0.621949,0.712278
7,aparissio,"[video, historia, década, conceptos, órgano, x...",0.612967,0.610762,0.693233
8,burningflags,"[fotografía, vivo, vinilo, serigrafia, publica...",0.614018,0.611329,0.696313
9,camilabarretohoyos,"[procesos, nueva, manera, imagen, género, chri...",0.616646,0.613646,0.714643


*Checking if there is a change between the general ranking and the top 30 word*

In [20]:
df[['Average Simil to Text Core','Artist']].sort_values(by=['Average Simil to Text Core'], ascending=False).head(10)

Unnamed: 0,Average Simil to Text Core,Artist
28,0.627557,klauslundi
3,0.627455,alexandra_mccormick_artista
20,0.626211,felipeuribemejia
15,0.624115,dacunat
1,0.623766,agustinalallana
6,0.623523,anibaldo081
50,0.623115,susanaordonezartista
49,0.622938,silvanuchis
43,0.62248,catalinaortiz1065
48,0.620753,oldnewflesh


In [21]:
df[['Simil top 30','Artist']].sort_values(by=['Simil top 30'], ascending=False).head(10)

Unnamed: 0,Simil top 30,Artist
38,0.746122,sergiogalvis.art
52,0.74118,vivianabtroya
20,0.739599,felipeuribemejia
3,0.732312,alexandra_mccormick_artista
23,0.730696,gabrielhernandezserrato
17,0.72916,estebanl_lopez_e
19,0.725794,federicopuyo
2,0.725793,alequint
35,0.725537,ojedaenel.lente
10,0.725048,camilacostalzate


Indeed, different artists feature in both lists

*Making a ranking for each*

In [23]:
df['Score Mean TextCorpus'] = df['Average Simil to Text Core'].rank(ascending = 0)
df['Score Top30'] = df['Simil top 30'].rank(ascending = 0)
df['Total Text Score'] = df['Score Mean TextCorpus'] + df['Score Top30'] 
df.sort_values(by='Total Text Score')
#df.drop('Text Rank',axis=1, inplace=True)
df

Unnamed: 0,Artist,Keywords,Average Simil to Text Core,Median Simil to Text Core,Simil top 30,Score Mean TextCorpus,Score Top30,Total Text Score
0,adalbertocalvogonzalez_artista,"[asedio, formal, dibujo, serie, representacion...",0.620029,0.617029,0.724822,17.0,12.0,29.0
1,agustinalallana,"[indicio, verdor, íntimo, época, fauna, volunt...",0.623766,0.62264,0.707438,5.0,34.0,39.0
2,alequint,"[rendirse, reconocer, cuerpo, enigma, tipo, re...",0.620684,0.619929,0.725793,11.0,8.0,19.0
3,alexandra_mccormick_artista,"[madeja, invertido, época, imagen, fenómeno, t...",0.627455,0.625081,0.732312,2.0,4.0,6.0
4,aliriocc,"[urbano, tejido, metal, material, hogar, habit...",0.615642,0.613963,0.68497,36.0,52.0,88.0
5,andres_moreno_hoffmannn,"[ópticos, ángulos, visualizar, visuales, vislu...",0.615792,0.613789,0.712518,35.0,27.0,62.0
6,anibaldo081,"[dibujo, íntimo, paisaje, nubes, naturaleza, m...",0.623523,0.621949,0.712278,6.0,28.0,34.0
7,aparissio,"[video, historia, década, conceptos, órgano, x...",0.612967,0.610762,0.693233,50.0,50.0,100.0
8,burningflags,"[fotografía, vivo, vinilo, serigrafia, publica...",0.614018,0.611329,0.696313,45.0,49.0,94.0
9,camilabarretohoyos,"[procesos, nueva, manera, imagen, género, chri...",0.616646,0.613646,0.714643,31.0,24.0,55.0


*Attention* MUSE measures SIMILARITY which makes the highest number the closest one. Do not confuse with DISTANCE values,  in which the lowest number would be the most similar.

In [24]:
df['Rank'] = df['Average Simil to Text Core'].rank(ascending = 0)
df.sort_values(by=['Rank'])#.head(20)

Unnamed: 0,Artist,Keywords,Average Simil to Text Core,Median Simil to Text Core,Simil top 30,Score Mean TextCorpus,Score Top30,Total Text Score,Rank
28,klauslundi,"[íntimos, pretensión, forma, querido, tener, s...",0.627557,0.625777,0.718135,1.0,20.0,21.0,1.0
3,alexandra_mccormick_artista,"[madeja, invertido, época, imagen, fenómeno, t...",0.627455,0.625081,0.732312,2.0,4.0,6.0,2.0
20,felipeuribemejia,"[vicioso, sucesivamente, sentido, portada, ter...",0.626211,0.622392,0.739599,3.0,3.0,6.0,3.0
15,dacunat,"[vanitas, realismo, nubes, negro, muerte, loom...",0.624115,0.621312,0.707029,4.0,36.0,40.0,4.0
1,agustinalallana,"[indicio, verdor, íntimo, época, fauna, volunt...",0.623766,0.62264,0.707438,5.0,34.0,39.0,5.0
6,anibaldo081,"[dibujo, íntimo, paisaje, nubes, naturaleza, m...",0.623523,0.621949,0.712278,6.0,28.0,34.0,6.0
50,susanaordonezartista,"[territory, sweet, sugarcane, socialite, sculp...",0.623115,0.620609,0.696696,7.0,47.0,54.0,7.0
49,silvanuchis,"[visual, sunset, remains, photography, light, ...",0.622938,0.620901,0.702922,8.0,43.0,51.0,8.0
43,catalinaortiz1065,"[represent, packaging, little, home, book, yea...",0.62248,0.62069,0.723487,9.0,16.0,25.0,9.0
48,oldnewflesh,"[culture, coffee, workshop, upon, sort, sol, s...",0.620753,0.619195,0.71814,10.0,19.0,29.0,10.0


In [25]:
df.set_index('Artist', inplace=True)

*Saving it to .csv file*

In [32]:
df.to_csv("colombiartistas_text_simmil.csv", encoding='utf-8')

In [35]:
textsdict = {'MetMuseum Keywords' : texts}

In [38]:
dftexts = pd.DataFrame(textsdict) 
dftexts.to_csv("metmuseumkeywords.csv", encoding='utf-8')

In [98]:
p = df_klauslundi.groupby(by='embeddings_2').mean()
p.sort_values(by='sim', ascending=False)

Unnamed: 0_level_0,sim
embeddings_2,Unnamed: 1_level_1
ritos,0.65158
pretensión,0.638849
querido,0.636366
tener,0.634979
superposiciones,0.632303
territorio,0.628726
serie,0.628261
sigue,0.628021
recorrido,0.627727
sica,0.626052


# Ranking using Frequency List Keywords

I also made a ranking using the keywords extracted through frequency list. Although TF-IDF gets better terms, I wanted to see the difference in the final rankings, in order to compare the effectiveness of the MUSE embeddings, and the ranking method in general.

In [35]:
artist_keywordsc = pickle.load(open("/storage/keywords_artistas_conteo.pkl", 'rb'))
artist_keywordsc

{'adalbertocalvogonzalez_artista': ['serie',
  'asedio',
  'paisaje',
  'reducido',
  'papel',
  'paso',
  'lenguajes',
  'universos',
  'desdoblados',
  'arqueología',
  'graffiti',
  'futurismo',
  'surrealismo',
  'representaciones',
  'arqueológicas',
  'dibujo',
  'pintura',
  'alta',
  'velocidad',
  'construcción'],
 'agustinalallana': ['selva',
  'ser',
  'hombre',
  'mundo',
  'animal',
  'animales',
  'tiempo',
  'gallo',
  'fauna',
  'especies',
  'indígena',
  'putumayo',
  'pueblos',
  'íntima',
  'través',
  'así',
  'frente',
  'bien',
  'serie',
  'región'],
 'alequint': ['cuerpo',
  'haciendo',
  'demás',
  'pregunta',
  'tiempo',
  'después',
  'celebrar',
  'mismo',
  'forma',
  'balnearios',
  'sino',
  'mientras',
  've',
  'cuál',
  'enigma',
  'detrás',
  'retratos',
  'albinos',
  'fotógrafa',
  'alejandra'],
 'alexandra_mccormick_artista': ['luz',
  'espacio',
  'lugar',
  'agua',
  'imágenes',
  'dibujos',
  'papel',
  'vida',
  'diapositivas',
  'superficie',

In [36]:
MethEmb = session.run(embedded_text, feed_dict={text_input: texts})

#obtains the embeddings in variables with the format emb_dictionarykey

namespace = globals()
for k in artist_keywordsc:
    namespace['embc_%s' % k] = session.run(embedded_text, feed_dict={text_input: artist_keywordsc[k]})


In [37]:
def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,):
    
    assert len(embeddings_1) == len(labels_1)
    assert len(embeddings_2) == len(labels_2)

  # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
    sim = 1 - np.arccos(sklearn.metrics.pairwise.cosine_similarity(embeddings_1,embeddings_2))/np.pi
    
    embeddings_1_col, embeddings_2_col, sim_col = [], [], []
    for i in range(len(embeddings_1)):
        for j in range(len(embeddings_2)):
            embeddings_1_col.append(labels_1[i])
            embeddings_2_col.append(labels_2[j])
            sim_col.append(sim[i][j])
    df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),
                columns=['embeddings_1', 'embeddings_2', 'sim'])
    return df

In [38]:
for k in artist_keywordsc:
    namespace['dfc_%s' % k] = visualize_similarity(MethEmb, namespace['embc_%s' % k], texts, artist_keywordsc[k]) 

  import sys
  import sys


In [62]:
dfc_haymuchasanas.sort_values(by=['sim'],ascending=False).head(35)

Unnamed: 0,embeddings_1,embeddings_2,sim
1014,art,arte,0.923247
1274,artist,arte,0.77973
1686,design,proyecto,0.769824
626,drafts,proyecto,0.75814
170,untimely,ser,0.740667
1550,unquestionably,ser,0.736547
1294,painter,arte,0.734229
970,undeniably,ser,0.734102
49,circulating,movimiento,0.718278
1766,scheme,proyecto,0.71679


In [15]:
simmilc_art_mean = []
simmilc_art_median = []

#finds and stores in above list the mean in each dataframe

for k in artist_keywordsc:
    simmilc_art_mean.append(namespace['dfc_%s' % k]['sim'].mean())
    
#finds and stores in above list the median in each dataframe

for k in artist_keywordsc:
    simmilc_art_median.append(namespace['dfc_%s' % k]['sim'].median())
    

In [16]:
print(min(simmilc_art_mean))
print(max(simmilc_art_mean))

print(min(simmilc_art_median))
print(max(simmilc_art_median))


0.6045667453979452
0.6241151082583449
0.6065887212753296
0.6219494044780731


In [17]:
simmil_30c = []

for k in artist_keywordsc:
    simmil_30c.append(namespace['dfc_%s' % k]['sim'].sort_values(ascending=False).head(30).mean())
    

In [18]:
print(min(simmil_30c))
print(max(simmil_30c))

0.6833540260791778
0.7692031582196553


In [20]:
nombres_artistasc =  []
palabras_keyc = []

for k in artist_keywordsc:
    nombres_artistasc.append(k)
    palabras_keyc.append(artist_keywordsc[k])    
    
ArtistTextDictc = {'Artist':nombres_artistasc,'Keywords':palabras_keyc,'Average Simil to Text Core':simmilc_art_mean,'Median Simil to Text Core':simmilc_art_median, 'Simil top 30':simmil_30c}

In [21]:
dfc = pd.DataFrame(ArtistTextDictc) 
dfc

Unnamed: 0,Artist,Keywords,Average Simil to Text Core,Median Simil to Text Core,Simil top 30
0,adalbertocalvogonzalez_artista,"[serie, asedio, paisaje, reducido, papel, paso...",0.619336,0.616342,0.732681
1,agustinalallana,"[selva, ser, hombre, mundo, animal, animales, ...",0.62093,0.619647,0.725307
2,alequint,"[cuerpo, haciendo, demás, pregunta, tiempo, de...",0.618375,0.614423,0.744468
3,alexandra_mccormick_artista,"[luz, espacio, lugar, agua, imágenes, dibujos,...",0.612733,0.613017,0.729265
4,aliriocc,"[escultura, hogar, habitar, espacio, contenido...",0.615642,0.613963,0.68497
5,andres_moreno_hoffmannn,"[artista, espacio, bogotá, museo, moreno, hoff...",0.613428,0.610498,0.755692
6,anibaldo081,"[dibujo, naturaleza, paisaje, instalación, cie...",0.623523,0.621949,0.712278
7,aparissio,"[tiempo, mundo, imágenes, reloj, biblioteca, a...",0.616998,0.61374,0.738137
8,burningflags,"[fotografía, publicaciones, impresiones, serig...",0.614018,0.611329,0.696313
9,camilabarretohoyos,"[barreto, textil, camila, tiempo, dispositivo,...",0.620611,0.618642,0.729784


In [22]:
dfc[['Average Simil to Text Core','Artist']].sort_values(by=['Average Simil to Text Core'], ascending=False).head(10)

Unnamed: 0,Average Simil to Text Core,Artist
15,0.624115,dacunat
6,0.623523,anibaldo081
50,0.623115,susanaordonezartista
49,0.622938,silvanuchis
34,0.622812,nclsbarrera
43,0.620984,catalinaortiz1065
29,0.620975,layosandres
1,0.62093,agustinalallana
31,0.620671,mayacorredorr
9,0.620611,camilabarretohoyos


In [23]:
dfc[['Simil top 30','Artist']].sort_values(by=['Simil top 30'], ascending=False).head(10)

Unnamed: 0,Simil top 30,Artist
34,0.769203,nclsbarrera
16,0.75638,ejelejalea
5,0.755692,andres_moreno_hoffmannn
22,0.752441,fernandaluzavendano
46,0.751148,juan_covelli
41,0.74666,vicentavictoriagomez
37,0.745868,sebastian_fonnegra
2,0.744468,alequint
12,0.742111,catalinajaramilloquijano
7,0.738137,aparissio


In [24]:
dfc['Score Mean TextCorpus'] = dfc['Average Simil to Text Core'].rank(ascending = 0)
dfc['Score Top30'] = dfc['Simil top 30'].rank(ascending = 0)
dfc['Total Text Score'] = dfc['Score Mean TextCorpus'] + dfc['Score Top30'] 
dfc.sort_values(by='Total Text Score')
#df.drop('Text Rank',axis=1, inplace=True)
dfc

Unnamed: 0,Artist,Keywords,Average Simil to Text Core,Median Simil to Text Core,Simil top 30,Score Mean TextCorpus,Score Top30,Total Text Score
0,adalbertocalvogonzalez_artista,"[serie, asedio, paisaje, reducido, papel, paso...",0.619336,0.616342,0.732681,17.0,14.0,31.0
1,agustinalallana,"[selva, ser, hombre, mundo, animal, animales, ...",0.62093,0.619647,0.725307,8.0,28.0,36.0
2,alequint,"[cuerpo, haciendo, demás, pregunta, tiempo, de...",0.618375,0.614423,0.744468,20.0,8.0,28.0
3,alexandra_mccormick_artista,"[luz, espacio, lugar, agua, imágenes, dibujos,...",0.612733,0.613017,0.729265,46.0,20.0,66.0
4,aliriocc,"[escultura, hogar, habitar, espacio, contenido...",0.615642,0.613963,0.68497,34.0,52.0,86.0
5,andres_moreno_hoffmannn,"[artista, espacio, bogotá, museo, moreno, hoff...",0.613428,0.610498,0.755692,44.0,3.0,47.0
6,anibaldo081,"[dibujo, naturaleza, paisaje, instalación, cie...",0.623523,0.621949,0.712278,2.0,37.0,39.0
7,aparissio,"[tiempo, mundo, imágenes, reloj, biblioteca, a...",0.616998,0.61374,0.738137,30.0,10.0,40.0
8,burningflags,"[fotografía, publicaciones, impresiones, serig...",0.614018,0.611329,0.696313,41.0,50.0,91.0
9,camilabarretohoyos,"[barreto, textil, camila, tiempo, dispositivo,...",0.620611,0.618642,0.729784,10.0,19.0,29.0
