# ART RECOMMENDER SYSTEM

## Building a Translator - Word2Vec

After completing the [Data Analysis for MET](03_MET_AnalysisTFIDF.ipynb) and the [Data Analysis for Prado](04_Prado_AnalysisTFIDF.ipynb), we reached a dead end trying to match the topics of one museum to the other.  We will now try to translate one of the datasets into the language of the other to build only ONE big dataset with all the information together and create a new model.

This script will take all the data retrieved from [Importing and Cleaning Data - MET](01_MET_LoadClean.ipynb) and [Importing and Cleaning Data - Prado](02_Prado_LoadClean.ipynb) and will try to create Word vectors with them.  

The translator (Closest word of one word in one space into the other), will be used to build a new topic model using just one big translated dataset in [Data Analysis for Both Museums](06_Museums_AnalysisTFIDF.ipynb)

In [1]:
import gensim
import pickle

from textblob import TextBlob
from os import listdir
from os.path import isfile, join

**Load Met Data**

In [2]:
mypath = '../MetFiles/'
onlyfiles = sorted([f for f in listdir(mypath) if isfile(join(mypath, f))])
len(onlyfiles)

34

In [3]:
met_df = []
for p in onlyfiles :
    with open(mypath + p, 'rb') as f:
        met_df = met_df + pickle.load(f)
len(met_df)

75528

In [4]:
documents = [x[1] for x in met_df]
documents = list(set(documents))
print(len(documents))
print(documents[-5:])

41596
['Because the mold for such an ambitious piece as this one would have been very costly, this urn was probably one of the most expensive pressed-glass articles of its time. It is decorated in a pattern called "magnet and grape" by collectors. The patent information on the spigot indicates that this piece was manufactured after 1869, although the pattern may date from around 1850. The design motif reflects Victorian delight in ornament appropriate to the object; here, a bunch of grapes for a wine urn.', 'Vertical panel with a textile design that is part of a group of 266 textile designs by the American artist Robert Bryer, possibly made for United Designing Co., since most of the designs carry a stamp of the "United Designing Co. / WOrth 4 - 8975". Some of them also contain a stamp in the verso of the "Original Designing Company, Inc." The collection contains a great variety of designs, from the more traditional floral and stripe patterns, to thematic designs based on various trave

**Load Prado Data and join them together**

In [5]:
mypath = '../PradoFiles/'
onlyfiles = sorted([f for f in listdir(mypath) if isfile(join(mypath, f))])
len(onlyfiles)

3

In [6]:
prado_df = []
for p in onlyfiles :
    with open(mypath + p, 'rb') as f:
        prado_df = prado_df + pickle.load(f)
len(prado_df)

9990

In [7]:
documents_es = [x[2] for x in prado_df]
documents_es = list(set(documents_es))
print(len(documents_es))
print(documents_es[-5:])

6965
['A lo largo del siglo XVII se pintan grandes series de cuadros para las órdenes religiosas que incluyen figuras exentas de santos y escenas de composición más compleja. Es el caso de este cuadro, que pertenece a una serie de santos realizada hacia 1657 por Valdés Leal para la sacristía del Convento de San Jerónimo de Sevilla. Todos ellos comparten características similares: el santo de pie, visto de abajo a arriba y acompañado de los atributos que le identifican. En esta ocasión son el capelo cardenalicio, la mesa con recado de escribir y el león al que curó la pata cuando estaba retirado haciendo penitencia. La perspectiva y el tamaño dan lugar a una obra muy monumental y solemne. La huella del artista es evidente en toda la obra, de una factura muy libre y muy segura, y se observa sobre todo en el rostro del santo, de gran expresividad. Pertenece a la serie de santos realizada para la sacristía del Convento de San Jerónimo de Sevilla, dispersada en el siglo XIX.', 'A la izquier

In [8]:
documents += documents_es

In [9]:
len(documents)

48561

**Stop Words**  
(Stop words in spanish were downloaded from [here](https://github.com/stopwords-iso))

In [10]:
import nltk
from nltk.corpus import stopwords
#Spanish
with open('../utils/stopwords-es.txt') as f:
    stop_words_es = f.readlines()
stop_words_es = [sw.replace('\n', '') for sw in stop_words_es]
stop_words_es += 'a b c d e f g h i j k l m n ñ o p q r s t u v w x y z'.split()
stop_words_es += 'i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi xvii xviii xix xx pp'.split()

##English
stop_words_en = stopwords.words('english')
stop_words_en = stop_words_en + stop_words_es
stop_words_en = stop_words_en + 'one two three four five six seven eight nive ten \
                                eleven twelve thirsteen fourteen fifteen sixteen seventeen \
                                eighteen nineteeen twenty &'.split()
stop_words_en = stop_words_en + 'a b c d e f g h i j k l m n ñ o p q r s t u v w x y z'.split()

In [12]:
stop_words_full = list(set(stop_words_en + stop_words_es))
stop_words_full

['sobre',
 'otras',
 'últimas',
 'despacio',
 'd',
 'she',
 'consideró',
 'buenas',
 'dia',
 'vii',
 'fuisteis',
 'tuve',
 'ampleamos',
 'twenty',
 'habremos',
 'ese',
 'día',
 'pp',
 'haces',
 'posible',
 'somos',
 'hablan',
 'varios',
 'only',
 'hubieras',
 'sabes',
 'then',
 'parece',
 'under',
 'habida',
 'aunque',
 'should',
 'dijeron',
 'informó',
 'habidas',
 'voy',
 'fueran',
 'segunda',
 'been',
 'han',
 'tuvo',
 'casi',
 'poca',
 'está',
 "wouldn't",
 "shouldn't",
 'at',
 'podriamos',
 'who',
 'muchas',
 'estad',
 'tuvieses',
 'ver',
 'trabajo',
 'again',
 'seis',
 'xiii',
 'mejor',
 'algunas',
 'arribaabajo',
 'mi',
 'so',
 'ninguna',
 'xii',
 'siete',
 '1',
 'few',
 're',
 'my',
 'do',
 'tengas',
 'dieron',
 'en',
 'empleas',
 'les',
 'pasado',
 'estarías',
 'ésas',
 'had',
 'solas',
 'didn',
 'podría',
 'señaló',
 'soy',
 'ningunas',
 'tarde',
 'medio',
 'tan',
 'estarán',
 '9',
 'trabajar',
 "mustn't",
 'estuviesen',
 'poder',
 'ayer',
 'pronto',
 'trabajais',
 'verdadera

In [13]:
# The type of input that Word2Vec is looking for..

# texts = [[word for word in document.lower().split() if word not in stop_words_en]
#          for document in documents]
texts = [[word for word in TextBlob(document.lower()).words if word.lower() not in stop_words_full]
         for document in documents]
print (texts[:5])

[[], ['postcard', 'section', 'hall', 'science', 'century', 'progress', 'international', 'exposition', 'chicago', '1933', 'pc225-1', 'found', 'album', '435', 'page', '15'], ['published', 'cesare', 'vecellio', 'italian', 'pieve', 'di', 'cadore', '1521-1601', 'venice', 'venice.from', 'top', 'bottom', 'left', 'right', 'design', 'composed', 'horizontal', 'registers', 'top', 'register', 'formed', 'top', 'edge', 'zigzagging', 'line', 'flower', 'vines', 'curve', 'around', 'central', 'bud', 'bottom', 'register', 'decorated', 'circles', 'frame', 'flower', 'center', 'left', 'right', 'circles', 'contain', 'type', 'flower'], ['born', 'budapest', 'hungary', 'agnes', 'denes', 'educated', 'sweden', 'united', 'states', 'began', 'career', 'painter', 'since', 'expanded', 'activities', 'encompass', 'wide', 'range', 'media', 'including', 'drawing', 'printmaking', 'photography', 'site-specific', 'sculpture', 'environmental', 'art', 'although', 'fit', 'neatly', 'specific', 'school', 'philosophical', 'approac

**Word2Vec Model using our data**

In [14]:
model_w2v = gensim.models.Word2Vec(texts, size=100, window=5, min_count=5, workers=2, sg=1)

In [15]:
# take a look at vocab
list(model_w2v.wv.vocab.items())[:7]

[('postcard', <gensim.models.keyedvectors.Vocab at 0x7ff369e05e48>),
 ('section', <gensim.models.keyedvectors.Vocab at 0x7ff369e05e80>),
 ('hall', <gensim.models.keyedvectors.Vocab at 0x7ff369e05e10>),
 ('science', <gensim.models.keyedvectors.Vocab at 0x7ff369e05da0>),
 ('century', <gensim.models.keyedvectors.Vocab at 0x7ff369e05f98>),
 ('progress', <gensim.models.keyedvectors.Vocab at 0x7ff369e05eb8>),
 ('international', <gensim.models.keyedvectors.Vocab at 0x7ff369e05ef0>)]

**Test and save our Word2Vec model**

In [24]:
model_w2v.most_similar('mano' ,topn=8)

  """Entry point for launching an IPython kernel.


[('sostiene', 0.8453785181045532),
 ('apoya', 0.7846434116363525),
 ('sujeta', 0.7823660969734192),
 ('espada', 0.7738539576530457),
 ('cuchillo', 0.756999135017395),
 ('pierna', 0.7506818771362305),
 ('sujetando', 0.7477926015853882),
 ('abanico', 0.745064377784729)]

In [26]:
model_w2v.similarity('fruits','frutas')

  """Entry point for launching an IPython kernel.


0.3196284894902604

In [18]:
with open('../models/Word2Vec_model1.pkl', 'wb') as f:
    pickle.dump(model_w2v, f)

**Conclusions:**
- Data Volume is not enough to produce an acceptable Word2Vec translator
- Although space inside a language remains and looks fine, distance between words of the different languages are not acceptable
- Next step will be to work with pre-existing Word Vectors to achieve a better result