# Natural Language Processing (NLP)

---

## Semantic search and query expansion

This notebook contains code to create a simple search engine that allows users to identify an artwork which matches their query based on a natural-language description of the artwork, as well as its title. Artwork information is from [WikiArt](https://www.wikiart.org/). This code makes use of [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html), an open-source Python search-engine library, as the framework for the search engine and carries out query-term expansion based on [pre-trained word embeddings from the Nordic Language Processing Laboratory \(NLPL\)](http://vectors.nlpl.eu/repository/).

In [1]:
# Import libraries
import json
import numpy as np
import pandas as pd
import os
import requests
from urllib.request import urlopen

import nltk 
from nltk.stem import WordNetLemmatizer
from nltk import wordnet
nltk.download('wordnet') # large lexical database
#nltk.download('omw-1.4')
from nltk.corpus import stopwords
nltk.download('stopwords')
import re

from whoosh.analysis import StemmingAnalyzer
from whoosh.fields import Schema, TEXT, ID, STORED
from whoosh import index
from whoosh.qparser import MultifieldParser
#from whoosh.reading import IndexReader

from gensim.models import KeyedVectors

import ipywidgets as widgets
from PIL import Image

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hannahtempest/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hannahtempest/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Set up language model for query-term expansion

This code uses the following pre-trained word embedding model:

|   |   |
|---|---|
| ID: | 12 |
| Download link:| http://vectors.nlpl.eu/repository/20/12.zip |
| Vector size: | 300 |
| Window: | 5 |
| Corpus: | Gigaword 5th Edition |
| Vocabulary size: | 292,479 |
| Algorithm: | Gensim Continuous Skipgram |
| Lemmatization: | False |

The model has been downloaded from the [online repository](http://vectors.nlpl.eu/repository/) and saved locally.

In [2]:
# Set model path
word_embeddings_model_path = '12/model.bin'

# Set search-engine path
search_engine_index_path = 'whoosh_artwork_index'

# Define a function to load embeddings from pretrained model
def load_model(path):
    word_embeddings_model = KeyedVectors.load_word2vec_format(word_embeddings_model_path, binary=True)
    return word_embeddings_model

In [3]:
# Use the above function to load the model
model = load_model(word_embeddings_model_path)

In [4]:
# Define a function to use the language model to get synonyms for a given term
def get_similar_words(model, search_term):
    similarity_list = model.most_similar(search_term, topn=5)
    similar_words = [sim_tuple[0] for sim_tuple in similarity_list]
    return similar_words

### Get artwork information

The artworks included in the search engine are the most-searched artworks on WikiArt from the last 30 days, so they will change periodically. Artwork information is gathered using the [WikiArt API](https://www.wikiart.org/en/App/GetApi), whihc returns information in JSON format.

In [5]:
# The 600 most-viewed paintings on Wiki Art from the last 30 days
# Store the URL in `url` as parameter for urlopen
url = "https://www.wikiart.org/en/App/Painting/MostViewedPaintings?randomSeed=123&json=2&inPublicDomain={true/false}"
  
# Store the data from urlopen() call to URL as a new variable called `response`
response = urlopen(url)
  
# Use JSON to load the data
data = json.loads(response.read())

# Put the JSON data into a pandas DataFrame
df = pd.DataFrame(data)
artwork_ids = [df['contentId']]

This JSON data includes lots of useful information, but doesn't include the natural-language description of the artworks. Each artwork desciption is found on the artwork's individual page, the content of which is also accessible in JSON format through WIkiArt's API.

In [6]:
# Create an empty Python dictionary into which to put the descriptions
artwork_description_dict = {}

# Use a for-loop to access each individual artwork page and get each artwork description. 
for artwork_id in artwork_ids[0]:
    start_artwork_url = 'https://www.wikiart.org/en/App/Painting/ImageJson/'
    full_artwork_url = start_artwork_url+str(artwork_id)

    artwork_response = urlopen(full_artwork_url)
    artwork_data = json.loads(artwork_response.read())
    description = artwork_data['description']
    # Append the artwork description as 'value' in the dictionary, with artwork_id as the key
    artwork_description_dict[artwork_id] = description
    
# Put this information into a pandas DataFrame, which can then be merged with the 
# rest of the information which is already in a pandas DataFrame
descriptions_data = artwork_description_dict.items()

In [7]:
wn1 = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

The descriptions need a bit of processing and cleaning up. I will remove the URLs and tags, convert to lowercase, remove stopwords and lemmatize, using tools from NLTK.

In [8]:
processed_descriptions_dict = {}
#pattern = re.compile("^([A-Z][0-9]+)+$")
pattern = re.compile(r'\w')
url_pattern = r'(url)|(href=.+?])'

for artwork_id in artwork_description_dict:
    description = artwork_description_dict[artwork_id]
    if description is not None:
        sents = nltk.sent_tokenize(description)
        sents_no_urls = [re.sub(url_pattern, '', sent) for sent in sents]
        # list comprehesion with word tokenizer
        words = [nltk.word_tokenize(sent) for sent in sents_no_urls]
        all_words = [item for sublist in words for item in sublist if not item in stop_words]
        words_lower = [word.lower() for word in all_words if pattern.match(word)]
        words_lemmatized = [wn1.lemmatize(word) for word in words_lower]
        processed_desc = ' '.join(words_lemmatized)
        processed_descriptions_dict[artwork_id] = processed_desc
        
descriptions_list = [(k, v) for k, v in processed_descriptions_dict.items()]
description_df = pd.DataFrame(descriptions_list)
description_df.columns = ['contentId', 'description']

In [9]:
# Merge the dataframes
complete_df = df.merge(description_df, on='contentId', how='left', sort=False)
complete_df.replace('None', np.NaN, inplace=True)
complete_df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
database_df = complete_df[['title', 'contentId', 'artistName', 'image', 'description']].copy()
database_df.head()

Unnamed: 0,title,contentId,artistName,image,description
0,Mona Lisa,225189,Leonardo da Vinci,https://uploads0.wikiart.org/00339/images/leon...,one iconic recognizable painting world mona li...
1,The Starry Night,207190,Vincent van Gogh,https://uploads4.wikiart.org/00142/images/vinc...,van gogh night sky field roiling energy below ...
2,The Persistence of Memory,221654,Salvador Dali,https://uploads.wikiart.org/Content/images/FRA...,the persistence memory 1931 one iconic recogni...
3,"In Bed, The Kiss",230453,Henri de Toulouse-Lautrec,https://uploads8.wikiart.org/images/henri-de-t...,this captivating 1892 artwork in bed the kiss ...
4,The Birth of Venus,189114,Sandro Botticelli,https://uploads6.wikiart.org/images/sandro-bot...,the birth venus painted sandro botticelli 1480...


In [10]:
# Add artwork data to Whoosh index 

schema = Schema(title=TEXT(stored=True, analyzer=StemmingAnalyzer()), 
                contentId=ID(stored=True),
                artistName=TEXT(stored=True),
                imageURL=ID(stored=True),
                description=TEXT(stored=True, analyzer=StemmingAnalyzer()))

if not os.path.exists(search_engine_index_path):
    os.mkdir(search_engine_index_path)

ix = index.create_in(search_engine_index_path, schema)

with ix.writer() as writer:
    for i in range(len(database_df)):
        writer.add_document(title=str(database_df.title.iloc[i]), 
                            contentId=str(database_df.contentId.iloc[i]), 
                            artistName=str(database_df.artistName.iloc[i]), 
                            imageURL=str(database_df.image.iloc[i]), 
                            description=str(database_df.description.iloc[i]))

In [11]:
# https://stackoverflow.com/questions/19477319/whoosh-accessing-search-page-result-items-throws-readerclosed-exception
# http://annamarbut.blogspot.com/2018/08/whoosh-pandas-and-redshift-implementing.html
# https://ai.intelligentonlinetools.com/ml/search-text-documents-whoosh/
def index_search(search_query):
    
    print(f"You entered '{search_query}'.")
    
    ix = index.open_dir('whoosh_artwork_index')
    schema = ix.schema
    
    try:
        other_words = get_similar_words(model, search_query)
        final_search_query = ' OR '.join([search_query] + other_words)
        print('Synonyms: {}\n'.format(other_words))
    except:
        final_search_query = search_query
        print('No synonyms available for your search term.')
    
    q = MultifieldParser(['title', 'description'], schema).parse(final_search_query)
    
    results = ix.searcher().search(q)
    
    if len(results)==0:
        print('\nSorry, there are no good matches for your search term.\nWould you like to try entering something different?')
        top_image = np.NaN
        pass
    
    else:
        top_image_url = results[0]['imageURL']
        top_artist = results[0]['artistName']
        top_title = results[0]['title']
        try:
            for i, result in enumerate(results):
                artist = result['artistName']
                title = result['title']
                #description = result['description']
                #contentId = result['contentId']    
                #image_url = result['imageURL']
                print(i+1, ': \'{}\' by {}'.format(title, artist))
                
            if str(top_image_url)=='https://uploads.wikiart.org/Content/images/FRAME-600x480.jpg':
                try:
                    top_image_url = 'https://www.google.com/search?tbm=isch&q=find'+str(top_artist)+str(top_title)
                    top_image = Image.open(requests.get(top_image_url, stream=True).raw)
                    display(top_image)
                except:
                    print('\nSorry, no image is available for your top artwork.')
                    pass
                
                try:
                    for j in range(len(results)):
                        backup_image_url = results[j+1]['imageURL']
                        backup_image = Image.open(requests.get(backup_image_url, stream=True).raw)
                        if str(backup_image_url)=='https://uploads.wikiart.org/Content/images/FRAME-600x480.jpg':
                            j+=1
                        else:
                            print('\nHere\'s an image of your number {} result:'.format(j+2)) 
                            display(backup_image)
                            break
                except:
                    pass
                    
                return top_artist, top_title
                
            else:
                print('\nHere\'s an image of your top result:') 
                top_image = Image.open(requests.get(top_image_url, stream=True).raw)
                display(top_image)
                return artist, title
            print("\nTop search result(s) for '{}':".format(search_query))
        
        except Exception as e:
            print('\nSorry, an error occurred.')
            print(e)
        pass        
    

# Artwork search engine

In [12]:
search_engine_prompt = widgets.Label(value='Enter your search term to see the best-matching artwork:')
text_input = widgets.Text(value='', placeholder='Type something', disabled=False, continuous_update=False)
search_engine_input = widgets.HBox([search_engine_prompt, text_input])

def f(query):
    return index_search(query)
        
out = widgets.interactive_output(f, {'query': text_input})

widgets.VBox([search_engine_input, out])

VBox(children=(HBox(children=(Label(value='Enter your search term to see the best-matching artwork:'), Text(va…