# Wiki text extraction

```wikiExtractor``` function is to extact the introduction text of a certain wiki topic using API. Pure texts are returned.

```wiki_find_redirect``` function is for redirecting the links if multiple topics are pointed to the same wiki page.

In [1]:
import requests

In [2]:
def wiki_find_redirect(topic):
    
    url = 'https://en.wikipedia.org/w/api.php?action=query&titles=' + topic + '&&redirects&format=json'
    r = requests.get(url)
    pageid = list(json.loads(r.text)['query']['pages'].keys())[0]
    title = json.loads(r.text)['query']['pages'][pageid]['title']
    return title

def wikiExtractor(topic):
        
    url = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&exintro&explaintext&titles=' + topic
    # if the full text of the article needs to be extracted
    #url = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&explaintext&titles=' + topic
    page_r = requests.get(url)
    page_content = page_r.content
    page_json = json.loads(page_content)

    pageid = list(page_json['query']['pages'].keys())[0]
    text = page_json['query']['pages'][pageid]['extract']
    
    return text

# Data preprocessing

The event data is scraped from nyc.com and is stored as text. 
```preprocess``` is to clean the invalid charactors from the text, and prepare the data future steps. All event descriptions are stored separately using ```event_list_desc```. The TF-IDF model will be defined on the descriptions.

In [3]:
data_path = '../insight_api_data/'

In [4]:
import json
import re

with open(data_path + 'event_list.txt', 'r') as f:
    event_list = json.load(f)

event_list_desc = []
for i, event in enumerate(event_list):
    event_list_desc.append(event_list[i]['description'])

Total number of events in the dataset:

In [5]:
len(event_list_desc)

217

In [6]:
for i, event in enumerate(event_list_desc):
    event_list_desc[i] = event_list_desc[i].replace('&#x27;', "'")
    event_list_desc[i] = event_list_desc[i].replace('&#x2019;', "'")
    event_list_desc[i] = event_list_desc[i].replace('B.C.', "BC")
    event_list_desc[i] = event_list_desc[i].replace('A.D.', "AD")
    event_list_desc[i] = event_list_desc[i].replace('&amp;', "and")

    # remove ',' in numbers
    event_list_desc[i] = re.sub('(\d+),(\d+)', lambda x: "{}{}".format(x.group(1).replace(',', ''), x.group(2)), event_list_desc[i])
    event_list_desc[i] = re.sub('&#x(.*?);', ' ', event_list_desc[i])
    event_list_desc[i] = re.sub('http(.+?) ', '', event_list_desc[i])

In [7]:
def preprocess(text):
    
    text = text.replace('\n', ' ')
    text = text.replace('&#x27;', "'")
    text = text.replace('&#x2019;', "'")
    
    # remove the dots in B.C. and A.D.
    text = text.replace('B.C.', "BC")
    text = text.replace('A.D.', "AD")
    
    text = text.replace('&amp;', "and")

    # remove ',' in numbers
    text = re.sub('(\d+),(\d+)', lambda x: "{}{}".format(x.group(1).replace(',', ''), x.group(2)), text)
    text = re.sub('&#x(.*?);', ' ', text)
    text = re.sub('http(.+?) ', '', text)
    
    return text

# Keyword extraction

There are three types of keywords been considered in the model:
- Named entities, e.g. "Claude Monet", "Africa".
- Proper noun(s) + normal noun, e.g. "Greek sculptures".
- Adjectives + normal noun, e.g. "contempory arts".

Function ```key_phrases```, ```nnp_nn``` and ```jj_nn``` are defined to find these three types of keywords.

### Named entity recognition

Spacy is used for *named entitiy recognition (NER)*. The types of entities been considered such as person, location, work of art, are set by ```spacy_labels```.

In [8]:
import spacy
nlp = spacy.load("en_core_web_sm")

spacy_labels = set(['PERSON', 'NORP', 'ORG', 'GPE', 'LOC', 'WORK_OF_ART'])
def NER_phrases(text):
    phrases = {}
    spacy_phrases = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in spacy_labels:
            phrase = ent.text
            phrase = phrase.replace('the ', '')
            phrase = phrase.replace('The ', '')
            phrase = phrase.replace("'s", "")
            spacy_phrases.append(phrase)
    phrases['spacy_phrases'] = spacy_phrases
    phrases['nnp_nn'] = nnp_nn(text)
    phrases['jj_nn'] = jj_nn(text)
    
    return phrases

### PoS tagging and pattern recognition
In order to find patterns like proper noun(s) + normal noun and adjectives + normal nouns, part of speech tagging (PoS) is needed. NLTK packaged is used for tagging and pattern recognition.

In [9]:
import nltk

```dog2tag``` function converts texts into a list of tags

In [10]:
def doc2tag(text):
    sentences = nltk.sent_tokenize(text)
    tag_list = []
    for s in sentences:
        tokens = nltk.word_tokenize(s)
        text_tagged = nltk.pos_tag(tokens)
        pair = [(word, pos) for (word, pos) in text_tagged]
        tag_list.extend(pair)
    
    return tag_list

patterns like proper noun(s) + normal noun and adjective + normal noun are captured by regex-like patterns defined in NLTK.

In [11]:
def nnp_nn(text):
    patterns = "NNP_NN: {<NNP>+(<NNS>|<NN>+)}" # at least one NNP followed by NNS or at least one NN
    parser = nltk.RegexpParser(patterns)
    p = parser.parse(doc2tag(text))
    phrase = []
    for node in p:
        if type(node) is nltk.Tree:
            phrase_str = ''
            for w in node:
                phrase_str += w[0]
                phrase_str += ' '
            phrase_str = phrase_str.strip()
            phrase.append(phrase_str)
    return phrase

def jj_nn(text):
    patterns = "NNP_NN: {<JJ>+(<NN>+)}"
    parser = nltk.RegexpParser(patterns)
    p = parser.parse(doc2tag(text))
    phrase = []
    for node in p:
        if type(node) is nltk.Tree:
            phrase_str = ''
            for w in node:
                phrase_str += w[0]
                phrase_str += ' '
            phrase_str = phrase_str.strip()
            phrase.append(phrase_str)
    return phrase

The keywords/phrases are captured and converted into tokens, after lemmatization and convertion into lower case.

In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [13]:
def extract_key_tokens(text):
    
    phrases = {}
    spacy_phrases = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in spacy_labels:
            phrase = ent.text
            phrase = phrase.replace('the ', '')
            phrase = phrase.replace('The ', '')
            phrase = phrase.replace("'s", "")
            spacy_phrases.append(phrase)
    phrases['spacy_phrases'] = spacy_phrases
    phrases['nnp_nn'] = nnp_nn(text)
    phrases['jj_nn'] = jj_nn(text)
    
    # add all phrases to the tokens; split and lemmatize nnp_nn, jj_nn
    tokens = set()
    for p in phrases['spacy_phrases']:
        tokens.add(p.lower())
    for p in phrases['nnp_nn']:
#         tokens.add(p.lower())
#         tokens.update(p.lower().split(' '))
        tokens.add(lemmatizer.lemmatize(p.lower()))
        for word in p.lower().split(' '):
            tokens.add(lemmatizer.lemmatize(word))
    for p in phrases['jj_nn']:
#         tokens.add(p.lower())
#         tokens.update(p.lower().split(' '))
        tokens.add(lemmatizer.lemmatize(p.lower()))
        for word in p.lower().split(' '):
            tokens.add(lemmatizer.lemmatize(word))
    
    return tokens

# TF-IDF

The event matching uses TF-IDF algorithm. TF stands for term frequency, the frequency of a word appearing in a document. IDF stands for inverse document frequency, the logarithm of the inverse of frequency that a word appears in all the documents. The dimension of TF-IDF vector is determined by the number of terms been considered. In this model, IDF is calculated only from the event descriptions; TF is calculated for each document (either a wiki article or an event description).

Build a list of tokens of keywords in all the event descriptions:

In [14]:
event_list_key_tokens = []
for t in event_list_desc:
    event_list_key_tokens.append(extract_key_tokens(t))

Put all the keyword tokens into a bag of words:

In [15]:
BoW = set()
for event in event_list_key_tokens:
    BoW.update(event)

### Inverse document frequency

The inverse document frequency (IDF) is defined as log(doc_count/word_count_in_doc). Stop words are not included for the IDF calculation.

In [16]:
from collections import defaultdict
import numpy as np

In [17]:
from nltk.corpus import stopwords

nlp = spacy.load("en_core_web_sm")

# Import the common English stopwords, and add customary stopwords.
stop = set(stopwords.words('english'))
stop_words = set(['event', 'collection', 'street', 'many', 
                  'exhibitions', 'works', 'monday', 'tuesday', 
                  'wednesday', 'thursday', 'friday', 'saturday', 
                  'sunday', 'new', 'york', 'new york', 'new york city',
                  'visit', 'museum', 'world', 'department', 'NYC'
                 ])
stop.update(stop_words)

In [18]:
idf_dict = defaultdict(int)
D = len(event_list_desc)
for t in BoW:
    if t in stop:
        continue
    for event in event_list_key_tokens:
        if t in event:
            idf_dict[t] += 1
    idf_dict[t] = np.log(D/idf_dict[t])

In this model, only uni-, bi- and tri-grams are considered for the IDF space.

In [19]:
ngram = 3
key_tokens = set([key for key in idf_dict.keys() if len(key.split()) <= ngram])

The total number of tokens been included is:

In [20]:
len(key_tokens)

4242

### Term frequency

The ```tf_idf``` function calculate the TF-IDF vector based on the idf values of the token considered. 
TF-IDF vectors are calculated as **(word_count/total_word_count) * log(doc_count/word_count_in_doc)** for each dimension.

In [21]:
def tf_idf(text, key_tokens, idf_dict, normalize=True, ngram=3):
    
    tf_idf_dict = defaultdict(int)
    
    text = preprocess(text)
    text = text.lower()
    
    # tokens to be used for tf-idf
    tokens = nltk.word_tokenize(text)
    
    token_list = []
    for i in range(1, ngram+1):
        token_list.extend(nltk.ngrams(tokens, i))
    token_list = [' '.join(token) for token in token_list]
    
    # lemmatize the tokens
    for i, token in enumerate(token_list):
        token_list[i] = lemmatizer.lemmatize(token)
    
    # calculate the normalized term frequency using count of total uni, bi, trigrams
    terms = len(token_list)
    
    # initialize the tf_idf_dict dictionary
    for token in key_tokens:
        tf_idf_dict[token] = 0
    
    # count only the tokens in the key_tokens (uni-, bi- and tri-grams)
    for token in token_list:
        if token in key_tokens:
            tf_idf_dict[token] += 1
    
    # if normalized, the the tf-idf vector will be devided by the number of terms in the text
    for key in tf_idf_dict.keys():
        
        if normalize:
            tf_idf_dict[key] = tf_idf_dict[key] / terms * idf_dict[key]
        else:
            tf_idf_dict[key] = tf_idf_dict[key] * idf_dict[key]
            
        tf_idf_dict[key] = tf_idf_dict[key] * idf_dict[key]
    
    # convert the dictionary into an np array
    tf_idf_vec = np.zeros((len(key_tokens),))
    for i, key in enumerate(key_tokens):
        tf_idf_vec[i] = tf_idf_dict[key]
        
    # returns a 1d np array as tf-idf vector
    return tf_idf_vec

# Prediction

Each event has a feature vector. The inner product between the feature vector of an event and a wiki article is the cosine similarity between them. Once the similarity is over a threshold, the event is considered to be *relevant* to the article. 

An event matrix needs to be constructed to include all the feature vectors for the events. Some events are not suitable for recommendation (not cultural, e.g. magic show) and are excluded from the recommendation candidate list.

In [22]:
event_set = set()
with open(data_path + 'event_list_wiki_dict_picked.txt', 'r') as f:
    event_list_wiki_dict_picked = json.load(f)
for key in event_list_wiki_dict_picked.keys():
    if len(event_list_wiki_dict_picked[key]) != 0:
        event_set.add(key)

Lookup tables from event matrix index to event list are constructed, to retrive the event description text once the recommendation is made.

In [23]:
event_matrix2list_dict = {}
list2event_matrix_dict = {}
matrix_ind = 0
for key in event_list_wiki_dict_picked.keys():
    if len(event_list_wiki_dict_picked[key]) != 0:
        event_matrix2list_dict[str(matrix_ind)] = key
        list2event_matrix_dict[key] = str(matrix_ind)
        matrix_ind += 1

### Event matrix

Construct the event matrix. All the event feature vectors are normalized.

In [24]:
from sklearn.preprocessing import normalize

vector_dim = len(key_tokens)
event_count = len(event_matrix2list_dict)
event_matrix = np.zeros((event_count, vector_dim))

for key in event_matrix2list_dict.keys():
    event_matrix_ind = int(key)
    event_list_ind = int(event_matrix2list_dict[key])
    event_desc = event_list_desc[event_list_ind]
    event_matrix[event_matrix_ind, :] = tf_idf(event_desc, key_tokens, idf_dict)
    
event_matrix = normalize(event_matrix, axis=1, norm='l2')

### Prediction function

The threshold is determined by maximizing the F1 score on the test data (the steps are not included in this notebook).

In [25]:
threshold = 0.05

def predict(event_matrix, vec_norm, threshold):
    
    prod = vec_norm.dot(event_matrix.T)
    # returned values are the indices of the selected events, or the event matrix indeces
    return list(np.nonzero(prod[0, :] > threshold)[0])

### Example

In this section, assume the user is viewing wiki topic "Ancient Egypt". The intro text of the wiki page is extracted and vectorized using the tf-idf function previously defined. Events with cosine similarity higher than the threshold is returned.

In [26]:
topic = 'Ancient_Egypt'
wiki_text = wikiExtractor(topic)

In [27]:
wiki_text

"Ancient Egypt was a civilization of ancient North Africa, concentrated along the lower reaches of the Nile River in the place that is now the country Egypt. Ancient Egyptian civilization followed prehistoric Egypt and coalesced around 3100 BC (according to conventional Egyptian chronology) with the political unification of Upper and Lower Egypt under Menes (often identified with Narmer). The history of ancient Egypt occurred as a series of stable kingdoms, separated by periods of relative instability known as Intermediate Periods: the Old Kingdom of the Early Bronze Age, the Middle Kingdom of the Middle Bronze Age and the New Kingdom of the Late Bronze Age.\nEgypt reached the pinnacle of its power in the New Kingdom, ruling much of Nubia and a sizable portion of the Near East, after which it entered a period of slow decline. During the course of its history Egypt was invaded or conquered by a number of foreign powers, including the Hyksos, the Libyans, the Nubians, the Assyrians, the 

In [28]:
wiki_vec = tf_idf(wiki_text, key_tokens, idf_dict, normalize=True, ngram=3)

In [29]:
events = predict(event_matrix, wiki_vec.reshape(1, -1), threshold)
for ix in events:
    print('Event: ', event_list[int(event_matrix2list_dict[str(ix)])]['name'])
    print(event_list[int(event_matrix2list_dict[str(ix)])]['description'])  
    print('\n')


Event:  Ancient Egyptian Art
The collection of ancient Egyptian art at the Metropolitan Museum ranks among the finest outside Cairo. It consists of approximately 36,000 objects of artistic, historical, and cultural importance, dating from the Paleolithic to the Roman period (ca. 300,000 B.C.&#x2013;4th century A.D.). More than half of the collection is derived from the Museum&#x27;s thirty-five years of archaeological work in Egypt, initiated in 1906 in response to increasing public interest in the culture of ancient Egypt. Today, virtually the entire collection is on display in thirty-two major galleries and eight study galleries, with objects arranged chronologically. Overall, the holdings reflect the aesthetic values, history, religious beliefs, and daily life of the ancient Egyptians over the entire course of their great civilization.  The Department of Egyptian Art is particularly well known for the Old Kingdom mastaba (offering chapel) of Perneb (ca. 2450 B.C.); a set of Middle K

For the wiki topic "Ancient Egypt", two events are recommended: "Ancient Egyptian Art" and "Event:  Egypt Reborn: Art for Eternity"