# CS 429: Information Retrieval

<br>

## Lecture 10: Query Expansion

<br>

### Dr. Aron Culotta
### Illinois Institute of Technology 


---

Last time:

- Evaluation
  - accuracy, precision, recall, MAP
  
This time:

- How can we incoporate user feedback to improve search?
- How can we alter the user's query to improve search?

# Relevance Feedback

- An *interactive* IR system in which 


1. The user enters a query.
2. The system returns results.
3. The user indicates which results are relevant.
4. GoTo 2.

# How should we incorporate user feedback?

- Create a new query that is similar to relevant documents but dissimilar to irrelevant documents.

# Rocchio

$ \DeclareMathOperator*{\argmax}{arg\,max}$
$\vec{q}^* \leftarrow \argmax_{\vec{q}} sim(\vec{q}, C_r) - sim(\vec{q}, C_{nr})$

- where $q$ is a query
- $C_r$ is a set of relevant documents
- $C_{nr}$ is a set of irrelevant documents
- $sim$ is cosine similarity

# Document Centroid

Recall that we represent each document as a vector of tf-idf values.

Given a collection of documents $D = \{\vec{d_1} \ldots \vec{d_N}\}$, the centroid vector is:

$$ \frac{1}{N} \sum_{\vec{d_j} \in D}\vec{d}_j $$

In [None]:
# A word about numpy arrays...
import numpy as np
a = [1,2,3]
b = [4,5,6]
print('list addition:', a + b)
print('numpy array addition:', np.array(a) + np.array(b))
print('numpy array division:', np.array(a) / 3)

In [None]:
# list division? Nope.
a / 3

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

points = [[1, 4],
          [.5, .5],
          [4, 6]]

# Compute centroid.
centroid = np.sum(points, axis=0) / len(points)
plt.figure()
plt.scatter([p[0] for p in points],
            [p[1] for p in points])
plt.scatter([centroid[0]],
            [centroid[1]], marker='x', s=60)
plt.show()

Want a query that is closest to relevant documents, but far from irrelevant documents.

$$\vec{q}^* = \frac{1}{|C_r|} \sum_{\vec{d_j} \in C_r}\vec{d}_j - \frac{1}{|C_{nr}|} \sum_{\vec{d}_j \in C_{nr}} \vec{d}_j$$

![rocchio](files/rocchio.png)

Source: [MRS](http://nlp.stanford.edu/IR-book/pdf/09expand.pdf)

But, we don't know the set of all relevant and irrelevant documents.


$$\vec{q}_m = \alpha \vec{q}_0 + \beta\frac{1}{D_r} \sum_{\vec{d}_j \in D_r} \vec{d}_j - \gamma\frac{1}{|D_{nr}|} \sum_{\vec{d_j} \in D_{nr}} \vec{d}_j$$

- $\vec{q}_0$ is original query vector
- $\alpha$, $\beta$, $\gamma$ are tunable parameters.


In [None]:
# Plot effect of relevance feedback as we change parameters.
import numpy as np
from numpy import array as npa
import random as rnd

def centroid(docs):
    return np.sum(docs, axis=0) / len(docs)

def rocchio(query, relevant, irrelevant, alpha, beta, gamma):
    return alpha * query + beta * centroid(relevant) - gamma * centroid(irrelevant) 

# Create some documents
relevant = npa([[1, 5], [1.1, 5.1], [0.9, 4.9], [1.0, 4.8]])
irrelevant = npa([[rnd.random()*6, rnd.random()*6] for i in range(30)])

# Create a query
query = npa([.1, .1])

# Compute two different Rocchio updates (gamma=0.5, gamma=0)
new_query_g5 = rocchio(query, relevant, irrelevant, 1., .75, .5)
new_query_g0 = rocchio(query, relevant, irrelevant, 1., .75, 0.)
new_query_g0_a0 = rocchio(query, relevant, irrelevant, 0., .75, 0.)

# Plot them.
plt.figure()
pos = plt.scatter([p[0] for p in relevant], [p[1] for p in relevant],
                  color='g', marker='o', label='relevant')
neg = plt.scatter([p[0] for p in irrelevant], [p[1] for p in irrelevant],
              marker='+', color='red', label='irrelevant')

q = plt.scatter(query[0], query[1], marker='v',
                color='b', s=100, label='query')
newq_b5 = plt.scatter([new_query_g5[0]], [new_query_g5[1]],
                      marker='*', s=100, color='black', label='gamma=.5')  # s=100, c=.9)
#newq_b0 = plt.scatter([new_query_g0[0]], [new_query_g0[1]],
#                      marker='d', s=100, color='black', label='gamma=0')  # s=100, c=.8)
newq_b0 = plt.scatter([new_query_g0_a0[0]], [new_query_g0_a0[1]],
                      marker='^', s=100, color='black', label='gamma=0')  # s=100, c=.8)

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

- $\gamma=0$ Often used, since we're more confident in relevant annotations than irrelevant.

- One might decrease $\alpha$ as the number of relevant documents increase.

# Does relevance feedback help precision or recall?

- Mostly recall: "adding" similar terms to query vector from relevant documents.

- When would it not help?

- Spelling correction?
- Different language?
- Synonyms?


- Assumption 1: query is "close" to relevant documents
  - feedback makes the query closer
  
- Assumption 2: relevant documents form one cluster.

In [None]:
# What happens if there are two clusters of relevant examples?

import numpy as np
import random as rnd

points = [[1, 5], [1.1, 5.1], [0.9, 4.9], [1.0, 4.8],
          [5, 1.2], [4.9, 1.1], [5.1, 1.0], [4.8,1.2]]
plt.figure()
centroid = np.sum(points, axis=0) / len(points)
pos = plt.scatter([p[0] for p in points], [p[1] for p in points], 
                 color='green', label='relevant')
neg = plt.scatter([rnd.random()*6 for i in range(30)],
                  [rnd.random() * 6 for i in range(30)],
                  marker='+', color='red', label='irrelevant')
centroid = plt.scatter([centroid[0]], [centroid[1]],
                   marker='x', s=100, color='blue', label='centroid')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

*"If you have your one foot in a bucket of boiling hot water and another foot in a bucket of ice cold water, on average you ought to feel pretty comfortable."* --Unknown

# Does relevance feedback affect search time?

- Much longer queries
- How to approximate?
  - Use top $k$ most informative terms from relevant set.

# Variants of relevance feedback

- Pseudo-relevance: Assume top $k$ documents are relevant.
- Indirect relevance: Mine click logs.

# Pseudo-relevance feedback

1. Rank documents
2. Let $V$ be the top $k$ documents. We pretend these are all relevant.
3. Update $q$ according to Rocchio

We can iterate steps $2-3$ until ranking stops changing.

When would this work? When would this not work?

# Explicit query expansion

- Thesaurus
- Word co-occurrences
- Mine reformulations from query log

# WordNet

<http://wordnetweb.princeton.edu/perl/webwn>

---

<br><br><br><br><br>

# Thesaurus discovery

**Idea:** Look for words that occur in same context.

- "He put the mug on the \_\_\_\_\_"

- "He put his feet on the \_\_\_\_"

Query: "cheap tables" 
  - expand to include "affordable ottomans"

In [None]:
def get_context(tokens, position, window):
    """ Get tokens to the left and right of this position.
    Params:
      tokens.....list of strings in this sentence
      position...integer. index into tokens
      window.....integer. number of tokens to the left and right to consider.
    """
    start = max(position - window, 0)
    end = min(position + window + 1, len(tokens))
    left = ['L=%s' % x for x in tokens[start : position]]
    right = ['R=%s' % x for x in tokens[position + 1 : end]]
    return left + right
    
get_context(['a', 'b', 'c', 'd', 'e'], position=2, window=2)

In [None]:
from collections import Counter, defaultdict
from sklearn.datasets import fetch_20newsgroups
import re
docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes')).data
print('read', len(docs), 'docs')

In [None]:
# Count words that occur within a window of -n to +n of each word.
def term2contexts(docs, n):
    contexts = defaultdict(lambda: Counter())
    for d in docs:
        toks = re.findall('\w+', d.lower())
        for i in range(len(toks)):
            contexts[toks[i]].update(get_context(toks, i, n))
    return contexts

contexts = term2contexts(docs, n=2)
# Print top contexts for email
contexts['email'].most_common(10)

In [None]:
# NB: Efficiently incrementing a Counter using the .update method.
from collections import Counter
c = Counter()
c.update([1,2,1,1,3])
print(c)
c.update([3,3,3])
print(c)
print(c.most_common(2))

In [None]:
print('made context vectors for %d terms' % len(contexts))
# Each word now has a "context vector" indicating
# the terms that often occur before/after it.

In [None]:
contexts['believe'].most_common(10)

In [None]:
# Downweighting common terms with IDF.
# Compute inverse document frequency values for each term.
# Here: document frequency means how many different contexts this feature appears in.
import math
def compute_idfs(contexts):
    idfs = Counter()
    for term, context in contexts.items():
        idfs.update(context.keys())
    for d in idfs:
        idfs[d] = math.log10(len(contexts) / idfs[d])
    return idfs

idfs = compute_idfs(contexts)
print('idf of L=the:', idfs['L=the'], ' of L=mouse:', idfs['L=mouse'])

In [None]:
# Multiply each context value by its idf
idf_contexts = {}
for term, counts in contexts.items():
    newcounts = defaultdict(lambda: 0)
    for term2, value in counts.items():
        if value > 1:  # remove context terms that don't occur at least twice, to reduce noise.
            newcounts[term2] = 1 + math.log10(value) * idfs[term2]
    idf_contexts[term] = newcounts

In [None]:
sorted(idf_contexts['email'].items(), key=lambda x: -x[1])[:10]

In [None]:
sorted(idf_contexts['believe'].items(), key=lambda x: -x[1])[:10]

In [None]:
# Filter terms that don't appear very often.
idf_contexts = dict([(term, cont) for term, cont in idf_contexts.items()
                     if len(cont) > 10])

In [None]:
print(len(idf_contexts), 'remain')

In [None]:
def norm(context):
    return math.sqrt(sum(x**2 for x in context.values()))
    
def cosine(term1, term2, contexts):
    # Compute cosine similarity between term1
    # and term2 context vectors.
    # NB: slow!
    context1 = contexts[term1]
    context2 = contexts[term2]
    dotprod = sum(context1[term] * context2[term] for term in context1)
    return dotprod / (norm(context1) * norm(context2))

def find_closest_terms(term, contexts):
    cosines = [(term2, cosine(term, term2, contexts)) for term2 in contexts]
    return sorted(cosines, key=lambda x: x[1], reverse=True)

find_closest_terms('believe', idf_contexts)[:10]

In [None]:
find_closest_terms('email', idf_contexts)[:10]

In [None]:
find_closest_terms('mouse', idf_contexts)[:10]

In [None]:
find_closest_terms('difficult', idf_contexts)[:10]

In [None]:
find_closest_terms('love', idf_contexts)[:10]

In [None]:
find_closest_terms('heaven', idf_contexts)[:10]

Google's $n$-gram data: 

<http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html>

# How do we decide when to expand the query?

- Few results returned.
- Query log data
  - Searches where few results are clicked.