### Information Retrieval and Document Search
* Concerned with obtaining stored in an unstructured manner
  * Find documents that satisfy some level of query-document similarity
* `tf-idf` computes a "weight" for every word of the query.
  * Those can be used to derive features for words in a document
* So far, the weight is taken in a literal sense
  * If you use the keyword notebook, documents containing the word laptop won't be identified
* This is problematic since there are many ways synonyms can be combined to form a query
  * A naïve information retrieval system does nothing to help
  * We can handle stemming and case folding but cannot do more

### Including Word Meanings

* Use information on word similarity using synonyms?
 * E.g. use a database of synonyms and include a word's popular synonyms in the query?
 * Could be a manually curated measure of word similarity. 
   * relatively easy to derive from a big collection of documents
* Unfortunately, context-free query expansion ends up problematic
  * [light hair] ≈ [fair hair] => expand [light] ⇒ [light fair]
  * [outdoor light] ≠ [outdoor fair]


### Synonyms versus Word Similarity

* It's hypothesized that there are not perfect synonyms
  * No two words that have the same meaning and that can be used interchangeably in all circumstances

 <img src="https://www.dropbox.com/s/nkirtpgru0nmcce/about_synonyms.png?dl=1" alt="Drawing" style="width: 400px;"/>

* Gabriel Girard French churchman (Abbot) and grammarian 
  * Author of the first work on synonyms published in France
* While there may not be perfect meaning, there are words sharing some level of meaning
 * Big/large, water/H2O, Apples/Oranges?

### Examples of Words Similarity 
* Simlex-999 was produced by mining the opinions of 500 annotators via Amazon Mechanical Turk. 
 <img src="https://www.dropbox.com/s/i5s3xy9h3fzapzq/simlex-999%20question.png?dl=1" alt="Drawing" style="width: 500px;"/>

* Often used as the gold standard resource for the evaluation of models that learn the meaning of words and concepts.
  
* https://fh295.github.io/simlex.html

### Similairty and Relatedness

* Words can be related in ways that are beyond sinonymity
* Those are called word associations or word relatedness. E.g.: 
  * Coffee and tea 
    * They are similar: both are hot beverages, produced from plants, stimulants, etc...
  * Coffee and Mug    
      * Are related: One is used to consume the other
      * Are not similar: one is produced from a plant the other is manufactured, one is liquid the other is solid, etc...
      


### Similarity and Relatedness - Cont'd

* Related words belong to the same field
  * Medicine: doctors, surgeon, nurse, hospital, anesthetic, etc.
  * Restaurant: menu, waiter, chef, food, drink, etc.
  * etc.
* Antonyms are also typically related as they describe opposite meanings of the same feature.
  * E.g.: hot/cold, dark/light, slow/fast, etc. 
* Words with a similar connotation (sentiment)  
  * positive connotation (happy, ecstatic, joyful, ...)
  * negative connotation (sad, upset, down....)


### Connotation

* FYI: Since many are working on this for the final project!
* Words vary along 3 affective dimensions
  * Valence: "relative capacity to unite, react, or interact as with antigens or a biological substrate" Webster dictionary
    * Pleasantness of a stimulus
    * love=1 versus nightmare=0
        
  * Arousal: Intensity of emotion provoked by a stimulus
    * Elated=1 versus meditative=0
  * Dominance: Degree of control asserted by a stimulus
    * Leadership=1 versus weak=0
  


### Learning Similarity

* You can learn query context-specific rewritings from search logs by attempting to identify the same user making a second attempt at the same user need
  * Foundational idea: the meaning of a word is tied to its use in the language
    * We have tons of data that shows how words are used. How can we extract the meaning of words from the data

* So far, we've encoded a document as a bag of words
  * Words are assigned numbers based on, for example, their occurrence in a text

### Automatic Thesaurus Generation

* Attempt to generate a thesaurus automatically by analyzing a collection of documents
* Fundamental notion: similarity between two words
* Definition 1: Two words are similar if they co-occur with similar words.
* Definition 2: Two words are similar if they occur in a given grammatical relation with the same words.


### Definition 1: Meaning from Linguistic Distribution

* What is Ong Choi?

* Given the following set of statements:
  * Ong choi is delicious sauteed with garlic... salt
  * I had ong choi over rice
  * Ong choi .... leaves should be washed thoroughly
* And the following second set of document titles
  * Recipe for spinach with garlic and rice
  * Garlicky Swiss Chard Recipe on the NYT .. garlic and rice 
* From the above, we can hypothesize that:
 * Ong choi is an edible vegetable that is **similar** to Spinach and Chard greens?
 


### Definition 2: Meaning as a point in Space

* We can use data as a point in space
  * each dimension describes the value along a given axis
    * e.g.: scores for valence, arousal, and dominance.
        
* We can alternative or even custom features that are domain-specific
  * For instance, axes can be tool-like, size-like, transportation-specific, etc...
    
* We call this an embedding
  * Vector representation such that similar words have a similar representation in higher-dimensional pace

### Word Embeddings

* Ideally, we would prefer a method that combines both approaches
  * Words that have similar linguistic distribution have similar embeddings
* We want the embedding to be easily learned
 * We don't want to have to define the axis and score each word on its value for a specific axis
* Word embeddings are the standard way to represent words in modern Natural Language Processing (NLP) applications
* Embeddings allow us to compare documents based on "similar" words
  * As opposed to the exact same words, as done in Assignment 2


### Docuement Representaiton in Terms of Words
* Representaiton used in Assignment 2
 <img src="https://www.dropbox.com/s/lh2afrz6p1l6a5t/doc_word_representation.png?dl=1" alt="Drawing" style="width: 300px;"/>



### Document Representation in Terms of Words

* But we can also choose a representation of words based on which document they occur in.

 <img src="https://www.dropbox.com/s/wdwjexi30f8txqa/word_doc_representation.png?dl=1" alt="Drawing" style="width: 300px;"/>
 
* Over a very large collection of documents, similar words will have similar vectors because they co-occur in the same documents
  * this, at least intuitively, satisfying ideas 1 and 2

* Instead of working with document word matrix, we can work with document matrix


### Word-Word Matrix (Word-context)
* Instead of working document word matrix, we can work with a word-word matrix
 <img src="https://www.dropbox.com/s/wanm7n5uvw772dt/word-word_matrix.png?dl=1" alt="Drawing" style="width: 300px;"/>

* Each cell represents the number of times a $w_i$ and $w_j$ were observed in the same context.
  * context is not limited to documents. 
    * Can be a paragraph or a sentence or even a window of custom size (5 words to the right or left)
    
* These are called term-context matrix

### Word-Context Example

 <img src="https://www.dropbox.com/s/csqlw32a4sqku6a/word_context_example.png?dl=1" alt="Drawing" style="width: 300px;"/>

* Note that these vectors are extremely large and sparse
* Appropriate data structures need to be used to store and compute on these matrices

* Use Cosine similarity to measure distance between words
  * Euclidean distance is not appropriate for the same reasons discussed in the td-idf section


### Weighting to Mitigate Overly Frequent Words

* The frequencies measured are raw. 
* Words like `raw` and `vegetable` will have a low frequency compared to `a` and `the`
* While frequency is useful, we want to make the meaning of frequent words less meaningful 
* Two solutions:
  * tf-idf: instead of frequencies compute the tf-idf values instead
    * Important to rememebr that here, document is the context.
    * Document can be a paragraph or even a sentence
  * point-wise mutual information: Instead of frequency compute the probability of randomly observing the two words based their frequency.
$$
  PMI = \log\frac{p(w_1, w_2)}{p(w1) \times p(w_2))}
$$


### Word2Vec Embedding Method

* Word2Vec content was moved to its own notebook in `Week 13`