### Analyzing  Large Collection of Documents

* Identifying Common Content: Finding overlap or common content among multiple documents.
  * Word Handling: Specifically, it looks at non-stop words (words that carry significant meaning) to determine the relationship and content of the documents.
    * Raw Word Counting: It considers the sheer occurrence of words without diving deep into the semantics or the context in which the word is used.

* While this approach offers a straightforward way to analyze documents, there are certain limitations we need to be aware of.



### Overlapping Issues in Large Document Sets

* As we include more diverse documents, the common content between them becomes smaller reduces.
  * "Sparses matches"
  * This can negatively impact "similarity" and subsequent tasks like clustering and classification.

* Taking all non-stop words from combined documents results in significantly large datasets.


* Words with similar beginnings like "leave," "leaving," and "left" are counted differently, even though they might express similar ideas.
Ignoring Word Semantics
  * We don't account for the different meanings a word can have based on context. For example:
    * "She works at the bank across the street" versus "houses on the bank of the river flooded due to a storm surge."
* Accurately understanding word semantics in context is a challenging task and we will delve into this later.

### Distribution of Words in a Text 

* The frequency distribution of words in a language follows Zipf's law
  * Just FYI: this makes computign statistics rather difficult or impossible
  
![](https://www.dropbox.com/s/neydq8wi2kqqof3/zipf_law.png?dl=1)

### Understanding Document Similarity: How We Measure a Match

* When we search for documents in Information Retrieval (IR), the goal is often to rank the results based on their relevance:
  * We want to find documents that are similar to our search criteria.

* The ideal result would rank these documents by how closely they match the search term.
* The aim is to list the most relevant documents first, making it easier for the searcher to find what they're looking for.
  * So, how do we decide the order of these documents in relation to a search term?
  * We give each document a score between 0 and 1.
* This score tells us how closely the document aligns with the search query.
  * Over the years, experts have come up with many innovative solutions in the Document Retrieval field to address these challenges.



### Measuring Document Similarity: Calculating a Match Score

* Think about a simple search using just one term.

* If the document doesn't have the search term, the score is 0.

* If the search term appears often in the document, the score should increase.


### Understanding the Jaccard Coefficient

* Jaccard Coefficient measures the overlap between two sets, A and B.
  * It calculates the overlap by considering all the terms in both A and B.

* It works even if A and B are of different sizes.

* The result is always a value between 0 and 1.

* Limitations:
  * It doesn't account for how often a term appears.
  * It doesn't recognize that rare terms can be more valuable than common ones.
    * This is why simply looking at the intersection might not always be best.

* A better method is needed to adjust for length, rather than just using $|A \cup B|$.


### Understanding Term-Document Count Matrices

* A count matrix displays the frequency of each word within a document.
  * This approach, known as the "bag of words" model
* The sequence of words in the document is not taken into account.
* For instance, the phrases `John is quicker than Mary` and `Mary is quicker than John` would produce identical vectors in this model.


### Understanding Term Frequency (`tf`)

* The term frequency, denoted as $tf_{t,d}$, represents how many times a term $t$ appears in a document $d$.
* While a higher $tf$ can indicate a better match, it's not always directly tied to the significance of that match. For instance:
  
  * A document where the term appears 10 times is more relevant than one where it appears just once. However, it's not necessarily 10 times more relevant.
  
  * This means the relevance doesn't scale linearly with the term frequency.


### Understanding Log-Frequency Weighting

* The weight of term $t$ in document $d$ can be determined using log-frequency as:

$$
w_{t,d} = \begin{cases} 
1+\log_{10}\mbox{tf}_{t,d} & \text{if } \mbox{tf}_{t,d} > 0 \\
0 & \text{otherwise}
\end{cases}
$$

* As a reference: 
  * 0 maps to 0
  * 1 maps to 1
  * 2 maps to 1.3
  * 10 maps to 2
  * 1000 maps to 4, and so on.

* To calculate the score for a document-query pair, sum over terms `t` present in both the query (`q`) and the document (`d`):

$$
\mbox{score} = \sum_{t\in q \cap d}(1+\log_{10}\mbox{tf}_{t,d})
$$

* A score of 0 indicates that none of the terms from the query are found in the document.
* While there might be different formulas, the core idea behind this calculation remains consistent.


### Importance of Document Frequency

* The challenge of rare terms remains:
  * Rare terms often provide more valuable information than common ones.
    * Think of stop words as an example.
* Take the term 'arachnid' in a query, which is seldom found in the collection:
  * A document that includes this term is highly probable to be pertinent to the query 'arachnid'.
  * This term significantly aids in contrasting documents effectively.
  * Therefore, it's beneficial to assign a higher weight to infrequent terms like 'arachnid'.


### Continuing with Document Frequency

* Common terms often offer less unique information than their rarer counterparts.
* Think of a query term that's widely seen in the collection, like `high`, `increase`, or `true`:
  * Solely using the $tf$ score, a document with these terms seems more relevant compared to one without.
  * However, this doesn't guarantee its significance.
* To evaluate how often a term appears across documents, we'll determine (or normalize using) its document frequency, denoted as `df`.


### Inverse Document Frequency (`idf`)

- $\mbox{df}_{t,d}$ represents the frequency of term $t$ within document $d$.
- $\mbox{df}_t$ serves as an inverse gauge of the term $t$'s informativeness.
  * Note: $\mbox{df}_t \le N$, with $N$ being the entire document count.

- The inverse document frequency ($\mbox{idf}$) of term $t$ is defined as:
$$
idf_t = log_{10}(N/\mbox{df}_t)
$$

- We opt for the inverse since it's more practical than handling minuscule numbers, especially when $N$ is much larger than $\mbox{df}_t$.
- The logarithm (`log`) is incorporated to temper the `idf` effect, which becomes especially vital when managing vast document sets.


### Term Frequency-Inverse Document Frequency (`tf-idf`) Scheme

- The `tf-idf` weight of a term is derived from multiplying its term frequency (`tf`) and inverse document frequency (`idf`).

$$
\begin{split}
w_{t,d} &= \mbox{tf}_{t,d} \times \mbox{idf}_t \\
&= log(1+\mbox{tf}_{t,d}) \times log(N/\mbox{df}_t)
\end{split}
$$
- This weighting method is a well-accepted strategy in the realm of information retrieval.
  * Other references: tf.idf, tf x idf

- The weight:
  * Rises as a term's occurrence within a document increases.
  * Also grows with the term's scarcity across the entire document set.


### Score for a Document Given a Query


$$
Score(Q, T) = \sum_{t\in Q\cap T} \mbox{tf}.\mbox{idf}_{t,d}
$$

* There are many variants
  * How `tf` is computed (with/without logs)
  * Whether the terms in the query are also weighted




### Using `tf-idf` for Feature Engineering
* Each document is represented by a real-valued vector of $\mbox{tf-idf}$ weights $\in R^{|V|}$

![](https://www.dropbox.com/s/1bx77e488ee6wek/count_tf_idf.png?dl=1)

In [50]:
import math

docs = {
    "doc_1": "the sky is blue",
    "doc_2": "the sun is bright",
    "doc_3": "the sun in the sky is bright",
}

docs = {k: v.lower().split() for k, v in docs.items()}
docs

{'doc_1': ['the', 'sky', 'is', 'blue'],
 'doc_2': ['the', 'sun', 'is', 'bright'],
 'doc_3': ['the', 'sun', 'in', 'the', 'sky', 'is', 'bright']}

In [166]:
N = len(docs)

def compute_tf(term, document):
    return document.count(term)
assert compute_tf("the",  docs['doc_1']) == 1


In [167]:
assert compute_tf("the",  docs['doc_3']) == 2


In [168]:
term= "the"

sum(1 for document in docs.values() if term in document)


3

In [169]:
term= "sky"
sum(1 for document in docs.values() if term in document)


2

In [170]:
term= "blue"
sum(1 for document in docs.values() if term in document)


1

In [171]:
def compute_idf(term, documents):
    N = len(documents)
    df = sum(1 for document in documents if term in document)    
    return math.log((1 + N) / (1 + df)) + 1

compute_idf("the", docs.values())

1.0

In [172]:
assert compute_idf("the", docs.values()) == math.log10(4 / 4) + 1



In [173]:
def compute_tf_idf(term, document, documents):
    tf = compute_tf(term, document)
    idf = compute_idf(term, documents)
    return round(tf * idf, 2)

compute_tf_idf("the",  docs["doc_1"], docs.values())


1.0

In [174]:
compute_tf_idf("sky",  docs["doc_1"], docs.values())


1.29

In [175]:
compute_tf_idf("blue",  docs["doc_1"], docs.values())

1.69

### Vector Representation in Information Retrieval (IR)

- Concept 1: Represent both queries and documents as vectors within a defined space.
- Concept 2: Sort documents based on how close their vectors are to the query vector in this space.
  * Here, closeness refers to vector similarity.
    * Using Euclidean distance might be misleading especially if vectors vary in length.
      * Large Euclidean distances can arise between vectors of dissimilar lengths.
- Concept 3: Order documents based on the angle they form with the query vector.


In [266]:
doc_1 = """The king hath happily received, Macbeth,
The news of thy success; and when he reads
Thy personal venture in the rebels' fight,
His wonders and his praises do contend
Which should be thine or his: silenced with that,
In viewing o'er the rest o' the selfsame day,
He finds thee in the stout Norweyan ranks,
Nothing afeard of what thyself didst make,
Strange images of death. As thick as hail
Came post with post; and every one did bear
Thy praises in his kingdom's great defence,
And pour'd them down before him."""

doc_1 = doc_1.lower()
doc_1

"the king hath happily received, macbeth,\nthe news of thy success; and when he reads\nthy personal venture in the rebels' fight,\nhis wonders and his praises do contend\nwhich should be thine or his: silenced with that,\nin viewing o'er the rest o' the selfsame day,\nhe finds thee in the stout norweyan ranks,\nnothing afeard of what thyself didst make,\nstrange images of death. as thick as hail\ncame post with post; and every one did bear\nthy praises in his kingdom's great defence,\nand pour'd them down before him."

In [267]:
doc_2 =  """The king was really happy to hear about your success, Macbeth. 
When he read about your brave actions fighting the rebels, he was so impressed he couldn't
decide who to praise more, you or himself. That same day, he noticed you also stood bravely
against the Norwegians and not scared at all, even when facing dangerous situations. Many
messengers came, one after the other, all of them talking about how great you were in defending
the kingdom. They all praised you in front of the king."""

doc_2 = doc_2.lower()
doc_2

"the king was really happy to hear about your success, macbeth. \nwhen he read about your brave actions fighting the rebels, he was so impressed he couldn't\ndecide who to praise more, you or himself. that same day, he noticed you also stood bravely\nagainst the norwegians and not scared at all, even when facing dangerous situations. many\nmessengers came, one after the other, all of them talking about how great you were in defending\nthe kingdom. they all praised you in front of the king."

In [268]:
lorem_ipsum = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris 
nisi ut aliquip ex ea commodo consequat. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris 
nisi ut aliquip ex ea commodo consequat. 
""".lower()

In [271]:
corpus = [doc_1, doc_2,  doc_1 + lorem_ipsum]

compute_tf_idf("KING",  corpus[0], corpus)

In [272]:
corpus

["the king hath happily received, macbeth,\nthe news of thy success; and when he reads\nthy personal venture in the rebels' fight,\nhis wonders and his praises do contend\nwhich should be thine or his: silenced with that,\nin viewing o'er the rest o' the selfsame day,\nhe finds thee in the stout norweyan ranks,\nnothing afeard of what thyself didst make,\nstrange images of death. as thick as hail\ncame post with post; and every one did bear\nthy praises in his kingdom's great defence,\nand pour'd them down before him.",
 "the king was really happy to hear about your success, macbeth. \nwhen he read about your brave actions fighting the rebels, he was so impressed he couldn't\ndecide who to praise more, you or himself. that same day, he noticed you also stood bravely\nagainst the norwegians and not scared at all, even when facing dangerous situations. many\nmessengers came, one after the other, all of them talking about how great you were in defending\nthe kingdom. they all praised you 

In [274]:
vocabulary = ['king', 'happily', 'and', "thy", "ipsum", "laboris"]
print([compute_tf_idf(v,  corpus[0], corpus) for v in vocabulary])
print([compute_tf_idf(v,  corpus[1], corpus) for v in vocabulary])
print([compute_tf_idf(v,  corpus[2], corpus) for v in vocabulary])


In [158]:

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline



In [205]:
c_vec = CountVectorizer(vocabulary=vocabulary)
c_vec

In [209]:
c_vec.fit_transform(corpus).todense()

matrix([[1, 1, 4, 3, 0, 0],
        [2, 0, 1, 0, 0, 0],
        [1, 1, 4, 3, 1, 2]])

In [213]:
[corpus[0].lower().split().count(v) for v in vocabulary]

[1, 1, 4, 3, 0, 0]

In [163]:

tfidf_vec = TfidfVectorizer(vocabulary=vocabulary)


In [200]:
tfidf_vec.fit_transform(corpus).todense().round(2)

array([[0.17, 0.22, 0.69, 0.67, 0.  , 0.  ],
       [0.89, 0.  , 0.45, 0.  , 0.  , 0.  ],
       [0.14, 0.19, 0.58, 0.56, 0.24, 0.49]])

In [248]:
import math
import numpy as np

def compute_tf(term, document):
    return document.lower().split().count(term) / len(document.lower().split() )

def compute_idf(term, documents):
    N = len(documents)
    df = sum(1 for document in documents if term in document)
    return math.log10((1+N) / (1 + df)) + 1


def compute_tf_idf(document, terms, idfs):
    tf_idf_raw = [compute_tf(term, document) * idfs.get(term, 0) for term in terms]
    norm = np.linalg.norm(tf_idf_raw, 2)  # L2 norm
    tf_idf_normalized = [round(value / norm, 2) for value in tf_idf_raw]
    
    return tf_idf_normalized


idfs = {term: compute_idf(term, corpus) for term in vocabulary}

for doc in corpus:
    tf_idf_values = compute_tf_idf(doc, vocabulary, idfs)
    print(tf_idf_values)

[0.18, 0.21, 0.73, 0.62, 0.0, 0.0]
[0.71, 0.0, 0.71, 0.0, 0.0, 0.0]
[0.16, 0.18, 0.65, 0.55, 0.21, 0.42]


In [262]:
vocabulary

['this', 'document', 'first', 'is', 'second', 'the']

In [279]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.preprocessing import normalize

c_vec = CountVectorizer(vocabulary=vocabulary)
X = c_vec.fit_transform(corpus).toarray()

X

array([[1, 1, 4, 3, 0, 0],
       [2, 0, 1, 0, 0, 0],
       [1, 1, 4, 3, 1, 2]])

In [284]:
print(X[0])
print(sum(X[0]))
print(X[0]/sum(X[0]))



[1 1 4 3 0 0]
9
[0.11111111 0.11111111 0.44444444 0.33333333 0.         0.        ]


In [285]:
tf = X / np.sum(X, axis=1, keepdims=True)
tf

array([[0.11111111, 0.11111111, 0.44444444, 0.33333333, 0.        ,
        0.        ],
       [0.66666667, 0.        , 0.33333333, 0.        , 0.        ,
        0.        ],
       [0.08333333, 0.08333333, 0.33333333, 0.25      , 0.08333333,
        0.16666667]])

In [288]:
X

array([[1, 1, 4, 3, 0, 0],
       [2, 0, 1, 0, 0, 0],
       [1, 1, 4, 3, 1, 2]])

In [287]:
df = np.sum(X > 0, axis=0)
df

array([3, 2, 3, 2, 1, 1])

In [290]:
N = X.shape[0]
N

3

In [294]:
idf = np.log((1 + N) / (1 + df)) + 1
idf

array([1.        , 1.28768207, 1.        , 1.28768207, 1.69314718,
       1.69314718])

In [293]:
tfidf = tf * idf
tfidf

array([[0.11111111, 0.14307579, 0.44444444, 0.42922736, 0.        ,
        0.        ],
       [0.66666667, 0.        , 0.33333333, 0.        , 0.        ,
        0.        ],
       [0.08333333, 0.10730684, 0.33333333, 0.32192052, 0.1410956 ,
        0.2821912 ]])

In [296]:
x = np.array([1,2,3])
x*2

array([2, 4, 6])

In [302]:
sum(x**2)

14

In [303]:
x / np.sqrt(sum(x**2))

array([0.26726124, 0.53452248, 0.80178373])

In [304]:
normalize([x], norm='l2', axis=1)

array([[0.26726124, 0.53452248, 0.80178373]])

In [305]:
tfidf = normalize(tfidf, norm='l2', axis=1)
print(tfidf.round(2))

[[0.17 0.22 0.69 0.67 0.   0.  ]
 [0.89 0.   0.45 0.   0.   0.  ]
 [0.14 0.19 0.58 0.56 0.24 0.49]]


### From Angles to Cosines

* In information retrieval, the following two notions are equivalent.
  * Rank documents in decreasing order of the angle between query and hit
  * Rank documents in increasing order of cosine(query,hit)

* Cosine is a monotonically decreasing function for the interval [0o, 180o]

![](https://www.dropbox.com/s/lpq4vvnlnmz0oxw/cosine.png?dl=1)

### Length Normalization

* A vector can be (length-) normalized by dividing each of its components by its length 
  * We commonly use the $L2$ norm:

* Dividing a vector by its $L2$ norm makes it a unit (length) vector

  * Effect on the two documents $d$ and $d′$ (d appended to itself) have identical vectors after length-normalization.
  * Thus, long and short documents now have comparable weights


### Cosine Similairity

* $q_i$ is the `tf-idf` weight of term `i` in the query
* $d_i$ is the `tf-idf` weight of term `i` in the document

![](https://www.dropbox.com/s/4x1fb50xiqidmnf/cos_equation.png?dl=1)

In [None]:
### Normalized TF-IDF Vectors



### Cosine Similarity Illustrated

![](https://www.dropbox.com/s/4inqt6nf9mfz6h9/cosine_similarity.png?dl=1)

### Example 
* Books: "Sense and Sensibility", "Pride and Prejudice", "Wuthering Heights?".

<img src="https://www.dropbox.com/s/z28xu8xxhuv8ll5/example_books.png?dl=1" alt="Drawing" style="width: 400px;"/>

```
cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
```

### `tf-idf` Weighing  Variants

* Just an FYI
<div align="center">
<img src="https://www.dropbox.com/s/r88cmbmaqyk7hcp/weighting_schemes.png?dl=1" alt="Drawing" style="width: 700px;"/>
</div>
* identifies components of the SMART notation: combination in use in a search engine (ddd.qqq)
  * e.g.: lnc.ltc
  * ```To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example, ltc.lnn represents the ltc weighting applied to a collection document and the lnn weighting applied to a query document.```  https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
  
* See [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/) for more info if you're interested in the topic


