# IR in the Richard Cabot book - Atoms

Case teaching in medicine; by Cabot, Richard C. (Richard Clarke), 1868-1939.

Accessing Solr via [pysolr]()

The following examples are extracted from the Ricard Cabot's book, which you can access here:
[Case teaching in medicine](../../data/case-teaching-cabot/caseteachinginm02cabogoog_djvu.txt)
or in CSV: [Case teaching in medicine in CSV](../../data/case-teaching-cabot/case-teaching-cabot.csv).

# IR Atoms in Python

Based on the tutorial [Text clustering with K-means and tf-idf](https://medium.com/@MSalnikov/text-clustering-with-k-means-and-tf-idf-f099bcf95183) by Mikhail Salnikov on Aug 5, 2018.

Plus the respective notebook https://github.com/MihailSalnikov/tf-idf_and_k-means

In [1]:
import re
import string
import pandas as pd
from functools import reduce
from math import log

## Simple example of [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
1. Example of corpus
2. Preprocessing and Tokenizing
3. Calculating bag of words
4. TF
5. IDF
6. TF-IDF

## Scenario 1 - Simpler

Run all algorithms bellow with the following data (jump the scenario 2 data inicialization).

In [11]:
#1
# case 8
# case 10
# case 17
# case 40
corpus = """
pain fever thrombosis
pain vomiting edema
pain fever weakness
pain fever weakness
""".split("\n")[1:-1]

## Scenario 2 - Simpler with Phrases

In [3]:
#1
# case 8
# case 10
# case 17
# case 40
corpus = """
has pain fever and thrombosis
has pain is vomiting and has edema
has pain fever and weakness
has pain fever and weakness
""".split("\n")[1:-1]

## Scenario 3 - More Complex

Run all algorithms bellow with the following data.

In [4]:
#1
# case 8
# case 29
# case 55
# case 66
corpus = """
as pulmonary embolism from the thrombosed abdominal veins
Mitral disease may favor the occurrence of bronchitis
mitral stenosis and cerebral embolism
as in mitral disease or from embolism
""".split("\n")[1:-1]

# Scenarios

After selecting and initializing the Scenario (1 or 2), run the following algorithms.

In [12]:
#2
l_A = corpus[0].lower().split()
l_B = corpus[1].lower().split()
l_C = corpus[2].lower().split()
l_D = corpus[3].lower().split()

print(l_A)
print(l_B)
print(l_C)
print(l_D)

['pain', 'fever', 'thrombosis']
['pain', 'vomiting', 'edema']
['pain', 'fever', 'weakness']
['pain', 'fever', 'weakness']


In [13]:
#3
word_set = set(l_A).union(set(l_B)).union(set(l_C)).union(set(l_D))
print(word_set)

{'edema', 'vomiting', 'fever', 'thrombosis', 'weakness', 'pain'}


In [14]:
word_dict_A = dict.fromkeys(word_set, 0)
word_dict_B = dict.fromkeys(word_set, 0)
word_dict_C = dict.fromkeys(word_set, 0)
word_dict_D = dict.fromkeys(word_set, 0)

for word in l_A:
    word_dict_A[word] += 1

for word in l_B:
    word_dict_B[word] += 1

for word in l_C:
    word_dict_C[word] += 1

for word in l_D:
    word_dict_D[word] += 1

    
pd.DataFrame([word_dict_A, word_dict_B, word_dict_C, word_dict_D])

Unnamed: 0,edema,fever,pain,thrombosis,vomiting,weakness
0,0,1,1,1,0,0
1,1,0,1,0,1,0
2,0,1,1,0,0,1
3,0,1,1,0,0,1


## \#4 tf - term frequency
In the case of the term frequency $tf(t,d)$, the simplest choice is to use the raw count of a term in a string. 
$${\displaystyle \mathrm {tf} (t,d)={\frac {n_{t}}{\sum _{k}n_{k}}}} $$
where $n_t$ is the number of occurrences of the word $t$ in the string, and in the denominator - the total number of words in this string.

In [15]:
def compute_tf(word_dict, l):
    tf = {}
    sum_nk = len(l)
    for word, count in word_dict.items():
        tf[word] = count/sum_nk
    return tf

tf_A = compute_tf(word_dict_A, l_A)
tf_B = compute_tf(word_dict_B, l_B)
tf_C = compute_tf(word_dict_C, l_C)
tf_D = compute_tf(word_dict_D, l_D)

pd.DataFrame([tf_A, tf_B, tf_C, tf_D])

Unnamed: 0,edema,fever,pain,thrombosis,vomiting,weakness
0,0.0,0.333333,0.333333,0.333333,0.0,0.0
1,0.333333,0.0,0.333333,0.0,0.333333,0.0
2,0.0,0.333333,0.333333,0.0,0.0,0.333333
3,0.0,0.333333,0.333333,0.0,0.0,0.333333


## \#5 idf - inverse document frequency
idf is a measure of how much information the word provides
$$ \mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|} $$
- $N$: total number of strings in the corpus ${\displaystyle N={|D|}}$
- ${\displaystyle |\{d\in D:t\in d\}|}$  : number of strings where the term ${\displaystyle t}$ appears (i.e., ${\displaystyle \mathrm {tf} (t,d)\neq 0})$. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to ${\displaystyle 1+|\{d\in D:t\in d\}|}$.

In [16]:
def compute_idf(strings_list):
    n = len(strings_list)
    idf = dict.fromkeys(strings_list[0].keys(), 0)
    for l in strings_list:
        for word, count in l.items():
            if count > 0:
                idf[word] += 1
    
    for word, v in idf.items():
        idf[word] = log(n / float(v))
    return idf

idf = compute_idf([word_dict_A, word_dict_B, word_dict_C, word_dict_D])

pd.DataFrame([idf])

Unnamed: 0,edema,fever,pain,thrombosis,vomiting,weakness
0,1.386294,0.287682,0.0,1.386294,1.386294,0.693147


## \# 6 tf-idf
Then tf–idf is calculated as
$$ {\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)} $$

In [17]:
def compute_tf_idf(tf, idf):
    tf_idf = dict.fromkeys(tf.keys(), 0)
    for word, v in tf.items():
        tf_idf[word] = v * idf[word]
    return tf_idf

tf_idf_A = compute_tf_idf(tf_A, idf)
tf_idf_B = compute_tf_idf(tf_B, idf)
tf_idf_C = compute_tf_idf(tf_C, idf)
tf_idf_D = compute_tf_idf(tf_D, idf)

pd.DataFrame([tf_idf_A, tf_idf_B, tf_idf_C, tf_idf_D])

Unnamed: 0,edema,fever,pain,thrombosis,vomiting,weakness
0,0.0,0.095894,0.0,0.462098,0.0,0.0
1,0.462098,0.0,0.0,0.0,0.462098,0.0
2,0.0,0.095894,0.0,0.0,0.0,0.231049
3,0.0,0.095894,0.0,0.0,0.0,0.231049
