# Thesis notebook


### Setup and prep
In this section we check that we are using the correct version of Python (3.8), and we import some of the basic libraries which we will need throughout the notebook.

In [1]:
!python --version

Python 3.8.18


In [2]:
import os
from tqdm import tqdm

from tqdm.auto import tqdm
tqdm.pandas(leave=False)

  from .autonotebook import tqdm as notebook_tqdm


## Ranking Queries

In [3]:
from Library.searcher import Searcher
searcher = Searcher()

In [4]:
from Library.evaluations import *



data/trec_2022_train_reldocs.jsonl


In [5]:
results = searcher.search(0, 5) 

for (result, rel_judge) in results:
    query_relevancy_labels = get_relevancy_labels(result, rel_judge)
    print(NDCG(query_relevancy_labels, 500))

Agriculture
Amphibians and Reptiles
Astronomy
Aviation
Biography/WikiProject Actors and Filmmakers
0.43301057140649213
0.5796176244972758
0.7367988169552996
0.8857038233406795
0.6385168742336628


In [6]:
ranking1 = results[0][0]
relevance_1 = results[0][1]

print(sum(get_relevancy_labels(ranking1[:100], relevance_1)))
print(sum(get_relevancy_labels(ranking1[:50], relevance_1)))
print(sum(get_relevancy_labels(ranking1[:10], relevance_1)))

54
28
7


#### Conclusions from default ranking
As we can see by the result, of the 500 documents we put in our top 100, 54 are relevant according to our relevance judgments. So it works okay-ish but definitely it is not great. By inspecting the results for 50 and 10 as well we can observe that about 50% of our results at any point will be relevant, but we do not necessarily have more relevant documents near the top. Thus, our ranking method could definitely improve.

## Fairness
In this section we setup our lookup table for the fairness attributes.

In [7]:
from Library.fairness import *

In [8]:
gender_align.sum()

gender
@UNKNOWN    4610461.0
female       353933.0
male        1495647.0
NB              571.0
dtype: float64

## Clustering

Data exploration for a new searcher

### LDA

#### Setup and prep

In [9]:
!python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 660.6 kB/s eta 0:00:20
      --------------------------------------- 0.2/12.8 MB 2.1 MB/s eta 0:00:07
     - -------------------------------------- 0.5/12.8 MB 3.5 MB/s eta 0:00:04
     -- ------------------------------------- 0.9/12.8 MB 5.1 MB/s eta 0:00:03
     ---- ----------------------------------- 1.4/12.8 MB 6.4 MB/s eta 0:00:02
     ------ --------------------------------- 2.1/12.8 MB 7.9 MB/s eta 0:00:02
     --------- ------------------------------ 3.0/12.8 MB 9.6 MB/s eta 0:00:02
     ----------- ---------------------------- 3.7/12.8 MB 10.

In [10]:
from Library.lda import *

rankedClusters, cluster_preferences, doc_topics= get_clustering(results[0][0], searcher.queries)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
rankedClusters = list(
    map(lambda d: 
        d[[y for (x,y) in d].index(max([y for (x,y) in d]))][0], doc_topics)
)
print(rankedClusters)

[7, 9, 1, 3, 9, 1, 9, 9, 9, 1, 1, 9, 9, 3, 1, 9, 1, 3, 9, 1, 9, 9, 9, 9, 9, 9, 7, 1, 1, 6, 1, 1, 9, 3, 3, 1, 3, 3, 3, 1, 9, 9, 9, 7, 9, 3, 3, 9, 6, 1, 9, 3, 1, 1, 1, 9, 3, 1, 3, 3, 9, 3, 9, 3, 1, 9, 9, 3, 9, 9, 3, 3, 9, 9, 9, 3, 9, 9, 1, 1, 9, 3, 9, 9, 1, 9, 3, 1, 3, 1, 1, 1, 9, 9, 9, 9, 3, 1, 1, 3, 9, 3, 3, 9, 9, 9, 3, 3, 9, 3, 1, 3, 1, 1, 9, 9, 3, 3, 9, 3, 9, 9, 1, 9, 3, 3, 9, 9, 3, 3, 3, 1, 9, 3, 1, 9, 9, 9, 3, 3, 3, 3, 3, 9, 3, 9, 9, 1, 9, 9, 9, 9, 1, 1, 1, 9, 9, 9, 9, 1, 3, 9, 3, 9, 3, 1, 9, 3, 3, 3, 9, 9, 9, 9, 9, 7, 3, 7, 9, 9, 9, 3, 1, 9, 9, 1, 3, 1, 3, 9, 9, 3, 9, 9, 3, 1, 1, 1, 9, 1, 3, 9, 3, 1, 7, 9, 3, 1, 3, 1, 6, 9, 9, 9, 3, 3, 9, 3, 9, 3, 1, 9, 1, 1, 9, 3, 1, 1, 3, 9, 1, 1, 1, 3, 3, 3, 3, 4, 9, 9, 3, 3, 3, 9, 3, 3, 3, 1, 3, 3, 1, 9, 9, 1, 9, 9, 3, 1, 9, 9, 1, 3, 1, 9, 1, 9, 3, 3, 1, 9, 9, 3, 3, 1, 1, 1, 9, 9, 9, 9, 3, 9, 9, 3, 9, 9, 3, 1, 1, 3, 3, 9, 9, 1, 3, 9, 9, 9, 3, 9, 1, 1, 9, 3, 9, 9, 6, 9, 3, 9, 9, 3, 9, 9, 3, 3, 7, 1, 1, 9, 3, 3, 1, 3, 9, 9, 9, 1, 3, 9, 9, 9, 3, 

## Clustering Based reranking

### Round Robin Clusters

In [12]:
from Library.roundrobin import *

## MMR

$$ \text{{MMR}} = \arg\max_{d_i \in D \setminus R} [ \lambda \cdot Sim_1(d_i, q) - (1 - \lambda) \cdot \max_{d_j \in R} Sim_2(d_i, d_j) ] $$

Here, D is the set of all candidate documents, R is the set of already selected documents, q is the query, $Sim_1$ is the similarity function between a document and the query, and $Sim_2$ is the similarity function between two documents. $d_i$ and $d_j$ are documents in D and R respectively.

In [13]:
from copy import deepcopy
from time import sleep
def MMR(ranked_list_inp, comp_function, lamb = 0.5):
    ranked_list = ranked_list_inp[:]
    output_list = [ranked_list[0]]
    ranked_list.pop(0)
        
    # compute an optimum with lambda between ranking score and similarity (comp_fucntion)

    while len(output_list) < 100:
        intermediate_list = [
            lamb*item.score - ((1- lamb) * comp_function(item, output_list))
            for item in ranked_list]

        index = intermediate_list.index(max(intermediate_list))
        output_list.append( 
            ranked_list.pop(index)
        ) 
    return output_list

#### MMR on fairness features

In [14]:
from Library.mmr import *

### Clustering
Either create or load the stored clusting of our original rankin

In [15]:
import csv
clusterings = []
clustering_preferences = []
clustering_topics = []

version = "500items_10_clusters"
file = f"clusters/clusterings{version}.csv"
file2 = f"clusters/clustering_preferences{version}.csv"
file3 = f"clusters/clustering_topics{version}.csv"
if not (os.path.isfile(file)):
    
    for i, ranking in enumerate(tqdm(rankings_total)):
        clusters, preferences, topics = get_clustering(ranking, ranking_index=i, searcher=searcher)
        
        clusterings.append(clusters)
        clustering_preferences.append(preferences)
        clustering_topics.append(list(topics))
        
    with open(file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(clusterings)

    with open(file2, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(clustering_preferences)

    with open(file3, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(clustering_topics)

else:
    with open(file) as f:
        clusterings = list(
            map(lambda x: x.split(","), f.read().splitlines())
        )
        clusterings = list(
            map( lambda x: list(map( lambda y: eval(y), x)), clusterings)
        )
    with open(file2) as f:
        lines = f.read()
        clustering_preferences= list(
            csv.reader(lines.splitlines())
        )
    
        clustering_preferences = list(
            map( lambda x: list(map( lambda y: eval(y), x)), clustering_preferences)
        )
        

    with open(file3) as f:
        lines = f.read()
        
        clustering_topics= list(
            csv.reader(lines.splitlines())
        )
    
        clustering_topics = list(
            map( lambda x: list(map( lambda y: eval(y), x)), clustering_topics)
        )

In [16]:
print(max(clustering_topics[0]))

[(0, 0.22743246), (1, 0.092498556), (2, 4.889585e-06), (3, 0.14233871), (4, 4.109632e-05), (5, 1.7180402e-05), (6, 0.033089202), (7, 8.071449e-05), (8, 1.3253429e-05), (9, 0.50448394)]


In [17]:
def qr_join(align):
    return qrels.join(align, on='page_id').set_index(['topic_id', 'page_id'])

In [18]:
qr_gender_align = qr_join(gender_align)
qr_gender_align.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,female,male,NB
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
84,572,1.0,0.0,0.0,0.0
84,627,1.0,0.0,0.0,0.0
84,678,1.0,0.0,0.0,0.0
84,903,1.0,0.0,0.0,0.0
84,1193,1.0,0.0,0.0,0.0


In [19]:
qr_gender_tgt = qr_gender_align.groupby('topic_id').mean()
qr_gender_fk = qr_gender_tgt.iloc[:, 1:].sum('columns')
qr_gender_tgt.iloc[:, 1:] *= 0.5
qr_gender_tgt.iloc[:, 1:] += qr_gender_fk.apply(lambda k: gender_tgt * k * 0.5)
qr_gender_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,female,male,NB
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
84,0.905943,0.03379,0.059797,0.00047
111,0.996106,0.001344,0.002531,1.9e-05
265,0.883099,0.038968,0.077328,0.000647
323,0.890183,0.033058,0.07621,0.000549
396,0.007847,0.428546,0.558768,0.005349


In [20]:
gender_targets = list(qr_gender_tgt.iterrows())
gender_targets[0]

(84,
 @UNKNOWN    0.905943
 female      0.033790
 male        0.059797
 NB          0.000470
 Name: 84, dtype: float64)

In [21]:
gender_targets = list(map(lambda x: x[1], gender_targets))
print(gender_targets)

[@UNKNOWN    0.905943
female      0.033790
male        0.059797
NB          0.000470
Name: 84, dtype: float64, @UNKNOWN    0.996106
female      0.001344
male        0.002531
NB          0.000019
Name: 111, dtype: float64, @UNKNOWN    0.883099
female      0.038968
male        0.077328
NB          0.000647
Name: 265, dtype: float64, @UNKNOWN    0.890183
female      0.033058
male        0.076210
NB          0.000549
Name: 323, dtype: float64, @UNKNOWN    0.007847
female      0.428546
male        0.558768
NB          0.005349
Name: 396, dtype: float64, @UNKNOWN    0.176227
female      0.314011
male        0.505519
NB          0.004484
Name: 397, dtype: float64, @UNKNOWN    0.007350
female      0.370282
male        0.617495
NB          0.005075
Name: 403, dtype: float64, @UNKNOWN    0.954086
female      0.013917
male        0.031767
NB          0.000230
Name: 409, dtype: float64, @UNKNOWN    0.969935
female      0.011101
male        0.018813
NB          0.000150
Name: 426, dtype: float64, @

## Relevant Topics

In [22]:
s = 0
e = 45
k = 500
rankings_total, relevancy_labels_total = zip(*searcher.search(s, e, k=k))

Agriculture
Amphibians and Reptiles
Astronomy
Aviation
Biography/WikiProject Actors and Filmmakers
Biography/WikiProject Musicians
Biography/science and academia work group
Birds
Books
Business
Chemicals
Christianity
Cities
Classical music
Computer science
Computing
Cricket
Crime and Criminal Biography
Cycling
Dams
Engineering
Film/American cinema task force
Former countries
Geography
Human rights
Insects
Islam
Japan
Japan/Biography task force
Jewish history
Languages
Literature
Medicine
Middle Ages
Military history
Military history/Maritime warfare task force
Motorsport
Netherlands
Photography
Politics
Skiing and Snowboarding
Southeast Asia
Television
Tennis
Trains


In [32]:
rel_indexes = [2,4,5,8,9,12,13,17,18,21,28,29,31,37,38,39,40,41,42,43]
rankings_total_2_indexes = [(i, rankings_total[i]) for i in rel_indexes]
rankings_total_2= [x for _,x in rankings_total_2_indexes]

In [24]:
rerankings_MMR = [MMR_gender(x, lamb=0.1) for x in rankings_total_2]

Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.76it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:55<00:00,  1.79it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:54<00:00,  1.80it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:57<00:00,  1.71it/s]


Missing fairness info for 1 values!


 99%|█████████▉| 99/100 [00:54<00:00,  1.81it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:54<00:00,  1.81it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:58<00:00,  1.69it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:57<00:00,  1.72it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.75it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.75it/s]


Missing fairness info for 2 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.76it/s]


Missing fairness info for 2 values!


 99%|█████████▉| 99/100 [00:57<00:00,  1.72it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.74it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:57<00:00,  1.71it/s]


Missing fairness info for 1 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.77it/s]


Missing fairness info for 1 values!


 99%|█████████▉| 99/100 [00:58<00:00,  1.68it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.75it/s]


Missing fairness info for 2 values!


 99%|█████████▉| 99/100 [00:58<00:00,  1.69it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:56<00:00,  1.74it/s]


Missing fairness info for 0 values!


 99%|█████████▉| 99/100 [00:58<00:00,  1.68it/s]


In [25]:
rerankings_CL, CLS = zip(*
                        [zip(*roundRobinWithClusters(r, clusterings[index], clustering_preferences[index])) for index, r in rankings_total_2_indexes]  
                       )

In [26]:
import random 

ndcg_len = 50

results = []

def shuffle(lst, n):
    lsts = []
    for _ in range(n):
        shuffled = lst[:]
        random.shuffle(shuffled)
        lsts.append(shuffled)
    return lsts

version=2

with open(f'results/results{version}.csv', 'w') as csvfile:
    csvfile.write(f"NAME, QUERY, NDCG, ALPHA-NDCG, TREC, AWRF\n")
    for i, q_index in enumerate(rel_indexes):
        print(q_index, i)
        
        ranking = rankings_total[q_index]
        reranking_MMR = rerankings_MMR[i]
    
        reranking_CL = rerankings_CL[i]
    
        shuffled_lst = shuffle(ranking, 5)

        ### rel_docs
        ranking_rels =  get_relevancy_labels(ranking, relevancy_labels_total[q_index])
        reranking_MMR_rels = get_relevancy_labels(reranking_MMR, relevancy_labels_total[q_index])
        reranking_CL_rels = get_relevancy_labels(reranking_CL, relevancy_labels_total[q_index])
        shuffled_rels = [get_relevancy_labels(s, relevancy_labels_total[q_index]) for s in shuffled_lst]
        
        ### document_topics
        ct = clustering_topics[q_index]
        ct_reranking_MMR = get_topic_info_reranking(ct, ranking, reranking_MMR)
        ct_reranking_CL = get_topic_info_reranking(ct, ranking, reranking_CL)
        ct_reranking_SHUFFLEDs = [get_topic_info_reranking(ct, ranking, shuffled) for shuffled in shuffled_lst]
        
        ### BM25 ####
        AWRF_bm25 = JS(get_fairness_NDCG(ranking[:ndcg_len])[1:], gender_targets[q_index][1:])
        NDCG_bm25 = NDCG(ranking_rels, ndcg_len)
        alpha_NDCG_bm25 = NDCG_2(ct, ndcg_len)
        TREC_bm25 = JS(get_fairness_NDCG(ranking[:ndcg_len])[1:], gender_targets[q_index][1:]) * NDCG(ranking_rels, ndcg_len)
        csvfile.write(f"BM25, {q_index}, {NDCG_bm25}, {alpha_NDCG_bm25}, {TREC_bm25}, {AWRF_bm25}\n")
        
        ### MMR ###
        AWRF_MMR = JS(get_fairness_NDCG(reranking_MMR[:ndcg_len])[1:], gender_targets[q_index][1:])
        NDCG_MMR = NDCG(reranking_MMR_rels, ndcg_len)
        alpha_NDCG_MMR = NDCG_2(ct_reranking_MMR, ndcg_len)
        TREC_MMR = JS(get_fairness_NDCG(reranking_MMR[:ndcg_len])[1:], gender_targets[q_index][1:]) * NDCG(reranking_MMR_rels, ndcg_len)
        csvfile.write(f"MMR, {q_index}, {NDCG_MMR}, {alpha_NDCG_MMR}, {TREC_MMR}, {AWRF_MMR}\n")

        ### MMR targets ###
        AWRF_MMR_T = JS(get_fairness_NDCG(reranking_MMR[:ndcg_len])[1:], gender_targets[q_index][1:])
        NDCG_MMR_T = NDCG(reranking_MMR_rels, ndcg_len)
        alpha_NDCG_MMR_T = NDCG_2(ct_reranking_MMR, ndcg_len)
        TREC_MMR_T = JS(get_fairness_NDCG(reranking_MMR[:ndcg_len])[1:], gender_targets[q_index][1:]) * NDCG(reranking_MMR_rels, ndcg_len)
        csvfile.write(f"MMR_T, {q_index}, {NDCG_MMR_T}, {alpha_NDCG_MMR_T}, {TREC_MMR_T}, {AWRF_MMR_T}\n")
        
        ### CLUSTER ###
        AWRF_CL = JS(get_fairness_NDCG(reranking_CL[:ndcg_len])[1:], gender_targets[q_index][1:])
        NDCG_CL = NDCG(reranking_CL_rels, ndcg_len)
        alpha_NDCG_CL = NDCG_2(ct_reranking_CL, ndcg_len)
        TREC_CL = JS(get_fairness_NDCG(reranking_CL[:ndcg_len])[1:], gender_targets[q_index][1:]) * NDCG(reranking_CL_rels, ndcg_len)
        csvfile.write(f"CL, {q_index}, {NDCG_CL}, {alpha_NDCG_CL}, {TREC_CL}, {AWRF_CL}\n")

        ### Full-random ###
        AWRF_FR = sum([JS(get_fairness_NDCG(shuffled[:ndcg_len])[1:], gender_targets[q_index][1:]) for shuffled in shuffled_lst])/len(shuffled_lst)
        NDCG_FR = sum([NDCG(shuffled_rel, ndcg_len) for shuffled_rel in shuffled_rels])/len(shuffled_rels)
        alpha_NDCG_FR = sum([NDCG_2(ct_s, ndcg_len) for ct_s in ct_reranking_SHUFFLEDs ])/len(ct_reranking_SHUFFLEDs)
        TREC_FR = AWRF_FR*NDCG_FR
        csvfile.write(f"FR, {q_index}, {NDCG_FR}, {alpha_NDCG_FR}, {TREC_FR}, {AWRF_FR}\n")
        
        

2 0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fairness.loc[eval(item.docid)]/= math.log2(max(k, 2))


4 1
5 2
8 3
9 4
12 5
13 6
17 7
18 8
21 9
28 10
29 11
31 12
37 13
38 14
39 15
40 16
41 17


  p = p / np.sum(p, axis=axis, keepdims=True)


42 18
43 19


In [28]:
bm25_pop = pd.concat([get_fairness(rankings_total[i][:100]) for i in rel_indexes])
bm25_pop = bm25_pop.drop('@UNKNOWN', axis=1)
bm25_pop.to_csv(f"results/bm25_pop{version}")

MMR_POP = pd.concat([get_fairness(rerankings_MMR[i][:100]) for i in range(len(rel_indexes))])
MMR_POP = MMR_POP.drop('@UNKNOWN', axis=1)
MMR_POP.to_csv(f"results/MMR_POP{version}")

CL_POP = pd.concat([get_fairness(rerankings_CL[i][:100]) for i in range(len(rel_indexes))])
CL_POP = CL_POP.drop('@UNKNOWN', axis=1)
CL_POP.to_csv(f"results/CL_POP{version}")

FR_POP = pd.concat(sum([[get_fairness(x[:100]) for x in shuffle(rankings_total[i], 5)] for i in range(len(rel_indexes))], []))
FR_POP = FR_POP.drop('@UNKNOWN', axis=1)
FR_POP.to_csv(f"results/FR_POP{version}")