## Search Engine with HuggingFace

<a href="https://colab.research.google.com/github/EffiSciencesResearch/ML4G/blob/main/days/w1d4/Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this tutorial is to understand the basics of using the most used library in modern NLP: Huggingface.

We will try to understand the first steps in creating a semantic search engine.

You can read a pandas cheat sheet before using it and read about the `.loc` and `.iloc` methods.

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf 

## Download the list of papers

In [11]:
!wget https://github.com/EffiSciencesResearch/ML4G/raw/4010bb6ccd63dee5896b26ee3c045898e0cf9ed6/days/w1d4/keynesian_eco_ML4G.xlsx -q

In [6]:
%pip install transformers -q

In [2]:
import pandas as pd

df = pd.read_excel("keynesian_eco_ML4G.xlsx")

In [3]:
df

Unnamed: 0,id,nbInCitations,nbOutCitations,authors,paperAbstract,title,year,doi,doiUrl,fieldsOfStudy,journalName,magId,s2Url
0,c2890ae1a86d104b0b5b2cf71d890cb514b6590b,0.0,0.0,"[['2069055537', 'Robin Boadway'], ['119227414...",,Fiscal Federalism: Preface,2009.0,10.1017/CBO9780511626883.001,https://doi.org/10.1017/CBO9780511626883.001,['Economics'],,1597197407,
1,b849439367ccb03b11183b9e3fdf594aae90255a,12.0,0.0,"[['116983853', 'Nancy J. Wulwick']]",,Two Econometric Replications: The Historic Phi...,1996.0,10.1215/00182702-28-3-391,https://doi.org/10.1215/00182702-28-3-391,['Economics'],History of Political Economy,2007001276,
2,1cc40d9cc0ee8978339d55b0406cb88014710ae9,0.0,0.0,"[['40266255', 'Julia M. Colston'], ['46784895'...",,An adenoviral model to unlock the secrets of m...,2013.0,10.1016/J.JINF.2013.07.009,https://doi.org/10.1016/J.JINF.2013.07.009,['Economics'],Journal of Infection,2064032475,
3,13bdff11c2e33df661d1bd7f08f7ab0bfe46acb6,0.0,0.0,"[['119468193', 'Baldwin Ranson']]",,The Keynesian Revolution and its Critics,1988.0,10.1080/00213624.1988.11504824,https://doi.org/10.1080/00213624.1988.11504824,['Economics'],,2565556283,
4,6286ee0c109ac7ce25a9d8cdc15281ed4d218cfa,4.0,1.0,"[['119242604', 'Josef Steindl']]",The control of the economy is examined in term...,The control of the economy,2013.0,10.1007/978-1-349-20821-0_16,https://doi.org/10.1007/978-1-349-20821-0_16,['Economics'],PSL Quarterly Review,1804776193,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15293,1a68ce621afac192a4b7c3c30152fb9f210ec0d2,0.0,0.0,"[['1883404', 'Christian Fries']]",,Heath‐Jarrow‐Morton Framework: Foundations,2007.0,10.1002/9780470179789.CH22,https://doi.org/10.1002/9780470179789.CH22,['Economics'],,1499640240,
15294,c637aa731bd2423633e75067ae039ec6133e32eb,0.0,0.0,"[['47021855', 'Miriam Smith']]",,"Discrimination, From Romer to Vriend, 1986–2000",2008.0,10.4324/9780203895016-9,https://doi.org/10.4324/9780203895016-9,['Economics'],,3028705958,
15295,f6375b906fc8496a361ef1c083624424af98ad7e,50.0,22.0,"[['2990724', 'Imad A. Moosa']]",Abstract Okun's coefficient is estimated from ...,"Cyclical output, cyclical unemployment, and Ok...",1999.0,10.1016/S1059-0560(99)00028-3,https://doi.org/10.1016/S1059-0560%2899%2900028-3,['Economics'],International Review of Economics & Finance,2037151286,
15296,b493013938a9b18de3009164cc9c21a95acb662c,5.0,3.0,"[['90362779', 'João Sicsú']]",The article criticizes the main hypothesis of ...,A NEGAÇÃO DA INEFICÁCIA DA POLÍTICA MONETÁRIA:...,2009.0,10.22456/2176-5456.10546,https://doi.org/10.22456/2176-5456.10546,['Economics'],,1957768946,


## Semantic search engine

Create a search engine by embedding by using https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Make some queries and check that it works.

Switch to the GPU mode in Colab!
NB: In the sentence-transformers library, `model.encode(sentences)` can process an arbitrary number of sentences by batching automatically. When using the transformer library, you generally have to batch yourself. 


In [None]:
query = "Poverty in the US"

...

## Few Shot Learning: Tldr

Use GPT-J 6b with some prompt engineering to create tldr of the summaries

You can begin with: https://huggingface.co/gpt2


The first step is to create a single tldr. For that, you will use some  few-shot learning. You can use this link to create your prompt: https://github.com/EffiSciencesResearch/ML4G/blob/38f80110be0802837254c1cd888f387475c9b5fe/days/w1d4/tldr_dataset.csv


After having created a single tldr, the aim is to add a new column in the pandas dataframe containing the tldr:
- automate the process and make inferences by batch. Store the inferences in a new column in the dataframe.
- Use the tqdm library to create a progress bar.
- Notice that it is too slow and switch to GPU. To do this, use ".to(device)" on the output tensor of the tokenizer and on the model.
- Use the command 'nvidia-smi' in the terminal to monitor the GPU usage. Aim at a GPU usage percentage of 70%.
- How does the speed of inference vary with the batch size?
- How does the inference speed vary with the padding policy in the tokenizer?
- How does the quality/speed of inference vary with the beam_search parameter?
- Bonus: read https://huggingface.co/blog/how-to-generate
- Bonus: Use a bigger model https://huggingface.co/EleutherAI/gpt-j-6B


In [None]:
# No need to 
!wget https://raw.githubusercontent.com/EffiSciencesResearch/ML4G/38f80110be0802837254c1cd888f387475c9b5fe/days/w1d4/tldr_dataset.csv -q
df_tldr = pd.read_csv("tldr_dataset.csv")

In [None]:
prompt = "You must craft a good prompt by selecting 2 good abstract-tldr pairs from from df_tldr"
# NB: An example of working prompt is written below.

from transformers import pipeline
# NB: Note that we will use gpt2 because colab may not work.
generator = pipeline('text-generation', model = 'gpt2')
generator(prompt, max_length = 30, num_return_sequences=3)
## [{'generated_text': "Hello, I'm a language modeler. So while writing this, when I went out to meet my wife or come home she told me that my"},
##  {'generated_text': "Hello, I'm a language modeler. I write and maintain software in Python. I love to code, and that includes coding things that require writing"}, ...

## Quality filtering 

(Bonus) Implement a strategy to keep only high quality tldr

## Fine-Tuning

Bonus: Fine Tune T5-base from the corpus generated by  GPT-J to accelerate the inference and fine tune your first LLM

Use: https://huggingface.co/docs/transformers/training#training-hyperparameters

(You can use either the Native Pytorch method of the Trainer object. Both are fine.)

In [None]:
# Working Prompt to summarize some abstracts
prompt = """I'm a TL;DR text generator, and need to generate a corresponding TL;DR from the following abstracts.

Abstract : "Review(s) of: The House of Rothschild: The World's Banker, by Niall Ferguson, Two vols, Penguin, 2000."
TL;DR : "The House of Rothschild: The World's Banker."

Abstract : "The benefits and drawbacks of sub‐contracting distribution are reviewed. The financial aspects of the decision, whether sub‐contracting is to be wholly adopted or in part, are discussed together with the implications for management."
TL;DR : "Benefits and drawbacks of subcontracting distribution are reviewed."

Abstract : "A Research Project Report by Kyagera Benjamin, Submitted to the Chandaria School of Business in Partial Fulfillment of the Requirement for the Degree of Master of Business Administration"
TL;DR : "Chandaria School of Business in Partial Fulfillment of the Degree of Master of Business Administration."

Abstract : "AbstractPolitical science and public policy scholars have long emphasised the importance of understanding institutional change and policy entrepreneurship. This review article is a response to this..."
TL;DR : "Political science and public policy scholars have long emphasised the importance of understanding institutional change and policy entrepreneurship."

Abstract : [mettez votre abstract]
TL;DR :"""