<a href="https://colab.research.google.com/github/nbarnett19/Computational_Language_Tech/blob/Natalie/natalie_stage_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 3: Question answering / Information retrieval

In this step, students create a question-answering system based on the cleantech media dataset following the stages below. The minimum task is to fine-tune an LLM for question answering.

Optionally, students can also implement a system for Retrieval-Augmented Generation, which is a major industry trend at the moment.

**Question answering (QA) system**
> * Extract key sentences from the given cleantech dataset using, for example, TextRank (https://github.com/davidadamojr/TextRank) or BERT Extractive Summarizer (https://pypi.org/project/bert-extractive-summarizer/).

> * Generate a question and an answer for each sentence using a pre-trained language model such as GPT-2 or T5.

> * Manually clean up the generated question-answer pairs to create a high-quality QA dataset.

> * Use the prepared QA dataset to fine-tune GPT-2 or T5 and evaluate model performance on new input data in the cleantech field.

> * Comparing the above results with the zero-shot capability of some open source large language models (LLMs) such as ChatGPT and Llama-2.


**Outputs:**

1. An QA dataset that can be shared between groups for model training purposes.
2. Training notebook, including the QA results on the trained model

In [None]:
%%capture
%pip install bert-extractive-summarizer

In [1]:
# Preprocessing
%%capture
!python -m spacy download en_core_web_sm

import numpy as np
import pandas as pd
import nltk
import spacy
import math
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS
import re

nlp = spacy.load('en_core_web_sm')

In [2]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

True

In [None]:
from summarizer import Summarizer

ModuleNotFoundError: No module named 'summarizer'

# BERT Extractive Summarizer

The first stage use's BERT extractive summarizer to extract key sentneces from the articles.

# Pre-Processing

Processing steps must be adjusted for use of BERT summarizer. Sentences must be preserved, so punctuation cannot be fully removed. Additionally, as the goal is the creation of a QA dataset it is not advantageous to remove numbers, as these may be important for answers questions/providing context to the articles.


In [1]:
df = pd.read_csv("/content/drive/MyDrive/Comp_Ling/cleantech_media_dataset_v1_20231109.csv")

NameError: name 'pd' is not defined

In [None]:
def preprocess_data(df):
    # Remove duplicates
    df = df.drop_duplicates()

    # Remove symbols, but keep punctuation for sentence tokenization
    df['content_cleaned_text'] = df['content'].apply(lambda x: re.sub(r"^[^a-zA-Z0-9.!?,/'-]+|[^a-zA-Z0-9.!?,/'-]+$", r" ", x))

    # Remove apostrophes not directly preceded and followed by a letter, handling possessive forms
    df['content_cleaned_text'] = df['content_cleaned_text'].apply(lambda x: re.sub(r"(?<![a-zA-Z])'(?![a-zA-Z])|(?<![a-zA-Z])'(?=[a-zA-Z])|(?<=[a-zA-Z])'(?![a-zA-Z])|(?<=[a-zA-Z])'s", "", x))

    # Remove unused columns
    df.drop('Unnamed: 0', axis=1, inplace=True)
    df.drop('author', axis=1, inplace=True)

    return df

# Example usage:
df = preprocess_data(df)


In [None]:
df

Unnamed: 0,title,date,content,domain,url,content_cleaned_text
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...
...,...,...,...,...,...,...
9602,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...,Strata Clean Energy has closed a $ 300 millio...
9603,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,['Global renewable energy developer Ørsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...,Global renewable energy developer Ørsted is d...
9604,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...,"Veolia North America, a provider of environme..."
9605,"SunEdison: Too Big, Too Fast?",2023-11-08,['Once the self-proclaimed “ leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...,Once the self-proclaimed “ leading renewable ...


In [None]:
ids_articles = []

for index, row in df.iterrows():
    article_id = row['title']
    article = row['content_cleaned_text']

    ids_articles.append({'article_id': article_id, 'content': article})

In [None]:
articles = [article['content'] for article in ids_articles]

Examine articles after processing:

In [None]:
articles[9606]

' Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer., Arevon secured a commitment with J.P. Morgan to purchase $ 191 million of investment tax credits and production tax credits, among the nation’ s first transactions announced to date that leverage the Inflation Reduction Act’ s transferability provision., The additional $ 338 million debt facility was financed with MUFG, BNP Paribas, Sumitomo Mitsui Banking Corp., and First Citizens Bank, who acted as coordinating lead arrangers. National Bank of Canada also participated as a lender. Stoel Rives represented Arevon as legal counsel; Milbank LLP served as transfer counsel; and Winston & Strawn LLP served as lender counsel., “ Vikings has been a landmark project from its inception. It is one of the nation’ s first solar peaker plants, and today it is one of the first utility-scale solar-plus-storage ITC and PTC transferability transactions to cl

Text looks reasonably clean and should be able to be handled by BERT summarizer.

Next step is to load the model.

In [None]:
model = Summarizer()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We test the model on one article. The number of sentences or ratio of sentences to the article length can be specified.

In [None]:
model(articles[9606], num_sentences=1)

'Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer.,'

Here we test another article using both the ratio method and the number of sentences method.

In [None]:
articles[50]

" The energy transition is very much about how far and how fast electrification can go. Siemens Energy is involved in most electrification technologies, from conventional and renewable power generation to storage, grids and green hydrogen production. To understand the issues, risks and opportunities, Energy Intelligence Senior Reporter Philippe Roos caught up with Stefan Diezinger, in charge of Sustainable Energy Systems at the German energy giant Industrial Applications division ( related). Q: Industrial carbon dioxide emissions can be reduced with energy efficiency. What the potential there? A: We see efficiency enhancement as an important part of decarbonizing industries, especially energy-intensive process industries. We still see a lot of old equipment which has 30, 40 or even 50 years of operation, even in Germany. With an upgrade, you can easily get 20% more efficiency. If you put this in terms of CO2 avoidance costs, it is extremely attractive. Then there so much waste heat whi

In [None]:
model(articles[50], ratio=0.2)



"The energy transition is very much about how far and how fast electrification can go. Q: Industrial carbon dioxide emissions can be reduced with energy efficiency. We still see a lot of old equipment which has 30, 40 or even 50 years of operation, even in Germany. High-temperature heat pumps can address this. Like this, you can get 5% -10% efficiency improvement, which also means CO2 reduction. To find the optimal configuration for a specific application, sophisticated design algorithms and a broad toolbox of technologies are available -- including equipment, electrification, automation and digitalization. We have at the moment ongoing discussions with companies in different parts of the world. If this is the case, you can produce green hydrogen or other green molecules like methanol and use them to replace fossil feedstock in the chemical industry and the mobility sector. But fuel shifting is not necessarily just about hydrogen. For example, even in Europe you still have a large amou

In [None]:
model(articles[50], num_sentences=3)



'The energy transition is very much about how far and how fast electrification can go. But if you really go into the details of the hydrogen business case, you can clearly see it only makes sense if there is a substantial amount of renewable power at very low cost. With digitalization, we can also optimize operations in an online mode.'

The ratio method has the danger of producing a large number of sentences when used on longer articles. For our purposes it is safer to specify the number of sentences directly, as this will create a more manageable dataset later on.

In [None]:
import warnings
warnings.filterwarnings("ignore")

#df['summary'] = ''

# Set the interval to save the DataFrame
#save_interval = 100

# Iterate over the rows of the DataFrame
#for index, row in df.iloc[1351:].iterrows():
    #body_text = row['content_cleaned_text']
    #summary_sentences = model(body_text, num_sentences=3)
    #df.at[index, 'summary'] = summary_sentences

    #if index % save_interval == 0:
        #df.to_csv('/content/drive/MyDrive/Comp_Ling/output_file_2.csv', index=False)

# Save the final DataFrame after all lines are processed
#df.to_csv('/content/drive/MyDrive/Comp_Ling/final_output_file.csv', index=False)


## Load the new dataset

The summarised sentences are contained in the 'summary' column.

In [1]:
import pandas as pd

In [2]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
!unzip final_summary_file.zip

--2024-01-15 15:34:35--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip [following]
--2024-01-15 15:34:35--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16819008 (16M) [application/zip]
Saving to: ‘final_summary_file.zip’


2024-01-15 15:34:36 (337 MB/s) - ‘final_summary_file.zip’ saved [16819008/16819008]

Archive:  final_summary_file.z

In [3]:
df = pd.read_csv("final_summary_file.csv")
df

Unnamed: 0,title,date,content,domain,url,content_cleaned_text,summary
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...,Qatar Petroleum ( QP) is targeting aggressive ...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...,New US President Joe Biden took office this we...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...,The slow pace of Japanese reactor restarts con...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...,Two of New York City largest pension funds say...
...,...,...,...,...,...,...,...
9602,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...,Strata Clean Energy has closed a $ 300 millio...,Strata Clean Energy has closed a $ 300 million...
9603,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,['Global renewable energy developer Ørsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...,Global renewable energy developer Ørsted is d...,Global renewable energy developer Ørsted is de...
9604,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...,"Veolia North America, a provider of environme...","Veolia North America, a provider of environmen..."
9605,"SunEdison: Too Big, Too Fast?",2023-11-08,['Once the self-proclaimed “ leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...,Once the self-proclaimed “ leading renewable ...,Once the self-proclaimed “ leading renewable p...


In [4]:
df["summary"][0]

'Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar LNG capacity by a further 16 million tons/yr. But QP judged them to be too expensive and none met its targeted 50-week construction schedule.'

# T5 Question and Answer Pairs

In [3]:
!pip install sentencepiece



In [4]:
import nltk
nltk.download('punkt')

True

In [5]:
import numpy as np
import nltk
import math
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset

In [6]:
from transformers import AutoModelWithLMHead, AutoTokenizer
from transformers import pipeline
import sentencepiece
import pandas as pd

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
# Clone the model repository
!git clone https://github.com/patil-suraj/question_generation.git

fatal: destination path 'question_generation' already exists and is not an empty directory.


In [9]:
%cd question_generation

/content/question_generation


In [13]:
# Test the model

text = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."

nlp = pipeline("text2text-generation", model="valhalla/t5-small-e2e-qg", max_length = 100)
result = nlp(text)

for item in result:
    print(f"Question: {item['generated_text']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Question: Python is an interpreted, high-level, general-purpose programming language.<sep> When was Python first released?<sep> What is Python's design philosophy?<sep>


In [14]:
# extract title and summary columns
sentences_df = pd.DataFrame(df,columns=['title', 'summary'])

# Split the data into training and validation sets
train, test = train_test_split(sentences_df, train_size=1000, random_state=42)

In [15]:
train.to_csv('/content/train_df.csv', index=False)

# Save val_df to CSV
test.to_csv('/content/val_df.csv', index=False)


In [10]:
# Tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("valhalla/t5-small-e2e-qg")
model = T5ForConditionalGeneration.from_pretrained("valhalla/t5-small-e2e-qg")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [55]:
train['question'] = None

# Q&A generation loop

for index, row in train.iterrows():
    context = row['summary']

    try:
        # Tokenize the context and generate question and answer
        inputs = tokenizer(context, return_tensors="pt")
        outputs = model.generate(**inputs, max_length=100)
        questions = tokenizer.decode(outputs[0], skip_special_tokens=False)
        questions = questions.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")
      # question, answer = question_answer.split(tokenizer.sep_token)

        # Assign question and answer to the corresponding row in train_df
        train.at[index, 'question'] = questions
        # train_df.at[index, 'answer'] = answer

    except Exception as e:
        # Handle the error (you can print or log the error message)
        print(f"Error processing row {index}: {str(e)}")
        # Set default values to None
        train.at[index, 'question'] = None
        train.at[index, 'answer'] = None



# save the train_df
train.to_csv('/content/train_results.csv', index=False)

KeyboardInterrupt: 

Multiple questions are generated for each summary in the data frame. Now we must build the answers for each question.

In [None]:
for index, row in train.iterrows():
  print(f"Question: {row['question']}, Context: {row['summary']}")

Question: What could Electrification cost-efficiently bust? <sep> What is the NSTA working with to achieve? <sep> What is the NSTA working with to achieve? <sep> What is the NSTA working with to progress prospective electrification projects? <sep> , Context: Electrification could cost-efficiently bust the vast majority of greenhouse gases arising from North Sea platforms., To achieve these goals, the NSTA is working with industry representatives to progress prospective electrification projects, including those in the central North Sea and West of Shetland., Sensitivities to gas, electricity and carbon prices were also investigated.
Question: What is the name of the company that has secured debt to acquire electric vehicles? <sep> What is the name of the company that has raised $ 750 million through green bonds? <sep> What is the name of the company that has raised $ 750 million through green bonds? <sep> , Context: Homegrown Blusmart Mobility has secured debt to acquire a massive fleet

In [118]:
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/train_results.csv

--2024-01-15 17:24:17--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/train_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 937976 (916K) [text/plain]
Saving to: ‘train_results.csv.1’


2024-01-15 17:24:17 (177 MB/s) - ‘train_results.csv.1’ saved [937976/937976]



In [119]:
df = pd.read_csv("train_results.csv")
df["question"][0]

'What could Electrification cost-efficiently bust? <sep> What is the NSTA working with to achieve? <sep> What is the NSTA working with to achieve? <sep> What is the NSTA working with to progress prospective electrification projects? <sep> '

In [120]:
# Split the sentences into different rows
df['question'] = df['question'].str.split('<sep>')
df = df.explode('question')

# Convert all columns to string
df = df.astype(str)

# Replace empty strings with None in the 'question' column
df['question'] = df['question'].replace('', None)

# Replace empty strings with None in the 'question' column
df['question'] = df['question'].replace(' ', None)

# Drop rows with no value in the 'question' column
df = df.dropna(subset=['question'])

# Drop duplicates based on all columns
df = df.drop_duplicates()

# Reset the index to get consecutive row indices
df = df.reset_index(drop=True)

# Display the resulting DataFrame
df

Unnamed: 0,title,summary,question
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...
1,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to achieve?
2,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to progress pro...
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...
4,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has rais...
...,...,...,...
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2481,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2482,AGR picks new head of wells and operations geo...,Lene Thorstensen has been appointed to head up...,Lene Thorstensen has been appointed to head u...
2483,The weekend read: Charging with solar at home ...,Just under 40% of the residential EV chargers ...,How much of the residential EV chargers in pv ...


In [121]:
df.to_csv('/content/questions.csv', index=False)

In [11]:
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/questions.csv

--2024-01-15 18:01:17--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/questions.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1807293 (1.7M) [text/plain]
Saving to: ‘questions.csv’


2024-01-15 18:01:18 (203 MB/s) - ‘questions.csv’ saved [1807293/1807293]



In [12]:
df = pd.read_csv("questions.csv")
# Rename the columns
df = df.rename(columns={'summary': 'context'})
df

Unnamed: 0,title,context,question
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...
1,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to achieve?
2,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to progress pro...
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...
4,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has rais...
...,...,...,...
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2481,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2482,AGR picks new head of wells and operations geo...,Lene Thorstensen has been appointed to head up...,Lene Thorstensen has been appointed to head u...
2483,The weekend read: Charging with solar at home ...,Just under 40% of the residential EV chargers ...,How much of the residential EV chargers in pv ...


In [13]:
!pip install transformers



In [14]:
from transformers import AutoTokenizer
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [15]:
tokenizer.is_fast

True

In [16]:
# Load and test pre-trained model
from pipelines import pipeline
nlp = pipeline("multitask-qa-qg")

# for qa pass a dict with "question" and "context"
nlp({"question": "What is 42 ?", "context": "42 is the answer to life, the universe and everything."})

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

'the answer to life, the universe and everything'

In [17]:
# Convert DataFrame to list of dictionaries
data_list = df.to_dict(orient='records')

# Iterate through the list and pass each dictionary to the pipeline
results = []
for data in data_list:
    result = nlp(data)
    results.append(result)

# Add the answers to a new column 'answer' in the DataFrame
df['answer'] = results
df.to_csv('/content/questions_answer_pairs_t5.csv', index=False)

In [18]:
df

Unnamed: 0,title,context,question,answer
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...,greenhouse gases
1,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to achieve?,progress prospective electrification projects
2,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to progress pro...,industry representatives
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...,Homegrown Blusmart Mobility
4,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has rais...,Damodar Valley Corporation
...,...,...,...,...
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?,Kelly Bennett
2481,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?,Kelly Bennett
2482,AGR picks new head of wells and operations geo...,Lene Thorstensen has been appointed to head up...,Lene Thorstensen has been appointed to head u...,the Norwegian Continental Shelf
2483,The weekend read: Charging with solar at home ...,Just under 40% of the residential EV chargers ...,How much of the residential EV chargers in pv ...,40%


Manually review and clean up the generated question-answer pairs. Ensure clarity, correctness, and coherence in the QA pairs. This step is crucial for creating a high-quality training dataset.

Train the T5 model to map tokenized input sequences to tokenized output sequences. The model learns the mapping from sentences to question-answer pairs.

Use the prepared QA dataset to fine-tune the pre-trained T5 model specifically for question answering. Fine-tuning adapts the model to generate context-specific questions and answers.

In [None]:
# # Fine-tuning
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# t5_model.to(device)
# qa_dataloader = DataLoader(qa_dataset, batch_size=2, shuffle=True)

# optimizer = AdamW(t5_model.parameters(), lr=5e-5)
# num_epochs = 3

# for epoch in range(num_epochs):
#     t5_model.train()
#     total_loss = 0.0

#     for batch in tqdm(qa_dataloader, desc=f"Epoch {epoch + 1}"):
#         input_ids = batch['input_ids'].to(device)
#         labels = batch['labels'].to(device)

#         optimizer.zero_grad()
#         outputs = t5_model(input_ids, labels=labels)
#         loss = outputs.loss
#         total_loss += loss.item()

#         loss.backward()
#         optimizer.step()

#     average_loss = total_loss / len(qa_dataloader)
#     print(f"Epoch {epoch + 1}, Average Loss: {average_loss}")

In [None]:
# # Save the fine-tuned model
# t5_model.save_pretrained('fine_tuned_t5_model')

Evaluate the performance of the fine-tuned T5 model on a validation dataset. Use metrics like accuracy, precision, recall, or F1 score to assess how well the model is answering questions.

In [None]:
# validation_questions = ["What is the capital of France?", "Who wrote 'Romeo and Juliet'?"]
# validation_reference_answers = ["The capital of France is Paris.", "'Romeo and Juliet' was written by William Shakespeare."]

# # Step 3: Tokenize the validation data
# t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
# validation_dataset = ValidationDataset(validation_questions, validation_reference_answers, t5_tokenizer)

# # Step 4: Initialize T5 model for conditional generation
# t5_model = T5ForConditionalGeneration.from_pretrained('fine_tuned_t5_model')  # Load the fine-tuned model

# # Step 5: Evaluation
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# t5_model.to(device)
# validation_dataloader = DataLoader(validation_dataset, batch_size=2, shuffle=False)

# all_predictions = []
# all_reference_ids = []

# t5_model.eval()
# with torch.no_grad():
#     for batch in tqdm(validation_dataloader, desc="Validation"):
#         input_ids = batch['input_ids'].to(device)
#         reference_ids = batch['reference_ids'].to(device)

#         # Generate predictions
#         generated_ids = t5_model.generate(input_ids)
#         predictions = t5_tokenizer.decode(generated_ids, skip_special_tokens=True)

#         # Convert reference_ids to string for comparison
#         reference_strings = [t5_tokenizer.decode(ref, skip_special_tokens=True) for ref in reference_ids]

#         all_predictions.extend(predictions)
#         all_reference_ids.extend(reference_strings)

# # Calculate evaluation metrics
# accuracy = accuracy_score(all_reference_ids, all_predictions)
# precision = precision_score(all_reference_ids, all_predictions, average='weighted')
# recall = recall_score(all_reference_ids, all_predictions, average='weighted')
# f1 = f1_score(all_reference_ids, all_predictions, average='weighted')

# print(f"Accuracy: {accuracy:.4f}")
# print(f"Precision: {precision:.4f}")
# print(f"Recall: {recall:.4f}")
# print(f"F1 Score: {f1:.4f}")

Compare the results of the fine-tuned T5 model with the zero-shot capability of other open-source large language models such as ChatGPT and Llama-2.

In [None]:
# from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# model_name = "deepset/tinyroberta-squad2"

# # a) Get predictions
# nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
# QA_input = {
#     'question': 'Why is model conversion important?',
#     'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
# }
# res = nlp(QA_input)

# # b) Load model & tokenizer
# model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# tokenizer = AutoTokenizer.from_pretrained(model_name)


config.json:   0%|          | 0.00/835 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/326M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]