<a href="https://colab.research.google.com/github/nbarnett19/Computational_Language_Tech/blob/Main/stage_3_Natalie_Alina_Nina.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accelerating Cleantech Advancements through NLP-Powered Text Mining and Knowledge Extraction:

### Stage 3: Question answering / Information retrieval

Natalie Barnett, Alina Yaroshchuk, Nina Merryweather

In this step, we create a question-answering system based on the cleantech media dataset following the stages below:

> * Extract key sentences from the given cleantech dataset using BERT Extractive Summarizer (https://pypi.org/project/bert-extractive-summarizer/).

> * Generate a question and an answer for each sentence using a pre-trained language model.

> * Manually clean up the generated question-answer pairs to create a high-quality QA dataset.

> * Use the prepared QA dataset to fine-tune GPT-2 and T5 and evaluate model performance on new input data in the cleantech field.

> * Comparing the above results with the zero-shot capability of ChatGPT.

In [None]:
%%capture
%pip install bert-extractive-summarizer

In [None]:
# Preprocessing
%%capture
!python -m spacy download en_core_web_sm

import numpy as np
import pandas as pd
import nltk
import spacy
import math
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS
import re

nlp = spacy.load('en_core_web_sm')

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

True

In [None]:
from summarizer import Summarizer

# Pre-Processing

Processing steps must be adjusted for use of BERT summarizer. Sentences must be preserved, so punctuation cannot be fully removed. Additionally, as the goal is the creation of a QA dataset it is not advantageous to remove numbers, as these may be important for answers questions/providing context to the articles.


In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/cleantech_media_dataset_v1_20231109.zip
!unzip /content/cleantech_media_dataset_v1_20231109.zip

--2024-01-19 20:48:17--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/cleantech_media_dataset_v1_20231109.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/cleantech_media_dataset_v1_20231109.zip [following]
--2024-01-19 20:48:17--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/cleantech_media_dataset_v1_20231109.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14880158 (14M) [application/zip]
Saving to: ‚Äòcleantech_media_dataset_v1_20231109.zip‚Äô


2024-01-19 20:48:18 (150 MB/s) - ‚Äòcleante

In [None]:
df = pd.read_csv("/content/cleantech_media_dataset_v1_20231109.csv")

In [None]:
def preprocess_data(df):
    # Remove duplicates
    df = df.drop_duplicates()

    # Remove symbols, but keep punctuation for sentence tokenization
    df['content_cleaned_text'] = df['content'].apply(lambda x: re.sub(r"^[^a-zA-Z0-9.!?,/'-]+|[^a-zA-Z0-9.!?,/'-]+$", r" ", x))

    # Remove apostrophes not directly preceded and followed by a letter, handling possessive forms
    df['content_cleaned_text'] = df['content_cleaned_text'].apply(lambda x: re.sub(r"(?<![a-zA-Z])'(?![a-zA-Z])|(?<![a-zA-Z])'(?=[a-zA-Z])|(?<=[a-zA-Z])'(?![a-zA-Z])|(?<=[a-zA-Z])'s", "", x))

    # Remove unused columns
    df.drop('Unnamed: 0', axis=1, inplace=True)
    df.drop('author', axis=1, inplace=True)

    return df

# Example usage:
df = preprocess_data(df)


In [None]:
df

Unnamed: 0,title,date,content,domain,url,content_cleaned_text
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""‚Ä¢ Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...
...,...,...,...,...,...,...
9602,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...,Strata Clean Energy has closed a $ 300 millio...
9603,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,['Global renewable energy developer √òrsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...,Global renewable energy developer √òrsted is d...
9604,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...,"Veolia North America, a provider of environme..."
9605,"SunEdison: Too Big, Too Fast?",2023-11-08,['Once the self-proclaimed ‚Äú leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...,Once the self-proclaimed ‚Äú leading renewable ...


In [None]:
ids_articles = []

for index, row in df.iterrows():
    article_id = row['title']
    article = row['content_cleaned_text']

    ids_articles.append({'article_id': article_id, 'content': article})

In [None]:
articles = [article['content'] for article in ids_articles]

Examine articles after processing:

In [None]:
articles[9606]

' Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer., Arevon secured a commitment with J.P. Morgan to purchase $ 191 million of investment tax credits and production tax credits, among the nation‚Äô s first transactions announced to date that leverage the Inflation Reduction Act‚Äô s transferability provision., The additional $ 338 million debt facility was financed with MUFG, BNP Paribas, Sumitomo Mitsui Banking Corp., and First Citizens Bank, who acted as coordinating lead arrangers. National Bank of Canada also participated as a lender. Stoel Rives represented Arevon as legal counsel; Milbank LLP served as transfer counsel; and Winston & Strawn LLP served as lender counsel., ‚Äú Vikings has been a landmark project from its inception. It is one of the nation‚Äô s first solar peaker plants, and today it is one of the first utility-scale solar-plus-storage ITC and PTC transferability transactio

Text looks reasonably clean and should be able to be handled by BERT summarizer.

Next step is to load the model.

# BERT Extractive Summarizer

The first stage uses BERT extractive summarizer to extract key sentences from the articles.

In [None]:
model = Summarizer()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We test the model on one article. The number of sentences or ratio of sentences to the article length can be specified.

In [None]:
model(articles[9606], num_sentences=1)

'Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer.,'

Here we test another article using both the ratio method and the number of sentences method.

In [None]:
articles[50]

" The energy transition is very much about how far and how fast electrification can go. Siemens Energy is involved in most electrification technologies, from conventional and renewable power generation to storage, grids and green hydrogen production. To understand the issues, risks and opportunities, Energy Intelligence Senior Reporter Philippe Roos caught up with Stefan Diezinger, in charge of Sustainable Energy Systems at the German energy giant Industrial Applications division ( related). Q: Industrial carbon dioxide emissions can be reduced with energy efficiency. What the potential there? A: We see efficiency enhancement as an important part of decarbonizing industries, especially energy-intensive process industries. We still see a lot of old equipment which has 30, 40 or even 50 years of operation, even in Germany. With an upgrade, you can easily get 20% more efficiency. If you put this in terms of CO2 avoidance costs, it is extremely attractive. Then there so much waste heat whi

In [None]:
model(articles[50], ratio=0.2)



"The energy transition is very much about how far and how fast electrification can go. Q: Industrial carbon dioxide emissions can be reduced with energy efficiency. We still see a lot of old equipment which has 30, 40 or even 50 years of operation, even in Germany. High-temperature heat pumps can address this. Like this, you can get 5% -10% efficiency improvement, which also means CO2 reduction. To find the optimal configuration for a specific application, sophisticated design algorithms and a broad toolbox of technologies are available -- including equipment, electrification, automation and digitalization. We have at the moment ongoing discussions with companies in different parts of the world. If this is the case, you can produce green hydrogen or other green molecules like methanol and use them to replace fossil feedstock in the chemical industry and the mobility sector. But fuel shifting is not necessarily just about hydrogen. For example, even in Europe you still have a large amou

In [None]:
model(articles[50], num_sentences=3)



'The energy transition is very much about how far and how fast electrification can go. But if you really go into the details of the hydrogen business case, you can clearly see it only makes sense if there is a substantial amount of renewable power at very low cost. With digitalization, we can also optimize operations in an online mode.'

The ratio method has the danger of producing a large number of sentences when used on longer articles. For our purposes it is safer to specify the number of sentences directly, as this will create a more manageable dataset later on.

In [None]:
import warnings
warnings.filterwarnings("ignore")

df['summary'] = ''

# Set the interval to save the DataFrame
#save_interval = 100

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    body_text = row['content_cleaned_text']

    try:
        summary_sentences = model(body_text, num_sentences=3)
        df.at[index, 'summary'] = summary_sentences
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        df.at[index, 'summary'] = None  # or any default value you prefer

    #if index % save_interval == 0:
        #df.to_csv('/content/drive/MyDrive/Comp_Ling/output_file_2.csv', index=False)

# Save the final DataFrame after all lines are processed
df.to_csv('/content/final_summary_file.csv', index=False)

# Falconsai/text-summarization

Testing a different summarizer model, this one uses a fine-tuned T-5 model.

In [None]:
%%capture
!pip install datasets evaluate transformers rouge-score nltk

In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
!unzip /content/final_summary_file.zip

--2024-01-19 20:58:13--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip [following]
--2024-01-19 20:58:13--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16819008 (16M) [application/zip]
Saving to: ‚Äòfinal_summary_file.zip‚Äô


2024-01-19 20:58:15 (101 MB/s) - ‚Äòfinal_summary_file.zip‚Äô saved [16819008/16819008]

Archive:  /content/fin

In [None]:
import pandas as pd
df = pd.read_csv("/content/final_summary_file.csv")

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="Falconsai/text_summarization")


config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
print(summarizer(df['content_cleaned_text'][0], max_length=100, min_length=30, do_sample=False))

Token indices sequence length is longer than the specified maximum sequence length for this model (612 > 512). Running this sequence through the model will result in indexing errors


[{'summary_text': 'QP said its goals include  reducing emissions intensity of Qatar LNG facilities by 25% and of its upstream facilities by at least 15% . About 2.2 million tons/yr of the carbon capture goal will come from Phase 1 of the LNG expansion . QP says it should be able to eliminate routine gas flaring by 2030, with methane emissions limited by 0.2% across all facilities by 2025 .'}]


In [None]:
import warnings
import time
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")

# Function to apply summarizer to each row and extract the summary text
def generate_summary(text):
    try:
        result = summarizer(text, max_length=100, min_length=30, do_sample=False)
        return result[0]['summary_text']
    except Exception as e:
        print(f"Error processing row: {e}")
        return None  # or any default value you prefer

# Record the start time
start_time = time.time()

# Create a subset of 10 articles for testing
df_subset = df.head(10)

# Apply the function to each row in the 'content_cleaned_text' column
df_subset['summaries_2'] = df_subset['content_cleaned_text'].apply(generate_summary)

# Calculate the elapsed time
elapsed_time = time.time() - start_time

# Display the DataFrame with the new 'summaries_2' column
df_subset

# Print the elapsed time
print(f"Elapsed Time: {elapsed_time} seconds")


Elapsed Time: 112.30806469917297 seconds


This unfortunately takes quite a while to run. Using it on the whole dataset is not feasible within our time limit. We can now compare the different summaries vs. the original content:

In [None]:
# Falconai summariser:
df_subset['summaries_2'][2]

"China' s imports of US crude jumped 211% in 2020 to a record 396,000 barrels per day, a trade worth $ 6.27 billion, according to the Chinese customs administration . China‚Äô s US crude imports reached 3.32 million tons, worth $ 1.11 billion, up from 851,000 b/d in December . The total value of these energy imports was $ 9.12 billion -- equal to the 2017 baseline ."

In [None]:
# Bert summariser

df_subset['summary'][2]

'New US President Joe Biden took office this week with the US-China relationship at its worst in decades. Meanwhile, imports of US LNG reached 3.32 million tons, worth $ 1.11 billion, up from a lone shipment in 2019, and 4.2 million tons of US LPG was discharged in China, for $ 1.74 billion, up from zero in 2019. With the dramatic change in tone in the White House on climate change, symbolized by the US rejoining the Paris climate accord this week, opportunities may widen beyond the oil and gas trade that could help rebuild dialogue between Washington and Beijing.'

In [None]:
# Original content
df_subset['content_cleaned_text'][2]

' New US President Joe Biden took office this week with the US-China relationship at its worst in decades. Energy has come to play a bigger role in that relationship than ever before, and rising Chinese imports of US oil and LNG could serve as the foundation for fresh discussions on trade -- one of the few areas where US-China communications have not completely broken down. But tackling climate change, a priority for Biden unlike predecessor Donald Trump, may offer the easiest and biggest opportunity for cooperation between the two powers now. Due to a bipartisan perception that China‚Äô s economic, geopolitical and technological rise poses an existential threat to the US, a Biden administration is unlikely to soften the tone on Beijing. Trump demanded bigger purchases of US energy products by China as part of a Phase 1 trade deal before he would lift US tariffs on Chinese products. After Biden won the November US presidential election, Trump moved into high gear to crack down on the p

The Falconai summariser does produce better summaries than BERT, however it takes longer to run so we will proceed with BERT.

## Load the new dataset

The summarised sentences are contained in the 'summary' column.

In [None]:
import pandas as pd

In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
!unzip final_summary_file.zip

--2024-01-19 16:39:28--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip [following]
--2024-01-19 16:39:28--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16819008 (16M) [application/zip]
Saving to: ‚Äòfinal_summary_file.zip‚Äô


2024-01-19 16:39:29 (281 MB/s) - ‚Äòfinal_summary_file.zip‚Äô saved [16819008/16819008]

Archive:  final_summar

In [None]:
df = pd.read_csv("final_summary_file.csv")
df

Unnamed: 0,title,date,content,domain,url,content_cleaned_text,summary
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...,Qatar Petroleum ( QP) is targeting aggressive ...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""‚Ä¢ Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...,New US President Joe Biden took office this we...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...,The slow pace of Japanese reactor restarts con...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...,Two of New York City largest pension funds say...
...,...,...,...,...,...,...,...
9602,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...,Strata Clean Energy has closed a $ 300 millio...,Strata Clean Energy has closed a $ 300 million...
9603,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,['Global renewable energy developer √òrsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...,Global renewable energy developer √òrsted is d...,Global renewable energy developer √òrsted is de...
9604,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...,"Veolia North America, a provider of environme...","Veolia North America, a provider of environmen..."
9605,"SunEdison: Too Big, Too Fast?",2023-11-08,['Once the self-proclaimed ‚Äú leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...,Once the self-proclaimed ‚Äú leading renewable ...,Once the self-proclaimed ‚Äú leading renewable p...


In [None]:
df["summary"][0]

'Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar LNG capacity by a further 16 million tons/yr. But QP judged them to be too expensive and none met its targeted 50-week construction schedule.'

# T5 Question and Answer Pairs

## Install packages

In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


Restart session to make sure sentencepiece is loaded.

In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
!unzip /content/final_summary_file.zip

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import sentencepiece
import pandas as pd
import numpy as np
import math
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoModelWithLMHead, AutoTokenizer
from transformers import pipeline
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
import pandas as pd
df = pd.read_csv("/content/final_summary_file.csv")

For some of the question and answer generator models we use. The context length needed to be smaller than the summaries we generated. To facilitate this, we split the summaries into sentence tokens.

In [None]:
import warnings
warnings.filterwarnings("ignore")

sentences_df = pd.DataFrame(columns=['index', 'title', 'sentence'])

for index, row in df.iterrows():
    article_index = row.name
    article_title = row['title']
    sentences = sent_tokenize(str(row['summary']))

    # Append sentences, article index, and title to sentences_df
    for sentence in sentences:
        sentences_df = sentences_df.append({'index': article_index, 'title': article_title, 'sentence': sentence}, ignore_index=True)

sentences_df

Unnamed: 0,index,title,sentence
0,0,Qatar to Slash Emissions as LNG Expansion Adva...,Qatar Petroleum ( QP) is targeting aggressive ...
1,0,Qatar to Slash Emissions as LNG Expansion Adva...,A further 1.1 million tons/yr will come from P...
2,0,Qatar to Slash Emissions as LNG Expansion Adva...,But QP judged them to be too expensive and non...
3,1,India Launches Its First 700 MW PHWR,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
4,1,India Launches Its First 700 MW PHWR,India nuclear suppliers should be feeling some...
...,...,...,...
22005,9603,Orsted Deploying SparkCognition Renewable Suit...,‚Äú From raw materials straight through to end-u...
22006,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,"Veolia North America, a provider of environmen..."
22007,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,Solar Industry offers industry participants pr...
22008,9605,"SunEdison: Too Big, Too Fast?",Once the self-proclaimed ‚Äú leading renewable p...


In [None]:
sentences_df['sentence'] = sentences_df['sentence'].replace('nan', float('nan'))
sentences_df = sentences_df.dropna(subset=['sentence'])
sentences_df

Unnamed: 0,index,title,sentence
0,0,Qatar to Slash Emissions as LNG Expansion Adva...,Qatar Petroleum ( QP) is targeting aggressive ...
1,0,Qatar to Slash Emissions as LNG Expansion Adva...,A further 1.1 million tons/yr will come from P...
2,0,Qatar to Slash Emissions as LNG Expansion Adva...,But QP judged them to be too expensive and non...
3,1,India Launches Its First 700 MW PHWR,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
4,1,India Launches Its First 700 MW PHWR,India nuclear suppliers should be feeling some...
...,...,...,...
22005,9603,Orsted Deploying SparkCognition Renewable Suit...,‚Äú From raw materials straight through to end-u...
22006,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,"Veolia North America, a provider of environmen..."
22007,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,Solar Industry offers industry participants pr...
22008,9605,"SunEdison: Too Big, Too Fast?",Once the self-proclaimed ‚Äú leading renewable p...


## mrm8488/t5-base-finetuned-question-generation-ap

In [None]:
df['summary'][0]

'Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar LNG capacity by a further 16 million tons/yr. But QP judged them to be too expensive and none met its targeted 50-week construction schedule.'

In [None]:
df['summary'][1]

'Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.1820). India nuclear suppliers should be feeling some relief over Kakrapar-3s start-up, although order flows will depend on how quickly NPCIL can get other projects moving, and the course of the Covid-19 pandemic ( NIW Dec.1120). ‚Ä¢ Across the ocean NuScale has launched a play for the UK market via a memorandum of understanding with start-up clean energy firm Shearwater Energy to explore the deployment of hybrid SMR and wind energy projects across the country.'

In [None]:
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'],
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

In [None]:
context = df['content_cleaned_text'][0]
answer = "Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion."

get_question(answer, context)

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors


'<pad> question: What is the goal of Qatar Petroleum in its latest Sustainability Report?</s>'

In [None]:
context=df['content_cleaned_text'][1]
answer="Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.1820)."

In [None]:
print(get_question(answer, context))
print(answer)

<pad> question: What is the first of 700 megawatts of indigenously developed PHWRs?</s>
Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.1820).


Try the model using the article title as context and the sentences as answers.

In [None]:
context=sentences_df['title'][0]
answer=sentences_df['sentence'][0]
print(get_question(answer, context))
print(answer)

<pad> question: What is the plan of Qatar Petroleum to cut its greenhouse gas emissions?</s>
Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion.


In [None]:
context=sentences_df['title'][1]
answer=sentences_df['sentence'][1]

print(get_question(answer, context))
print(answer)

<pad> question: How much more LNG will be produced from the North Field South project?</s>
A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar LNG capacity by a further 16 million tons/yr.


## potsawee/t5-large-generation-squad-QuestionAnswer

In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/qa_section_data.zip
!unzip /content/qa_section_data.zip

--2024-01-19 21:15:28--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/qa_section_data.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/qa_section_data.zip [following]
--2024-01-19 21:15:29--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/qa_section_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2798097 (2.7M) [application/zip]
Saving to: ‚Äòqa_section_data.zip‚Äô


2024-01-19 21:15:29 (58.9 MB/s) - ‚Äòqa_section_data.zip‚Äô saved [2798097/2798097]

Archive:  /content/qa_section_data.zip

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

model = AutoModelForSeq2SeqLM.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

tokenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.23k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [None]:
context = df['summary'][4]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  How much money did the New York City Board of Education Retirement System have under management?
answer:  $ 7.4 billion


In [None]:
context = df['summary'][1]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What is the name of the company that has a memorandum of understanding with Shearwater Energy?
answer:  NuScale


In [None]:
df['summary'][9605]

'Once the self-proclaimed ‚Äú leading renewable power plant developer in the world, ‚Äù U.S.-based SunEdison filed for Chapter 11 bankruptcy on April 21., ‚Äú Nevertheless, the fall of SunEdison has made some renewable energy stakeholders question the viability of the yieldco structure., However, Chase asserts, ‚Äú SunEdison‚Äô s bankruptcy says more about the company‚Äô s strategic decisions than about the solar industry as a whole.'

In [None]:
context = df['summary'][9605]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What did Chase believe SunEdison‚Äôs bankruptcy says more about than the solar industry?
answer:  strategic decisions


In [None]:
context = sentences_df['sentence'][4]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What is the name of the pandemic that is threatening India?
answer:  Covid-19


In [None]:
context = sentences_df['sentence'][4000]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What are two key factors in making electric vehicle ownership a practical and enjoyable experience?
answer:  Effective route planning and access to charging stations


Test the model on the summaries produced by the Falconai model:

In [None]:
df_subset = pd.read_csv("/content/qa_section_data/df_subset.csv")

In [None]:
context = df_subset['summaries_2'][1]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What was the name of the former chairman of the Department of Atomic Energy?
answer:  Anil Kakodkar


In [None]:
context = df_subset['summaries_2'][8]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What was the projected date for the filling of the tank?
answer:  summer of 2022


Split data into training and validation sets:

In [None]:
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(sentences_df, train_size=1000, random_state=42)

In [None]:
train_df.to_csv('/content/train_df.csv', index=False)

# Save val_df to CSV
val_df.to_csv('/content/val_df.csv', index=False)

Generate Q and As:

In [None]:
train_df['question'] = None
train_df['answer'] = None

# Q&A generation loop

for index, row in train_df.iterrows():
    context = row['sentence']

    try:
        # Tokenize the context and generate question and answer
        inputs = tokenizer(context, return_tensors="pt")
        outputs = model.generate(**inputs, max_length=100)
        question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
        question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")
        question, answer = question_answer.split(tokenizer.sep_token)

        # Assign question and answer to the corresponding row in train_df
        train_df.at[index, 'question'] = question
        train_df.at[index, 'answer'] = answer

    except Exception as e:
        # Handle the error (you can print or log the error message)
        print(f"Error processing row {index}: {str(e)}")
        # Set default values to None
        train_df.at[index, 'question'] = None
        train_df.at[index, 'answer'] = None



# save the train_df
train_df.to_csv('/content/qa_results.csv', index=False)

Error processing row 17586: not enough values to unpack (expected 2, got 1)
Error processing row 12790: not enough values to unpack (expected 2, got 1)
Error processing row 9510: not enough values to unpack (expected 2, got 1)
Error processing row 9310: not enough values to unpack (expected 2, got 1)
Error processing row 8512: not enough values to unpack (expected 2, got 1)
Error processing row 9865: not enough values to unpack (expected 2, got 1)
Error processing row 21343: not enough values to unpack (expected 2, got 1)
Error processing row 9765: not enough values to unpack (expected 2, got 1)
Error processing row 12941: not enough values to unpack (expected 2, got 1)
Error processing row 13746: not enough values to unpack (expected 2, got 1)
Error processing row 13703: not enough values to unpack (expected 2, got 1)
Error processing row 10954: not enough values to unpack (expected 2, got 1)
Error processing row 6168: not enough values to unpack (expected 2, got 1)
Error processing r

In [None]:
train_df = pd.read_csv("/content/qa_section_data/qa_results.csv")

In [None]:
for index, row in train_df.head(10).iterrows():
  print(f"Question: {row['question']}, Answer: {row['answer']}")

Question:  What is the name of Atome‚Äô s CEO?, Answer:  Olivier Mussat
Question:  What is the price of the ID.4?, Answer:  cheaper than the other models on the list
Question:  What is the process of selecting the correct BoM?, Answer:  an art in itself
Question:  What did the IEA say industry needs to do to lower nuclear construction costs?, Answer:  lower reactor construction costs by 40%
Question:  What is the typical server utilisation rate for AWS?, Answer:  65%
Question:  What is the name of the organization that aims to support South Africa transition from conventional plastics to more environmentally sustainable alternatives?, Answer:  UNIDO
Question:  What kind of look can French doors give to a tiny sustainable home?, Answer:  grand
Question:  What was collected daily from each mesocosm tank?, Answer:  50 mL of seawater
Question:  What is 4th Resource?, Answer:  a next generation geothermal energy developer
Question: nan, Answer: nan


The question-answer pairs are of varying quality. The model is predisposed towards nouns or noun phrase answers, and questions often leave out important contextual information such as names of countries.

There are probably multiple reasons why the quality is poor. One is that the BERT summarizer is does not adequately paraphrase articles. Rather it extracts kex sentences verbatim from the articles. Utilising a different summariser would possible improve results significantly. However, the other summarizer that was tested comparitively required a lot of compute power and was slow to execute.

We will continue to test different question and answer generating models in hopes of improving these results.

## T5-small End-to-End Question Generation

Now we will use T5 to build questions from the cleantech dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Clone the github repository containing the T5 model

In [None]:
# Clone the model repository
!git clone https://github.com/patil-suraj/question_generation.git

Cloning into 'question_generation'...
remote: Enumerating objects: 268, done.[K
remote: Total 268 (delta 0), reused 0 (delta 0), pack-reused 268[K
Receiving objects: 100% (268/268), 299.04 KiB | 11.08 MiB/s, done.
Resolving deltas: 100% (140/140), done.


In [None]:
%cd question_generation

/content/question_generation


We test the model to make sure it is working properly.

In [None]:
# Test the model

text = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."

nlp = pipeline("text2text-generation", model="valhalla/t5-small-e2e-qg", max_length = 100)
result = nlp(text)

for item in result:
    print(f"Question: {item['generated_text']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Question: Python is an interpreted, high-level, general-purpose programming language.<sep> When was Python first released?<sep> What is Python's design philosophy?<sep>


To simplify the cleantech dataframe, we filter just the title and summary columns. Additionally, we split the data to train and test sets. Due to long computational times, we only extract 1000 rows but under ideal GPU resources, we would utilize the whole dataset with a 80/20 train test split.

In [None]:
# extract title and summary columns
sentences_df = pd.DataFrame(df,columns=['title', 'summary'])

# Split the data into training and validation sets
train, test = train_test_split(sentences_df, train_size=1000, random_state=42)

In [None]:
train.to_csv('/content/train_df.csv', index=False)

# Save val_df to CSV
test.to_csv('/content/val_df.csv', index=False)


In [None]:
# Use the sample method to get a random sample of 1000 rows
test= test.sample(n=1000, random_state=42)  # Set a seed for reproducibility (optional)

len(test)

1000

In [None]:
# Load the Tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("valhalla/t5-small-e2e-qg")
model = T5ForConditionalGeneration.from_pretrained("valhalla/t5-small-e2e-qg")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Use the pretrained model to generate questions for both the training and test sets.

In [None]:
train['question'] = None

# Q&A generation loop

for index, row in train.iterrows():
    context = row['summary']

    try:
        # Tokenize the context and generate question and answer
        inputs = tokenizer(context, return_tensors="pt")
        outputs = model.generate(**inputs, max_length=100)
        questions = tokenizer.decode(outputs[0], skip_special_tokens=False)
        questions = questions.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")
      # question, answer = question_answer.split(tokenizer.sep_token)

        # Assign question and answer to the corresponding row in train_df
        train.at[index, 'question'] = questions
        # train_df.at[index, 'answer'] = answer

    except Exception as e:
        # Handle the error (you can print or log the error message)
        print(f"Error processing row {index}: {str(e)}")
        # Set default values to None
        train.at[index, 'question'] = None
        train.at[index, 'answer'] = None



# save the train_df
train.to_csv('/content/test_results.csv', index=False)

In [None]:
test['question'] = None

# Q&A generation loop

for index, row in test.iterrows():
    context = row['summary']

    try:
        # Tokenize the context and generate question and answer
        inputs = tokenizer(context, return_tensors="pt")
        outputs = model.generate(**inputs, max_length=100)
        questions = tokenizer.decode(outputs[0], skip_special_tokens=False)
        questions = questions.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")
      # question, answer = question_answer.split(tokenizer.sep_token)

        # Assign question and answer to the corresponding row in train_df
        test.at[index, 'question'] = questions
        # train_df.at[index, 'answer'] = answer

    except Exception as e:
        # Handle the error (you can print or log the error message)
        print(f"Error processing row {index}: {str(e)}")
        # Set default values to None
        test.at[index, 'question'] = None
        test.at[index, 'answer'] = None



# save the train_df
test.to_csv('/content/test_results.csv', index=False)

In [None]:
for index, row in test.iterrows():
  print(f"Question: {row['question']}, Context: {row['summary']}")

Question: How many plugin registrations were logged last month? <sep> What is the Dutch plugin vehicle market? <sep> What is the Dutch plugin vehicle market? <sep> , Context: The world is going through historic transitions, a global shift of energy, transportation, and consumption that will impact every aspect of our lives, but..., Originally posted on EVANNEX. By Charles Morris If you‚Äô re a car buff, and have traveled abroad, you‚Äô ve doubtless noticed that the mix of auto brands you..., The Dutch plugin vehicle ( PEV) market continues in the fast lane, with 6,637 plugin registrations logged last month. The content produced by this site is for entertainment purposes only.
Question: What is the name of the Pennsylvania Agriculture Secretary? <sep> What is the name of the Pennsylvania Agriculture Secretary? <sep> What is the name of the Pennsylvania Agriculture Secretary? <sep> What is the name of the Pennsylvania Agriculture Secretary? <sep> What is the name of the Pennsylvania Agri

Multiple questions are generated for each summary in the data frame. Next we must build the answers for each question. Before we can do this however, we must separate the questions into individual rows in the dataframe.

In [None]:
# Load data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/train_results.csv

--2024-01-15 17:24:17--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/train_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 937976 (916K) [text/plain]
Saving to: ‚Äòtrain_results.csv.1‚Äô


2024-01-15 17:24:17 (177 MB/s) - ‚Äòtrain_results.csv.1‚Äô saved [937976/937976]



In [None]:
df = pd.read_csv("train_results.csv")
df["question"][0]

'What could Electrification cost-efficiently bust? <sep> What is the NSTA working with to achieve? <sep> What is the NSTA working with to achieve? <sep> What is the NSTA working with to progress prospective electrification projects? <sep> '

In [None]:
# Split the sentences into different rows
df['question'] = df['question'].str.split('<sep>')
df = df.explode('question')

# Convert all columns to string
df = df.astype(str)

# Replace empty strings with None in the 'question' column
df['question'] = df['question'].replace('', None)

# Replace empty strings with None in the 'question' column
df['question'] = df['question'].replace(' ', None)

# Drop rows with no value in the 'question' column
df = df.dropna(subset=['question'])

# Drop duplicates based on all columns
df = df.drop_duplicates()

# Reset the index to get consecutive row indices
df = df.reset_index(drop=True)

# Display the resulting DataFrame
df

Unnamed: 0,title,summary,question
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...
1,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to achieve?
2,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to progress pro...
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...
4,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has rais...
...,...,...,...
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2481,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2482,AGR picks new head of wells and operations geo...,Lene Thorstensen has been appointed to head up...,Lene Thorstensen has been appointed to head u...
2483,The weekend read: Charging with solar at home ...,Just under 40% of the residential EV chargers ...,How much of the residential EV chargers in pv ...


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [None]:
df.to_csv('/content/questions.csv', index=False)

Do the same for the test data.

In [None]:
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/test_results.csv

--2024-01-15 22:58:01--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/test_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 927401 (906K) [text/plain]
Saving to: ‚Äòtest_results.csv‚Äô


2024-01-15 22:58:01 (132 MB/s) - ‚Äòtest_results.csv‚Äô saved [927401/927401]



In [None]:
val_data = pd.read_csv("test_results.csv")

In [None]:
val_data

Unnamed: 0,title,summary,question
0,Netherlands Archives - Page 5 of 43,The world is going through historic transition...,How many plugin registrations were logged last...
1,Pennsylvania Unveils Solar Installation Guidan...,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricultu...
2,REA: Half of UK electricity generation could b...,Half of the UK electricity generation could be...,What is the name of the major report from the ...
3,Engie brings online 181MW solar PV plant in Chile,Utility giant Engie has reached commercial ope...,Engie has reached commercial operation for a s...
4,Greenlandic Ice is Teeming With Different Spec...,By clicking `` Allow All you agree to the sto...,What is the name of the research that NASA is ...
...,...,...,...
995,Corona Reinvents Sustainable Packaging by Laun...,By clicking `` Allow All you agree to the sto...,What is the name of the process that creates a...
996,The Great Electricity Transmission Debate ‚Äî Ho...,Many people want to build new transmission lin...,What is the inference that many people want to...
997,Fourth UK CfD round secures 11GW of new renewa...,"The UK‚Äô s Department for Business, Energy and ...","What is the UK's Department for Business, Ener..."
998,Implementing the Clean Energy Investments in U...,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law? <se...


In [None]:
# Split the sentences into different rows
val_data['question'] = val_data['question'].str.split('<sep>')
val_data = val_data.explode('question')

# Convert all columns to string
val_data = val_data.astype(str)

# Replace empty strings with None in the 'question' column
val_data['question'] = val_data['question'].replace('', None)

# Replace empty strings with None in the 'question' column
val_data['question'] = val_data['question'].replace(' ', None)

# Drop rows with no value in the 'question' column
val_data = val_data.dropna(subset=['question'])

# Drop duplicates based on all columns
val_data = val_data.drop_duplicates()

# Reset the index to get consecutive row indices
val_data = val_data.reset_index(drop=True)

# Display the resulting DataFrame
val_data

Unnamed: 0,title,summary,question
0,Netherlands Archives - Page 5 of 43,The world is going through historic transition...,How many plugin registrations were logged last...
1,Netherlands Archives - Page 5 of 43,The world is going through historic transition...,What is the Dutch plugin vehicle market?
2,Pennsylvania Unveils Solar Installation Guidan...,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricultu...
3,Pennsylvania Unveils Solar Installation Guidan...,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricult...
4,REA: Half of UK electricity generation could b...,Half of the UK electricity generation could be...,What is the name of the major report from the ...
...,...,...,...
2494,Fourth UK CfD round secures 11GW of new renewa...,"The UK‚Äô s Department for Business, Energy and ...",What is the record amount of support given to...
2495,Implementing the Clean Energy Investments in U...,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law?
2496,Implementing the Clean Energy Investments in U...,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law?
2497,Oil-Free Air Compressors Set to Support Sustai...,By clicking `` Allow All you agree to the sto...,What is the name of the product manager for Oi...


In [None]:
val_data.to_csv('/content/val_data_new.csv', index=False)

In [None]:
# Download the data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/questions.csv

--2024-01-15 18:01:17--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/questions.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1807293 (1.7M) [text/plain]
Saving to: ‚Äòquestions.csv‚Äô


2024-01-15 18:01:18 (203 MB/s) - ‚Äòquestions.csv‚Äô saved [1807293/1807293]



We change the column name to 'context' to provide clarity in the dataframe.

In [None]:
df = pd.read_csv("questions.csv")
# Rename the columns
df = df.rename(columns={'summary': 'context'})
df

Unnamed: 0,title,context,question
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...
1,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to achieve?
2,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to progress pro...
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...
4,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has rais...
...,...,...,...
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2481,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?
2482,AGR picks new head of wells and operations geo...,Lene Thorstensen has been appointed to head up...,Lene Thorstensen has been appointed to head u...
2483,The weekend read: Charging with solar at home ...,Just under 40% of the residential EV chargers ...,How much of the residential EV chargers in pv ...


## T5 for Multi-task QA and QG - Answer Generation

We are using a different model from the same repository specifically catered to answering questions.

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer

In [None]:
# @title
from transformers import AutoTokenizer
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
tokenizer.is_fast

True

In [None]:
# Load and test pre-trained model
from pipelines import pipeline
nlp = pipeline("multitask-qa-qg")

# for qa pass a dict with "question" and "context"
nlp({"question": "What is 42 ?", "context": "42 is the answer to life, the universe and everything."})

tokenizer_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

'the answer to life, the universe and everything'

In [None]:
# Convert DataFrame to list of dictionaries
data_list = df.to_dict(orient='records')

# Iterate through the list and pass each dictionary to the pipeline
results = []
for data in data_list:
    result = nlp(data)
    results.append(result)

# Add the answers to a new column 'answer' in the DataFrame
df['answer'] = results
df.to_csv('/content/questions_answer_pairs_t5.csv', index=False)

In [None]:
df

Unnamed: 0,title,context,question,answer
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...,greenhouse gases
1,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to achieve?,progress prospective electrification projects
2,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What is the NSTA working with to progress pro...,industry representatives
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...,Homegrown Blusmart Mobility
4,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has rais...,Damodar Valley Corporation
...,...,...,...,...
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?,Kelly Bennett
2481,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?,Kelly Bennett
2482,AGR picks new head of wells and operations geo...,Lene Thorstensen has been appointed to head up...,Lene Thorstensen has been appointed to head u...,the Norwegian Continental Shelf
2483,The weekend read: Charging with solar at home ...,Just under 40% of the residential EV chargers ...,How much of the residential EV chargers in pv ...,40%


We must create answers for the validation set as well.

In [None]:
# Download data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/val_data_new.csv

In [None]:
val_data = pd.read_csv("val_data.csv")

In [None]:
# Rename the columns
val_data = val_data.rename(columns={'summary': 'context'})
val_data

Unnamed: 0,title,context,question
0,Netherlands Archives - Page 5 of 43,The world is going through historic transition...,How many plugin registrations were logged last...
1,Netherlands Archives - Page 5 of 43,The world is going through historic transition...,What is the Dutch plugin vehicle market?
2,Pennsylvania Unveils Solar Installation Guidan...,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricultu...
3,Pennsylvania Unveils Solar Installation Guidan...,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricult...
4,REA: Half of UK electricity generation could b...,Half of the UK electricity generation could be...,What is the name of the major report from the ...
...,...,...,...
2494,Fourth UK CfD round secures 11GW of new renewa...,"The UK‚Äô s Department for Business, Energy and ...",What is the record amount of support given to...
2495,Implementing the Clean Energy Investments in U...,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law?
2496,Implementing the Clean Energy Investments in U...,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law?
2497,Oil-Free Air Compressors Set to Support Sustai...,By clicking `` Allow All you agree to the sto...,What is the name of the product manager for Oi...


In [None]:
# Load and test pre-trained model
from pipelines import pipeline
nlp = pipeline("multitask-qa-qg")

# for qa pass a dict with "question" and "context"
nlp({"question": "What is 42 ?", "context": "42 is the answer to life, the universe and everything."})

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'the answer to life, the universe and everything'

In [None]:
# Convert DataFrame to list of dictionaries
data_list = val_data.to_dict(orient='records')

# Iterate through the list and pass each dictionary to the pipeline
results = []
for data in data_list:
    result = nlp(data)
    results.append(result)

# Add the answers to a new column 'answer' in the DataFrame
val_data['answer'] = results
val_data.to_csv('/content/new_validation_answer_pairs_t5.csv', index=False)

In [None]:
# Download data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/questions_answer_pairs_t5.csv

--2024-01-16 00:10:27--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/questions_answer_pairs_t5.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1865876 (1.8M) [text/plain]
Saving to: ‚Äòquestions_answer_pairs_t5.csv‚Äô


2024-01-16 00:10:27 (213 MB/s) - ‚Äòquestions_answer_pairs_t5.csv‚Äô saved [1865876/1865876]



In [None]:
# Download data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/new_validation_answer_pairs_t5.csv

--2024-01-16 00:10:33--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/new_validation_answer_pairs_t5.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1856501 (1.8M) [text/plain]
Saving to: ‚Äònew_validation_answer_pairs_t5.csv‚Äô


2024-01-16 00:10:34 (141 MB/s) - ‚Äònew_validation_answer_pairs_t5.csv‚Äô saved [1856501/1856501]



In [None]:
import pandas as pd
qa_pairs = pd.read_csv("questions_answer_pairs_t5.csv")
val_data = pd.read_csv("new_validation_answer_pairs_t5.csv")

## Clean QA data set

Next, we manually review and clean up the generated question-answer pairs to ensure clarity, correctness, and coherence in the QA pairs. This step is crucial for creating a high-quality training dataset.

In [None]:
# Rows to remove after manual inspection - index: 21, 25, 29, 34, 45, 79, 80, 84, 87, 97, 103, 2435
# These questions do not make sense and must be dropped

index_to_remove = [21, 25, 29, 34, 45, 79, 80, 84, 87, 97, 103, 2435]
qa_pairs = qa_pairs.drop(index_to_remove)

In [None]:
# Remove rows that do not start with Who, what, where, when, why, how, in
# Specify the keywords to keep
keywords = ['Who', 'What', 'Where', 'When', 'Why', 'How', 'In']

# Use boolean indexing to filter rows
qa_pairs = qa_pairs[qa_pairs['question'].str.startswith(tuple(keywords), na=False)]

In [None]:
# Use boolean indexing to filter rows
val_data = val_data[val_data['question'].str.startswith(tuple(keywords), na=False)]

In [None]:
# Drop duplicates based on all columns
qa_pairs = qa_pairs.drop_duplicates()
qa_pairs

Unnamed: 0,title,context,question,answer
0,Electrification could trim 87% off North Sea p...,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...,greenhouse gases
3,"Saurabh, Author at CleanTechnica",Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...,Homegrown Blusmart Mobility
5,CPV Gains Maryland Approval for 200 MW Solar Farm,Competitive Power Ventures ( CPV) has achieved...,What is the name of the 200 MW DC Backbone Sol...,CPV Backbone
10,Massachusetts ‚Äì pv magazine International,Academics from MIT and Stanford who have posit...,What is the name of the new method that MIT an...,machine learning system
13,New Jersey approves pilot program to demonstra...,Nicholas Welch makes his way down one of the r...,What is the name of the project that Nicholas ...,Jack‚Äô s Solar Garden
...,...,...,...,...
2474,German power group to add storage at all new s...,EnBW said it has become the first German energ...,What is the name of the company that plans all...,EnBW
2476,Clean-energy groups hail 'new era ' as Biden t...,Clean-energy groups praised the inauguration o...,Who was the CEO of the Business Network for Of...,Liz Burdock
2478,Global Price Strength Seen Driving Natural Gas...,Sign in to get the best natural gas news and d...,What is the name of the U.S. Energy Informatio...,EIA
2480,"Natural Gas, Oil Players Eyeing Lithium to Bui...",Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?,Kelly Bennett


In [None]:
# Drop duplicates based on all columns
val_data = val_data.drop_duplicates()
val_data

Unnamed: 0,title,context,question,answer
0,Netherlands Archives - Page 5 of 43,The world is going through historic transition...,How many plugin registrations were logged last...,6637
2,Pennsylvania Unveils Solar Installation Guidan...,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricultu...,Russell Redding
4,REA: Half of UK electricity generation could b...,Half of the UK electricity generation could be...,What is the name of the major report from the ...,REA
11,Greenlandic Ice is Teeming With Different Spec...,By clicking `` Allow All you agree to the sto...,What is the name of the research that NASA is ...,life in the ice
13,David Carroll ‚Äì pv magazine International,The Australian government‚Äô s green bank has co...,What is the name of the Australian government'...,Waratah Super Battery Project
...,...,...,...,...
2486,Corona Reinvents Sustainable Packaging by Laun...,By clicking `` Allow All you agree to the sto...,What is the name of the process that creates a...,100% recycled wood fibers
2488,The Great Electricity Transmission Debate ‚Äî Ho...,Many people want to build new transmission lin...,What is the inference that many people want to...,solar and wind
2492,Fourth UK CfD round secures 11GW of new renewa...,"The UK‚Äô s Department for Business, Energy and ...","What is the UK's Department for Business, Ener...",BEIS
2495,Implementing the Clean Energy Investments in U...,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law?,$ 11 billion


The datasets are clean by removing non-coherent questions and removing all question rows that do not start with the appropriant questions words (i.e who, what, where, etc). For training the model, we also remove the title column because it is not needed.

In [None]:
qa_pairs = qa_pairs.drop(["title"], axis=1)

In [None]:
qa_pairs

Unnamed: 0,context,question,answer
0,Electrification could cost-efficiently bust th...,What could Electrification cost-efficiently bu...,greenhouse gases
3,Homegrown Blusmart Mobility has secured debt t...,What is the name of the company that has secur...,Homegrown Blusmart Mobility
5,Competitive Power Ventures ( CPV) has achieved...,What is the name of the 200 MW DC Backbone Sol...,CPV Backbone
10,Academics from MIT and Stanford who have posit...,What is the name of the new method that MIT an...,machine learning system
13,Nicholas Welch makes his way down one of the r...,What is the name of the project that Nicholas ...,Jack‚Äô s Solar Garden
...,...,...,...
2474,EnBW said it has become the first German energ...,What is the name of the company that plans all...,EnBW
2476,Clean-energy groups praised the inauguration o...,Who was the CEO of the Business Network for Of...,Liz Burdock
2478,Sign in to get the best natural gas news and d...,What is the name of the U.S. Energy Informatio...,EIA
2480,Sign in to get the best natural gas news and d...,What is B3 Insight's CEO?,Kelly Bennett


In [None]:
qa_pairs.to_csv('/content/qa_pairs.csv', index=False)

In [None]:
val_data = val_data.drop(["title"], axis=1)
val_data

Unnamed: 0,context,question,answer
0,The world is going through historic transition...,How many plugin registrations were logged last...,6637
2,Pennsylvania Agriculture Secretary Russell Red...,What is the name of the Pennsylvania Agricultu...,Russell Redding
4,Half of the UK electricity generation could be...,What is the name of the major report from the ...,REA
11,By clicking `` Allow All you agree to the sto...,What is the name of the research that NASA is ...,life in the ice
13,The Australian government‚Äô s green bank has co...,What is the name of the Australian government'...,Waratah Super Battery Project
...,...,...,...
2486,By clicking `` Allow All you agree to the sto...,What is the name of the process that creates a...,100% recycled wood fibers
2488,Many people want to build new transmission lin...,What is the inference that many people want to...,solar and wind
2492,"The UK‚Äô s Department for Business, Energy and ...","What is the UK's Department for Business, Ener...",BEIS
2495,Originally published on WRI‚Äô s Resource Instit...,What is the Bipartisan Infrastructure Law?,$ 11 billion


In [None]:
val_data.to_csv('/content/validation_pairs.csv', index=False)

# Fine-tune models

## T5

Next, we fine-tune the T5 pre-trained model with our cleantech Q&A pairs. Fine-tuning adapts the model to generate context-specific questions and answers.

In [None]:
# Download data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/qa_pairs.csv

--2024-01-19 17:43:57--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/qa_pairs.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 625943 (611K) [text/plain]
Saving to: ‚Äòqa_pairs.csv.4‚Äô


2024-01-19 17:43:57 (35.4 MB/s) - ‚Äòqa_pairs.csv.4‚Äô saved [625943/625943]



In [None]:
# Download data if needed
!wget https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/validation_pairs.csv

--2024-01-19 17:43:58--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Natalie/validation_pairs.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 617458 (603K) [text/plain]
Saving to: ‚Äòvalidation_pairs.csv.4‚Äô


2024-01-19 17:43:58 (37.3 MB/s) - ‚Äòvalidation_pairs.csv.4‚Äô saved [617458/617458]



In [None]:
import pandas as pd
qa_pairs = pd.read_csv("qa_pairs.csv")
val_data = pd.read_csv("validation_pairs.csv")

In [None]:
import torch
import json
from tqdm import tqdm
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import Dataset, DataLoader, RandomSampler
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration, T5TokenizerFast
import matplotlib.pyplot as plt

In [None]:
class QA_Dataset(Dataset):
    def __init__(self, tokenizer, dataframe, q_len, t_len):
        self.tokenizer = tokenizer
        self.q_len = q_len
        self.t_len = t_len
        self.data = dataframe
        self.questions = self.data["question"]
        self.context = self.data["context"]
        self.answer = self.data['answer']

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        context = self.context[idx]
        answer = self.answer[idx]

        question_tokenized = self.tokenizer(question, context, max_length=self.q_len, padding="max_length",
                                                    truncation=True, pad_to_max_length=True, add_special_tokens=True)
        answer_tokenized = self.tokenizer(answer, max_length=self.t_len, padding="max_length",
                                          truncation=True, pad_to_max_length=True, add_special_tokens=True)

        labels = torch.tensor(answer_tokenized["input_ids"], dtype=torch.long)
        labels[labels == 0] = -100

        return {
            "input_ids": torch.tensor(question_tokenized["input_ids"], dtype=torch.long),
            "attention_mask": torch.tensor(question_tokenized["attention_mask"], dtype=torch.long),
            "labels": labels,
            "decoder_attention_mask": torch.tensor(answer_tokenized["attention_mask"], dtype=torch.long)
        }

In [None]:
# Set up the T5 model
TOKENIZER = T5TokenizerFast.from_pretrained("valhalla/t5-base-qa-qg-hl")
MODEL = T5ForConditionalGeneration.from_pretrained("valhalla/t5-base-qa-qg-hl", return_dict=True)
OPTIMIZER = Adam(MODEL.parameters(), lr=0.00001)
Q_LEN = 256   # Question Length
T_LEN = 32    # Target Length
BATCH_SIZE = 4
DEVICE = "cuda:0"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
# Move the model to the GPU
MODEL = MODEL.to('cuda')

# # Move the input data tensors to the GPU
# input_ids = input_ids.to('cuda')
# attention_mask = attention_mask.to('cuda')
# # decoder_input_ids = decoder_input_ids.to('cuda')
# # decoder_attention_mask = decoder_attention_mask.to('cuda')

In [None]:
# Data Loader
qa_dataset = QA_Dataset(TOKENIZER, qa_pairs, Q_LEN, T_LEN)
val_dataset = QA_Dataset(TOKENIZER, val_data, Q_LEN, T_LEN)

train_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

In [None]:
train_loss = 0
val_loss = 0
train_batch_count = 0
val_batch_count = 0

for epoch in range(2):
    MODEL.train()
    for batch in tqdm(train_loader, desc="Training batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
                          input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels,
                          decoder_attention_mask=decoder_attention_mask
                        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        train_loss += outputs.loss.item()
        train_batch_count += 1

    #Evaluation
    MODEL.eval()
    for batch in tqdm(val_loader, desc="Validation batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
                          input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels,
                          decoder_attention_mask=decoder_attention_mask
                        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        val_loss += outputs.loss.item()
        val_batch_count += 1

    print(f"{epoch+1}/{2} -> Train loss: {train_loss / train_batch_count}\tValidation loss: {val_loss/val_batch_count}")


Training batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 234/234 [00:35<00:00,  6.65it/s]
Validation batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 232/232 [00:30<00:00,  7.52it/s]


1/2 -> Train loss: 1.2147699728226051	Validation loss: 0.45939696580171585


Training batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 234/234 [00:32<00:00,  7.17it/s]
Validation batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 232/232 [00:30<00:00,  7.50it/s]

2/2 -> Train loss: 0.8173677359954414	Validation loss: 0.34696191029045087





In [None]:
MODEL.save_pretrained("/content/drive/MyDrive/NLP_models/t5_qa_model")
TOKENIZER.save_pretrained("/content/drive/MyDrive/NLP_models/t5_qa_tokenizer")

('/content/drive/MyDrive/NLP_models/t5_qa_tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_models/t5_qa_tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_models/t5_qa_tokenizer/spiece.model',
 '/content/drive/MyDrive/NLP_models/t5_qa_tokenizer/added_tokens.json',
 '/content/drive/MyDrive/NLP_models/t5_qa_tokenizer/tokenizer.json')

## Evaluate Model T5


The models are saved in a google share file: https://drive.google.com/file/d/19-c3cJmdS-uJwiX4UzvGiI9Z94zvuVX8/view?usp=sharing

In [None]:
# Download models if needed
MODEL = T5ForConditionalGeneration.from_pretrained("/content/drive/MyDrive/NLP_models/t5_qa_model")
TOKENIZER = T5Tokenizer.from_pretrained("/content/drive/MyDrive/NLP_models/t5_qa_tokenizer")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# @title
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
# @title
# !pip install dill==0.3.4

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill (from evaluate)
  Using cached dill-0.3.7-py3-none-any.whl (115 kB)
Installing collected packages: dill, responses, evaluate
  Attempting uninstall: dill
    Found existing installation: dill 0.3.4
    Uninstalling dill-0.3.4:
      Successfully uninstalled dill-0.3.4
Successfully installed dill-0.3.7 evaluate-0.4.1 responses-0.18.0


We download the BLEU metric to evaluate the model performance.

In [None]:
from transformers import pipeline

# Load the question-answering pipeline
qa_pipeline = pipeline("question-answering")

# Load the BLEU metric
import evaluate

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
def predict_answer(context, question, ref_answer=None):
    inputs = TOKENIZER(question, context, max_length=Q_LEN, padding="max_length", truncation=True, add_special_tokens=True)

    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).to(DEVICE).unsqueeze(0)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).to(DEVICE).unsqueeze(0)

    outputs = MODEL.generate(input_ids=input_ids, attention_mask=attention_mask)

    predicted_answer = TOKENIZER.decode(outputs.flatten(), skip_special_tokens=True)

    if ref_answer:
        # Load the Bleu metric
        bleu = evaluate.load("google_bleu")
        score = bleu.compute(predictions=[predicted_answer],
                            references=[ref_answer])

        print("Context: \n", context)
        print("\n")
        print("Question: \n", question)
        return {
            "Reference Answer: ": ref_answer,
            "Predicted Answer: ": predicted_answer,
            "BLEU Score: ": score
        }
    else:
        return predicted_answer

In [None]:
# Choose a random row from new samples
df = pd.read_csv("final_summary_file.csv")
df = df[3000:]
context = df["summary"].sample(n=1).iloc[0]

In [None]:
# Display the entire content of the cell using to_string
context

'Sign in to get the best natural gas news and data. GE‚Äô s 130-plus years of experience in energy applications will be invaluable to us as we rapidly scale our operations and manufacturing capacity to be able to capture millions of tonnes of carbon dioxide ( CO2) from diverse industrial sites around the world, ‚Äù said Svante CEO Claude Letourneau., Funded by the Infrastructure Investment and Jobs Act of 2021, the program is designed to decrease CO2 emissions from electricity generation with an ambition of net-zero emissions by 2050.,'

In [None]:
question = "What act is funding the Svante program?"
answer = "Infrastructure Investment and Jobs Act of 2021"

In [None]:
predict_answer(context, question, answer)



Context: 
 Sign in to get the best natural gas news and data. GE‚Äô s 130-plus years of experience in energy applications will be invaluable to us as we rapidly scale our operations and manufacturing capacity to be able to capture millions of tonnes of carbon dioxide ( CO2) from diverse industrial sites around the world, ‚Äù said Svante CEO Claude Letourneau., Funded by the Infrastructure Investment and Jobs Act of 2021, the program is designed to decrease CO2 emissions from electricity generation with an ambition of net-zero emissions by 2050.,


Question: 
 What act is funding the Svante program?


{'Reference Answer: ': 'Infrastructure Investment and Jobs Act of 2021',
 'Predicted Answer: ': 'Infrastructure Investment and Jobs Act of 2021',
 'BLEU Score: ': {'google_bleu': 1.0}}

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'Sign in to get the best natural gas news and data. California Gov. Gavin Newsom joined members of the Biden administration in May 2021 to announce the state would be advancing with at least two offshore wind energy areas ( WEA), including Humboldt Bay and central California‚Äô s Morro Bay., The CEC noted that initial funds may also draw in federal grants for project construction.,'

In [None]:
question = "What did Gov. Gavin Newsom announce in May 2021?"
answer = "The state would be advancing with at least two offshore wind energy areas"
predict_answer(context, question, answer)



Context: 
 Sign in to get the best natural gas news and data. California Gov. Gavin Newsom joined members of the Biden administration in May 2021 to announce the state would be advancing with at least two offshore wind energy areas ( WEA), including Humboldt Bay and central California‚Äô s Morro Bay., The CEC noted that initial funds may also draw in federal grants for project construction.,


Question: 
 What did Gov. Gavin Newsom announce in May 2021?


{'Reference Answer: ': 'The state would be advancing with at least two offshore wind energy areas',
 'Predicted Answer: ': 'Humboldt Bay and central California',
 'BLEU Score: ': {'google_bleu': 0.0}}

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers.'

In [None]:
q1 = "How manny people will still lack ecelctricity in 2030?"
a1 = "670 million people"
predict_answer(context, q1, a1)

Context: 
 The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers.


Question: 
 How manny people will still lack ecelctricity in 2030?


{'Reference Answer: ': '670 million people',
 'Predicted Answer: ': '670 million',
 'BLEU Score: ': {'google_bleu': 0.5}}

In [None]:
q2 = "How many countries are planning new clean power infrastructure in Africa?"
a2 = "18"
predict_answer(context, q2, a2)

Context: 
 The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers.


Question: 
 How many countries are planning new clean power infrastructure in Africa?


{'Reference Answer: ': '18',
 'Predicted Answer: ': '18',
 'BLEU Score: ': {'google_bleu': 1.0}}

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'FranklinWH Energy Storage has announced the addition of the aHub to its Franklin Home Power Solution. Whether due to extreme weather events, issues within our aging power grid or unknown reasons, power outages create a lot of anxiety for homeowners across the country, ‚Äù said Gary Lam, Co-Founder of FranklinWH. ‚Äú With aHub, homeowners will have more choices on how to power their homes, so they do not have to suffer during extended outages.'

In [None]:
question = "What will aHub provide homeowners?"
answer = "More choices on how to power their homes"
predict_answer(context, question, answer)

Context: 
 FranklinWH Energy Storage has announced the addition of the aHub to its Franklin Home Power Solution. Whether due to extreme weather events, issues within our aging power grid or unknown reasons, power outages create a lot of anxiety for homeowners across the country, ‚Äù said Gary Lam, Co-Founder of FranklinWH. ‚Äú With aHub, homeowners will have more choices on how to power their homes, so they do not have to suffer during extended outages.


Question: 
 What will aHub provide homeowners?


{'Reference Answer: ': 'More choices on how to power their homes',
 'Predicted Answer: ': 'Franklin Home Power Solution',
 'BLEU Score: ': {'google_bleu': 0.0}}

In [None]:
# https://www.iisd.org/articles/insight/cop-28-energy-transition-expectations
# Try on new data outside of the Cleantech corpus
context = "As COP 28 approaches, all eyes are on the energy sector. Amid a striking 91% \
of global carbon dioxide emissions originating from fossil fuels in 2022, the world is watching \
for an ambitious energy transition package at this year‚Äôs UN Climate Change Conference in Dubai. \
Many regard this year‚Äôs edition as the most important COP since COP 21, where governments \
adopted the Paris Agreement. The global stocktake, the Paris Agreement mandated inventory \
of global progress on climate action, is set to conclude in Dubai, with world leaders expected \
to present a political response. The global stocktake has already revealed that countries are not \
on track to meet the Paris Agreement‚Äôs mitigation goal of limiting global warming to 1.5¬∞C. \
Rather, the world needs to move much faster to reduce emissions. In this context, COP 28 is a \
moment of reckoning for the energy sector. It presents a significant opportunity for governments \
to construct a credible and ambitious deal to dramatically step up climate action to speed up the \
energy transition. In the last year, discussions about phasing out fossil fuels and scaling up \
renewable energy have gained traction. The Dubai conference is a chance to build on this momentum. \
For an ambitious and robust energy package at COP 28, governments need to step up on five key areas."

In [None]:
question1 = "How much global carbon dioxide originiated from fossil fuels in 2022?"
answer1 = "91%"
predict_answer(context, question1, answer1)

Context: 
 As COP 28 approaches, all eyes are on the energy sector. Amid a striking 91% of global carbon dioxide emissions originating from fossil fuels in 2022, the world is watching for an ambitious energy transition package at this year‚Äôs UN Climate Change Conference in Dubai. Many regard this year‚Äôs edition as the most important COP since COP 21, where governments adopted the Paris Agreement. The global stocktake, the Paris Agreement mandated inventory of global progress on climate action, is set to conclude in Dubai, with world leaders expected to present a political response. The global stocktake has already revealed that countries are not on track to meet the Paris Agreement‚Äôs mitigation goal of limiting global warming to 1.5¬∞C. Rather, the world needs to move much faster to reduce emissions. In this context, COP 28 is a moment of reckoning for the energy sector. It presents a significant opportunity for governments to construct a credible and ambitious deal to dramatical

{'Reference Answer: ': '91%',
 'Predicted Answer: ': '91%',
 'BLEU Score: ': {'google_bleu': 1.0}}

In [None]:
question2 = "what discussions have been taking place in the last year?"
answer2 = "Discussions about phasing out fossil fuels and scaling up renewable energy"
predict_answer(context, question2, answer2)



Context: 
 As COP 28 approaches, all eyes are on the energy sector. Amid a striking 91% of global carbon dioxide emissions originating from fossil fuels in 2022, the world is watching for an ambitious energy transition package at this year‚Äôs UN Climate Change Conference in Dubai. Many regard this year‚Äôs edition as the most important COP since COP 21, where governments adopted the Paris Agreement. The global stocktake, the Paris Agreement mandated inventory of global progress on climate action, is set to conclude in Dubai, with world leaders expected to present a political response. The global stocktake has already revealed that countries are not on track to meet the Paris Agreement‚Äôs mitigation goal of limiting global warming to 1.5¬∞C. Rather, the world needs to move much faster to reduce emissions. In this context, COP 28 is a moment of reckoning for the energy sector. It presents a significant opportunity for governments to construct a credible and ambitious deal to dramatical

{'Reference Answer: ': 'Discussions about phasing out fossil fuels and scaling up renewable energy',
 'Predicted Answer: ': 'UN Climate Change Conference',
 'BLEU Score: ': {'google_bleu': 0.0}}

In [None]:
# https://www.euronews.com/green/2023/12/29/can-you-make-a-wind-farm-out-of-wood-meet-the-swedish-start-up-making-clean-energy-even-gr
context = "Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.\
The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?\
Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.\
Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.\
Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.\
With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.\
They can also be joined together with glue, rather than the thousands of bolts needed for steel towers that require regular inspection. A coat of watertight paint protects them from the elements.\
Cost-wise, wooden towers work out less expensive than steel towers over time, according to Modvion. The cost advantage is greater for taller towers, and they are also shielded from the fluctuating price of steel.\
So far, Modvion has only installed wooden towers on land. But with minor adaptations, it says they could also be used offshore.\
In theory, wooden towers could be built as high as 1,500 metres, but current demand is closer to 150-200 metres.\ How do the emissions of steel and wooden wind towers compare?\
The life cycle emissions of a 110-metre-tall steel wind turbine sit at around 1,250 tonnes of CO2, according to Modvion.\
This is slashed by 90 per cent - to around 125 tonnes - for a wooden tower.\
When they need to be decommissioned, wooden towers can find new life as high-strength beams for the building industry, further reducing their footprint.\
‚ÄúWe like to see our towers as vertical storage of future building material,‚Äù says Modvion.\
Building towers in wood lowers the emissions from wind power plants by approximately 30 per cent per kWh, they estimate.\
Between 300 and 1,200 cubic metres of wood are needed per tower. The company uses Scandinavian spruce from sustainably managed forests, for which re-growth exceeds logging."

In [None]:
q1 = "Is wood a good alternative to steel for wind towers?"
predict_answer(context, q1)

'wood'

In [None]:
q2 = "How can wind towers find new life after being decomissioned?"
a2 = "as high-strength beams for the building industry"
predict_answer(context, q2, a2)

Context: 
 Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.They can also be joined together with glue, rather than the thousands of bolts needed for

{'Reference Answer: ': 'as high-strength beams for the building industry',
 'Predicted Answer: ': 'Scandi design',
 'BLEU Score: ': {'google_bleu': 0.0}}

In [None]:
q3 = "Which Swedish company is making wind towers from wood?"
a3 = "Modvion"
predict_answer(context, q3, a3)

Context: 
 Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.They can also be joined together with glue, rather than the thousands of bolts needed for

{'Reference Answer: ': 'Modvion',
 'Predicted Answer: ': 'Modvion',
 'BLEU Score: ': {'google_bleu': 1.0}}

In [None]:
q4 = "How high can wooden towers be built?"
a4 = "As high as 1,500 meters"
predict_answer(context, q4, a4)

Context: 
 Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.They can also be joined together with glue, rather than the thousands of bolts needed for

{'Reference Answer: ': 'As high as 1,500 meters',
 'Predicted Answer: ': 'Scandi design',
 'BLEU Score: ': {'google_bleu': 0.0}}

In [None]:
q5 = "What is the life cycle emissions of a 110-metre-tall steel wind turbine?"
a5 = "250 tonnes of CO2"
predict_answer(context, q5, a5)

Context: 
 Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.They can also be joined together with glue, rather than the thousands of bolts needed for

{'Reference Answer: ': '250 tonnes of CO2',
 'Predicted Answer: ': 'wood',
 'BLEU Score: ': {'google_bleu': 0.0}}

We can see clearly that the model only performs well on questions regarding numeric figures, company names, people and things of this nature. Situational questions do not perform well.

## GPT-2

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m507.1/507.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m115.3/115.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m134.8/134.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multipr

In [None]:
!pip install transformers[torch]



In [None]:
!pip install accelerate -U



In [None]:
!pip install --upgrade accelerate



In [None]:
!pip install --upgrade accelerate transformers

Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m8.2/8.2 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.35.2
    Uninstalling transformers-4.35.2:
      Successfully uninstalled transformers-4.35.2
Successfully installed transformers-4.36.2


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict, Dataset


We use GPT-2 tokenizer, which includes special tokens to denote the end of a sentence or padding if needed, to tokenize the data.Also we load the QA pairs from CSV files. The train split contains examples for training the model, while the validation split contains examples used to evaluate the model's performance.


In [None]:
# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize questions and answers
def tokenize_function(examples):
    # Concatenate question and answer texts
    concatenated_qa_pairs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    return tokenizer(concatenated_qa_pairs, padding="max_length", truncation=True)

In [None]:
dataset = load_dataset('csv', data_files={'train': 'qa_pairs.csv', 'validation': 'validation_pairs.csv'})

In [None]:
print(dataset['train'][:5])

{'context': ['Electrification could cost-efficiently bust the vast majority of greenhouse gases arising from North Sea platforms., To achieve these goals, the NSTA is working with industry representatives to progress prospective electrification projects, including those in the central North Sea and West of Shetland., Sensitivities to gas, electricity and carbon prices were also investigated.', 'Homegrown Blusmart Mobility has secured debt to acquire a massive fleet of electric vehicles. According to media reports, Railways Energy Management..., A large vertically integrated energy utility in India has auctioned a huge solar power project. According to media reports, Damodar Valley Corporation ( DVC), a..., A government-owned non-banking financial institution in India has raised $ 750 million through the issuance of green bonds.', 'Competitive Power Ventures ( CPV) has achieved a major milestone in its plan to construct and operate the 200 MW DC Backbone Solar Farm ( CPV Backbone) in we

In [None]:
print(dataset['validation'][:5])

{'context': ['The world is going through historic transitions, a global shift of energy, transportation, and consumption that will impact every aspect of our lives, but..., Originally posted on EVANNEX. By Charles Morris If you‚Äô re a car buff, and have traveled abroad, you‚Äô ve doubtless noticed that the mix of auto brands you..., The Dutch plugin vehicle ( PEV) market continues in the fast lane, with 6,637 plugin registrations logged last month. The content produced by this site is for entertainment purposes only.', 'Pennsylvania Agriculture Secretary Russell Redding has released guidance for landowners considering solar production on farmland and in rural communities in Pennsylvania. It includes maintenance that protects soil and pollinators through native vegetation free of invasive species; gives priority to grazing vs. mowing, compensating farmers for the service; and includes decommissioning plans to fully restore any soil at a project‚Äô s completion., Solar Industry offers i

After tokekization process we load the pre-trained GPT_2 model.

In [None]:
# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

The function group_texts was designed to format the tokenized data into fixed-size blocks. This is necessary because models like GPT-2 are trained on fixed-length sequences.

In [None]:
def group_texts(examples):
    # Flatten the list of lists into a single list
    concatenated_examples = {k: sum(examples[k], []) if isinstance(examples[k][0], list) else examples[k] for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len
    result = {
        k: [t[i: i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

block_size = 128
lm_datasets = tokenized_datasets.map(group_texts, batched=True)

Next step is to define training arguments for fine-tuning the model. We specifies arguments such as total number of training epochs to perform, batch size that will be propagated through the network at one time during training, batch size to process the validation dataset, the number of steps for the learning rate warmup etc.

In [None]:
# Define the training arguments
training_args1 = TrainingArguments(
    output_dir="./results",
    num_train_epochs=30,
    per_device_train_batch_size=8,    # batch size for training
    per_device_eval_batch_size=8,     # batch size for evaluation
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=1,
    evaluation_strategy="epoch",      # perform evaluation each epoch
    save_strategy="epoch",
    load_best_model_at_end=True)      # load the best model when finished training

Next step is to initialize a trainer with parameters as pre-trained model, training arguments, training and validation data and to train the model.

In [None]:
trainer1 = Trainer(
    model=model,
    args=training_args1,
    train_dataset=lm_datasets['train'],
    eval_dataset=lm_datasets['validation'])

In [None]:
# Train and fine-tune the model
trainer1.train()

Epoch,Training Loss,Validation Loss
1,9.2505,8.692105
2,8.6131,8.66907
3,9.2585,8.623162
4,8.4906,8.55423
5,8.4452,8.463243
6,8.235,8.350114
7,8.199,8.215183
8,8.2562,8.061435
9,8.0712,7.889064
10,7.929,7.702499


Checkpoint destination directory ./results/checkpoint-1 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-3 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-4 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-5 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-6 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-7 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkp

TrainOutput(global_step=30, training_loss=6.2947769403457645, metrics={'train_runtime': 168.6999, 'train_samples_per_second': 1.245, 'train_steps_per_second': 0.178, 'total_flos': 13717831680000.0, 'train_loss': 6.2947769403457645, 'epoch': 30.0})

In [None]:
evaluation_results = trainer1.evaluate()

print(evaluation_results)

{'eval_loss': 2.3671534061431885, 'eval_runtime': 0.0319, 'eval_samples_per_second': 219.554, 'eval_steps_per_second': 31.365, 'epoch': 30.0}


The output displays that an evaluation loss decresead with each iteration. A lower loss indicates better performance, with the model's predictions being closer to the actual target values.

In [None]:
trainer1.save_model("/content/drive/MyDrive/NLP_models/gpt_qa_model1")

## Evaluate model GPT-2

In [None]:
from datasets import load_metric

In [None]:
modelGPT2 = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/NLP_models/gpt_qa_model1')
tokenizerGPT2 = GPT2Tokenizer.from_pretrained('gpt2')
tokenizerGPT2.pad_token = tokenizerGPT2.eos_token

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

We desidned the function predict_answer_GPT_2 to generate an answer to a question using a pre-trained and fine-tuned GPT-2 model. It takes a context paragraph and a question as inputs, and optionally a reference answer for evaluation. The function constructs a prompt from the context and question, tokenizes this prompt, generates an answer using the model, and then decodes the generated tokens back into text. If a reference answer is provided, the function computes the BLEU score to evaluate the predicted answer's quality.

In [None]:
def predict_answer_GPT_2(context, question, ref_answer=None):
    # Prepare the prompt with a cue for the model
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
    # Encode the inputs
    inputs = tokenizerGPT2(prompt, return_tensors="pt", max_length=512, padding=True, truncation=True)
    # Generate the output with controlled length
    output_sequences = modelGPT2.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=inputs['input_ids'].shape[1] + 50,)

    # Decode the generated sequence
    generated_sequence = output_sequences[0].tolist()
    predicted_answer = tokenizerGPT2.decode(generated_sequence, skip_special_tokens=True)

    # Post-process to extract only the answer part
    answer_start = predicted_answer.find("Answer:") + len("Answer:")
    answer_end = predicted_answer.find(".", answer_start)
    predicted_answer = predicted_answer[answer_start:answer_end]

    print(f"Context: {context}\n")
    print(f"Question: {question}\n")
    print(f"Reference answer: {ref_answer}\n")
    print(f"Predicted answer GPT-2: {predicted_answer}\n")

    # BLEU score
    if ref_answer:
        bleu = load_metric("bleu")
        # Tokenize the reference and predicted answers
        tokenized_ref_answer = tokenizerGPT2.tokenize(ref_answer)
        tokenized_predicted_answer = tokenizerGPT2.tokenize(predicted_answer)

        # Compute the BLEU score
        score = bleu.compute(predictions=[tokenized_predicted_answer], references=[[tokenized_ref_answer]])
        print(f"BLEU score: {score['bleu']}\n")

    return predicted_answer

In [None]:
# Choose a random row from new samples
df = pd.read_csv("final_summary_file.csv")
df = df[3000:]

Below we're testing the model on the Cleantech data. We can see the BLUE score in each output.

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'California Gov. Gavin Newsom has signed Senate Bill 1340, which extends the existing property tax exclusion for newly constructed, active solar energy systems by two years., ‚Äú That said, this policy has a direct impact on property tax revenues that support essential services at the local level. Gov. Newsom‚Äô s signature on this two-year tax exclusion will help ensure solar and storage projects continue providing clean, reliable power to California at a time of unprecedented stress on the state‚Äô s electric grid, ‚Äù says Rick Umoff, senior director and counsel for California at the Solar Energy Industries Association ( SEIA). ‚Äú'

In [None]:
question = "What Gavin Newsom has signed?"
ref_answer = "Senate Bill 1340, which extends the existing property tax exclusion for newly constructed, active solar energy systems by two years"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: California Gov. Gavin Newsom has signed Senate Bill 1340, which extends the existing property tax exclusion for newly constructed, active solar energy systems by two years., ‚Äú That said, this policy has a direct impact on property tax revenues that support essential services at the local level. Gov. Newsom‚Äô s signature on this two-year tax exclusion will help ensure solar and storage projects continue providing clean, reliable power to California at a time of unprecedented stress on the state‚Äô s electric grid, ‚Äù says Rick Umoff, senior director and counsel for California at the Solar Energy Industries Association ( SEIA). ‚Äú

Question: What Gavin Newsom has signed?

Reference answer: Senate Bill 1340, which extends the existing property tax exclusion for newly constructed, active solar energy systems by two years

Predicted answer GPT-2:  The governor signed SB 1340, which extends the existing property tax exclusion for newly constructed, active solar energy systems b

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.829360614494932



' The governor signed SB 1340, which extends the existing property tax exclusion for newly constructed, active solar energy systems by two years'

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'The Clean Air Task Force has published a detailed report on the potential, initiatives, challenges of extracting geothermal energy from superhot rock systems., GEMex, a Mexico-based EU program with focus on hot dry rock and super hot rock systems has drilled several wells at the Acoculco geothermal field with dry wells reportedly exceeding 300 degrees Celsius. Moreover, being able to drill into superhot rock systems can make geothermal energy available nearly anywhere in the world.,'

In [None]:
question = "What the Clean Air Task Force has published?"
ref_answer = "a detailed report on the potential, initiatives, challenges of extracting geothermal energy from superhot rock systems"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: The Clean Air Task Force has published a detailed report on the potential, initiatives, challenges of extracting geothermal energy from superhot rock systems., GEMex, a Mexico-based EU program with focus on hot dry rock and super hot rock systems has drilled several wells at the Acoculco geothermal field with dry wells reportedly exceeding 300 degrees Celsius. Moreover, being able to drill into superhot rock systems can make geothermal energy available nearly anywhere in the world.,

Question: What the Clean Air Task Force has published?

Reference answer: a detailed report on the potential, initiatives, challenges of extracting geothermal energy from superhot rock systems

Predicted answer GPT-2:  The Clean Air Task Force has published a detailed report on the potential, initiatives, challenges of extracting geothermal energy from superhot rock systems



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.6855303107442261



' The Clean Air Task Force has published a detailed report on the potential, initiatives, challenges of extracting geothermal energy from superhot rock systems'

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'PT Barito Renewables has announced the share price for the company‚Äô s planned IPO in the Indonesian Stock Exchange, in a bid to raise USD 203 million., PT Barito Renewables Energy has provided additional details on the planned initial public offering on the Indonesian Stock Exchange ( IDX). The company has set a price of 780 rupiah ( approx.'

In [None]:
question = "Who has announced the share price for the company‚Äô s planned IPO in the Indonesian Stock Exchange?"
ref_answer = "PT Barito Renewables"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: PT Barito Renewables has announced the share price for the company‚Äô s planned IPO in the Indonesian Stock Exchange, in a bid to raise USD 203 million., PT Barito Renewables Energy has provided additional details on the planned initial public offering on the Indonesian Stock Exchange ( IDX). The company has set a price of 780 rupiah ( approx.

Question: Who has announced the share price for the company‚Äô s planned IPO in the Indonesian Stock Exchange?

Reference answer: PT Barito Renewables

Predicted answer GPT-2:  PT Barito Renewables has provided additional details on the planned initial public offering on the Indonesian Stock Exchange ( IDX)



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.09849349468888721



' PT Barito Renewables has provided additional details on the planned initial public offering on the Indonesian Stock Exchange ( IDX)'

Now we will compare to some of the same questions used in the T5 model both from in the Cleantech dataset and outside sources.

In [None]:
context = "Sign in to get the best natural gas news and data. GE‚Äô s 130-plus years of experience in energy applications will be invaluable to us as we rapidly scale our operations and manufacturing capacity to be able to capture millions of tonnes of carbon dioxide ( CO2) from diverse industrial sites around the world, ‚Äù said Svante CEO Claude Letourneau., Funded by the Infrastructure Investment and Jobs Act of 2021, the program is designed to decrease CO2 emissions from electricity generation with an ambition of net-zero emissions by 2050"

In [None]:
question = "What act is funding the Svante program?"
ref_answer = "Infrastructure Investment and Jobs Act of 2021"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: Sign in to get the best natural gas news and data. GE‚Äô s 130-plus years of experience in energy applications will be invaluable to us as we rapidly scale our operations and manufacturing capacity to be able to capture millions of tonnes of carbon dioxide ( CO2) from diverse industrial sites around the world, ‚Äù said Svante CEO Claude Letourneau., Funded by the Infrastructure Investment and Jobs Act of 2021, the program is designed to decrease CO2 emissions from electricity generation with an ambition of net-zero emissions by 2050

Question: What act is funding the Svante program?

Reference answer: Infrastructure Investment and Jobs Act of 2021

Predicted answer GPT-2:  The Svante program is a $1



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.0



' The Svante program is a $1'

In [None]:
context = "Sign in to get the best natural gas news and data. California Gov. Gavin Newsom joined members of the Biden administration in May 2021 to announce the state would be advancing with at least two offshore wind energy areas ( WEA), including Humboldt Bay and central California‚Äô s Morro Bay., The CEC noted that initial funds may also draw in federal grants for project construction.,"

In [None]:
question = "What did Gov. Gavin Newsom announce in May 2021?"
ref_answer = "The state would be advancing with at least two offshore wind energy areas"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: Sign in to get the best natural gas news and data. California Gov. Gavin Newsom joined members of the Biden administration in May 2021 to announce the state would be advancing with at least two offshore wind energy areas ( WEA), including Humboldt Bay and central California‚Äô s Morro Bay., The CEC noted that initial funds may also draw in federal grants for project construction.,

Question: What did Gov. Gavin Newsom announce in May 2021?

Reference answer: The state would be advancing with at least two offshore wind energy areas

Predicted answer GPT-2:  The state will begin construction of a new offshore wind energy area in Morro Bay, which will be located in the western part of the state



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.0



' The state will begin construction of a new offshore wind energy area in Morro Bay, which will be located in the western part of the state'

In [None]:
context = "The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers."

In [None]:
q1 = "How manny people will still lack ecelctricity in 2030?"
ref_answer = "670 million people"
predict_answer_GPT_2(context, q1, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers.

Question: How manny people will still lack ecelctricity in 2030?

Reference answer: 670 million people

Predicted answer GPT-2:  The world's population will continue to grow by about 1



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.0



" The world's population will continue to grow by about 1"

In [None]:
# Try on new data outside of the Cleantech corpus
context = "As COP 28 approaches, all eyes are on the energy sector. Amid a striking 91% \
of global carbon dioxide emissions originating from fossil fuels in 2022, the world is watching \
for an ambitious energy transition package at this year‚Äôs UN Climate Change Conference in Dubai. \
Many regard this year‚Äôs edition as the most important COP since COP 21, where governments \
adopted the Paris Agreement. The global stocktake, the Paris Agreement mandated inventory \
of global progress on climate action, is set to conclude in Dubai, with world leaders expected \
to present a political response. The global stocktake has already revealed that countries are not \
on track to meet the Paris Agreement‚Äôs mitigation goal of limiting global warming to 1.5¬∞C. \
Rather, the world needs to move much faster to reduce emissions. In this context, COP 28 is a \
moment of reckoning for the energy sector. It presents a significant opportunity for governments \
to construct a credible and ambitious deal to dramatically step up climate action to speed up the \
energy transition. In the last year, discussions about phasing out fossil fuels and scaling up \
renewable energy have gained traction. The Dubai conference is a chance to build on this momentum. \
For an ambitious and robust energy package at COP 28, governments need to step up on five key areas."

In [None]:
question1 = "How much global carbon dioxide originiated from fossil fuels in 2022?"
ref_answer1 = "Amid a striking 91% of global carbon dioxide emissions originating from fossil fuels in 2022"
predict_answer_GPT_2(context, question1, ref_answer1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: As COP 28 approaches, all eyes are on the energy sector. Amid a striking 91% of global carbon dioxide emissions originating from fossil fuels in 2022, the world is watching for an ambitious energy transition package at this year‚Äôs UN Climate Change Conference in Dubai. Many regard this year‚Äôs edition as the most important COP since COP 21, where governments adopted the Paris Agreement. The global stocktake, the Paris Agreement mandated inventory of global progress on climate action, is set to conclude in Dubai, with world leaders expected to present a political response. The global stocktake has already revealed that countries are not on track to meet the Paris Agreement‚Äôs mitigation goal of limiting global warming to 1.5¬∞C. Rather, the world needs to move much faster to reduce emissions. In this context, COP 28 is a moment of reckoning for the energy sector. It presents a significant opportunity for governments to construct a credible and ambitious deal to dramatically

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.0



' The world is watching for an ambitious energy transition package at this year‚Äôs UN Climate Change Conference in Dubai'

In [None]:
# https://www.euronews.com/green/2023/12/29/can-you-make-a-wind-farm-out-of-wood-meet-the-swedish-start-up-making-clean-energy-even-gr
context = "Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.\
The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?\
Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.\
Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.\
Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.\
With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.\
They can also be joined together with glue, rather than the thousands of bolts needed for steel towers that require regular inspection. A coat of watertight paint protects them from the elements.\
Cost-wise, wooden towers work out less expensive than steel towers over time, according to Modvion. The cost advantage is greater for taller towers, and they are also shielded from the fluctuating price of steel.\
So far, Modvion has only installed wooden towers on land. But with minor adaptations, it says they could also be used offshore.\
In theory, wooden towers could be built as high as 1,500 metres, but current demand is closer to 150-200 metres.\ How do the emissions of steel and wooden wind towers compare?\
The life cycle emissions of a 110-metre-tall steel wind turbine sit at around 1,250 tonnes of CO2, according to Modvion.\
This is slashed by 90 per cent - to around 125 tonnes - for a wooden tower.\
When they need to be decommissioned, wooden towers can find new life as high-strength beams for the building industry, further reducing their footprint.\
‚ÄúWe like to see our towers as vertical storage of future building material,‚Äù says Modvion.\
Building towers in wood lowers the emissions from wind power plants by approximately 30 per cent per kWh, they estimate.\
Between 300 and 1,200 cubic metres of wood are needed per tower. The company uses Scandinavian spruce from sustainably managed forests, for which re-growth exceeds logging."

In [None]:
question = "Which Swedish company is making wind towers from wood?"
ref_answer = "Modvion"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.They can also be joined together with glue, rather than the thousands of bolts needed for s

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.0



' Modvion'

The model perfoms better if we use the Cleantech corpus. In two of the three tests the score is about 0.7 in average. Unfortunately, the same results cannot be boasted by the model when tested on other data.

# Comparing the results with the zero-shot capability of LLMs

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

As in previous steps we load the tokenizer and the pre-trained model - GPT-2.

In [None]:
# Load tokenizer and model
model_name = "gpt2"
tokenizer_GPT = AutoTokenizer.from_pretrained(model_name)
model_GPT = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
tokenizer_GPT.pad_token = tokenizer_GPT.eos_token


if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

For comparing results we use the function from the stage "Fine-tune model: evaluate model GPT-2" with a few modifications: replacing the model to zero-shot capability model of GPT-2.

In [None]:
def predict_answer_GPT(context, question, ref_answer=None):
    # Format the prompt for the model
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"

    # Tokenize the inputs with padding
    inputs = tokenizer_GPT(prompt, return_tensors="pt", max_length=512, padding=True, truncation=True)

    # Generate the output with controlled length
    output_sequences = model_GPT.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=inputs['input_ids'].shape[1] + 50,
    )

    # Decode the generated sequence
    generated_sequence = output_sequences[0].tolist()
    predicted_answer = tokenizer.decode(generated_sequence, skip_special_tokens=True)

    # Post-process to extract only the answer part
    answer_start = predicted_answer.find("Answer:") + len("Answer:")
    answer_end = predicted_answer.find(".", answer_start)
    predicted_answer = predicted_answer[answer_start:answer_end]

    print(f"Context: {context}\n")
    print(f"Question: {question}\n")
    print(f"Reference answer: {ref_answer}\n")
    print(f"Predicted answer GPT-2 (zero-shot): {predicted_answer}\n")

    # BLEU score
    if ref_answer:
        bleu = load_metric("bleu")
        # Tokenize the reference and predicted answers
        tokenized_ref_answer = tokenizer_GPT.tokenize(ref_answer)
        tokenized_predicted_answer = tokenizer_GPT.tokenize(predicted_answer)

        # Compute the BLEU score
        score = bleu.compute(predictions=[tokenized_predicted_answer], references=[[tokenized_ref_answer]])
        print(f"BLEU score: {score['bleu']}\n")

    return predicted_answer

We will take the random rows from the Cleantech data and run functions predict-answer for 3 models: GPT-2 zero-shot, fine-tuned GPT-2, fine-tuned T5.

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'Solar loan company Sungage Financial has released a new 0% solar financing product for homeowners. This new 0% interest rate helps us achieve our mission to help as many homeowners as possible realize the benefits of solar energy ownership by providing the most affordable path to a renewable energy future, ‚Äù said Michael Gilroy, Sungage Financial CEO. ‚Äú Two close friends of mine are now receiving 0% on their solar loans.'

In [None]:
question = "What a Solar loan company Sungage Financial has released?"
ref_answer = "a new 0% solar financing product for homeowners."
predict_answer_GPT(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: Solar loan company Sungage Financial has released a new 0% solar financing product for homeowners. This new 0% interest rate helps us achieve our mission to help as many homeowners as possible realize the benefits of solar energy ownership by providing the most affordable path to a renewable energy future, ‚Äù said Michael Gilroy, Sungage Financial CEO. ‚Äú Two close friends of mine are now receiving 0% on their solar loans.

Question: What a Solar loan company Sungage Financial has released?

Reference answer: a new 0% solar financing product for homeowners.

Predicted answer GPT-2 (zero-shot):  The company has released a new solar financing product for homeowners



  bleu = load_metric("bleu")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.3672056269893592



' The company has released a new solar financing product for homeowners'

In [None]:
question = "What a Solar loan company Sungage Financial has released?"
ref_answer = "a new 0% solar financing product for homeowners."
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: Solar loan company Sungage Financial has released a new 0% solar financing product for homeowners. This new 0% interest rate helps us achieve our mission to help as many homeowners as possible realize the benefits of solar energy ownership by providing the most affordable path to a renewable energy future, ‚Äù said Michael Gilroy, Sungage Financial CEO. ‚Äú Two close friends of mine are now receiving 0% on their solar loans.

Question: What a Solar loan company Sungage Financial has released?

Reference answer: a new 0% solar financing product for homeowners.

Predicted answer GPT-2:  The company has released a new solar financing product for homeowners



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.3672056269893592



' The company has released a new solar financing product for homeowners'

In [None]:
question = "What a Solar loan company Sungage Financial has released?"
ref_answer = "a new 0% solar financing product for homeowners."
predict_answer(context, question, ref_answer)



Context: 
 Solar loan company Sungage Financial has released a new 0% solar financing product for homeowners. This new 0% interest rate helps us achieve our mission to help as many homeowners as possible realize the benefits of solar energy ownership by providing the most affordable path to a renewable energy future, ‚Äù said Michael Gilroy, Sungage Financial CEO. ‚Äú Two close friends of mine are now receiving 0% on their solar loans.


Question: 
 What a Solar loan company Sungage Financial has released?


{'Reference Answer: ': 'a new 0% solar financing product for homeowners.',
 'Predicted Answer: ': '0% solar financing product',
 'BLEU Score: ': {'google_bleu': 0.4117647058823529}}

In [None]:
# Choose a random row
context = df["summary"].sample(n=1).iloc[0]
context

'A Japanese consortium is now testing a 100 kW fuel cell system in Tokyo. The demonstration aims to establish an urban hydrogen energy utilization model that combines global and local hydrogen sources., The system is the first in the world to mix renewable energy hydrogen and waste plastic hydrogen, supply the mixture to a fuel cell, and connect it to the building air conditioning system for advanced use of electricity and heat, ‚Äù said the Japanese consortium.",'

In [None]:
question = "What is Japanese consortium testing?"
ref_answer = "a 100 kW fuel cell system in Tokyo"
predict_answer_GPT(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: A Japanese consortium is now testing a 100 kW fuel cell system in Tokyo. The demonstration aims to establish an urban hydrogen energy utilization model that combines global and local hydrogen sources., The system is the first in the world to mix renewable energy hydrogen and waste plastic hydrogen, supply the mixture to a fuel cell, and connect it to the building air conditioning system for advanced use of electricity and heat, ‚Äù said the Japanese consortium.",

Question: What is Japanese consortium testing?

Reference answer: a 100 kW fuel cell system in Tokyo

Predicted answer GPT-2 (zero-shot):  The Japanese consortium is testing a 100 kW fuel cell system in Tokyo



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.4703709593866897



' The Japanese consortium is testing a 100 kW fuel cell system in Tokyo'

In [None]:
question = "What is Japanese consortium testing?"
ref_answer = "a 100 kW fuel cell system in Tokyo"
predict_answer_GPT_2(context, question, ref_answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: A Japanese consortium is now testing a 100 kW fuel cell system in Tokyo. The demonstration aims to establish an urban hydrogen energy utilization model that combines global and local hydrogen sources., The system is the first in the world to mix renewable energy hydrogen and waste plastic hydrogen, supply the mixture to a fuel cell, and connect it to the building air conditioning system for advanced use of electricity and heat, ‚Äù said the Japanese consortium.",

Question: What is Japanese consortium testing?

Reference answer: a 100 kW fuel cell system in Tokyo

Predicted answer GPT-2:  The Japanese consortium is testing a 100 kW fuel cell system in Tokyo



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


BLEU score: 0.4703709593866897



' The Japanese consortium is testing a 100 kW fuel cell system in Tokyo'

In [None]:
question = "What is Japanese consortium testing?"
ref_answer = "a 100 kW fuel cell system in Tokyo"
predict_answer(context, question, ref_answer)



Context: 
 A Japanese consortium is now testing a 100 kW fuel cell system in Tokyo. The demonstration aims to establish an urban hydrogen energy utilization model that combines global and local hydrogen sources., The system is the first in the world to mix renewable energy hydrogen and waste plastic hydrogen, supply the mixture to a fuel cell, and connect it to the building air conditioning system for advanced use of electricity and heat, ‚Äù said the Japanese consortium.",


Question: 
 What is Japanese consortium testing?


{'Reference Answer: ': 'a 100 kW fuel cell system in Tokyo',
 'Predicted Answer: ': 'a 100 kW fuel cell system',
 'BLEU Score: ': {'google_bleu': 0.6923076923076923}}

**The results of BLUE score**\
Test 1:\
GPT-2 zero-shot:0.3672056269893592\
GPT-2 after fine-tuning:0.3672056269893592\
T-5 after fine-tuning:0.4117647058823529

Test 2:\
GPT-2 zero-shot:0.4703709593866897\
GPT-2 after fine-tuning:0.4703709593866897\
T-5 after fine-tuning:0.6923076923076923

Both tests reveal that the fine-tuned GPT-2 model didn't show an improvement in BLEU scores when compared to its zero-shot performance. In contrast, the T-5 model demonstrates an increase in its BLEU score after fine-tuning, especially in Test 2.

The identical BLEU scores for GPT-2's zero-shot and fine-tuned capabilities suggest that the fine-tuning process may not have provided the model with new patterns or information beyond its pre-training. This can happen when the fine-tuning data doesn't contain enough novel information to challenge the model's existing knowledge base.

To fine-tune models effectively, a substantial amount of data is typically required. We've used  only a part of the Cleantech data for creating question answers pairs, prioritizing time efficiency and saving computational resources. It's important to note that even with the entire dataset, the performance improvements for the models might not be significantly higher. Fine-tuning on a larger dataset demands considerable computational resources and time.

## GPT-2 vs ChatGPT

We can also qualitatively compare the results of our fine-tuned model with the use the freely available LLM ChatGPT (https://chat.openai.com). ChatGPT runs on GPT3.5, and should perform better than GPT2 on langauge tasks.

First we can test the limits of the models dataset by simply typing in some of the questions obtained from our cleantech articles.

Example 1 (no context given to ChatGPT):

- **Question:** What a Solar loan company Sungage Financial has released?

- **Answer - Human generated:** a new 0% solar financing product for homeowners.

- **Answer - Fine Tuned GPT-2:**  The company has released a new solar financing product for homeowners

- **Answer - ChatGPT:** As of my last knowledge update in January 2022, I don't have specific information about any recent releases or updates from Sungage Financial, including any new solar loan products. Companies often introduce new products and services, and details about these developments may change over time.

To get the latest and most accurate information about Sungage Financial's offerings, I recommend checking their official website, press releases, or contacting the company directly for the most up-to-date details on any solar loan products they may have released since my last update.

Example 2 (no context given to ChatGPT):

- **Question:** What is Japanese consortium testing?

- **Fine-tuned model**: a 100 kW fuel cell system in Tokyo

- **GPT-2:**  The Japanese consortium is testing a 100 kW fuel cell system in Tokyo

- **ChatGPT:** I don't have real-time information, and my knowledge was last updated in January 2022. Therefore, I'm not aware of any specific Japanese consortium or their current testing activities. Consortia in various industries often collaborate on research and development projects, and their focus can vary widely.

If there have been developments or news regarding a Japanese consortium and their testing activities since my last update, I recommend checking recent news sources, press releases, or the official announcements from the consortium in question for the latest and most accurate information.

As can be seen from these results. ChatGPT unsurprisingly requires more context to be useful in these case. To do this we use the following prompt format:

- **Context:** [e.g. Solar loan company Sungage Financial has released a new 0% solar financing product for homeowners. This new 0% interest rate helps us achieve our mission to help as many homeowners as possible realize the benefits of solar energy ownership by providing the most affordable path to a renewable energy future, ‚Äù said Michael Gilroy, Sungage Financial CEO. ‚Äú Two close friends of mine are now receiving 0% on their solar loans.]

- **Question:** [e.g. What a Solar loan company Sungage Financial has released?]


With the above example we now get the following results:

Example 1 (context given to ChatGPT):

- **Question:** What a Solar loan company Sungage Financial has released?

- **GPT-2 Fine-tuned model:** a new 0% solar financing product for homeowners.

- **GPT-2:** The company has released a new solar financing product for homeowners

- **ChatGPT:** Sungage Financial has released a new 0% solar financing product for homeowners. This financing option with a 0% interest rate aims to make solar energy ownership more affordable and accessible for homeowners, aligning with the company's mission to help as many individuals as possible transition to a renewable energy future. Michael Gilroy, the CEO of Sungage Financial, highlighted the significance of this initiative in providing a cost-effective pathway for homeowners to embrace solar energy.

Example 2 (context given to ChatGPT):

- **Context:** A Japanese consortium is now testing a 100 kW fuel cell system in Tokyo. The demonstration aims to establish an urban hydrogen energy utilization model that combines global and local hydrogen sources., The system is the first in the world to mix renewable energy hydrogen and waste plastic hydrogen, supply the mixture to a fuel cell, and connect it to the building air conditioning system for advanced use of electricity and heat, ‚Äù said the Japanese consortium."

- **Question:** What is Japanese consortium testing?

- **Answer - Human generated**: a 100 kW fuel cell system in Tokyo

- **Answer - Fine-tuned GPT-2:**  The Japanese consortium is testing a 100 kW fuel cell system in Tokyo

- **Answer - ChatGPT:** The Japanese consortium is testing a 100 kW fuel cell system in Tokyo. The goal of the demonstration is to establish an urban hydrogen energy utilization model that combines global and local hydrogen sources. The system being tested is the first in the world to mix renewable energy hydrogen and hydrogen derived from waste plastic. The resulting mixture is supplied to a fuel cell, and the system is connected to a building's air conditioning system, enabling advanced utilization of both electricity and heat.

With this limited prompt ChatGPT provides answers with more depth than the fine-tuned and non-finetuned GPT-2 models. They are however as long as and sometimes longer than the given context. Depending on the needs of the Q and A system ChatGPT may not be the optimimal model when only short answers are required. On the other hand, if either more detail is required ChatGPT is clearly superior.

## T5 vs ChatGPT

Now we can compare the results of our fine-tuned T5 model with ChatGPT's capabilities. Having already established that ChatGPT requires prompting with an article for context we will proceed with the comparison, using the same [Context: <context>, Question: <question>] prompt format mentioned above.

Example 1:

- **Context:** Sign in to get the best natural gas news and data. GE‚Äô s 130-plus years of experience in energy applications will be invaluable to us as we rapidly scale our operations and manufacturing capacity to be able to capture millions of tonnes of carbon dioxide ( CO2) from diverse industrial sites around the world, ‚Äù said Svante CEO Claude Letourneau., Funded by the Infrastructure Investment and Jobs Act of 2021, the program is designed to decrease CO2 emissions from electricity generation with an ambition of net-zero emissions by 2050.

- **Question:** What act is funding the Svante program?

- **Answer - Human generated:** Infrastructure Investment and Jobs Act of 2021

- **Answer - Fine-tuned T5 model:**  Infrastructure Investment and Jobs Act of 2021

- **Answer - ChatGPT:** The Svante program is funded by the Infrastructure Investment and Jobs Act of 2021. This act is designed to support initiatives that aim to decrease carbon dioxide (CO2) emissions from electricity generation, with the overarching goal of achieving net-zero emissions by the year 2050.




Example 2:

- **Context:**  Sign in to get the best natural gas news and data. California Gov. Gavin Newsom joined members of the Biden administration in May 2021 to announce the state would be advancing with at least two offshore wind energy areas ( WEA), including Humboldt Bay and central California‚Äô s Morro Bay., The CEC noted that initial funds may also draw in federal grants for project construction.,

- **Question:** What did Gov. Gavin Newsom announce in May 2021?

- **Answer - Human generated:** The state would be advancing with at least two offshore wind energy areas

- **Answer - Fine-tuned T5 model:** Humboldt Bay and central California

- **Answer - ChatGPT:** In May 2021, California Gov. Gavin Newsom announced that the state would be advancing with at least two offshore wind energy areas (WEA). These areas were identified as Humboldt Bay and central California's Morro Bay. The announcement signified the state's commitment to the development of offshore wind energy projects in those locations.


Example 3:

- **Context:**  The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers.

- **Question:**  How many people will still lack electricity in 2030?

- **Answer - Human generated:** 670 million people

- **Answer - Fine-tuned T5 model:** 670 million

- **Answer - ChatGPT:** In 2030, an estimated 670 million people will still lack electricity.

Example 4:

- **Context:**   The International Renewable Energy Agency‚Äô s latest annual report on the progress towards UN sustainable development goal seven estimates 670 million people will still lack electricity in 2030, and more than 2 billion will be reliant on unhealthy, polluting cooking methods., The Africa Solar Industry Association has recorded almost 2 GW of large scale project announcements since the start of last month with 18 countries planning new clean power infrastructure and including energy storage in the plants., This website uses cookies to anonymously count visitor numbers.

- **Question:** How many countries are planning new clean power infrastructure in Africa?

- **Answer - Human generated:** 18

- **Answer - Fine-tuned T5 model:** 18

- **Answer - ChatGPT:** The text mentions that 18 countries in Africa are planning new clean power infrastructure.

Example 5:

- **Context:** Swedish start-up Modvion is channelling Scandi design to combat this problem and create net zero wind power by erecting wooden towers.The company recently debuted the world‚Äôs tallest wooden wind turbine tower, near Gothenburg in southwest Sweden. Is wood a good alternative to steel for wind towers?Steel - the more common material for wind turbine towers - is favoured for its strength and durability. But as towers get ever taller, the logistics of putting them in place become more complex.Huge pieces of metal are heavy and difficult to transport. The high steel towers require extra enforcement to carry their own weight.Modvion‚Äôs laminated wood towers are manufactured from modules that are easy to stack onto lorries and do not need reinforcement.With a higher strength to weight ratio, they enable lighter construction and can be assembled by a small crane before being lifted into place.They can also be joined together with glue, rather than the thousands of bolts needed for steel towers that require regular inspection. A coat of watertight paint protects them from the elements.Cost-wise, wooden towers work out less expensive than steel towers over time, according to Modvion. The cost advantage is greater for taller towers, and they are also shielded from the fluctuating price of steel.So far, Modvion has only installed wooden towers on land. But with minor adaptations, it says they could also be used offshore.In theory, wooden towers could be built as high as 1,500 metres, but current demand is closer to 150-200 metres.\ How do the emissions of steel and wooden wind towers compare?The life cycle emissions of a 110-metre-tall steel wind turbine sit at around 1,250 tonnes of CO2, according to Modvion.This is slashed by 90 per cent - to around 125 tonnes - for a wooden tower.When they need to be decommissioned, wooden towers can find new life as high-strength beams for the building industry, further reducing their footprint.‚ÄúWe like to see our towers as vertical storage of future building material,‚Äù says Modvion.Building towers in wood lowers the emissions from wind power plants by approximately 30 per cent per kWh, they estimate.Between 300 and 1,200 cubic metres of wood are needed per tower. The company uses Scandinavian spruce from sustainably managed forests, for which re-growth exceeds logging.

- **Question:** How can wind towers find new life after being decomissioned?

- **Answer - Human generated:** as high-strength beams for the building industry

- **Answer - Fine-tuned T5 model:** Scandi design

- **Answer - ChatGPT:** After being decommissioned, wooden wind towers can find new life by repurposing the materials for various applications. In the case of Modvion's towers made from laminated wood, they can be transformed into high-strength beams for the building industry. This repurposing not only extends the lifespan of the materials but also contributes to sustainability and reduces the environmental impact.
By using the decommissioned wooden towers as high-strength beams, they can be integrated into construction projects, providing structural support in buildings and infrastructure. This approach aligns with the concept of circular economy, where materials are reused and recycled to minimize waste and promote sustainability.
The idea behind viewing the towers as "vertical storage of future building material," as mentioned by Modvion, reflects the commitment to creating a more sustainable and environmentally friendly approach to wind power infrastructure. It also highlights the potential for a closed-loop system where materials from decommissioned structures are efficiently repurposed for new applications, reducing the overall carbon footprint of the wind power industry.







The answers generated by ChatGPT are factually correct and in-depth. Without further prompt engineering they can be quite wordy. Inserting further prompts such as "Limit the answer to 10 words or less" can reduce output. However, even when prompted ChatGPT is not reliable at limiting its responses.

To conclude, ChatGPT can be considered much more effective than our fine-tuned T5 and GPT-2 models at summarising the complete context of an article. When given context and then queried it gives more accurate and more in-depth answers. Depending on the use-case, ChatGPT may give answers that are too long. This can be mitigated by giving more detailed prompts, however even then ChatGPT prefers to answer in full sentences.

# Conclusion

The open-sourced LLM models like ChatGPT perform at a much higher level than the fine-tuned pretrained models. However, if we provide more Q&A pairs in the training, its possible the fine-tuned model could perform better in providing accurate and conscise answers in the clean energy field. This would not only require more data, but also providing a varying degree of complex questions and answers to prepare the model for more situational and abstract questions.

# Student Contributions

* BERT Summarizer: *Nina Merryweather*

* Question and Answer Pairs: *Natalie Barnett, Nina Merryweather*

* Q&A manual cleaning: *Natalie Barnett*

* T5 Fine-Tuning: *Natalie Barnett*

* GPT-2 Fine-Tuning: *Alina Yaroshchuk*

* Zero-Shot-Capability Evaluaion: *Alina Yaroshchuk*

* LLM Zero-Shot-Capability Comparison: *Nina Merryweather*