<a href="https://colab.research.google.com/github/nbarnett19/Computational_Language_Tech/blob/Main/stage_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [22]:
%%capture
%pip install bert-extractive-summarizer

In [21]:
# Preprocessing
%%capture
!python -m spacy download en_core_web_sm

import numpy as np
import pandas as pd
import nltk
import spacy
import math
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS
import re

nlp = spacy.load('en_core_web_sm')

In [23]:
from summarizer import Summarizer

# Pre-Processing

Processing steps must be adjusted for use of BERT summarizer. Sentences must be preserved, so punctuation cannot be fully removed. Additionally, as the goal is the creation of a QA dataset it is not advantageous to remove numbers, as these may be important for answers questions/providing context to the articles.


In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/cleantech_media_dataset_v1_20231109.zip
!unzip /content/cleantech_media_dataset_v1_20231109.zip

In [19]:
df = pd.read_csv("/content/cleantech_media_dataset_v1_20231109.csv")

In [24]:
def preprocess_data(df):
    # Remove duplicates
    df = df.drop_duplicates()

    # Remove symbols, but keep punctuation for sentence tokenization
    df['content_cleaned_text'] = df['content'].apply(lambda x: re.sub(r"^[^a-zA-Z0-9.!?,/'-]+|[^a-zA-Z0-9.!?,/'-]+$", r" ", x))

    # Remove apostrophes not directly preceded and followed by a letter, handling possessive forms
    df['content_cleaned_text'] = df['content_cleaned_text'].apply(lambda x: re.sub(r"(?<![a-zA-Z])'(?![a-zA-Z])|(?<![a-zA-Z])'(?=[a-zA-Z])|(?<=[a-zA-Z])'(?![a-zA-Z])|(?<=[a-zA-Z])'s", "", x))

    # Remove unused columns
    df.drop('Unnamed: 0', axis=1, inplace=True)
    df.drop('author', axis=1, inplace=True)

    return df

# Example usage:
df = preprocess_data(df)


In [6]:
df

Unnamed: 0,title,date,content,domain,url,content_cleaned_text
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...
...,...,...,...,...,...,...
9602,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...,Strata Clean Energy has closed a $ 300 millio...
9603,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,['Global renewable energy developer Ørsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...,Global renewable energy developer Ørsted is d...
9604,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...,"Veolia North America, a provider of environme..."
9605,"SunEdison: Too Big, Too Fast?",2023-11-08,['Once the self-proclaimed “ leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...,Once the self-proclaimed “ leading renewable ...


In [7]:
ids_articles = []

for index, row in df.iterrows():
    article_id = row['title']
    article = row['content_cleaned_text']

    ids_articles.append({'article_id': article_id, 'content': article})

In [8]:
articles = [article['content'] for article in ids_articles]

Examine articles after processing:

In [9]:
articles[9606]

' Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer., Arevon secured a commitment with J.P. Morgan to purchase $ 191 million of investment tax credits and production tax credits, among the nation’ s first transactions announced to date that leverage the Inflation Reduction Act’ s transferability provision., The additional $ 338 million debt facility was financed with MUFG, BNP Paribas, Sumitomo Mitsui Banking Corp., and First Citizens Bank, who acted as coordinating lead arrangers. National Bank of Canada also participated as a lender. Stoel Rives represented Arevon as legal counsel; Milbank LLP served as transfer counsel; and Winston & Strawn LLP served as lender counsel., “ Vikings has been a landmark project from its inception. It is one of the nation’ s first solar peaker plants, and today it is one of the first utility-scale solar-plus-storage ITC and PTC transferability transactions to cl

Text looks reasonably clean and should be able to be handled by BERT summarizer.

Next step is to load the model.

#

# BERT Extractive Summarizer

The first stage use's BERT extractive summarizer to extract key sentneces from the articles.

In [25]:
model = Summarizer()

We test the model on one article. The number of sentences or ratio of sentences to the article length can be specified.

In [11]:
model(articles[9606], num_sentences=1)

'Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer.,'

Here we test another article using both the ratio method and the number of sentences method.

In [12]:
articles[50]

" The energy transition is very much about how far and how fast electrification can go. Siemens Energy is involved in most electrification technologies, from conventional and renewable power generation to storage, grids and green hydrogen production. To understand the issues, risks and opportunities, Energy Intelligence Senior Reporter Philippe Roos caught up with Stefan Diezinger, in charge of Sustainable Energy Systems at the German energy giant Industrial Applications division ( related). Q: Industrial carbon dioxide emissions can be reduced with energy efficiency. What the potential there? A: We see efficiency enhancement as an important part of decarbonizing industries, especially energy-intensive process industries. We still see a lot of old equipment which has 30, 40 or even 50 years of operation, even in Germany. With an upgrade, you can easily get 20% more efficiency. If you put this in terms of CO2 avoidance costs, it is extremely attractive. Then there so much waste heat whi

In [13]:
model(articles[50], ratio=0.2)



"The energy transition is very much about how far and how fast electrification can go. Q: Industrial carbon dioxide emissions can be reduced with energy efficiency. We still see a lot of old equipment which has 30, 40 or even 50 years of operation, even in Germany. High-temperature heat pumps can address this. Like this, you can get 5% -10% efficiency improvement, which also means CO2 reduction. To find the optimal configuration for a specific application, sophisticated design algorithms and a broad toolbox of technologies are available -- including equipment, electrification, automation and digitalization. We have at the moment ongoing discussions with companies in different parts of the world. If this is the case, you can produce green hydrogen or other green molecules like methanol and use them to replace fossil feedstock in the chemical industry and the mobility sector. But fuel shifting is not necessarily just about hydrogen. For example, even in Europe you still have a large amou

In [14]:
model(articles[50], num_sentences=3)



'The energy transition is very much about how far and how fast electrification can go. But if you really go into the details of the hydrogen business case, you can clearly see it only makes sense if there is a substantial amount of renewable power at very low cost. With digitalization, we can also optimize operations in an online mode.'

The ratio method has the danger of producing a large number of sentences when used on longer articles. For our purposes it is safer to specify the number of sentences directly, as this will create a more manageable dataset later on.

In [29]:
import warnings
warnings.filterwarnings("ignore")
import time

start_time = time.time()
for index, row in df_subset.iterrows():
    body_text = row['content_cleaned_text']

    try:
        summary_sentences = model(body_text, num_sentences=3)
        df_subset.at[index, 'summary'] = summary_sentences
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        df_subset.at[index, 'summary'] = None  # or any default value you prefer

    # if index % save_interval == 0:
    #     df_subset.to_csv('/content/drive/MyDrive/Comp_Ling/output_file_2.csv', index=False)
total_elapsed_time = time.time() - start_time
print(f"Total Elapsed Time: {total_elapsed_time:.2f} seconds")

# Display the subset DataFrame with the 'summary' column
df_subset

Total Elapsed Time: 6.89 seconds


Unnamed: 0,title,date,content,domain,url,content_cleaned_text,summary,summaries_2
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...,Qatar Petroleum ( QP) is targeting aggressive ...,QP said its goals include reducing emissions ...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...,New US President Joe Biden took office this we...,China' s imports of US crude jumped 211% in 20...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...,The slow pace of Japanese reactor restarts con...,Tokyo Electric Power Co. ( Tepco) is strugglin...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...,Two of New York City largest pension funds say...,Two of New York City largest pension funds say...
5,Japan: Supreme Court Will Likely Decide on Fuk...,2021-01-28,"[""Japan's Supreme Court will likely become the...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Japan Supreme Court will likely become the ar...,Japan Supreme Court will likely become the arb...,Japan Supreme Court will likely become the arb...
6,Biden Appointees Signal Progressive Engagement,2021-01-28,"[""Oil and natural gas industry officials have ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Oil and natural gas industry officials have b...,Oil and natural gas industry officials have be...,Biden’ s cabinet nominees include those with a...
7,The Big Picture: The New 'Great Game ',2021-02-02,"[""• A new “ great game ” is emerging for the e...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,A new “ great game ” is emerging for the ener...,A new “ great game ” is emerging for the energ...,China's low-carbon energy race will be at the ...
8,Japan: Tritium Release Plans at Fukushima On Hold,2021-02-11,"[""Close to the 10th anniversary of the Fukushi...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Close to the 10th anniversary of the Fukushim...,Close to the 10th anniversary of the Fukushima...,In the face of widespread opposition to the pl...
9,United States: Cold Snap Highlights Electrific...,2021-02-18,"[""As the coldest weather in a generation broug...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,As the coldest weather in a generation brough...,As the coldest weather in a generation brought...,"In Texas, the problem is particularly acute si..."


In [30]:
import warnings
warnings.filterwarnings("ignore")

df['summary'] = ''

# Set the interval to save the DataFrame
#save_interval = 100

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    body_text = row['content_cleaned_text']

    try:
        summary_sentences = model(body_text, num_sentences=3)
        df.at[index, 'summary'] = summary_sentences
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        df.at[index, 'summary'] = None  # or any default value you prefer

    #if index % save_interval == 0:
        #df.to_csv('/content/drive/MyDrive/Comp_Ling/output_file_2.csv', index=False)

# Save the final DataFrame after all lines are processed
df.to_csv('/content/final_summary_file.csv', index=False)



In [31]:
df = pd.read_csv("/content/final_summary_file.csv")

In [34]:
df['summary'][9606]

'Arevon Energy Inc. has closed financing on the Vikings solar-plus-storage project with a combination of debt financing and tax credit transfer., The project showcases key U.S. manufacturers, with PV module supply from Arizona-based First Solar, along with solar trackers from Nextracker, whose headquarters are in Fremont, Calif. Tesla is supplying the facility’ s utility-scale batteries, which allow the solar energy generated to be directed to the grid during peak demand., Construction of the facility is well underway, with commercial operations scheduled for the third quarter of 2024.'

# Falconsai/text-summarization
Testing a different summarizer model, this one uses a fine-tuned T-5 model.

In [35]:
%%capture
!pip install datasets evaluate transformers rouge-score nltk

In [60]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
!unzip /content/final_summary_file.zip

--2024-01-14 22:15:07--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip [following]
--2024-01-14 22:15:08--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16819008 (16M) [application/zip]
Saving to: ‘final_summary_file.zip.2’


2024-01-14 22:15:08 (363 MB/s) - ‘final_summary_file.zip.2’ saved [16819008/16819008]

Archive:  /content/fin

In [39]:
import pandas as pd
df = pd.read_csv("/content/final_summary_file.csv")

In [40]:
from transformers import pipeline

summarizer = pipeline("summarization", model="Falconsai/text_summarization")


In [49]:
print(summarizer(df['content_cleaned_text'][0], max_length=100, min_length=30, do_sample=False))

[{'summary_text': 'QP said its goals include  reducing emissions intensity of Qatar LNG facilities by 25% and of its upstream facilities by at least 15% . About 2.2 million tons/yr of the carbon capture goal will come from Phase 1 of the LNG expansion . QP says it should be able to eliminate routine gas flaring by 2030, with methane emissions limited by 0.2% across all facilities by 2025 .'}]


In [18]:
import warnings
import time
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")

# Function to apply summarizer to each row and extract the summary text
def generate_summary(text):
    try:
        result = summarizer(text, max_length=100, min_length=30, do_sample=False)
        return result[0]['summary_text']
    except Exception as e:
        print(f"Error processing row: {e}")
        return None  # or any default value you prefer

# Record the start time
start_time = time.time()

# Create a subset of 10 articles for testing
df_subset = df.head(10)

# Apply the function to each row in the 'content_cleaned_text' column
df_subset['summaries_2'] = df_subset['content_cleaned_text'].apply(generate_summary)

# Calculate the elapsed time
elapsed_time = time.time() - start_time

# Display the DataFrame with the new 'summaries_2' column
df_subset

# Print the elapsed time
print(f"Elapsed Time: {elapsed_time} seconds")


                                               title        date  \
0  Qatar to Slash Emissions as LNG Expansion Adva...  2021-01-13   
1               India Launches Its First 700 MW PHWR  2021-01-15   
2              New Chapter for US-China Energy Trade  2021-01-20   
3  Japan: Slow Restarts Cast Doubt on 2030 Energy...  2021-01-22   
4     NYC Pension Funds to Divest Fossil Fuel Shares  2021-01-25   
5  Japan: Supreme Court Will Likely Decide on Fuk...  2021-01-28   
6     Biden Appointees Signal Progressive Engagement  2021-01-28   
7             The Big Picture: The New 'Great Game '  2021-02-02   
8  Japan: Tritium Release Plans at Fukushima On Hold  2021-02-11   
9  United States: Cold Snap Highlights Electrific...  2021-02-18   

                                             content       domain  \
0  ["Qatar Petroleum ( QP) is targeting aggressiv...  energyintel   
1  ["• Nuclear Power Corp. of India Ltd. ( NPCIL)...  energyintel   
2  ["New US President Joe Biden took office 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['summaries_2'] = df_subset['content_cleaned_text'].apply(generate_summary)


This unfortunately takes quite a while to run. Using it on the whole dataset is not feasible within our time limit. We can create a smaller dataset to test with the Q&A models however:

In [51]:
def generate_summary(text):
    try:
        result = summarizer(text, max_length=100, min_length=30, do_sample=False)
        return result[0]['summary_text']
    except Exception as e:
        print(f"Error processing row: {e}")
        return None  # or any default value you prefer



# Create a subset of 10 articles for testing
df_subset = df.head(100)

# Apply the function to each row in the 'content_cleaned_text' column
df_subset['summaries_2'] = df_subset['content_cleaned_text'].apply(generate_summary)


# Display the DataFrame with the new 'summaries_2' column
df_subset

Unnamed: 0,title,date,content,domain,url,content_cleaned_text,summary,summaries_2
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum ( QP) is targeting aggressive...,Qatar Petroleum ( QP) is targeting aggressive ...,QP said its goals include reducing emissions ...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp. of India Ltd. ( NPCIL) sy...,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this w...,New US President Joe Biden took office this we...,China' s imports of US crude jumped 211% in 20...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,The slow pace of Japanese reactor restarts co...,The slow pace of Japanese reactor restarts con...,Tokyo Electric Power Co. ( Tepco) is strugglin...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Two of New York City largest pension funds sa...,Two of New York City largest pension funds say...,Two of New York City largest pension funds say...
...,...,...,...,...,...,...,...,...
95,Upstream Capex Set for Growth Despite Pricing ...,2021-10-26,"['Oil and gas service companies, always hopefu...",energyintel,https://www.energyintel.com/0000017c-baac-dcc3...,"Oil and gas service companies, always hopeful...","Oil and gas service companies, always hopeful ...",Oil and gas service companies are always hopef...
96,Japan: Why Japan's Latest Nuclear Targets Won'...,2021-10-29,"[""Japan's latest energy plan, ratified Oct. 22...",energyintel,https://www.energyintel.com/0000017c-c667-dd9d...,"Japan latest energy plan, ratified Oct. 22, s...","Japan latest energy plan, ratified Oct. 22, st...","Japan's latest energy plan, ratified Oct. 22, ..."
97,Repsol CEO Calls Strategic Plan a 'Transformat...,2021-11-01,['Repsol recently accelerated its energy trans...,energyintel,https://www.energyintel.com/0000017c-c249-d117...,Repsol recently accelerated its energy transi...,Repsol recently accelerated its energy transit...,Repsol has accelerated its energy transition p...
98,"COP26 Roundup: Methane, Carbon Pricing, Coal",2021-11-02,"[""Methane emissions were high on the agenda on...",energyintel,https://www.energyintel.com/0000017c-e1e6-dbc1...,Methane emissions were high on the agenda on ...,Methane emissions were high on the agenda on t...,Methane emissions were high on the agenda on t...


In [61]:
df_subset = pd.read_csv('/content/df_subset.csv', index=False)

Compare the different summaries vs. the original content:

In [53]:
# Falconai summariser:
df_subset['summaries_2'][2]

"China' s imports of US crude jumped 211% in 2020 to a record 396,000 barrels per day, a trade worth $ 6.27 billion, according to the Chinese customs administration . China’ s US crude imports reached 3.32 million tons, worth $ 1.11 billion, up from 851,000 b/d in December . The total value of these energy imports was $ 9.12 billion -- equal to the 2017 baseline ."

In [54]:
# Bert summariser

df_subset['summary'][2]

'New US President Joe Biden took office this week with the US-China relationship at its worst in decades. Meanwhile, imports of US LNG reached 3.32 million tons, worth $ 1.11 billion, up from a lone shipment in 2019, and 4.2 million tons of US LPG was discharged in China, for $ 1.74 billion, up from zero in 2019. With the dramatic change in tone in the White House on climate change, symbolized by the US rejoining the Paris climate accord this week, opportunities may widen beyond the oil and gas trade that could help rebuild dialogue between Washington and Beijing.'

In [55]:
# Original content
df_subset['content_cleaned_text'][2]

' New US President Joe Biden took office this week with the US-China relationship at its worst in decades. Energy has come to play a bigger role in that relationship than ever before, and rising Chinese imports of US oil and LNG could serve as the foundation for fresh discussions on trade -- one of the few areas where US-China communications have not completely broken down. But tackling climate change, a priority for Biden unlike predecessor Donald Trump, may offer the easiest and biggest opportunity for cooperation between the two powers now. Due to a bipartisan perception that China’ s economic, geopolitical and technological rise poses an existential threat to the US, a Biden administration is unlikely to soften the tone on Beijing. Trump demanded bigger purchases of US energy products by China as part of a Phase 1 trade deal before he would lift US tariffs on Chinese products. After Biden won the November US presidential election, Trump moved into high gear to crack down on the per

# Q & A generation

In [35]:
%%capture
!pip install sentencepiece

Restart runtime to make sure sentencepiece is loaded.

In [1]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
!unzip /content/final_summary_file.zip

--2024-01-14 19:07:36--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/final_summary_file.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip [following]
--2024-01-14 19:07:37--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/final_summary_file.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16819008 (16M) [application/zip]
Saving to: ‘final_summary_file.zip’


2024-01-14 19:07:38 (303 MB/s) - ‘final_summary_file.zip’ saved [16819008/16819008]

Archive:  /content/final_s

In [2]:
!python -m nltk.downloader punkt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
from transformers import AutoModelWithLMHead, AutoTokenizer

In [4]:
import sentencepiece

In [5]:
import pandas as pd
df = pd.read_csv("/content/final_summary_file.csv")


In [6]:
import numpy as np
import nltk
import math
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS
import re

In [7]:
import warnings
warnings.filterwarnings("ignore")

sentences_df = pd.DataFrame(columns=['index', 'title', 'sentence'])

for index, row in df.iterrows():
    article_index = row.name
    article_title = row['title']
    sentences = sent_tokenize(str(row['summary']))

    # Append sentences, article index, and title to sentences_df
    for sentence in sentences:
        sentences_df = sentences_df.append({'index': article_index, 'title': article_title, 'sentence': sentence}, ignore_index=True)

sentences_df

Unnamed: 0,index,title,sentence
0,0,Qatar to Slash Emissions as LNG Expansion Adva...,Qatar Petroleum ( QP) is targeting aggressive ...
1,0,Qatar to Slash Emissions as LNG Expansion Adva...,A further 1.1 million tons/yr will come from P...
2,0,Qatar to Slash Emissions as LNG Expansion Adva...,But QP judged them to be too expensive and non...
3,1,India Launches Its First 700 MW PHWR,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
4,1,India Launches Its First 700 MW PHWR,India nuclear suppliers should be feeling some...
...,...,...,...
22005,9603,Orsted Deploying SparkCognition Renewable Suit...,“ From raw materials straight through to end-u...
22006,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,"Veolia North America, a provider of environmen..."
22007,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,Solar Industry offers industry participants pr...
22008,9605,"SunEdison: Too Big, Too Fast?",Once the self-proclaimed “ leading renewable p...


In [8]:
sentences_df['sentence'] = sentences_df['sentence'].replace('nan', float('nan'))
sentences_df = sentences_df.dropna(subset=['sentence'])
sentences_df

Unnamed: 0,index,title,sentence
0,0,Qatar to Slash Emissions as LNG Expansion Adva...,Qatar Petroleum ( QP) is targeting aggressive ...
1,0,Qatar to Slash Emissions as LNG Expansion Adva...,A further 1.1 million tons/yr will come from P...
2,0,Qatar to Slash Emissions as LNG Expansion Adva...,But QP judged them to be too expensive and non...
3,1,India Launches Its First 700 MW PHWR,Nuclear Power Corp. of India Ltd. ( NPCIL) syn...
4,1,India Launches Its First 700 MW PHWR,India nuclear suppliers should be feeling some...
...,...,...,...
22005,9603,Orsted Deploying SparkCognition Renewable Suit...,“ From raw materials straight through to end-u...
22006,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,"Veolia North America, a provider of environmen..."
22007,9604,Veolia Has Plans for 5 MW of Solar in Arkansas,Solar Industry offers industry participants pr...
22008,9605,"SunEdison: Too Big, Too Fast?",Once the self-proclaimed “ leading renewable p...


## mrm8488/t5-base-finetuned-question-generation-ap

In [8]:
df['summary'][0]

'Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar LNG capacity by a further 16 million tons/yr. But QP judged them to be too expensive and none met its targeted 50-week construction schedule.'

In [9]:
df['summary'][1]

'Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.1820). India nuclear suppliers should be feeling some relief over Kakrapar-3s start-up, although order flows will depend on how quickly NPCIL can get other projects moving, and the course of the Covid-19 pandemic ( NIW Dec.1120). • Across the ocean NuScale has launched a play for the UK market via a memorandum of understanding with start-up clean energy firm Shearwater Energy to explore the deployment of hybrid SMR and wind energy projects across the country.'

In [9]:
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'],
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

In [11]:
context = df['content_cleaned_text'][0]
answer = "Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion."

get_question(answer, context)

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors


'<pad> question: What is the goal of Qatar Petroleum in its latest Sustainability Report?</s>'

In [18]:
context=df['content_cleaned_text'][1]
answer="Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.1820)."

In [19]:
print(get_question(answer, context))
print(answer)

<pad> question: What is the first of 700 megawatts of indigenously developed PHWRs?</s>
Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.1820).


Try the model using the article title as context and the sentences as answers.

In [20]:
context=sentences_df['title'][0]
answer=sentences_df['sentence'][0]
print(get_question(answer, context))
print(answer)

<pad> question: What is the plan of Qatar Petroleum to cut its greenhouse gas emissions?</s>
Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion.


In [21]:
context=sentences_df['title'][1]
answer=sentences_df['sentence'][1]

print(get_question(answer, context))
print(answer)

<pad> question: How much more LNG will be produced from the North Field South project?</s>
A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar LNG capacity by a further 16 million tons/yr.


## potsawee/t5-large-generation-squad-QuestionAnswer

In [None]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/qa_section_data.zip
!unzip /content/qa_section_data.zip

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

model = AutoModelForSeq2SeqLM.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

tokenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.23k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [37]:
context = df['summary'][4]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  How much money did the New York City Board of Education Retirement System have under management?
answer:  $ 7.4 billion


In [38]:
context = df['summary'][1]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What is the name of the company that has a memorandum of understanding with Shearwater Energy?
answer:  NuScale


In [12]:
df['summary'][9605]

'Once the self-proclaimed “ leading renewable power plant developer in the world, ” U.S.-based SunEdison filed for Chapter 11 bankruptcy on April 21., “ Nevertheless, the fall of SunEdison has made some renewable energy stakeholders question the viability of the yieldco structure., However, Chase asserts, “ SunEdison’ s bankruptcy says more about the company’ s strategic decisions than about the solar industry as a whole.'

In [11]:
context = df['summary'][9605]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What did Chase believe SunEdison’s bankruptcy says more about than the solar industry?
answer:  strategic decisions


In [13]:
context = sentences_df['sentence'][4]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What is the name of the pandemic that is threatening India?
answer:  Covid-19


In [14]:
context = sentences_df['sentence'][4000]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What are two key factors in making electric vehicle ownership a practical and enjoyable experience?
answer:  Effective route planning and access to charging stations


Test the model on the summaries produced by the Falconai model:

In [None]:
df_subset = pd.read_csv("/content/df_subset.csv")

In [57]:
context = df_subset['summaries_2'][1]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What was the name of the former chairman of the Department of Atomic Energy?
answer:  Anil Kakodkar


In [59]:
context = df_subset['summaries_2'][8]

inputs = tokenizer(context, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)

question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")

question, answer = question_answer.split(tokenizer.sep_token)

print("question:", question)
print("answer:", answer)

question:  What was the projected date for the filling of the tank?
answer:  summer of 2022


Split data into training and validation sets:

In [16]:
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(sentences_df, train_size=1000, random_state=42)

In [21]:
train_df.to_csv('/content/train_df.csv', index=False)

# Save val_df to CSV
val_df.to_csv('/content/val_df.csv', index=False)

Generate Q and As:

In [23]:
train_df['question'] = None
train_df['answer'] = None

# Q&A generation loop

for index, row in train_df.iterrows():
    context = row['sentence']

    try:
        # Tokenize the context and generate question and answer
        inputs = tokenizer(context, return_tensors="pt")
        outputs = model.generate(**inputs, max_length=100)
        question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
        question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")
        question, answer = question_answer.split(tokenizer.sep_token)

        # Assign question and answer to the corresponding row in train_df
        train_df.at[index, 'question'] = question
        train_df.at[index, 'answer'] = answer

    except Exception as e:
        # Handle the error (you can print or log the error message)
        print(f"Error processing row {index}: {str(e)}")
        # Set default values to None
        train_df.at[index, 'question'] = None
        train_df.at[index, 'answer'] = None



# save the train_df
train_df.to_csv('/content/qa_results.csv', index=False)

Error processing row 17586: not enough values to unpack (expected 2, got 1)
Error processing row 12790: not enough values to unpack (expected 2, got 1)
Error processing row 9510: not enough values to unpack (expected 2, got 1)
Error processing row 9310: not enough values to unpack (expected 2, got 1)
Error processing row 8512: not enough values to unpack (expected 2, got 1)
Error processing row 9865: not enough values to unpack (expected 2, got 1)
Error processing row 21343: not enough values to unpack (expected 2, got 1)
Error processing row 9765: not enough values to unpack (expected 2, got 1)
Error processing row 12941: not enough values to unpack (expected 2, got 1)
Error processing row 13746: not enough values to unpack (expected 2, got 1)
Error processing row 13703: not enough values to unpack (expected 2, got 1)
Error processing row 10954: not enough values to unpack (expected 2, got 1)
Error processing row 6168: not enough values to unpack (expected 2, got 1)
Error processing r

In [None]:
train_df = pd.read("/content/qa_results.csv")

In [33]:
for index, row in train_df.iterrows():
  print(f"Question: {row['question']}, Answer: {row['answer']}")

Question:  What is the name of Atome’ s CEO?, Answer:  Olivier Mussat
Question:  What is the price of the ID.4?, Answer:  cheaper than the other models on the list
Question:  What is the process of selecting the correct BoM?, Answer:  an art in itself
Question:  What did the IEA say industry needs to do to lower nuclear construction costs?, Answer:  lower reactor construction costs by 40%
Question:  What is the typical server utilisation rate for AWS?, Answer:  65%
Question:  What is the name of the organization that aims to support South Africa transition from conventional plastics to more environmentally sustainable alternatives?, Answer:  UNIDO
Question:  What kind of look can French doors give to a tiny sustainable home?, Answer:  grand
Question:  What was collected daily from each mesocosm tank?, Answer:  50 mL of seawater
Question:  What is 4th Resource?, Answer:  a next generation geothermal energy developer
Question: None, Answer: None
Question:  What do cookies help us do?, An

The question-answer pairs are mostly of poor quality. The model is predisposed towards nouns of noun phrase answers, and questions often leave out important contextual information such as names of countries.

There are probably multiple reasons why the quality is poor. One is that the BERT summarizer is does not adequately paraphrase articles. Rather it extracts kex sentences verbatim from the articles. Utilising a different summariser would possible improve results significantly. However, the other summarizer that was tested comparitively required a lot of compute power and was slow to execute.