<a href="https://colab.research.google.com/github/nbarnett19/Computational_Language_Tech/blob/Main/Stage_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accelerating Cleantech Advancements through NLP-Powered Text Mining and Knowledge Extraction:

Stage 2: Training word and sentence embedding models




In this step, students train their own embedding models based on the given dataset and compare the model performance with the open-source embedding models.

> Data Preparation for Embeddings
*   Preprocess the text data for training embeddings, ensuring it is clean and well-structured.
*   Split the data into training and validation sets to assess model performance.

> Word Embedding Training
*   Train word embeddings using techniques like Word2Vec, FastText, or GloVe on the text data.
* Experiment with hyperparameters such as vector dimensions, context window size, and training epochs to optimize word embeddings.

> Sentence Embedding Training
* Develop sentence embeddings using methods like averaging word vectors, Doc2Vec, or
BERT embeddings.
* Fine-tune the sentence embeddings on the cleantech-specific data.
> Embedding Model Evaluation
* Assess the quality of both word and sentence embeddings using intrinsic evaluation methods,
including word similarity and analogy tasks.
* Compare the performance of the in-house embeddings to open source embeddings like Word2Vec, GloVe, or BERT embeddings.

> Transfer Learning with Open Source Models [Optional]
* Implement transfer learning by fine-tuning pre-trained open source models such as BERT or GPT-2 on the text data.
* Compare the performance of transfer learning with the in-house embeddings. This comparison could be done through evaluating the effectiveness of the embeddings in domain-specific tasks like topic classification.

> Outputs:
* Notebook with annotated model training steps.
* Notebook with visualizations comparing the performance of the embedding models.


In [1]:
# Top2Vec Modelling
%%capture
!pip install top2vec
!pip install top2vec[sentence_encoders]
!pip install top2vec[sentence_transformers]
!pip install top2vec[indexing]

# Import Libraries

In [2]:
# Preprocessing
%%capture
!python -m spacy download en_core_web_sm

import numpy as np
import pandas as pd
import nltk
import spacy
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS
import re

nlp = spacy.load('en_core_web_sm')

In [3]:
# Preprocessing
# Download nltk packages
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

True

In [4]:
# Exploratory Analysis
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.feature_extraction import _stop_words as sklearn_stop_words
from gensim.models.doc2vec import TaggedDocument
from sklearn import preprocessing

In [5]:
# Word2Vec and Doc2Vec
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from gensim.models import Word2Vec
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset
from torchtext.data.utils import get_tokenizer
from sklearn.metrics.pairwise import cosine_similarity

# Processing packages
import gensim
from gensim.models import Word2Vec
from gensim import utils
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from gensim.test.utils import datapath
from nltk.probability import FreqDist
import random
import copy

In [6]:
# Plots
from pathlib import Path
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)

import plotly.graph_objs as go
from plotly.offline import iplot
from IPython.core.interactiveshell import InteractiveShell

pd.options.display.max_colwidth = 100
pd.options.display.max_columns = 30

In [7]:
# Install packages for Top2Vec
%%capture
from top2vec import Top2Vec

# Load the Data

First step is to load our data from the csv file into a dataframe.

In [8]:
!wget https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/cleantech_media_dataset_v1_20231109.zip
!unzip /content/cleantech_media_dataset_v1_20231109.zip

--2023-12-18 22:49:12--  https://github.com/nbarnett19/Computational_Language_Tech/raw/Main/cleantech_media_dataset_v1_20231109.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/cleantech_media_dataset_v1_20231109.zip [following]
--2023-12-18 22:49:12--  https://raw.githubusercontent.com/nbarnett19/Computational_Language_Tech/Main/cleantech_media_dataset_v1_20231109.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14880158 (14M) [application/zip]
Saving to: ‘cleantech_media_dataset_v1_20231109.zip.4’


2023-12-18 22:49:13 (36.6 MB/s) - ‘cleantec

In [9]:
df = pd.read_csv("cleantech_media_dataset_v1_20231109.csv")

In [10]:
# Inspect dataframe
pd.DataFrame.head(df)

Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url
0,1280,Qatar to Slash Emissions as LNG Expansion Advances,2021-01-13,,"[""Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepa...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de685b0000
1,1281,India Launches Its First 700 MW PHWR,2021-01-15,,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of G...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de6c710001
2,1283,New Chapter for US-China Energy Trade,2021-01-20,,"[""New US President Joe Biden took office this week with the US-China relationship at its worst i...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de735a0000
3,1284,Japan: Slow Restarts Cast Doubt on 2030 Energy Plan,2021-01-22,,"[""The slow pace of Japanese reactor restarts continues to cast doubt on the goal of the governme...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de79160000
4,1285,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,"[""Two of New York City's largest pension funds say they will divest roughly $ 4 billion in share...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de7d9e0000


In [11]:
# Inspect dataframe shape
df.shape

(9607, 7)

The dataframe contains 9607 records and 7 attributes.

In [12]:
# Check for NAs
print(df.isnull().sum())

Unnamed: 0       0
title            0
date             0
author        9576
content          0
domain           0
url              0
dtype: int64


There are no NAs in the title or content columns which are the most important for topic modelling. We can also see that the author column has almost no data so we can drop this column.

# Preprocessing

We create a function to apply the first preprocessing steps. This includes dropping any duplicated records, changing the contents to lower case, removing non alpha-numeric characters, tokenizing the contents and adding a word count.

In [13]:
def preprocess_data(df):
    # Remove duplicates
    df = df.drop_duplicates()

    # Remove digits << Added because the tokenized numbers do not bring value to our analysis
    df['content_cleaned'] = df['content'].str.replace('\d+', '', regex=True)

    # Convert content to lower case
    df['content_cleaned'] = df['content_cleaned'].apply(lambda x: x.lower())

    # Remove symbols and punctuation (not sure about this step)
    df['content_cleaned'] = df['content_cleaned'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))

    # Tokenize
    df['tokenized_content'] = df['content_cleaned'].apply(lambda x: nltk.word_tokenize(x))

    # Remove digits << Added because the tokenized numbers do not bring value to our analysis
    df['tokenized_content'] = df['tokenized_content'].replace('[0-9]+', '', regex=True)

    # Add word count column
    df['word_count'] = df['tokenized_content'].apply(lambda x: len(x))

    # Remove unused columns
    df.drop('Unnamed: 0', axis=1, inplace=True)
    df.drop('author', axis=1, inplace=True)

    return df

df = preprocess_data(df)

In [14]:
df[['content_cleaned', 'tokenized_content', 'word_count']].head()

Unnamed: 0,content_cleaned,tokenized_content,word_count
0,qatar petroleum qp is targeting aggressive cuts in its greenhouse gas emissions as it prepares ...,"[qatar, petroleum, qp, is, targeting, aggressive, cuts, in, its, greenhouse, gas, emissions, as,...",415
1,nuclear power corp of india ltd npcil synchronized kakrapar in the western state of gujarat to...,"[nuclear, power, corp, of, india, ltd, npcil, synchronized, kakrapar, in, the, western, state, o...",518
2,new us president joe biden took office this week with the uschina relationship at its worst in d...,"[new, us, president, joe, biden, took, office, this, week, with, the, uschina, relationship, at,...",679
3,the slow pace of japanese reactor restarts continues to cast doubt on the goal of the government...,"[the, slow, pace, of, japanese, reactor, restarts, continues, to, cast, doubt, on, the, goal, of...",663
4,two of new york citys largest pension funds say they will divest roughly billion in shares of ...,"[two, of, new, york, citys, largest, pension, funds, say, they, will, divest, roughly, billion, ...",384


Lemmatize tokens

In [16]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])


def lemmatize_tokens(token_list):
    # Join the tokens back into a string
    joined_tokens = ' '.join(token_list)
    # Process the text with spacy
    doc = nlp(joined_tokens)
    # Return the lemmatized tokens
    return [token.lemma_ for token in doc]

# Apply the lemmatization function to the 'tokenized_content' column
spacy_lemma = df['tokenized_content'].apply(lemmatize_tokens)
spacy_lemma

0       [qatar, petroleum, qp, be, target, aggressive, cut, in, its, greenhouse, gas, emission, as, it, ...
1       [nuclear, power, corp, of, india, ltd, npcil, synchronize, kakrapar, in, the, western, state, of...
2       [new, us, president, joe, biden, take, office, this, week, with, the, uschina, relationship, at,...
3       [the, slow, pace, of, japanese, reactor, restart, continue, to, cast, doubt, on, the, goal, of, ...
4       [two, of, new, york, city, large, pension, fund, say, they, will, divest, roughly, billion, in, ...
                                                       ...                                                 
9602    [strata, clean, energy, have, close, a, million, revolving, loan, and, letter, of, credit, facil...
9603    [global, renewable, energy, developer, rste, be, deploy, sparkcognition, s, renewable, suite, ac...
9604    [veolia, north, america, a, provider, of, environmental, solution, in, the, us, and, canada, hav...
9605    [once, the, selfproc

In [17]:
df['spacy_lemma'] = spacy_lemma

In [18]:
# Remove stops
stop_words_spacy = nlp.Defaults.stop_words
stops_spacy = df['spacy_lemma'].apply(lambda x: [word for word in x if word.lower() not in stop_words_spacy])
print(len(stops_spacy[0]))

232


In [19]:
df['stops_spacy'] = stops_spacy
df.head()

Unnamed: 0,title,date,content,domain,url,content_cleaned,tokenized_content,word_count,spacy_lemma,stops_spacy
0,Qatar to Slash Emissions as LNG Expansion Advances,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepa...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de685b0000,qatar petroleum qp is targeting aggressive cuts in its greenhouse gas emissions as it prepares ...,"[qatar, petroleum, qp, is, targeting, aggressive, cuts, in, its, greenhouse, gas, emissions, as,...",415,"[qatar, petroleum, qp, be, target, aggressive, cut, in, its, greenhouse, gas, emission, as, it, ...","[qatar, petroleum, qp, target, aggressive, cut, greenhouse, gas, emission, prepare, launch, phas..."
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of G...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de6c710001,nuclear power corp of india ltd npcil synchronized kakrapar in the western state of gujarat to...,"[nuclear, power, corp, of, india, ltd, npcil, synchronized, kakrapar, in, the, western, state, o...",518,"[nuclear, power, corp, of, india, ltd, npcil, synchronize, kakrapar, in, the, western, state, of...","[nuclear, power, corp, india, ltd, npcil, synchronize, kakrapar, western, state, gujarat, grid, ..."
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this week with the US-China relationship at its worst i...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de735a0000,new us president joe biden took office this week with the uschina relationship at its worst in d...,"[new, us, president, joe, biden, took, office, this, week, with, the, uschina, relationship, at,...",679,"[new, us, president, joe, biden, take, office, this, week, with, the, uschina, relationship, at,...","[new, president, joe, biden, office, week, uschina, relationship, bad, decade, energy, come, pla..."
3,Japan: Slow Restarts Cast Doubt on 2030 Energy Plan,2021-01-22,"[""The slow pace of Japanese reactor restarts continues to cast doubt on the goal of the governme...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de79160000,the slow pace of japanese reactor restarts continues to cast doubt on the goal of the government...,"[the, slow, pace, of, japanese, reactor, restarts, continues, to, cast, doubt, on, the, goal, of...",663,"[the, slow, pace, of, japanese, reactor, restart, continue, to, cast, doubt, on, the, goal, of, ...","[slow, pace, japanese, reactor, restart, continue, cast, doubt, goal, government, fifth, basic, ..."
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds say they will divest roughly $ 4 billion in share...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de7d9e0000,two of new york citys largest pension funds say they will divest roughly billion in shares of ...,"[two, of, new, york, citys, largest, pension, funds, say, they, will, divest, roughly, billion, ...",384,"[two, of, new, york, city, large, pension, fund, say, they, will, divest, roughly, billion, in, ...","[new, york, city, large, pension, fund, divest, roughly, billion, share, fossil, fuel, company, ..."


In [20]:
# Join text for each doc
df['final_cleaned'] = df['stops_spacy'].apply(lambda x: ' '.join(x))

# Topic Labeling

From stage 1, we identified 8 topics that will be used in our topic classification model. Below we will rerun the top2vec model and assign the topics to the cleantech corpus.

In [23]:
documents = df["content"].tolist()
model2 = Top2Vec(documents, embedding_model='universal-sentence-encoder-multilingual')

INFO:top2vec:Pre-processing documents for training

The parameter 'token_pattern' will not be used since 'tokenizer' is not None'

INFO:top2vec:Downloading universal-sentence-encoder-multilingual model
INFO:top2vec:Creating joint document/word embedding
INFO:top2vec:Creating lower dimension embedding of documents
INFO:top2vec:Finding dense areas of documents
INFO:top2vec:Finding topics


In [24]:
# Get topics
topics = model2.get_topics()

In [25]:
model2.hierarchical_topic_reduction(num_topics=8)

[[3, 17, 7, 20, 21, 25, 37, 39, 0],
 [31, 34, 48, 49, 24, 47, 51, 9, 15, 11, 12, 19, 30, 40, 27],
 [26, 54, 14, 28, 16, 41, 50, 5],
 [36, 33, 45, 38, 43, 10, 44, 29, 18, 42, 8],
 [53, 22, 35, 46, 4],
 [52, 32, 1],
 [2],
 [23, 13, 6]]

In [26]:
# # Creating a dataframe with topic numbers and topic words
# Get the topic numbers and words and scores for a specific topic
topic_sizes, topic_nums = model2.get_topic_sizes(reduced=True)
topic_words, word_scores, topic_num = model2.get_topics(len(topic_nums))

# Create an empty list to store the results
topics_data = []

# Iterate over topics
for i in range(len(topic_nums)):
    # Get words, word scores, and topic number for the current topic
    current_topic_words, current_word_scores, current_topic_num = model2.get_topics()
    # Append the information for the current topic to the list
    # Words scores were excluded from the data frame because it did not add value to the analysis
    topics_data.append({'topic_nums': current_topic_num[i], 'topic_sizes': topic_sizes[i], 'topic_words': current_topic_words[i]})

# Create a DataFrame from the list of dictionaries
topics_df = pd.DataFrame(topics_data)

# Display the results DataFrame
topics_df

Unnamed: 0,topic_nums,topic_sizes,topic_words
0,0,1951,"[solar, solarpower, solarapp, solarize, agrivoltaic, photovoltaic, solaredge, agrivoltaics, geoe..."
1,1,1591,"[tesla, electrics, electricity, superchargers, exxonmobil, electric, agrivoltaic, renewables, to..."
2,2,1357,"[environmentally, greenpeace, environmental, ecological, eco, greentech, environment, ecojustice..."
3,3,1260,"[geothermal, geoenergy, hydrothermal, thinkgeoenergy, bioenergy, energies, energie, geosciences,..."
4,4,1010,"[solar, solarpower, agrivoltaic, agrivoltaics, solarapp, photovoltaic, solarize, terawatt, geoen..."
5,5,901,"[solar, solarpower, solarapp, agrivoltaic, solarize, agrivoltaics, photovoltaic, photovoltaics, ..."
6,6,796,"[greenpeace, climatic, environmentally, environmental, ecological, climate, ecology, climates, e..."
7,7,741,"[geoenergy, renewables, energies, energie, thinkgeoenergy, solarpower, bioenergy, totalenergies,..."


In [27]:
# # Creating a dataframe with documents assigned to the topics and document scores
# Get the topic sizes and topic numbers
topic_sizes, topic_nums = model2.get_topic_sizes(reduced = True)

# Create an empty DataFrame to store the results
results_df2 = pd.DataFrame(columns=['topic', 'document_ids','document_scores'])

# Iterate over topics
for i in range(len(topic_sizes)):
    # Get documents, document scores, and document IDs for the current topic
    documents, document_scores, document_ids = model2.search_documents_by_topic(reduced = True, topic_num=i, num_docs=topic_sizes[i])

    # Create a DataFrame for the current topic
    topic_df = pd.DataFrame({'topic': i, 'document_ids': document_ids, 'document_scores': document_scores})

    # Append the DataFrame for the current topic to the results DataFrame
    results_df2 = pd.concat([results_df2, topic_df], ignore_index=True)

# Display the results DataFrame
results_df2

Unnamed: 0,topic,document_ids,document_scores
0,0,4749,0.868814
1,0,1150,0.868699
2,0,8986,0.866830
3,0,9597,0.855517
4,0,9256,0.854619
...,...,...,...
9602,7,2563,0.333337
9603,7,2588,0.331727
9604,7,2521,0.326119
9605,7,2547,0.249271


Assign the documents to the identified topics.

In [28]:
# the index of each document in the original corpus is the id
# Therefor we can simply join the two dataframes on the index to assign the topics

# Merge DataFrames on the index of df and the document ids from results_df2 column
df_labeled = pd.merge(df, results_df2, left_index=True, right_on='document_ids', how='inner')

In [29]:
# Merge DataFrames on the 'ID' column
df_labeled

Unnamed: 0,title,date,content,domain,url,content_cleaned,tokenized_content,word_count,spacy_lemma,stops_spacy,final_cleaned,topic,document_ids,document_scores
2466,Qatar to Slash Emissions as LNG Expansion Advances,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepa...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de685b0000,qatar petroleum qp is targeting aggressive cuts in its greenhouse gas emissions as it prepares ...,"[qatar, petroleum, qp, is, targeting, aggressive, cuts, in, its, greenhouse, gas, emissions, as,...",415,"[qatar, petroleum, qp, be, target, aggressive, cut, in, its, greenhouse, gas, emission, as, it, ...","[qatar, petroleum, qp, target, aggressive, cut, greenhouse, gas, emission, prepare, launch, phas...",qatar petroleum qp target aggressive cut greenhouse gas emission prepare launch phase plan milli...,1,0,0.761405
844,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of G...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de6c710001,nuclear power corp of india ltd npcil synchronized kakrapar in the western state of gujarat to...,"[nuclear, power, corp, of, india, ltd, npcil, synchronized, kakrapar, in, the, western, state, o...",518,"[nuclear, power, corp, of, india, ltd, npcil, synchronize, kakrapar, in, the, western, state, of...","[nuclear, power, corp, india, ltd, npcil, synchronize, kakrapar, western, state, gujarat, grid, ...",nuclear power corp india ltd npcil synchronize kakrapar western state gujarat grid jan indias me...,0,1,0.746307
3315,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this week with the US-China relationship at its worst i...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de735a0000,new us president joe biden took office this week with the uschina relationship at its worst in d...,"[new, us, president, joe, biden, took, office, this, week, with, the, uschina, relationship, at,...",679,"[new, us, president, joe, biden, take, office, this, week, with, the, uschina, relationship, at,...","[new, president, joe, biden, office, week, uschina, relationship, bad, decade, energy, come, pla...",new president joe biden office week uschina relationship bad decade energy come play big role re...,1,2,0.657593
3470,Japan: Slow Restarts Cast Doubt on 2030 Energy Plan,2021-01-22,"[""The slow pace of Japanese reactor restarts continues to cast doubt on the goal of the governme...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de79160000,the slow pace of japanese reactor restarts continues to cast doubt on the goal of the government...,"[the, slow, pace, of, japanese, reactor, restarts, continues, to, cast, doubt, on, the, goal, of...",663,"[the, slow, pace, of, japanese, reactor, restart, continue, to, cast, doubt, on, the, goal, of, ...","[slow, pace, japanese, reactor, restart, continue, cast, doubt, goal, government, fifth, basic, ...",slow pace japanese reactor restart continue cast doubt goal government fifth basic energy plan l...,1,3,0.584835
2987,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds say they will divest roughly $ 4 billion in share...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c-a17b-e7de7d9e0000,two of new york citys largest pension funds say they will divest roughly billion in shares of ...,"[two, of, new, york, citys, largest, pension, funds, say, they, will, divest, roughly, billion, ...",384,"[two, of, new, york, city, large, pension, fund, say, they, will, divest, roughly, billion, in, ...","[new, york, city, large, pension, fund, divest, roughly, billion, share, fossil, fuel, company, ...",new york city large pension fund divest roughly billion share fossil fuel company aim insulate h...,1,4,0.705993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1168,Strata Clean Energy Nets $ 300 Million in Funding to Support Growth,2023-11-06,['Strata Clean Energy has closed a $ 300 million revolving loan and letter of credit facility to...,solarindustrymag,https://solarindustrymag.com/strata-clean-energy-nets-300-million-in-funding-to-support-growth,strata clean energy has closed a million revolving loan and letter of credit facility to expan...,"[strata, clean, energy, has, closed, a, million, revolving, loan, and, letter, of, credit, facil...",294,"[strata, clean, energy, have, close, a, million, revolving, loan, and, letter, of, credit, facil...","[strata, clean, energy, close, million, revolving, loan, letter, credit, facility, expand, opera...",strata clean energy close million revolving loan letter credit facility expand operational fleet...,0,9602,0.716440
1140,Orsted Deploying SparkCognition Renewable Suite for Solar Asset Management,2023-11-07,['Global renewable energy developer Ørsted is deploying SparkCognition’ s Renewable Suite across...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-sparkcognition-renewable-suite-for-solar-asset-man...,global renewable energy developer rsted is deploying sparkcognition s renewable suite across gw...,"[global, renewable, energy, developer, rsted, is, deploying, sparkcognition, s, renewable, suite...",319,"[global, renewable, energy, developer, rste, be, deploy, sparkcognition, s, renewable, suite, ac...","[global, renewable, energy, developer, rste, deploy, sparkcognition, s, renewable, suite, gw, la...",global renewable energy developer rste deploy sparkcognition s renewable suite gw landbased wind...,0,9603,0.718924
51,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,"['Veolia North America, a provider of environmental solutions in the U.S. and Canada, has partne...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-for-5-mw-of-solar-in-arkansas,veolia north america a provider of environmental solutions in the us and canada has partnered wi...,"[veolia, north, america, a, provider, of, environmental, solutions, in, the, us, and, canada, ha...",341,"[veolia, north, america, a, provider, of, environmental, solution, in, the, us, and, canada, hav...","[veolia, north, america, provider, environmental, solution, canada, partner, today, s, power, in...",veolia north america provider environmental solution canada partner today s power inc install mw...,0,9604,0.833346
4701,"SunEdison: Too Big, Too Fast?",2023-11-08,"['Once the self-proclaimed “ leading renewable power plant developer in the world, ” U.S.-based ...",solarindustrymag,http://www.solarindustrymag.com/online/issues/SI1606/FEAT_01_SunEdison-Too-Big-Too-Fast.html,once the selfproclaimed leading renewable power plant developer in the world usbased sunedison...,"[once, the, selfproclaimed, leading, renewable, power, plant, developer, in, the, world, usbased...",1683,"[once, the, selfproclaime, lead, renewable, power, plant, developer, in, the, world, usbased, su...","[selfproclaime, lead, renewable, power, plant, developer, world, usbased, sunedison, file, chapt...",selfproclaime lead renewable power plant developer world usbased sunedison file chapter bankrupt...,2,9605,0.676568


# Word Embeddings


In [None]:
df.head()

## Word2Vec Embedding

In [None]:
sentences = df['stops_spacy']

In [None]:
# ensure the data is a list of words.
sentences

In [None]:
# Set seeds for reproducibility
SEED = 37
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

data = copy.deepcopy(sentences)

# We define how to split the data
split_train = int(len(data) * 0.8)
split_val = (len(data) - split_train) // 2
split_test = len(data) - split_train - split_val

train_data = data[:split_train]
data = data[split_train:]

val_data = data[:split_val]
data = data[split_val:]

test_data = data

# Sanity check
assert len(train_data + val_data + test_data) == len(sentences)
print(len(train_data), len(val_data), len(test_data))

In [None]:
# min_count = removes words with a frequency less than listed
# Vector size = the number of dimensions that gensim Word2Vec maps the words into. Bigger size requires more training data, but can lead to better models
# Workers = parallelization to speeed up training
min_count = 5
vector_size = 200
workers=4

model = gensim.models.Word2Vec(sentences=train_data, min_count=min_count,
                               vector_size=vector_size,workers=workers,
                               compute_loss=True, seed = 55, epochs=50 )

In [None]:
# getting the training loss value
training_loss = model.get_latest_training_loss()
print(training_loss)

Show the model works by obtaining a vector from a common word in the model.

In [None]:
vec_energy = model.wv['energy']
vec_energy

Retrieve Vocabulary Words

In [None]:
for index, word in enumerate(model.wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")

Word2Vec supports word similarity tasks.

In [None]:
pairs = [
    ('energy', 'electricity'),
    ('energy', 'solar'),
    ('energy', 'gas'),
    ('energy', 'clean'),
    ('energy', 'climate'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, model.wv.similarity(w1, w2)))

In [None]:
# 5 most similar words to energy
model.wv.most_similar(positive=['energy'], topn=5)

In [None]:
# Which word does not belong in the sequence
print(model.wv.doesnt_match(['energy', 'solar', 'wind', 'water', 'electricity', 'president']))

We can evaluate the effectiveness of Word2Vec with word analogy or word pair mthods. Word analogies evaluates the model on a set of syntactic and semantic word analogies. The output is a tuple containing a total accuracy and a list of dictionaries for each section with correct and incorrect analogies. Word similarity evaluates the model on a dataset containing word pairs with human-assigned similarity judgments. The output includes the correlation coefficient and p-value.

In [None]:
# Word Analogy Evaluation
model.wv.evaluate_word_analogies(datapath('questions-words.txt'))
print("Analogies Score:", analogies[0])

In [None]:
# Word Similarity Evaluation
model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

* Pearson correlation coefficient: 0.41519180153657675
* Significance: p-value = 1.4867188725644225e-13 (very close to zero)
* Spearman rank-order correlation coefficient: 0.43021345396803157
* Coverage: 17.56% (percentage of word pairs covered by the model)

The Pearson correlation coefficient measures the linear relationship between the model's similarity scores and the human similarity judgments. In this case, the low p-value suggests that the correlation is statistically significant.

The Spearman rank-order correlation measures how consistently two sets of rankings are related. A higher Spearman coefficient indicates a better performance in capturing the ordinal relationships.

The coverage indicates the percentage of word pairs from the evaluation set that are present in the model's vocabulary.

Overall, these metrics provide insights into how well the Word2Vec model aligns with human judgments of word similarity.

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import random
import numpy as np

def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # extract the words & their vectors, as numpy arrays
    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index_to_key)  # fixed-width numpy strings

    # reduce using t-SNE
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels

x_vals, y_vals, labels = reduce_dimensions(model)

plt.figure(figsize=(12, 12))
plt.scatter(x_vals, y_vals)

# Label randomly subsampled 25 data points
indices = list(range(len(labels)))
selected_indices = random.sample(indices, 25)
for i in selected_indices:
    plt.annotate(labels[i], (x_vals[i], y_vals[i]))

plt.show()


Build a new model using skip-gram.

In [None]:
# min_count = removes words with a frequency less than listed
# Vector size = the number of dimensions that gensim Word2Vec maps the words into. Bigger size requires more training data, but can lead to better models
# Workers = parallelization to speeed up training
min_count = 5
vector_size = 200
workers=4

model = gensim.models.Word2Vec(sentences=train_data, min_count=min_count,
                               vector_size=vector_size,workers=workers,
                               compute_loss=True, seed = 72, sg=1, hs=0, epochs=50 )

In [None]:
# getting the training loss value
training_loss = model.get_latest_training_loss()
print(training_loss)

In [None]:
pairs = [
    ('energy', 'electricity'),
    ('energy', 'solar'),
    ('energy', 'gas'),
    ('energy', 'clean'),
    ('energy', 'climate'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, model.wv.similarity(w1, w2)))

In [None]:
# 5 most similar words to energy
model.wv.most_similar(positive=['energy'], topn=5)

In [None]:
# Word Analogy Evaluation
analogies = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))
print("Analogies Score:", analogies[0])

In [None]:
# Word Similarity Evaluation
model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

* Pearson correlation coefficient: 0.41225004262668175
* Significance: p-value = 2.2882532293753467e-13 (very close to zero)
* Spearman rank-order correlation coefficient: 0.46687706846402177
* Coverage: 17.56% (percentage of word pairs covered by the model)

The Pearson correlation coefficient measures the linear relationship between the model's similarity scores and the human similarity judgments. In this case, the low p-value suggests that the correlation is statistically significant.

The Spearman rank-order correlation measures how consistently two sets of rankings are related. A higher Spearman coefficient indicates a better performance in capturing the ordinal relationships.

The coverage indicates the percentage of word pairs from the evaluation set that are present in the model's vocabulary.

Overall, these metrics show a slight improvement from the CBOW word2vec model.

The word embeddings made by the model can be visualised by reducing dimensionality of the words to 2 dimensions using tSNE.

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import random
import numpy as np

def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # extract the words & their vectors, as numpy arrays
    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index_to_key)  # fixed-width numpy strings

    # reduce using t-SNE
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels

x_vals, y_vals, labels = reduce_dimensions(model)

plt.figure(figsize=(12, 12))
plt.scatter(x_vals, y_vals)

# Label randomly subsampled 25 data points
indices = list(range(len(labels)))
selected_indices = random.sample(indices, 25)
for i in selected_indices:
    plt.annotate(labels[i], (x_vals[i], y_vals[i]))

plt.show()


RNN classification

In [None]:
# # Complete the RNN class
# class RNNModel(nn.Module):
#     def __init__(self, input_size, hidden_size, num_layers, num_classes):
#         super(RNNModel, self).__init__()
#         self.hidden_size = hidden_size
#         self.num_layers = num_layers
#         self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
#         self.fc = nn.Linear(hidden_size, num_classes)
#     def forward(self, x):
#         h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
#         out, _ = self.rnn(x, h0)
#         out = out[:, -1, :]
#         out = self.fc(out)
#         return out

# # Initialize the model
# rnn_model = RNNModel(input_size, hidden_size, num_layers, num_classes)
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.Adam(rnn_model.parameters(), lr=0.01)

# # Train the model for ten epochs and zero the gradients
# for epoch in range(10):
#     optimizer.zero_grad()
#     outputs = rnn_model(X_train_seq)
#     loss = criterion(outputs, y_train_seq)
#     loss.backward()
#     optimizer.step()
#     print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Building an LSTM model

In [None]:
# # Initialize the LSTM and the output layer with parameters
# class LSTMModel(nn.Module):
#     def __init__(self, input_size, hidden_size, num_layers, num_classes):
#         super(LSTMModel, self).__init__()
#         self.hidden_size = hidden_size
#         self.num_layers = num_layers
#         self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
#         self.fc = nn.Linear(hidden_size, num_classes)
#     def forward(self, x):
#         h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
#         c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
#         out, _ = self.lstm(x, (h0, c0))
#         out = out[:, -1, :]
#         out = self.fc(out)
#         return out

# # Initialize model with required parameters
# lstm_model = LSTMModel(input_size, hidden_size, num_layers, num_classes)
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

# # Train the model by passing the correct parameters and zeroing the gradient
# for epoch in range(10):
#     optimizer.zero_grad()
#     outputs = lstm_model(X_train_seq)
#     loss = criterion(outputs, y_train_seq)
#     loss.backward()
#     optimizer.step()
#     print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Building a GRU Model

In [None]:
# # Complete the GRU model
# class GRUModel(nn.Module):
#     def __init__(self, input_size, hidden_size, num_layers, num_classes):
#         super(GRUModel, self).__init__()
#         self.hidden_size = hidden_size
#         self.num_layers = num_layers
#         self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
#         self.fc = nn.Linear(hidden_size, num_classes)
#     def forward(self, x):
#         h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
#         out, _ = self.gru(x, h0)
#         out = out[:, -1, :]
#         out = self.fc(out)
#         return out

# # Initialize the model
# gru_model = GRUModel(input_size, hidden_size, num_layers, num_classes)
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.Adam(gru_model.parameters(), lr=0.01)

# # Train the model and backpropagate the loss after initialization
# for epoch in range(15):
#     optimizer.zero_grad()
#     outputs = gru_model(X_train_seq)
#     loss = criterion(outputs, y_train_seq)
#     loss.backward()
#     optimizer.step()
#     print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Evaluate RNN model

In [None]:
# # Create an instance of the metrics
# accuracy = Accuracy(task="multiclass", num_classes=3)
# precision = Precision(task="multiclass", num_classes=3)
# recall = Recall(task="multiclass", num_classes=3)
# f1 = F1Score(task="multiclass", num_classes=3)

# # Generate the predictions
# outputs = rnn_model(X_test_seq)
# _, predicted = torch.max(outputs, 1)

# # Calculate the metrics
# accuracy_score = accuracy(predicted, y_test_seq)
# precision_score = precision(predicted, y_test_seq)
# recall_score = recall(predicted, y_test_seq)
# f1_score = f1(predicted, y_test_seq)
# print("RNN Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_score, precision_score, recall_score, f1_score))

Evaluate LSTM and GRU

In [None]:
# # Create an instance of the metrics
# accuracy = Accuracy(task="multiclass", num_classes=3)
# precision = Precision(task="multiclass", num_classes=3)
# recall = Recall(task="multiclass", num_classes=3)
# f1 = F1Score(task="multiclass", num_classes=3)

# # Calculate metrics for the LSTM model
# accuracy_1 = accuracy(y_pred_lstm, y_test)
# precision_1 = precision(y_pred_lstm, y_test)
# recall_1 = recall(y_pred_lstm, y_test)
# f1_1 = f1(y_pred_lstm, y_test)
# print("LSTM Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_1, precision_1, recall_1, f1_1))

# # Calculate metrics for the GRU model
# accuracy_2 = accuracy(y_pred_gru, y_test)
# precision_2 = precision(y_pred_gru, y_test)
# recall_2 = recall(y_pred_gru, y_test)
# f1_2 = f1(y_pred_gru, y_test)
# print("GRU Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_2, precision_2, recall_2, f1_2))

## Doc2Vec

Prepare training, test and validation data.

In [None]:
# Set seeds for reproducibility
SEED = 87
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

data = copy.deepcopy(df['final_cleaned'])

# We define how to split the data
split_train = int(len(data) * 0.8)
split_val = (len(data) - split_train) // 2
split_test = len(data) - split_train - split_val

train_data = data[:split_train]
data = data[split_train:]

val_data = data[:split_val]
data = data[split_val:]

test_data = data

# Sanity check
assert len(train_data + val_data + test_data) == len(sentences)
print(len(train_data), len(val_data), len(test_data))

We define a funciton to read the corpus line by line (each line of the corpus represents a document), tokenize text into individual words, remove puctuation and set to lowercase. To train the model, we need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [None]:
def read_corpus(data, tokens_only=False):
    for i, line in enumerate(data):
        tokens = gensim.utils.simple_preprocess(line)
        if tokens_only:
            yield tokens
        else:
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(train_data))
test_corpus = list(read_corpus(test_data, tokens_only=True))

In [None]:
# Look at training and test corpus
print(train_corpus[:2])
print(test_corpus[:2])

The testing is just a list of lists and should not contain any tags.

Now we intitiate a Doc2Vec model with a vector size with 50 dimensions and iterating over the training corpus 40 times. The minimum word count is set to 2 in order to discard words with very few occurences.

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

Build the vocabulary which is a list of all the unique words extracted from the training corpus.

In [None]:
model.build_vocab(train_corpus)

In [None]:
# Can view the vocabulary below
model.wv.index_to_key

In [None]:
# Can view additional attributes using the get_vecattr method
print(f"Word 'energy' appeared {model.wv.get_vecattr('energy', 'count')} times in the training corpus.")

Next, we train the model.

In [None]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Use the trained model to infer a vector for any piece of text to ensure the model is working.

In [None]:
vector = model.infer_vector([ 'instead','quickly','creating','neutral','planned','farms'])
print(vector)

Assess the model by obtaining vector representations for each document in the training set. The idea is to rank the documents based on their self-similarity. The expectation is that, if the model has overfit the training data, all the ranks will be less than 2. This is because a document is most similar to itself, and potentially one other document, indicating overfitting. If overfitting has occurred, the model should be able to find similar documents very easily among the training corpus. Additionally, the second ranks are tracked. This is useful for comparing less similar documents. If the model has overfit, the second-ranked documents should be less similar, reflecting a distinction between very similar and less similar documents.

In [None]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

Now we will count how each document ranks with respect to the training corpus.

In [None]:
import collections

counter = collections.Counter(ranks)
print(counter)

Basically, greater than 99% of the inferred documents are found to be most similar to itself and about 1% of the time it is mistakenly most similar to another document. Checking the inferred-vector against a training-vector is a sort of ‘sanity check’ as to whether the model is behaving in a usefully consistent manner, though not a real ‘accuracy’ value.

In [None]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

The most similar document usually has a similarity score close to 1.0, while the second-ranked document should have a significantly lower similarity score, assuming the documents are different. However, the documents in our corpus show more similarieties because the documents all discuss various forms of clean energy. This can also be seen by running the below cell mutliple times.

In [None]:
# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Using the same approach as above, we will infer the vector for a randomly chosen test document and compare the results.

In [None]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

# Model Evaluation

Now we will compare our in house train word2vec models with a pre-trained model.

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

Retrieve the vocabulary.

In [None]:
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")

Obtain word vectors availablie in the model

In [None]:
try:
    vec_energy = wv['energy']
except KeyError:
    print("The word does not appear in this model")

In [None]:
# 5 most similar words to energy
print(model.wv.most_similar(positive=['energy'], topn=5))

In [None]:
pairs = [
    ('energy', 'electricity'),
    ('energy', 'solar'),
    ('energy', 'gas'),
    ('energy', 'clean'),
    ('energy', 'climate'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, model.wv.similarity(w1, w2)))