<a href="https://colab.research.google.com/github/lucarenz1997/NLP/blob/main/Stage_2_Part_1_DataPrep_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 2: Advanced Embedding Models Training and Analysis
## Part 1: Model Training Steps

**Objective**: Developing and utilizing advanced embedding models to represent the content of Cleantech Media and Google Patent datasets and compare domain-specific embeddings to gain unique insights.

**Output**: Notebook with annotated model training steps

## Data Preparation for Embeddings
Lead: Alvaro Cervan

### Preprocessing Steps

The preprocessing steps have already been completed in the previous stage, which include:
- Dropping duplicates
- Setting data types
- Dropping unnecessary columns
- Tokenizing text data
- Stopword Removal
- Language detection
- Translating non-English text to English
- Lemmatization

These steps were applied to both datasets, `media` and `patents`, and the resulting data was saved in the `data` folder. We will now load the data and perform the following steps:

In [None]:
# module imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle


In [None]:
from google.colab import drive
drive.mount('/content/drive')

processed_media_data_backup = pd.read_csv("/content/drive/MyDrive/CLT/data/processed_media_data_backup.csv")
processed_patent_data_backup = pd.read_csv("/content/drive/MyDrive/CLT/data/processed_patent_data_backup.csv")

'''processed_media_data_backup = pd.read_csv("data/processed_media_data_backup.csv")
processed_patent_data_backup = pd.read_csv("data/processed_patent_data_backup.csv")'''

print("Media Backup:")
processed_media_data_backup.head(5)

Media Backup:


Unnamed: 0.1,Unnamed: 0,title,date,content,domain,url,processed_text
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,['Chinese automotive startup XPeng has shown o...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-del...,chinese automotive startup XPeng show one dram...
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,['Sinopec has laid plans to build the largest ...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-gre...,Sinopec lay plan build large green hydrogen pr...
2,98159,World’ s largest floating PV plant goes online...,2022-01-03,['Huaneng Power International has switched on ...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-...,Huaneng Power International switch mw float pv...
3,98158,Iran wants to deploy 10 GW of renewables over ...,2022-01-03,"['According to the Iranian authorities, there ...",pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wa...,accord iranian authority currently renewable e...
4,31128,Eastern Interconnection Power Grid Said ‘ Bein...,2022-01-03,['Sign in to get the best natural gas news and...,naturalgasintel,https://www.naturalgasintel.com/eastern-interc...,sign get good natural gas news datum follow to...


### Create training and validation sets for both media and patent texts.

In [None]:
# Function to preprocess data
def preprocess_data(data, seed=42):
	data = shuffle(data, random_state=seed).reset_index(drop=True)  # Shuffle and reset index with seed
	data = data.dropna().drop_duplicates(subset=['processed_text'])  # Drop NaN and duplicates
	return data.reset_index(drop=True)

# Preprocess media and patent data
media_data = preprocess_data(processed_media_data_backup.copy(), seed=42)
patent_data = preprocess_data(processed_patent_data_backup.copy(), seed=42)

# Split the data into training and validation sets with seed
media_train, media_val = train_test_split(media_data, test_size=0.2, random_state=42)
patent_train, patent_val = train_test_split(patent_data, test_size=0.2, random_state=42)

# Display sample data
print("Media Train:")
media_train.head(5) #notice how the index is reset after shuffling

Media Train:


Unnamed: 0.1,Unnamed: 0,title,date,content,domain,url,processed_text
12565,63589,Second Westbridge Alberta Project Wins Approval,2023-06-07,['The Alberta Utilities Commission ( AUC) rece...,solarindustrymag,https://solarindustrymag.com/second-westbridge...,the Alberta Utilities Commission AUC recently ...
1085,93711,Study: Bitcoin Could Achieve Zero Emissions by...,2022-09-07,['Despite all the promise of a decentralized c...,cleantechnica,https://cleantechnica.com/2022/09/07/study-bit...,despite promise decentralized currency free go...
19776,103900,Solar and PHES projects deemed ‘ critical’ in ...,2024-07-04,['The New South Wales ( NSW) government has de...,pv-tech,https://www.pv-tech.org/solar-and-pumped-hydro...,the New South Wales NSW government declare six...
9016,21606,10 Entrepreneurs Share CHF1.75 million to Tack...,2023-01-19,"[""By clicking `` Allow All '' you agree to the...",azocleantech,https://www.azocleantech.com/news.aspx?newsID=...,by click allow all agree storing cookie device...
10443,103449,Trinasolar rooftop project in Vietnam connecte...,2024-01-15,['Trinasolar has announced the grid connection...,pv-tech,https://www.pv-tech.org/industry-updates/trina...,trinasolar announce grid connection MW rooftop...


## Word Embeddings Training

This table showcases the characteristics of each of the models **Word2Vec**, **FastText**, and **GloVe** to help us understand the differences between them and choose the best model for our use case.

| **Feature**           | **Word2Vec**                        | **FastText**                              | **GloVe**                               |
|------------------------|-------------------------------------|-------------------------------------------|-----------------------------------------|
| **Speed**             | Fast                                | Moderate (slower due to subword modeling) | Slow (requires building a co-occurrence matrix) |
| **Performance**       | Good (captures semantic relationships) | Best (handles OOV words and morphology)   | Good (captures global relationships)    |
| **Handles OOV Words** | No                                  | Yes (via subword embeddings)              | No                                      |
| **Captures Morphology**| No                                 | Yes                                       | No                                      |
| **Focus**             | Local Context (Skip-gram or CBOW)   | Local Context + Subwords                  | Global Co-occurrence                    |
| **GPU Compatibility** | Yes (via libraries like Gensim)     | Yes (custom implementations)              | Limited (custom implementations)        |
| **Best Use Case**     | General-purpose, fast training      | Rare words, morphologically rich languages | Global word relationships and context   |

For computing resources and time constraints, GloVe will not be used due to lack of GPU acceleration. We will focus on training **Word2Vec** for its speed, general-purpose use, and ability to capture semantic relationships.

# TODO - Try **FastText** ???
## We will also train **FastText** for its ability to handle OOV words and morphology. 