<a href="https://colab.research.google.com/github/lucarenz1997/NLP/blob/main/Stage_1_Cleaning_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 1: Enhanced Data Cleaning and  Preprocessing
Objective: Analyzing both the Cleantech Media Dataset and the Cleantech Google Patent Dataset to
identify emerging trends, technologies, and potential innovation gaps in the cleantech sector

# Setup and Imports

In [2]:
!pip install langdetect
!pip install googletrans
!pip install nest_asyncio
!pip install demoji contractions unidecode num2words pyspellchecker

import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from google.colab import drive
import pandas as pd
from langdetect import detect, DetectorFactory
from googletrans import Translator
from collections import Counter
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
import nest_asyncio
import re
import string
from num2words import num2words
from spellchecker import SpellChecker
nest_asyncio.apply()
import contractions
import unidecode
nltk.download('stopwords')
nltk.download('wordnet')

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=b891cb0bb71a858c08fc4c835fd3d672b78732b39d5eb55b994af7be1f38b346
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Collecting googletrans
  Downloading googletrans-4.0.2-py3-none-any.whl.metadata (10 kB)
Downloading googletrans-4.0.2-py3-none-any.whl (18 kB)
Installing collected packages: googletrans
Successfully installed googletrans-4.0.2
Collecting demoji
  Downloadi

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Collection and Cleaning
- Download and load the Cleantech Media Dataset and the Cleantech Google Patent Dataset.

### Patent Data

In [None]:
patent_data = pd.read_json("/content/drive/MyDrive/CLT/patent/CleanTech_patent_22-24.json", lines=True) # smaller json file to start with
patent_data.head()

Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc
0,US-2022239235-A1,US-202217717397-A,US,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,20220728,[],"[{'code': 'H02M7/5395', 'inventive': True, 'fi..."
1,US-2022239251-A1,US-202217580956-A,US,System for providing the energy from a single ...,"In accordance with an example embodiment, a so...",20220728,[],"[{'code': 'H02S40/38', 'inventive': True, 'fir..."
2,EP-4033090-A1,EP-21152924-A,EP,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
3,EP-4033090-A1,EP-21152924-A,EP,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
4,US-11396827-B2,US-202117606042-A,US,Control method for optimizing solar-to-power e...,A control method for optimizing a solar-to-pow...,20220726,[],"[{'code': 'F24S50/00', 'inventive': True, 'fir..."


#### Basic Checks

In [None]:
print("NAs in patent data \n",patent_data.isna().sum()) # no NAs to handle
print("datatypes in patent data \n", patent_data.dtypes) #

NAs in patent data 
 publication_number    0
application_number    0
country_code          0
title                 0
abstract              0
publication_date      0
inventor              0
cpc                   0
dtype: int64
datatypes in patent data 
 publication_number            object
application_number            object
country_code                category
title                         object
abstract                      object
publication_date      datetime64[ns]
inventor                      object
cpc                           object
dtype: object


In [None]:
patent_data['publication_date'] = pd.to_datetime(patent_data['publication_date'], format='%Y%m%d')
patent_data['country_code'] = patent_data['country_code'].astype('category')

In [None]:
# drop irrelevant features
patent_data.drop(columns=["country_code"]) # if we were to analyse regional trends for example. but not part of our goals.

Unnamed: 0,publication_number,application_number,title,abstract,publication_date,inventor,cpc
0,US-2022239235-A1,US-202217717397-A,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,2022-07-28,[],"[{'code': 'H02M7/5395', 'inventive': True, 'fi..."
1,US-2022239251-A1,US-202217580956-A,System for providing the energy from a single ...,"In accordance with an example embodiment, a so...",2022-07-28,[],"[{'code': 'H02S40/38', 'inventive': True, 'fir..."
2,EP-4033090-A1,EP-21152924-A,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,2022-07-27,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
3,EP-4033090-A1,EP-21152924-A,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,2022-07-27,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
4,US-11396827-B2,US-202117606042-A,Control method for optimizing solar-to-power e...,A control method for optimizing a solar-to-pow...,2022-07-26,[],"[{'code': 'F24S50/00', 'inventive': True, 'fir..."
...,...,...,...,...,...,...,...
23561,CN-113883761-A,CN-202111329337-A,LNG cold energy and solar energy-based combine...,The invention discloses an LNG cold energy and...,2022-01-04,"[LIAN XIAOLONG, XIAO JUNFENG, XIA LIN, WANG YI...","[{'code': 'Y02E10/44', 'inventive': False, 'fi..."
23562,CN-113883736-A,CN-202111153046-A,一种移动式液冷源的散热器调控装置及调控方法,The invention discloses a radiator regulating ...,2022-01-04,"[WANG ZANSHE, GU ZHAOLIN, FENG SHIYU, GAO XIUF...","[{'code': 'F25B1/00', 'inventive': True, 'firs..."
23563,CN-113883736-A,CN-202111153046-A,Mobile liquid cold source radiator regulation ...,The invention discloses a radiator regulating ...,2022-01-04,"[WANG ZANSHE, GU ZHAOLIN, FENG SHIYU, GAO XIUF...","[{'code': 'F25B1/00', 'inventive': True, 'firs..."
23564,TW-202201895-A,TW-109121032-A,Fish farming system capable of generating and ...,The present invention relates to a fish farmin...,2022-01-01,"[WANG, YI-FENG]","[{'code': 'Y02B10/10', 'inventive': False, 'fi..."


- Perform an initial data cleaning to remove e.g. duplicates and irrelevant information from both
datasets

#### Duplications
because above duplications show the issue that most of them are due to language difference, it has been decided to use the following approach:
1. Group records by publication_number and application_number
2. For each group, detect language of title and abstract and do either
- keep record that has both title and abstract in English already
- Merge two records so that the resulting record has the English title from one and the English abstract from the other,
- if neither record is fully in English, translate first record.

In [None]:
# Get duplications
duplicate_rows = patent_data[patent_data.duplicated(subset=['publication_number', 'application_number'], keep=False)]

# Display the duplicate rows to see a pattern
duplicate_rows.head(50)


Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc
2,EP-4033090-A1,EP-21152924-A,EP,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
3,EP-4033090-A1,EP-21152924-A,EP,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
5,CN-217015449-U,CN-202220853486-U,CN,Automatic desilting channel of diversion type ...,本实用新型公开了一种引水式水电站的自动沉沙渠，包括渠体、柔性扰流杆、混凝土斜块和抽沙机构，渠...,20220722,[],[]
6,CN-217015449-U,CN-202220853486-U,CN,Automatic desilting channel of diversion type ...,The utility model discloses an automatic sand ...,20220722,[],[]
7,CN-217015449-U,CN-202220853486-U,CN,一种引水式水电站的自动沉沙渠,The utility model discloses an automatic sand ...,20220722,[],[]
8,CN-114778667-A,CN-202210212793-A,CN,一种水电站螺栓监测装置、方法及电子设备,The invention discloses a hydropower station b...,20220722,[],[]
9,CN-114778667-A,CN-202210212793-A,CN,Hydropower station bolt monitoring device and ...,The invention discloses a hydropower station b...,20220722,[],[]
10,CN-217026943-U,CN-202220132423-U,CN,Water conservancy water and electricity gate h...,The utility model belongs to the technical fie...,20220722,[],[]
11,CN-217026943-U,CN-202220132423-U,CN,一种水利水电闸门提升装置,The utility model belongs to the technical fie...,20220722,[],[]
12,CN-217031317-U,CN-202123252107-U,CN,Comprehensive energy heating system for coupli...,本实用新型公开一种浅层地热和火电厂耦合的综合能源供热系统，所述浅层地热和火电厂耦合的综合能源...,20220722,[],[]


In [None]:
import pandas as pd
from langdetect import detect, DetectorFactory
import asyncio
from googletrans import Translator

DetectorFactory.seed = 0
translator = Translator()

def is_english(text):
    try:
        return detect(text) == 'en'
    except Exception:
        return False

def translate_text_sync(text):
    try:
        # Use asyncio.run to get the translation result synchronously as we are working in a jupyter notebook
        translation = asyncio.run(translator.translate(text, dest='en'))
        return translation.text
    except Exception as e:
        print("Translation error:", e)
        return text

def merge_duplicate_group(group):
    # Compute an English score for each record (1 point for an English title, 1 for an English abstract)
    group = group.copy()
    group['eng_score'] = group.apply(lambda row: int(is_english(row['title'])) + int(is_english(row['abstract'])), axis=1)

    # 1. If any record has both fields in English (score == 2), choose that record
    complete_eng = group[group['eng_score'] == 2]
    if not complete_eng.empty:
        return complete_eng.iloc[0]

    # 2. If no single record has both fields, check if there's a record with an English title
    #    and another with an English abstract, then merge them.
    eng_title = group[group['title'].apply(is_english)]
    eng_abstract = group[group['abstract'].apply(is_english)]
    if not eng_title.empty and not eng_abstract.empty:
        record = eng_title.iloc[0].copy()
        record['abstract'] = eng_abstract.iloc[0]['abstract']
        return record

    # 3. Otherwise, select the record with the highest English score.
    # This ensures that if, for example, the first record is not in English but a later one is,
    # you pick the one with a higher score.
    best_idx = group['eng_score'].idxmax()
    record = group.loc[best_idx].copy()

    # Translate any fields that are not in English
    if not is_english(record['title']):
        record['title'] = translate_text_sync(record['title'])
    if not is_english(record['abstract']):
        record['abstract'] = translate_text_sync(record['abstract'])

    return record

patent_data_no_duplicates = patent_data.groupby(['publication_number', 'application_number']).apply(merge_duplicate_group)
patent_data_no_duplicates = patent_data_no_duplicates.reset_index(drop=True)
patent_data_no_duplicates.head(50)


  patent_data_no_duplicates = patent_data.groupby(['publication_number', 'application_number']).apply(merge_duplicate_group)


Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc,eng_score
0,AU-2016208290-B2,AU-2016208290-A,AU,Closed loop control system for heliostats,CLOSED LOOP CONTROL SYSTEM FOR HELIOSTATS \nAb...,20220317,"[BURTON, ALEXANDER]","[{'code': 'F24S2050/25', 'inventive': False, '...",2
1,AU-2016321918-B2,AU-2016321918-A,AU,Device for capturing solar energy,The invention relates to a device for capturin...,20220407,"[LÓPEZ ONA, Sergio, ROS RUÍZ, Antonio José]","[{'code': 'H02S20/32', 'inventive': True, 'fir...",2
2,AU-2017246326-B2,AU-2017246326-A,AU,Combined window shade and solar panel,A window shade system includes a mounting brac...,20220526,"[GEIGER, JAMES]","[{'code': 'E06B2009/6827', 'inventive': False,...",2
3,AU-2017267740-B2,AU-2017267740-A,AU,System and methods for improving the accuracy ...,A computer system and method for improving the...,20220324,"[FORBES, KEVIN F., ZAMPELLI, ERNEST M.]","[{'code': 'G06F17/10', 'inventive': True, 'fir...",2
4,AU-2017276466-B2,AU-2017276466-A,AU,Ocean carbon capture and storage method and de...,Provided is an ocean carbon capture and storag...,20220602,"[PENG, SIGAN]","[{'code': 'Y02C20/40', 'inventive': False, 'fi...",2
5,AU-2019479010-A1,AU-2019479010-A,AU,Vertical-axis wind turbine,The invention relates to wind energy engineeri...,20220721,"[LEOSHKO, Anatolij Viktorovich]","[{'code': 'F03D3/06', 'inventive': True, 'firs...",2
6,AU-2020207782-B2,AU-2020207782-A,AU,Cladding sheet,"Cladding sheets (9), such as roof or wall clad...",20220331,"[KLEES, Robert, RYAN, BRAD, CLAYTON, Trevor, A...","[{'code': 'E04D3/30', 'inventive': True, 'firs...",2
7,AU-2020227052-B2,AU-2020227052-A,AU,Conversion of Solar Energy,A solar energy plant/system includes different...,20220602,"[LASICH, JOHN BEAVIS]","[{'code': 'Y02E10/52', 'inventive': False, 'fi...",2
8,AU-2020281145-B2,AU-2020281145-A,AU,Infrared transmissive concentrated photovoltai...,The use of photovoltaic (PV) cells to convert ...,20220707,"[ESCARRA, MATTHEW DAVID, LEWSON, Benjamin, JI,...","[{'code': 'Y02E10/60', 'inventive': False, 'fi...",2
9,AU-2020290034-A1,AU-2020290034-A,AU,Highly efficient low-cost static planar solar ...,The invention relates to a solar energy concen...,20220120,"[PARE, SYLVAIN]","[{'code': 'Y02E10/52', 'inventive': False, 'fi...",2


In [None]:
# if you want to run from this point to save yourself 10 minutes on initial run

# patent_data_no_duplicates.to_csv("/content/drive/MyDrive/CLT/patent/CleanTech_patent_22-24_no_duplicates.csv")

# patent_data_no_duplicates = pd.read_csv("/content/drive/MyDrive/CLT/patent/CleanTech_patent_22-24_no_duplicates.csv")

#### Normalisation
Due to having lists in columns 'inventor' and 'cpc', we had to decide if we want to focus solely on textual contents or also inventor and cpc fields.
Inventor-normalization would help with collaboration networks and counting patents per inventor or link inventors to specific textual features.
CPC normalisation (exploding) would be helpful if we wanted to perform frequency analysis on CPC codes.

Below, the two dataframes can be found which were normalized. They can be used in a later step in case richer analysis which incorporates metadata could be required.

Examples:

**Pre-NLP Processing**:
When NLP tasks are focused solely on textual content (such as topic modeling on titles and abstracts), it is practical to perform text cleaning, feature extraction, and modeling on the main DataFrame initially, without integrating the normalized tables. Once the textual features are obtained, the inventor and CPC metadata can be merged to provide additional context for further analysis or visualization.

**Feature Engineering**:
If metadata—such as inventor networks or CPC code frequencies—is intended to be used as features for downstream tasks (like classification or clustering), merging the normalized metadata with the main DataFrame at an early stage in feature engineering can be beneficial.

**Post-NLP Enrichment**:
After conducting the initial NLP analysis, if there are patterns that warrant further investigation—such as correlations with inventor collaboration or specific technological domains indicated by CPC codes—merging the normalized metadata with the text-based results can offer more comprehensive insights.



In [None]:
inventors_df = patent_data[['publication_number', 'application_number', 'inventor']].explode('inventor')
cpc_df = patent_data[['publication_number', 'application_number', 'cpc']].explode('cpc')

### Media Data

In [4]:
media_data = pd.read_csv("/content/drive/MyDrive/CLT/media/media_data.csv")
media_data.head()

Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,,['Chinese automotive startup XPeng has shown o...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-del...
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,,['Sinopec has laid plans to build the largest ...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-gre...
2,98159,World’ s largest floating PV plant goes online...,2022-01-03,,['Huaneng Power International has switched on ...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-...
3,98158,Iran wants to deploy 10 GW of renewables over ...,2022-01-03,,"['According to the Iranian authorities, there ...",pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wa...
4,31128,Eastern Interconnection Power Grid Said ‘ Bein...,2022-01-03,,['Sign in to get the best natural gas news and...,naturalgasintel,https://www.naturalgasintel.com/eastern-interc...


#### Basic Checks

In [9]:
print("NAs in media data \n",media_data.isna().sum(), "\n") # no NAs to handle
print("Authors:", media_data['author'].notna().sum(), "\n") # number of authors.
print("datatypes in media data \n", media_data.dtypes) #

NAs in media data 
 Unnamed: 0        0
title             0
date              0
author        20111
content           0
domain            0
url               0
dtype: int64 

Authors: 0 

datatypes in media data 
 Unnamed: 0      int64
title          object
date           object
author        float64
content        object
domain         object
url            object
dtype: object


In [None]:
# Drop irrelevant features
media_data = media_data.drop(columns=["Unnamed: 0", "author"]) # unnamed is assumed to be the index and author does not have any values. So they are irrelevant to us

In [11]:
# Assign Datatypes
media_data['date'] = pd.to_datetime(media_data['date'], format='%Y-%m-%d')
media_data['domain'] = media_data['domain'].astype('category')

#### Duplications
TODO
we don't really have duplicates based on id. however, some information is almos tthe same (example: cleantechnica). Figure out how to do this.

In [None]:
# Get duplications
duplicate_rows = media_data[media_data.duplicated(subset=['publication_number', 'application_number'], keep=False)]

# Display the duplicate rows to see a pattern
duplicate_rows.head(50)

## Text preprocessing
- Tokenize the text data from both datasets.
- To refine the data, apply techniques such as stemming and lemmatization, remove stop words
and non-informative terms, and convert text to lowercase for consistency.

In [None]:
STOPWORDS = set(stopwords.words('english'))

nltk.download('punkt')
nltk.download('punkt_tab')

def freq_words(text, counter, n=10):
    tokens = word_tokenize(text)
    FrequentWords = []

    for word in tokens:
        counter[word] += 1

    for (word, word_count) in counter.most_common(n):
        FrequentWords.append(word)
    return FrequentWords


def rare_words(text, counter, n=10):
    # tokenization
    tokens = word_tokenize(text)
    for word in tokens:
        counter[word]= +1

    RareWords = []
    number_rare_words = 10
    # take top 10 frequent words
    frequentWords = counter.most_common()
    for (word, word_count) in frequentWords[:-number_rare_words:-1]:
        RareWords.append(word)

    return RareWords

def remove_words(text, Words):
    tokens = word_tokenize(text)
    without_words = []
    for word in tokens:
        if word not in Words:
            without_words.append(word)

    without_words = ' '.join(without_words)
    return without_words

def nums_to_words(text):
    new_text = []
    for word in text.split():
        if word.isdigit():
            new_text.append(num2words(word))
        else:
            new_text.append(word)
    return " ".join(new_text)

def expand_contractions(text):
  return contractions.fix(text)

def clean_url(text):
  # Remove protocol (http:// or https://) and www.
  return re.sub(r'https?://|www\.', '', text)

stemmer = PorterStemmer() # since we're dealing now only with english text, we can use PorterStemmer

def stem_words(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

spell = SpellChecker()

def correct_spelling(text):
    correct_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            correction = spell.correction(word)
            correct_text.append(correction if correction is not None else word)
        else:
            correct_text.append(word)
    return " ".join(correct_text)

def accented_to_ascii(text):
    return unidecode.unidecode(text)

# Remove Hashtags
pattern = re.compile(r'(@\S+|#\S+)')

PUNCTUATIONS = string.punctuation
lemmatizer = WordNetLemmatizer()

def clean_text(text):

  if text is None:
    return ""
  else:
    text = str(text)

    # To Lower-Case
    text = text.lower();

    # Replace newline characters with a space
    text = text.replace('\n', ' ')

    # Remove email
    text = re.sub(r'\S+@\S+', '', text)

    # Remove pagination text like "page 1 of 2"
    text = re.sub(r'page\s+\d+\s+of\s+\d+', '', text, flags=re.IGNORECASE)

    # Remove hashtags
    text= pattern.sub('', text)

    text = text.translate(str.maketrans('', '', PUNCTUATIONS))

    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in STOPWORDS])
    # Remove frequent words
    text = remove_words(text, freq_words(text, Counter()))

    # Remove rare words
    text = remove_words(text, rare_words(text, Counter()))

    # Normalize multiple spaces to a single space
    text = re.sub(r'\s+', ' ', text)

    # Numbers to Words
    text = nums_to_words(text)

    text = expand_contractions(text)

    # # Stemming
    # text = stem_words(text)# rather take lemmatization due to accuracy & clarity is preferred.

    #Lemmatization
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

    # Spell Checking
    text = correct_spelling(text)

    # Accented to ASCII
    text = accented_to_ascii(text)

    # Strip leading and trailing whitespace
    return " ".join(text.split())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
#Text preprocessing

#Patent Data
patent_data['title'] = patent_data['title'].apply(clean_text)
patent_data['abstract'] = patent_data['abstract'].apply(clean_text)

# Media Data
media_data['title'] = media_data['title'].apply(clean_text)
media_data['content'] = media_data['content'].apply(clean_text)

media_data['url'] = media_data['url'].apply(clean_url)

## Exploratory Data Analysis
- Perform separate and comparative exploratory data analysis (EDA) on both datasets, such as
temporal analysis and sentiment analysis, to understand the landscape of cleantech innovations
and patents.
- Use pre-trained Named Entity Recognition (NER) models (like spaCy, Hugging Face
Transformers) to extract companies and technologies, then build a co-occurrence matrix
representing the frequency with which a company is mentioned together with a particular
technology in the text. Construct a graph where nodes represent entities, and edges reflect
relationships, and analyze it using centrality and clustering for insights.
- Use visualization techniques such as word clouds, bar charts, scatter plots and NetworkX (with
Matplotlib) to illustrate the results.


## Topic Modeling
- Test topic modeling techniques such as LDA and NMF (https://github.com/AnushaMeka/NLPTopic-Modeling-LDA-NMF), Top2Vec (https://github.com/ddangelov/Top2Vec), and BERTopic
(https://github.com/MaartenGr/BERTopic), evaluate the quality of the topics using such as
coherence metrics, and refine the topic model based on evaluation results and domain
expertise.
- Apply hierarchical topic modeling to explore more granular subtopics
(https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.htm
l) within major cleantech technologies (e.g., solar energy subtopics).
- Visualize and interpret the topics, comparing emerging trends in media publications against focuses of recent patents


# LAST CHECKS
Outputs:
- Notebook with data cleaning and preprocessing steps.
- Notebook with EDA visualizations on e.g. hidden topics and the detailed comparison between
the two datasets.