<a href="https://colab.research.google.com/github/lucarenz1997/NLP/blob/main/Stage_1_Part_1_Cleaning_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 1: Enhanced Data Cleaning and  Preprocessing
Objective: Analyzing both the Cleantech Media Dataset and the Cleantech Google Patent Dataset to
identify emerging trends, technologies, and potential innovation gaps in the cleantech sector

# Install Required Libraries

Installing all necessary Python libraries for topic modeling, visualization, and NLP processing.

In [None]:
# 1) Install all required libraries at the top of the notebook
!pip install googletrans langdetect nest_asyncio demoji contractions unidecode num2words \
             pyspellchecker spacy matplotlib wordcloud networkx pyLDAvis top2vec bertopic \
             gensim

# 2) Install Cupy for CUDA 11.x (typical in Colab) + cugraph for CUDA 11
!pip install cupy-cuda12x --upgrade
!pip install cugraph-cu12 --extra-index-url https://pypi.nvidia.com --upgrade

Collecting googletrans
  Downloading googletrans-4.0.2-py3-none-any.whl.metadata (10 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl.metadata (9.2 kB)
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Collecting num2words
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting pyspellchecker
  Downloading pyspellchecker-0.8.2-py3-none-any.whl.metadata (9.4 kB)
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting top2ve

In [None]:
!nvidia-smi

Sat Mar 15 12:01:49 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   50C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

NOTE: Restart session in case you have issues (specifically with cugraph import)

In [None]:
# Import necessary libraries
import os
import re
import numpy as np
import pandas as pd
import gensim
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel
import nltk
import spacy
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import (CountVectorizer, TfidfTransformer, TfidfVectorizer)
from sklearn.preprocessing import normalize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import torch

from top2vec import Top2Vec
from collections import Counter
from wordcloud import WordCloud
from langdetect import detect, DetectorFactory
from googletrans import Translator
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm.auto import tqdm
import cugraph
import cudf

  axis.set_ylabel('$\lambda$ value')
  """Perform robust single linkage clustering from a vector array
  from pkg_resources import parse_version
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(parent)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pyp

In [None]:
# Download stopwords, wordnet, and punkt tokenizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('vader_lexicon')

# Download and load spaCy model
!python -m spacy download en_core_web_sm
spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")

# Set a deterministic seed for language detection
DetectorFactory.seed = 0

# Initialize a Google Translator instance
translator = Translator()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m89.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Load datasets
media_data = pd.read_csv("/content/drive/MyDrive/CLT/data/media_data/cleantech_media_dataset_v3_2024-10-28.csv")
patent_data = pd.read_json("/content/drive/MyDrive/CLT/data/patent_data/CleanTech_22-24.json", lines=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
patent_data.head()

Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc
0,US-2022239235-A1,US-202217717397-A,US,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,20220728,[],"[{'code': 'H02M7/5395', 'inventive': True, 'fi..."
1,US-2022239251-A1,US-202217580956-A,US,System for providing the energy from a single ...,"In accordance with an example embodiment, a so...",20220728,[],"[{'code': 'H02S40/38', 'inventive': True, 'fir..."
2,EP-4033090-A1,EP-21152924-A,EP,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
3,EP-4033090-A1,EP-21152924-A,EP,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
4,US-11396827-B2,US-202117606042-A,US,Control method for optimizing solar-to-power e...,A control method for optimizing a solar-to-pow...,20220726,[],"[{'code': 'F24S50/00', 'inventive': True, 'fir..."


In [None]:
media_data.head()

Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,,['Chinese automotive startup XPeng has shown o...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-del...
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,,['Sinopec has laid plans to build the largest ...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-gre...
2,98159,World’ s largest floating PV plant goes online...,2022-01-03,,['Huaneng Power International has switched on ...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-...
3,98158,Iran wants to deploy 10 GW of renewables over ...,2022-01-03,,"['According to the Iranian authorities, there ...",pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wa...
4,31128,Eastern Interconnection Power Grid Said ‘ Bein...,2022-01-03,,['Sign in to get the best natural gas news and...,naturalgasintel,https://www.naturalgasintel.com/eastern-interc...


# Data Collection and Cleaning
Lead: Luca Renz & Rafaella Miranda-Sousa Wasser
- Dropping duplicates
- Setting datatypes
- Dropping unnecessary columns

In [None]:
# Check for missing values
print("\nMissing values in patent data:")
print(patent_data.isna().sum())

print("\nMissing values in media data:")
print(media_data.isna().sum())


# Convert non-hashable columns (lists) to strings before removing duplicates
patent_data = patent_data.astype(str)

# Remove duplicate entries
media_data = media_data.drop_duplicates()
patent_data = patent_data.drop_duplicates()

# Convert date columns to datetime format
patent_data.loc[:, 'publication_date'] = pd.to_datetime(patent_data['publication_date'], errors='coerce')

# Remove unnecessary columns
patent_data = patent_data.drop(columns=["country_code", "cpc"], errors="ignore")
media_data = media_data.drop(columns=["author"], errors="ignore") # all of them were NaN



Missing values in patent data:
publication_number    0
application_number    0
country_code          0
title                 0
abstract              0
publication_date      0
inventor              0
cpc                   0
dtype: int64

Missing values in media data:
Unnamed: 0        0
title             0
date              0
author        20111
content           0
domain            0
url               0
dtype: int64


# Language Detection, Translation & Text Preprocessing
Lead: Rafaella Miranda-Sousa Wasser & Luca Renz
- Translation to english
- Lemmatization / Tokenization

In [None]:
tqdm.pandas()  # Enables progress tracking in Pandas

# Function to detect if a text is in English
def is_english(text):
    try:
        return detect(text) == 'en'
    except Exception:
        return False

# Cache translations to avoid redundant API calls
translation_cache = {}

def translate_text(text):
    if text in translation_cache:  # Avoid redundant translations
        return translation_cache[text]

    try:
        translated = translator.translate(text, dest='en').text
        translation_cache[text] = translated
        return translated
    except:
        return text  # Return original text if translation fails

def preprocess_texts(texts):
    processed_texts = []
    # Tokenization, Lemmatization
    for doc in tqdm(nlp.pipe(texts, disable=["parser", "ner"]), total=len(texts), desc="Processing Text"):
        words = [token.lemma_ for token in doc if token.is_alpha and token.text not in stopwords.words('english')]
        processed_texts.append(" ".join(words))

    return processed_texts

# Apply translation efficiently
non_english_mask_media_title = ~media_data['title'].apply(is_english)
non_english_mask_media_content = ~media_data['content'].apply(is_english)
non_english_mask_patent_title = ~patent_data['title'].apply(is_english)
non_english_mask_patent_abstract = ~patent_data['abstract'].apply(is_english)

media_data.loc[non_english_mask_media_title, 'title'] = media_data.loc[non_english_mask_media_title, 'title'].progress_apply(translate_text)
media_data.loc[non_english_mask_media_content, 'content'] = media_data.loc[non_english_mask_media_content, 'content'].progress_apply(translate_text)
patent_data.loc[non_english_mask_patent_title, 'title'] = patent_data.loc[non_english_mask_patent_title, 'title'].progress_apply(translate_text)
patent_data.loc[non_english_mask_patent_abstract, 'abstract'] = patent_data.loc[non_english_mask_patent_abstract, 'abstract'].progress_apply(translate_text)

# Apply optimized text preprocessing
media_data['processed_text'] = preprocess_texts(media_data['content'].tolist())
patent_data['processed_text'] = preprocess_texts(patent_data['abstract'].tolist())


  return text  # Return original text if translation fails
100%|██████████| 1147/1147 [00:00<00:00, 104016.49it/s]
  return text  # Return original text if translation fails
100%|██████████| 5/5 [00:00<00:00, 6034.97it/s]
  return text  # Return original text if translation fails
100%|██████████| 11124/11124 [00:00<00:00, 314115.35it/s]
  return text  # Return original text if translation fails
100%|██████████| 2420/2420 [00:00<00:00, 249311.42it/s]
Processing Text: 100%|██████████| 20111/20111 [44:31<00:00,  7.53it/s]
Processing Text: 100%|██████████| 22815/22815 [10:29<00:00, 36.24it/s]


In order to not run this code every time, the processed files will be saved below.

In [None]:
# Paths for backups
media_data_filepath = "/content/drive/MyDrive/CLT/data/processed_media_data_backup.csv"
patent_data_filepath = "/content/drive/MyDrive/CLT/data/processed_patent_data_backup.csv"

# Save processed files
media_data.to_csv(media_data_filepath, index=False)
patent_data.to_csv(patent_data_filepath, index=False)

print(f"\n File for media data saved under: {media_data_filepath}")
print(f" File for patent data saved under: {patent_data_filepath}")


 Backup der bearbeiteten Medien-Daten gespeichert unter: /content/drive/MyDrive/CLT/data/processed_media_data_backup.csv
 Backup der bearbeiteten Patent-Daten gespeichert unter: /content/drive/MyDrive/CLT/data/processed_patent_data_backup.csv
