<a href="https://colab.research.google.com/github/lucarenz1997/NLP/blob/main/Stage_3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 3: Implementing an RAG System for Question Answering

Part 1: Model Training Steps
Objective: Developing and utilizing advanced embedding models to represent the content of Cleantech Media and Google Patent datasets and compare domain-specific embeddings to gain unique insights.

Output: Notebook with annotated model training steps

Data Preparation for Embeddings
Lead: Alvaro Cervan

Preprocessing Steps
The preprocessing steps have already been completed in the previous stage, which include:

Dropping duplicates
Setting data types
Dropping unnecessary columns
Tokenizing text data
Stopword Removal
Language detection
Translating non-English text to English
Lemmatization
These steps were applied to both datasets, media and patents, and the resulting data was saved in the data folder. We will now load the data and perform the following steps:

## SETUP & DATA LOADING

Installationen (einmalig):

In [3]:
!pip install transformers accelerate bitsandbytes




In [12]:
# Import imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from tqdm import tqdm
import random
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### LOAD & PREPARE DATA

In [5]:
prepdata_media= pd.read_csv("/content/drive/MyDrive/CLT/data/processed_media_data_backup.csv")
prepdata_patent = pd.read_csv("/content/drive/MyDrive/CLT/data/processed_patent_data_backup.csv")
print("Media Backup:")
prepdata_media.head(5)

Media Backup:


Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url,processed_text
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,Unknown,['Chinese automotive startup XPeng has shown o...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-del...,chinese automotive startup XPeng show one dram...
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,Unknown,['Sinopec has laid plans to build the largest ...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-gre...,Sinopec lay plan build large green hydrogen pr...
2,98159,World’ s largest floating PV plant goes online...,2022-01-03,Unknown,['Huaneng Power International has switched on ...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-...,Huaneng Power International switch MW float pv...
3,98158,Iran wants to deploy 10 GW of renewables over ...,2022-01-03,Unknown,"['According to the Iranian authorities, there ...",pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wa...,accord iranian authority currently renewable e...
4,31128,Eastern Interconnection Power Grid Said ‘ Bein...,2022-01-03,Unknown,['Sign in to get the best natural gas news and...,naturalgasintel,https://www.naturalgasintel.com/eastern-interc...,sign get good natural gas news datum follow to...


In [6]:
prepdata_patent.head(5)

Unnamed: 0,publication_number,application_number,title,abstract,publication_date,inventor,processed_text
0,US-2022239235-A1,US-202217717397-A,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,2022-07-28 00:00:00,[],disclose adaptable DC AC inverter system opera...
1,US-2022239251-A1,US-202217580956-A,System for providing the energy from a single ...,"In accordance with an example embodiment, a so...",2022-07-28 00:00:00,[],in accordance example embodiment solar energy ...
2,EP-4033090-A1,EP-21152924-A,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,2022-07-27 00:00:00,"['Schaper, Ulf', 'von Aswege, Enno', 'Gerke Fu...",Verfahren zum steuern einer Windenergieanlage ...
3,EP-4033090-A1,EP-21152924-A,Method for controlling a wind energy system,Verfahren zum Steuern einer Windenergieanlage ...,2022-07-27 00:00:00,"['Schaper, Ulf', 'von Aswege, Enno', 'Gerke Fu...",Verfahren zum steuern einer Windenergieanlage ...
4,US-11396827-B2,US-202117606042-A,Control method for optimizing solar-to-power e...,A control method for optimizing a solar-to-pow...,2022-07-26 00:00:00,[],a control method optimize solar power efficien...


## Select 50–100 Relevant Paragraphs

In [14]:
# Function to select long, unique paragraphs
def get_paragraphs(df, min_words=40, max_paragraphs=50):
    paragraphs = df["processed_text"].dropna().unique()
    filtered = [p for p in paragraphs if len(p.split()) >= min_words]
    return filtered[:max_paragraphs]

# Extract paragraphs
media_paragraphs = get_paragraphs(prepdata_media, max_paragraphs=50)
patent_paragraphs = get_paragraphs(prepdata_patent, max_paragraphs=50)

# Combine and export to CSV for manual processing
selected_paragraphs = media_paragraphs + patent_paragraphs

pd.DataFrame({'paragraph': selected_paragraphs}).to_csv(
    "/content/drive/MyDrive/CLT/data/selected_paragraphs.csv", index=False
)

Export to a text or CSV file for easy copy-pasting:

### Manually Generate QA Pairs in ChatGPT-4

#experiment ohne api

In [13]:
#Load Model (TinyLLaMA)
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

Device set to use cuda:0


### Step 2: Load Data & Extract Paragraphs

testen

In [14]:
# Combine and sample 5 paragraphs
def extract_sample(df, column='processed_text', n=5):
    paragraphs = df[column].dropna().unique()
    paragraphs = [p for p in paragraphs if len(p.split()) > 30]
    return random.sample(paragraphs, min(len(paragraphs), n))

sample_paragraphs = extract_sample(prepdata_media, n=3) + extract_sample(prepdata_patent, n=2)

In [7]:
# Funktion: längere Absätze filtern & zufällig auswählen
def extract_paragraphs(df, column='processed_text', num_samples=150):
    paragraphs = df[column].dropna().unique()
    paragraphs = [p for p in paragraphs if len(p.split()) > 30]
    return random.sample(paragraphs, min(len(paragraphs), num_samples))

media_paragraphs = extract_paragraphs(prepdata_media, num_samples=150)
patent_paragraphs = extract_paragraphs(prepdata_patent, num_samples=150)

# Gesamt: ca. 300 Absätze
combined_paragraphs = media_paragraphs + patent_paragraphs


### Step 3: Generate QA Pairs with TinyLLaMA

test

In [15]:
def generate_qa_local(paragraph):
    prompt = f"""<|system|>
You are a helpful assistant that generates question-answer pairs from academic or technical paragraphs.</s>
<|user|>
Based on the following reference text, generate a relevant question and its corresponding answer.

Reference Text:
\"\"\"{paragraph}\"\"\"

Output format:
Q: <question>
A: <answer></s>
<|assistant|>"""

    try:
        output = generator(prompt, max_new_tokens=256, do_sample=True, temperature=0.7)[0]['generated_text']
        if "Q:" in output and "A:" in output:
            result = output.split("Q:")[-1]
            result = "Q:" + result
            return result.strip()
        else:
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None


In [None]:
def generate_qa_local(paragraph):
    prompt = f"""<|system|>
You are a helpful assistant that generates question-answer pairs from academic or technical paragraphs.</s>
<|user|>
Based on the following reference text, generate a relevant question and its corresponding answer.

Reference Text:
\"\"\"{paragraph}\"\"\"

Output format:
Q: <question>
A: <answer></s>
<|assistant|>"""

    try:
        output = generator(prompt, max_new_tokens=256, do_sample=True, temperature=0.7)[0]['generated_text']
        if "Q:" in output and "A:" in output:
            result = output.split("Q:")[-1]
            result = "Q:" + result
            return result.strip()
        else:
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None


### Step 4: Generate QA Pairs from All Paragraphs

testen

In [16]:
def generate_qa_local(paragraph):
    prompt = f"""<|system|>
You are a helpful assistant that generates question-answer pairs from academic or technical paragraphs.</s>
<|user|>
Based on the following reference text, generate a relevant question and its corresponding answer.

Reference Text:
\"\"\"{paragraph}\"\"\"

Output format:
Q: <question>
A: <answer></s>
<|assistant|>"""

    try:
        output = generator(prompt, max_new_tokens=256, do_sample=True, temperature=0.7)[0]['generated_text']
        if "Q:" in output and "A:" in output:
            result = output.split("Q:")[-1]
            result = "Q:" + result
            return result.strip()
        else:
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None


In [None]:
qa_pairs = []

for para in tqdm(combined_paragraphs):
    result = generate_qa_local(para)
    if result:
        qa_pairs.append({'paragraph': para, 'qa': result})

qa_df = pd.DataFrame(qa_pairs)

### Step 5: Split QA into Separate Columns

testen

In [17]:
qa_results = []

for para in sample_paragraphs:
    result = generate_qa_local(para)
    if result:
        qa_results.append({'paragraph': para, 'qa': result})

qa_df = pd.DataFrame(qa_results)


In [None]:
def split_qa(text):
    try:
        q = text.split("Q:")[1].split("A:")[0].strip()
        a = text.split("A:")[1].strip()
        return q, a
    except:
        return None, None

qa_df[['question', 'answer']] = qa_df['qa'].apply(lambda x: pd.Series(split_qa(x)))
qa_df.dropna(subset=['question', 'answer'], inplace=True)

### Step 6: Categorize Questions

testen

In [18]:
def split_qa(text):
    try:
        q = text.split("Q:")[1].split("A:")[0].strip()
        a = text.split("A:")[1].strip()
        return q, a
    except:
        return None, None

qa_df[['question', 'answer']] = qa_df['qa'].apply(lambda x: pd.Series(split_qa(x)))

# Preview result
qa_df[['question', 'answer']]


Unnamed: 0,question,answer
0,<question>,<answer></s>\n<|assistant|>\nQuestion: What is...
1,<question>,<answer></s>\n<|assistant|>\nBased on the give...
2,How does the Council Scientific Industrial Res...,The CSIR aims to conduct a Strategic Environme...
3,"Based on the provided reference text, what is ...",The invention includes a solar power generatio...
4,How does the utility model provide solar contr...,The utility model provides a combination of so...


In [None]:
def categorize(row):
    q = row['question'].lower()
    if any(w in q for w in ['how', 'why']):
        return 'Analytical'
    elif any(w in q for w in ['compare', 'difference', 'vs']):
        return 'Comparative'
    elif any(w in q for w in ['when', 'where', 'who']):
        return 'Factual'
    elif 'advantage' in q or 'benefit' in q:
        return 'Evaluation'
    else:
        return 'General'

qa_df['category'] = qa_df.apply(categorize, axis=1)

# Optional: Define descriptions for each category
category_explanations = {
    'Analytical': 'Exploratory or reasoning-based questions (e.g., how, why)',
    'Comparative': 'Questions comparing concepts or methods',
    'Factual': 'Fact-based questions (e.g., when, where, who)',
    'Evaluation': 'Questions about benefits or advantages',
    'General': 'Broad or uncategorized questions'
}


### Step 7: Add Manual Review Fields


In [None]:
qa_df['review_comment'] = ""
qa_df['review_quality'] = ""  # e.g., good / average / poor


### Step 8: Save Final QA Dataset

In [None]:
qa_df.to_csv("/content/drive/MyDrive/CLT/data/generated_qa_pairs_tinyllama_full.csv", index=False)
