### Building RAG system to find out faulty or not semiconductor
dataset:  https://archive.ics.uci.edu/dataset/179/secom
- Modern semiconductor manufacturing involves continuous monitoring using sensor signals or process measurement points.

- Not all monitored signals are equally valuable; they may contain useful data, irrelevant information, and noise.

- Often, useful data is hidden within irrelevant information and noise.

- Engineers typically collect more signals than necessary.

- Feature selection helps identify the most relevant signals (features) for analysis.

- Selected features assist Process Engineers in identifying key factors causing yield issues in production.

- Benefits include:

    - Improved process throughput

    - Reduced time to learning

    - Lower per-unit production costs

- Feature selection is being explored as an intelligent systems technique to support business improvement.

- The dataset includes:

    - Individual production instances (entities)

    - Associated measured features

    - Labels indicating pass (.1) or fail (1) outcomes from in-house testing

    - Date-time stamps corresponding to the test point-

In [None]:
#1
from google.colab import drive
drive.mount('/content/drive')

1. install appropriate  packages

In [None]:
#2
!pip install -q langchain langchain-community
!pip install langchain-huggingface
!pip install --upgrade torch transformers accelerate bitsandbytes transformers sentence-transformers
!pip install torchvision
!pip install faiss-gpu-cu12
!pip install -U bitsandbytes
# every tuple in documents as a chunk
# !pip install -U langchain-huggingface
!pip install imbalanced-learn
!pip install --force-reinstall numpy==1.26 pandas ## if there is panda and numpy  mismatch error, also faiss need numpy < 2 vevrsion
# !pip install numpy==1.26 pandas
!pip uninstall  -y pandas numpy

## 1. load the datasets

In [None]:
print(np.__version__)
print(pd.__version__)

2. Load the dataset

In [None]:
import pandas as pd
import numpy as np
# data: 1576, features: 591
data = pd.read_csv("/content/drive/MyDrive/RAG Q&A System/semoconductor_data/data/secom/secom.data", delim_whitespace=True, header = None)

label_df = pd.read_csv("/content/drive/MyDrive/RAG Q&A System/semoconductor_data/data/secom/secom_labels.data", sep = ' ', header = None)

#3

In [None]:
print(data.shape)
print(label_df.shape)

In [None]:
# changing the features for example 0 to feature_0 ....
old_labels = list(data.columns)
type(old_labels)
new_labels = ["f_"+str(i) for i in range(0,591)]
type(new_labels)
data.rename(columns=dict(zip(old_labels, new_labels)), inplace = True)


In [None]:
# rename label data
label_df.rename(columns={0: "label", 1:"timestamp"}, inplace=True)


In [None]:
label_df["label"].value_counts()
# 1 :  faulty
# -1 : not faulty

In [None]:
# createed 2 df, without preprocess, with preprocess
# data_wo_preprocess = data.copy()
data_for_process = data.copy()
## handling nan value by imputing mean
data_for_process.fillna(data.mean(), inplace=True)
# data_wo_preprocess.sample()
#print(data_wo_preprocess.isnull().values.any())
print(data_for_process.isnull().values.any())
#6

3.  under sample the data

In [None]:
# merge the data + label
from imblearn.under_sampling import RandomUnderSampler
new_df = pd.concat([data_for_process, label_df], axis = 1)
# make x, y
x = new_df.drop("label", axis = 1)
y = new_df["label"]

rus = RandomUnderSampler(random_state = 42, sampling_strategy='majority')
new_x, new_y = rus.fit_resample(x, y)


In [None]:
# checking is there nay NAn vaule exist or not
new_x.isnull().values.any()


4.  shuffeling and then scalling the data


In [None]:
# after US
# pandas series to frame
lebel_df = new_y.to_frame()

# concat both dataset
new_df = pd.concat([new_x, lebel_df], axis = 1)

# shuffle the dataset after resampling
new_df = new_df.sample(frac = 1, random_state = 42)

# create ferature_df and lebel_Df seperate
feature_df = new_df.drop(['timestamp', 'label'], axis = 1)
label_df = new_df[["label", "timestamp"]]



In [None]:
(feature_df.index == label_df.index).all()


In [49]:
new_df.head(10) # run

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_582,f_583,f_584,f_585,f_586,f_587,f_588,f_589,timestamp,label
448,2942.41,2523.71,2207.0444,1269.6078,1.7571,100.0,97.0189,0.1221,1.5272,0.0181,...,0.5017,0.0161,0.004,3.1998,0.0235,0.0355,0.0099,150.7761,24/08/2008 13:03:00,1
257,3012.98,2498.28,2243.7778,1502.9221,1.816,100.0,102.0978,0.1195,1.44,-0.0281,...,0.5001,0.0191,0.0058,3.8122,0.0135,0.0114,0.0043,84.4337,18/08/2008 10:13:00,-1
80,2855.8,2537.35,2183.4333,1582.5646,1.3601,100.0,99.0267,0.124,1.4912,-0.0004,...,0.5011,0.0122,0.0032,2.425,0.0218,0.0152,0.005,69.422,03/08/2008 20:23:00,-1
1296,2951.06,2503.18,2228.4778,1721.1108,1.4301,100.0,93.6222,0.1221,1.3841,-0.0264,...,0.4997,0.0131,0.0034,2.6132,0.0308,0.0183,0.0063,59.3775,04/10/2008 18:59:00,-1
583,2949.82,2497.56,2173.4556,1433.6732,1.0304,100.0,110.5422,0.1245,1.4031,0.0027,...,0.4961,0.0185,0.0044,3.7335,0.0332,0.0216,0.0083,65.1043,30/08/2008 14:10:00,1
970,3066.18,2539.01,2180.5556,1165.1351,0.7892,100.0,101.4578,0.1226,1.4454,0.0177,...,0.4999,0.0174,0.0046,3.486,0.0223,0.0159,0.0053,71.0108,21/09/2008 01:06:00,-1
648,3068.56,2363.52,2171.3222,966.5755,0.8066,100.0,107.17,0.1242,1.5316,-0.0214,...,0.5036,0.0152,0.0034,3.0241,0.0211,0.0106,0.0034,50.065,02/09/2008 01:10:00,-1
231,2940.65,2495.850231,2214.0556,1150.7775,1.3772,100.0,102.9389,0.1205,1.4978,0.0221,...,0.5038,0.0188,0.004,3.7356,0.0118,0.0098,0.0031,83.1192,17/08/2008 12:16:00,1
63,3016.64,2492.8,2246.4889,1006.9548,1.0997,100.0,103.3222,0.1184,1.5068,0.0126,...,0.4984,0.0146,0.004,2.9336,0.0296,0.0062,0.0018,20.8909,01/08/2008 02:02:00,-1
321,2936.59,2526.22,2196.6889,1593.122,1.5925,100.0,99.1133,0.1226,1.4934,-0.0074,...,0.5004,0.0174,0.0045,3.47,0.0171,0.0096,0.0025,56.0858,20/08/2008 02:27:00,1


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# scalling the data
scalled_data = scaler.fit_transform(feature_df)

# convert to data frame, index remain same if we set the index parameter,
scalled_dataframe = pd.DataFrame(scalled_data, columns = feature_df.columns, index = feature_df.index, dtype=np.float32)



In [None]:
# checkong if there is any index mismatching or not
(scalled_dataframe.index == label_df.index).all()



In [50]:
scalled_dataframe.head() # run

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_580,f_581,f_582,f_583,f_584,f_585,f_586,f_587,f_588,f_589
448,-0.825624,0.361398,0.304672,-0.203265,0.808139,0.0,-0.814064,-0.146505,0.893895,1.480429,...,0.05904,0.047889,0.331202,0.154244,0.14803,0.140071,0.179105,2.46911,1.916266,0.730306
257,0.091214,0.006365,1.459878,0.427774,0.928286,0.0,0.087459,-1.436443,-0.379969,-1.732891,...,0.05904,0.047889,-0.131535,0.656584,1.662304,0.652909,-0.778577,-0.555906,-0.293701,-0.104598
80,-1.950852,0.551829,-0.437859,0.643181,-0.001683,0.0,-0.457673,0.796142,0.367988,0.19371,...,-0.177553,-0.555919,0.157676,-0.498798,-0.524981,-0.508763,0.016299,-0.078932,-0.017455,-0.293517
1296,-0.713244,0.074775,0.978718,1.017903,0.141107,0.0,-1.416991,-0.146505,-1.196586,-1.614652,...,0.05904,0.047889,-0.24722,-0.348096,-0.356728,-0.351161,0.878212,0.310178,0.495573,-0.419925
583,-0.729354,-0.003687,-0.751642,0.240478,-0.674223,0.0,1.58637,1.044207,-0.919024,0.409323,...,0.05904,0.047889,-1.28838,0.556116,0.484535,0.587004,1.108056,0.724392,1.284847,-0.347854


5. Text creation from the scaled features and labels. leter we will use this text for  embedding



# problem_solution_1 : chainging the text (just pushing only scalled and nan replaced data). Because the input token limit of embedding model is less then 9000

In [None]:
# 14
#def create_prompt_column(data_wo_preprocess, scalled_dataframe, label_df)
text = []
# process = []
labels = []
for i in range(0,208):
  # raw_data = ", ".join([f"{col}: {data_wo_preprocess.iloc[i][col]}" for col in data_wo_preprocess.columns])
  scaled_data = ", ".join([f"{col}: {scalled_dataframe.iloc[i][col]:.2f}" for col in scalled_dataframe.columns])
  label = "Faulty" if label_df.iloc[i]["label"] == 1 else "Not Faulty"
  timestamp = label_df.iloc[i]["timestamp"]


    # intro = ""
    # abstract = "Data is given from  a semi-conductor manufacturing process. "

    # intro_blurb = f"\n{intro}"
    # abstract_blurb = f"\n{abstract}"
    # prompt_intro =

  # text = f""" You are a semiconductor sensor specialist. Below is a sensor log and system status. Analyze and summarize the key details. Asses the system status.Sensor Log - Timestamp: {timestamp}.Original Sensor Readings:{raw_data}.Processed Sensor Readings (Nan Filled with mean value and then standardized):{scaled_data}.System Statue: {label}
  # """

  # i am just embedding the provided text, no need to instruct  here, instruction will be in later when i wull use the model
  # row_to_text_for_embedding = f"""
  # Timestamp: {timestamp}.
  # Sensor Readings:{scaled_data}.
  # System Statue: {label}.
  # """
  # prompts[i] = prompt
  text.append(f"Timestamp: {timestamp} | Sensor Data: {scaled_data} | Label: {label}")
  labels.append(label)
process = f"Data Preprocessed: NaN filled with mean value, standardized"
# return prompts

# print(raw_data)
# print(scaled_data)
# print(label)
# print(timestamp)
#prompts

6. Document creation

In [None]:
from langchain.schema import Document

In [None]:
# prompts = """This is my prompt"""
# label = "yes"
# documents(dictionary{metadata}, string(page_content))
#16
documents = [Document(page_content=cont, metadata={"data processing": process}) for cont in text]

In [None]:
# documents[0:2]
# checkimh the length of documents, for passing the parameter chunk_size. I dont e´want to break a row into f´different chunk
length = []
for doc in documents:
  l = len(doc.metadata)
  length.append(len(doc.page_content))
# documents[0].page_content

In [None]:
max(length)

In [None]:
import locale

locale.getpreferredencoding = lambda: "UTF-8"

7. Emdedding model initialization

In [None]:

from langchain_community.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings

embedding = HuggingFaceEmbeddings(model_name = "BAAI/bge-m3") # can handle max 8192 sequence length, embedding vector - 1024

8. chunking the docs




In [None]:
# creating vector db taking time, so decided to chunking
# 18
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=7958, chunk_overlap=0)

chunk_docs = splitter.split_documents(documents)

In [None]:
len(chunk_docs) # 29773

9. Using vector store from FAISS for vector database

In [None]:

from langchain_community.vectorstores import FAISS


In [None]:
# # converting the chunk doc into np_array
# #embed_document() expect list of string not Document
# texts = [doc.page_content for doc in chunk_docs] # type: list of text
# text_embeddings = embedding.embed_documents(texts)# List of text to embed --> list of float
# embeddings_np_text = np.array(text_embeddings, dtype = np.float32)
# embeddings_np_text.shape # 768

In [None]:
## error in from_embedding, now using from_texts
# https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS.from_embeddings
faiss = FAISS.from_documents(chunk_docs, embedding)
# faiss = FAISS.from_texts(texts, embedding)

In [None]:
faiss.embeddings # max_seq_length : 8192, good for our case

In [None]:
# save in local
#/content/drive
faiss.save_local("/content/drive/MyDrive/RAG Q&A System/fiasis_doc_chunk_from_documents") # chunked doc
# load from local
# db = FAISS.load_local("fiasis_index_chunk", embedding)

10. Retrieval systems are fundamental to many AI applications, efficiently identifying relevant information from large datasets. These systems accommodate various data formats.

In [None]:
# We need a way to return(retrieve) the documents given an unstructured query. For that, we’ll use the as_retriever method
retriever = faiss.as_retriever(search_type = "similarity", search_kwargs ={"k": 10})

11. LLM model initialization with quantization

In [None]:
# load quantized model


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import bitsandbytes

model_name = "HuggingFaceH4/zephyr-7b-beta"


In [None]:
# bnb configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
# model initialization
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config = bnb_config)
# tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained(model_name)

12. setup llm chain

In [None]:
# setup the llm chain
# text_generation pipeline
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model = model,
    tokenizer = tokenizer,
    task = "text-generation",
    temperature = 0.6,
    do_sample = True,
    repetition_penalty = 1.1,
    return_full_text = True,
    max_new_tokens = 400,
)

llm = HuggingFacePipeline(pipeline = text_generation_pipeline)
# Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting.

prompt_template = """
<|system|>
You are a highly skilled semiconductor systems expert. Analyze the sensore log data provided in the context and answer the question accurately and concisely:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

"""

prompt = PromptTemplate(input_variables = ["context", "question"],
                        template = prompt_template)

llm_chain = prompt | llm | StrOutputParser () # without context

In [None]:
# faiss = FAISS.load_local("/content/drive/MyDrive/RAG Q&A System/fiasis_doc_chunk_from_documents", embedding)

13. setup rag chain

In [None]:
from langchain_core.runnables import RunnablePassthrough

retriever = faiss.as_retriever() # just passed an emepty retriever for checking how LLM alone works

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain # with context

13. providing the question to llm chain and then rag chain to see the difference between them

In [None]:
# “Were there any abnormal scaled readings before a Faulty event occurred?
# "How do sensor patterns differ between Faulty and Not Faulty logs?"
# "Summarize the system behavior and sensor anomalies for the logs labeled as Faulty."
# "What are the key sensor readings recorded around the most recent Faulty event?"
question_1 = "How do sensor patterns differ between Faulty and Not Faulty logs?"
question_2 = "Were there any abnormal scaled readings before a Faulty event occurred?"
question_3 = "Summarize the system behavior and sensor anomalies for the logs labeled as Faulty."
question_4 = "What are the key sensor readings recorded around the most recent Faulty event?"
question_5 ="Can you tell me what the data actually tells?"
question_6 = "What does it means faulty and not faulty in the sense of semiconductor menufacturing process?"
question_7 = "Why was the system labeled as faulty at timestamp 24/08/2008 13:03:00?"
question_8 = "Which features contributed most to the fault?"
question_9 = "What anomaly is present in the sensor readings?"
question_10 = "Do faulty logs have a pattern in any particular sensor?"
question_11 = "What is the typical range of f_3 when the system is faulty?"
question_12 = "what would be your suggestion by seeing this history, to reduce the faulty logs?"
#question_13 = "If i ask you to generate a faulty d"

## without specific context, llm_chain

In [51]:
print(llm_chain.invoke({"context": "", "question": question_1}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_2}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_3}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_4}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_5}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_6}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_7}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_8}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_9}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_10}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_11}))

print("#"*100)

print(llm_chain.invoke({"context": "", "question": question_12}))

# print("#"*100)

# print(llm_chain.invoke({"context": "", "question": question_4}))


<|system|>
You are a highly skilled semiconductor systems expert. Analyze the sensore log data provided in the context and answer the question accurately and concisely:



</s>
<|user|>
How do sensor patterns differ between Faulty and Not Faulty logs?
</s>
<|assistant|>

To answer this question, we will need to analyze the sensor log data provided in the context. From the given information, it is not clear what type of sensors are being monitored or what constitutes a faulty versus non-faulty log. Therefore, we cannot provide a definitive answer without additional context.

However, generally speaking, faulty sensor logs may exhibit abnormal readings, frequent errors or fluctuations, and patterns that deviate significantly from those observed during normal operations. In contrast, non-faulty sensor logs should have consistent and expected readings within normal operating ranges. By comparing these patterns, we can identify potential faults and differentiate between faulty and non-faul

## with specific data provided in the rag chain. rag_chain.

- total data 1567
 - 104 label_1
 - 1463 label_2

- scalled and data preprocessed, undersampled data
 - total 208
  - 104 for each label

In [52]:
print(rag_chain.invoke(question_1))

print("#"*100)

print(rag_chain.invoke(question_2))

print("#"*100)

print(rag_chain.invoke(question_3))

print("#"*100)

print(rag_chain.invoke(question_4))

print("#"*100)

print(rag_chain.invoke(question_5))

print("#"*100)

print(rag_chain.invoke(question_6))

print("#"*100)

print(rag_chain.invoke(question_7))

print("#"*100)

print(rag_chain.invoke(question_8))

print("#"*100)

print(rag_chain.invoke(question_9))

print("#"*100)

print(rag_chain.invoke(question_10))

print("#"*100)

print(rag_chain.invoke(question_11))

print("#"*100)

print(rag_chain.invoke(question_12))

# print("#"*100)

# print(rag_chain.invoke(question_4))

# print("#"*100)

# print(rag_chain.invoke(question_4))


<|system|>
You are a highly skilled semiconductor systems expert. Analyze the sensore log data provided in the context and answer the question accurately and concisely:

[Document(id='bd8e29bb-31e9-4e48-aa76-242c3fd0c2a1', metadata={'data processing': 'Data Preprocessed: NaN filled with mean value, standardized'}, page_content='Timestamp: 02/10/2008 14:07:00 | Sensor Data: f_0: -1.75, f_1: 0.45, f_2: -0.70, f_3: -0.87, f_4: -0.16, f_5: 0.00, f_6: -0.17, f_7: 1.29, f_8: 0.03, f_9: -0.01, f_10: -0.37, f_11: -0.10, f_12: 1.63, f_13: 0.00, f_14: -0.68, f_15: 0.74, f_16: 1.01, f_17: 0.62, f_18: 1.44, f_19: -0.01, f_20: -1.43, f_21: 0.07, f_22: -0.43, f_23: -0.12, f_24: 1.96, f_25: 0.71, f_26: 0.47, f_27: 0.61, f_28: 0.57, f_29: 0.25, f_30: 0.50, f_31: -0.42, f_32: 0.48, f_33: 0.18, f_34: 0.12, f_35: -0.07, f_36: -0.12, f_37: -0.99, f_38: 0.99, f_39: 0.10, f_40: 0.58, f_41: 0.00, f_42: 0.00, f_43: 1.74, f_44: 0.55, f_45: -0.37, f_46: 0.72, f_47: 0.78, f_48: 0.36, f_49: 0.00, f_50: 1.11, f_5