In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement re (from versions: none)
ERROR: No matching distribution found for re


# # Web Traffic Log-Based Q&A System
# 
# This project uses web traffic logs to create a question-answering system powered by a Retrieval-Augmented Generation (RAG) model. The system processes logs, stores them in a vector database, and uses two different language models (LLaMA 3 and Google T5) to generate answers to user queries.


Imports and Setups

In [9]:
import re
"""
Import necessary libraries and modules for the Q&A BOT notebook.
This code imports the following libraries and modules:
- re: Regular expression operations.
- pandas: Data manipulation and analysis.
- numpy: Numerical computing.
- datetime: Date and time manipulation.
- faiss: Efficient similarity search and clustering of dense vectors.
- tqdm: Progress bar for loops and tasks.
- langchain.vectorstores: Vector stores for language embeddings.
- langchain_huggingface: Hugging Face embeddings for language models.
- langchain.docstore: Document store for storing and retrieving documents.
- langchain.schema: Schema for defining document structure.
- transformers: State-of-the-art natural language processing models.
- langchain_huggingface: Hugging Face pipeline for language models.
- langchain.chains: Retrieval-based question answering model.
The code also checks for GPU availability, clears the GPU cache, and sets the device to either "cuda" or "cpu" based on availability.
Note: The code assumes that the necessary libraries and modules are already installed.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import faiss
from tqdm.notebook import tqdm
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.docstore import InMemoryDocstore
from langchain.schema import Document
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Check for GPU availability
import torch
# Clear GPU cache before and after running the model
torch.cuda.empty_cache()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")



Using device: cuda


Log parsing and preprocessing

In [1]:
import pandas as pd
from langchain.document_loaders import CSVLoader
import re
# The following code handles the loading, parsing, and preprocessing of web traffic logs. The logs are parsed to extract useful fields, and the data is then cleaned and structured for further use.

# Function to parse the page content
def parse_page_content(page_content):
    """
    Parses the content of a web traffic log entry.
    
    Args:
        page_content (str): A single log entry as a string.
        
    Returns:
        dict: A dictionary containing parsed fields like IP, identity, user, datetime, method, etc.
    """
    pattern = re.compile(
        r'IP: (?P<ip>[\d\.]+)\n'
        r'Identity: (?P<identity>.+)\n'
        r'User: (?P<user>.+)\n'
        r'Timestamp: (?P<datetime>.+)\n'
        r'Request: (?P<method>\w+) (?P<url>.+) HTTP/\d\.\d\n'
        r'Status: (?P<status>\d+)\n'
        r'Size: (?P<size>\d+)\n'
        r'Referer: (?P<referer>.+)\n'
        r'User-Agent: (?P<user_agent>.+)'
    )
    match = pattern.search(page_content)
    if match:
        return match.groupdict()
    return {}

# Function to downsample the CSV file
def downsample_csv(csv_file_path, sample_size, output_file_path):
    df = pd.read_csv(csv_file_path)
    if sample_size and sample_size < len(df):
        df = df.sample(n=sample_size)
    df.to_csv(output_file_path, index=False)
    return output_file_path

# Load and preprocess logs using CSVLoader
def load_and_preprocess_logs(csv_file_path, sample_size=10000):
    """
    Loads and preprocesses web traffic logs from a CSV file.
    
    Args:
        csv_file_path (str): Path to the CSV file containing log data.
        sample_size (int): Number of samples to load from the file.
        
    Returns:
        pd.DataFrame: A DataFrame containing processed log data with additional fields for analysis.
    """
    # Downsample the CSV file
    downsampled_csv = downsample_csv(csv_file_path, sample_size, 'downsampled_logs.csv')
    
    # Load the downsampled CSV file
    loader = CSVLoader(file_path=downsampled_csv)
    documents = loader.load()
    
    # Debug: Print the first few documents to check if they are loaded correctly
    print("Loaded documents:", documents[:5])
    
    # Convert documents to DataFrame
    data = []
    for doc in documents:
        parsed_data = parse_page_content(doc.page_content)
        data.append(parsed_data)
    df = pd.DataFrame(data)
    
    # Debug: Print the DataFrame to check if it is populated correctly
    print("DataFrame head:", df.head())
    
    # Preprocess logs
    df['datetime'] = pd.to_datetime(df['datetime'], format='%d/%b/%Y:%H:%M:%S %z', errors='coerce')
    df['hour'] = df['datetime'].dt.hour
    df['day'] = df['datetime'].dt.day
    df['month'] = df['datetime'].dt.month
    df['year'] = df['datetime'].dt.year
    df['weekday'] = df['datetime'].dt.weekday
    df['status'] = df['status'].astype(int, errors='ignore')
    df['size'] = df['size'].astype(int, errors='ignore')
    df['status_category'] = df['status'] // 100
    
    # Create a text field for embedding
    df['text'] = df.apply(lambda row: f"{row['method']} {row['url']} (Status: {row['status']}, Size: {row['size']}, IP: {row['ip']})", axis=1)
    
    return df

# Example usage
processed_logs = load_and_preprocess_logs('processed_logs.csv', sample_size=10000)

print(processed_logs.head())

Loaded documents: [Document(metadata={'source': 'downsampled_logs.csv', 'row': 0}, page_content='IP: 188.158.164.229\nIdentity: -\nUser: -\nTimestamp: 26/Jan/2019:20:01:52 +0330\nRequest: GET /image/57274/productModel/150x150 HTTP/1.1\nStatus: 200\nSize: 5687\nReferer: https://www.zanbil.ir/browse/cooktop/%D8%A7%D8%AC%D8%A7%D9%82-%DA%AF%D8%A7%D8%B2-%D8%B5%D9%81%D8%AD%D9%87-%D8%A7%DB%8C\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'), Document(metadata={'source': 'downsampled_logs.csv', 'row': 1}, page_content='IP: 83.121.146.171\nIdentity: -\nUser: -\nTimestamp: 25/Jan/2019:01:37:21 +0330\nRequest: GET /image/58606/productModel/200x200 HTTP/1.1\nStatus: 200\nSize: 2492\nReferer: https://www.zanbil.ir/m/filter/b2%2Cp3?page=1\nUser-Agent: Mozilla/5.0 (Linux; Android 8.1.0; SM-G610F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.99 Mobile Safari/537.36'), Document(metadata={'source': 'downsampled_l

Vector Database Setup

In [4]:
from langchain.text_splitter import CharacterTextSplitter

# The following section creates a vector database using FAISS and stores the processed log data for efficient retrieval during query processing.

"""
Initialize the embedding model using HuggingFaceEmbeddings.
Parameters:
- model_name (str): The name of the Hugging Face model to use for embeddings.
Returns:
- embeddings (HuggingFaceEmbeddings): The initialized HuggingFaceEmbeddings object.
"""
"""
Create documents for vector store.
Parameters:
- processed_logs (DataFrame): The processed logs containing 'text', 'datetime', and 'user_agent' columns.
Returns:
- documents (list): The list of documents for the vector store, with additional information appended.
"""
"""
Split documents into smaller chunks.
Parameters:
- chunk_size (int): The size of each chunk.
- chunk_overlap (int): The overlap between consecutive chunks.
Returns:
- texts (list): The list of split documents.
"""
"""
Create the vector store using FAISS.
Parameters:
- texts (list): The list of split documents.
- embeddings (HuggingFaceEmbeddings): The initialized HuggingFaceEmbeddings object.
Returns:
- vectorstore (FAISS): The created FAISS vector store.
"""
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create documents for vector store
documents = [
    f"{row['text']} (Datetime: {row['datetime']}, User Agent: {row['user_agent']})"
    for _, row in processed_logs.iterrows()
]

# Split documents
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.create_documents(documents)

# Create the vector store
vectorstore = FAISS.from_documents(texts, embeddings)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Rag Model Setup

In [5]:
import logging
"""
Create a RAG (Retrieval-Augmented Generation) chain for question answering.
Args:
    llm (HuggingFacePipeline): The HuggingFace pipeline for text generation.
    chain_type (str): The type of chain to create.
    retriever (Retriever): The retriever used for document retrieval.
    chain_type_kwargs (dict): Additional keyword arguments specific to the chain type.
Returns:
    RetrievalQA: The created RAG chain for question answering.
"""
from langchain.prompts import PromptTemplate
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

pipe = pipeline(
    "text2text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=512,
    device=0 if device == "cuda" else -1
)

llm = HuggingFacePipeline(pipeline=pipe)
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# Create the RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

Asking a Question

In [6]:
# %%
# The following function allows users to input questions about the web traffic logs. The RAG model then processes the query and provides answers, along with relevant log entries as evidence.

def ask_question(question):
    """
    Processes a user query using the RAG model and returns an answer with relevant source documents.
    
    Args:
        question (str): The user's question about the web traffic logs.
        
    Returns:
        None: Prints the question, the generated answer, and source documents.
    """
    result = rag_chain({"query": question})
    print(f"Question: {question}")
    print(f"Answer: {result['result']}")
    print("\nSource Documents:")
    for i, doc in enumerate(result['source_documents'], 1):
        print(f"{i}. {doc.page_content[:200]}...")

In [7]:
ask_question("Are there any unusual patterns in user-agent strings that might indicate bot activity or potential attackers?")
ask_question("Which HTTP methods are predominantly used in the logs, and what does this tell us about the nature of the traffic?")

  warn_deprecated(
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"


Question: Are there any unusual patterns in user-agent strings that might indicate bot activity or potential attackers?
Answer: Based on the provided log entries, I have analyzed the user-agent strings to identify any unusual patterns. Here's my findings:

The user-agent strings in all four log entries are identical and match a Googlebot/2.1 crawler. The presence of `compatible;` at the end of the string is also indicative of a crawler, as it suggests that this user agent is compatible with other services.

The repeated occurrence of the same user-agent string across multiple log entries (specifically, from IP address 66.249.66.194) raises suspicions about potential bot activity orcrawler-based traffic. The fact that all four log entries have the same user-agent string and IP address suggests a consistent pattern, which could be indicative of automated crawling activity.

To further support this conclusion, I would cross-reference these log entries with other logs to determine if there

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"


Question: Which HTTP methods are predominantly used in the logs, and what does this tell us about the nature of the traffic?
Answer: Based on the provided log entries, we can see that there are only two instances of different HTTP methods being used. The first is a GET request (200 OK), and the second is a POST request.

Here are the relevant log entries:

* GET requests:
	+ /blog/tag/%d8%aa%d8%b4%da%a9-%d8%b1%d9%88%db%8c%d8%a7/ (2019-01-26 17:05:25+03:30)
	+ /m/article/711/%DA%86%DA%AF%D9%88%D9%86%D9%87-%D8%A7%D8%B2-%D9%88%D8%A7%DA%A9%D8%B3-%D9%85%D9%88-%D8%A7%D8%B3%D8%AA%D9%81%D8%A7%D8%AF%D9%87-%DA%A9%D9%86%DB%8C%D9%85%D8%9F (2019-01-23 17:02:28+03:30)
	+ /m/article/122/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AE%D8%B1%DB%8C%D8%AF-%D8%AA%D8%BE%D9%85-%D9%85%D8%B1%D8%BA-%D9%BE%D8%B2 (2019-01-25 18:10:40+03:30)
	+ /blog/home-appliances/%da%86%d8%b1%d9%88%da%a9-%d9%87%d8%a7%db%8c-%d9%84%d8%a8%d8%a7%d8%b3-%d8%b1%d8%a7-%d8%a7%d8%b2-%d8%a8%db%8c%d9%86-%d8%a8%d8%a8%d8%b1%db%8c%d8%af/ (

Streamlit

In [20]:

# Run the Streamlit app
!streamlit run streamlit_app.py

^C
