## RAG Pipline - Data Ingestion to Vector DB pipeline

In [1]:
import os
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
### Read all the pdf's inside the directory
def process_all_pdfs(pdf_directory):
    """Process all PDF files in a directory"""
    all_documents = []
    pdf_dir = Path(pdf_directory)
    
    # Find all PDF files recursively
    pdf_files = list(pdf_dir.glob("**/*.pdf"))
    
    print(f"Found {len(pdf_files)} PDF files to process")
    
    for pdf_file in pdf_files:
        print(f"\nProcessing: {pdf_file.name}")
        try:
            loader = PyPDFLoader(str(pdf_file))
            documents = loader.load()
            
            # Add source information to metadata
            for doc in documents:
                doc.metadata['source_file'] = pdf_file.name
                doc.metadata['file_type'] = 'pdf'
            
            all_documents.extend(documents)
            print(f"  ‚úì Loaded {len(documents)} pages")
            
        except Exception as e:
            print(f"  ‚úó Error: {e}")
    
    print(f"\nTotal documents loaded: {len(all_documents)}")
    return all_documents

# Process all PDFs in the data directory
all_pdf_documents = process_all_pdfs("../data")

Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 61 0 (offset 0)
Ignoring wrong pointing object 88 0 (offset 0)


Found 3 PDF files to process

Processing: Peer Critique Review 6 (1).pdf
  ‚úì Loaded 14 pages

Processing: da_2026_roadmap.pdf
  ‚úì Loaded 26 pages

Processing: Grades for venkatanikhilkumarreddyu@gwmail.gwu.edu_ AWS Academy Cloud Architecting [132274].pdf
  ‚úì Loaded 4 pages

Total documents loaded: 44


In [4]:
all_pdf_documents

[Document(metadata={'producer': 'macOS Version 26.1 (Build 25B78) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20251121225625Z00'00'", 'moddate': "D:20251121225625Z00'00'", 'source': '../data/pdf_files/Peer Critique Review 6 (1).pdf', 'total_pages': 14, 'page': 0, 'page_label': '1', 'source_file': 'Peer Critique Review 6 (1).pdf', 'file_type': 'pdf'}, page_content='Peer Critique Review: Vision to Road Zero - NYC Name: Ulindala Venkata Nikhil Kumar Reddy   Overall Assessment This project tackles an important and policy-relevant question: Which specific Vision Zero interventions in New York City actually reduce pedestrian injuries and fatalities at intersections and street segments? The team does a strong job framing the problem, identifying appropriate outcome and treatment variables, and recognizing that a true randomized controlled trial (RCT) is impossible in this context. The proposed approach-combining matching, pre/post comparisons, and Difference-in-Differences (DiD

In [5]:
### Text splitting get into chunks

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """Split documents into smaller chunks for better RAG performance"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show example of a chunk
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

In [6]:
chunks=split_documents(all_pdf_documents)
chunks

Split 44 documents into 83 chunks

Example chunk:
Content: Peer Critique Review: Vision to Road Zero - NYC Name: Ulindala Venkata Nikhil Kumar Reddy   Overall Assessment This project tackles an important and policy-relevant question: Which specific Vision Zer...
Metadata: {'producer': 'macOS Version 26.1 (Build 25B78) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20251121225625Z00'00'", 'moddate': "D:20251121225625Z00'00'", 'source': '../data/pdf_files/Peer Critique Review 6 (1).pdf', 'total_pages': 14, 'page': 0, 'page_label': '1', 'source_file': 'Peer Critique Review 6 (1).pdf', 'file_type': 'pdf'}


[Document(metadata={'producer': 'macOS Version 26.1 (Build 25B78) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20251121225625Z00'00'", 'moddate': "D:20251121225625Z00'00'", 'source': '../data/pdf_files/Peer Critique Review 6 (1).pdf', 'total_pages': 14, 'page': 0, 'page_label': '1', 'source_file': 'Peer Critique Review 6 (1).pdf', 'file_type': 'pdf'}, page_content='Peer Critique Review: Vision to Road Zero - NYC Name: Ulindala Venkata Nikhil Kumar Reddy   Overall Assessment This project tackles an important and policy-relevant question: Which specific Vision Zero interventions in New York City actually reduce pedestrian injuries and fatalities at intersections and street segments? The team does a strong job framing the problem, identifying appropriate outcome and treatment variables, and recognizing that a true randomized controlled trial (RCT) is impossible in this context. The proposed approach-combining matching, pre/post comparisons, and Difference-in-Differences (DiD

## Embedding and Vector Store DB

In [7]:
import sys
print(sys.executable)
print(sys.version)


/Users/uvnikhil/Desktop/RAG/.venv/bin/python
3.12.12 (main, Dec  9 2025, 19:05:33) [Clang 21.1.4 ]


In [8]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [9]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize the embedding manager
        
        Args:
            model_name: HuggingFace model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the SentenceTransformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts
        
        Args:
            texts: List of text strings to embed
            
        Returns:
            numpy array of embeddings with shape (len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings


## initialize the embedding manager

embedding_manager=EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x11f9fa240>

## Vector Store

In [10]:
class VectorStore:
    """Manages document embeddings in a ChromaDB vector store"""
    
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):
        """
        Initialize the vector store
        
        Args:
            collection_name: Name of the ChromaDB collection
            persist_directory: Directory to persist the vector store
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize ChromaDB client and collection"""
        try:
            # Create persistent ChromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)
            
            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF document embeddings for RAG"}
            )
            print(f"Vector store initialized. Collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")
            
        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise
    
    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store
        
        Args:
            documents: List of LangChain documents
            embeddings: Corresponding embeddings for the documents
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        
        print(f"Adding {len(documents)} documents to vector store...")
        
        # Prepare data for ChromaDB
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []
        
        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)
            
            # Prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)
            
            # Document content
            documents_text.append(doc.page_content)
            
            # Embedding
            embeddings_list.append(embedding.tolist())
        
        # Add to collection
        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text
            )
            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection: {self.collection.count()}")
            
        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise

vectorstore=VectorStore()
vectorstore
    

Vector store initialized. Collection: pdf_documents
Existing documents in collection: 83


<__main__.VectorStore at 0x11ff6ad20>

In [11]:
### Convert the text to embeddings
texts=[doc.page_content for doc in chunks]

## Generate the Embeddings

embeddings=embedding_manager.generate_embeddings(texts)

##store int he vector dtaabase
vectorstore.add_documents(chunks,embeddings)

Generating embeddings for 83 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00,  5.40it/s]

Generated embeddings with shape: (83, 384)
Adding 83 documents to vector store...
Successfully added 83 documents to vector store
Total documents in collection: 166





## Retriver Pipeline from Vector Store

In [12]:
class RAGRetriever:
    """Handles query-based retrieval from the vector store"""
    
    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        """
        Initialize the retriever
        
        Args:
            vector_store: Vector store containing document embeddings
            embedding_manager: Manager for generating query embeddings
        """
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:
        """
        Retrieve relevant documents for a query
        
        Args:
            query: The search query
            top_k: Number of top results to return
            score_threshold: Minimum similarity score threshold
            
        Returns:
            List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_threshold}")
        
        # Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            
            # Process results
            retrieved_docs = []
            
            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]
                
                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)
                    similarity_score = 1 - distance
                    
                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i + 1
                        })
                
                print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
            else:
                print("No documents found")
            
            return retrieved_docs
            
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []

rag_retriever=RAGRetriever(vectorstore,embedding_manager)

In [13]:
rag_retriever

<__main__.RAGRetriever at 0x11f2efd40>

In [14]:
rag_retriever.retrieve("Data Analyst Roadmap 2026")

Retrieving documents for query: 'Data Analyst Roadmap 2026'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.71it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents (after filtering)





[{'id': 'doc_48e85150_31',
  'content': 'codebasics.io \nData Analyst Roadmap 2026 \nFollowing is the roadmap for anyone aspiring to become a Data Analyst, Business Analyst, \nor any form of Analytics Professional in 2026. It includes FREE learning resources for \ntechnical skills (or tool skills) and soft (or core) skills                        \nThis roadmap is designed based on the analysis of hundreds of Data Analyst jobs and our \nown experience of working on Data projects at AtliQ Technologies (https://www.atliq.com/). \nMore than 90% of our clients at AtliQ are SMEs (Small to medium-sized enterprises) based in \nthe USA. \nData Analyst Jobs Analysis: https://github.com/codebasics/job-\nscrapper/blob/main/Analysis/Data%20Analysis/data_visualizer.ipynb  \nLink to our full YouTube video for this roadmap: https://youtu.be/d8U2G2_7dnM  \nFind Your Suitability: Before starting your journey, it is important to understand whether \nthis role truly aligns with your natural strengths and 

## Integration VectorDB Context Pipeline with LLM output

In [29]:
### Simple RAG pipeline with Groq LLM
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

### Initialize the Groq LLM (set your GROQ_API_KEY in environment)
groq_api_key = os.getenv("GROQ_API_KEY")

llm=ChatGroq(groq_api_key=groq_api_key,model="llama-3.1-8b-instant",temperature=0.1,max_tokens=1024)

def rag_simple(query, retriever, llm, top_k=3):
    # Get docs (support both your custom retriever and LangChain retrievers)
    if hasattr(retriever, "retrieve"):
        results = retriever.retrieve(query, top_k=top_k)
        context = "\n\n".join([doc["content"] for doc in results]) if results else ""
    else:
        docs = retriever.invoke(query)  # LangChain retriever style
        docs = docs[:top_k]
        context = "\n\n".join([d.page_content for d in docs]) if docs else ""

    if not context:
        return "No relevant context found to answer the question."

    messages = [
        ("system", "Use the provided context to answer the question concisely."),
        ("human", f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"),
    ]

    response = llm.invoke(messages)
    return response.content


In [30]:
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv

load_dotenv()

# Use a currently supported Groq model ID
llm = ChatGroq(
    model="llama-3.1-8b-instant",   # or "llama-3.3-70b-versatile"
    api_key=os.getenv("GROQ_API_KEY"),
    temperature=0.1,
    max_tokens=1024,
)

# quick sanity check: this should NOT mention gemma
test = llm.invoke([("human", "say 'ok'")])
print("‚úÖ model used:", test.response_metadata.get("model_name"))
print(test.content)


‚úÖ model used: llama-3.1-8b-instant
ok


In [31]:
answer = rag_simple("What do we need to do to become a Data Analyst?", rag_retriever, llm)
print(answer)


Retrieving documents for query: 'What do we need to do to become a Data Analyst?'
Top K: 3, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  6.30it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents (after filtering)





To become a Data Analyst, you need to:

1. Check your suitability for the role by taking the short suitability test at https://codebasics.io/find-your-match-da.
2. If the results show that this career role matches you, proceed further with the roadmap.
3. Follow the 4-6 month roadmap, dedicating 4 hours of study every day, 6 days a week.

This will help you land a job in roles such as Data Analyst, Business Analyst, or other related positions.


In [16]:
import sys
print(sys.executable)


/Users/uvnikhil/Desktop/RAG/.venv/bin/python


In [18]:
import sys, subprocess, importlib

print("Kernel Python:", sys.executable)

subprocess.check_call([sys.executable, "-m", "pip", "install", "-U",
                       "langchain-groq", "python-dotenv"])

importlib.invalidate_caches()

from langchain_groq import ChatGroq
print("‚úÖ ChatGroq import works now")


Kernel Python: /Users/uvnikhil/Desktop/RAG/.venv/bin/python


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


‚úÖ ChatGroq import works now


## Enhanced RAG Pipeline

In [33]:
# --- Enhanced RAG Pipeline Features ---
def rag_advanced(query, retriever, llm, top_k=5, min_score=0.2, return_context=False):
    """
    RAG pipeline with extra features:
    - Returns answer, sources, confidence score, and optionally full context.
    """
    results = retriever.retrieve(query, top_k=top_k, score_threshold=min_score)
    if not results:
        return {'answer': 'No relevant context found.', 'sources': [], 'confidence': 0.0, 'context': ''}
    
    # Prepare context and sources
    context = "\n\n".join([doc['content'] for doc in results])
    sources = [{
        'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
        'page': doc['metadata'].get('page', 'unknown'),
        'score': doc['similarity_score'],
        'preview': doc['content'][:300] + '...'
    } for doc in results]
    confidence = max([doc['similarity_score'] for doc in results])
    
    # Generate answer
    prompt = f"""Use the following context to answer the question concisely.\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"""
    response = llm.invoke([prompt.format(context=context, query=query)])
    
    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }
    if return_context:
        output['context'] = context
    return output

# Example usage:
result = rag_advanced("Data Analyst roadmap", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'Data Analyst roadmap'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.61it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents (after filtering)





Answer: Data Analyst Roadmap 2026 by codebasics.io, designed for aspiring Data Analysts, Business Analysts, or Analytics Professionals, including free learning resources for technical and soft skills.
Sources: [{'source': 'da_2026_roadmap.pdf', 'page': 0, 'score': 0.3374539613723755, 'preview': 'codebasics.io \nData Analyst Roadmap 2026 \nFollowing is the roadmap for anyone aspiring to become a Data Analyst, Business Analyst, \nor any form of Analytics Professional in 2026. It includes FREE learning resources for \ntechnical skills (or tool skills) and soft (or core) skills                     ...'}, {'source': 'da_2026_roadmap.pdf', 'page': 0, 'score': 0.3374539613723755, 'preview': 'codebasics.io \nData Analyst Roadmap 2026 \nFollowing is the roadmap for anyone aspiring to become a Data Analyst, Business Analyst, \nor any form of Analytics Professional in 2026. It includes FREE learning resources for \ntechnical skills (or tool skills) and soft (or core) skills                     ..

In [35]:
# --- Advanced RAG Pipeline: Streaming, Citations, History, Summarization ---
from typing import List, Dict, Any
import time

class AdvancedRAGPipeline:
    def __init__(self, retriever, llm):
        self.retriever = retriever
        self.llm = llm
        self.history = []  # Store query history

    def query(self, question: str, top_k: int = 5, min_score: float = 0.2, stream: bool = False, summarize: bool = False) -> Dict[str, Any]:
        # Retrieve relevant documents
        results = self.retriever.retrieve(question, top_k=top_k, score_threshold=min_score)
        if not results:
            answer = "No relevant context found."
            sources = []
            context = ""
        else:
            context = "\n\n".join([doc['content'] for doc in results])
            sources = [{
                'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
                'page': doc['metadata'].get('page', 'unknown'),
                'score': doc['similarity_score'],
                'preview': doc['content'][:120] + '...'
            } for doc in results]
            # Streaming answer simulation
            prompt = f"""Use the following context to answer the question concisely.\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"""
            if stream:
                print("Streaming answer:")
                for i in range(0, len(prompt), 80):
                    print(prompt[i:i+80], end='', flush=True)
                    time.sleep(0.05)
                print()
            response = self.llm.invoke([prompt.format(context=context, question=question)])
            answer = response.content
        
        # Add citations to answer
        citations = [f"[{i+1}] {src['source']} (page {src['page']})" for i, src in enumerate(sources)]
        answer_with_citations = answer + "\n\nCitations:\n" + "\n".join(citations) if citations else answer

        # Optionally summarize answer
        summary = None
        if summarize and answer:
            summary_prompt = f"Summarize the following answer in 2 sentences:\n{answer}"
            summary_resp = self.llm.invoke([summary_prompt])
            summary = summary_resp.content

        # Store query history
        self.history.append({
            'question': question,
            'answer': answer,
            'sources': sources,
            'summary': summary
        })

        return {
            'question': question,
            'answer': answer_with_citations,
            'sources': sources,
            'summary': summary,
            'history': self.history
        }

# Example usage:
adv_rag = AdvancedRAGPipeline(rag_retriever, llm)
result = adv_rag.query("Skills required to become a HR Data Analyst", top_k=3, min_score=0.1, stream=True, summarize=True)
print("\nFinal Answer:", result['answer'])
print("Summary:", result['summary'])
print("History:", result['history'][-1])

Retrieving documents for query: 'Skills required to become a HR Data Analyst'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.64it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents (after filtering)
Streaming answer:
Use the following context to answer the question concisely.
Context:
knowledge to explore diverse roles related to data analytics. For example, if you have 
domain knowledge in a specific field like HR, then HR Analytics could be a promisin




g 
career opportunity as a data analyst. 
 
 
 
check out this resource to understand more about the career role related to data 
analytics: https://codebasics.io/resources/data-analytics-and-beyond-career-paths-
you-should-know

knowledge to explore diverse roles related to data analytics. For example, if you have 
domain knowledge in a specific field like HR, then HR Analytics could be a promising 
career opportunity as a data analyst. 
 
 
 
check out this resource to understand more about the career role related to data 
analytics: https://codebasics.io/resources/data-analytics-and-beyond-career-paths-
you-should-know

codebasics.io 
‚Ä¢ Assignment  
‚òê Finish all these exercises/projects from the resources provided.  
‚òê Pick any domain (fintech, banking, healthcare, etc.) and write 5‚Äì7 clear 
questions you would ask a stakeholder to understand their requirement. 
Post it on LinkedIn in 3‚Äì4 lines and tag Codebasics. 
 
     Insight: Only 30 among 100 aspirants continue furth