## **Approach :-**



### 1.   **Document Loading** : Fetching Data Science Interview Questions from github
### 2.   **Document Splitting** : The loaded document is split into smaller, manageable chunks to allow for efficient search and retrieval. This is done using a Recursive Character Text Splitter, which divides the document based on logical separators such as paragraphs, sentences, or characters. This ensures that each chunk contains coherent pieces of information.
    The text is split into chunks of 1000 characters with a small overlap between them to maintain context across chunks.



### 3.   **Embedding Generation** : The system uses HuggingFace Embeddings to transform each document chunk into a dense vector representation. These embeddings capture the semantic meaning of the text and allow for similarity-based retrieval.
    The HuggingFace model generates vector embeddings for both the document chunks and the user queries.
### 4.   **Vector Store with FAISS** : The embeddings are stored in a FAISS (Facebook AI Similarity Search) vector store, which enables fast, approximate similarity search. When a user issues a query, FAISS compares the query embedding to the stored document embeddings and retrieves the most similar chunks.
    FAISS is highly efficient at handling large-scale vector searches, making it ideal for querying the document database.
### 5. **Query Processing and Document Retrieval** : When a user inputs a query, the system transforms the query into an embedding using the same HuggingFace model. FAISS then performs a similarity search to retrieve document chunks that are most semantically relevant to the query.

### 6. **Generative Response with GROQ Language Model** : When a user inputs a query, the system transforms the query into an embedding using the same HuggingFace model. FAISS then performs a similarity search to retrieve document chunks that are most semantically relevant to the query.
    Note: Groq is used for faster inference

## **RAG**:
### **Retrieval:** The system retrieves relevant context (document chunks) using FAISS, which ensures the answer is grounded in the provided documents.
###**Generation:** The GROQ LLM generates a coherent answer based on the retrieved context and the user’s query.





## Installing Dependencies and libraries

In [1]:
!pip install -q langchain langchain_core langchain_community sentence_transformers faiss-cpu chromadb langchain_groq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.1/405.1 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m15.7 MB/s[0m eta [36m0:00:

## Setting Up the GROQ API Key


In [2]:
import getpass
import os

if "GROQ_API_KEY" not in os.environ:
    os.environ["GROQ_API_KEY"] = getpass.getpass("Provide your GROQ API TOKEN")

Provide your GROQ API TOKEN··········


## Fetching Data Science Interview QnA Text Data

In [3]:
import requests

url = "https://raw.githubusercontent.com/youssefHosni/Data-Science-Interview-Questions-Answers/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md"
res = requests.get(url)
with open("ds_interview_ques.txt", "w") as f:
  f.write(res.text)

In [4]:
# Document Loader
from langchain.document_loaders import TextLoader
loader = TextLoader('./ds_interview_ques.txt')
documents = loader.load()

In [5]:
documents[:]

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content='# Machine Learning Interview Questions & Answers for Data Scientists #\n\n## Questions ##\n* [Q1: Mention three ways to make your model robust to outliers?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-mention-three-ways-to-make-your-model-robust-to-outliers)\n* [Q2: Describe the motivation behind random forests and mention two reasons why they are better than individual decision trees?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-describe-the-motivation-behind-random-forests-and-mention-two-reasons-why-they-are-better-than-individual-decision-trees)\n* [Q3: What are the differences and similarities between gradient boosting and random forest? and what are the 

## Wrap text while preserving newlines using the `textwrap` module in Python.


*  **Preserve Line Breaks:** The function ensures that existing line breaks in the text are maintained while formatting each line to a specified width, which is useful for readability and consistent formatting.

* **Control Line Length:** By wrapping lines to a maximum width, the function helps in managing text presentation, making it suitable for outputs like console logs, text files, or documents where line length consistency is important.















In [6]:
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

In [7]:
print(wrap_text_preserve_newlines(str(documents[0])))

page_content='# Machine Learning Interview Questions & Answers for Data Scientists #

## Questions ##
* [Q1: Mention three ways to make your model robust to outliers?](https://github.com/youssefHosni/Data-
Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for
%20Data%20Scientists.md#q1-mention-three-ways-to-make-your-model-robust-to-outliers)
* [Q2: Describe the motivation behind random forests and mention two reasons why they are better than
individual decision trees?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main
/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-describe-the-
motivation-behind-random-forests-and-mention-two-reasons-why-they-are-better-than-individual-decision-trees)
* [Q3: What are the differences and similarities between gradient boosting and random forest? and what are the
advantages and disadvantages of each when compared to each 

## Use the `RecursiveCharacterTextSplitter` from LangChain to split documents into chunks based on different separators, with a fallback mechanism:
- **`chunk_size`**: Maximum size of each chunk (1000 characters).
- **`chunk_overlap`**: Number of overlapping characters between chunks (10).
- **`separators`**: List of separators to split the text, including newlines, spaces, and empty strings.


In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Try to split by sentences, then fallback to characters if necessary
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=10,
    separators=["\n\n", "\n", " ", ""]
)
docs = text_splitter.split_documents(documents)

In [9]:
len(docs)

70

In [10]:
docs[23]

Document(metadata={'source': './ds_interview_ques.txt'}, page_content='Answer:\n\nLogistic regression is used to calculate the probability of occurrence of an event in the form of a dependent output variable based on independent input variables. Logistic regression is commonly used to estimate the probability that an instance belongs to a particular class. If the probability is bigger than 0.5 then it will belong to that class (positive) and if it is below 0.5 it will belong to the other class. This will make it a binary classifier.\n\nIt is important to remember that the Logistic regression isn\'t a classification model, it\'s an ordinary type of regression algorithm, and it was developed and used before machine learning, but it can be used in classification when we put a threshold to determine specific categories"\n\nThere is a lot of classification applications to it:\n\nClassify email as spam or not, To identify whether the patient is healthy or not, and so on.')

## Create Embeddings

In [11]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()


  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Create a **FAISS** vector store **(Database)** from documents using embeddings and perform a similarity search with a query to find relevant documents.


In [12]:
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)

query = "Mention three ways to make your model robust to outliers."
docs = db.similarity_search(query)

In [13]:
docs

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content='-------------------------------------------------------------------------------------------------------------------------------------------------------------\n\n## Questions & Answers ##\n\n### Q1: Mention three ways to make your model robust to outliers. ###\n\nInvestigating the outliers is always the first step in understanding how to treat them. After you understand the nature of why the outliers occurred you can apply one of the several methods mentioned [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#11:~:text=for%20large%20datasets.-,Bonus%20Question%3A%20Discuss%20how%20to%20make%20your%20model%20robust%20to%20outliers.,-There%20are%20several).\n\n### Q2: Describe the motivation behind random forests and mention two reasons why they are better than individual decision trees. ###'),
 Document(metadata={'source': './ds_interview_ques.txt'}, p

## Create LLM

In [14]:
import langchain_groq
from langchain_groq import ChatGroq
from langchain.chains.question_answering import load_qa_chain

GROQ_LLM = ChatGroq(
            api_key=os.getenv("GROQ_API_KEY"),
            model="gemma2-9b-it"
        )

# Load a QA chain with the specified language model and chain type for question answering.
chain = load_qa_chain(GROQ_LLM, chain_type="stuff")

stuff: https://python.langchain.com/v0.2/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/v0.2/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/v0.2/docs/how_to/#qa-with-rag
  chain = load_qa_chain(GROQ_LLM, chain_type="stuff")


## Q1.What is information gain and entropy in the context of decision trees?

In [28]:
query = "what is information gain and entropy in the context of decision trees?"

# Perform a similarity search in the FAISS vector store to retrieve relevant documents based on the query
docs = db.similarity_search(query)

# Use the QA chain to process the retrieved documents and generate an answer to the query
chain.run(input_documents=docs, question=query)

'Here\'s a breakdown of information gain and entropy in the context of decision trees, based on the provided text:\n\n**Entropy**\n\n* **Measure of Impurity:**  Entropy essentially quantifies the disorder or uncertainty within a set of data points. In decision trees, it measures how mixed the classes are at a particular node. A node with high entropy means the data points are spread across many different classes, indicating higher uncertainty.\n\n* **Goal:** The goal of building a decision tree is to reduce entropy as you move down the tree. Each split (decision) should lead to nodes with lower entropy, meaning the data points are becoming more homogeneous (grouped into fewer, more distinct classes).\n\n**Information Gain**\n\n* **Measuring Split Effectiveness:** Information gain calculates how much the entropy *decreases* when you split the data based on a specific feature.\n\n* **Choosing the Best Split:**  Decision trees use information gain to determine the best feature to split on

Here's a breakdown of information gain and entropy in the context of decision trees, based on the provided text:

**Entropy**

* **Measure of Impurity:**  Entropy essentially quantifies the disorder or uncertainty within a set of data points. In decision trees, it measures how mixed the classes are at a particular node. A node with high entropy means the data points are spread across many different classes, indicating higher uncertainty.

* **Goal:** The goal of building a decision tree is to reduce entropy as you move down the tree. Each split (decision) should lead to nodes with lower entropy, meaning the data points are becoming more homogeneous (grouped into fewer, more distinct classes).

**Information Gain**

* **Measuring Split Effectiveness:** Information gain calculates how much the entropy *decreases* when you split the data based on a specific feature.

* **Choosing the Best Split:**  Decision trees use information gain to determine the best feature to split on at each node. The feature that results in the largest reduction in entropy (the highest information gain) is chosen. This ensures that the tree is built in a way that progressively separates the data into more distinct classes.


**Analogy:**

Imagine a bag of mixed fruit (high entropy).  You want to sort them into separate piles (low entropy).

* **Feature:**  The feature you choose to split on (e.g., color, size).
* **Split:**  Dividing the fruit based on the chosen feature (e.g., red apples separate from green apples).
* **Information Gain:** How much "cleaner" your piles become after the split (i.e., how much the overall entropy decreases).




In [29]:
# Retrieved Document
docs

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content='### Q9: Explain what is information gain and entropy in the context of decision trees. ###\nEntropy and Information Gain are two key metrics used in determining the relevance of decision-making when constructing a decision tree model and determining the nodes and the best way to split.\n\nThe idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until we reach a small enough set that contains data points that fall under one label.'),
 Document(metadata={'source': './ds_interview_ques.txt'}, page_content='Entropy is the measure of impurity, disorder, or uncertainty in a bunch of examples. Entropy controls how a Decision Tree decides to split the data.\nInformation gain calculates the reduction in entropy or surprise from transforming a dataset in some way. It is commonly used in the construction of decision trees from a training dataset, by evaluating the informat

## Q2.Explain the kernel trick in SVM.

In [32]:
query = "Explain the kernel trick in SVM. "

docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The kernel trick in SVM allows us to perform calculations in a higher dimensional space without actually transforming the data into that space.  \n\nHere\'s why we use it and how to choose a kernel:\n\n**Why use the kernel trick?**\n\n* **High dimensionality:**  Sometimes, data is not linearly separable in the original space.  By transforming it into a higher dimensional space, we might find a linear separation.  However, this transformation can be computationally expensive.\n* **Efficiency:** The kernel trick calculates the dot product in the higher dimensional space directly, without performing the explicit transformation. This is much faster and more memory-efficient.\n\n**Choosing a kernel:**\n\nThe choice of kernel depends on the nature of the data and the problem:\n\n* **Linear kernel:**  Suitable for linearly separable data.\n* **Polynomial kernel:**  Can capture non-linear relationships by increasing the dimensionality. The degree of the polynomial determines the complexity.\n

The kernel trick in SVM allows us to perform calculations in a higher dimensional space without actually transforming the data into that space.  

Here's why we use it and how to choose a kernel:

**Why use the kernel trick?**

* **High dimensionality:**  Sometimes, data is not linearly separable in the original space.  By transforming it into a higher dimensional space, we might find a linear separation.  However, this transformation can be computationally expensive.
* **Efficiency:** The kernel trick calculates the dot product in the higher dimensional space directly, without performing the explicit transformation. This is much faster and more memory-efficient.

**Choosing a kernel:**

The choice of kernel depends on the nature of the data and the problem:

* **Linear kernel:**  Suitable for linearly separable data.
* **Polynomial kernel:**  Can capture non-linear relationships by increasing the dimensionality. The degree of the polynomial determines the complexity.
* **Radial basis function (RBF) kernel:**  Very flexible and commonly used. It creates a "similarity" measure between data points, allowing for complex decision boundaries.
* **Sigmoid kernel:**  Similar to the activation function in neural networks, often used in bioinformatics.

**Common practice:**

It's often necessary to experiment with different kernels and their parameters (e.g., the degree of the polynomial or the width of the RBF) to find the best performing kernel for a particular dataset.




In [33]:
# Retrieved Document
docs

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content='Typically without the kernel trick, in order to calculate support vectors and support vector classifiers, we need first to transform data points one by one to the higher dimensional space, do the calculations based on SVM equations in the higher dimensional space, and then return the results. The ‘trick’ in the kernel trick is that we design the kernels based on some conditions as mathematical functions that are equivalent to a dot product in the higher dimensional space without even having to transform data points to the higher dimensional space. i.e. we can calculate support vectors and support vector classifiers in the same space where the data is provided which saves a lot of time and calculations.'),
 Document(metadata={'source': './ds_interview_ques.txt'}, page_content="**Nonparametric models** don't assume anything about the function from which the dataset was sampled. For these models, the number of paramet

## Q3. What are Loss Functions and Cost Functions?

In [38]:
query = "What are Loss Functions and Cost Functions?"

docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The provided text explains the difference between loss functions and cost functions:\n\n* **Loss function:** Measures the performance of the model on a single training example.\n* **Cost function:**  Averages the loss function over all training examples (or a batch of examples in mini-batch gradient descent). \n\n\nLet me know if you have other questions. \n'

The provided text explains the difference between loss functions and cost functions:

* **Loss function:** Measures the performance of the model on a single training example.
* **Cost function:**  Averages the loss function over all training examples (or a batch of examples in mini-batch gradient descent).



In [39]:
# Retrieved Document
docs

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content='To scale the data, normalization, and standardization are the most popular approaches.\n![SVM scaled Vs non scaled](https://user-images.githubusercontent.com/72076328/192571498-4a939472-7bb1-4bf2-963f-a6e6394802ba.png)\n\n### Q27: What are Loss Functions and Cost Functions? Explain the key Difference Between them. ###\n\nAnswer:\nThe loss function is the measure of the performance of the model on a single training example, whereas the cost function is the average loss function over all training examples or across the batch in the case of mini-batch gradient descent.\n\nSome examples of loss functions are Mean Squared Error, Binary Cross Entropy, etc.\n\nWhereas, the cost function is the average of the above loss functions over training examples.\n\n### Q28: What is the importance of batch in machine learning and explain some batch-dependent gradient descent algorithms? ###'),
 Document(metadata={'source': './ds_int

## Q4. What are the assumptions made by the ARIMA model?

In [23]:
query = "What are the assumptions made by the ARIMA model?"

docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

"The provided text lists the assumptions made by the ARIMA model:\n\n* **Normally Distributed Residuals:** The residuals follow a normal distribution with a mean of zero.\n* **Stationarity:** The time series is stationary, meaning its statistical properties remain constant over time.\n* **Linearity:** The relationship between observations and lagged values is linear.\n* **No Autocorrelation in Residuals:** The residuals are not correlated with each other. \n\n\nLet me know if you'd like more detail on any of these assumptions. \n"

The provided text lists the assumptions made by the ARIMA model:

* **Normally Distributed Residuals:** The residuals follow a normal distribution with a mean of zero.
* **Stationarity:** The time series is stationary, meaning its statistical properties remain constant over time.
* **Linearity:** The relationship between observations and lagged values is linear.
* **No Autocorrelation in Residuals:** The residuals are not correlated with each other.


Let me know if you'd like more detail on any of these assumptions.


In [24]:
# Retrieved Document
docs

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content="ARIMA models are widely used in forecasting applications, but they do make certain assumptions about the underlying data, such as linearity and stationarity. It's important to validate these assumptions and adjust the model accordingly if they are not met.\n![1-1](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/assets/72076328/12707951-bdf5-4cd1-9efd-c60c465007a3)\n\n### Q36: What are the assumptions made by the ARIMA model? ###\nAnswer:\n\nThe ARIMA model makes several assumptions about the underlying time series data. These assumptions are important to ensure the validity and accuracy of the model's results. Here are the key assumptions:"),
 Document(metadata={'source': './ds_interview_ques.txt'}, page_content="Normally Distributed Residuals: The ARIMA model assumes that the residuals follow a normal distribution with a mean of zero. This assumption is necessary for statistical inference,

## Q5. Discuss two clustering algorithms that can scale to large datasets.


In [42]:
query = "Discuss two clustering algorithms that can scale to large datasets."

docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

"The provided context discusses two clustering algorithms that can scale to large datasets: \n\n* **Minibatch Kmeans:** Uses mini-batches of data instead of the entire dataset at each iteration, speeding up the process and allowing for larger datasets.\n* **BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):** Creates a compact summary of the large dataset, clustering the summary instead of the full dataset.  \n\n\nLet me know if you'd like more details about either of these algorithms. \n"

The provided context discusses two clustering algorithms that can scale to large datasets:

* **Minibatch Kmeans:** Uses mini-batches of data instead of the entire dataset at each iteration, speeding up the process and allowing for larger datasets.
* **BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):** Creates a compact summary of the large dataset, clustering the summary instead of the full dataset.  

In [44]:
# Retrieved Document
docs

[Document(metadata={'source': './ds_interview_ques.txt'}, page_content='Answer:\n\n**Minibatch Kmeans:**  Instead of using the full dataset at each iteration, the algorithm\nis capable of using mini-batches, moving the centroids just slightly at each iteration.\nThis speeds up the algorithm typically by a factor of 3 or 4 and makes it\npossible to cluster huge datasets that do not fit in memory. Scikit-Learn implements\nthis algorithm in the MiniBatchKMeans class.\n\n**Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)**\xa0\nis a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset.'),
 Document(metadata={'source': './ds_interview_ques.txt'}, page_content='* [Q18: You are working on a clustering problem, what are different evaluation metrics that can be used, and how to choos