<a href="https://colab.research.google.com/github/mrudulamadhavan/Mrudula_Scifor/blob/main/week6/NLP%20Test%20Q%26A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP Test Questions & Answers**

## 1. What you understand by Text Processing? Write a code to perform text processing


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('punkt')
nltk.download('stopwords')

def text_processing(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Perform stemming
    porter = PorterStemmer()
    tokens = [porter.stem(token) for token in tokens]

    return tokens

# Given text
text = "If you want to see a sunrise, Ken said, we can go hiking in the morning next time."

# Process the text
processed_text = text_processing(text)

# Display the processed text
print("Processed Text:")
print(processed_text)


## 2. What you understand by NLP toolkit and spacy library? Write a code in which any one gets used.

**Natural Language Processing (NLP) Toolkit**

An NLP toolkit is a collection of libraries and tools designed to process and analyze natural language data. It typically includes functionalities for tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and more. NLTK (Natural Language Toolkit) is a popular Python library for NLP, offering a wide range of tools and resources for linguistic tasks.

**spaCy Library**

Spacy is an open-source Natural Language Processing library that can be used for various tasks. It has built-in methods for Named Entity Recognition. Spacy has a fast statistical entity recognition system.It is designed for production use and is known for its speed and ease of use. Spacy has a fast statistical entity recognition system.



In [12]:
import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")
raw_text="The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru.Apple tops Indian smartphone market by revenue in 2023"
text1= NER(raw_text)
for word in text1.ents:
    print(word.text," : ",word.label_)

displacy.render(text1,style="ent",jupyter=True)


The Indian Space Research Organisation  :  ORG
India  :  GPE
Bengaluru  :  GPE
Apple  :  ORG
Indian  :  NORP
2023  :  DATE


## 3. Describe Neural Networks and Deep Learning in depth.

### **Neural Networks**

A Neural Network (NN) is a computational model inspired by the human brain's structure and functioning. It consists of interconnected nodes, or artificial neurons, organized into layers. The three main types of layers are the input layer, hidden layers, and output layer. Neurons in each layer are connected with weighted edges, and each connection has an associated weight that the network learns during training.

* Feedforward Neural Networks (FNN): In FNN, information flows in one direction—from the input layer through the hidden layers to the output layer. It is commonly used for tasks like image and speech recognition.

* Recurrent Neural Networks (RNN): RNNs have connections that form directed cycles, allowing them to capture sequential dependencies. They are well-suited for tasks involving sequential data, such as time series analysis and natural language processing.

* Convolutional Neural Networks (CNN): CNNs are designed for tasks involving grid-like data, such as images. They use convolutional layers to efficiently learn hierarchical representations of patterns.

### **Deep Learning**

Deep Learning (DL) is a subfield of machine learning that focuses on neural networks with many layers, often referred to as deep neural networks. Deep learning has gained prominence due to its ability to automatically learn hierarchical representations from data, enabling the extraction of complex features.

* Deep Neural Networks (DNN):

    A deep neural network typically has more than one hidden layer, allowing it to model intricate relationships in the data. It excels in tasks with large amounts of labeled data.

* Training and Backpropagation:

    Deep learning models are trained using optimization algorithms like gradient descent and backpropagation. Backpropagation adjusts the model's weights by minimizing the difference between predicted and actual outputs.

* Activation Functions:

    Non-linear activation functions (e.g., ReLU, Sigmoid, Tanh) introduce non-linearity to the model, enabling it to learn complex mappings between inputs and outputs.

* Loss Functions:

    Loss functions measure the difference between predicted and actual values. Common loss functions include Mean Squared Error (MSE) for regression tasks and Cross-Entropy for classification tasks.

* Applications:

    Deep learning has achieved remarkable success in various domains, including computer vision (object recognition), natural language processing (language translation, sentiment analysis), speech recognition, and reinforcement learning.

**Challenges:**

* Computational Resources: Deep learning models demand significant computational power for training, which can be resource-intensive.

* Interpretability: As deep networks become more complex, interpreting their decision-making processes becomes challenging.

* Data Requirements: Deep learning models often require large amounts of labeled data for effective training, which may be a limitation in some domains.

In summary, neural networks form the foundation of deep learning, with deep neural networks exhibiting remarkable capabilities in learning complex representations from data, leading to state-of-the-art performance in various tasks.


## 4.What you understand by Hyperparameter Tuning?

Hyperparameter tuning is a critical phase in machine learning, involving the adjustment of external configuration settings, or hyperparameters, to optimize model performance. Hyperparameters, distinct from internal parameters learned during training, significantly influence a model's behavior.

* **Grid search** systematically explores predefined hyperparameter values within specified ranges, offering a comprehensive but computationally expensive approach.
* **Random search** provides an efficient alternative by randomly sampling hyperparameter values, particularly useful for resource-constrained scenarios.
* **Bayesian optimization** employs probabilistic models, intelligently guiding the search process and excelling with expensive-to-evaluate models.

## 5. What you understand by Ensemble Learning?

Ensemble Learning is a machine learning technique that involves combining the predictions of multiple individual models to create a stronger, more robust, and accurate model. The main idea is that by aggregating the predictions of several models, the ensemble can often outperform any individual model within the ensemble. This approach is particularly beneficial when dealing with complex datasets and diverse patterns that might not be captured effectively by a single model.

The predictions from individual models are combined through various techniques, such as averaging, voting, or weighted averaging.The ensemble prediction often outperforms the predictions of its individual components, resulting in a more reliable and accurate model.

**Common Ensemble Techniques:**
* Bagging (Bootstrap Aggregating)

> * Train multiple instances of the same model on different subsets of the training data (selected with replacement).
> * Example: Random Forests, which are an ensemble of decision trees trained on different bootstrap samples of the data.

* Boosting

> * Train multiple weak learners sequentially, with each learner focusing on correcting the errors of its predecessor.
> * Example: AdaBoost (Adaptive Boosting) and Gradient Boosting. These algorithms assign different weights to instances in the dataset, emphasizing the misclassified instances to improve overall performance.

* Stacking

> * Combine predictions from multiple models using another model (meta-model) that takes these predictions as input.
> * Example: Train multiple diverse models, obtain their predictions, and use these predictions as features to train a meta-model.

**Advantages of Ensemble Learning:**
* Improved Accuracy: Ensemble methods often lead to higher accuracy than individual models, especially when models are diverse.

* Robustness: Ensembles are less susceptible to overfitting on specific patterns in the training data.

* Generalization: Ensemble methods generalize well to new, unseen data, making them suitable for a wide range of applications.

* Handling Noise: Ensemble methods can be effective in handling noisy data and outliers.

**Considerations:**
* Computational Cost: Training multiple models can be computationally expensive, especially for large datasets.

* Interpretability: Ensembles might sacrifice interpretability, as understanding the reasoning behind the combined predictions can be complex.

Ensemble Learning has proven to be a powerful tool in machine learning, and its applications extend to various domains, contributing to the success of models in predictive tasks.

## 6. What do you understand by Model Evaluation and Selection ?

###**Model Evaluation**

Model evaluation is the process of assessing the performance of a trained machine learning model. It involves comparing the model's predictions to the actual outcomes on a set of data reserved for evaluation, typically the test set.

Goals:

1) Performance Assessment: Understand how well the model generalizes to unseen data and performs on tasks it was trained for.

2) Identify Weaknesses: Identify areas where the model may struggle, such as handling specific classes or making accurate predictions under certain conditions.

3) Optimization: Use evaluation results to fine-tune hyperparameters, modify the model architecture, or adjust preprocessing steps.

**Common Metrics:**

* Accuracy: The proportion of correctly classified instances out of the total instances.
* Precision: The ratio of correctly predicted positive observations to the total predicted positives (relevant in binary and multiclass classification).
* Recall (Sensitivity or True Positive Rate): The ratio of correctly predicted positive observations to the all observations in actual class.
* F1 Score: The harmonic mean of precision and recall, providing a balance between precision and recall.
* Area Under the ROC Curve (AUC-ROC): Evaluates the trade-off between true positive rate and false positive rate.
###**Model Selection**

Model selection is the process of choosing the best-performing model among different candidate models. It involves training multiple models with different configurations or algorithms and selecting the one that performs the best on the evaluation metrics.

Goals:

1) Optimal Model Choice: Choose a model that strikes the right balance between bias and variance and generalizes well to new, unseen data.

2) Avoid Overfitting: Ensure that the selected model doesn't overfit the training data and performs well on diverse datasets.

3) Balancing Complexity: Consider the trade-off between model complexity and performance.

Methods:

* Cross-Validation: Split the dataset into multiple subsets, train the model on different subsets, and evaluate on the remaining data. This helps assess how well the model generalizes to different partitions of the data.
* Grid Search: Systematically test different hyperparameter combinations to find the set that results in the best model performance.
* Random Search: Similar to grid search, but instead of testing all possible combinations, random search tests a random subset of hyperparameter combinations.
* Ensemble Methods: Combine predictions from multiple models (e.g., Random Forests or Gradient Boosting) to create a more robust and accurate model.

**Key Difference:**

* Model Evaluation: Focuses on assessing the performance of a specific model on a given dataset using various metrics.

* Model Selection: Involves comparing and choosing the best-performing model among different models or configurations based on evaluation metrics.

In practice, both model evaluation and model selection are iterative processes. Model evaluation helps understand the current model's strengths and weaknesses, while model selection helps choose the most suitable model for a particular task.


## **7. What you understand by Feature Engineering and Feature selection? What is the difference between them?**

### **Feature Engineering**

Feature engineering involves creating new features or modifying existing ones to enhance the dataset's quality and provide more informative input to machine learning models.

Goals:

* Improving Model Performance: By introducing new features or transforming existing ones, we aim to capture more relevant information that can potentially lead to better model performance.

* Handling Complex Relationships: Creating interaction terms, polynomial features, or domain-specific transformations helps the model better capture complex relationships within the data.

* Dealing with Non-Linearity: In scenarios where relationships between features and the target variable are non-linear, feature engineering can help linear models by creating new representations.

Examples:

* Interaction Terms: Combining two or more features to capture their joint effect. For instance, in a housing dataset, combining "number of bedrooms" and "number of bathrooms" to create a new feature representing the interaction between them.

* Polynomial Features: Introducing polynomial terms, such as squaring or cubing existing features, to account for non-linear relationships.

* Encoding Categorical Variables: Converting categorical variables into a numerical format that the model can understand, such as one-hot encoding or label encoding.

* Temporal Features: Extracting features related to time, like day of the week, month, or season, which can be useful in time-series analysis.

### **Feature Selection**

Feature selection involves choosing a subset of relevant features from the original set to simplify models, reduce overfitting, and improve computational efficiency.

Goals:

* Simplifying Models: By selecting only the most relevant features, we aim to simplify the model's structure, making it more interpretable and easier to understand.

* Reducing Overfitting: Including irrelevant features may lead to overfitting, where the model performs well on the training data but poorly on new, unseen data. Feature selection helps mitigate this issue.

* Improving Computational Efficiency: Fewer features mean less computational resources required for training and making predictions, which is crucial for large datasets.

Examples:

* Univariate Feature Selection: Selecting features based on univariate statistical tests, such as chi-squared tests or mutual information.

* Recursive Feature Elimination (RFE): Iteratively removing the least important features based on model performance until the desired number of features is reached.

* Feature Importance from Models: Some models provide feature importance scores, and features can be selected based on these scores. Random Forests and Gradient Boosting models are examples.

* Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) reduce the dataset's dimensionality by transforming features into a new set of uncorrelated features.

**Key Difference:**

1) Creation vs. Selection: Feature engineering involves creating new features or transforming existing ones, while feature selection involves choosing a subset of features from the existing set.

2) Information vs. Simplicity: Feature engineering aims to provide more information to the model, capturing complex relationships. Feature selection aims to simplify models by retaining only the most relevant features.

In practice, a combination of both feature engineering and feature selection is often employed to achieve the best model performance while maintaining simplicity and interpretability. The choice between these techniques depends on the specific characteristics of the dataset and the goals of the machine learning task.

-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

**Submitted by : Mrudula A P**