# NLP Assignment 2

### 1. What are Corpora?

Ans:-Corpora (singular: corpus) are large and structured collections of text or spoken language data that are used for linguistic analysis, language modeling, and various natural language processing (NLP) tasks. Corpora serve as essential resources for studying language, understanding linguistic patterns, and developing and evaluating language-related algorithms and models.

Here are some key points about corpora:

1. **Text and Speech Data:** Corpora can consist of written text, transcribed spoken language (audio recordings converted to text), or a combination of both. Text corpora may include books, articles, websites, social media posts, and more, while speech corpora contain spoken language data, such as interviews, conversations, and speeches.

2. **Size and Diversity:** Corpora vary in size, from relatively small collections to massive datasets that encompass a wide range of text or speech sources. The diversity of data sources and topics covered in a corpus can influence its usefulness for specific research or NLP applications.

3. **Structured Data:** Corpora are typically structured in a way that allows researchers and NLP practitioners to access and analyze the data efficiently. This often involves organizing the data into documents, sentences, paragraphs, or other linguistic units, making it suitable for computational analysis.

4. **Annotation:** Many corpora are annotated with linguistic information such as part-of-speech tags, named entity recognition, syntactic structures, sentiment labels, and more. Annotations help researchers analyze and train machine learning models on specific linguistic tasks.

5. **Domain and Language:** Corpora can be domain-specific (e.g., medical texts, legal documents) or cover a broad range of topics. They can also be language-specific or multilingual, depending on their intended use.

6. **Creation Methods:** Corpora can be created through various methods, including manual annotation, data scraping from the web, transcription of audio recordings, and more. Some corpora are curated and maintained by academic institutions, while others are crowd-sourced or collected by private organizations.

7. **Use Cases:** Corpora have diverse applications in linguistics, computational linguistics, NLP, and related fields. They are used for tasks such as training and evaluating language models (e.g., machine translation, chatbots), linguistic research (e.g., studying language evolution, dialects), sentiment analysis, information retrieval, and much more.

Examples of well-known corpora include the Penn Treebank (a corpus of English text with syntactic annotations), the Brown Corpus (a diverse text corpus for studying language patterns), and the Common Crawl (a web corpus containing data from websites worldwide). Corpora are foundational resources in language-related research and play a critical role in advancing our understanding of language and improving the performance of NLP systems.

### 2. What are Tokens?

Ans:-Tokens are the individual units or elements that a larger body of text, such as a document or a sentence, is divided into for the purpose of linguistic analysis or processing. In natural language processing (NLP), text data is typically broken down into tokens, which are often words, but they can also be subword units like characters or n-grams (sequences of characters).

Here are some key points about tokens:

1. **Words as Tokens:** In most NLP applications, words are the most common type of tokens. For example, in the sentence "The quick brown fox jumps," the tokens are "The," "quick," "brown," "fox," and "jumps."

2. **Whitespace Separation:** In English and many other languages, words in text are usually separated by whitespace (spaces, tabs, or line breaks). Tokenization often involves splitting the text at these whitespace characters to identify individual words.

3. **Punctuation:** Punctuation marks like periods, commas, and hyphens are typically treated as separate tokens. For example, in the sentence "I have a dog, and he is brown," the tokens include "I," "have," "a," "dog," ",", "and," "he," "is," and "brown."

4. **Subword Tokens:** In some NLP applications, words are further divided into subword units, such as characters or n-grams (sequences of characters). This is common in languages with complex morphology or for tasks like machine translation and text generation. For example, the word "unhappiness" might be tokenized into subword tokens like "un," "happi," and "ness."

5. **Tokenization Rules:** Tokenization rules can vary depending on the language and the specific task. For instance, some languages use spaces to separate words, while others do not. Additionally, tokenization may take into account special cases like contractions ("can't" as a token) or compound words ("New York" as a token).

6. **Importance in NLP:** Tokenization is a crucial preprocessing step in NLP. It serves as the foundation for various NLP tasks, including text classification, machine translation, named entity recognition, sentiment analysis, and more. Tokens are the units on which linguistic analysis, feature extraction, and modeling are performed.

7. **Token Count:** The number of tokens in a text document is often used as a basic statistic for understanding document length. It can be useful for tasks like estimating reading time or analyzing the distribution of words in a corpus.

8. **Stop Words:** Stop words are common words like "the," "and," "is," "in," etc., that are often removed during tokenization because they are considered to carry little semantic information in many NLP tasks.

In summary, tokens are the building blocks of text data in NLP, representing individual words or subword units that are essential for linguistic analysis and the development of NLP models and algorithms. Proper tokenization is crucial for accurate and effective text processing and analysis.

### 3. What are Unigrams, Bigrams, Trigrams?

Ans- Unigrams, bigrams, and trigrams are different types of n-grams, which are contiguous sequences of 'n' items (typically words or characters) in a given order within a text document. N-grams are used in natural language processing (NLP) and computational linguistics to analyze the structure and patterns of text data. The choice of 'n' determines the size of the n-grams and the level of context they capture.

Here's what each term means:

1. **Unigrams (1-grams):**
   - Unigrams are the simplest type of n-grams, where 'n' is set to 1.
   - Unigrams are individual words in a text, and they capture each word's frequency and presence in the text.
   - For example, in the sentence "I love programming," the unigrams are "I," "love," and "programming."

2. **Bigrams (2-grams):**
   - Bigrams are n-grams where 'n' is set to 2, meaning they consist of pairs of consecutive words in a text.
   - Bigrams provide some context by capturing the co-occurrence of words in pairs.
   - For example, in the sentence "I love programming," the bigrams are "I love" and "love programming."

3. **Trigrams (3-grams):**
   - Trigrams are n-grams where 'n' is set to 3, so they consist of triples of consecutive words in a text.
   - Trigrams provide more context and capture the co-occurrence of words in triplets.
   - For example, in the sentence "I love programming," the trigram is "I love programming."

N-grams can be used for various NLP tasks and text analysis purposes:

- **Language Modeling:** N-grams are used to build language models that predict the likelihood of a word based on the context of the previous 'n-1' words. For example, trigrams are used in trigram language models.

- **Text Classification:** N-grams can be used as features for text classification tasks, where they represent the presence or frequency of specific word sequences in documents.

- **Information Retrieval:** In information retrieval systems, search queries and documents can be represented as n-grams to improve search results.

- **Sentiment Analysis:** N-grams are used to capture sentiment-related phrases and idioms in text for sentiment analysis tasks.

- **Machine Translation:** N-grams can be employed in statistical machine translation systems to capture translation probabilities for sequences of words.

- **Text Generation:** In text generation tasks, n-grams can be used to generate coherent and contextually appropriate sequences of words.

The choice of 'n' depends on the specific task and the desired level of context. Smaller 'n' (e.g., unigrams and bigrams) are useful for capturing local patterns, while larger 'n' (e.g., trigrams and higher-order n-grams) capture longer-range dependencies and context in text data.

### 4. How to generate n-grams from text?

Ans:-Generating n-grams from text is a common text processing task in natural language processing (NLP). N-grams are contiguous sequences of 'n' items, which are typically words or characters, in a given order within a text document. Here's how you can generate n-grams from text using Python:

Let's assume you have a text string:

```python
text = "Natural language processing is a subfield of artificial intelligence."
```

You can create n-grams from this text using Python code. Here are some examples:

### Generating Word N-Grams:

```python
def generate_ngrams(text, n):
    # Tokenize the text into words
    words = text.split()
    
    # Initialize an empty list to store the n-grams
    ngrams = []
    
    # Iterate through the list of words to create n-grams
    for i in range(len(words) - n + 1):
        ngram = ' '.join(words[i:i+n])
        ngrams.append(ngram)
    
    return ngrams

# Generate word bigrams (2-grams)
bigrams = generate_ngrams(text, 2)
print("Bigrams:", bigrams)

# Generate word trigrams (3-grams)
trigrams = generate_ngrams(text, 3)
print("Trigrams:", trigrams)
```

### Generating Character N-Grams:

```python
def generate_char_ngrams(text, n):
    # Initialize an empty list to store the n-grams
    ngrams = []
    
    # Iterate through the text to create n-grams
    for i in range(len(text) - n + 1):
        ngram = text[i:i+n]
        ngrams.append(ngram)
    
    return ngrams

# Generate character bigrams (2-grams)
char_bigrams = generate_char_ngrams(text, 2)
print("Character Bigrams:", char_bigrams)

# Generate character trigrams (3-grams)
char_trigrams = generate_char_ngrams(text, 3)
print("Character Trigrams:", char_trigrams)
```

In the code examples above:

- `generate_ngrams` is a Python function that takes the input text and the desired 'n' (e.g., 2 for bigrams, 3 for trigrams) as arguments.
- For word n-grams, the text is first tokenized into words using the `split()` function.
- Then, a loop iterates through the list of words to create n-grams by joining 'n' consecutive words.
- For character n-grams, the function directly iterates through the text to create n-grams of characters.

You can adjust the value of 'n' to generate n-grams of different lengths. These n-grams can be useful for various NLP tasks such as text classification, language modeling, and text generation.

### 5. Explain Lemmatization?

Ans- Lemmatization is a natural language processing (NLP) technique used to reduce words to their base or root form, known as the "lemma." The main goal of lemmatization is to normalize words so that different inflected forms or variations of a word are converted to a common base form. This helps in text analysis, information retrieval, and improving the efficiency of text processing tasks.

Here are the key aspects of lemmatization:

1. **Base Form:** In lemmatization, words are transformed into their base, dictionary, or canonical form, which is the word as it would appear in a dictionary or lexicon. For example:
   - "Running" → "run" (base form: verb)
   - "Better" → "good" (base form: adjective)

2. **Language-Specific:** Lemmatization is language-specific because the rules for finding the base form of words vary from one language to another. Different languages have different inflectional patterns and rules.

3. **Lemmatization vs. Stemming:** Lemmatization is similar to stemming, but they are not the same. Stemming reduces words to their base or root form by removing prefixes or suffixes, but it may not always produce a valid word. Lemmatization, on the other hand, ensures that the resulting lemma is a valid word in the language.

4. **Part-of-Speech Consideration:** Lemmatization often takes into account the part of speech (POS) of the word to determine its base form. For example, the lemma of "better" is different when it is an adjective ("good") compared to when it is used as an adverb ("well").

5. **Use of Lexicons:** Lemmatization typically relies on dictionaries or lexicons that contain information about words, including their base forms and associated POS tags. These resources help lemmatizers make accurate transformations.

6. **Applications:** Lemmatization is used in various NLP applications, such as text normalization, information retrieval, text classification, and machine translation. It is particularly valuable when the analysis or comparison of words in their base forms is essential.

Here's an example of lemmatization in action:

- Input Sentence: "The quick brown foxes are running quickly."

After lemmatization, the sentence might be transformed to something like:

- Lemmatized Sentence: "The quick brown fox be run quickly."

In this example, various inflected forms of words have been converted to their base forms (lemmas) while preserving grammatical correctness.

There are libraries and tools in various programming languages, such as NLTK (Natural Language Toolkit) in Python, that provide lemmatization capabilities. These libraries often include pre-built lexicons and algorithms for performing lemmatization on text data.

### 6. Explain Stemming?

Ans:-Stemming is a natural language processing (NLP) technique used to reduce words to their base or root form, known as the "stem." The primary goal of stemming is to remove suffixes or prefixes from words so that related words with the same root are treated as the same word, even if they have different inflected forms. Stemming is a simpler and more rule-based process compared to lemmatization.

Here are the key aspects of stemming:

1. **Base Form:** In stemming, words are transformed into their base or root form by removing affixes. This root form may not always be a valid word, but it is a common substring shared by related words.

2. **Language Agnostic:** Stemming algorithms are often language-agnostic, which means they can be applied to multiple languages without requiring language-specific rules or resources.

3. **Suffix Stripping:** Stemming algorithms use suffix-stripping rules to remove suffixes from words. These rules are typically heuristic and rule-based, and they aim to remove common suffixes to obtain the stem.

4. **Simplicity:** Stemming is a relatively simple and fast process compared to lemmatization, making it suitable for applications where efficiency is critical.

5. **Possible Overstemming:** Stemming may sometimes produce stems that are not actual words or may overstem by excessively removing letters, which can lead to a loss of meaning. For example, "jumps" might be stemmed to "jump," which is the root form, but "jump" and "jumping" have distinct meanings.

6. **Applications:** Stemming is used in various NLP applications, such as information retrieval, search engines, text classification, and document clustering, where reducing words to their common base form can aid in text analysis and indexing.

Here's an example of stemming in action:

- Input Word: "jumping"

After stemming using a common stemming algorithm, the word might be reduced to its root form:

- Stemmed Word: "jump"

In this example, the suffix "-ing" has been removed from the word "jumping" to obtain the stem "jump."

Popular Stemming Algorithms:
- **Porter Stemming Algorithm:** Developed by Martin Porter in the 1980s, the Porter stemming algorithm is one of the most widely used stemming algorithms. It applies a series of heuristic rules to reduce words to their stems.
- **Snowball (Porter2) Stemming Algorithm:** An improved version of the Porter algorithm, Snowball (also known as Porter2) offers support for multiple languages and includes additional stemming rules.
- **Lancaster Stemming Algorithm:** The Lancaster stemming algorithm is another popular stemming algorithm that uses more aggressive stemming rules compared to Porter.

Stemming is suitable for tasks where simplicity and speed are more important than linguistic precision. However, for applications that require more accurate word normalization and where understanding the context of words is crucial, lemmatization is often preferred over stemming.

### 7. Explain Part-of-speech (POS) tagging?

Ans:-Part-of-speech (POS) tagging, also known as grammatical tagging or word-category disambiguation, is a fundamental natural language processing (NLP) task that involves assigning a specific part-of-speech tag to each word in a given text. The primary goal of POS tagging is to determine the syntactic category or grammatical role of each word in a sentence, which helps in understanding the structure of the text and its grammatical relationships.

Here are the key aspects of POS tagging:

1. **Part-of-Speech Tags:** POS tags are linguistic labels or codes that represent the grammatical category of a word. Common POS tags include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections. Additionally, specific subcategories or finer-grained tags exist to capture more detailed information about word usage, such as tense, number, gender, and case.

2. **Contextual Disambiguation:** POS tagging involves disambiguating words that may have multiple grammatical roles based on their context within a sentence. For example, the word "bank" can be a noun ("I went to the bank") or a verb ("I will bank the money"). Context helps determine the correct tag.

3. **Ambiguity Handling:** Some words are inherently ambiguous and can belong to different parts of speech depending on usage. POS taggers employ context clues, word order, and neighboring words to resolve such ambiguities.

4. **POS Tag Sets:** Different languages and POS tagging schemes may use distinct sets of POS tags. For example, the Penn Treebank POS tag set is commonly used for English, while other languages have their own sets of tags.

5. **Applications:** POS tagging is used in various NLP applications, including:
   - Information retrieval: To improve search and information retrieval results by considering the grammatical roles of words.
   - Machine translation: To aid in language translation by preserving grammatical structures.
   - Sentiment analysis: To understand how the grammatical structure of text affects sentiment.
   - Named entity recognition: To identify proper nouns like names, locations, and organizations, which have different tagging requirements.
   - Dependency parsing: As a preprocessing step in syntactic and dependency parsing, which analyze sentence structure.

6. **POS Tagging Models:** POS tagging can be performed using rule-based systems, statistical models (such as Hidden Markov Models), or machine learning techniques, including conditional random fields (CRF) and neural networks (e.g., recurrent neural networks and transformers). Modern deep learning models, like Bidirectional LSTMs and BERT, have achieved state-of-the-art results in POS tagging.

Here's an example of POS tagging in a sentence:

- Input Sentence: "She is reading a book."

POS Tagged Sentence: "She (PRON) is (VERB) reading (VERB) a (DET) book (NOUN)."

In this example, each word in the sentence has been assigned a POS tag in parentheses, indicating its grammatical role.

POS tagging is a crucial preprocessing step in many NLP tasks and plays a vital role in understanding the syntactic and grammatical structure of text, making it easier to extract meaningful information and perform more advanced linguistic analysis.

### 8. Explain Chunking or shallow parsing?

Ans:-Chunking, also known as shallow parsing, is a natural language processing (NLP) technique used to identify and group words in a sentence or text into meaningful chunks or phrases based on their grammatical structures and relationships. Unlike full syntactic parsing, which builds a complete parse tree for a sentence, chunking focuses on identifying higher-level phrases, such as noun phrases (NP) and verb phrases (VP), without specifying their internal syntactic structure.

Here are the key aspects of chunking (shallow parsing):

1. **Chunks:** Chunks are groups of words that form meaningful phrases or syntactic units within a sentence. These phrases often represent grammatical constructs or convey some level of semantic meaning.

2. **Phrase Types:** Chunking typically identifies several types of phrases, including:
   - **Noun Phrases (NP):** These are phrases centered around a noun and can include articles, adjectives, and other modifiers. For example, "the big red apple."
   - **Verb Phrases (VP):** These are phrases centered around a verb and can include adverbs, objects, and other verb-related elements. For example, "quickly eat dinner."
   - **Prepositional Phrases (PP):** These are phrases that begin with a preposition and often contain an object of the preposition. For example, "in the park."

3. **Contextual Analysis:** Chunking uses contextual information and grammatical rules to identify and extract meaningful phrases from text. It relies on patterns and relationships between words to determine chunk boundaries.

4. **Chunking Models:** Chunking can be performed using various techniques, including rule-based approaches, regular expressions, and machine learning models. Machine learning-based chunkers are trained on annotated data to learn patterns and relationships between words that indicate chunk boundaries.

5. **Applications:** Chunking is used in a variety of NLP tasks, including information extraction, named entity recognition, and text summarization. It simplifies the representation of text structure while preserving essential grammatical relationships.

6. **Example:**
   - Sentence: "The quick brown fox jumps over the lazy dog."
   - Chunked Result: 
     - (NP The/DT quick/JJ brown/NN fox/NN) (VP jumps/VBZ) (PP over/IN) (NP the/DT lazy/JJ dog/NN).

In the example above, the sentence has been chunked into noun phrases (NP), verb phrases (VP), and a prepositional phrase (PP), with each chunk enclosed in parentheses and labeled with its respective phrase type.

Chunking provides a simplified and higher-level representation of the syntactic structure of a sentence or text, making it easier to analyze and extract information. It serves as an intermediate step in various NLP tasks and is particularly useful when you want to capture essential linguistic units without the complexity of full syntactic parsing.

### 9. Explain Noun Phrase (NP) chunking?

Ans:-Chunking, also known as shallow parsing, is a natural language processing (NLP) technique used to identify and group words in a sentence or text into meaningful chunks or phrases based on their grammatical structures and relationships. Unlike full syntactic parsing, which builds a complete parse tree for a sentence, chunking focuses on identifying higher-level phrases, such as noun phrases (NP) and verb phrases (VP), without specifying their internal syntactic structure.

Here are the key aspects of chunking (shallow parsing):

1. **Chunks:** Chunks are groups of words that form meaningful phrases or syntactic units within a sentence. These phrases often represent grammatical constructs or convey some level of semantic meaning.

2. **Phrase Types:** Chunking typically identifies several types of phrases, including:
   - **Noun Phrases (NP):** These are phrases centered around a noun and can include articles, adjectives, and other modifiers. For example, "the big red apple."
   - **Verb Phrases (VP):** These are phrases centered around a verb and can include adverbs, objects, and other verb-related elements. For example, "quickly eat dinner."
   - **Prepositional Phrases (PP):** These are phrases that begin with a preposition and often contain an object of the preposition. For example, "in the park."

3. **Contextual Analysis:** Chunking uses contextual information and grammatical rules to identify and extract meaningful phrases from text. It relies on patterns and relationships between words to determine chunk boundaries.

4. **Chunking Models:** Chunking can be performed using various techniques, including rule-based approaches, regular expressions, and machine learning models. Machine learning-based chunkers are trained on annotated data to learn patterns and relationships between words that indicate chunk boundaries.

5. **Applications:** Chunking is used in a variety of NLP tasks, including information extraction, named entity recognition, and text summarization. It simplifies the representation of text structure while preserving essential grammatical relationships.

6. **Example:**
   - Sentence: "The quick brown fox jumps over the lazy dog."
   - Chunked Result: 
     - (NP The/DT quick/JJ brown/NN fox/NN) (VP jumps/VBZ) (PP over/IN) (NP the/DT lazy/JJ dog/NN).

In the example above, the sentence has been chunked into noun phrases (NP), verb phrases (VP), and a prepositional phrase (PP), with each chunk enclosed in parentheses and labeled with its respective phrase type.

Chunking provides a simplified and higher-level representation of the syntactic structure of a sentence or text, making it easier to analyze and extract information. It serves as an intermediate step in various NLP tasks and is particularly useful when you want to capture essential linguistic units without the complexity of full syntactic parsing.

### 10. Explain Named Entity Recognition?

Ans:-Named Entity Recognition (NER), also known as entity identification, entity chunking, or entity extraction, is a natural language processing (NLP) technique that focuses on identifying and classifying named entities in text. Named entities are words or phrases that represent specific objects, locations, people, dates, numerical values, organizations, and other structured information within a text.

Here are the key aspects of Named Entity Recognition (NER):

1. **Named Entity Types:** NER classifies named entities into predefined categories or types. Common named entity types include:
   - **Person:** Names of individuals, such as "John Smith."
   - **Location:** Names of places, cities, countries, etc., like "New York City."
   - **Organization:** Names of companies, institutions, agencies, etc., e.g., "Google."
   - **Date:** Expressions of dates and times, such as "January 1, 2023."
   - **Number:** Numerical values, including integers and decimals, such as "100 million."
   - **Miscellaneous:** Other named entities that do not fit into the standard categories, like "iPhone."

2. **Contextual Analysis:** NER algorithms analyze the context of words within a sentence or document to determine whether a word or phrase is a named entity and to classify it into the appropriate category. Contextual features may include capitalization, neighboring words, and grammatical structures.

3. **Challenges:** NER faces several challenges, including handling ambiguous words (e.g., "Paris" can refer to the city or a person's name), recognizing named entities in various languages, and dealing with complex entity names (e.g., "United States of America").

4. **Named Entity Recognition Models:** NER can be performed using different approaches and models, including rule-based systems, statistical models, and machine learning techniques. Machine learning-based models, such as Conditional Random Fields (CRF), Hidden Markov Models (HMM), and deep learning models like Bidirectional LSTM and Transformer-based models (e.g., BERT), have shown state-of-the-art results in NER tasks.

5. **Applications:** Named Entity Recognition is used in various NLP applications and information extraction tasks, including:
   - **Information Retrieval:** To improve search results and retrieval of documents containing specific entities.
   - **Question Answering:** To identify and extract answers from text documents.
   - **Entity Linking:** To link named entities to external knowledge bases or databases for additional information.
   - **Text Summarization:** To extract important named entities for generating summaries.
   - **Language Translation:** To ensure proper translation of named entities in machine translation systems.
   - **Geospatial Analysis:** To extract location information for geospatial applications.

6. **Example:**
   - Input Sentence: "Apple Inc. was founded by Steve Jobs in Cupertino, California."
   - NER Output: (ORG) Apple Inc. (PERSON) Steve Jobs (LOCATION) Cupertino, California

In the example above, the NER system has identified and classified named entities into their respective types, such as "Apple Inc." as an organization, "Steve Jobs" as a person, and "Cupertino, California" as a location.

Named Entity Recognition plays a critical role in extracting structured information from unstructured text data, enabling various downstream NLP applications and facilitating the extraction of valuable insights from large volumes of textual information.