In [None]:
# balance of catego

In [4]:
# deep learning for sentiment analysis 

# Lemmatization

Lemmatization is the process of reducing a word to its base or root form, which is known as its lemma. This process involves understanding the context in which a word is used and ensuring that the lemma corresponds to a valid word in the language. The algorithm used for lemmatization typically relies on several components and can vary depending on the specific implementation. However, the most common and widely used lemmatization algorithms include:

### 1. **Rule-Based Lemmatization**
Rule-based lemmatizers use a set of manually crafted rules to identify the base form of a word. These rules often include:
- Morphological analysis to identify suffixes and prefixes.
- A dictionary or lexicon to look up the lemma of a word.
- Part-of-speech tagging to ensure the correct lemma is chosen based on the word's grammatical role.

### 2. **Dictionary-Based Lemmatization**
This approach relies heavily on a pre-compiled dictionary or lexicon that maps inflected forms of words to their base forms. The algorithm:
- Looks up the word in the dictionary.
- Identifies its lemma based on the dictionary entry.

### 3. **Statistical or Machine Learning-Based Lemmatization**
In more advanced approaches, statistical models or machine learning techniques are used. These models are trained on large corpora of text and learn to predict the base form of a word based on its context. Components include:
- Training data consisting of annotated text where each word is labeled with its lemma.
- A statistical model (e.g., Hidden Markov Model, Conditional Random Field) or a machine learning model (e.g., neural networks).
- Features such as the word itself, its context, and part-of-speech tags.

### Common Lemmatization Tools and Libraries
Several libraries and tools implement these algorithms. Notable examples include:

1. **WordNet Lemmatizer (NLTK)**: This is part of the Natural Language Toolkit (NLTK) and relies on the WordNet lexical database to find lemmas.
2. **spaCy**: An industrial-strength NLP library that provides a high-performance lemmatizer as part of its pipeline.
3. **Stanford CoreNLP**: A comprehensive NLP toolkit that includes a lemmatizer with support for multiple languages.
4. **TextBlob**: A simple NLP library built on top of NLTK and Pattern that includes lemmatization.

### Example of Rule-Based Lemmatization
Here's a simple example of how a rule-based lemmatizer might work for the word "running":

1. **Input Word**: running
2. **Morphological Analysis**: Identify suffix "-ing".
3. **Part-of-Speech Tagging**: Determine that "running" is a verb.
4. **Apply Rules**: Remove the suffix "-ing" and check for the base form "run".
5. **Output Lemma**: run

### Example of Dictionary-Based Lemmatization
Using a dictionary, the process might be:

1. **Input Word**: ran
2. **Dictionary Lookup**: Find "ran" in the dictionary with an entry pointing to "run".
3. **Output Lemma**: run

### Example of Statistical Lemmatization
For a machine learning-based approach:

1. **Input Word**: running
2. **Contextual Analysis**: Consider the surrounding words and part-of-speech tags.
3. **Model Prediction**: Use the trained model to predict the lemma "run".
4. **Output Lemma**: run

In summary, the specific "leema" or lemma-finding process used in lemmatization depends on the chosen algorithm and the implementation details of the tool or library being used. Whether rule-based, dictionary-based, or statistical, each approach has its strengths and applications.

## Algorithm of spaCy and NLKT


### spaCy
spaCy uses a combination of rule-based and dictionary-based methods, tightly integrated with part-of-speech tagging. Here is a step-by-step outline of the algorithm:

1. **Part-of-Speech Tagging**: First, spaCy assigns a part-of-speech tag to each word in the text. This helps understand the grammatical role of the word (e.g., noun, verb, adjective).

2. **Rule-Based Morphological Analysis**: spaCy applies a set of predefined rules to transform the word into its base form. These rules cover common morphological changes, such as:
   - Removing suffixes like "-ing", "-ed", "-s".
   - Handling regular forms of verbs and nouns.

3. **Lookup in Lexical Resources**: spaCy uses its internal lexical database (word lists) to find the lemma. This database includes mappings from various word forms to their base forms. For example, it maps irregular forms like "went" to "go".

4. **Combining Rules and Lexical Resources**: The algorithm combines the results from the rule-based transformations and lexical lookup to determine the most accurate lemma.

5. **Output Lemma**: Finally, spaCy outputs the lemma for each word based on the applied rules and dictionary lookup.

### NLTK (WordNet Lemmatizer)
NLTK's WordNet Lemmatizer relies primarily on the WordNet lexical database. Here's a detailed breakdown of the algorithm:

1. **Part-of-Speech Tagging**: The WordNet lemmatizer requires the part-of-speech tag for each word to select the correct lemma. Users must provide this tag or use an external POS tagger.

2. **WordNet Lookup**: The lemmatizer looks up the word in the WordNet database using the provided part-of-speech tag. WordNet contains extensive mappings of words to their base forms.

3. **Simple Rule-Based Transformations**: For words not found in WordNet or for regular transformations, the lemmatizer applies basic morphological rules, such as:
   - Removing common suffixes.
   - Handling plurals and regular verb forms.

4. **Fallback Mechanism**: If the word is not found in the dictionary and cannot be transformed using rules, the algorithm may return the word itself as the lemma.

5. **Output Lemma**: The algorithm outputs the lemma based on the dictionary lookup or the rule-based transformation.

### Example: spaCy Algorithm in Code
```python
import spacy

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Process some text
doc = nlp("running runs ran")

# Print the base forms (lemmas) of the words
lemmas = [token.lemma_ for token in doc]
print(lemmas)  # Output: ['run', 'run', 'run']
```
In this example, spaCy:
1. Tags "running", "runs", and "ran" as verbs.
2. Applies rules to remove "-ing" and "-s".
3. Uses its internal dictionary to map "ran" to "run".
4. Outputs the lemma "run" for all forms.

### Example: NLTK Algorithm in Code
```python
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Create the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize words with part-of-speech tags
lemmas = [
    lemmatizer.lemmatize("running", pos=wordnet.VERB),
    lemmatizer.lemmatize("runs", pos=wordnet.VERB),
    lemmatizer.lemmatize("ran", pos=wordnet.VERB)
]
print(lemmas)  # Output: ['run', 'run', 'run']
```
In this example, NLTK:
1. Receives the part-of-speech tags for "running", "runs", and "ran" as verbs.
2. Looks up each word in the WordNet database.
3. Applies simple rules if necessary.
4. Outputs the lemma "run" for all forms.

### Summary
- **spaCy** uses a combination of rules and a dictionary, guided by part-of-speech tags, to find the lemma.
- **NLTK** primarily relies on the WordNet dictionary, supplemented by simple rules and part-of-speech tags, to determine the lemma.

# Comparison of stemming and lemmatization


Stemming and lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root forms, but they differ significantly in their approaches and outcomes. Here’s a detailed comparison:

### Stemming
**Definition**: Stemming is the process of reducing a word to its base or root form, usually by removing suffixes. The root form is not necessarily a valid word in the language.

**Algorithm**: Stemming algorithms, like the Porter Stemmer, use a set of simple, rule-based transformations. These rules are designed to strip common suffixes such as "ing", "ly", "es", "s", etc.

**Example**:
- "running" -> "run"
- "happily" -> "happi"
- "studies" -> "studi"

**Advantages**:
- **Speed**: Stemming is usually faster because it applies simple rules.
- **Simplicity**: The algorithms are relatively straightforward and easy to implement.

**Disadvantages**:
- **Accuracy**: The resulting root form may not be a valid word. For example, "happily" becomes "happi".
- **Over-stemming and Under-stemming**: Stemming can sometimes remove too much (over-stemming) or too little (under-stemming), leading to incorrect root forms.

### Lemmatization
**Definition**: Lemmatization is the process of reducing a word to its base or dictionary form (lemma), ensuring that the base form is a valid word. It considers the context and part of speech of the word.

**Algorithm**: Lemmatization algorithms use complex rules and dictionaries. They often rely on part-of-speech tagging to choose the correct base form.

**Example**:
- "running" -> "run"
- "happily" -> "happy"
- "studies" -> "study" (if noun) or "study" (if verb)

**Advantages**:
- **Accuracy**: The result is always a valid word, making lemmatization more accurate.
- **Context Awareness**: Lemmatization considers the context and grammatical role, leading to better results for words that have multiple forms.

**Disadvantages**:
- **Speed**: Lemmatization is usually slower because it involves more complex processing, including part-of-speech tagging.
- **Complexity**: The algorithms are more complex and require more resources (like lexical databases).

### Which is Better?
**Use Stemming When**:
- You need a fast and simple solution.
- You’re working with applications where perfect accuracy is not critical (e.g., search engines, where rough matches are acceptable).
- Computational resources are limited.

**Use Lemmatization When**:
- You need higher accuracy and the base form must be a valid word.
- The context and grammatical structure of the words are important (e.g., text analysis, language translation, and advanced NLP tasks).
- You have sufficient computational resources and time.

### Practical Examples
#### Stemming with NLTK:
```python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "happily", "studies"]
stems = [stemmer.stem(word) for word in words]
print(stems)  # Output: ['run', 'happili', 'studi']
```

#### Lemmatization with NLTK:
```python
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()
words = ["running", "happily", "studies"]
lemmas = [lemmatizer.lemmatize(word, pos=wordnet.VERB) if word == "studies" else lemmatizer.lemmatize(word) for word in words]
print(lemmas)  # Output: ['run', 'happy', 'study']
```

#### Lemmatization with spaCy:
```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("running happily studies")
lemmas = [token.lemma_ for token in doc]
print(lemmas)  # Output: ['run', 'happy', 'study']
```

#### Summary
- **Stemming** is faster and simpler but less accurate.
- **Lemmatization** is more accurate and context-aware but slower and more complex.

Choosing between stemming and lemmatization depends on the specific requirements of your NLP task, balancing the trade-offs between speed and accuracy.

# Different ways to create a DTM


## 1. Binary DTM

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]
vectorizer = CountVectorizer(binary=True)
binary_dtm = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()

binary_dtm =  pd.DataFrame(binary_dtm.toarray(), columns=terms)
binary_dtm

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,1,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1


## 2. Frequency DTM (Bags of words)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]
vectorizer = CountVectorizer()
freq_dtm = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()

freq_dtm =  pd.DataFrame(freq_dtm.toarray(), columns=terms)
freq_dtm

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1


## 3. TF-IDF DTM

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["first document", "document second document", "And third one"]
vectorizer = TfidfVectorizer()
tfidf_dtm = vectorizer.fit_transform(corpus)

terms = vectorizer.get_feature_names_out()

tfidf_dtm =  pd.DataFrame(tfidf_dtm.toarray(), columns=terms)
tfidf_dtm

Unnamed: 0,and,document,first,one,second,third
0,0.0,0.605349,0.795961,0.0,0.0,0.0
1,0.0,0.835592,0.0,0.0,0.549351,0.0
2,0.57735,0.0,0.0,0.57735,0.0,0.57735



1. **Term Frequency (TF)**: This measures how often a term (word) appears in a document. It is calculated as the ratio of the count of the term to the total number of terms in the document. It helps to give more weight to terms that occur more frequently in the document.

   $$ TF_{t,d} = \frac{{\text{Number of times term } t \text{ appears in document } d}}{{\text{Total number of terms in document } d}} $$

2. **Inverse Document Frequency (IDF)**: This measures how important a term is within the entire corpus. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. This helps to penalize words that occur too frequently across all documents.

 $$ IDF_{t} = \log\left(\frac{{\text{Total number of documents}}}{{\text{Number of documents containing term } t}}\right) $$

Combining TF and IDF, we get the TF-IDF score for a term $ t$ in a document $ d$:

$$ \text{TF-IDF}_{t,d} = TF_{t,d} \times IDF_{t} $$

The manual calculation will yeid different result as scikit-learn uses different method for normalization and prescision


The significance of multiplying the Term Frequency (TF) and Inverse Document Frequency (IDF).

1. **Term Frequency (TF)**:
   - **Purpose**: Measures the frequency of a term in a document.
   - **Significance**: It helps to identify terms that are important within a specific document. A higher TF value indicates that the term is more significant within that particular document.

2. **Inverse Document Frequency (IDF)**:
   - **Purpose**: Measures the importance of a term across the entire document collection.
   - **Significance**: It helps to identify terms that are unique to fewer documents and therefore more significant in distinguishing one document from another. A higher IDF value indicates that the term is less common across all documents, making it more useful for distinguishing between documents.

### Why Multiply TF and IDF?

When you multiply TF and IDF, you achieve the following:

1. **Adjust for Term Frequency**:
   - Terms that occur frequently within a document get a higher score from the TF component. This highlights their importance within that specific document.

2. **Adjust for Common Terms**:
   - The IDF component down-weights terms that are common across many documents. This prevents common terms (e.g., "the", "is", "and") from being considered too important, even if they appear frequently in a document.

### Combined Effect: TF-IDF

The combined TF-IDF score effectively balances these two aspects:

$$ \text{TF-IDF}_{t,d} = TF_{t,d} \times IDF_{t} $$

- **High TF, High IDF**: A term that appears frequently in a document but not in many other documents will have a high TF-IDF score, indicating it's very relevant to that document.
- **High TF, Low IDF**: A term that appears frequently in a document but also in many other documents will have a lower TF-IDF score, indicating it's less distinctive.
- **Low TF, High IDF**: A term that appears infrequently in a document but also infrequently in other documents will have a moderate TF-IDF score.
- **Low TF, Low IDF**: A term that appears infrequently in a document and frequently in other documents will have a low TF-IDF score.


### Used for:

TF-IDF (Term Frequency-Inverse Document Frequency) is commonly used in natural language processing and information retrieval tasks. Here's when you might want to use TF-IDF:

### 1. Text Classification:
- **Use Case**: When you have a set of documents and you want to classify them into different categories or topics.
- **Why TF-IDF**: TF-IDF helps in identifying the most relevant words or features that differentiate one category from another. It does this by giving higher weights to terms that are frequent in a document but rare across the entire corpus, thus capturing the discriminative power of words.

### 2. Information Retrieval:
- **Use Case**: When you want to retrieve relevant documents from a large collection based on a user's query.
- **Why TF-IDF**: TF-IDF helps in ranking documents based on their relevance to the query. Documents containing terms that match the query but are rare in the overall corpus receive higher scores, indicating higher relevance.

### 3. Search Engine Optimization (SEO):
- **Use Case**: When optimizing web content to improve its visibility in search engine results.
- **Why TF-IDF**: TF-IDF can help in identifying and incorporating relevant keywords into web content. By understanding the importance of words in a document relative to the entire corpus, SEO practitioners can create content that is more likely to rank well in search engine results pages.

### 4. Text Summarization:
- **Use Case**: When generating concise summaries of long documents.
- **Why TF-IDF**: TF-IDF can be used to identify the most important sentences or phrases in a document based on the frequency of important terms. This helps in extracting key information and generating informative summaries.

### 5. Text Mining and Sentiment Analysis:
- **Use Case**: When analyzing large volumes of text data to extract insights or sentiments.
- **Why TF-IDF**: TF-IDF can be used to identify important terms or phrases in the text data, which are then used as features for further analysis. By focusing on terms that are both frequent in a document and rare across the entire corpus, TF-IDF helps in identifying significant patterns and sentiments.

In summary, TF-IDF is particularly useful in tasks where the goal is to identify important words or features in a document collection, and where the relative importance of terms needs to be considered. It's widely used in various applications across natural language processing, information retrieval, and text analysis.


## 4. Normalized Term Frequency DTM

Each entry is the frequency of a term in a document normalized by the document length (total number of terms).

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]
vectorizer = TfidfVectorizer(use_idf=False, norm='l1') # norm can be l2 (euclidian) and max
normalized_tf_dtm = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()

normalized_tf_dtm =  pd.DataFrame(normalized_tf_dtm.toarray(), columns=terms)
normalized_tf_dtm

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.2,0.2,0.2,0.0,0.0,0.2,0.0,0.2
1,0.0,0.333333,0.0,0.166667,0.0,0.166667,0.166667,0.0,0.166667
2,0.166667,0.0,0.0,0.166667,0.166667,0.0,0.166667,0.166667,0.166667


### L1 Normalization

L1 normalization, also known as L1 norm or Manhattan norm, is a normalization technique that ensures the sum of the absolute values of the vector elements equals 1. In the context of a Document-Term Matrix (DTM), L1 normalization adjusts the term frequencies such that the sum of the normalized term frequencies for each document equals 1. This is useful for comparing documents of different lengths on a common scale.

The formula for L1 normalization for a given term $ t $ in a document $ d $ is:

$ \text{Normalized TF}_{L1}(t, d) = \frac{\text{Frequency of } t \text{ in } d}{\sum_{t' \in d} \text{Frequency of } t' \text{ in } d} $

### Example of L1 Normalization

Let's use scikit-learn's `TfidfVectorizer` with `use_idf=False` and `norm='l1'` to create a DTM with L1 normalization.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one."
]

# Initialize the vectorizer with L1 normalization
vectorizer = TfidfVectorizer(use_idf=False, norm='l1')

# Fit and transform the corpus to create the normalized term frequency DTM
normalized_tf_dtm = vectorizer.fit_transform(corpus)

# Get feature names (terms)
terms = vectorizer.get_feature_names_out()

# Convert the DTM to an array and print it
normalized_tf_array = normalized_tf_dtm.toarray()

print("Terms:", terms)
print("Normalized Term Frequency DTM (L1):\n", normalized_tf_array)
```

### Other Normalization Techniques

Besides L1 normalization, there are other normalization methods, such as L2 normalization and max normalization. Each has its specific use cases and advantages.

### L2 Normalization

L2 normalization, also known as L2 norm or Euclidean norm, adjusts the term frequencies such that the sum of the squares of the vector elements equals 1. This normalization technique is often used when the magnitude of the vectors is important.

The formula for L2 normalization for a given term $ t $ in a document $ d $ is:

$ \text{Normalized TF}_{L2}(t, d) = \frac{\text{Frequency of } t \text{ in } d}{\sqrt{\sum_{t' \in d} (\text{Frequency of } t' \text{ in } d)^2}} $

### Example of L2 Normalization

Let's use scikit-learn's `TfidfVectorizer` with `use_idf=False` and `norm='l2'` to create a DTM with L2 normalization.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer with L2 normalization
vectorizer = TfidfVectorizer(use_idf=False, norm='l2')

# Fit and transform the corpus to create the normalized term frequency DTM
normalized_tf_dtm = vectorizer.fit_transform(corpus)

# Convert the DTM to an array and print it
normalized_tf_array = normalized_tf_dtm.toarray()

print("Normalized Term Frequency DTM (L2):\n", normalized_tf_array)
```

### Max Normalization

Max normalization scales the term frequencies by the maximum term frequency in the document. This method ensures that the term with the highest frequency in each document gets a value of 1.

The formula for max normalization for a given term $ t$ in a document $ d $ is:

$ \text{Normalized TF}_{\text{max}}(t, d) = \frac{\text{Frequency of } t \text{ in } d}{\max_{t' \in d} (\text{Frequency of } t' \text{ in } d)} $

### Example of Max Normalization

Let's use scikit-learn's `TfidfVectorizer` with `use_idf=False` and `norm='max'` to create a DTM with max normalization.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer with max normalization
vectorizer = TfidfVectorizer(use_idf=False, norm='max')

# Fit and transform the corpus to create the normalized term frequency DTM
normalized_tf_dtm = vectorizer.fit_transform(corpus)

# Convert the DTM to an array and print it
normalized_tf_array = normalized_tf_dtm.toarray()

print("Normalized Term Frequency DTM (Max):\n", normalized_tf_array)
```

### Summary

- **L1 Normalization**: Ensures the sum of the absolute values of the vector elements is 1. Suitable for comparing documents of different lengths on a common scale.
- **L2 Normalization**: Ensures the sum of the squares of the vector elements is 1. Often used when the magnitude of the vectors is important.
- **Max Normalization**: Scales term frequencies by the maximum term frequency in the document. Ensures that the term with the highest frequency in each document gets a value of 1.

Each normalization method has its specific use cases and should be chosen based on the particular requirements of your text processing task.

### When to use which norm

Sure, let's simplify:

### L1 Normalization:
- **What It Does**: It makes sure that when you add up all the absolute values of the numbers in a set, the total is 1.
- **Use Case**: Imagine you have a bunch of numbers (like the importance of words in a document). L1 normalization makes sure that the total importance of all those words is 1. It's like dividing a cake into slices, where each slice represents the importance of one word, and all the slices together fill up the whole cake.
- **Why You'd Use It**: If you think only a few words are super important in a document, you might want to use L1 normalization. It helps you focus on the most important stuff and ignore the rest.

### L2 Normalization:
- **What It Does**: It makes sure that when you add up all the squares of the numbers in a set, and then take the square root, the total is 1.
- **Use Case**: Similar to L1, but it's more about making sure that all the pieces (or numbers) contribute more or less equally to the total. It's like if you have a bunch of puzzle pieces and you want to make sure no single piece is way bigger or smaller than the others when you add them up.
- **Why You'd Use It**: If you want a more balanced approach where all features contribute somewhat equally to the final result, you might go for L2 normalization. It's good for avoiding one feature dominating the others.

### Max Norm Normalization:
- **What It Does**: It makes sure that the maximum value in a set is not bigger than a certain limit.
- **Use Case**: Let's say you have a list of numbers, and one of them is way bigger than the others. Max norm normalization caps that big number so it doesn't throw everything out of balance.
- **Why You'd Use It**: If you're worried about some data points being much larger than others and causing problems in your analysis or model, you might use max norm normalization. It helps keep things in check and prevents outliers from messing up your results.

### Choosing Between Them:
- **L1**: Use it if you want to focus on just a few important things and ignore the rest.
- **L2**: Go for it if you want a more balanced approach where everything contributes somewhat equally.
- **Max Norm**: Use it if you want to make sure no single data point dominates the others and causes issues.

In short, each method helps you manage the size or importance of different pieces of data in a way that makes sense for your analysis or model.

## Objective of using TF-IDF and Normalization

In both TF-IDF and normalization techniques for Document-Term Matrix (DTM), the main objective is to ensure that the representation of terms in documents is not biased by their frequency alone. Let's explore why this is important:

### 1. TF-IDF Method for DTM:
In TF-IDF, the idea is to weight terms based on their importance in the corpus. This is achieved through two main components:

- **Term Frequency (TF)**: Measures how often a term occurs in a document.
- **Inverse Document Frequency (IDF)**: Measures how unique or rare a term is across all documents in the corpus.

The product of TF and IDF results in a score that prioritizes terms that are frequent in a document but rare across the corpus. This is important because it helps in identifying terms that are more discriminative and descriptive of the content of a document.

If a particular word has a high TF-IDF value, it means that it is important within that document but relatively rare across the entire corpus. This ensures that common words (like "the", "and", etc.) don't dominate the representation.

### 2. Normalization in DTM:
Normalization techniques, such as L2 normalization or maximum frequency normalization, are used to scale down the raw term frequencies in a document. The main objectives are:

- **Fair Comparison**: By normalizing term frequencies, we make it easier to compare documents of different lengths. Otherwise, longer documents would naturally have higher raw term frequencies, which could bias comparisons.

- **Reduce Noise**: In some cases, certain terms may have very high frequencies due to the nature of the document (e.g., repeated occurrences of the same word). Normalization helps to reduce the impact of such noisy terms.

- **Stabilize Model**: Normalization can stabilize the model and prevent it from being overly sensitive to outliers or extreme values in term frequencies.

### Conclusion:
In both cases, the goal is to ensure that the representation of terms in the DTM is meaningful and not skewed by factors like document length or common words. By using TF-IDF and normalization techniques, we can create more robust and informative representations of textual data, which is crucial for tasks like information retrieval, text classification, and clustering.

## 5. Binary DTM with N-grams

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one."
]

# Initialize CountVectorizer with binary=True and ngram_range=(2, 2) for bigrams
vectorizer = CountVectorizer(binary=True, ngram_range=(2,2))

# Fit and transform the corpus to create the binary with bigrams DTM
binary_bigram_dtm = vectorizer.fit_transform(corpus)

# Get feature names (bigrams)
bigrams = vectorizer.get_feature_names_out()

# Convert the DTM to an array and print it
binary_bigram_array = pd.DataFrame(binary_bigram_dtm.toarray(),columns = bigrams)

print("Bigrams:", bigrams)
binary_bigram_array

Bigrams: ['and this' 'document is' 'first document' 'is the' 'second document'
 'the first' 'the second' 'the third' 'third one' 'this document'
 'this is']


Unnamed: 0,and this,document is,first document,is the,second document,the first,the second,the third,third one,this document,this is
0,0,0,1,1,0,1,0,0,0,0,1
1,0,1,0,1,1,0,1,0,0,1,0
2,1,0,0,1,0,0,0,1,1,0,1


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document."]
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
binary_ngram_dtm = vectorizer.fit_transform(corpus)
# Get feature names (bigrams)
bigrams = vectorizer.get_feature_names_out()

# Convert the DTM to an array and print it
binary_ngram_dtm = pd.DataFrame(binary_ngram_dtm.toarray(),columns = bigrams)

binary_ngram_dtm


Unnamed: 0,document,first,first document,is,is the,the,the first,this,this is
0,1,1,1,1,1,1,1,1,1


## 6. Frequency with N-Grams

Each entry is the count of the n-gram's occurrences in the document.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]
vectorizer = CountVectorizer(ngram_range=(1, 2))
freq_ngram_dtm = vectorizer.fit_transform(corpus)
print(freq_ngram_dtm.toarray())


[[0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 1 0 1]
 [0 0 2 1 0 0 1 1 0 1 1 1 0 1 0 0 0 1 1 0]
 [1 1 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1]]


## 7. TF-IDF with N-Grams

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
tfidf_ngram_dtm = vectorizer.fit_transform(corpus)

terms = vectorizer.get_feature_names_out()

tfidf_ngram_dtm =  pd.DataFrame(tfidf_ngram_dtm.toarray(), columns=terms)
tfidf_ngram_dtm

Unnamed: 0,and,and this,document,document is,first,first document,is,is the,one,second,second document,the,the first,the second,the third,third,third one,this,this document,this is
0,0.0,0.0,0.322764,0.0,0.424396,0.424396,0.250655,0.250655,0.0,0.0,0.0,0.250655,0.424396,0.0,0.0,0.0,0.0,0.250655,0.0,0.322764
1,0.0,0.0,0.515421,0.338858,0.0,0.0,0.200135,0.200135,0.0,0.338858,0.338858,0.200135,0.0,0.338858,0.0,0.0,0.0,0.200135,0.338858,0.0
2,0.354136,0.354136,0.0,0.0,0.0,0.0,0.209158,0.209158,0.354136,0.0,0.0,0.209158,0.0,0.0,0.354136,0.354136,0.354136,0.209158,0.0,0.269329
