# Preface

**Natural Language Processing**, commonly abbreviated as **``NLP``**, involves the computational handling and interpretation of human language, be it in spoken or written form. This field, rooted in ``linguistics``, has evolved significantly over the past half-century, especially with the advent of computer technology. In this lesson, we will delve into the essence of natural language processing and its significance.

When dealing with textual data in the realm of **computer engineering**, it's essential to master three core domains:

- **Text Preprocessing**: This encompasses the tasks of loading, analyzing, filtering, and refining text data before any computational modeling.
- **Text Representation**: Beyond the traditional bag-of-words model, it's crucial to understand the advanced distributed representations like word embeddings.
- **Text Generation**: This domain covers a spectrum of intriguing challenges, from generating image captions to facilitating machine translation.

# Data Preparation


Transitioning from ``raw text`` to building a ``machine learning`` or ``deep learning`` model isn't a direct process.

Initially, you need to **preprocess your text**, which involves tasks like breaking it into words, handling punctuation, and addressing letter casing.

In fact, there's a range of text preparation techniques that may be necessary, and the specific methods you choose will depend on your natural language processing objectives. In this section, we will explore how to clean and preprocess text effectively to make it ready for machine learning modeling. By the end of this section, you will have learned:

- The basics of creating your own simple text cleaning tools.
- How to progress to using more advanced techniques available in the [NLTK library](https://www.nltk.org/).
- Key considerations when preparing text for natural language processing models.

## Metamorphosis by Franz Kafka

First, we'll begin by choosing a dataset.

In this section, we'll be working with the text from the book **``Metamorphosis``** by [Franz Kafka](https://www.amazon.com.br/Metamorfose-Edi%C3%A7%C3%A3o-Exclusiva-Amazon/dp/6580210001/). There isn't a specific reason for this choice other than the fact that it's relatively short, I personally enjoy it, and you might too. I anticipate that it's one of those classic pieces of literature that many students encounter in school. The complete text of **``Metamorphosis``** is freely available from [Project Gutenberg](https://www.gutenberg.org/). You can download the ASCII text version of the book from the code below:

In [None]:
import requests

# URL of the file you want to download
url = "http://www.gutenberg.org/cache/epub/5200/pg5200.txt"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Open a local file to save the response content
    with open("pg5200.txt", "wb") as file:
        file.write(response.content)
    print("Download completed successfully. The file has been saved as 'pg5200.txt'")
else:
    print(f"Error downloading the file. Status code: {response.status_code}")

In [None]:
# rename the file
!mv pg5200.txt metamorphosis.txt

Download the file and save it to your current working directory as **``metamorphosis.txt``**. This file includes header and footer details, specifically copyright and license information, which we don't need.

- Open the file, remove the header and footer sections. The beginning of the cleaned-up file should appear as:

> One morning, when Gregor Samsa woke from troubled dreams, he found himself
transformed in his bed into a horrible vermin.

The file should end with:

> And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.

In [None]:
# delete lines 1 to 44
!sed -i '1,44d' metamorphosis.txt

In [None]:
# delete lines 1861 to 2225
!sed -i '1861,2225d' metamorphosis.txt

In [None]:
# load text
filename = 'metamorphosis.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

In [None]:
text

## Text Cleaning is Task-Specific

Once you've acquired your text data, the initial step in tidying it up is to have a clear objective in mind. In this light, assess your text to pinpoint what might be beneficial. Spend a moment examining the text. What stands out to you? From my perspective:

- The content is in plain text, eliminating the need to process any markup.
- The content has been translated from the original German and utilizes UK English conventions, such as "travelling."
- Text lines break roughly after every 70 characters.
- I couldn't spot any blatant typos or spelling errors.
- Various punctuation marks, like commas, apostrophes, quotation marks, and question marks, are present.
- Hyphenated descriptors, such as "armour-like," are used.
- Em dashes are frequently employed to extend sentences—perhaps consider substituting with commas?
- Names like "Mr. Samsa" are mentioned.


If I may humbly observe, it seems there aren't any numerical values that might need attention, such as "1999". Additionally, one can notice section markers, for instance, "II" and "III". To someone with keen insight, there may be even more intricate details to discern. In this section, we'll be graciously guiding you through the foundational steps of text cleaning. While doing so, it would be beneficial to ponder upon the specific goals we might aim to achieve with this text document. For instance:

Should we aspire to craft a Kafkaesque language model, it might be prudent to retain all the case, quotes, and accompanying punctuation.
Conversely, if our ambition were to categorize documents into "Kafka" and "Not Kafka", perhaps it would be advantageous to eliminate case, omit punctuation, and even reduce words to their roots.
I would kindly encourage using your task's objectives as a guiding principle when preparing your text data.

## Manual Tokenization

Refining text typically involves transforming it into a list of words or tokens suitable for our machine learning endeavors. Essentially, this entails transforming the original text into a sequence of words and storing it. A straightforward method to achieve this is by dividing the document based on white spaces, encompassing spaces, newline characters, tabs, and the like. In Python, this can be gracefully accomplished using the **``split()``** method on the loaded string.

In [None]:
# split into words by white space
words = text.split()
print(words[:100])

Executing the given sample breaks the document into an extensive list of words, displaying the first 100 for our perusal. It's pleasing to observe that punctuation remains intact, as seen in words like **``"wasn't"``** and **``"armour-like"``**. However, it's less ideal to note that the punctuation marking the end of sentences remains attached to the last word, as in **``"thought."``**
.
```python
['One', 'morning,', 'when', 'Gregor', 'Samsa' 'woke', 'from', 'troubled', 'dreams,', 'he',
'found', 'himself', 'transformed', 'in', 'his' 'bed', 'into', 'a', 'horrible', 'vermin.', 'He',
'lay', 'on', 'his', 'armour-like', 'back,', 'and' 'if', 'he', 'lifted', 'his', 'head', 'a', 'little',
'he', 'could', 'see', 'his', 'brown', 'belly,' 'slightly', 'domed', 'and', 'divided', 'by',
'arches', 'into', 'stiff', 'sections.', 'The' 'bedding', 'was', 'hardly', 'able', 'to', 'cover',
'it', 'and', 'seemed', 'ready', 'to', 'slide' 'off', 'any', 'moment.', 'His', 'many', 'legs',
'pitifully', 'thin', 'compared', 'with', 'the' 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved',
'about', 'helplessly', 'as', 'he', 'looked.', 'What’s', 'happened', 'to', 'me?”', 'he',
'thought', 'It', 'wasn’t', 'a', 'dream.', 'His', 'room,','a', 'proper', 'human']
```

Another approach might be to use the [regex model (re)](https://docs.python.org/3/library/re.html) and split the document into words by
selecting for strings of alphanumeric characters **``(a-z, A-Z, 0-9 and _)``**. For example:

In [None]:
import re

# split based on words only
words = re.split(r'\W+', text)
print(words[:100])

```python
['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself',
'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like',
'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly',
'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was',
'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His',
'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved',
'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It',
'wasn', 't', 'a', 'dream', 'His', 'room']
```

Upon executing the sample once more, we obtain our desired list of words. Now, it's evident that **``"armour-like"``** has been split into two separate words: **``"armour"``** and **``"like"``** (which is satisfactory). However, contractions such as **``"What's"``** have also been divided into **``"What"``** and **``"s"``** (which isn't quite optimal).

We might desire the words but without punctuations such as commas and quotes. Additionally, it's preferable to retain contractions as a single unit. One approach is to segment the document into words based on white space, followed by utilizing string translation to eliminate all punctuation. Python conveniently offers a constant named **``string.punctuation``** which encompasses a comprehensive set of punctuation symbols. For instance:

In [None]:
import string

# split into words by white space
words = text.split()

# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))

# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

In [None]:
string.punctuation


It's a typical practice to unify all words to a single case. While this action reduces the vocabulary's size, it can lead to the loss of certain nuances (a classic illustration being the distinction between 'Apple' the corporation and 'apple' the fruit). We can transform all words to lowercase by invoking the **lower()** method on each word. As an illustration:

In [None]:
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

Text cleaning can be challenging, is often tailored to specific problems, and comes with its compromises. It's essential to keep in mind that simplicity is key. Opt for more straightforward text data, streamlined models, and smaller vocabularies. Complexity can always be added later to determine if it enhances the model's performance. Up next, we'll explore tools within the [NLTK library](https://www.nltk.org/) that provide functionalities beyond basic string division.

## Tokenization and Cleaning with NLTK

The Natural Language Toolkit, abbreviated as NLTK, is a Python library designed for text processing and modeling. It offers valuable tools for importing and preprocessing text, preparing it for use with machine learning and deep learning models.

> You can download NLTK via your preferred package manager, like pip. For machines compatible with POSIX, the command would be:

```
sudo pip install -U nltk
```

Once installed, you'll need to set up the data associated with the library, which includes a comprehensive collection of documents useful for testing other NLTK tools. There are several methods to achieve this, one of which is through a script:

```python
import nltk
nltk.download()
```

Or from command line:

In [None]:
!python -m nltk.downloader all

In [None]:
import nltk
nltk.__version__

### Sentence Segmentation




An initial beneficial step is to break the text down into individual sentences. Some modeling techniques, like **``Word2Vec``**, often work better with inputs formatted as paragraphs or sentences.

You can begin by segmenting your **text into sentences**, further dividing each sentence into words, and then saving each sentence as a separate line in a file. NLTK offers the **``sent_tokenize()``** function for this purpose. In the following example, we load the **``metamorphosis.txt``** file, segment it into sentences, and display the first sentence.

In [None]:
from nltk import sent_tokenize

# load data
filename = 'metamorphosis.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split into sentences
sentences = sent_tokenize(text)
print(sentences[0])

### Split to words

NLTK offers a function named **``word_tokenize()``** that divides **strings into tokens**, which are typically **words**. This function segments tokens considering white space and punctuation. As such, punctuation marks like commas and periods are treated as individual tokens.

> Additionally, contractions are divided (for instance, "What's" is tokenized into "What" and "'s"). Quotation marks remain intact, among other features. Here's an example:

In [None]:
from nltk.tokenize import word_tokenize

# split into words
tokens = word_tokenize(text)
print(tokens[:100])

Executing the code reveals that punctuation marks are tokenized, giving us the option to filter them out if desired.


```python
['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he',
'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.',
'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a',
'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and',
'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly',
'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.',
'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the',
'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '“',
'What', '’', 's', 'happened']
```

### Remove Punctuation




We can exclude tokens that aren't of interest, including standalone punctuation marks. This can be achieved by iterating through each token and retaining only those that are entirely alphabetic. Python provides the **``isalpha()``** function for this purpose. Here's an example:

In [None]:
from nltk.tokenize import word_tokenize

# split into words
tokens = word_tokenize(text)

# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

```python
['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself',
 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and',
'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly',
'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly',
'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs',
'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about',
'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a',
 'dream', 'His', 'room', 'a', 'proper']
```

### Eliminate Stop Words (and Integration Process)



**Stop words** are typically words that don't add significant meaning to a sentence. They include common words like "the," "a," and "is." In tasks like document classification, it can be beneficial to remove these stop words. NLTK offers a set of universally recognized stop words for various languages, including English. Here's how you can load them:

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

```python
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your',
'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it',
"it's",'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that',
"that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below',
 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't",  
'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'no',
'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn',
"isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
```

When you run this code, it will print out a list of language codes that correspond to the languages for which NLTK provides stop words.

In [None]:
from nltk.corpus import stopwords

# List all languages supported by the stopwords corpus
supported_languages = stopwords.fileids()

print(supported_languages)

The provided stop words are all in lowercase and devoid of punctuation. When comparing your tokens to these stop words for filtering, it's crucial to preprocess your text consistently. Here's a brief guide on setting up a text preparation pipeline:

- Load the raw text.
- Tokenize the text.
- Convert tokens to lowercase.
- Strip each token of punctuation.
- Retain only alphabetic tokens.
- Exclude tokens identified as stop words.

In [None]:
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# load data
filename = 'metamorphosis.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split into words
tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))

# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])


### Stemming Words



Stemming is the process of truncating words to their fundamental form or base. For instance, **``fishing``**, **``fished``**, and **``fisher``** all stem to **``fish``**. In tasks like document classification, stemming can be advantageous as it not only condenses the vocabulary but also emphasizes the sentiment or general intent of a document over its intricate meaning. Various stemming techniques exist, with the **Porter Stemming** algorithm being one of the most renowned and enduring. This algorithm can be accessed in NLTK using the **``PorterStemmer``** class. Here's a demonstration:

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

# load data
filename = 'metamorphosis.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split into words
tokens = word_tokenize(text)

# stemming of words
porter = PorterStemmer()

stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

```python
['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself',
 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'he', 'lay', 'on', 'hi', 'armour-lik',
'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he','could', 'see', 'hi', 'brown', 'belli',
',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli',
'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg',
',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about',
'helplessli', 'as', 'he', 'look', '.', '“', 'what', '’', 's', 'happen']
```

Upon executing the example, it's evident that words are truncated to their root forms, for instance, **``trouble``** transforms to **``troubl``**. Additionally, the stemming process also converts tokens to lowercase, presumably for efficient internal referencing in word tables.

### Expanding on Text Cleaning: A Comprehensive Overview



In the context of our preliminary discussion, it is pertinent to note that the source text utilized was relatively uncomplicated. However, when delving into real-world applications, several additional layers of complexity emerge in text cleaning. The following outlines some critical considerations:

- **Large-Scale Documents**: How does one efficiently handle extensive documents or vast collections of textual data that exceed memory constraints?
- **Structured Document Extraction**: There's a challenge in gleaning text from formats like HTML, PDF, or other structured document types.
- **Transliteration Concerns**: The task of converting characters from non-Latin scripts to the English alphabet poses its unique set of problems.
- **Unicode Decoding**: One must address the decoding of Unicode characters into a standardized form, such as UTF-8.
- **Domain-Specific Lexical Units**: Specialized terminologies, acronyms, and phrases in certain fields may necessitate specialized treatment.
Numerical Entities: Handling or omitting numerics, which can range from dates to quantities, requires careful consideration.
- **Typographical Errors**: Identifying and rectifying common typographical errors and misspellings is a crucial step.
- **Additional Factors**: The list is by no means exhaustive, and there are myriad other considerations based on specific project requirements.

It's imperative to understand that achieving an entirely **``clean``** text, devoid of all inconsistencies, is a formidable challenge. The notion of **``cleanliness``** in text is often contextual, determined by the objectives and parameters of a specific project. It is recommended to consistently evaluate your data post each transformation phase. The practice of storing intermediary data post every transformation can be beneficial for in-depth analysis, as nuances and issues often become apparent upon closer inspection.

The overarching principle, as emphasized in this discourse, is that effective text cleaning is a meticulous balance of resources, time, and domain knowledge.

## Text Data Preparation Using Scikit-learn: A Guide

Textual data necessitates unique pre-processing steps to make it suitable for predictive modeling.

> Initially, it involves parsing the text to separate and identify individual words, a process known as **tokenization**.

Subsequently, these words must be translated into numeric formats like integers or floating-point values. This transformation facilitates their use in machine learning models and is termed as **feature extraction** or **vectorization**.

The scikit-learn library in Python simplifies this procedure by providing intuitive tools for both tokenization and feature extraction. In this section, we'll delve deep into how you can ready your text data for predictive modeling harnessing the power of scikit-learn. Upon completion, you'll be adept at:

- Transforming text into word count vectors using **CountVectorizer**.
- Transitioning text into word frequency vectors via **TfidfVectorizer**.


### CountVectorizer in Scikit-learn: A Simplified Guide


The **CountVectorizer** in **``scikit-learn``** offers a straightforward mechanism to achieve three core objectives with text data:

- tokenize a set of text documents
- establish a vocabulary of identified words
-  and subsequently, encode fresh documents leveraging this vocabulary.


Here's a concise step-by-step on its usage:

- Instantiate the **``CountVectorizer``** class.
- Utilize the **``fit()``** method to derive a vocabulary from your **text corpus**.
- As and when required, use the **``transform()``** method to vectorize documents based on the previously established vocabulary.

> The **outcome** is an encoded vector whose length corresponds to the entire vocabulary.

Each entry in this vector signifies the occurrence count of its respective word in the document. Given that many of these counts are zero (the word doesn't appear in the document), **such vectors are typically spars**.

Python's **``scipy.sparse``** package facilitates efficient handling of these sparse vectors.

Notably, the vectors produced post **``transform()``** are sparse. If you wish to visualize these vectors for a more intuitive understanding, you can revert them to standard NumPy arrays using the **``toarray()``** method.

Here's a hands-on example showcasing the utility of **``CountVectorizer``** for tokenization, vocabulary creation, and document encoding:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

```python
{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2,
'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}

(1, 8)

<class 'scipy.sparse._csr.csr_matrix'>

[[1 1 1 1 1 1 1 2]]
```


From our observations, it's evident that the tokenization process, by default, **converts all words to lowercase and disregards punctuation**.

While these are the standard settings, the tokenization process offers various customizable options.

> I strongly recommend perusing the API documentation to familiarize yourself with these configurations.

Upon executing the example, the vocabulary is displayed initially, followed by the dimensions of the encoded document.

> The vocabulary consists of 8 unique words, which means the encoded vectors span a length of 8.

Notably, **the encoded vector manifests as a sparse matrix**. When viewed in its array format, it's evident that each word, with the exception of **``the``** (indexed as 7), has been counted once. The word **``the``** registers a count of 2.

> It's crucial to note that if a document contains words not present in the established vocabulary, the vectorizer will still function.

**Words not found in the vocabulary are simply disregarded**, and their counts in the resultant vector are omitted. To illustrate, let's use the aforementioned vectorizer to encode a document comprising both a word from the vocabulary and a word outside of it.

In [None]:
# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

When executing this example, the output is the array representation of the encoded **sparse vector**. It indicates a single occurrence of the word present in the vocabulary, while the word absent from the vocabulary is entirely overlooked.

### Word Frequencies with TfidfVectorizer

In the realm of text data analysis, word counts offer a rudimentary technique for representing textual information. However, a salient challenge with relying solely on raw word counts is that commonplace words—often referred to as **``stop words``** like **``the``**—typically emerge frequently.

Consequently, these high frequencies might inadvertently overshadow more pertinent terms in the document, rendering the vectors less informative. To address this concern, a more sophisticated measure, known as **Term Frequency-Inverse Document Frequency (TF-IDF)**, has been propounded.

**TF-IDF**, an abbreviation for Term Frequency-Inverse Document Frequency, **assigns a weight** to each word in a document based on two key metrics:

- **Term Frequency (TF)**: This metric computes the recurrence of a word within a specific document. Essentially, it provides a measure of the word's significance within that individual document.
- **Inverse Document Frequency (IDF)**: This metric serves to attenuate the weight of words that manifest ubiquitously across multiple documents, thereby ensuring that words that are pervasive and perhaps less meaningful do not unduly dominate.

To elucidate further without delving into intricate mathematical derivations, TF-IDF scoring aims to accentuate words that hold significant value within a particular document but aren't necessarily prevalent across an entire corpus of documents.

The **``TfidfVectorizer``** in Python serves a dual purpose:

1. It not only tokenizes documents but also discerns the vocabulary and computes the IDF weightings.
2. Subsequently, it facilitates the encoding of novel documents in accordance with this learned knowledge.

For instances where a pre-existing **``CountVectorizer``** has been utilized, the **``TfidfTransformer``** can be seamlessly integrated to solely determine the IDF values and subsequently embark on the encoding process. The procedural steps encompassing creation, fitting, and transformation, akin to the **``CountVectorizer``**, remain consistent.

The subsequent example delineates the procedure for employing the **``TfidfVectorizer``**, wherein it learns both the vocabulary and IDF weightings across a trifecta of succinct documents and subsequently encodes one amongst them.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The Fox"]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# gives us a mapping of each term to its index in the vector
print(vectorizer.vocabulary_)

# provides the computed inverse document frequency (IDF) values for each term
print(vectorizer.idf_)

# encode document
vector = vectorizer.transform([text[0]])

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

In [None]:
vectorizer.transform([text[1]]).toarray()

In [None]:
vectorizer.transform([text[2]]).toarray()

**Mathematical Breakdown of TF-IDF**

**1. Term Frequency (TF)**

The term frequency for a term `t` in a document `d` is defined as:

$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}$

**2. Inverse Document Frequency (IDF)**

The inverse document frequency for a term `t` is defined as:

$\text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) + 1$

**3. TF-IDF**:
The TF-IDF score is the product of TF and IDF.

$TFIDF(t, d) = TF(t, d) \times IDF(t)$


Please note:

In practice, TF-IDF implementations might involve additional steps, like normalization and stopword removal. The scikit-learn **``TfidfVectorizer``** does some of these for you.

## How to Prepare Text Data With Keras

In the context of **deep learning**, raw text cannot be directly utilized. Instead, it needs to be converted into numerical format, like word embeddings, to be effectively employed as input or output for machine learning and deep learning algorithms. The Keras library comes equipped with fundamental tools for this purpose. In this section, you'll delve into the ways Keras aids in text data preparation. Upon completion, you'll be familiar with:

- Handy methods provided by Keras for streamlining text data preparation.
- The Tokenizer API, which can be tailored based on training data and subsequently employed to encode training, validation, and test data.
- Four distinct document encoding methods available via the Tokenizer API.

### Convert Text to Individual Words Using the ``text_to_word_sequence`` Method.

When dealing with text, an initial beneficial approach is to segment it into individual words. These individual words are referred to as tokens, and the act of dividing text into these tokens is termed tokenization. Keras offers the **``text_to_word_sequence()``** method to assist in breaking down text into word lists. By its default settings, the function performs three actions:

- Divides words by spaces.
- Eliminates punctuation.
- Transforms the text into lowercase using the parameter (lower=True).

These default behaviors can be modified by providing specific arguments to the method. Here's a demonstration of how the **``text_to_word_sequence()``** function can be employed to segment a given text (in this context, a straightforward string) into its constituent words.

In [None]:
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'

# tokenize the document
result = text_to_word_sequence(text)
print(result)

This initial step is beneficial, yet additional pre-processing is essential before the text is ready for use.

### Encoding with one_hot

Representing a document as a series of unique integer values for each word is a common approach. Keras offers the **``one_hot()``** function which enables tokenization and integer encoding of a text document simultaneously.

> Contrary to what its name implies, this function doesn't produce a one-hot encoding of the document.

Instead, it acts as a **wrapper** for the **hashing_trick()** function, which will be discussed subsequently. This function outputs an integer-encoded version of the document.

> Due to the utilization of a hash function, **there might be instances of collisions**, implying that some words might not get distinct integer values.

Similar to the **``text_to_word_sequence()``** function mentioned earlier, **``one_hot()``** converts the text to lowercase, removes punctuation, and separates words based on spaces.

Furthermore, when employing this function, the vocabulary size (total unique words) has to be defined. This count could encompass the total words in the document or even more, especially if you plan on encoding other documents with additional words. The vocabulary's size determines the hash space. By default, the hash function is utilized. However, as we'll delve deeper in the subsequent section, different hash functions can be chosen when directly invoking the **``hashing_trick()``** function.

To dissect the document into individual words, we can utilize the **``text_to_word_sequence()``** function from earlier and then employ a **``set``** to display only the distinct words within the document. The count of this set can provide an estimate of the vocabulary size for a single document. For instance:

In [None]:
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

We can integrate this with the **``one_hot()``** function to encode the words within the document. The entire example is provided below. To reduce the likelihood of collisions during word hashing, the vocabulary size has been expanded by one-third.

In [None]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

### Hash Encoding with ``hashing_trick``

> Integer and count-based encodings have a constraint: they necessitate the maintenance of a vocabulary along with its corresponding integer mappings.

One way to circumvent this is by employing a **``one-way hash function``** to transform words into integers. This method eliminates the need for vocabulary tracking, making it quicker and more memory-efficient.

Keras offers the **``hashing_trick()``** function, which not only tokenizes but also assigns integer values to the document, akin to the **``one_hot()``** function. It's a versatile tool; you can choose the hash function, with "hash" as the default. Other options include the built-in **``md5``** function or even custom-made functions. Here's a demonstration of how a document can be integer-encoded using the **``md5``** hash function.

In [None]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

### Tokenizer API


Up to this point, we have explored straightforward convenience methods in **``Keras``** for text preparation. However, **``Keras``** offers a more **``advanced API``** for text preparation, which can be employed effectively across multiple text documents.

This approach is particularly well-suited for substantial projects. In **``Keras``**, the **``Tokenizer``** class is provided for the purpose of text document preparation in the context of deep learning. To utilize the **``Tokenizer``**, it must first be initialized and then applied to either raw text documents or integer-encoded text documents. For instance:

```python
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!','Good work','Great effort','nice work','Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
```

Once initialized, the **``Tokenizer``** offers four attributes that enable you to retrieve information about your documents:

- **Word Counts**: This attribute provides a dictionary that maps words to their respective occurrence counts as observed during the Tokenizer initialization.
- **Word Docs**: You can access another dictionary that associates words with the number of documents they appear in.
- **Word Index**: The Tokenizer generates a dictionary that assigns unique integers to each word in your documents.
- **Document Count**: This attribute supplies a dictionary that indicates the number of documents in which each word appears, a calculation made during the initialization process.

```python
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
```

In [None]:
from keras.preprocessing.text import Tokenizer

# define 5 documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!']

# create the tokenizer
t = Tokenizer()

# fit the tokenizer on the documents
t.fit_on_texts(docs)

# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

Once the **``Tokenizer``** has been trained on the training data, you can employ it to encode documents within both the training and test datasets. The **``texts_to_matrix()``** function provided by the **``Tokenizer``** enables the creation of one vector for each document in the input. The length of these vectors corresponds to the total vocabulary size.

This function offers a range of standard text encoding schemes commonly used in **``bag-of-words``** models. You can specify your desired encoding scheme by providing a **``mode``** argument to the function. The available modes are as follows:

1. **Binary**: This mode indicates whether each word is present or absent in the document. It is the default encoding scheme.

2. **Count**: In this mode, the encoding represents the count of each word in the document.

3. **TF-IDF (Text Frequency-Inverse Document Frequency)**: This mode calculates the TF-IDF score for each word in the document, providing a measure of the word's importance within the document and across the entire corpus.

4. **Frequency (Freq)**: The frequency mode represents the frequency of each word as a ratio of its occurrences to the total number of words within each document.

In [None]:
# encoding represents the count of each word in the document
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

In [None]:
# indicates whether each word is present or absent in the document
encoded_docs = t.texts_to_matrix(docs, mode='binary')
print(encoded_docs)

In [None]:
# calculates the TF-IDF score for each word in the document
encoded_docs = t.texts_to_matrix(docs, mode='tfidf')
print(encoded_docs)

In [None]:
# frequency of each word as a ratio of its occurrences
# to the total number of words within each document
encoded_docs = t.texts_to_matrix(docs, mode='freq')
print(encoded_docs)

## Bag of Words

The bag-of-words model serves as a method for representing text data when employing machine learning algorithms for text-related tasks.

It is characterized by its simplicity in both comprehension and implementation, and it has proven to be highly effective in various applications, including language modeling and document classification.

This section aims to introduce you to the bag-of-words model as a feature extraction technique in the field of natural language processing. By the end of this section, you will gain insights into the following key aspects:

- **Understanding the bag-of-words model** and recognizing its significance in text representation.
- **Developing** a bag-of-words model for a collection of documents.
- **Utilizing** various techniques to construct a vocabulary and assign scores to words within the model.





### The Challenge with Text Data


One of the challenges associated with modeling text data is its **inherent messiness**. Machine learning algorithms, in particular, thrive on structured, fixed-length inputs and outputs. They are not equipped to handle raw text directly; instead, text must undergo a transformation into numerical representations, specifically, vectors of numbers.

In the realm of natural language processing, these vectors (represented as 'x') are derived from textual data to capture various linguistic characteristics of the text. This process is commonly known as **feature extraction** or **feature encoding**. One widely used and straightforward approach for **feature extraction** with text data is referred to as the **``bag-of-words``** model.

### Exploring the Bag-of-Words Model


The **Bag-of-Words (BoW)** model, a fundamental concept in text analysis, serves as a robust method for **feature extraction** from textual data, particularly in the context of machine learning algorithms. This approach is renowned for its inherent simplicity and adaptability, offering a versatile framework for feature extraction from diverse documents.

Essentially, a bag-of-words represents text by elucidating the presence of words within a given document, relying on two pivotal elements:

- **A Vocabulary of Known Words**: This component encompasses a collection of words that are recognized within the context of the analysis.
- **A Measure of the Presence of Known Words**: The BoW model quantifies the occurrence of these known words within the document.

The nomenclature **``bag-of-words``** is aptly chosen, signifying the deliberate disregard for any information pertaining to the arrangement or structure of words within the document. Instead, the model's primary concern lies in ascertaining whether known words are present in the document, without regard to their specific location within the text.

A prevalent feature extraction technique for both sentences and documents is the Bag-of-Words approach (BoW). This method involves the construction of a histogram, effectively treating each word count as an individual feature.

> The underlying intuition of the BoW model rests on the premise that documents sharing similar content tend to exhibit analogous characteristics. Furthermore, it posits that by scrutinizing content in isolation, meaningful insights about the document's semantic content can be gleaned.

It is important to note that the complexity of the Bag-of-Words model can be tailored to specific needs. This complexity manifests in the decision-making processes surrounding the design of the vocabulary of known words (or tokens) and the methodology employed to evaluate the presence of these known words. Both of these considerations warrant careful examination and will be explored in greater detail.

### Illustration of the Bag-of-Words Model



To provide a tangible demonstration of the Bag-of-Words model, we will walk through a practical example.

**Step 1: Data Collection**

For this demonstration, we will begin by excerpting the initial lines of text from the renowned literary work **``A Tale of Two Cities``** authored by *Charles Dickens*. These excerpts have been sourced from **Project Gutenberg**.

```python
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
```

In this compact example, we will consider each line as an individual document, and collectively, these four lines constitute our entire corpus of documents.

**Step 2: Vocabulary Construction**

We can proceed by creating a list that encompasses all the words within our model's vocabulary. In this context, we consider the unique words, disregarding variations in case and excluding punctuation.

```python
it
was
the
best
of
times
worst
age
wisdom
foolishness
```

This results in a vocabulary of 10 words, derived from a corpus consisting of a total of 24 words.

**Step 3: Generating Document Vectors**

The subsequent phase entails assigning scores to the words contained within each document. The objective is to transform each free-text document into a vector, which can serve as input or output for a machine learning model. Given our knowledge of the vocabulary comprising 10 words, we can establish a fixed-length representation of each document, consisting of 10 positions within the vector to score each word.

The most straightforward scoring technique involves denoting the presence or absence of words as binary values: 0 for absence and 1 for presence.

Employing the arbitrary sequence of words provided earlier in our vocabulary, we can systematically process the initial document, **``It was the best of times,``** and translate it into a binary vector. The scoring for this document would manifest as follows:

```python
it = 1
was = 1
the = 1
best = 1
of = 1
times = 1
worst = 0
age = 0
wisdom = 0
foolishness = 0
```

As a binary vector, this would look as follows:

```python
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
```

The other three documents would look as follows:

```python
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
```

All word orderings are uniformly disregarded, ensuring a consistent method for extracting features from any document within our corpus, thereby rendering them suitable for modeling purposes. Even when confronted with new documents that partially overlap with the established vocabulary of known words but may also include words outside of this vocabulary, they can still be encoded effectively. In such cases, only the occurrences of known words are assigned scores, while unknown words are omitted from consideration. This approach demonstrates its scalability, making it well-suited for handling extensive vocabularies and more substantial documents.

### Managing the Vocabulary



> As the size of the vocabulary expands, the vector representation of documents grows proportionally.

In the previous example, the length of the document vector equated to the number of known words. When dealing with a considerably extensive corpus, such as thousands of books, it is conceivable that the vector's length may extend to thousands or even millions of positions. Additionally, **it's common for each document to contain only a small subset of the known words from the vocabulary**.

This phenomenon results in a vector that contains numerous zero scores, characterized as a **sparse vector** or **sparse representation**. Sparse vectors consume more memory and computational resources during modeling, and the substantial number of positions or dimensions can pose considerable challenges for traditional algorithms. Consequently, there is a compelling incentive to reduce vocabulary size when employing a bag-of-words model.

There exist uncomplicated text preprocessing techniques that can serve as an initial step in this endeavor, including:

- Ignoring letter case.
- Disregarding punctuation.
- Omitting frequently occurring words with low information content, known as stop words (e.g., "a," "of," etc.).
- Correcting misspelled words.
- Reducing words to their root form, known as **stemming**, using appropriate stemming algorithms.

> For a more intricate strategy, one can opt to construct a vocabulary of grouped words.

This approach not only alters the scope of the vocabulary but also allows the bag-of-words model to capture a modicum of additional semantic meaning from the documents. In this context, each word or token is referred to as a **``gram``**.

> Formulating a vocabulary of two-word pairs is specifically known as a **bigram model**.

It's important to note that only the bigrams that manifest within the corpus are modeled, rather than considering all possible bigrams.

> An **n-gram** represents a sequence of **``n tokens``** within a text. To illustrate, a 2-gram, often referred to as a **bigram**, consists of two-word sequences such as **``please turn``**, **``turn your``**, or **``your homework``**. On the other hand, a **3-gram**, commonly known as a **``trigram``**, encompasses three-word sequences like **``please turn your``** or **``turn your homework``**.

As an illustration, consider the bigrams within the initial line of text presented in the preceding section, **``It was the best of times``**. These bigrams are as follows:

```python
it was
was the
the best
best of
of times
```

> A representation based on a **``bag-of-bigrams``** exhibits significantly greater potency than the traditional bag-of-words approach, and in numerous instances, it proves exceptionally challenging to surpass in performance.

### Evaluating Word Occurrences



After the selection of a vocabulary, the next crucial step involves scoring the occurrence of words within sample documents. In the previously demonstrated example, we have already encountered one straightforward scoring method: binary scoring, which involves marking the presence or absence of words. Additionally, there are several other uncomplicated scoring approaches, including:

- **Counts**: This method involves tallying the number of times each word appears within a document.
- **Frequencies**: Here, we calculate the frequency of each word's occurrence within a document relative to the total number of words in that document.

### Constraints of the Bag-of-Words Model



The bag-of-words model, while distinguished for its simplicity and adaptability when it comes to customizing text data, has achieved significant success in various prediction tasks, including language modeling and document classification. Nevertheless, it is not without its limitations, which encompass the following aspects:

- **Vocabulary**: The design of the vocabulary necessitates meticulous consideration, especially concerning its size. The vocabulary size directly impacts the sparsity of document representations.

- **Sparsity**: Sparse representations pose formidable challenges, both in terms of computational complexity (space and time) and information utilization. The difficulty lies in effectively harnessing limited information within an expansive representational space.

- **Semantic Meaning**: Disregarding word order, the bag-of-words model overlooks the context and, consequently, the semantics of words within a document. The context and meaning can contribute substantially to the model's understanding, potentially distinguishing between differently arranged words (e.g., **``this is interesting``** vs. **``is this interesting``**), identifying synonyms (e.g., **``old bike``** vs. **``used bike``**), and addressing a myriad of other linguistic nuances.

# Additional Resources: Exploring the Development of Deep Learning Models with Keras


Creating and evaluating deep learning neural networks becomes intuitive in Python using Keras, but it's crucial to adhere to a specific model life-cycle. In this lesson, we will walk you through the detailed **``life-cycle``** steps for formulating, training, and assessing deep learning neural networks in Keras. Furthermore, we'll guide you on predicting using a trained model and delve into the functional API for more versatile model design.

By the end of this section, you will have insights into:

- Defining, compiling, fitting, and evaluating a deep learning neural network using Keras.
- Adopting standard defaults for both regression and classification predictive modeling tasks.
- Utilizing the functional API to craft standard Multilayer Perceptrons, convolutional, and recurrent neural networks.

## Creating a Neural Network Model in Keras



Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. Here's a step-by-step guide on how to create a neural network model using Keras:

**1. Define the Network**

First, you need to define the architecture of your neural network. This involves specifying the number of layers, the number of neurons in each layer, and the activation function for each neuron.

```python
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(32, input_dim=8, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
```


**2. Compile the Network**

After defining the architecture, you need to compile the model. This step involves specifying the optimizer, loss function, and evaluation metric(s).

```python
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```

**3. Fit the Network**

Now, you can train the model using your training data. This involves feeding the input data and the corresponding target values to the model, specifying the number of epochs, and optionally, the batch size.

```python
# Assuming X_train and y_train are your input data and labels respectively
model.fit(X_train, y_train, epochs=10, batch_size=32)
```

**4. Evaluate the Network**

After training, you should evaluate the performance of your model using a separate dataset (often called the test dataset).

```python
# Assuming X_test and y_test are your test data and labels
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")
```

**5. Make Predictions**

Finally, you can use the trained model to make predictions on new, unseen data.

```python
# Assuming X_new is your new data
predictions = model.predict(X_new)
```

And that's it! You've successfully created, trained, and evaluated a neural network model in Keras.



## Keras Functional Models

In Keras, the functional API provides a more flexible approach to defining models compared to the sequential API. While the sequential API allows you to stack layers in a linear fashion, the functional API allows you to define more complex architectures, such as multi-input, multi-output, and shared layers models.

### Basics of Functional API

Instead of starting with a **``Sequential model``**, you'll begin by defining placeholder tensors for your inputs. You'll then define a series of layer calls, treating these layers as functions that process the input tensors and generate output tensors.

**Example: Simple Feedforward Network**

```python
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(32, input_shape=(784,), activation='relu'))
model.add(Dense(10, activation='softmax'))
```


With the functional API, the same model can be defined as:

```python
from keras.layers import Input, Dense
from keras.models import Model

inputs = Input(shape=(784,))
x = Dense(32, activation='relu')(inputs)
predictions = Dense(10, activation='softmax')(x)

model = Model(inputs=inputs, outputs=predictions)
```

Notice how in the functional approach, the layers are used as functions that take tensors and return tensors.

**Example: Multi-input Model**

Imagine you want a model with two different inputs. For instance, one input could be an image, and the other input could be a description of that image.

```python
from keras.layers import Input, Dense, concatenate
from keras.models import Model

# Image input
image_input = Input(shape=(128,128,3))
x1 = Dense(128, activation='relu')(image_input)

# Text input
text_input = Input(shape=(100,))
x2 = Dense(32, activation='relu')(text_input)

# Merge the outputs of the two branches
merged = concatenate([x1, x2])

# Final dense layer
output = Dense(1, activation='sigmoid')(merged)

# Model
model = Model(inputs=[image_input, text_input], outputs=output)
```

This kind of architecture allows for flexibility in feeding different kinds of data into the model and processing them differently.