**Text summarization** is the process of distilling the most important information from a text document to produce a shorter version, while still retaining the key points and meaning. There are generally two approaches to text summarization: extractive and abstractive.

1.  **Extractive Summarization**: In this approach, sentences or phrases are selected directly from the original text and assembled to form a summary. This method typically involves identifying important sentences based on features such as word frequency, sentence length, and position in the document.
    
2.  **Abstractive Summarization**: In this approach, a model generates new sentences that capture the main ideas of the original text, often rephrasing and synthesizing information. This method requires more advanced natural language processing techniques, such as sequence-to-sequence models and attention mechanisms.

Below is the sample code for understanding **Extractive Summarization**

In [4]:
## Example for Extractive Summarization

from gensim.summarization import summarize

# Sample text for summarization
text = """
Natural language processing (NLP) is a field of artificial intelligence that deals with the interaction between computers and humans through natural language. It enables computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP techniques are used in a wide range of applications, including machine translation, sentiment analysis, and text summarization.

Text summarization is the process of distilling the most important information from a text document to produce a shorter version, while still retaining the key points and meaning. There are generally two approaches to text summarization: extractive and abstractive.

Extractive summarization involves selecting important sentences or phrases directly from the original text and assembling them to form a summary. This method is relatively straightforward and relies on features such as word frequency, sentence length, and position in the document.

Abstractive summarization, on the other hand, involves generating new sentences that capture the main ideas of the original text, often rephrasing and synthesizing information. This method requires more advanced natural language processing techniques, such as sequence-to-sequence models and attention mechanisms.

Gensim is a popular Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It provides implementations of several text summarization algorithms, including the TextRank algorithm, which is an extractive summarization technique based on graph theory.

In this example, we'll use Gensim's summarize function to perform extractive text summarization on the sample text provided above.
"""

# Perform extractive text summarization
summary = summarize(text)

# Print the summary
print(summary)


Extractive summarization involves selecting important sentences or phrases directly from the original text and assembling them to form a summary.
Abstractive summarization, on the other hand, involves generating new sentences that capture the main ideas of the original text, often rephrasing and synthesizing information.


Below is the sample code for understanding **Abstractive Summarization**

In [1]:
## Prep the dataset first

import numpy as np

# Sample data for training
input_texts = ['Natural language processing (NLP) is a field of artificial intelligence.',
               'It deals with the interaction between computers and humans through natural language.',
               'NLP enables computers to understand, interpret, and generate human language.',
               'Sequence-to-sequence models are commonly used for abstractive summarization.']

target_texts = ['NLP is a field of AI.',
                'NLP deals with interaction between computers and humans through language.',
                'NLP enables computers to understand, interpret, and generate human language.',
                'Seq2Seq models are used for abstractive summarization.']

# Tokenize input and target texts
input_texts = [[char for char in text] for text in input_texts]
target_texts = [[char for char in text] for text in target_texts]

print('Tokenized Text : ',input_texts[0]) # See output to get an idea

# Create vocabulary sets
input_chars = set()
target_chars = set()

for input_text, target_text in zip(input_texts, target_texts):
    input_chars.update(input_text)
    target_chars.update(target_text)

input_chars = sorted(list(input_chars))
target_chars = sorted(list(target_chars))

print('Vocab Set : ',input_chars) # See output to get an idea

num_encoder_tokens = len(input_chars)
num_decoder_tokens = len(target_chars)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

# Create character-to-index dictionaries
input_token_index = dict([(char, i) for i, char in enumerate(input_chars)])
target_token_index = dict([(char, i) for i, char in enumerate(target_chars)])

print('Token index : ',input_token_index) # See output to get an idea

# Create encoder input data
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype='float32')

for i, input_text in enumerate(input_texts):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.

print('The encoder_input_data is a 3d array of size (4, X, 34), here 4 is num of sequences in input_texts, X is the max length of a sequence and 34 is num of vocab chars, hence you get  one hot vector where chars are activated for each sequence')
print('Encoder Input of 1st sequences and 1st char is N and index 8 is active: ',encoder_input_data[0][0]) # See output to get an idea

# Create decoder input and target data
decoder_input_data = np.zeros((len(target_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32') # This is simiöat to encoder_input_data
decoder_target_data = np.zeros((len(target_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')

for i, target_text in enumerate(target_texts):
    for t, char in enumerate(target_text):
        # Decoder input data does not include the start token
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # Decoder target data is one timestep ahead of decoder input data and does not include the start token
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.



Tokenized Text :  ['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', '(', 'N', 'L', 'P', ')', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'i', 'e', 'l', 'd', ' ', 'o', 'f', ' ', 'a', 'r', 't', 'i', 'f', 'i', 'c', 'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', '.']
Vocab Set :  [' ', '(', ')', ',', '-', '.', 'I', 'L', 'N', 'P', 'S', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z']
Token index :  {' ': 0, '(': 1, ')': 2, ',': 3, '-': 4, '.': 5, 'I': 6, 'L': 7, 'N': 8, 'P': 9, 'S': 10, 'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15, 'f': 16, 'g': 17, 'h': 18, 'i': 19, 'l': 20, 'm': 21, 'n': 22, 'o': 23, 'p': 24, 'q': 25, 'r': 26, 's': 27, 't': 28, 'u': 29, 'v': 30, 'w': 31, 'y': 32, 'z': 33}
The encoder_input_data is a 3d array of size (4, X, 34), here 4 is num of sequences in input_texts, X is the max length of a 

In [2]:
## Model arch building and training
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.utils import plot_model

# Define encoder architecture
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

# Define decoder architecture
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Visualize the model architecture
plot_model(model, to_file='model.png', show_shapes=True)

AttributeError: module 'pydot' has no attribute 'InvocationException'

In [None]:
# Compile the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# Train the model
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=64,
          epochs=100,
          validation_split=0.2)


In [None]:
## Testing section

# Inference mode (sampling)
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to something readable
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

# Decode function
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length or find stop token
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1)
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence

# Test the model
for seq_index in range(10):
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)