#### Case Study: Automatic Text Summarization with NLP

##### Background

* Text summarization is a crucial task in Natural Language Processing (NLP) that involves creating a short, accurate, and fluent summary of a longer text document. Automatic text summarization can help users quickly understand large volumes of information by extracting the most important points from the text.

* In this project, we will implement extractive text summarization using the TextRank algorithm. TextRank is an unsupervised graph-based ranking algorithm inspired by the PageRank algorithm used by Google Search. It identifies the most important sentences in a text by building a graph of sentences and ranking them based on the similarity between sentences.

* We will use Python along with the nltk and networkx libraries to build the summarizer.

In [1]:
# Import necessary libraries
import nltk
import numpy as np
import networkx as nx

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joelfuentes/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joelfuentes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Sample text for summarization
text = """
Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans through natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.
By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.
In this rapidly evolving field, advancements are being made continuously, enabling computers to understand and process human language with increasing accuracy.
As a result, NLP plays a critical role in supporting applications like voice-operated GPS systems, digital assistants, speech-to-text dictation software, chatbots, and many more.
With the ever-growing amount of unstructured data, the importance of NLP continues to rise, making it a key area of research and development in the tech industry.
"""

# Function to generate a summary of the text
def summarize_text(text, summary_ratio=0.3):
    # Split text into sentences
    sentences = sent_tokenize(text)
    total_sentences = len(sentences)
    
    # Clean and tokenize sentences
    stop_words = stopwords.words('english')
    sentence_tokens = [word_tokenize(sentence.lower()) for sentence in sentences]
    
    # Build the similarity matrix
    similarity_matrix = np.zeros((total_sentences, total_sentences))
    
    for idx1 in range(total_sentences):
        for idx2 in range(total_sentences):
            if idx1 != idx2:
                similarity = sentence_similarity(sentence_tokens[idx1], sentence_tokens[idx2], stop_words)
                similarity_matrix[idx1][idx2] = similarity
    
    # Build the graph and rank sentences using PageRank algorithm
    nx_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(nx_graph)
    
    # Sort sentences by their score
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    
    # Select the top sentences for the summary
    summary_length = max(1, int(total_sentences * summary_ratio))
    summary_sentences = [ranked_sentences[i][1] for i in range(summary_length)]
    
    # Combine summary sentences
    summary = ' '.join(summary_sentences)
    return summary

# Function to compute the similarity between two sentences
def sentence_similarity(sent1, sent2, stop_words=None):
    if stop_words is None:
        stop_words = []
    
    sent1 = [w for w in sent1 if w not in stop_words]
    sent2 = [w for w in sent2 if w not in stop_words]
    
    all_words = list(set(sent1 + sent2))
    
    # Create word vectors
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
    
    # Build the vector for the first sentence
    for w in sent1:
        vector1[all_words.index(w)] += 1
    
    # Build the vector for the second sentence
    for w in sent2:
        vector2[all_words.index(w)] += 1
    
    # Compute cosine similarity
    similarity = cosine_similarity(vector1, vector2)
    return similarity

# Function to compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    
    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0.0
    else:
        return dot_product / (norm_vec1 * norm_vec2)

# Generate and print the summary
summary = summarize_text(text, summary_ratio=0.4)
print("Original Text:\n", text)
print("\nSummary:\n", summary)

Original Text:
 
Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans through natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.
By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.
In this rapidly evolving field, advancements are being made continuously, enabling computers to understand and process human language with increasing accuracy.
As a result, NLP plays a critical role in supporting applications like voice-operated GPS systems, digital assistants, speech-to-text dictation software, chatbots, and many more.
With the ever-growing amount of unstructured data, the importance of NLP continues to rise, making it a ke

##### Content Preservation
The summary includes sentences that cover the importance of NLP, its applications, and its increasing relevance in technology, effectively capturing the essence of the original text.

##### Sentence Selection
The algorithm selects sentences that are central to the main topic, based on the similarity between sentences and their connectivity in the graph.

##### Coherence
The summary maintains logical flow and coherence, although it may not be perfect due to the extractive nature of the method.

##### Length Control
By adjusting the summary_ratio parameter, you can control the length of the summary. In this example, a ratio of 0.4 means the summary contains approximately 40% of the total sentences.
