<a href="https://colab.research.google.com/github/mahekkothari/SLM_Summary/blob/main/SLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: To create a SLM script that summarizes long texts that are written.

In [15]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from heapq import nlargest

def summarize_text(text, num_sentences=3):
    sentences = sent_tokenize(text) #will tokenize given text into sentences
    words = word_tokenize(text.lower()) # will tokenize the text into words


    stop_words = set(stopwords.words("english")) # Remove stopwords aka commonly used words
    words = [word for word in words if word not in stop_words]

    word_freq = FreqDist(words) # finds the frequency of each word in the given text
    sentence_scores = {}   # words recieve a score based on word frequency

    for sentence in sentences:
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                if len(sentence.split(' ')) < 20: #Max amount of words
                    if sentence not in sentence_scores:
                        sentence_scores[sentence] = word_freq[word]
                    else:
                        sentence_scores[sentence] += word_freq[word]

    # Select the top sentences with the highest scores
    summarized_sentences = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)

    # Join the selected sentences to create the summary
    summary = ' '.join(summarized_sentences)

    return summary



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
# Example usage
text = """
If you’ve followed the hype, then you’re likely familiar with LLMs such as ChatGPT. These generative AIs are hugely interesting across academic, industrial and consumer segments. That’s primarily due to their ability to perform relatively complex interactions in the form of speech communication.
Currently, LLM tools are being used as an intelligent machine interface to knowledge available on the internet. LLMs distill relevant information on the Internet, which has been used to train it, and provide concise and consumable knowledge to the user. This is an alternative to searching a query on the Internet, reading through thousands of Web pages and coming up with a concise and conclusive answer.
Indeed, ChatGPT is the first consumer-facing use case of LLMs, which previously were limited to OpenAI’s GPT and Google’s BERT technology.
Recent iterations, including but not limited to ChatGPT, have been trained and engineered on programming scripts. Developers use ChatGPT to write complete program functions – assuming they can specify the requirements and limitations via the text user prompt adequately. (Raza 2024) (https://www.splunk.com/en_us/blog/learn/language-models-slm-vs-llm.html)
"""
summary = summarize_text(text)
print("Summary:")
print(summary)


Summary:
Recent iterations, including but not limited to ChatGPT, have been trained and engineered on programming scripts. 
If you’ve followed the hype, then you’re likely familiar with LLMs such as ChatGPT. Currently, LLM tools are being used as an intelligent machine interface to knowledge available on the internet.
