<a href="https://colab.research.google.com/github/mahekkothari/SLM_Summary/blob/main/SLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: To create a SLM script that summarizes long texts that are written.

In [11]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from heapq import nlargest

def summarize_text(text, num_sentences=3):
    sentences = sent_tokenize(text) #will tokenize given text into sentences
    words = word_tokenize(text.lower()) # will tokenize the text into words


    stop_words = set(stopwords.words("english")) # Remove stopwords aka commonly used words
    words = [word for word in words if word not in stop_words]

    word_freq = FreqDist(words) # finds the frequency of each word in the given text
    sentence_scores = {}   # words recieve a score based on word frequency

    for sentence in sentences:
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                if len(sentence.split(' ')) < 20: #Max amount of words
                    if sentence not in sentence_scores:
                        sentence_scores[sentence] = word_freq[word]
                    else:
                        sentence_scores[sentence] += word_freq[word]

    # Select the top sentences with the highest scores
    summarized_sentences = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)

    # Join the selected sentences to create the summary
    summary = ' '.join(summarized_sentences)

    return summary



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# Example usage
text = """
    We can view LLMs and SLMs as two ends of a spectrum with overlap in between. Overall SLMs distinguish themselves from LLMs in one or more of the following ways. SLMs are more fine-tuned because vendors or companies train them on detailed, domain-specific data, for example to assist complex data engineering tasks. They enrich user prompts, for example by injecting domain-specific data into a user’s question to make the response more accurate. Data pipeline vendors are building SLMs with these capabilities now, often alongside LLMs, to help companies tackle specialized data engineering problems with better governance. This will help data teams boost productivity while reducing risks related to data quality, fairness, and explainability. We should get ready for a boom of small language models in data engineering and many other fields.
- Kevin Petrie in Should AI Bots Build Your Data Pipelines? Part III: The Emergence of Small Language Models for Data Engineering June 21, 2023
(Blog)
"""
summary = summarize_text(text)
print("Summary:")
print(summary)


Summary:
This will help data teams boost productivity while reducing risks related to data quality, fairness, and explainability. Part III: The Emergence of Small Language Models for Data Engineering June 21, 2023
(Blog) We should get ready for a boom of small language models in data engineering and many other fields.
