<a href="https://colab.research.google.com/github/kanikachitnis1018/Summarizer/blob/main/flowchart_converter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers
!pip install torch



In [3]:
# from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import re


In [4]:
text_to_be_split = """ Kasparov was a fiercely aggressive chess player who thrived on energy and confidence. My father wrote a book called Mortal Games about Garry, and during the years surrounding the 1990 Kasparov-Karpov match, we both spent quite a lot of time with him.

At one point, after Kasparov had lost a big game and was feeling dark and fragile, my father asked Garry how he would handle his lack of confidence in the next game. Garry responded that he would try to play the chess moves that he would have played if he were feeling confident. He would pretend to feel confident, and hopefully trigger the state.

Kasparov was an intimidator over the board. Everyone in the chess world was afraid of Garry and he fed on that reality. If Garry bristled at the chessboard, opponents would wither. So if Garry was feeling bad, but puffed up his chest, made aggressive moves, and appeared to be the manifestation of Confidence itself, then opponents would become unsettled. Step by step, Garry would feed off his own chess moves, off the created position, and off his opponent's building fear, until soon enough the confidence would become real and Garry would be in flow…

He was not being artificial. Garry was triggering his zone by playing Kasparov chess """

paragraphs = text_to_be_split.split("\n\n")

paragraphs

[' Kasparov was a fiercely aggressive chess player who thrived on energy and confidence. My father wrote a book called Mortal Games about Garry, and during the years surrounding the 1990 Kasparov-Karpov match, we both spent quite a lot of time with him.',
 'At one point, after Kasparov had lost a big game and was feeling dark and fragile, my father asked Garry how he would handle his lack of confidence in the next game. Garry responded that he would try to play the chess moves that he would have played if he were feeling confident. He would pretend to feel confident, and hopefully trigger the state.',
 "Kasparov was an intimidator over the board. Everyone in the chess world was afraid of Garry and he fed on that reality. If Garry bristled at the chessboard, opponents would wither. So if Garry was feeling bad, but puffed up his chest, made aggressive moves, and appeared to be the manifestation of Confidence itself, then opponents would become unsettled. Step by step, Garry would feed of

In [5]:
cleaned_paras = []

for para in paragraphs:
    para = para.strip()

    para = re.sub(r'\s+', ' ', para)

    para = re.sub(r'[^a-zA-Z0-9.,;:!?()\'" -]', '', para)

    cleaned_paras.append(para)

for i, para in enumerate(cleaned_paras, 1):
    print(f"Paragraph {i}: {para}\n")

Paragraph 1: Kasparov was a fiercely aggressive chess player who thrived on energy and confidence. My father wrote a book called Mortal Games about Garry, and during the years surrounding the 1990 Kasparov-Karpov match, we both spent quite a lot of time with him.

Paragraph 2: At one point, after Kasparov had lost a big game and was feeling dark and fragile, my father asked Garry how he would handle his lack of confidence in the next game. Garry responded that he would try to play the chess moves that he would have played if he were feeling confident. He would pretend to feel confident, and hopefully trigger the state.

Paragraph 3: Kasparov was an intimidator over the board. Everyone in the chess world was afraid of Garry and he fed on that reality. If Garry bristled at the chessboard, opponents would wither. So if Garry was feeling bad, but puffed up his chest, made aggressive moves, and appeared to be the manifestation of Confidence itself, then opponents would become unsettled. Ste

In [6]:
import nltk
from nltk.tokenize import sent_tokenize
from itertools import chain
from transformers import BertTokenizer, BertModel

nltk.download("punkt_tab")

sentences = list(chain.from_iterable(map(sent_tokenize, cleaned_paras)))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [7]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

encoded = tokenizer(    # converts sentences to tokens
    sentences,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

with torch.no_grad():   # converts tokens to token embeddings
    outputs = model(**encoded)

embeddings = outputs.last_hidden_state    # converts token embeddings to sentence embeddings
sentence_embeddings = embeddings.mean(dim=1)

sentences_per_para = list(map(sent_tokenize, cleaned_paras))
para_lengths = list(map(len, sentences_per_para))

sentence_groups = torch.split(sentence_embeddings, para_lengths)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [8]:
import torch.nn.functional as F
from operator import itemgetter

def summarize_all_paragraphs(sentence_groups, sentences_per_para, k=2):
    def summarize_one(group_sents):
        group, para_sents = group_sents
        if group.size(0) == 0:
            return ""

        para_embedding = group.mean(dim=0, keepdim=True)
        scores = F.cosine_similarity(group, para_embedding)
        k_safe = min(k, group.size(0))

        top_idx = scores.topk(k_safe).indices
        sorted_idx, _ = torch.sort(top_idx)

        selected = itemgetter(*sorted_idx.tolist())(para_sents)

        if isinstance(selected, tuple):
            return " ".join(selected)
        else:
            return selected

    return list(map(summarize_one, zip(sentence_groups, sentences_per_para)))


In [9]:
local_summaries = summarize_all_paragraphs(sentence_groups, sentences_per_para, k=4)

local_summaries

['Kasparov was a fiercely aggressive chess player who thrived on energy and confidence. My father wrote a book called Mortal Games about Garry, and during the years surrounding the 1990 Kasparov-Karpov match, we both spent quite a lot of time with him.',
 'At one point, after Kasparov had lost a big game and was feeling dark and fragile, my father asked Garry how he would handle his lack of confidence in the next game. Garry responded that he would try to play the chess moves that he would have played if he were feeling confident. He would pretend to feel confident, and hopefully trigger the state.',
 "Everyone in the chess world was afraid of Garry and he fed on that reality. If Garry bristled at the chessboard, opponents would wither. So if Garry was feeling bad, but puffed up his chest, made aggressive moves, and appeared to be the manifestation of Confidence itself, then opponents would become unsettled. Step by step, Garry would feed off his own chess moves, off the created positi

In [10]:
from transformers import BartTokenizer, BartForConditionalGeneration

bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

inputs = bart_tokenizer(
    local_summaries,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=1024
)

summary_ids = bart_model.generate(
    inputs["input_ids"],
    num_beams=4,
    max_length=80,
    min_length=20,
    length_penalty=2.0,
    early_stopping=True
)

final_summaries = bart_tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

print(final_summaries)


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

['Kasparov was a fiercely aggressive chess player who thrived on energy and confidence. My father wrote a book called Mortal Games about Garry.', 'At one point, after Kasparov had lost a big game and was feeling dark and fragile, my father asked Garry how he would handle his lack of confidence in the next game. Garry responded that he would try to play the chess moves that he Would have played if he were feeling confident. He would pretend to feel confident, and hopefully trigger the state.', 'Everyone in the chess world was afraid of Garry and he fed on that reality. If Garry bristled at the chessboard, opponents would wither. So if Garry was feeling bad, but puffed up his chest, made aggressive moves, and appeared to be the manifestation of Confidence itself, then opponents would become unsettled. Step by step, Garry would feed off his own chess', 'He was not being artificial. Garry was triggering his zone by playing Kasparov chess.']


In [11]:
!pip install yake

Collecting yake
  Downloading yake-0.6.0-py3-none-any.whl.metadata (10 kB)
Collecting jellyfish (from yake)
  Downloading jellyfish-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.6 kB)
Collecting segtok (from yake)
  Downloading segtok-1.5.11-py3-none-any.whl.metadata (9.0 kB)
Downloading yake-0.6.0-py3-none-any.whl (80 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.7/80.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jellyfish-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (355 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m355.9/355.9 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading segtok-1.5.11-py3-none-any.whl (24 kB)
Installing collected packages: segtok, jellyfish, yake
Successfully installed jellyfish-1.2.0 segtok-1.5.11 yake-0.6.0


In [12]:
import yake

kw_extractor = yake.KeywordExtractor(lan="en", n=2, top=1)

titles = list(map(lambda summary: kw_extractor.extract_keywords(summary)[0][0], final_summaries))

print(titles)


['fiercely aggressive', 'big game', 'Garry', 'Kasparov chess']


In [13]:
from graphviz import Digraph

def make_flowchart(titles):
    dot = Digraph(format="png")
    dot.attr(rankdir="LR", size="8")

    # Add all nodes
    list(map(lambda t: dot.node(str(t[0]), t[1]), enumerate(titles)))

    # Add edges between consecutive titles
    list(map(lambda pair: dot.edge(str(pair[0]), str(pair[1])),
             zip(map(str, range(len(titles)-1)), map(str, range(1, len(titles))))))

    return dot

flow = make_flowchart(titles)
flow.render("flowchart", view=True)


'flowchart.png'