In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2023-07-13 23:37:51--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-07-13 23:37:51--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-07-13 23:37:51--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [5]:
import numpy as np

# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()
len(word_embeddings)

400000

In [6]:
import re
import nltk
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')

def extractive_summarization(text, n):
    # Split the text into sentences
    sentences = nltk.sent_tokenize(text)

    # Remove punctuations, numbers, and special characters
    sentences_cleaned = [re.sub('[^a-zA-Z]', ' ', sentence) for sentence in sentences]

    # Make alphabets lowercase
    sentences_cleaned = [sentence.lower() for sentence in sentences_cleaned]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    sentences_cleaned = [[word for word in sentence.split() if word not in stop_words] for sentence in sentences_cleaned]

    # Convert sentences to sentence vectors using word_embeddings
    sentence_vectors = []
    for sentence in sentences_cleaned:
        vector = np.zeros((100,))
        count = 0
        for word in sentence:
            if word in word_embeddings:
                vector += word_embeddings[word]
                count += 1
        if count != 0:
            vector /= count
        sentence_vectors.append(vector)

    # Calculate cosine similarity between sentence vectors
    similarity_matrix = cosine_similarity(sentence_vectors, sentence_vectors)

    # Calculate scores for sentences
    scores = np.sum(similarity_matrix, axis=1)

    # Sort sentences based on scores in descending order
    ranked_sentences_indices = np.argsort(scores)[::-1]

    # Select the top n sentences for the summary
    ranked_sentences = [sentences[i] for i in ranked_sentences_indices[:n]]

    # Restore original order of sentences
    restored_sentences = [sentence for _, sentence in sorted(zip(ranked_sentences_indices, ranked_sentences))]

    # Combine restored sentences into the final summary
    summary = ' '.join(restored_sentences)

    return summary


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
text = '''Education is the transmission of knowledge, skills, and character traits. There are many debates about its precise definition, for example, about which aims it tries to achieve. A further issue is whether part of the meaning of education is that the change in the student is an improvement. Some researchers stress the role of critical thinking in distinguishing education from indoctrination. These disagreements affect how to identify, measure, and improve forms of education. The term can also refer to the mental states and qualities of educated people. Additionally, it can mean the academic field studying education.
There are many types of education. Formal education happens in a complex institutional framework, like public schools. Non-formal education is also structured but happens outside the formal schooling system. Informal education is unstructured learning through daily experiences. Formal and non-formal education are divided into levels. They include early childhood education, primary education, secondary education, and tertiary education. Other classifications focus on the teaching method, like teacher-centered and student-centered education. Forms of education can also be distinguished by subject, like science education, language education, and physical education.
Education socializes children into society by teaching cultural values and norms. It equips them with the skills needed to become productive members of society. This way, it stimulates economic growth and raises awareness of local and global problems. Organized institutions affect many aspects of education. For example, governments set education policies. They determine when school classes happen, what is taught, and who can or must attend. International organizations, like UNESCO, have been influential in promoting primary education for all children.
Many factors influence whether education is successful. Psychological factors include motivation, intelligence, and personality. Social factors, like socioeconomic status, ethnicity, and gender, are often linked to discrimination. Further factors include educational technology, teacher quality, and parent involvement.
The main field investigating education is called education studies. It examines what education is and what aims it has. It also studies how it happens, what effects it has, and how to improve it. It has many subfields, like philosophy of education, psychology of education, sociology of education, economics of education, and comparative education. It also discusses the history of education. In prehistory, education happened informally through oral communication and imitation. With the rise of ancient civilizations, writing was invented, and the amount of knowledge grew. This caused a shift from informal to formal education. Initially, formal education was mainly available to elites and religious groups. The invention of the printing press in the 15th century made books more widely available. This increased general literacy. Beginning in the 18th and 19th centuries, public education became more important. It led to the worldwide process of making primary education available to all, free of charge, and compulsory up to a certain age.'''

In [8]:
extractive_summarization(text, 2)

'A further issue is whether part of the meaning of education is that the change in the student is an improvement. Forms of education can also be distinguished by subject, like science education, language education, and physical education.'

In [25]:
! gdown --id 1_-JauywBQBxQgbLXA-4492-qzSSQmCQr

Downloading...
From: https://drive.google.com/uc?id=1_-JauywBQBxQgbLXA-4492-qzSSQmCQr
To: /content/test.csv.zip
100% 18.9M/18.9M [00:00<00:00, 88.8MB/s]


In [27]:
!unzip test.csv.zip

Archive:  test.csv.zip
  inflating: test.csv                


In [28]:
import pandas as pd
from rouge_score import rouge_scorer
import nltk

# Load the CSV file
df = pd.read_csv('test.csv')

# Calculate N as the number of sentences in the Summary column
df['Summary_sentences'] = df['highlights'].apply(nltk.sent_tokenize)
df['N'] = df['Summary_sentences'].apply(len)

# Initialize the RougeScorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Initialize a list to store the Rouge scores
rouge_scores = {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}

# Loop through each row in the dataframe
for _, row in df.iterrows():
    article = row['article']
    summary = row['highlights']
    n = row['N']

    # Generate the summary using extractive summarization
    generated_summary = extractive_summarization(article, n)

    # Calculate Rouge scores
    scores = scorer.score(summary, generated_summary)

    # Accumulate Rouge scores
    for metric in rouge_scores:
        rouge_scores[metric] += scores[metric].fmeasure

num_examples = len(df)

# Calculate average Rouge scores
for metric in rouge_scores:
    rouge_scores[metric] /= num_examples

# Print average Rouge scores
print("Average Rouge Scores:")
for metric, score in rouge_scores.items():
    print(f"{metric}: {score}")

Average Rouge Scores:
rouge1: 0.307775614345949
rouge2: 0.10497020798066341
rougeL: 0.18745117715400356
