# PySpark Summarization Work
### Submitted by: Kunal Malhan

## 0.1. Installing required packages

In [1]:
!pip install lxml[html_clean]
!pip install newspaper3k



## 0.2. Import statements

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql.functions import explode, split, lower, regexp_replace, size, desc, array_distinct, trim, udf
from pyspark.sql.types import ArrayType, StringType
from newspaper import Article
from os import truncate
from transformers import pipeline

## 0.3. Getting news article and parsing

In [3]:
spark=SparkSession.builder.appName("NLP for Summerization").getOrCreate()

In [4]:
url     = "https://apnews.com/article/mideast-wars-israel-gaza-lebanon-5dbfc18c7311a6b3eb16c89981bd3dfb"
article = Article(url)
article.download()
article.parse()
text    = article.text

print(text[:600])

The International Criminal Court issued arrest warrants on Thursday for Israeli Prime Minister Benjamin Netanyahu, his former defense minister and a Hamas military leader, accusing them of war crimes and crimes against humanity. The announcement came as health officials in the Gaza Strip said the death toll from the 13-month-old war between Israel and Hamas has surpassed 44,000.

The warrant marked the first time that a sitting leader of a major Western ally has been accused of war crimes and crimes against humanity by a global court of justice. The ICC panel said there were reasonable grounds


# 0.4. Create Spark dataframe

In [5]:
data     = [(text,)]
columns  = ["article_text"]
df       = spark.createDataFrame(data,columns)
df.show(truncate=40)

+----------------------------------------+
|                            article_text|
+----------------------------------------+
|The International Criminal Court issu...|
+----------------------------------------+



## 0.5. Cleaning the dataset

Making the sentencess

In [6]:
df    = df.withColumn("sentence",explode(split("article_text","\.")))
df.show(truncate=30)

+------------------------------+------------------------------+
|                  article_text|                      sentence|
+------------------------------+------------------------------+
|The International Criminal ...|The International Criminal ...|
|The International Criminal ...| The announcement came as h...|
|The International Criminal ...|\n\nThe warrant marked the ...|
|The International Criminal ...| The ICC panel said there w...|
|The International Criminal ...|\n\nIsrael’s war has caused...|
|The International Criminal ...|3 million people from their...|
|The International Criminal ...|\n\nIsrael launched its war...|
|The International Criminal ...| 7, 2023, killing some 1,20...|
|The International Criminal ...| Around 100 hostages are st...|
|The International Criminal ...|\n\n___\n\nHere’s the Lates...|
|The International Criminal ...|                             N|
|The International Criminal ...|     deputy special envoy says|
|The International Criminal ...|        

Removing non-alphabets and lower-casing the sentences

In [7]:
df    = df.withColumn("Clean_sentence",lower(regexp_replace("sentence", r'[^a-zA-Z\s]', "")))
df    = df.withColumn("Clean_sentence",trim(lower(regexp_replace("Clean_sentence", "[\s+]", " "))))
df.select("Clean_sentence").show(truncate=80)

+--------------------------------------------------------------------------------+
|                                                                  Clean_sentence|
+--------------------------------------------------------------------------------+
|the international criminal court issued arrest warrants on thursday for israe...|
|the announcement came as health officials in the gaza strip said the death to...|
|the warrant marked the first time that a sitting leader of a major western al...|
|the icc panel said there were reasonable grounds to believe that both netanya...|
|israels war has caused heavy destruction across gaza decimated parts of the t...|
|        million people from their homes leaving most dependent on aid to survive|
|israel launched its war in gaza after hamasled militants stormed into souther...|
|                     killing some  people mostly civilians and abducting another|
|around  hostages are still inside gaza at least a third of whom are believed ...|
|her

Split the clean sentences in tokens

In [8]:
df     = df.withColumn("Tokens", split("Clean_sentence", " "))
df.select("Clean_sentence", "Tokens").show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                      Clean_sentence|                                                                                              Tokens|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|the international criminal court issued arrest warrants on thursday for israeli prime minister be...|[the, international, criminal, court, issued, arrest, warrants, on, thursday, for, israeli, prime...|
|the announcement came as health officials in the gaza strip said the death toll from the monthold...|[the, announcement, came, as, health, officials, in, the, gaza, strip, said, the, 

## 0.6. Word Frequency Calculation

In [9]:
df     = df.withColumn("Word_Freq", size("Tokens"))
df.select("Clean_sentence", "Word_Freq").show(truncate=100)

+----------------------------------------------------------------------------------------------------+---------+
|                                                                                      Clean_sentence|Word_Freq|
+----------------------------------------------------------------------------------------------------+---------+
|the international criminal court issued arrest warrants on thursday for israeli prime minister be...|       33|
|the announcement came as health officials in the gaza strip said the death toll from the monthold...|       24|
|the warrant marked the first time that a sitting leader of a major western ally has been accused ...|       31|
|the icc panel said there were reasonable grounds to believe that both netanyahu and his exdefense...|       37|
|israels war has caused heavy destruction across gaza decimated parts of the territory and driven ...|       20|
|                            million people from their homes leaving most dependent on aid to su

## 0.7. Selecting sentences having more than average word frequency and Sorting in decreasing order of word frequency

In [10]:
average_word_freq = df.agg({"Word_Freq": "avg"}).collect()[0][0]
summary_df        = df.filter(df.Word_Freq >= average_word_freq).orderBy(desc("Word_Freq"))
summary_df.select("Clean_sentence", "Word_Freq").show(truncate=100)

+----------------------------------------------------------------------------------------------------+---------+
|                                                                                      Clean_sentence|Word_Freq|
+----------------------------------------------------------------------------------------------------+---------+
|the eu is wracked by members divisions over how peace should come about in the middle east  inter...|       82|
|eu foreign policy chief says icc arrest warrants are binding on all bloc members  the european un...|       68|
|the leaked documents are said to have formed the basis of a widely discredited article in the lon...|       61|
|hamas welcomes warrants against netanyahu and gallant  hamas has welcomed the decision by the int...|       55|
|turkeys ruling party welcomes warrant against netanyahu  ankara  turkish president recep tayyip e...|       51|
|heres the latest  un official warns that increasing airstrikes in syria particularly by israel 

# 1. Identify Unique Words

In [11]:
df     = df.withColumn("Word_Unique_Count", size(array_distinct("Tokens")))
df.select("Clean_sentence", "Word_Unique_Count").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------------+
|                                                                                      Clean_sentence|Word_Unique_Count|
+----------------------------------------------------------------------------------------------------+-----------------+
|the international criminal court issued arrest warrants on thursday for israeli prime minister be...|               30|
|the announcement came as health officials in the gaza strip said the death toll from the monthold...|               21|
|the warrant marked the first time that a sitting leader of a major western ally has been accused ...|               25|
|the icc panel said there were reasonable grounds to believe that both netanyahu and his exdefense...|               32|
|israels war has caused heavy destruction across gaza decimated parts of the territory and driven ...|               18|
|                            mil

# 2. Find Sentences Containing a Specific Word

In [12]:
summary_df = df.filter(df.Clean_sentence.ilike('%Israeli%'))
summary_df.select("Clean_sentence", "Word_Unique_Count").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------------+
|                                                                                      Clean_sentence|Word_Unique_Count|
+----------------------------------------------------------------------------------------------------+-----------------+
|the international criminal court issued arrest warrants on thursday for israeli prime minister be...|               30|
|in the current escalating climate rochdi said once again israeli airstrikes on syria have increas...|               20|
|he pointed to wednesdays strike near palmyra that killed dozens which was likely the deadliest is...|               19|
|he said israeli strikes on residential areas in the capital damascus as well as on bridges roads ...|               29|
|israeli strikes in lebanon kill more than  people nationwide  beirut  israeli strikes killed at l...|               26|
|in eastern lebanon intensified 

# 3. Compute Average Sentence Length

In [13]:
average_word_freq = df.agg({"Word_Freq": "avg"}).collect()[0][0]
print("Average Sentence Length is ", average_word_freq)

Average Sentence Length is  20.1


# 4. Find the Longest Sentence

In [14]:
summary_df        = df.orderBy(desc("Word_Freq"))
summary_df.limit(1).select("Clean_sentence", "Word_Freq").show(truncate=100)

+----------------------------------------------------------------------------------------------------+---------+
|                                                                                      Clean_sentence|Word_Freq|
+----------------------------------------------------------------------------------------------------+---------+
|the eu is wracked by members divisions over how peace should come about in the middle east  inter...|       82|
+----------------------------------------------------------------------------------------------------+---------+



# 5. Filter Short Sentences

In [15]:
summary_df        = df.filter(df.Word_Freq < 5).orderBy("Word_Freq")
summary_df.select("Clean_sentence", "Word_Freq").show(truncate=100)

+--------------+---------+
|Clean_sentence|Word_Freq|
+--------------+---------+
|             n|        1|
|             n|        1|
|             n|        1|
|             n|        1|
|             n|        1|
|             n|        1|
|             n|        1|
|             s|        1|
|             u|        1|
|             s|        1|
|           gen|        1|
|           gen|        1|
|             n|        1|
|              |        1|
|              |        1|
|              |        1|
|              |        1|
|             u|        1|
|             s|        1|
|              |        1|
+--------------+---------+
only showing top 20 rows



# 6. Word Co-occurrence

In [16]:
Word_List   = df.select("Tokens").rdd.flatMap(lambda x: x).collect()

Word_Rdd    = spark.sparkContext.parallelize(Word_List)
result_rdd  = Word_Rdd.flatMap(lambda words: [(words[i], words[i+1]) for i in range(len(words) - 1)])
counts_rdd  = result_rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b)

counts_df   = counts_rdd.toDF(["Value", "Count"])
counts_df.orderBy(desc("count")).show(truncate=False)

+-------------------------+-----+
|Value                    |Count|
+-------------------------+-----+
|{in, the}                |28   |
|{of, the}                |15   |
|{the, international}     |14   |
|{, people}               |14   |
|{arrest, warrants}       |14   |
|{international, criminal}|12   |
|{health, ministry}       |11   |
|{benjamin, netanyahu}    |10   |
|{prime, minister}        |10   |
|{netanyahu, and}         |10   |
|{, the}                  |10   |
|{and, the}               |10   |
|{to, the}                |10   |
|{minister, benjamin}     |9    |
|{the, war}               |9    |
|{in, gaza}               |9    |
|{have, been}             |9    |
|{criminal, court}        |8    |
|{on, thursday}           |8    |
|{the, gaza}              |8    |
+-------------------------+-----+
only showing top 20 rows



# 7. Generate Bigrams

In [17]:
@udf(returnType=ArrayType(StringType()))
def bigrams(words):
    return [" ".join(pair) for pair in zip(words[:-1], words[1:])]

df = df.withColumn("bigrams", bigrams("Tokens"))
df.select("Clean_sentence", "bigrams").show(truncate=80)

+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                  Clean_sentence|                                                                         bigrams|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|the international criminal court issued arrest warrants on thursday for israe...|[the international, international criminal, criminal court, court issued, iss...|
|the announcement came as health officials in the gaza strip said the death to...|[the announcement, announcement came, came as, as health, health officials, o...|
|the warrant marked the first time that a sitting leader of a major western al...|[the warrant, warrant marked, marked the, the first, first time, time that, t...|
|the icc panel s

# 8. Remove Stopwords

In [18]:
remover         = StopWordsRemover(inputCol="Tokens", outputCol="Clean_sentence_No_Stopwords")
df              = remover.transform(df)
df.select("Clean_sentence", "Clean_sentence_No_Stopwords").show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                      Clean_sentence|                                                                         Clean_sentence_No_Stopwords|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|the international criminal court issued arrest warrants on thursday for israeli prime minister be...|[international, criminal, court, issued, arrest, warrants, thursday, israeli, prime, minister, be...|
|the announcement came as health officials in the gaza strip said the death toll from the monthold...|[announcement, came, health, officials, gaza, strip, said, death, toll, monthold, 

# Sentence wise summary

In [19]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

sentences  = [row["Clean_sentence"] for row in df.collect()]
for sentence in sentences:
    input_length = len(sentence.split())
    if input_length >=10:
        print("******************************************************************************************************")
        summary = summarizer(sentence, max_length=min(50, input_length), min_length=10, do_sample=False)
        print(f"Original: {sentence}")
        print(f"Summary: {summary[0]['summary_text']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


******************************************************************************************************
Original: the international criminal court issued arrest warrants on thursday for israeli prime minister benjamin netanyahu his former defense minister and a hamas military leader accusing them of war crimes and crimes against humanity
Summary: International criminal court issued arrest warrants on th Thursday for israeli prime minister benjamin netanyahu. Former defense minister and a hamas military leader accused
******************************************************************************************************
Original: the announcement came as health officials in the gaza strip said the death toll from the monthold war between israel and hamas has surpassed
Summary: Health officials in the gaza strip said the death toll from the monthold war between israel and ham
******************************************************************************************************
Original: th