# **Downloading Resources**
**Stopwords:** Stopwords are common words in a language (e.g., "the", "is", "and") that often occur frequently but do not carry much meaning. In many NLP tasks, including text classification, sentiment analysis, and information retrieval, removing stopwords can improve performance by reducing noise and focusing on the more informative words in the text.

**WordNet:** WordNet is a lexical database of English words that provides semantic relationships between words, such as synonyms, hypernyms, hyponyms, and meronyms. WordNet is often used in various NLP tasks for tasks like word sense disambiguation, semantic similarity measurement, and lemmatization. In the preprocessing step, lemmatization is used to reduce words to their base or dictionary form, which helps in standardizing words and improving text analysis accuracy.

**Punkt Tokenizer:** Punkt is a pre-trained unsupervised machine learning model for sentence tokenization, which means it is trained to segment text into sentences. Sentence tokenization is essential for many NLP tasks, including text summarization, machine translation, and information extraction. Punkt tokenizer accurately identifies sentence boundaries even in complex cases like abbreviations, which is crucial for downstream tasks.

In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# **Load Dataset**

In [None]:
import pandas as pd

# Load the data from the CSV file into a DataFrame
df = pd.read_csv("sample_data/FinalFile.csv")
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0   Each of Tara Marklin’s three sons have vastly...  
1  “How much plastic will you have for dinner, si...  
2  In this season of the podcast Chasing Life Wit...  


# Drop rows with missing values in the 'Article' column

In [None]:
df.dropna(subset=['Article '], inplace=True)

# Preprocessing steps

# 1. Convert text to lowercase

In [None]:
df['Article '] = df['Article '].str.lower()
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0   each of tara marklin’s three sons have vastly...  
1  “how much plastic will you have for dinner, si...  
2  in this season of the podcast chasing life wit...  


# 2. Remove unwanted characters like punctuation, special symbols, etc.

In [None]:
import re
df['Article '] = df['Article '].apply(lambda x: re.sub(r'[^\w\s]', '', str(x)))  # Handle NaN values by converting to string
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0   each of tara marklins three sons have vastly ...  
1  how much plastic will you have for dinner sir ...  
2  in this season of the podcast chasing life wit...  


# 3. Tokenize the text

In [None]:
from nltk.tokenize import word_tokenize
df['Article '] = df['Article '].apply(word_tokenize)
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0  [each, of, tara, marklins, three, sons, have, ...  
1  [how, much, plastic, will, you, have, for, din...  
2  [in, this, season, of, the, podcast, chasing, ...  


# 4. Remove stop words

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['Article '] = df['Article '].apply(lambda x: [word for word in x if word not in stop_words])
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0  [tara, marklins, three, sons, vastly, differen...  
1  [much, plastic, dinner, sir, maam, may, seem, ...  
2  [season, podcast, chasing, life, dr, sanjay, g...  


# 5. Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['Article '] = df['Article '].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0  [tara, marklins, three, son, vastly, different...  
1  [much, plastic, dinner, sir, maam, may, seem, ...  
2  [season, podcast, chasing, life, dr, sanjay, g...  


# Display the preprocessed DataFrame

In [None]:
print(df.head(3))

                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   

                               Published Date  Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024  Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024  Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024     Andrea Kane   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   

                                            Article   
0  [tara, marklins, three, son, vastly, different...  
1  [much, plastic, dinner, sir, maam, may, seem, ...  
2  [season, podcast, chasing, life, dr, sanjay, g...  


# Save the preprocessed DataFrame to a new CSV file

In [None]:
df.to_csv("preprocessed_data.csv", index=False)


# **SUMMARY**
The preprocessing steps performed above are common **text preprocessing techniques** used to clean and prepare textual data for natural language processing tasks. Here's a summary of each step:

**Convert text to lowercase:** This step ensures consistency by converting all text to lowercase. It helps in treating words with the same characters but different cases as identical.

**Remove unwanted characters:** Using regular expressions (re.sub()), special characters, punctuation, and other non-alphanumeric characters are removed from the text. This step helps in eliminating noise from the text data.

**Tokenization:** Tokenization is the process of breaking down a text into individual words or tokens. In the provided code, word_tokenize() from the NLTK library is used to tokenize the text.

**Remove stop words:** Stop words are common words that typically do not carry significant meaning and are often removed from text data. Examples include "the", "is", "and", etc. By removing these words, we can focus on the more important words in the text. The NLTK library is used to obtain a set of stop words in English, and then these stop words are removed from the tokenized text.

**Lemmatization:** Lemmatization reduces words to their base or root form, called a lemma. It helps in normalizing words so that different forms of the same word are treated as one. For example, "running", "ran", and "runs" all reduce to "run". The NLTK library is used for lemmatization in the code.

In [None]:
!pip install transformers torch


Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch)
  Using cached nvidia_cusolver_cu12-

In [None]:
import pandas as pd
from transformers import pipeline

# Load the preprocessed data from the CSV file into a DataFrame
df = pd.read_csv("/content/preprocessed_data.csv")

# Convert token list back to text
df['Article '] = df['Article '].apply(lambda tokens: ' '.join(eval(tokens)))

# Initialize the summarization pipeline with a specific model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Function to summarize text with truncation
def summarize_text(text, max_length=1024):
    # Truncate the text if it exceeds the model's maximum input length
    truncated_text = text[:max_length]
    # Summarize the text with the pipeline
    summary = summarizer(truncated_text, max_length=150, min_length=40, do_sample=False)
    return summary[0]['summary_text']

# Apply the summarization function to the 'Article' column
df['Summary'] = df['Article '].apply(lambda x: summarize_text(x))

# Save the DataFrame with the summaries to a new CSV file
df.to_csv("summarized_articles.csv", index=False)

# Display the first few rows of the DataFrame to verify
print(df.head())


Your max_length is set to 150, but your input_length is only 129. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=64)
Your max_length is set to 150, but your input_length is only 146. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=73)
Your max_length is set to 150, but your input_length is only 149. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=74)
Your max_length is set to 150, but your input_length is only 149. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=74)


                                URL Link Source Page  \
0  https://edition.cnn.com/2024/04/23/health/pick...   
1  https://edition.cnn.com/2024/04/22/health/plas...   
2  https://edition.cnn.com/2024/04/19/health/joy-...   
3  https://edition.cnn.com/2024/04/18/health/impo...   
4  https://edition.cnn.com/2024/04/02/health/rest...   

                               Published Date     Article Writer  \
0   Published 1:16 PM EDT, Mon April 22, 2024     Kristen Rogers   
1  Published 10:53 AM EDT, Mon April 22, 2024     Sandee LaMotte   
2  Published 12:02 PM EDT, Fri April 19, 2024        Andrea Kane   
3  Published 12:40 PM EDT, Thu April 18, 2024     Sandee LaMotte   
4  \nPublished 1:49 PM EDT, Tue April 2, 2024  Madeline Holcombe   

                                     Article Heading  \
0  3 in 5 families are short-order cooks for pick...   
1  Which foods have the most plastics? You may be...   
2                  5 ways to add joy into your meals   
3  Fresh and frozen imported s

**Install Dependencies:** Ensure transformers and torch libraries are installed.

**Load Preprocessed Data:** Load the preprocessed CSV file into a DataFrame.

**Convert Tokens to Text:** Convert the token lists back into text strings using the join method.

**Initialize Summarization Pipeline:** Use the facebook/bart-large-cnn model for summarization.

**Summarize Text:** Define a function summarize_text to summarize the text. The function truncates the text if it exceeds the model's maximum input length (1024 tokens).

**Apply Summarization:** Apply the summarization function to each article in the DataFrame.

**Save Summarized Data:** Save the DataFrame with the summaries to a new CSV file.

The eval function is used to convert the string representation of the list back to an actual list.

# Modal used

The summarization pipeline from Hugging Face's transformers library uses pre-trained models for text summarization. By default, if you don't specify a particular model, the pipeline uses the facebook/bart-large-cnn model. This is a popular model for summarization tasks based on the BART (Bidirectional and Auto-Regressive Transformers) architecture, specifically fine-tuned on CNN/DailyMail dataset for summarization.

In [None]:
!pip install tensorflow
import tensorflow as tf

# Path to your saved model directory
saved_model_dir = "C:\\Users\\hinaa\\Desktop\\Article_summarization"

# Convert the model to TFLite format
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

# Save the TFLite model to a file
with open("model.tflite", "wb") as f:
    f.write(tflite_model)
