# News Aggregator

## Features
- **Top Articles Retrieval**: Scrapes top articles from predefined news sources.
- **Article Summarization**: Summarizes articles to provide concise representations.
- **Sentiment Detection**: Determines the sentiment of articles based on sentiment analysis.
- **Topic Modeling**: Describes with three words the topic of the articles


In [4]:
import newspaper
import json

newspaper_list = [ 'https://time.com/',
        'https://www.theguardian.com/europe',
        'https://edition.cnn.com/' ]

### Top Articles Retrieval
The script scrapes the top articles from predefined news sources listed in the `newspaper_list`. It utilizes the `newspaper` library to build newspaper objects from these URLs. The `get_top_articles` function takes a newspaper URL and retrieves a specified number of top articles (default is 5) from it. It then extracts relevant information from each article, including its title, publication date, text content, and URL. This feature enables the aggregation of recent articles from multiple news sources.

In [5]:
def get_top_articles(newspaper_url, num_articles=5):
    paper = newspaper.build(newspaper_url,number_threads=3)
    top_articles = []

    article_urls = [article.url for article in paper.articles[:num_articles]]

    for article in paper.articles[:num_articles]:
        article.download()
        article.parse()
        top_articles.append(article)
        article.url
    
    return zip(article_urls, top_articles)

articles_info = []

for news_url in newspaper_list:
    top_articles = get_top_articles(news_url, num_articles=5)
  
    # Display article titles
    for url, article in top_articles:
        single_info = {
            "newspaper": str(news_url),
            "title": article.title,
            "date": article.publish_date.strftime('%Y-%m-%d') if article.publish_date else None,
            "text" : article.text,
            "url": str(url)
        }
        articles_info.append(single_info)

# Convert the list to a JSON object
articles_json = json.dumps(articles_info)

# Print the JSON object of the first one to show it is correctly structured
print(articles_info[0])

  if feed.doc:


{'newspaper': 'https://time.com/', 'title': 'Parent Company of Saks Fifth Avenue to Buy Rival Neiman Marcus', 'date': '2024-07-04', 'text': "The parent company of Saks Fifth Avenue has signed a deal to buy upscale rival Neiman Marcus for $2.65 billion.\n\nThe new entity would be called Saks Global, which will comprise the Saks Fifth Avenue and Saks OFF 5TH brands, Neiman Marcus and Bergdorf Goodman, as well as the real estate assets of Neiman Marcus Group and HBC, a holding company that purchased Saks in 2013.\n\nThe pact was announced Thursday after months of rumors that the department store chains had been negotiating a deal.\n\nThe Wall Street Journal first reported the impending deal Wednesday.\n\nBoth Saks and Neiman Marcus have struggled as shoppers have been pulling back on buying high-end goods and shifting their spending toward experiences, like travel and upscale restaurants. The two iconic luxury purveyors have also faced stiffer competition from luxury brands, which are inc

### Article Summarization
For article summarization, the script employs the `summarize` function. This function takes an article URL as input and utilizes the `newspaper` library to download, parse, and perform natural language processing (NLP) on the article. By leveraging NLP, it generates a summary of the article's content. This summary provides a condensed representation of the article's main points, making it easier for users to grasp the essential information without having to read the entire article.

In [6]:
def summarize(article_url):
    article = newspaper.article(article_url)
    article.download()
    article.parse()
    article.nlp()
    return article.summary

#Example usage
summarize(articles_info[0]['url'])

'The parent company of Saks Fifth Avenue has signed a deal to buy upscale rival Neiman Marcus for $2.65 billion.\nThe new entity would be called Saks Global, which will comprise the Saks Fifth Avenue and Saks OFF 5TH brands, Neiman Marcus and Bergdorf Goodman, as well as the real estate assets of Neiman Marcus Group and HBC, a holding company that purchased Saks in 2013.\nBoth Saks and Neiman Marcus have struggled as shoppers have been pulling back on buying high-end goods and shifting their spending toward experiences, like travel and upscale restaurants.\nThe two iconic luxury purveyors have also faced stiffer competition from luxury brands, which are increasingly opening their own stores.\nNeiman Marcus filed for bankruptcy protection in May 2020 during the first months of the coronavirus pandemic but emerged in September of that year.'

### Sentiment Detection
The script includes a `detect_sentiment` function that determines the political bias of articles based on sentiment analysis. It utilizes the `transformers` library to load a sentiment analysis model pretrained on text from different languages. The sentiment analysis model assigns a sentiment score to the text, indicating the overall sentiment (positive, negative, or neutral). Based on this score, the function categorizes the article's political bias as 'left-leaning,' 'right-leaning,' or 'neutral.'

In [7]:
from transformers import pipeline

def detect_sentiment(text):
    # Load sentiment analysis model with explicit model and revision
    sentiment_model = pipeline(model="nlptown/bert-base-multilingual-uncased-sentiment")

    # Perform sentiment analysis on the text
    sentiment = sentiment_model(text)

    # Based on sentiment score, determine sentiment
    sentiment_label = sentiment[0]['label']
    sentiment_score = float(sentiment_label.split()[0])

    if sentiment_score > 3:
        return 'positive'
    elif sentiment_score < 3:
        return 'negative'
    else:
        return 'neutral'
    
detect_sentiment(articles_info[0]['text'][:511])

  from .autonotebook import tqdm as notebook_tqdm


'negative'

### Topic Modeling
The script includes a `get_topic` function that determines the main topics of articles using topic modeling. It utilizes the `sklearn` library to load a Latent Dirichlet Allocation (LDA) model. The `preprocess_text` function cleans the text by removing non-alphabet characters, converting to lowercase, and eliminating stopwords. The text is then transformed into a document-term matrix using `CountVectorizer`. The LDA model identifies the dominant topics within the text, and the `extract_topics` function retrieves the top words representing these topics. This approach helps in summarizing the main themes of the text, making it easier to understand the central topics of the article.


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

# Preprocessing function
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Display Topics
def extract_topics(model, feature_names, no_top_words):
    topicString = ""
    for topic in model.components_:
        topicString = " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])
    return topicString



def get_topic(text_to_process):
    # Preprocess the document
    preprocessed_text = preprocess_text(text_to_process)

    # Vectorization
    vectorizer = CountVectorizer(max_df=2, min_df=0.85, stop_words='english')
    dtm = vectorizer.fit_transform([preprocessed_text])

    # LDA Model
    lda = LatentDirichletAllocation(n_components=1, random_state=42)
    lda.fit(dtm)

    number_top_words = 3
    return extract_topics(lda, vectorizer.get_feature_names_out(), number_top_words)

get_topic(articles_info[0]['text'])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\P2001\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'saks stores luxury'