## Ethics in Data Science

 **The AI News Summarizer**

In [None]:
!pip install newspaper3k transformers torch nltk shap

In [None]:
!pip install lxml_html_clean

In [6]:
import nltk
from newspaper import Article

# Download the 'punkt' and 'punkt_tab' data
nltk.download('punkt')
nltk.download('punkt_tab')  # Download the missing data

def fetch_article(url):
    """Fetch and preprocess a news article."""
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    return article.text, article.title, article.authors, article.publish_date

# Example usage
url = "https://timesofindia.indiatimes.com/news"
news_text, title, authors, publish_date = fetch_article(url)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [7]:
from transformers import pipeline

# Load the summarization model
summarizer = pipeline("summarization")

def summarize_text(text, max_length=150):
    """Generate an AI-based summary of the news article."""
    summary = summarizer(text, max_length=max_length, min_length=50, do_sample=False)
    return summary[0]['summary_text']

summary = summarize_text(news_text)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu
Your max_length is set to 150, but your input_length is only 80. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)


In [8]:
from transformers import pipeline

sentiment_analyzer = pipeline("sentiment-analysis")

def detect_bias(text):
    """Analyze sentiment to check for potential bias."""
    sentiment = sentiment_analyzer(text)
    return sentiment

bias_result = detect_bias(summary)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [10]:


import nltk
from newspaper import Article
from transformers import pipeline
import shap

# Install necessary libraries (if not already installed)
!pip install newspaper3k transformers torch nltk shap

# Download NLTK data (if not already downloaded)
nltk.download('punkt')
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

def fetch_article(url):
    """Fetch and preprocess a news article."""
    article = Article(url)
    try:
        article.download()
        article.parse()
        article.nlp()
        return article.text, article.title, article.authors, article.publish_date
    except Exception as e:
        print(f"Error fetching article: {e}")
        return None, None, None, None


# Example usage (using a different, more stable news URL)
url = "https://www.bbc.com/news"  # Example URL - replace with the desired news article URL
news_text, title, authors, publish_date = fetch_article(url)

if news_text:  # Proceed only if article fetching was successful
    # Load the summarization model
    summarizer = pipeline("summarization")

    def summarize_text(text, max_length=150):
        """Generate an AI-based summary of the news article."""
        try:
            summary = summarizer(text, max_length=max_length, min_length=50, do_sample=False)
            return summary[0]['summary_text']
        except Exception as e:
            print(f"Error summarizing text: {e}")
            return "Error generating summary."


    summary = summarize_text(news_text)

    sentiment_analyzer = pipeline("sentiment-analysis")

    def detect_bias(text):
        """Analyze sentiment to check for potential bias (simplified approach)."""
        try:
          sentiment = sentiment_analyzer(text)
          return sentiment
        except Exception as e:
          print(f"Error analyzing sentiment: {e}")
          return "Error analyzing bias."

    bias_result = detect_bias(summary)

    def display_summary():
        """Display the summarized news with transparency features."""
        print(f"Title: {title}")
        print(f"Authors: {', '.join(authors) if authors else 'Unknown'}")
        print(f"Published Date: {publish_date if publish_date else 'Unknown'}")
        print("\nOriginal News (first 500 characters):\n", news_text[:500], "...")  # Limit original text for display
        print("\nAI-Generated Summary:\n", summary)
        print("\nBias Analysis:\n", bias_result)

    display_summary()




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Your max_length is set to 150, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Title: BBC News
Authors: Unknown
Published Date: Unknown

Original News (first 500 characters):
 End of a dynasty? What next for the Chiefs after Super Bowl blowout?

After the worst game of Patrick Mahomes' career, and a second Super Bowl blowout defeat, how do the Kansas City Chiefs come back from here? ...

AI-Generated Summary:
  The Chiefs lost to the Patriots in the Super Bowl for the second time in a row . It was a second Super Bowl blowout defeat for the Chiefs . The loss was the worst game of Patrick Mahomes' career for the first time in his career .

Bias Analysis:
 [{'label': 'NEGATIVE', 'score': 0.9997221827507019}]
