**1-Setup**

In [3]:
!pip install nltk spacy gensim pyLDAvis transformers sentence-transformers googletrans==4.0.0rc1 textblob
!python -m spacy download en_core_web_sm

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting googletrans==4.0.0rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Coll

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

**2-Imports**

In [7]:
import pandas as pd
import numpy as np
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
from googletrans import Translator
from textblob import TextBlob
import warnings
warnings.filterwarnings("ignore")

nlp = spacy.load("en_core_web_sm")
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

**3-Load Midterm Dataset & Model**

In [12]:
import pandas as pd
df = pd.read_csv('/content/newsbot_dataset.csv')
df.head()

Unnamed: 0,content,category
0,The government passed a new policy today.,Politics
1,The football team won their final match.,Sports
2,A major tech company released a new smartphone.,Technology
3,The stock market saw significant growth this w...,Business
4,The actor won an award for best performance.,Entertainment


**4-Preprocessing Function**

In [13]:
def preprocess(text):
    text = text.lower()
    doc = nlp(text)
    tokens = []
    for tok in doc:
        if tok.text not in stop_words and tok.is_alpha:
            tokens.append(lemmatizer.lemmatize(tok.text))
    return " ".join(tokens)

df["clean"] = df["content"].apply(preprocess)
df.head()

Unnamed: 0,content,category,clean
0,The government passed a new policy today.,Politics,government passed new policy today
1,The football team won their final match.,Sports,football team final match
2,A major tech company released a new smartphone.,Technology,major tech company released new smartphone
3,The stock market saw significant growth this w...,Business,stock market saw significant growth week
4,The actor won an award for best performance.,Entertainment,actor award best performance


**5-Topic Modeling (LDA)**

In [14]:
vectorizer_lda = TfidfVectorizer(max_features=3000)
X = vectorizer_lda.fit_transform(df["clean"])

lda = LatentDirichletAllocation(n_components=6, random_state=42)
lda.fit(X)

def show_topics(model, vectorizer, n_words=10):
    words = vectorizer.get_feature_names_out()
    for i, topic in enumerate(model.components_):
        print(f"\nTopic #{i+1}:")
        print(" ".join([words[i] for i in topic.argsort()[-n_words:]]))

show_topics(lda, vectorizer_lda)



Topic #1:
state weekend happening election best actor award performance major new

Topic #2:
state weekend happening election best actor award performance major new

Topic #3:
stock company released smartphone tech happening election several state weekend

Topic #4:
warn flu season doctor case rising government policy passed today

Topic #5:
major upset victory basketball celebrate fan final football match team

Topic #6:
study transforming technology ai modern advance award performance best actor


**6-Text Summarization (Transformers)**

In [15]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def summarize(text):
    result = summarizer(text, max_length=120, min_length=40, do_sample=False)
    return result[0]['summary_text']


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [16]:
sample = df["content"].iloc[0]
print("Original:", sample[:400])
print("\nSummary:", summarize(sample))


Your max_length is set to 120, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


Original: The government passed a new policy today.

Summary: The government passed a new policy today. The new policy is aimed at improving the quality of life in the country. The policy was passed by the House of Lords on Tuesday. The bill will now go to the Senate for further consideration.


**7-Semantic Search (Embedding Model)**

In [17]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

corpus = df["clean"].tolist()
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

def semantic_search(query, top_k=5):
    query_emb = embedder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_emb, corpus_embeddings, top_k=top_k)[0]
    results = []
    for h in hits:
        results.append(df.iloc[h['corpus_id']]["content"])
    return results


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
semantic_search("ai innovation")


['Advances in AI are transforming modern technology.',
 'A major tech company released a new smartphone.',
 'The government passed a new policy today.',
 'The stock market saw significant growth this week.',
 'The actor won an award for best performance.']

**8-Multilingual Detection & Translation**

In [19]:
translator = Translator()

def detect_language(text):
    return translator.detect(text).lang

def translate_to_english(text):
    if detect_language(text) != "en":
        return translator.translate(text, dest="en").text
    return text


In [20]:
translate_to_english("La technologie avance rapidement.")


'Technology advances rapidly.'

**9-Conversational Query Engine**

In [21]:
def newsbot_query(query):
    query_en = translate_to_english(query)
    results = semantic_search(query_en)
    summaries = [summarize(r) for r in results]
    return summaries


In [22]:
newsbot_query("Find business articles about stocks")


Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your max

['The stock market saw significant growth this week. The Dow Jones Industrial Average and the S&P 500 were both up more than 1%. The Nasdaq was up about 1.5%. The NASDAQ was down about 0.7%.',
 "A major tech company released a new smartphone. Here's what you need to know about the new iPhone 7 Plus. Here are some tips on how to get the most out of your phone. Here is a look at how to use the new phone.",
 '.com will feature iReporter photos in a weekly Travel Snapshots gallery. Visit CNN.com/Travel each week for a new gallery of snapshots from around the world. Visit iReport.com for more travel snapshots.',
 'Doctors warn about rising flu cases this season. Doctors warn about rise in flu cases in the U.S. and around the world. Doctors say they are seeing an increase in the number of people with the flu.',
 "Basketball fans celebrate a major upset victory over the Knicks. The game was played in New York's Madison Square Garden. It was the first time the Knicks had won a game in the city

**10-System Integration Demo**

In [23]:
def full_demo(text_query):
    print("🔍 USER QUERY:", text_query)

    translated = translate_to_english(text_query)
    print("\n🌐 English Version:", translated)

    print("\n📌 Top matching articles:")
    matches = semantic_search(translated)

    for i, article in enumerate(matches):
        print(f"\n--- Article #{i+1} ---")
        print(article[:400], "...")
        print("\n📝 Summary:")
        print(summarize(article))

    print("\n🎭 Sentiment of query:")
    print(TextBlob(translated).sentiment)

full_demo("Muéstrame noticias sobre inteligencia artificial en tecnología")


🔍 USER QUERY: Muéstrame noticias sobre inteligencia artificial en tecnología


Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)



🌐 English Version: Show me news about artificial intelligence in technology

📌 Top matching articles:

--- Article #1 ---
Advances in AI are transforming modern technology. ...

📝 Summary:


Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


.com will feature iReporter photos in a weekly Travel Snapshots gallery. Visit CNN.com/Travel each week for a new gallery of snapshots from around the world. Visit iReport.com for more travel snapshots.

--- Article #2 ---
A major tech company released a new smartphone. ...

📝 Summary:


Your max_length is set to 120, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


A major tech company released a new smartphone. Here's what you need to know about the new iPhone 7 Plus. Here are some tips on how to get the most out of your phone. Here is a look at how to use the new phone.

--- Article #3 ---
The government passed a new policy today. ...

📝 Summary:


Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


The government passed a new policy today. The new policy is aimed at improving the quality of life in the country. The policy was passed by the House of Lords on Tuesday. The bill will now go to the Senate for further consideration.

--- Article #4 ---
The stock market saw significant growth this week. ...

📝 Summary:


Your max_length is set to 120, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


The stock market saw significant growth this week. The Dow Jones Industrial Average and the S&P 500 were both up more than 1%. The Nasdaq was up about 1.5%. The NASDAQ was down about 0.7%.

--- Article #5 ---
Doctors warn about rising flu cases this season. ...

📝 Summary:
Doctors warn about rising flu cases this season. Doctors warn about rise in flu cases in the U.S. and around the world. Doctors say they are seeing an increase in the number of people with the flu.

🎭 Sentiment of query:
Sentiment(polarity=-0.6, subjectivity=1.0)
