# Text Classification, Spam Detection, Topic Modeling, and Document Classification

## Text Classification

Text classification is the process of assigning predefined categories or labels to a text. This can involve categorizing news articles, emails, or any other text data into different classes based on their content.

**Examples:**

### Sentiment Analysis

Classifying customer reviews as positive, negative, or neutral.

**Review:** "The product quality is amazing and I love it!"  
**Classified as:** Positive

### Email Filtering

Classifying emails as work-related, personal, promotional, etc.

**Email:** "Meeting at 3 PM tomorrow to discuss the project status."  
**Classified as:** Work-related

## Spam Detection

Spam detection is a specific type of text classification where the goal is to identify and filter out unwanted or irrelevant emails or messages, often referred to as spam.

**Examples:**

### Email Filtering

Classifying emails as spam or not spam (ham).

**Email:** "Congratulations! You've won a $1000 gift card. Click here to claim."  
**Classified as:** Spam

**Email:** "Hi, can we reschedule our meeting to 3 PM?"  
**Classified as:** Not Spam

## Topic Modeling

Topic modeling is an unsupervised learning technique used to identify the underlying topics present in a collection of documents. It helps in discovering the abstract themes or topics that occur in a set of texts.

**Examples:**

### Analyzing Research Papers

Identifying topics in a collection of research papers.

**Papers:** Collection of papers on AI, data science, and machine learning.  
**Topics:** AI Ethics, Deep Learning Techniques, Natural Language Processing, etc.

**Latent Dirichlet Allocation (LDA):** A common algorithm used for topic modeling.

**Document 1:** "Deep learning models have achieved state-of-the-art results in image classification."  
**Document 2:** "Natural language processing techniques are improving rapidly with new transformer models."  
**Identified Topics:** "Deep Learning" and "Natural Language Processing"

## Document Classification

Document classification is similar to text classification but often involves longer texts such as articles, reports, or legal documents. It aims to categorize entire documents into predefined classes.

**Examples:**

### News Article Classification

Categorizing news articles into topics like sports, politics, technology, etc.

**Article:** "The new AI model developed by researchers shows promising results in various benchmarks."  
**Classified as:** Technology

### Legal Document Classification

Classifying legal documents as contracts, agreements, patents, etc.

**Document:** "This agreement is made between the two parties and outlines the terms and conditions of the partnership."  
**Classified as:** Agreement


In [23]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data for training
reviews = ["I love this product!", "This is the worst thing I've ever bought.", "Absolutely fantastic! Highly recommend.",
           "Not good, very disappointing.", "Pretty good, but could be better.", "Terrible, I want a refund.",
           "This product is amazing!", "I hate it, very bad experience.", "It's okay, not great.", 
           "I am very satisfied with this purchase.", "Not worth the money.", "Highly satisfied with this product.",
           "It broke after one use, very poor quality.", "Best thing I ever bought!", "Waste of money, do not buy.",
           "Fantastic quality, exceeded my expectations."]

# Labels: positive, negative, or neutral
labels = ["positive", "negative", "positive", "negative", "neutral", "negative", "positive", "negative", 
          "neutral", "positive", "negative", "positive", "negative", "positive", "negative", "positive"]

# Create a model
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model
model.fit(reviews, labels)

# Predict the sentiment of new reviews
new_reviews = ["The product is okay, but I've seen better.", "Absolutely love it! Best purchase ever.", 
               "Would not recommend, waste of money."]
predictions = model.predict(new_reviews)

# Display the results
for review, prediction in zip(new_reviews, predictions):
    print(f'Review: "{review}" - Sentiment: {prediction}')


Review: "The product is okay, but I've seen better." - Sentiment: neutral
Review: "Absolutely love it! Best purchase ever." - Sentiment: positive
Review: "Would not recommend, waste of money." - Sentiment: negative


In [2]:
# Sample data for training
emails = ["Congratulations, you have won a free lottery ticket!", 
          "Meeting rescheduled to 3 PM tomorrow.", 
          "You have been selected for a free vacation!", 
          "Don't forget to submit your report by end of day.", 
          "Claim your free prize now!", "This is not spam, it's a genuine email.", 
          "Your account needs verification, click here.", "Free coupons available, redeem now!", 
          "Please review the attached document.", "You are a lucky winner!"]

# Labels: spam or not spam
labels = ["spam", "not spam", "spam", "not spam", "spam", "not spam", "spam", "spam", "not spam", "spam"]

# Create a model
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model
model.fit(emails, labels)

# Predict the spam status of new emails
new_emails = ["Join us for the webinar on AI advancements.", "You have won a free cruise ticket!", 
              "Please review the attached document for tomorrow's meeting.", "Exclusive offer just for you!"]
predictions = model.predict(new_emails)

# Display the results
for email, prediction in zip(new_emails, predictions):
    print(f'Email: "{email}" - Status: {prediction}')


Email: "Join us for the webinar on AI advancements." - Status: spam
Email: "You have won a free cruise ticket!" - Status: spam
Email: "Please review the attached document for tomorrow's meeting." - Status: not spam
Email: "Exclusive offer just for you!" - Status: spam


In [3]:
from gensim import corpora, models
from gensim.parsing.preprocessing import preprocess_string

# Sample documents for training
documents = ["Deep learning models have achieved state-of-the-art results in image classification.",
             "Natural language processing techniques are improving rapidly with new transformer models.",
             "AI ethics and responsible AI are becoming critical areas of research.",
             "Reinforcement learning has seen significant advancements in recent years.",
             "AI applications in healthcare are revolutionizing diagnostics and treatment.",
             "The use of AI in finance is creating new opportunities.",
             "Natural language processing is key to many AI applications.",
             "Ethical concerns in AI are being widely discussed.",
             "Recent advancements in reinforcement learning are impressive.",
             "Healthcare is being transformed by AI innovations."]

# Preprocess the documents
texts = [preprocess_string(doc) for doc in documents]

# Create a dictionary and a corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train an LDA model
lda_model = models.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics
topics = lda_model.print_topics(num_words=4)
for topic in topics:
    print(topic)

# New documents for testing
new_documents = ["The new technology in healthcare is saving lives.",
                 "AI and machine learning are transforming industries.",
                 "Political debates are heating up over the new policies."]

# Preprocess the new documents
new_texts = [preprocess_string(doc) for doc in new_documents]

# Convert new documents to bag-of-words format
new_corpus = [dictionary.doc2bow(text) for text in new_texts]

# Infer topics for the new documents
for doc, bow in zip(new_documents, new_corpus):
    topic_distribution = lda_model.get_document_topics(bow)
    print(f'Document: "{doc}" - Topic Distribution: {topic_distribution}')


(0, '0.074*"healthcar" + 0.072*"learn" + 0.072*"advanc" + 0.072*"recent"')
(1, '0.050*"new" + 0.050*"ethic" + 0.049*"area" + 0.049*"critic"')
(2, '0.056*"process" + 0.056*"natur" + 0.056*"languag" + 0.056*"model"')
Document: "The new technology in healthcare is saving lives." - Topic Distribution: [(0, 0.44974118), (1, 0.42265034), (2, 0.12760845)]
Document: "AI and machine learning are transforming industries." - Topic Distribution: [(0, 0.7641553), (1, 0.11201374), (2, 0.123830885)]
Document: "Political debates are heating up over the new policies." - Topic Distribution: [(0, 0.16774409), (1, 0.6407644), (2, 0.19149147)]


In [25]:
# Sample data for training
articles = ["The new AI model developed by researchers shows promising results in various benchmarks.",
            "The sports team won their match with a last-minute goal.",
            "Political tensions are rising in the region due to the new policies.",
            "Recent advancements in technology are transforming industries.",
            "Health experts are emphasizing the importance of vaccination.",
            "The new tech startup is revolutionizing the industry.",
            "The athlete broke the world record in the recent competition.",
            "The government has announced new economic policies.",
            "Tech giants are investing heavily in AI research.",
            "New health guidelines have been issued for the pandemic."]

# Labels: technology, sports, politics, health
labels = ["technology", "sports", "politics", "technology", "health", "technology", "sports", "politics", "technology", "health"]

# Create a model
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model
model.fit(articles, labels)

# Predict the categories of new articles
new_articles = ["The latest advancements in AI are remarkable.", 
                "The team played an excellent match yesterday.",
                "Health officials are warning about a new virus outbreak.",
                "The US government has announced new education policies",
                "Technology is rapidly evolving with new innovations."]
predictions = model.predict(new_articles)

# Display the results
for article, prediction in zip(new_articles, predictions):
    print(f'Article: "{article}" - Category: {prediction}')


Article: "The latest advancements in AI are remarkable." - Category: technology
Article: "The team played an excellent match yesterday." - Category: sports
Article: "The US government has announced new education policies" - Category: politics
Article: "Technology is rapidly evolving with new innovations." - Category: technology
