#### Case Study: NLP Text Classification of News Articles using Scikit-Learn

* Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning categories or labels to text documents. It has applications in spam detection, sentiment analysis, news categorization, and more. In this project, we will build a text classification model to categorize news articles into different topics using Python and the Scikit-Learn library.

* We will use the 20 Newsgroups dataset, a popular dataset for experimenting with text classification. The dataset comprises around 18,000 newsgroup posts on 20 different topics, ranging from sports and politics to science and technology.

In [10]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
newsgroups = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

# Extract features and labels
X = newsgroups.data
y = newsgroups.target
target_names = newsgroups.target_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Fit and transform the training data, transform the test data
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Initialize the classifier
classifier = LogisticRegression(max_iter=1000, random_state=42)

# Train the classifier
classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test data
y_pred = classifier.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}\n")

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

Accuracy: 0.9228

Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.92      0.94      0.93       160
           comp.graphics       0.83      0.92      0.87       195
 comp.os.ms-windows.misc       0.91      0.86      0.88       197
comp.sys.ibm.pc.hardware       0.82      0.88      0.85       196
   comp.sys.mac.hardware       0.92      0.90      0.91       193
          comp.windows.x       0.92      0.93      0.92       198
            misc.forsale       0.85      0.87      0.86       195
               rec.autos       0.94      0.94      0.94       198
         rec.motorcycles       0.98      0.96      0.97       199
      rec.sport.baseball       0.96      0.97      0.97       199
        rec.sport.hockey       0.98      0.97      0.97       200
               sci.crypt       0.98      0.95      0.96       198
         sci.electronics       0.90      0.90      0.90       197
                 sci.med       0.9

##### High-Performing Categories
Some categories achieve over 90% precision and recall, suggesting that the model can distinguish these topics effectively.

##### Confusion Between Categories
Lower scores in certain categories may indicate confusion between similar topics (e.g., ‘comp.sys.mac.hardware’ vs. ‘comp.sys.ibm.pc.hardware’).

##### Overall Performance
An overall accuracy of around 92% is generally considered good for text classification tasks with multiple classes.