<a href="https://colab.research.google.com/github/mathewsrc/Natural-Language-Processing-in-Python/blob/master/classifying_fake_news_using_supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install polars

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [46]:
import polars as pl
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np

In [33]:
# Load the dataset
bunch = fetch_20newsgroups(subset='all', categories=['sci.space', 'comp.graphics'])

# Get the list of categories
categories = bunch.target_names

# Print the list of categories
print(categories)

# Create a dictionary with the data and target fields
data_dict = {
    'data': bunch.data,
    'target': bunch.target
}

# Create a dataframe from the dictionary
df = pl.from_dict(data_dict)

# Keep the first 1000 rows of the DataFrame
df = df.head(1000)

# Display the first 5 rows of the resulting DataFrame
print(df.head(5))

['comp.graphics', 'sci.space']
shape: (5, 2)
┌───────────────────────────────────┬────────┐
│ data                              ┆ target │
│ ---                               ┆ ---    │
│ str                               ┆ i64    │
╞═══════════════════════════════════╪════════╡
│ From: henry@zoo.toronto.edu (Hen… ┆ 1      │
│ From: leech@cs.unc.edu (Jon Leec… ┆ 1      │
│ From: jscotti@lpl.arizona.edu (J… ┆ 1      │
│ From: dchien@hougen.seas.ucla.ed… ┆ 1      │
│ From: robert@slipknot.rain.com (… ┆ 0      │
└───────────────────────────────────┴────────┘


## Split data into train and test

In [34]:
# Create a series to store the labels: y
y = df.select('target').to_numpy()

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['data'], y, test_size=0.33, random_state=42)

## CountVectorizer

This technique counts the frequency of each word in a document and uses these counts as features for modeling. In other words, it creates a document-term matrix where each row represents a document and each column represents a word in the vocabulary. This approach is simple and effective for many text classification tasks, but it doesn't take into account the fact that some words are more important than others.

For tasks where word frequency alone is a good indicator of importance, such as spam detection or sentiment analysis, CountVectorizer may be sufficient.

In [35]:
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

print(count_vectorizer.get_feature_names_out()[:10])

['00' '000' '0000' '00000' '000005102000' '000021' '000050' '00041032'
 '0004136' '00043819']


## TfidfVectorizer

This technique uses a similar approach to CountVectorizer, but it also takes into account the importance of words based on their frequency across all documents in the dataset. Specifically, it assigns a weight to each word based on its term frequency-inverse document frequency (TF-IDF) score, which is a measure of how frequently a word appears in a document relative to its frequency across all documents. Words that occur frequently in a particular document but rarely in other documents will have a higher TF-IDF score and be considered more important for modeling.

In general, TfidfVectorizer tends to perform better than CountVectorizer in text classification tasks because it takes into account the importance of words based on their frequency across all documents.

In [36]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)

# Transform the training data: tfidf_train
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names_out()[:10])

print(tfidf_train[:5].A)

['00' '000' '0000' '00000' '000005102000' '000021' '000050' '00041032'
 '0004136' '00043819']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Training and test a classification model

### Multinomial Naive Bayes (MNB) 


Multinomial Naive Bayes (MNB) is a popular algorithm for text classification tasks. It is a probabilistic model that uses Bayes' theorem to calculate the probability of a document belonging to a certain class based on the words in it.

In [37]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [43]:
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train.ravel())

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
acc = metrics.accuracy_score(y_test, pred)
print(acc)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred)
print(cm)

0.9787878787878788
[[161   2]
 [  5 162]]


In [44]:
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train.ravel())

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
acc = metrics.accuracy_score(y_test, pred)
print(acc)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred)
print(cm)

0.9696969696969697
[[156   7]
 [  3 164]]


## Improving model

alpha is a smoothing parameter that's added to the frequency counts of each feature (word) in the training data. The purpose of smoothing is to avoid zero probabilities, which can cause issues with Naive Bayes classifiers that assume that all features are independent. A non-zero alpha value ensures that all features have non-zero probabilities, even if they don't appear in some of the training instances.

In [50]:
# Create the list of alphas: alphas
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha, force_alpha=True)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train.ravel())
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_predict(alpha))
    print()

Alpha:  0.0
Score:  0.6515151515151515

Alpha:  0.1
Score:  0.9818181818181818

Alpha:  0.2
Score:  0.9757575757575757

Alpha:  0.30000000000000004
Score:  0.9696969696969697

Alpha:  0.4
Score:  0.9696969696969697

Alpha:  0.5
Score:  0.9696969696969697

Alpha:  0.6000000000000001
Score:  0.9696969696969697

Alpha:  0.7000000000000001
Score:  0.9696969696969697

Alpha:  0.8
Score:  0.9696969696969697

Alpha:  0.9
Score:  0.9696969696969697



  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


In [58]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.feature_log_prob_[0], feature_names))

# Print the first class label and the top 5 feat_with_weights entries
print(class_labels[0], feat_with_weights[:5])

# Print the second class label and the bottom 5 feat_with_weights entries
print(class_labels[1], feat_with_weights[:-5])

0 [(-9.96182570161473, '00000'), (-9.96182570161473, '000021'), (-9.96182570161473, '000050'), (-9.96182570161473, '00041032'), (-9.96182570161473, '0004136')]
