- Load and understand the **20 news groups** dataset.
- Use clustering to find out the group of the document, but this time use train test split method to evaluate the results.
- compare the results with two diffrerent vectorization approach : CountVectorizer and TfIDFVectorizer.
- Compare the results of with another clusetring model called LDA ? https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation
- Compare your clustering prédiction with a classic classification approach. Which one works better ? why ?

In [10]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# # Vectorizer CountVectorizer
count_vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# # Vectorizer TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

n_clusters = len(newsgroups.target_names)

kmeans_count = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans_tfidf = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)

y_pred_count = kmeans_count.fit_predict(X_train_count)
y_pred_tfidf = kmeans_tfidf.fit_predict(X_train_tfidf)

ari_count = adjusted_rand_score(y_train, y_pred_count)
ari_tfidf = adjusted_rand_score(y_train, y_pred_tfidf)

print(f"ARI CountVectorizer: {ari_count}")
print(f"ARI Tf-IDF Vectorizer: {ari_tfidf}")

#ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

ARI CountVectorizer: 4.423780062323558e-05
ARI Tf-IDF Vectorizer: 0.0783284740004759


In [11]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = len(newsgroups.target_names) 

lda_count = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_tfidf = LatentDirichletAllocation(n_components=n_topics, random_state=42)

X_train_lda_count = lda_count.fit_transform(X_train_count)
X_train_lda_tfidf = lda_tfidf.fit_transform(X_train_tfidf)

y_pred_lda_count = X_train_lda_count.argmax(axis=1)
y_pred_lda_tfidf = X_train_lda_tfidf.argmax(axis=1)

ari_lda_count = adjusted_rand_score(y_train, y_pred_lda_count)
ari_lda_tfidf = adjusted_rand_score(y_train, y_pred_lda_tfidf)

print(f"ARI CountVectorizer: {ari_lda_count}")
print(f"ARI Tf-IDF Vectorizer: {ari_lda_tfidf}")

ARI CountVectorizer: 0.17751061161166298
ARI Tf-IDF Vectorizer: 0.09613244948287275


In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# CountVectorizer
nb_count = MultinomialNB()
nb_count.fit(X_train_count, y_train)
y_pred_count_nb = nb_count.predict(X_test_count)

# Tf-IDF Vectorizer
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)
y_pred_tfidf_nb = nb_tfidf.predict(X_test_tfidf)

accuracy_count = accuracy_score(y_test, y_pred_count_nb)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf_nb)

classification_report_count = classification_report(y_test, y_pred_count_nb, target_names=newsgroups.target_names)
classification_report_tfidf = classification_report(y_test, y_pred_tfidf_nb, target_names=newsgroups.target_names)

print("Accuracy CountVectorizer Naive Bayes:", accuracy_count)
print("Accuracy Tf-IDF Naive Bayes:", accuracy_tfidf)

print("Classification Report CountVectorizer Naive Bayes:")
print(classification_report_count)

print("Classification Report Tf-IDF Naive Bayes:")
print(classification_report_tfidf)

Accuracy CountVectorizer Naive Bayes: 0.6363395225464191
Accuracy Tf-IDF Naive Bayes: 0.6835543766578249
Classification Report CountVectorizer Naive Bayes:
                          precision    recall  f1-score   support

             alt.atheism       0.47      0.61      0.53       151
           comp.graphics       0.53      0.64      0.58       202
 comp.os.ms-windows.misc       0.78      0.04      0.07       195
comp.sys.ibm.pc.hardware       0.44      0.69      0.54       183
   comp.sys.mac.hardware       0.61      0.65      0.63       205
          comp.windows.x       0.70      0.74      0.72       215
            misc.forsale       0.80      0.63      0.71       193
               rec.autos       0.66      0.75      0.70       196
         rec.motorcycles       0.37      0.70      0.48       168
      rec.sport.baseball       0.77      0.78      0.78       211
        rec.sport.hockey       0.94      0.74      0.83       198
               sci.crypt       0.85      0.67      

Which one works better ? why ?

MultinomialNB semble mieux marché car on a la liste des newsgroups


conda create --solver=libmamba -n rapids-23.08 -c rapidsai -c conda-forge -c nvidia rapids=23.08 python=3.10 cuda-version=12.0