## TP Text mining - Kaichen MA

## Application à la classification : l’analyse d’opinions

### Implémentation du classifieur

In [1]:
import os.path as op
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, ClassifierMixin
from collections import *
import re

In [2]:
###############################################################################
# Load data
print("Loading dataset")

from glob import glob
filenames_neg = sorted(glob(op.join('.', 'data', 'imdb1', 'neg', '*.txt')))
filenames_pos = sorted(glob(op.join('.', 'data', 'imdb1', 'pos', '*.txt')))

texts_neg = [open(f).read() for f in filenames_neg]
texts_pos = [open(f).read() for f in filenames_pos]
texts = texts_neg + texts_pos
y = np.ones(len(texts), dtype=np.int)
y[:len(texts_neg)] = 0.

print("%d documents" % len(texts))


Loading dataset
2000 documents


### Question 1: Compléter la fonction count_words qui va compter le nombre d’occurrences de chaque mot dans une liste de string et renvoyer le vocabulaire.

In [3]:
def count_words(texts,stop_words=None):
    """
    Vectorize text : return count of each word in the text snippets

    Parameters
    ----------
    texts : list of str
        The texts

    Returns
    -------
    vocabulary : dict
        A dictionary that points to an index in counts for each word.
    counts : ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
        n_samples == number of documents.
        n_features == number of words in vocabulary.
    """

    words = set()
    
    if stop_words is None:
        
        for i in range(len(texts)):
            for j in re.findall(r"(\w+)", texts[i]):
                words.add(j)
        
        vocabulary = dict(zip(words, range(len(words)))) 
        counts = np.zeros((len(texts), len(words)), dtype=int)
        for i in range(len(texts)):
            # occurrence de mot dans chaque texte
            for j in re.findall(r"(\w+)", texts[i]):
                counts[i, vocabulary[j]] += 1        
                    
    else:
        
        for i in range(len(texts)):
            for j in re.findall(r"(\w+)", texts[i]):
                if j not in stop_words:
                    words.add(j)
                    
        vocabulary = dict(zip(words, range(len(words)))) 
        counts = np.zeros((len(texts), len(words)), dtype=int)
        for i in range(len(texts)):
            # occurrence de mot dans chaque texte
            for j in re.findall(r"(\w+)", texts[i]):
                if j not in stop_words:
                    counts[i, vocabulary[j]] += 1                  
                    
    

    n_features = len(vocabulary)
  
    return vocabulary, counts

In [5]:
vocabulary, X = count_words(texts)
X.shape

(2000, 39696)

### Question 2: Expliquer comment les classes positives et négatives ont été assignées sur les critiques de films 

Les commentaires ont été classés positif ou négatif selon l'indicateur explicite dans le texte, tel que les fractions (ex: 8/10), des étoiles (ex: 4/5), ou encore avec des lettres (A, B, C, D...)

Pour une notation sur 5, les documents avec des notes suppérieures à 3.5 sont considérés comme positif et ceux avec des notes inférieures à 2 sont considérés comme négatif.

Pour un système à 4 étoiles, les évaluations supérieurs ou égales à 3 sont classifiés positives, et celles inférieurs ou égales à 1.5 sont considérées négatives.

Pour un système de botes par lettre, B ou au-dessus est marqué positive, C ou au-dessous est classifié négatif.

### Question 3: Compléter la classe NB pour qu’elle implémente le classifieur Naive Bayes

In [6]:
class NB(BaseEstimator, ClassifierMixin):
           

    def __init__(self): 
        pass

    def fit(self, X, y):
        n_samples = X.shape[0]
        n_features = X.shape[1]
        N = len(X)
        classes = np.unique(y)
        self.classe_ = list(set(y))
        self.prior_ = {}
        self.condprob_ = np.zeros([X.shape[1] , len(classes)])
    
        for c in classes:
            X_classe = X[y == c]
            Nc = X_classe.shape[0]
            self.prior_[c] = Nc / N
            tct = X_classe.sum(axis=0) + 1 
            self.condprob_[:, c] = tct / np.sum(tct)
        return self

    def predict(self, X):
        y_pred = np.zeros(X.shape[0])
        score = np.zeros(len(self.classe_))
        log_prior = np.log(list(self.prior_.values())) 
        log_condprob = np.log(self.condprob_)
        score = log_prior
        for i in range(X.shape[0]):
            W_ind = np.where(X[i])[0]
            W = np.repeat(W_ind, X[i,W_ind])
            for j in range(len(self.classe_)):
                score[j] = np.sum(log_condprob[W, j])
            y_pred[i] = np.argmax(score)
        return y_pred
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)


In [7]:
clf_nb = NB()
clf_nb.fit(X[::2], y[::2])
print('Classification score with NB class is: %s' % (clf_nb.score(X[1::2], y[1::2])))

Classification score with NB class is: 0.812


### Question 4: Evaluer les performances de votre classifieur en cross-validation 5-folds

In [8]:
from sklearn.model_selection import cross_val_score

In [9]:
score_nb = cross_val_score(clf_nb, X, y, cv = 5, n_jobs=-1)
for i in range(len(score_nb)):
    print ("The score of cross validation fold",i+1,"is:", score_nb[i])

print("The score of NB classe with cross validation 5-folds is :", score_nb.mean())

The score of cross validation fold 1 is: 0.81
The score of cross validation fold 2 is: 0.825
The score of cross validation fold 3 is: 0.8125
The score of cross validation fold 4 is: 0.83
The score of cross validation fold 5 is: 0.7925
The score of NB classe with cross validation 5-folds is : 0.814


### Question 5: Modifiez la fonction count_words pour qu’elle ignore les “stop words”

In [10]:
stopWords = open('./data/english.stop').read().replace('\n',' ').split(' ')

In [12]:
vocabulary_reduced, X_reduced = count_words(texts,stop_words=stopWords)
X_reduced.shape

(2000, 39195)

In [13]:
clf_nb = NB()
clf_nb.fit(X_reduced[::2], y[::2])
print('Classification score with NB class without stop words is: %s' % (clf_nb.score(X_reduced[1::2], y[1::2])))

Classification score with NB class without stop words is: 0.804


La performance de notre classifieur NB baisse un petit peu (de 0.812 à 0.804) sans les mots stop_words. On trouve que le fichier stop_words quand même inclut certains des mots significatifs tels que awfully, best, better, clearly, definitely. La suppression des mots pourrait rendre les critiques moins identifiables dans leur négativité ou positivité.

## Utilisation de scikitlearn

### Question 1: Comparer votre implémentation avec scikitlearn

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [15]:
# Set analyzer = 'char' for CountVectorizer, 
pipe_NB = Pipeline([('countVectorizer',CountVectorizer(analyzer='char')),\
                    ('clf',MultinomialNB())])
pipe_NB.fit(texts[::2],y[::2])
print("Classification score with NB class and analyzer='char' is :", \
      pipe_NB.score(texts[1::2],y[1::2]))

Classification score with NB class and analyzer='char' is : 0.606


In [16]:
# Set analyzer = 'char' for CountVectorizer
pipe_NB = Pipeline([('countVectorizer',CountVectorizer(analyzer='char', ngram_range=(1,2))),\
                    ('clf',MultinomialNB())])
pipe_NB.fit(texts[::2],y[::2])
print("Classification score with NB class and analyzer='char', ngram_range=(1,2) is :", \
      pipe_NB.score(texts[1::2],y[1::2]))

Classification score with NB class and analyzer='char', ngram_range=(1,2) is : 0.658


In [17]:
# Set analyzer = 'word' for CountVectorizer
pipe_NB = Pipeline([('countVectorizer',CountVectorizer(analyzer='word')),\
                    ('clf',MultinomialNB())])
pipe_NB.fit(texts[::2],y[::2])
print("Classification score with NB class and analyzer='word' is :",\
      pipe_NB.score(texts[1::2],y[1::2]))

Classification score with NB class and analyzer='word' is : 0.813


In [18]:
# Set analyzer = 'word' for CountVectorizer
pipe_NB = Pipeline([('countVectorizer',CountVectorizer(analyzer='word',ngram_range=(1,2))),\
                    ('clf',MultinomialNB())])
pipe_NB.fit(texts[::2],y[::2])
print("Classification score with NB class and analyzer='word', ngram_range=(1,2) is :", \
      pipe_NB.score(texts[1::2],y[1::2]))

Classification score with NB class and analyzer='word', ngram_range=(1,2) is : 0.841


Selon les essais au-dessus, l'analyseur de CountVectorizer "word" est trouvé plus beaucoup plus performant que "char". En plus, Le traitement autorisant les mots et bigrammes améliore les performances de classifieur pour ces deux façons de CountVectorizer. Notamment la classfieur Naive Bayes avec le bigrammes ngram_range, sa performance arrive 0.841.

Au niveau de la performance de la classifieur de sickitlearn et celui fait dans le cadre de ce TP, on observe que ces efficacités sont bien proches (0.813 vs 0.812).


### Question 2: Tester un autre algorithme de la librairie scikitlearn

In [19]:
from sklearn.svm import LinearSVC
pipe_SVC = Pipeline([('countVectorizer',CountVectorizer()),\
                    ('clf',LinearSVC())])
pipe_SVC.fit(texts[::2],y[::2])
print("Classification score with LinearSVC and is :", \
      pipe_SVC.score(texts[1::2],y[1::2]))

Classification score with LinearSVC and is : 0.81


In [20]:
from sklearn.svm import LinearSVC
pipe_SVC = Pipeline([('countVectorizer',CountVectorizer(ngram_range=(1,2))),\
                    ('clf',LinearSVC())])
pipe_SVC.fit(texts[::2],y[::2])
print("Classification score with LinearSVC and ngram_range=(1,2) is :", \
      pipe_SVC.score(texts[1::2],y[1::2]))

Classification score with LinearSVC and ngram_range=(1,2) is : 0.825


In [21]:
from sklearn.linear_model import LogisticRegression

pipe_LR = Pipeline([('countVectorizer', CountVectorizer()),\
                    ('clf', LogisticRegression())])
pipe_LR.fit(texts[::2],y[::2])
print("Classification score with LogisticRegression is :", \
      pipe_LR.score(texts[1::2],y[1::2]))

Classification score with LogisticRegression is : 0.831


In [22]:
from sklearn.linear_model import LogisticRegression

pipe_LR = Pipeline([('countVectorizer', CountVectorizer(analyzer='word',ngram_range=(1,2))),\
                    ('clf', LogisticRegression())])
pipe_LR.fit(texts[::2],y[::2])
print("Classification score with LogisticRegression and ngram_range=(1,2) is :", \
      pipe_LR.score(texts[1::2],y[1::2]))

Classification score with LogisticRegression and ngram_range=(1,2) is : 0.828


Ici on voit que globalement la performance de classifieur LogisticRegression est mieux que de classifieur LinearSVC. Par contre, l'appplication de bigramme ngram_range baisse un la performance de classifieur LogisticRegression (0.831 sans bigramme vs 0.828 avec bigramme).

### Question 3: Utiliser la librairie NLTK afin de procéder à une racinisation (stemming)

In [23]:
import nltk
from nltk.stem import SnowballStemmer 

stemmer = SnowballStemmer("english")

def tokenizer_stemmer(texts):
    p = re.compile('[\s\.:;,_]')
    words = p.split(texts)
    return [stemmer.stem(word) for word in words if word !='']


In [24]:
# This is a test for function 'tokenizer_stemmer'
tokenizer_stemmer("testing")

['test']

In [25]:
pipe_stem = Pipeline([('countVectorizer', CountVectorizer(analyzer=tokenizer_stemmer)),\
                          ('clf', MultinomialNB())])

pipe_stem.fit(texts[::2],y[::2])
print("Classification score of Naive Bayes with stemming is", \
      pipe_stem.score(texts[1::2],y[1::2]))

Classification score of Naive Bayes with stemming is 0.807


In [26]:
pipe_stem = Pipeline([('countVectorizer', CountVectorizer(analyzer=tokenizer_stemmer)),\
                          ('clf', LogisticRegression())])

pipe_stem.fit(texts[::2],y[::2])
print("Classification score of LogisticRegression with stemming is", \
      pipe_stem.score(texts[1::2],y[1::2]))

Classification score of LogisticRegression with stemming is 0.829


Les performances de classifieur restent à peu près invariantes, ce pré-traitement de racinisation réduit la taille de nombre de feature et plus la taille de la matrice pour l'espace mémoire ou l'espace de stockage nécessaire.


### Question 4: Filtrer les mots par catégorie grammaticale (POS : Part Of Speech) et ne garder que les noms, les verbes, les adverbes et les adjectifs pour la classification.

In [27]:
from nltk import pos_tag

In [28]:
def tokenizer_tag(texts):
    
    p = re.compile('[\s\.:;,_]')
    words = p.split(texts)    
    words_reduced = [word for word in words if word != '']   
    tags = pos_tag(words_reduced)
    return [tag[0] for tag in tags if tag[1] in ['NN', 'VBN', 'RB', 'JJ']]

In [29]:
# This is a test for function 'tokenizer_tag'
tokenizer_tag("good")

['good']

In [30]:
pipe_tag = Pipeline([('countVect', CountVectorizer(analyzer=tokenizer_tag)),\
                          ('naiveBayes', MultinomialNB())])


pipe_tag.fit(texts[::2],y[::2])
print("Classification score of Naive Bayes with POS filting is", \
      pipe_tag.score(texts[1::2],y[1::2]))

Classification score of Naive Bayes with POS filting is 0.816


In [31]:
pipe_tag = Pipeline([('countVect', CountVectorizer(analyzer=tokenizer_tag)),\
                          ('naiveBayes', LogisticRegression())])


pipe_tag.fit(texts[::2],y[::2])
print("Classification score of LogisticRegression with POS filting is", \
      pipe_tag.score(texts[1::2],y[1::2]))

Classification score of LogisticRegression with POS filting is 0.83


La filtrage des mots par catégorie grammaticale améliore les performances de classifieur pour Naive Bayes et LogisticRegression. 
L'extraction POS des noms, verbes, adverbes, et adjectifs permet de extraire les mots significatif, dont les mots clés sont plus visés. Ceci en plus reduit fortement le nombre de vocabulaire, Le temps de traitement de cette opération.
