# TP1
References:

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

https://github.com/makcedward/nlp/blob/master/sample/nlp-bag_of_words.ipynb




In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
import matplotlib.pyplot as plt

## Load Data

In [3]:
import pickle
with open(r'C:\Users\Lenovo\Documents\cours_dataminig\newsgroups_data\newsgroups_data.pickle', 'rb') as f:
    newsgroups_data_df = pickle.load(f)
newsgroups_data_df.shape

(6641, 2)

In [4]:
newsgroups_data_df.head()

Unnamed: 0,message,group_label
0,"Hello folks, I've a super scope 6 for sale, it...",misc.forsale
1,I have for sale the following:\n\n\tHewlett Pa...,misc.forsale
2,\nThis is definitely the wrong newsgroup for t...,sci.electronics
3,"Acorn Software, Inc. has 3 tape drives (curren...",misc.forsale
4,\n[KAAN] Who the hell is this guy David Davidi...,talk.politics.mideast


In [5]:
newsgroups_data_df.group_label.value_counts()

sci.med                  990
rec.autos                990
sci.space                987
sci.electronics          984
misc.forsale             975
talk.politics.mideast    940
talk.politics.misc       775
Name: group_label, dtype: int64

******

## Vector representation of documents using simple BOW

We will do this using the CountVectorizer class from the `sklearn` library.

**Questions - part 1**

Execute the cells below, then answer the following questions based on the output of these cells. You do not need to write new code.

1. What is the size of the document-term matrix `dtm1`, and what is the fraction of non-zero values in it?
2. What is the maximum value in `dtm1`, and what does this value mean?
3. How many words does the vocabulary contain?
4. Give the words that correspond to the first 10 columns of `dtm1`.
5. Are the words in `vocab1` ordered arbitrarily or in some specific order?
6. Read the documentation of the `CountVectorizer` class using the command `?CountVectorizer`. Without writing any code, what will happen if we modify the parameter `binary = True` ?
7. Repeat the previous question by modifying the parameter `lowercase = False`.

In [6]:
## Read documentation of the CountVectorizer class's constructor.
# Notice the default values of its parameters.
?CountVectorizer

In [7]:
### Set values of control parameters of the Vectorizer classes that we will use

max_features = 5000
max_df = 0.9
min_df = 3

In [8]:
# create an instance of the class with the desired parameter values.
vect1 = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', binary = False)

# create vocabulary based on given corpus by calling the fit() method.
vect1.fit(newsgroups_data_df.message)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=3,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]:
# create the document-term matrix (DTM) of a given corpus by calling the transform() method
dtm1 = vect1.transform(newsgroups_data_df.message)
dtm1.shape

(6641, 18939)

In [10]:
# what is the data type of this matrix?
#  dtm is stored in a special data structure (class) called 'scipy.sparse.csr.csr_matrix'
type(dtm1)

scipy.sparse.csr.csr_matrix

In [11]:
# what is the number of non-zero values in this matrix?
dtm1.nnz

406455

In [12]:
# the minimum and maximum values
dtm1.min(),dtm1.max()

(0, 185)

In [13]:
## let's store the vocabulary in a list for convenience
# the first element of vocab corresponds to the first column of dtm
# the second element of vocab corresponds to the first column of dtm
# etc.
vocab1 = vect1.get_feature_names()

In [14]:
print(type(vocab1))
print(len(vocab1))

<class 'list'>
18939


In [15]:
# The first 50 words of the vocabulary
print(vocab1[0:50])

['00', '000', '0000', '00000', '000k', '001', '00pm', '01', '011', '015', '01730', '02', '02106', '02115', '02173', '03', '030', '0300', '04', '05', '0500', '06', '0663', '0674', '07', '08', '08003', '081', '09', '095', '0l', '10', '100', '1000', '1000w', '100hz', '100k', '100mhz', '100ns', '101', '102', '1024x768', '103', '1031', '104', '1047', '105', '106', '107', '1070']


In [16]:
# The last 50 words of the vocabulary
print(vocab1[-50:])

['yugo', 'yugoslav', 'yugoslavia', 'yunusova', 'yup', 'yuppie', 'yuppies', 'yuri', 'yusuf', 'yxy4145', 'yyz', 'z28', 'z80', 'za', 'zabitlari', 'zahal', 'zangezour', 'zangibasar', 'zaphod', 'zbib', 'zeal', 'zealand', 'zen', 'zener', 'zenith', 'zeos', 'zero', 'zeus', 'zhiguli', 'zilfi', 'zilkade', 'zillion', 'zinc', 'zion', 'zionism', 'zionist', 'zionists', 'zip', 'zivin', 'zodiacal', 'zoloft', 'zombies', 'zone', 'zones', 'zoo', 'zoom', 'zuma', 'zur', 'zwischen', 'zx']


#Your answers

#Q1.
the shape of dtm1 are 6641 documents, 18939terms. there are 406455/(6641*18939)= 0.00323

#Q2.
the maximum value is 185.It means the maximum frequency of a world in a document.

#Q3.
Number of words does the vocabulary contains is 18939

In [17]:
#Q4. The first 10 words of the vocabulary:
print(vocab1[0:10])

['00', '000', '0000', '00000', '000k', '001', '00pm', '01', '011', '015']


#Q5. The words are ordered by alphabet as we noticed in the cell 16

#Q6.Binary=True it means we will activate the first version of the bag-of-words (the value are 0 or 1)

#Q7. Lowercase=False we will have uppercase words and lowercase words 

**Questions - part 2**

In each of the following questions, you need to re-build the vocabulary and the DTM matrix. You can copy the code above and modify it as desired.

1. Change the `token_pattern` parameter of the original call to `CountVectorizer()` so that only words that contain the letters 'a' to 'z' and the '-' character are accepted in the vocabulary. How many words does the new vocabulary contain and what are the first 10 words of this vocabulary?

2. Add the following parameter to the original call to `CountVectorizer()`: ngram_range=(1,2). Determine the number of words of the new vocabulary, and give 10 words in this vocabulary that are not in `vocab1`.

3. Change the parameters of the original call to `CountVectorizer()` using the new values below, but modify each parameter separately (not simulateneously). For each modification, explain how and why the vocabulary and DTM have changed compared to `vocab1` and `dtm1`.

       stop_words = None
       max_df = 0.7
       min_df = 1
       max_features = 2000

In [18]:
## Q1
vect2 = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', binary = False,token_pattern='[a-zA-Z]+-?[a-zA-Z]+')



In [19]:

vect2.fit(newsgroups_data_df.message)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=3,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='[a-zA-Z]+-?[a-zA-Z]+',
        tokenizer=None, vocabulary=None)

In [20]:
dtm2 = vect2.transform(newsgroups_data_df.message)

In [21]:
dtm2.shape


(6641, 17978)

In [22]:
# what is the number of non-zero values in this matrix?
dtm2.nnz

382235

In [23]:
# the minimum and maximum values
dtm2.min(),dtm2.max()

(0, 153)

In [24]:

vocab2 = vect2.get_feature_names()

In [25]:
print(type(vocab2))
print(len(vocab2))

<class 'list'>
17978


In [26]:
# The first 50 words of the vocabulary
print(vocab2[0:10])

['aa', 'aaa', 'aamir', 'aaron', 'aas', 'ab', 'abandon', 'abandoned', 'abandoning', 'abbey']


#Q1. 17978 words in the vocab2

In [27]:
#Q2.
vect3 = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', binary = False,ngram_range=(1,2))

In [28]:

vect3.fit(newsgroups_data_df.message)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=3,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [29]:
dtm3 = vect3.transform(newsgroups_data_df.message)

In [30]:
dtm3.shape


(6641, 42246)

In [31]:
# what is the number of non-zero values in this matrix?
dtm3.nnz

517520

In [32]:
# the minimum and maximum values
dtm3.min(),dtm3.max()

(0, 185)

In [33]:

vocab3 = vect3.get_feature_names()

In [34]:
print(type(vocab3))
print(len(vocab3))

<class 'list'>
42246


In [35]:
idx=set(vocab3)-set(vocab1)
print('the number of non common words are',len(idx),"example of 10 words",list(idx)[0:10])

the number of non common words are 23307 example of 10 words ['bari unknown', 'harbord anadolu', 'second round', 'know secretive', 'having difficulty', 'real estate', 'woke dro', 'teblig olunacak', 'ken mitchell', 'rider old']


In [36]:
# lowercase = False
vect4 = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', binary = False,lowercase = False)
vect4.fit(newsgroups_data_df.message)
dtm4 = vect4.transform(newsgroups_data_df.message)
vocab4 = vect4.get_feature_names()
print('NNZ',dtm4.nnz)
print('size',dtm4.shape)
print('min',dtm4.min(),'max',dtm4.max())
print('type',type(vocab4))
print('length',len(vocab4))

NNZ 439339
size (6641, 22288)
min 0 max 185
type <class 'list'>
length 22288


puisque lowercase=False le nombre de mot a augmenté de (22288-18939)=3349 mots, nous etudions les mots majuscules et minuscules

In [37]:
#Q3.
#max_df = 0.7
max_df1 = 0.7
vect5 = CountVectorizer(max_df=max_df1, min_df=min_df, stop_words='english', binary = False)
vect5.fit(newsgroups_data_df.message)
dtm5 = vect5.transform(newsgroups_data_df.message)
vocab5 = vect5.get_feature_names()
print('NNZ',dtm5.nnz)
print('size',dtm5.shape)
print('min',dtm5.min(),'max',dtm5.max())
print('type',type(vocab5))
print('length',len(vocab5))

NNZ 406455
size (6641, 18939)
min 0 max 185
type <class 'list'>
length 18939


In [38]:
idx=set(vocab5)-set(vocab1)
print('the number of non common words are',len(idx),"example of 10 words",list(idx)[0:10])

the number of non common words are 0 example of 10 words []


Pas de changement car diminuer le max_df par 0.3 n'a pas d'effet sur cette dataset

In [39]:
#min_df
min_df1 = 1
vect6 = CountVectorizer(max_df=max_df, min_df=min_df1, stop_words='english', binary = False)
vect6.fit(newsgroups_data_df.message)
dtm6 = vect6.transform(newsgroups_data_df.message)
vocab6 = vect6.get_feature_names()
print('NNZ',dtm6.nnz)
print('size',dtm6.shape)
print('min',dtm6.min(),'max',dtm6.max())
print('type',type(vocab6))
print('length',len(vocab6))

NNZ 446163
size (6641, 51354)
min 0 max 185
type <class 'list'>
length 51354


changement dans le nombre de mot.Il a augmenté de 51354-18939=32415 puisque nous avons diminué le min_df

In [40]:
#max_features = 2000
vect7 = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', binary = False, max_features = 2000)
vect7.fit(newsgroups_data_df.message)
dtm7= vect7.transform(newsgroups_data_df.message)
vocab7 = vect7.get_feature_names()
print('NNZ',dtm7.nnz)
print('size',dtm7.shape)
print('min',dtm7.min(),'max',dtm7.max())
print('type',type(vocab7))
print('length',len(vocab7))

NNZ 237615
size (6641, 2000)
min 0 max 185
type <class 'list'>
length 2000


nombre de mots a diminué de 18939-2000=16939 puisue nous avons limité le nombre de max_feature alors que dans vect1 il est par défaut 

In [41]:
#stop_words=None
vect8 = CountVectorizer(max_df=max_df, min_df=min_df, stop_words=None, binary = False)
vect8.fit(newsgroups_data_df.message)
dtm8= vect8.transform(newsgroups_data_df.message)
vocab8 = vect8.get_feature_names()
print('NNZ',dtm8.nnz)
print('size',dtm8.shape)
print('min',dtm8.min(),'max',dtm8.max())
print('type',type(vocab8))
print('length',len(vocab8))

NNZ 611258
size (6641, 19242)
min 0 max 716
type <class 'list'>
length 19242


maximum of frequency in the the document increases it becomes 716 and the number of words also it becomes 19242 as we took into consideration the stop_words

****

## Vector representation of documents using tfidf BOW

We will do this using the TfidfVectorizer class from the `sklearn` library.

**Questions**

1. Read the documentation of the`TfidfVectorizer` class.
2. Call the constructor of this class using the following parameter values: `max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english'`. Put the result in a variable called `vect2`.
3. Call the `fit_transform` in order to create the vocabulary and the DTM matrix in one step using the text documents in `newsgroups_data_df.message`. Put the result in a variable called `dtm2`.
4. Determine the minimum and maximum values of `dtm2`.
5. Is the vocabulary of vect2 different than `vocab1`? You can use set operations for this.

In [42]:
# Read documentation of the TfidfVectorizer class
?TfidfVectorizer

In [43]:
#Q2.
vect2=TfidfVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english')

In [44]:
#Q3.
vect2.fit_transform(newsgroups_data_df.message)
dtm2= vect2.transform(newsgroups_data_df.message)


In [45]:
#Q4.
print('min',dtm2.min(),'max',dtm2.max())

min 0.0 max 1.0


In [46]:
vocab2 = vect2.get_feature_names()

print('NNZ',dtm2.nnz)
print('size',dtm2.shape)

print('type',type(vocab2))
print('length',len(vocab2))


NNZ 317215
size (6641, 5000)
type <class 'list'>
length 5000


In [47]:
idx=set(vocab2) & set(vocab1)
print('the number of common words are',len(idx))


the number of common words are 5000


le nombre de mot en commun est tout les element de vocab2 

Le nombre de mots a diminué 18939-5000=13939 and the NNZ becomes 317215

******

## Document Classification

**Questions**

1. Copy `dtm1` into a new variable called `X` and copy `df.group_label` into a new variable called `y`
2. split `X` and `y` into train and test sets with test set size of 30%.
3. Build a logistic regression model using the training set and calculate its accuracy on the test set
4. Repeat the first 3 questions with `dtm2` instead of `dtm1`. Is the accuracy better or worse than with `dtm1`?

In [48]:
#Q1.
X=dtm1.copy()
y=newsgroups_data_df.group_label.copy()

In [49]:
#Q2.
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X,y, test_size=0.3, random_state=0)

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
clf=LogisticRegression()
clf.fit(X_train1, y_train1)
test_predictions = clf.predict(X_test1)
acc = accuracy_score(y_test1, test_predictions)
print("accuracy",acc)



accuracy 0.806823883592574


In [51]:
X1=dtm2.copy()
y1=newsgroups_data_df.group_label.copy()
X_train, X_test, y_train, y_test = train_test_split(X1,y1, test_size=0.3, random_state=0)

In [52]:
clf=LogisticRegression()
clf.fit(X_train, y_train)
test_pred = clf.predict(X_test)
acc = accuracy_score(y_test, test_pred)
print("accuracy",acc)

accuracy 0.8183642749623683


the accuracy of dtm2 is better than dtm1

***