# TP1
References:

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

https://github.com/makcedward/nlp/blob/master/sample/nlp-bag_of_words.ipynb




In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
import matplotlib.pyplot as plt

## Load Data

In [None]:
import pickle
with open('newsgroups_data.pickle', 'rb') as f:
    newsgroups_data_df = pickle.load(f)
newsgroups_data_df.shape

In [None]:
newsgroups_data_df.head()

In [None]:
newsgroups_data_df.group_label.value_counts()

******

## Vector representation of documents using simple BOW

We will do this using the CountVectorizer class from the `sklearn` library.

**Questions - part 1**

Execute the cells below, then answer the following questions based on the output of these cells. You do not need to write new code.

1. What is the size of the document-term matrix `dtm1`, and what is the fraction of non-zero values in it?
2. What is the maximum value in `dtm1`, and what does this value mean?
3. How many words does the vocabulary contain?
4. Give the words that correspond to the first 10 columns of `dtm1`.
5. Are the words in `vocab1` ordered arbitrarily or in some specific order?
6. Read the documentation of the `CountVectorizer` class using the command `?CountVectorizer`. Without writing any code, what will happen if we modify the parameter `binary = True` ?
7. Repeat the previous question by modifying the parameter `lowercase = False`.

In [None]:
## Read documentation of the CountVectorizer class's constructor.
# Notice the default values of its parameters.
?CountVectorizer

In [None]:
### Set values of control parameters of the Vectorizer classes that we will use

max_features = 5000
max_df = 0.9
min_df = 3

In [None]:
# create an instance of the class with the desired parameter values.
vect1 = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english', binary = False)

# create vocabulary based on given corpus by calling the fit() method.
vect1.fit(newsgroups_data_df.message)

In [None]:
# create the document-term matrix (DTM) of a given corpus by calling the transform() method
dtm1 = vect1.transform(newsgroups_data_df.message)
dtm1.shape

In [None]:
# what is the data type of this matrix?
#  dtm is stored in a special data structure (class) called 'scipy.sparse.csr.csr_matrix'
type(dtm1)

In [None]:
# what is the number of non-zero values in this matrix?
dtm1.nnz

In [None]:
# the minimum and maximum values
dtm1.min(),dtm1.max()

In [None]:
## let's store the vocabulary in a list for convenience
# the first element of vocab corresponds to the first column of dtm
# the second element of vocab corresponds to the first column of dtm
# etc.
vocab1 = vect1.get_feature_names()

In [None]:
print(type(vocab1))
print(len(vocab1))

In [None]:
# The first 50 words of the vocabulary
print(vocab1[0:50])

In [None]:
# The last 50 words of the vocabulary
print(vocab1[-50:])

In [None]:
# Your answers


**Questions - part 2**

In each of the following questions, you need to re-build the vocabulary and the DTM matrix. You can copy the code above and modify it as desired.

1. Change the `token_pattern` parameter of the original call to `CountVectorizer()` so that only words that contain the letters 'a' to 'z' and the '-' character are accepted in the vocabulary. How many words does the new vocabulary contain and what are the first 10 words of this vocabulary?

2. Add the following parameter to the original call to `CountVectorizer()`: ngram_range=(1,2). Determine the number of words of the new vocabulary, and give 10 words in this vocabulary that are not in `vocab1`.

3. Change the parameters of the original call to `CountVectorizer()` using the new values below, but modify each parameter separately (not simulateneously). For each modification, explain how and why the vocabulary and DTM have changed compared to `vocab1` and `dtm1`.

       stop_words = None
       max_df = 0.7
       min_df = 1
       max_features = 2000

In [None]:
## Q1



In [None]:
## Q2



In [None]:
## Q3



In [None]:
# lowercase = False


In [None]:
# max_df = 0.7


In [None]:
# min_df = 1


In [None]:
# max_features = 2000


In [None]:
# stop_words = None



****

## Vector representation of documents using tfidf BOW

We will do this using the TfidfVectorizer class from the `sklearn` library.

**Questions**

1. Read the documentation of the`TfidfVectorizer` class.
2. Call the constructor of this class using the following parameter values: `max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english'`. Put the result in a variable called `vect2`.
3. Call the `fit_transform` in order to create the vocabulary and the DTM matrix in one step using the text documents in `newsgroups_data_df.message`. Put the result in a variable called `dtm2`.
4. Determine the minimum and maximum values of `dtm2`.
5. Is the vocabulary of vect2 different than `vocab1`? You can use set operations for this.

In [None]:
# Read documentation of the TfidfVectorizer class
?TfidfVectorizer

******

## Document Classification

**Questions**

1. Copy `dtm1` into a new variable called `X` and copy `df.group_label` into a new variable called `y`
2. split `X` and `y` into train and test sets with test set size of 30%.
3. Build a logistic regression model using the training set and calculate its accuracy on the test set
4. Repeat the first 3 questions with `dtm2` instead of `dtm1`. Is the accuracy better or worse than with `dtm1`?

***