# Lab Work 2: Text Processing: Preparation of texts

Use this notebook for the subsequence excecise's parts

## 6.2.1 Load the data and CountVectorize them
You will find a list of files in Ilias [sherlock.zip](https://www.ili.fh-aachen.de/goto_elearning_file_815003_download.html)
Download the zip file and adapt your next line accordingly.

In [1]:
import numpy as np

filenames = [r"Sherlock/Sherlock.txt", 
             r"Sherlock/Sherlock_blanched.txt",
             r"Sherlock/Sherlock_black.txt",
             r"Sherlock/Sherlock_blue.txt",
             r"Sherlock/Sherlock_card.txt"]

Now we create a count Vectorizer. The parameter given tells the CountVectorizer that its methods shall operate on a list of filenames.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input="filename")

Now generate the Bag of Words with the CountVectorizer and check:
* the total number of different words
* the total number of words per document
* the total number of occurences of each word

``fit()`` - is used to analyze the content of the files specified in the filenames list. 
This step extracts the vocabulary from the file and assigns indices to each unique term.

In [3]:
vectorizer.fit(filenames)
print(vectorizer.vocabulary_)



 ``get_features_names_out()``- retrieves the feature names(words) in the vocabulary and returns an array of words. <br> So by getting the length of the array returned, we are able to get the total number of different words.

In [4]:
vocabulary = vectorizer.get_feature_names_out()
len(vocabulary)

8879

The total number of words per document: <br>
``fit_transform`` learn the vocabulary dictionary and return document-term matrix <br>
By converting the matrix to an array, we can get the total number of words per document by using ``np.sum()``

In [5]:
X = vectorizer.fit_transform(filenames)
bow_matrix = X.toarray()
i=0
for file in filenames:
    print("Total number of words in Document", i+1, ":", np.sum(bow_matrix[i]))
    i = i + 1

Total number of words in Document 1 : 107416
Total number of words in Document 2 : 7258
Total number of words in Document 3 : 7775
Total number of words in Document 4 : 7497
Total number of words in Document 5 : 8242


Total number of occurences of each word: <br>
By creating a dataframe with the document-term matrix as the index and the number of occurence as the column. We then sort the dataframe column in descending order to see which word has the highest occurence.

In [38]:
import pandas as pd

matrix = vectorizer.fit_transform(filenames)
counts = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())

dataframe = counts.T.sort_values(by=0, ascending=False).sum(axis=1)
dataframe

the             7975
and             3819
of              3640
to              3431
that            2754
                ... 
covent             3
remonstrance       1
inwardly           1
involuntary        1
117                2
Length: 8879, dtype: int64

## 6.2.2 Which word is occuring the most?

This must be done in three steps. Reason is, that the vectorizer.vocabulary_ is organized as a dictonary with the value indicating the position of the word in the array
1. Find out the highest count of a word
2. Find out the position of this count
3. Find out the word at this position

Find out the highest count of a word: <br>
``vectorizer.vocabulary_``- First retrieve the vocabulary dictionary from the CountVectorizer object and assign it to the variable dictionary. This dictionary contains **words as keys** and their **corresponding indices as values**.<br>

Then create a new list (inv) by swapping the keys and values in the dictionary.<br>

``max(inv)[0]``- finds the maximum count in inv. The value will be the highest count of a word in the dictionary.

In [7]:
dictionary = vectorizer.vocabulary_
inv = [(value, key) for key, value in dictionary.items()]
print(max(inv)[0])

8878


Find out the position of this count: <br> 
1. The keys of the dictionary (the words) will be converted into a list.
2. ``max(dictionary)`` - to find the maximum count in the dictionary.
3. Find the index of the maximum count in the list of keys (words). This index represents the position of the word in the vocabulary array corresponding to the highest count.

In [8]:
list(dictionary.keys()).index(max(dictionary))

5577

Find out the word at this position:<br>
``max(dictionary)`` - gives the word that corresponds to the maximum count in the vocabulary

In [9]:
max(dictionary)

'zoo'

# 6.3 Improving using stop word, ngrams and tf-idf
The feature space is vast with nearly 9000 dimensions. Hence we should try to reduce the number of dimensions by:

1. use only words that have a mimimum occurence in all documents (minimal document frequency) min_df
2. remove stop words (like 'a', 'and', 'the') as they don't give valuable information for classification and/or 
3. remove words that occur in many documents (maximum document frequency) max_df 

Experiment with the values of min_df and max_df and see how the size of the vocabulary is changing.

Implement all three options and check for their separate outcome an their combinations

In [43]:
print("Original Size of Vocabulary:", len(dictionary))

# use only words that have a mimimum occurence in all documents (minimal document frequency) min_df
vectorizer_mindf = CountVectorizer(input="filename", min_df = 0.2)
vectorizer_mindf.fit_transform(filenames)
print("Size of Vocabulary after min_df:", len(vectorizer_mindf.vocabulary_))

# use only words that have a maximum occurence in all documents (maximum document frequency) max_df
vectorizer_maxdf = CountVectorizer(input="filename", max_df = 0.5)
vectorizer_maxdf.fit_transform(filenames)
print("Size of Vocabulary after max_df:", len(vectorizer_maxdf.vocabulary_))

# remove stop words (like 'a', 'and', 'the')
vectorizer_stopwords = CountVectorizer(input="filename", stop_words=(['a', 'and', 'the']))
vectorizer_stopwords.fit_transform(filenames)
print("Size of Vocabulary after using stop_words:", len(vectorizer_stopwords.vocabulary_))

# combine all 3 options
vectorizer2 = CountVectorizer(input="filename", min_df = 0.2, stop_words=(['a', 'and', 'the']), max_df = 0.5)
vectorizer2.fit_transform(filenames)
print("Size of Vocabulary after:", len(vectorizer2.vocabulary_))

Original Size of Vocabulary: 8879
Size of Vocabulary after min_df: 8879
Size of Vocabulary after max_df: 7349
Size of Vocabulary after using stop_words: 8877
Size of Vocabulary after: 7349


# 6.4 Rescaling the data using term frequency inverse document frequency
Here, term frequency is the number of occurences of a term (word) $t$ in a document $d$. 

$\operatorname{tf}(t, d) = f_{t, d}$ 

Sometimes tf gets normalized to the length of $d$
The inverse document frequency idf is a measure on the amount of information a term t carries. Rare occurences of t leads to a high amount of information common occurence to a low amount of information. The idf is computed as 

$\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$

where $n$ is the total number of documents and $\text{df}(t)$ is the number of documents that contain the term $t$. Hence, the tf-idf is the product of the two terms:

$\text{tf-idf(t,d)}=\text{tf(t,d)} \cdot \text{idf(t)}$

scikit-learn supports this in the `TfidfTransformer`, when using the following parameters: `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`. Refer to the scikit documentation for the parameter sets and how this changes the formula.

Combining Bag of Words and tf-idf can be done using the `TfidfVectorizer`

# 6.4.1 Find maximum value for each of the features over dataset

1. Here we create a TfidVectorizer. <br>
Parameter given: 
- ``input = "filename"``- tells the TfidVectorizer that its methods shall operate on a list of filenames
- ``use_idf = True`` - enable inverse-document-frequency reweighting 
- ``smooth_idf = True`` - Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

2. Learn vocabulary and idf from training set with ``fit()``
3. Create a dataframe with 

In [42]:
# your code here
from sklearn.feature_extraction.text import TfidfVectorizer
tfid_vectorizer = TfidfVectorizer(input="filename", use_idf=True, smooth_idf=True)
x = tfid_vectorizer.fit(filenames)
counts = pd.DataFrame(x.idf_, index=tfid_vectorizer.get_feature_names_out(), columns=["idf_value"])
dataframe = counts.sort_values(by=["idf_value"], ascending=False)
dataframe

Unnamed: 0,idf_value
117,2.098612
neutralized,2.098612
napoleon,2.098612
naples,2.098612
nameless,2.098612
...,...
behind,1.000000
short,1.000000
have,1.000000
night,1.000000
