# Class 6: Text Basics - Solution

## 1.0 Word Presence Vectorizer

In this lab session, we will work practice the pipeline from:

0) reading and cleaning data
1) tokenization
2) preprocessing
3) vectorization
4) model analysis and inference

We will work with speeches in the Danish parliament and use a binary vectorization with a set of preprocessing steps. We will not work through all possible preprocessing steps or ways of vectorizing. However, you should be able to adapt the pipeline to new applications after today.

## Setup 

* Modules
* Working Directory

In [2]:
# # # # Import modules # # # #
import os
import spacy
import numpy as np
import pandas as pd

from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# # # # Working Directory # # # #

# Change directory
# wd = '/home/rask/Dropbox/teaching/css_fall2023'                              
wd = 'C:/Users/au535365/Dropbox/teaching/css_fall2023'
os.chdir(wd)

# Confirm that the working directory is as intended 
os.getcwd()

'C:\\Users\\au535365\\Dropbox\\teaching\\css_fall2023'

#### Exercise 1.0: Reading in Data

We start by reading in data. We work the same data as in class05. 

Instead of reading the data in from our local directories, we read the data directly from GitHub. See the notebook `class05-filereading.ipynb` for details.

In [4]:
# Generate file ids
files = ['20001', 
         '20011',
         '20012',
         '20021',
         '20031',
         '20041',
         '20042',
         '20051',
         '20061',
         '20071',
         '20072',
         '20081',
         '20091',
         '20101',
         '20102',
         '20111',
         '20121',
         '20131',
         '20141',
         '20142',
         '20151',
         '20161',
         '20171',
         '20181',
         '20182',
         '20191',
         '20201',
         '20211']

# Specify base url
base_url = 'https://raw.githubusercontent.com/mraskj/css_fall2023/master/data/ft-speeches/'

#### Solution 1.0

In [5]:
# Read in data. Solution here:
df = pd.DataFrame()
for file in tqdm(files):
    df_term = pd.read_csv(base_url + file + '.csv')
    df = pd.concat([df, df_term])
df.reset_index(drop=True, inplace=True)

100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [01:05<00:00,  2.35s/it]


#### Exercise 1.1: Random Sampling

Since this is just an exercise, we don't need to work with all the data we have available. 

Use the code you used in `class05-exercise` to randomly sample $N=500$ speeches from the dataframe `df` from _exercise 1.0_. 

Filter the dataframe `df` based on the sampled indices and save the new dataframe in an objected called `sample_df`. Remember to reset_indices using `.reset_index()` but this time specify `drop=False`. This creates a new column called `index`, which allows us to locate the sampled speeches in the original dataframe `df`.

Remember to seed a seed to be able to replicate your results: `np.random.seed(10)`

#### Solution 1.1

In [6]:
# Solution here:

# Set seed
np.random.seed(10)

# Define number of samples
n_samples = 500

# Random sampling
sample_indices = np.random.choice(len(df), size=n_samples, replace=False)

# Filter df and save to sample_df
sample_df = df.loc[sample_indices].reset_index(drop=False)

#### Exercise 1.2: Load spaCy Model

We'll work with the SpaCy library (https://spacy.io/) when tokenizing and preprocessing our text. 

Since we use Danish text, we need to load a Danish model (https://spacy.io/models/da): `da_core_news_sm`. Load the model with:

    spacy.load('da_core_news_sm')

and save it to an object called `spacy_pipeline_da`.

If you can not load the model, you must download the model first with:

    !python -m spacy download da_core_news_sm
    
   
After you have downloaded/loaded the model, define a list object called `texts`, which is the `text` column from the `sampled_df`. Remember to convert the object to a list.

#### Solution 1.2:

In [7]:
# Solution here:

# Load the model "da_core_news_sm"
spacy_pipeline_da = spacy.load("da_core_news_sm")

# Define a list called `texts' based on the 'text' column in the sample_df dataframe
texts = list(sample_df['text'])

#### Exercise 1.3: Define a customized spaCy tokenizer

We want to create a custom spacy tokenizer that takes a string as input and returns a list of token (each token's text) with punctuation filtered out.

To do this, define a function called `spacy_tokenizer`. See `class06-tutorial` for the syntax of creating your own functions in Python. The function should return the text of each token.

#### Solution 1.3:

In [8]:
# Solution here:

# Define custom tokenizer that removes punctuation
def spacy_tokenizer(doc):
    toks = [t for t in spacy_pipeline_da(doc) if not t.is_punct]
    return [t.text for t in toks]

#### Exercise 1.4: Binary Vectorizer

When you have written the function, we now instantiate an instance of the `CountVectorizer` class from sklearn to an object called `vectorizer`. 

Pass your tokenizer `spacy_tokenizer` to the `tokenizer` parameter.

Besides passing your tokenizer, use the following arguments:

   - `binary=True`
   - `decode_error='ignore'`
   - `token_pattern=None`

Try read the documentation to figure out the functioning of the parameters.

#### Solution 1.4:

In [10]:
# Solution here:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, binary=True, decode_error='ignore', token_pattern=None)

#### Exercise 1.5: Fit Vectorizer

When you have instantiated the binary BoW vectorizer to the object `vectorizer`, we then fit the vectorizer to our texts in the list `texts`. Write:

    `vectorizer.fit(texts)`

#### Solution 1.5:

In [11]:
# Give `tokens` to the the `vectorizer` - you should get an error
vectorizer.fit(texts)

In [1]:
#When you do it, you should get an error saying: 
#    `AttributeError: 'list' object has no attribute 'lower'`
   
#Can you figure out what's wrong? Try to solve the problem. Use the hints if you can not solve it. Save the result to a new list called `processed_text`.#

#*Hints:* The problem is that the `tokens` object contains nested lists. This creates a problem as you see from the error since the tokenizer you provided to `CountVectorizer` expects a string input. Since the list is nested, we are given a list to the `vectorizer` object every time. The solution is to flatten the list, that is we want to join each nested list to a single string. Right now, each nested list has a token in each element. We join these together using:
#    
#    `' '.join()`
    
#Note that you must wrap the it inside a list comprehension! 

# Join together tokens for each document
# processed_text = [' '.join(x) for x in tokens]

In [None]:
# Now we can apply the binary vectorizer to the list `processed_text`
#vectorizer.fit(processed_text)

#### Exercise 1.6: Inspect the Vocabulary

Now that we've fitted the speeches to the vectorizer (i.e. we have created our vocabulary), we can inspect the result.


1) Print our the vocabulary from the `vectorizer` object using the `vocabulary_` attribute.
2) Compute the length of the vocab
3) Sort the vocab (since the vocab is a dictionary, you need to pass a lambda function to the key parameter inside the `sorted()` function)

#### Solution 1.6:

In [None]:
# Write solutions for three bullets in three cells below here:

In [12]:
# 1) Print vocab
print(vectorizer.vocabulary_)

{'beslutningsforslaget': 1045, 'går': 3536, 'jo': 4339, 'ud': 8953, 'på': 6896, 'at': 755, 'folketinget': 2542, 'skal': 7642, ' ': 0, 'opfordre': 6211, 'regeringen': 7029, 'til': 8573, 'give': 3396, 'indsigt': 4153, 'i': 3971, 'en': 2053, 'række': 7347, 'nærmere': 6034, 'præciserede': 6845, 'dokumenter': 1750, 'den': 1638, 'sag': 7365, 'som': 7929, 'er': 2135, 'årsag': 9863, 'har': 3605, 'adskillige': 272, 'gange': 3294, 'været': 9780, 'genstand': 3365, 'for': 2564, 'opmærksomhed': 6271, 'både': 1417, 'her': 3693, 'og': 6110, 'form': 2778, 'af': 279, 'spørgsmål': 8001, 'skiftende': 7714, 'ministre': 5690, 'pressen': 6728, ' \xa0\xa0\xa0\xa0\xa0': 7, 'baggrunden': 789, 'skattemyndighederne': 7669, '1993': 86, 'tilsidesatte': 8659, '10-mands-projekter': 27, 'værende': 9779, 'uden': 8999, 'realitet': 6982, 'investorerne': 4283, 'projekterne': 6833, 'opfattelse': 6206, 'herved': 3717, 'skærpede': 7808, 'praksis': 6713, 'anlagt': 569, 'mod': 5733, 'skatteministeriet': 7667, 'disse': 1736, '

In [13]:
# 2) Compute number of unique tokens
len(vectorizer.vocabulary_)

9963

In [14]:
# 3) Sort vocabulary 
sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1])

[(' ', 0),
 ('  ', 1),
 ('   ', 2),
 ('     ', 3),
 ('       ', 4),
 ('           \xa0\xa0\xa0\xa0\xa0', 5),
 ('     \xa0\xa0\xa0\xa0\xa0', 6),
 (' \xa0\xa0\xa0\xa0\xa0', 7),
 (' \xa0\xa0\xa0\xa0\xa0\xa0', 8),
 ('-afgrøder', 9),
 ('0,1', 10),
 ('0,25', 11),
 ('0,5', 12),
 ('0,8', 13),
 ('0,9', 14),
 ('1', 15),
 ('1,4', 16),
 ('1,5', 17),
 ('1,6', 18),
 ('1,8', 19),
 ('1-diabetes', 20),
 ('1-årig', 21),
 ('1.', 22),
 ('1.000', 23),
 ('1.100', 24),
 ('10', 25),
 ('10-15', 26),
 ('10-mands-projekter', 27),
 ('10-mands-projekterne', 28),
 ('10-årige', 29),
 ('10.000', 30),
 ('100', 31),
 ('100.000', 32),
 ('104', 33),
 ('10:13', 34),
 ('11', 35),
 ('11.', 36),
 ('111', 37),
 ('114', 38),
 ('115', 39),
 ('116', 40),
 ('12', 41),
 ('12.', 42),
 ('12.00', 43),
 ('12.000', 44),
 ('120', 45),
 ('125', 46),
 ('12:33', 47),
 ('13', 48),
 ('130', 49),
 ('134', 50),
 ('137', 51),
 ('14', 52),
 ('140', 53),
 ('148', 54),
 ('15', 55),
 ('15.05', 56),
 ('16', 57),
 ('16-års-valgret', 58),
 ('160', 59)

#### Exercise 1.7: Transform the Documents

We have now created and inspected the vocabulary. The next step is to transform it to create a document-term-matrix. 

Use the `.transform()` method on the `processed_text` list you generated in exercise 1.5 and save the result to an object called `binary_bow`.

When this is done convert the matrix to a numpy array using to attribute `.toarray()` of your `binary_bow` object. The array should be saved to an object called `binary_bow_array`. 

Compute the shape of `binary_bow_array`. What's the expected dimension? Does it match your expectation?

#### Solution 1.7:

In [None]:
# Solution here:

# Transform 
binary_bow = vectorizer.transform(processed_text)

# Numpy array
binary_bow_array = binary_bow.toarray()

# Compute shape of the array - what's the expected dimension?
binary_bow_array.shape

#### Exercise 1.8: Query Search

We have now tokenized, preprocessed, and vectorized our corpus. The final thing is analysis and inference. In this exercise, we restrict ourselves to simple, but a powerful applications. In this exercise 1.8, we define a search engine where we compute the similarity between documents (i.e. speeches) and a search query. 

To do this, we define a query word and save it in a list called `query`. You can choose whatever word you like, but I have used "dagpenge". The reason why we use a list despite having only one word is because of the required format of our `vectorizer` object.

When you have done this, we need to transform our `query` list into a vector to be able to numerically compare the similarity between the query and our speeches. Use the existing `vectorizer` you have and use the `.transform()` method and save it to an object called `query_vector`.

Why do we transform the query using the existing vectorizer object and not a new one?


#### Solution 1.8:

In [None]:
# Solution here:

# Transform the query to a binary vector using the `vectorizer` and the `.transform()` method
query = ["dagpenge"]
query_vector = vectorizer.transform(query)

#### Exercise 1.9: Query Similarity

We can now compute the cosine similarity between the `binary_bow_array` object and the `query_vector`. Use the `cosine_similarity` function imported from sklearn in top of the notebook. Flatten the results using `.flatten()`. We use this to convert the array from $(500,1)$ to $(500, )$. Save the results to an object called `cos_sim`. 


1. Verify the shape of `cos_sim` after flattening
2. Return $k=5$ indices of speeches with highest cosine similarity
3. Inspect the cosine similarity for top $k=5$ speeches
4. Print the text of the speech with the highest similarity by subetting the top index using the `texts` list


I have provided you a function that help you to do step 2 below: `top_k`

In [None]:
# Define function that returns the top k indices
def top_k(arr, k):
    kth_largest = (k + 1) * -1
    return np.argsort(arr)[:kth_largest:-1]

#### Solution 1.9

In [None]:
# Solution here: 

# Compute cosine similarity
cos_sim = cosine_similarity(binary_bow_array, query_vector).flatten()

# 1) Verify shape
print(cos_sim.shape)

# 2) Top k=5 speeches
top_related_indices = top_k(cos_sim, 5)
print(top_related_indices)

# 3) Cosine similarity for k=5
print(cos_sim[top_related_indices])

# 4) Top match
print(texts[top_related_indices[0]])

#### Exercise 1.10: Advanced Preprocessing

The customized spaCy tokenizer we provided was very simple. We need to do more in actual applications.

In this exercise, you are encouraged to play around with possible preprocessing steps and see how it affects the query similarity exercise. 

Adapt the `spacy_tokenizer` you wrote in exercise 1.3 .

Do the following:

- Remove punctuation (as already done in exercise 1.3)
- Lemmatize (can be done by returning the `lemma_` attribute of each token)
- Lower-casing
- Removal of stopwords (define a list of stopwords from `spacy_pipeline_da.Defaults.stop_words`)

You are free to use more preprocessing steps if you like. 

#### Solution 1.10

In [None]:
# Write your solution in the cells below here:

In [None]:
# Define a list with Danish stopwords
stop_words = sorted(list(spacy_pipeline_da.Defaults.stop_words))

In [None]:
# Customize tokenizer
def spacy_tokenizer(doc):
    toks = [t for t in spacy_pipeline_da(doc) if not t.is_punct]
    toks = [t.lemma_ for t in toks]
    toks = [t.lower() for t in toks]
    toks = [t for t in toks if t not in stop_words]
    return toks

In [None]:
# Instantiate CountVectorizer
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, binary=True, decode_error='ignore', token_pattern=None)

# Join together tokens for each document
processed_text = [' '.join(x) for x in tokens]

# Fit binary vectorizer to the list `processed_text`
vectorizer.fit(processed_text)

# Transform the documents using .transform()
binary_bow = vectorizer.transform(processed_text)

# Convert to np array
binary_bow_array = binary_bow.toarray()

# Transform the query to a binary vector using the `vectorizer` and the `.transform()` method
query = ["dagpenge"]
query_vector = vectorizer.transform(query)

# Compute cosine similarity
cos_sim = cosine_similarity(binary_bow_array, query_vector).flatten()

# Verify shape
print(cos_sim.shape)

# Top k=5 speeches
top_related_indices = top_k(cos_sim, 5)
print(top_related_indices)

# Cosine similarity for k=5
print(cos_sim[top_related_indices])

# Top match
print(texts[top_related_indices[0]])