### To Use the Code in Google Colab

To run the notebooks and access the data in Google Colab, follow these steps:

1. **Create a Google account** if you don't already have one. This gives you access to both Colab and Google Drive.

2. **Use the shared link** provided to you to copy (clone) the notebooks into your own Google Drive.

3. **Download the corpora and exercise files** from the link provided, and upload them to your Google Drive.

4. **Grant Colab access to your Google Drive** by running the following code:

```python
from google.colab import drive
drive.mount('/content/drive')
```


5. **You can now easily read files from your Google Drive** using standard file paths.

---
---

# Exercises for session 2

## Exercise 2.1

In this exercise, you will build a document-frequency matrix from a collection of text files.

### Tasks:

1. **Read the corpus**
   - Load all text files from the specified folder.

2. **Create a DataFrame**
   - Each row should represent one document.
   - Include a variable (column) that stores the full text of each document.

3. **Create a Document-Frequency Matrix (DFM)**
   - Use a vectorizer (e.g., `CountVectorizer`) to convert the text into numerical features.
   - Each row should represent a document, and each column a word.
   - The values should reflect how often each word appears in each document.

## Exercise 2.2

In this exercise, you will build your own custom **text preprocessor** and pass it to `CountVectorizer()`.

### Tasks:

1. **Write a custom preprocessing function in Python.**
   - This function will take a raw text string as input and return a list of cleaned tokens.

2. **The function should include the following steps:**
   - ✅ Remove stopwords
   - ✅ Remove punctuation
   - ✅ Tokenize using **NLTK's tokenizer**
   - ✅ Apply **lemmatization** to each token

3. **Use your custom function with `CountVectorizer()`**
   - Pass it via the `analyzer` argument to create a document-term matrix.


## Exercise 2.3

In this exercise, you will convert raw text into a **Term Frequency (TF) matrix**.

### Task:

1. **Convert the text into a Term Frequency matrix**
   - Use `CountVectorizer()` or `TfidfVectorizer(use_idf=False)` to compute **normalized term frequencies**.
   - Make sure your vectorizer uses the custom preprocessor you built in the previous exercise.
   - The resulting matrix should contain **relative term frequencies** (not raw counts).


## Exercise 2.4

In this exercise, you will convert your corpus into a **TF-IDF matrix**.

### Task:

1. **Convert the texts to a TF-IDF matrix**
   - Use `TfidfVectorizer()` from `sklearn.feature_extraction.text`.
   - Make sure to pass your custom preprocessor (from Exercise 2.2) via the `analyzer` argument.
   - This matrix will reflect both:
     - **Term Frequency** (how often a word appears in a document)
     - **Inverse Document Frequency** (how unique the word is across all documents)

The result will be a matrix where each value represents how important a word is in a specific document, relative to the entire corpus.


## Exercise 2.5

In this exercise, you will practice reading and explaining code by writing clear and helpful comments.

### Task:

1. **Write comments on a given piece of code**
   - For each major step or line, explain **what** the code does and **why** it is needed.
   - Use complete sentences or clear phrases.
   - Write comments in a way that someone new to text processing could understand.

This exercise will help you build a deeper understanding of the logic behind text preprocessing and matrix construction.


## Exercise 2.6

In this exercise, you will perform **sentiment analysis** on the **preambles of constitutions**.

### Task:

1. **Analyze the sentiment of each preamble**
   - Use a sentiment analysis tool such as **TextBlob** or **VADER**.
   - Apply it to the full text of each preamble to calculate sentiment scores (e.g., polarity).

2. **Answer the following questions:**
   - 🟢 **Which preamble has the highest sentiment score?**
   - 🔴 **Which preamble has the lowest sentiment score?**

Make sure to include both the **country name** (or document ID) and the **sentiment score** in your answers.


## Homework

In his paper *Constitutional Archetypes*, constitutional law scholar **David Law** argues that constitutional preambles can be grouped into three main categories:

- **Liberal**
- **Statist**
- **Universal**

Law provides linguistic and conceptual evidence supporting these three archetypes based on word choice and framing.

### Task:

1. **Read the paper by David Law carefully**
   - Focus especially on the sections that define and describe the three preamble types.

2. **Create dictionaries for each archetype**
   - Based on the paper, build a **dictionary of representative words** for each category:
     - Liberal
     - Statist
     - Universal

3. **Apply the dictionary approach**
   - Write Python code that scans each preamble and calculates the number (or proportion) of words matching each dictionary.
   - Assign or score each preamble according to how strongly it aligns with each archetype.

4. **Optional:**
   - Visualize the distribution of preamble types across countries or regions.
   - Explore if certain types are more common in specific eras or continents.