### To Use the Code in Google Colab

To run the notebooks and access the data in Google Colab, follow these steps:

1. **Create a Google account** if you don't already have one. This gives you access to both Colab and Google Drive.

2. **Use the shared link** provided to you to copy (clone) the notebooks into your own Google Drive.

3. **Download the corpora and exercise files** from the link provided, and upload them to your Google Drive.

4. **Grant Colab access to your Google Drive** by running the following code:

```python
from google.colab import drive
drive.mount('/content/drive')
```


5. **You can now easily read files from your Google Drive** using standard file paths.

---
---

# Text and Its Features

## Exercise 1.1

In this exercise, you will prepare your environment for text analysis using Python.

### Steps:

1. **Upload files to Google Colab**
   - Upload the provided Jupyter notebooks along with the required corpora and datasets (e.g., `.txt` files, `.csv` files) into your Google Colab environment.

2. **Open a notebook**
   - Open the exercise notebook where you will perform all tasks.

3. **Import necessary Python packages**
   - Import all libraries commonly used for text analysis, such as:
     - `pandas`
     - `nltk`
     - `sklearn`
     - `re`, `string`
     - `matplotlib`, `seaborn` (for visualization, if needed)


## Exercise 1.2

In this exercise, you will load and combine two sources of data: the preambles of constitutions and metadata about countries.

### Steps:

1. The subfolder `'preamble'` in the exercise directory contains the **preambles of constitutions currently in force**, each stored as a separate text file.

2. **Load these preambles into a DataFrame**
   - Each row should represent one country or document.
   - Include a column for the country code or file name, and another for the full text of the preamble.

3. **Load the metadata CSV file**
   - This file contains country-level information (e.g., region, legal system, population, etc.).

4. **Merge the two DataFrames**
   - Merge them on a common key (e.g., country code) to create one unified DataFrame.


## Exercise 1.3

In this exercise, you will analyze the constitutional preambles loaded in the previous step.

### Task:

1. **Answer the following questions:**

   1. 📝 **Which country has the lengthiest preamble?**
      - Use character count or word count to determine the length of each preamble.

   2. 📚 **Which country has the most difficult-to-read preamble?**
      - Use a readability metric such as Flesch Reading Ease, Gunning Fog Index, or another suitable readability formula.


## Exercise 1.4

In this exercise, you will **tokenize** the preambles of constitutions.

### Task:

1. **Tokenize each document**
   - Use a tokenizer from `nltk` (e.g., `nltk.word_tokenize`) or another reliable library.
   - Tokenization means splitting each document into individual words or tokens.
   - Store the tokenized version in a new column of your DataFrame.

This step prepares the text for further processing such as stopword removal, lemmatization, or frequency analysis.


## Exercise 1.5

In this exercise, you will clean your tokenized documents by removing unwanted elements.

### Tasks:

1. **Remove stopwords and digits**
   - Use a stopword list from `nltk.corpus.stopwords` or a custom list.
   - Remove numeric tokens or any tokens that contain digits.

2. **Define and remove domain-specific words**
   - Identify a list of domain-specific or overused words that are not helpful for analysis (e.g., "preamble", "constitution", "state").
   - Create a custom set of these words and remove them from your tokenized documents.

The cleaned tokens will be more meaningful for later analysis steps such as frequency counting or sentiment analysis.


## Exercise 1.6

In this exercise, you will apply **lemmatization** and **stemming** to your tokenized documents.

### Tasks:

1. **Lemmatization**
   - Use a lemmatizer such as `WordNetLemmatizer` from `nltk.stem`.
   - Lemmatization reduces words to their base or dictionary form (e.g., "running" → "run").

2. **Stemming**
   - Use a stemmer such as `PorterStemmer` or `SnowballStemmer`.
   - Stemming cuts off word suffixes to reduce them to a common root (e.g., "governmental" → "govern").

👉 You can store the lemmatized and stemmed versions of your documents in separate columns to compare their effects.