<span>
<b>Methods Workshop in Quantitative Text Analysis </b><br/>     
<b>Author:</b> <a href="https://github.com/jisukimmmm">Jisu Kim</a><br/>
<b>Python version:</b>  >=3.6<br/>
<b>Last update:</b> 21/04/2024
</span>

<a id='top'></a>
# *Introduction to text analysis - Exercises*

Here, we'll be diving into various text processing techniques, using the beloved classic "Alice's Adventures in Wonderland" by Lewis Carroll as our playground.This exercise set will guide you through a series of hands-on tasks aimed at developing your skills in string manipulation, text normalization, tokenization, and more advanced topics like Named Entity Recognition (NER) and Word Embeddings. Each exercise is crafted to help you understand the underlying principles of text analysis while engaging with the whimsical and imaginative text of "Alice's Adventures in Wonderland".

Source of the text: https://www.geeksforgeeks.org/

### Build-in string manipulation functions
1. Find and print the first and last occurrence of the word "Alice" in the book.
2. Split the chapter 1.
3. Replace name "Alice" to your own name
### Lowercasing
1. Convert the entire text of the first chapter to lowercase.
2. Compare two sentences from different parts of the book case-insensitively to check if they are the same.
### Removing Punctuations, Numbers, and Special Characters
1. Remove all punctuations from the first paragraph of the book.
2. Write a function to remove any numbers and special characters found in the text of the first chapter.
### Handling Contractions
1. Expand all contractions in the famous quote "We're all mad here." from the book using a dictionary of contractions.
### Spell Checking and Correction
1. Identify and correct any intentionally misspelled words in a given excerpt from the book.
### Tokenization
1. Tokenize the first paragraph into sentences and words.
2. Write a function that tokenizes a sentence into words without using any library functions.
### Regular Expressions
1. Extract all email addresses (if any) from the text using regular expressions. (You can add fake email addresses for fun.)
2. Find all uppercase words in the first chapter using regular expressions.
### Normalisation
1. Normalize a given excerpt by converting it to lowercase and removing punctuations.
2. Write a function that normalizes white space in the text (e.g., convert multiple spaces to a single space).
### Stemming
1. Apply stemming to a list of words from a random page in the book using the Porter Stemmer.
2. Compare the results of different stemming algorithms on the same text excerpt.
### Lemmatization
1. Lemmatize a list of words from the book using NLTK's WordNet lemmatizer.
2. Compare the output of lemmatization and stemming on the same set of words.
### Part of Speech (POS) Tagging
1. Return the POS tags of all words in a selected paragraph.
2. Count the number of nouns, verbs, and adjectives in the selected paragraph. 
### Named Entity Recognition (NER)
1. Extract all the names of characters mentioned in the book using NER.
### Bag of Words (BoW)
1. Create a Bag of Words representation for the first chapter.
### TF-IDF
1. Compute the TF-IDF scores for a collection of chapters.
2. Write a function that returns the top 5 words with the highest TF-IDF scores in a specific chapter.
### Word Embeddings
1. Find the most similar words to "Rabbit".
2. Write a function that computes the cosine similarity between two words, such as "Alice" and "Wonderland".
### N-grams
1. Generate and print all bigrams and trigrams from the Mad Hatter's tea party scene.
2. Write a function that counts the frequency of each n-gram in the first chapter.
### Text Length and Complexity
1. Calculate the average word length and sentence length in the first chapter.
2. Write a function that computes the Flesch Reading Ease score of the book's first chapter.
### Collocation and Terminology Extraction
1. Identify and print the most common collocations in the conversation between Alice and the Cheshire Cat.
2. Write a function that extracts key terms from a selected chapter based on their frequency and context.
3. Plot the frequency distribution of these key terms using matplotlib.
4. Remove punctuations and stopwords and redo the plot

# Answers

### Built-in string manipulation functions
1. Find and print the first and last occurrence of the word "Alice" in the book.

2. Split the chapter 1

3. Replace name "Alice" to your own name

### Lowercasing
1. Convert the entire text of the first chapter to lowercase.

2. Compare two sentences from different parts of the book case-insensitively to check if they are the same.

### Removing Punctuations, Numbers, and Special Characters
1. Remove all punctuations from the first paragraph of the book.

2. Write a function to remove any numbers and special characters found in the text of the first chapter.

### Handling Contractions
1. Expand all contractions in the famous quote "We're all mad here." from the book using the `contractions` library.

### Spell Checking and Correction
1. Identify and correct any intentionally misspelled words in a given excerpt from the book.
The 13th paragraph has been purposely misspelled.

### Tokenization
1. Tokenize the first paragraph into sentences and words.

2. Write a function that tokenizes a sentence into words without using any library functions.

### Regular Expressions
1. Extract all email addresses (if any) from the text using regular expressions. (You can add fake email addresses for fun.)

2. Find all uppercase words in the first chapter using regular expressions.

### Normalization
1. Normalize a given excerpt by converting it to lowercase and removing punctuations.

2. Write a function that normalizes white space in the text (e.g., convert multiple spaces to a single space). Use the same misspelled_text from above.

### Stemming
1. Apply stemming to a list of words from a random page in the book using the Porter Stemmer.

2. Compare the results of different stemming algorithms on the same text excerpt.

### Part of Speech (POS) Tagging
1. Return the POS tags of all words in a selected paragraph.

2. Count the number of nouns, verbs, and adjectives in the selected paragraph. 

### Named Entity Recognition (NER)
1. Identify and classify named entities in the first chapter using NLTK.

### Bag of Words (BoW)
1. Create a Bag of Words representation for the first chapter.

### TF-IDF (Term Frequency-Inverse Document Frequency)
1. Compute the TF-IDF scores for a collection of chapters.

2. Write a function that returns the top 5 words with the highest TF-IDF scores in a specific chapter.

### Word Embeddings
1. Find the most similar words to "Rabbit".

2. Write a function that computes the cosine similarity between two words, such as "Alice" and "Wonderland".

### N-grams
1. Generate and print all bigrams and trigrams from the Mad Hatter's tea party scene.

2. Write a function that counts the frequency of each n-gram in the first chapter.

### Text Length and Complexity
1. Calculate the average word length and sentence length in the first chapter.

2. Write a function that computes the Flesch Reading Ease score of the book's first chapter.
   
Note: Flesch–Kincaid readability tests are readability tests designed to indicate how difficult a passage in English is to understand. It gives a text a score between 1 and 100, with 100 being the highest readability score. The formula to calculate the Flesch Reading Ease Score is:

flesch reading ease score=206.835−(1.015×ASL)−(84.6×ASW) 

Where:

ASL = (Total number of words) / (Total number of sentences)

ASW = (Total number of syllables) / (Total number of words)

206.835: This is a base score from which deductions are made. It represents the starting point for the readability score before adjusting for sentence length and word complexity.

1.015: This constant is multiplied by the average sentence length (ASL). It represents the weight or impact that sentence length has on readability. Longer sentences typically make the text harder to read, so higher ASL values reduce the overall score.

84.6: This constant is multiplied by the average number of syllables per word (ASW). It signifies the weight of word complexity on readability. Words with more syllables are generally harder to read and understand, so a higher ASW decreases the readability score significantly.

### Collocation and Terminology Extraction
1. Identify and print the most common collocations in the Chapter VII: A Mad Tea-Party.

2. Write a function that extracts key terms from a selected chapter based on their frequency and context.

3. Plot the frequency distribution of these key terms using matplotlib.

4. Remove punctuations and stopwords and redo the plot