***
# **Exercise 1: Text Analysis Tool**

***
### [**John Mike Asuncion**](https://github.com/johnmikx)
![BSCPE 2–2](https://img.shields.io/badge/BSCPE_2–2-18BCF2?style=for-the-badge&logo=Home%20Assistant&logoColor=white)
![CMPE 201 DSA](https://img.shields.io/badge/[CMPE_201]_Data_Structures_and_Algorithms-FF3621?style=for-the-badge&logo=Databricks&logoColor=white)

#### **Professor: Engr. Godofredo T. Avena**
##### *September 16, 2025*
***

## **Table of Contents**

* [**1. Objective**](#1)
* [**2. Problem Description**](#2)
* [**3. Requirements**](#3)
* [**4. Sample Text**](#4)
* [**5. Expected Output**](#5)
* [**6. Solutions**](#6)
  * [6.1 Cleaning the Text](#6.1)
  * [6.2 Counting Words](#6.2)
  * [6.3 Counting Sentences](#6.3)
  * [6.4 Counting Paragraphs](#6.4)
  * [6.5 Most Common Word](#6.5)
  * [6.6 Average Word Length](#6.6)
  * [6.7 Finding Long Words](#6.7)
* [**7. Generate Final Text Analysis Report**](#7)
***

## **1. Objective**  <a class="anchor" id="1"></a>
Practice string manipulation, loops, conditional statements, and function

## **2. Problem Description** <a class="anchor" id="2"></a>
Create a text analysis tool that can analyze a given text and provide various statistics.

## **3. Requirements** <a class="anchor" id="3"></a>
1. Create functions to analyze a text string:
   - `count_words(text)`: Count total number of words
   - `count_sentences(text)`: Count sentences (assume sentences end with '.', '!', or '?')
   - `count_paragraphs(text)`: Count paragraphs (separated by double newlines)
   - `most_common_word(text)`: Find the most frequently used word (ignore case)
   - `average_word_length(text)`: Calculate average length of words
   - `find_long_words(text, min_length)`: Find all words longer than specified length

2. Clean the text by removing punctuation when counting words
3. Handle case sensitivity appropriately

## **4. Sample Text** <a class="anchor" id="4"></a>

In [119]:
sample_text = """
Python is a high-level programming language. It is known for its simplicity and readability.
Python supports multiple programming paradigms including procedural, object-oriented, and functional programming.

Many developers choose Python for its extensive libraries and frameworks.
The language is widely used in web development, data science, artificial intelligence, and automation.
Python's philosophy emphasizes code readability and simplicity!
"""

## **5. Expected Output** <a class="anchor" id="5"></a>
```
=== TEXT ANALYSIS REPORT ===
Total Words: 56
Total Sentences: 6
Total Paragraphs: 2
Most Common Word: "python" (appears 4 times)
Average Word Length: 6.2 characters
Words longer than 8 characters: ['programming', 'language', 'simplicity', 'readability', ...]
```

## **6. Solutions** <a class="anchor" id="6"></a>

### **6.1 Cleaning the Text (`clean_text`)** <a class="anchor" id="6.1"></a>

1. Define a string of punction marks.
2. For each punctuation mark, remove it from the text.
3. Convert text to lowercase.
4. Return the cleaned text.

In [104]:
def clean_text(text):
  """Remove punctuation and convert text to lowercase."""
  punctuation = ".,!?;:'\"-()[]{}<>@#$%^&*_~\\/"

  for p in punctuation:
    text = text.replace(p, "")

  return text.lower()

In [105]:
clean_text(sample_text)

'\npython is a highlevel programming language it is known for its simplicity and readability\npython supports multiple programming paradigms including procedural objectoriented and functional programming\n\nmany developers choose python for its extensive libraries and frameworks \nthe language is widely used in web development data science artificial intelligence and automation\npythons philosophy emphasizes code readability and simplicity\n'

### **6.2 Counting Words (`count_words`)** <a class="anchor" id="6.2"></a>

1. Clean the text (remove punctuation, lowercase).
2. Split the text by spaces into a list of words.
3. Count the number of words in the list.

In [106]:
def count_words(text):
  words = clean_text(text).split()

  return len(words)

In [107]:
count_words(sample_text)

56

### **6.3 Counting Sentences (`count_sentences`)** <a class="anchor" id="6.3"></a>

1. Initialize an empty list `sentences`.
2. Traverse each character `char` in the text.
3. Add characters to a temporary string `temp`.
4. If a character is `.`, `!`, or `?`, save `temp` as a sentence and reset it.
5. Return the total number of sentences.

In [108]:
def count_sentences(text):
  sentences = []
  temp = ""

  for char in text:
    temp += char
    if char in ".!?":
      sentences.append(temp.strip())
      temp = ""

  if temp.strip():
    sentences.append(temp.strip())

  return len(sentences)

In [109]:
count_sentences(sample_text)

6

### **6.4 Counting Paragraphs (`count_paragraphs`)** <a class="anchor" id="6.4"></a>

1. Remove leading and trailing whitespace.
2. Split the text by two consecutive newlines (`\n\n`).
3. Count non-empty paragraphs.

In [110]:
def count_paragraphs(text):
  paragraphs = []

  for p in text.strip().split("\n\n"):
    if p.strip():
      paragraphs.append(p)

  return len(paragraphs)

In [111]:
count_paragraphs(sample_text)

2

### **6.5 Most Common Word (`most_common_word`)** <a class="anchor" id="6.5"></a>

1. Clean the text and split into words.
2. Use a dictionary `freq` to count each word’s occurrences.
3. Loop through dictionary items to find the word with the highest count.
4. Return the word and its count.

In [112]:
def most_common_word(text):
  words = clean_text(text).split()
  freq = {}

  for word in words:
    freq[word] = freq.get(word, 0) + 1

  max_word = None
  max_count = 0

  for word, count in freq.items():
    if count > max_count:
      max_word, max_count = word, count

  return max_word, max_count

In [113]:
most_common_word(sample_text)

('and', 5)

### **6.6 Average Word Length (`average_word_length`)** <a class="anchor" id="6.6"></a>

1. Clean the text and split into words.
2. Count total letters across all words.
3. Divide total letters by total words.
4. Round the result to 1 decimal place.

***Mathematically...***

If $W = \{w_1, w_2, ..., w_n\}$ and $|w_i|$ is the length of each word:

$$
\text{Average Word Length} = \frac{\sum_{i=1}^{n} {|w_i|}}{n}
$$

In [114]:
def average_word_length(text):
  words = clean_text(text).split()

  if not words:
    return 0

  total_length = 0

  for word in words:
    total_length += len(word)

  return round(total_length / len(words), 1)

In [115]:
average_word_length(sample_text)

6.8

### **6.7 Finding Long Words (`find_long_words`)** <a class="anchor" id="6.7"></a>

1. Clean the text and split into words.
2. For each word, check if its length is greater than `min_length`.
3. If yes, add it to the list if not already included.
4. Sort the final list alphabetically.

In [116]:
def find_long_words(text, min_length):
  words = clean_text(text).split()
  long_words = []

  for word in words:
    if len(word) > min_length and word not in long_words:
      long_words.append(word)

  return sorted(long_words)

In [117]:
find_long_words(sample_text, 8)

['artificial',
 'automation',
 'developers',
 'development',
 'emphasizes',
 'extensive',
 'frameworks',
 'functional',
 'highlevel',
 'including',
 'intelligence',
 'libraries',
 'objectoriented',
 'paradigms',
 'philosophy',
 'procedural',
 'programming',
 'readability',
 'simplicity']

## **7. Generate Final Text Analysis Report** <a class="anchor" id="7"></a>

In [120]:
#################
# Collect results
#################

total_words = count_words(sample_text)
total_sentences = count_sentences(sample_text)
total_paragraphs = count_paragraphs(sample_text)
common_word, freq = most_common_word(sample_text)
avg_length = average_word_length(sample_text)
long_words = find_long_words(sample_text, 8)

print("=== TEXT ANALYSIS REPORT ===")
print(f"Total Words: {total_words}")
print(f"Total Sentences: {total_sentences}")
print(f"Total Paragraphs: {total_paragraphs}")
print(f'Most Common Word: "{common_word}" (appears {freq} times)')
print(f"Average Word Length: {avg_length} characters")
print(f"Words longer than 8 characters: {long_words}")

=== TEXT ANALYSIS REPORT ===
Total Words: 56
Total Sentences: 6
Total Paragraphs: 2
Most Common Word: "and" (appears 5 times)
Average Word Length: 6.8 characters
Words longer than 8 characters: ['artificial', 'automation', 'developers', 'development', 'emphasizes', 'extensive', 'frameworks', 'functional', 'highlevel', 'including', 'intelligence', 'libraries', 'objectoriented', 'paradigms', 'philosophy', 'procedural', 'programming', 'readability', 'simplicity']
