### Portuguese Word Analysis

Portuguese word structure fundamentally relies on how vowels (V) and consonants (C) combine to create syllables. As the rhythmic unit of speech, the syllable almost always has a vowel as its core, the main sound. Understanding this vowel-consonant interaction is key to finding the perfect word in the game Termo (the Portuguese version of Wordle).

In [24]:
import os
from collections import Counter
import unicodedata 

In [25]:
FILENAME = 'palavras_5letras.txt' 
TARGET_LENGTH = 5                 

##### Removes accents from a character

In [26]:
def normalize_char(char):
    return ''.join(c for c in unicodedata.normalize('NFD', char)
                   if unicodedata.category(c) != 'Mn')

##### Defining the vowels and consonants

In [27]:
BASE_VOWELS = set('aeiou')
ALPHABET_PT_FULL = set('abcdefghijklmnopqrstuvwxyzç')
CONSONANTS = {char for char in ALPHABET_PT_FULL if char not in BASE_VOWELS}

##### Variables and Counters

In [28]:
max_distinct_vowels = 0
words_with_max_vowels = []

consonant_counter = Counter()

total_words_analyzed = 0
total_consonants_counted = 0
error_message = None # Store error messages

##### Reads the word file, performs vowel and consonant analysis and updates global variables with the results.

In [29]:
all_original_valid_words = []

def analyze_words():
    
    # Declare global variables to be modified
    global max_distinct_vowels, words_with_max_vowels, consonant_counter
    global total_words_analyzed, total_consonants_counted, error_message
    global all_original_valid_words 

    # Clear the list of all valid words for the new run
    all_original_valid_words.clear()
    max_distinct_vowels = 0
    words_with_max_vowels = []
    consonant_counter.clear() 
    total_words_analyzed = 0
    total_consonants_counted = 0
    error_message = None 

    # Check if the input file exists
    if not os.path.exists(FILENAME):
        error_message = f"Critical Error: The file '{FILENAME}' was not found in the current directory."
        print(error_message)
        return False

    try:
        with open(FILENAME, 'r', encoding='utf-8') as f:
            for line in f:
                original_word = line.strip()

                if original_word and len(original_word) == TARGET_LENGTH:
                    normalized_word = "".join(normalize_char(c) for c in original_word.lower())

                    # Additional validation: check if all normalized characters are valid letters
                    if all(c in ALPHABET_PT_FULL for c in normalized_word) and len(normalized_word) == TARGET_LENGTH:
                        all_original_valid_words.append(original_word)
                        total_words_analyzed += 1
                        vowels_in_word = set() # Set to store distinct vowels found in this word
                        
                        for letter in normalized_word:
                            # Vowel Analysis
                            if letter in BASE_VOWELS:
                                vowels_in_word.add(letter)
                            # Consonant Analysis
                            elif letter in CONSONANTS:
                                consonant_counter[letter] += 1
                                total_consonants_counted += 1 

                        # Update the results for the highest number of distinct vowels found
                        num_distinct_vowels = len(vowels_in_word)
                        if num_distinct_vowels > max_distinct_vowels:
                            max_distinct_vowels = num_distinct_vowels
                            words_with_max_vowels = [original_word] # New maximum found, reset list
                        elif num_distinct_vowels == max_distinct_vowels:
                            words_with_max_vowels.append(original_word) # Tied for maximum, add to list

        # Print completion message after processing the file
        if total_words_analyzed > 0:
             print(f"Analysis complete. {total_words_analyzed} valid {TARGET_LENGTH} letter words were processed and stored.")
        else:
             print(f"Analysis complete. No valid {TARGET_LENGTH}-letter words found in '{FILENAME}'.")
             error_message = f"No valid {TARGET_LENGTH}-letter words found." # Set an informative error message
        return True # Indicate analysis finished successfully (even if no words found)

    except FileNotFoundError:
        error_message = f"Critical Error: The file '{FILENAME}' was not found."
        print(error_message)
        return False 
    except Exception as e:
        error_message = f"An unexpected error occurred during file processing: {e}"
        print(error_message)
        return False


In [30]:
analysis_completed = analyze_words()

Analysis complete. 19082 valid 5 letter words were processed and stored.


##### Insights

Now, let's move on to finding the best starting words for the game. The goal is to identify words that maximize our chances of guessing letters correctly on the first try.

To do this, we'll combine some insights. We're looking for words that not only contain a good number of distinct vowels (to test which vowels are present) but also include some of the most frequently occurring consonants we identified earlier. This strategy aims to gather the most information possible with the initial guess.

##### Prints the results of the distinct vowel analysis

In [31]:
def display_vowel_results():
    
    if error_message and not total_words_analyzed: # Show error only if analysis didn't even start
        print(f"Cannot display results due to a critical error: {error_message}")
    elif total_words_analyzed == 0:
        print(f"No valid {TARGET_LENGTH}-letter words were found or analyzed in the file '{FILENAME}'.")
    else:
        print(f"Total {TARGET_LENGTH}-letter words analyzed: {total_words_analyzed}")
        print("-" * 40)
        if words_with_max_vowels:
            print(f"Highest number of distinct vowels found: {max_distinct_vowels}")
            print(f"Words with {max_distinct_vowels} distinct vowels ({len(words_with_max_vowels)} found):")
            # Print words, 15 per line
            for i in range(0, len(words_with_max_vowels), 15):
                print(", ".join(words_with_max_vowels[i:i+15]))
        else:
            print("No words containing vowels were found.")
        if error_message:
             print(f"\nNote: An error occurred during processing: {error_message}")

In [32]:
display_vowel_results()

Total 5-letter words analyzed: 19082
----------------------------------------
Highest number of distinct vowels found: 4
Words with 4 distinct vowels (132 found):
aboei, aboie, acoei, acuei, adioe, adiou, adoei, aduei, afeio, afeou, afiou, aguei, aiemo, aigue, aioes
airou, aiues, aiune, aiuno, ajoie, aleio, aleou, aliou, aluei, aluio, ameio, ameou, amoie, amuei, aneio
aneou, aoqui, apeio, apeou, apoie, apuei, apuie, aqueo, areio, areou, ariou, ateio, ateou, atoei, atuei
audio, aueti, aueto, aunei, aurei, aureo, ausio, aveio, aviou, avoei, azoei, baiou, caiou, caiue, cuiao
ecoai, eguai, eicou, eimou, eivao, eivou, eixou, eluia, eolia, equio, euria, eurio, faiou, feiao, gaiou
guaie, guaio, guiao, iameu, iaque, iauos, iaupe, iauvo, ideou, iogue, iolau, iuane, iucea, lauie, maeio
maiou, meiao, mueia, odeia, ofaie, oguei, oigue, oleai, oleia, opaie, oquea, oquei, oquie, oreai, oreia
ouari, ougai, ourai, ourei, ousai, ousei, ousia, ousie, outai, outei, ouvia, oviua, ozeai, ozeia, queia
raiou

##### Prints the results of the consonant frequency analysis

In [33]:
def display_consonant_results():

    if error_message and not total_words_analyzed: 
        print(f"Cannot display results due to a critical error: {error_message}")
    elif total_words_analyzed == 0:
         print(f"No valid {TARGET_LENGTH}-letter words were found or analyzed in the file '{FILENAME}'.")
    else:
        print(f"Total {TARGET_LENGTH}-letter words analyzed: {total_words_analyzed}")
        print(f"Total consonants counted: {total_consonants_counted}")
        print("-" * 40)

        if consonant_counter:
            top_10_consonants = consonant_counter.most_common(10)
            print("The 10 most frequent consonants are:")
            print(f"{'Rank':<6} {'Consonant':<12} {'Frequency':<12}")
            print("-" * 40)
            for rank, (consonant, frequency) in enumerate(top_10_consonants, 1):
                print(f"{rank:<6} {consonant:<12} {frequency:<12}")
        elif total_consonants_counted == 0 and total_words_analyzed > 0:
             print("No consonants were found in the analyzed words.")
        else:
             print("No consonants were counted.")
        if error_message: # Display non-critical errors
             print(f"\nNote: An error occurred during processing: {error_message}")


In [34]:
display_consonant_results()

Total 5-letter words analyzed: 19082
Total consonants counted: 48302
----------------------------------------
The 10 most frequent consonants are:
Rank   Consonant    Frequency   
----------------------------------------
1      s            6375        
2      r            5981        
3      m            4282        
4      l            4028        
5      c            3886        
6      n            3465        
7      t            3432        
8      p            2498        
9      b            2471        
10     d            2408        


In [35]:
words_with_max_vowels_and_s = []


if analysis_completed and words_with_max_vowels:
    target_letter = 's'

    for word in words_with_max_vowels:
        normalized_word = "".join(normalize_char(c) for c in word.lower())

        # Check if the letter 's' is in the normalized word
        if target_letter in normalized_word:
            words_with_max_vowels_and_s.append(word) # Add the original word to the list

    print(f"\nFiltering: Words with {max_distinct_vowels} Vowels and the Letter '{target_letter.upper()}'")
    if words_with_max_vowels_and_s:
        print(f"Found {len(words_with_max_vowels_and_s)} words (from the original list of {len(words_with_max_vowels)}) :")
        # Print the found words, 15 per line
        for i in range(0, len(words_with_max_vowels_and_s), 15):
            print(", ".join(words_with_max_vowels_and_s[i:i+15]))
    else:
        # Message if no matching words were found
        print(f"No words in the list of {len(words_with_max_vowels)} words with {max_distinct_vowels} distinct vowels contain the letter '{target_letter}'.")

# Error handling messages if prerequisites are not met
elif error_message:
     print(f"\nCannot filter words due to a previous error: {error_message}")
else:
     print("\nThere are no words in the 'words_with_max_vowels' list to filter.")


Filtering: Words with 4 Vowels and the Letter 'S'
Found 9 words (from the original list of 132) :
aioes, aiues, ausio, iauos, ousai, ousei, ousia, ousie, uaios


In [36]:
if analysis_completed and all_original_valid_words and consonant_counter:

    print("Searching for the best words (5 distinct letters, high top-10 consonant score) for each vowel:")

    # Dictionary with frequencies of the 10 most common consonants
    top_10_freq_dict = dict(consonant_counter.most_common(10))

    for target_vowel in sorted(list(BASE_VOWELS)):

        print(f"\n\n Words with 5 Distinct Letters Containing Vowel: '{target_vowel.upper()}'")

        candidate_scores_for_vowel = [] # List to store scores for this vowel

        # Iterate through all valid words found in the initial analysis
        for candidate_word in all_original_valid_words:
            # Normalize the candidate word
            normalized_candidate = "".join(normalize_char(c) for c in candidate_word.lower())
            letters_in_candidate_set = set(normalized_candidate) # Set of unique letters

            # Must contain the target vowel for the current iteration
            # Must have exactly 5 distinct letters
            if (target_vowel not in normalized_candidate or
                len(letters_in_candidate_set) != TARGET_LENGTH):
                continue # Skip to the next word if criteria are not met
                
            current_score = 0
            # Get the unique consonants in the candidate word
            distinct_consonants_in_candidate = {c for c in normalized_candidate if c in CONSONANTS}
            # Sum the frequency (if top 10) for each distinct consonant
            for cons in distinct_consonants_in_candidate:
                current_score += top_10_freq_dict.get(cons, 0) # .get() returns 0 if not top 10

            # Store: (-score for descending sort), original word
            # 'Overlap' is no longer relevant in this context
            candidate_scores_for_vowel.append((-current_score, candidate_word))

        candidate_scores_for_vowel.sort() # Sorts by score

        N = 10 # Number of results to show per vowel
        print(f"Top {N} candidates (ordered by Score Descending):")
        print(f"{'Rank':<6} {'Word':<10} {'Score':<8} {'Letters'}")
        print("-" * 40)

        if candidate_scores_for_vowel:
            rank = 1
            # Show the top words found for this vowel
            for neg_score, word in candidate_scores_for_vowel[:N]:
                score = -neg_score # Reverse the score to positive for display
                # Show all letters in the word
                norm_w = "".join(normalize_char(c) for c in word.lower())
                letters_str = "".join(sorted(list(set(norm_w)))) # Get unique letters and sort
                print(f"{rank:<6} {word:<10} {score:<8} {letters_str}")
                rank += 1
        else:
            print(f"No words found with 5 distinct letters containing '{target_vowel.upper()}'.")

elif not analysis_completed:
     print(f"\nCannot analyze words by vowel. Initial analysis failed: {error_message}")
elif not all_original_valid_words:
     print("\nCannot analyze words by vowel. The list of all valid words is empty (check the modification in 'analyze_words').")
elif not consonant_counter:
     print("\nCannot analyze words by vowel. The consonant counter is empty (initial analysis failed?).")

Searching for the best words (5 distinct letters, high top-10 consonant score) for each vowel:


 Words with 5 Distinct Letters Containing Vowel: 'A'
Top 10 candidates (ordered by Score Descending):
Rank   Word       Score    Letters
----------------------------------------
1      trans      19253    anrst
2      brcas      18713    abcrs
3      prans      18319    anprs
4      armes      16638    aemrs
5      ermas      16638    aemrs
6      esmar      16638    aemrs
7      irmas      16638    aimrs
8      mares      16638    aemrs
9      maris      16638    aimrs
10     marso      16638    amors


 Words with 5 Distinct Letters Containing Vowel: 'E'
Top 10 candidates (ordered by Score Descending):
Rank   Word       Score    Letters
----------------------------------------
1      trens      19253    enrst
2      perns      18319    enprs
3      armes      16638    aemrs
4      ermas      16638    aemrs
5      ermos      16638    emors
6      esmar      16638    aemrs
7      mares     

Our initial analysis focused on finding starting words with the maximum number of distinct vowels (four out of five, like AUREI). The idea was to follow up with a second word containing the single missing vowel and hopefully several useful, distinct consonants.

However, this "4+1" vowel strategy has a potential drawback when considering the combined information from the first two guesses. A word with four distinct vowels necessarily has only one consonant slot. This severely limits our ability to test high-frequency consonants in the very first guess.

Subsequently, finding an optimal second word becomes difficult. We need a valid 5-letter word that:

1. Contains the specific missing vowel.
2. Contains four other letters.
3. Ideally, these four letters are distinct, high-frequency consonants not used in the first word.

Satisfying all these conditions simultaneously is challenging. We might find that the available second words either repeat letters from the first guess, use less frequent consonants, or simply don't exist, forcing a suboptimal second guess. The constraint of having only one consonant in the first word limits the overall efficiency of the pair.

##### Exploring a 3+2 Vowel Split Strategy

This suggests that maximizing vowel coverage in the first word alone might not be the best path for a two-word opening strategy. An alternative approach is to balance the vowel distribution across the two words.

Consider a "3+2" vowel split:

Word 1 Contains 3 distinct vowels and 2 consonants (VVVCC).

Word 2 Contains the remaining 2 distinct vowels and 3 consonants (VVCCC).

V -> A, E, I, O, U

C -> S, R, M, L, C   

##### Why Use a Graph-Inspired Method for Finding Word Pairs?

Our goal is to find pairs of 5-letter words that, when combined, contain all 5 distinct base vowels (A, E, I, O, U) and also test a high number of distinct, frequently used consonants.

A straightforward approach would be to simply check every possible pair of words in our dictionary. For a dictionary of N valid 5-letter words, there are N×(N−1) possible ordered pairs (or N×(N−1)/2 if the order doesn't matter). Given our analysis found over 19,000 valid words (N≈19,000), checking every single pair means performing calculations roughly 19,000^2 ≈ 361,000,000 times. This brute-force O(N^2) approach is computationally very expensive and time-consuming for large word lists.

This is where thinking in terms of graphs becomes beneficial for improving agility.

Nodes (Vertices): Represent the distinct sets of vowels found within the 5-letter words in our dictionary.
Edges: An edge connects two vowel set nodes, say VowelSet1 and VowelSet2, if the union of these two sets of vowels contains all 5 base vowels (VowelSet1 ∪ VowelSet2 = {'a', 'e', 'i', 'o, 'u'}). This edge represents the complementarity needed to satisfy our primary vowel coverage requirement.

In [40]:
if analysis_completed and all_original_valid_words and consonant_counter:

    all_vowels_set = set('aeiou')
    word_pairs_analysis_optimized = []
    top_10_freq_dict = dict(consonant_counter.most_common(10))

    # Group Words by Vowel Set 
    # The dictionary key will be a frozenset (immutable set) of the word's vowels,
    # and the value will be a list of original words that contain exactly those distinct vowels.
    words_by_vowel_set = {}

    for word in all_original_valid_words:
        normalized_word = "".join(normalize_char(c) for c in word.lower())
        word_vowels_set = frozenset({c for c in normalized_word if c in BASE_VOWELS}) # Use frozenset as key

        if word_vowels_set not in words_by_vowel_set:
            words_by_vowel_set[word_vowels_set] = []

        words_by_vowel_set[word_vowels_set].append(word) 

    print(f"Grouping complete. Found {len(words_by_vowel_set)} distinct vowel sets in {TARGET_LENGTH}-letter words.")

    # Find Complementary Set Pairs and Iterate Through Words
    # Iterate over all unique pairs of vowel sets (V1, V2)
    # Iterating directly over the dictionary keys gives us the "nodes" of the vowel complementarity graph.
    # Searching for V2 such that V1 | V2 = all_vowels_set is "traversing" this graph.

    processed_vowel_set_pairs = set()

    for v1_set in words_by_vowel_set:
        list_w1 = words_by_vowel_set[v1_set]
        required_vowels_in_w2 = all_vowels_set - v1_set

        for v2_set in words_by_vowel_set:
            if (v1_set | v2_set) == all_vowels_set:
                pair_vowel_set_key = frozenset({v1_set, v2_set})

                if pair_vowel_set_key in processed_vowel_set_pairs:
                    continue 
                    
                processed_vowel_set_pairs.add(pair_vowel_set_key)
                list_w2 = words_by_vowel_set[v2_set]

                # Now, iterate through ALL words in list_w1 and ALL in list_w2
                # and calculate the pair score

                for word1 in list_w1:
                    normalized_word1 = "".join(normalize_char(c) for c in word1.lower())
                    word1_letters_set = set(normalized_word1)

                    for word2 in list_w2:
                        if word1 == word2: 
                            continue

                        normalized_word2 = "".join(normalize_char(c) for c in word2.lower())
                        word2_letters_set = set(normalized_word2)

                        # Calculate Consonant Score for the Pair
                        # Score based on the distinct consonants in the combined pair
                        # that are in the top 10 most frequent consonants.

                        combined_letters = word1_letters_set | word2_letters_set
                        combined_consonants = {c for c in combined_letters if c in CONSONANTS}

                        pair_consonant_score = 0
                        for cons in combined_consonants:
                            pair_consonant_score += top_10_freq_dict.get(cons, 0)

                        word_pairs_analysis_optimized.append((-pair_consonant_score, word1, word2))

    word_pairs_analysis_optimized.sort() # Sort by score

    N_pairs = 10 # Number of best pairs to show
    print(f"\nTop Word Pairs Found ({len(word_pairs_analysis_optimized)} total)")
    print(f"{'Rank':<6} {'Word 1':<10} {'Word 2':<10} {'Score':<8} {'Distinct Letters in Pair'}")
    print("-" * 80)

    if word_pairs_analysis_optimized:
        rank = 1
        for neg_score, word1, word2 in word_pairs_analysis_optimized[:N_pairs]:
            score = -neg_score
            normalized_pair_letters = set("".join(normalize_char(c) for c in (word1 + word2).lower()))
            display_letters = sorted([c for c in normalized_pair_letters if c in all_vowels_set or c in top_10_freq_dict])
            letters_str = "".join(display_letters)
            print(f"{rank:<6} {word1:<10} {word2:<10} {score:<8} {letters_str}")
            rank += 1
    else:
        print("No word pairs found in the list that, together, contain all 5 base vowels.")

# Error Handling Messages
elif not analysis_completed:
    print(f"\nCannot search for word pairs. Initial analysis failed: {error_message}")
elif not all_original_valid_words:
    print("\nCannot search for word pairs. The list of valid words is empty.")
elif not consonant_counter:
     print("\nCannot search for word pairs. The consonant counter is empty.")

Grouping complete. Found 31 distinct vowel sets in 5-letter words.

Top Word Pairs Found (8590088 total)
Rank   Word 1     Word 2     Score    Distinct Letters in Pair
--------------------------------------------------------------------------------
1      acelo      muris      24552    aceilmorsu
2      acelo      rumis      24552    aceilmorsu
3      acelo      sumir      24552    aceilmorsu
4      acelo      surim      24552    aceilmorsu
5      acero      lusmi      24552    aceilmorsu
6      acero      milus      24552    aceilmorsu
7      acero      musli      24552    aceilmorsu
8      acile      morus      24552    aceilmorsu
9      acile      muros      24552    aceilmorsu
10     acile      murso      24552    aceilmorsu
