### **Performing ETL on [Wordle](https://www.nytimes.com/games/wordle/index.html) game**

Before we start, let's manually look into the source code for the variables where each words `list` is stored.

###############################################################################<br>
1st `list`:

```js script
 8520 >>> var mo = [
 8521 >>>         "cigar",
 8522 >>>         "rebut",
 8523 >>>         "sissy",
 8524 >>>         "humph",
 8525 >>>         "awake",
                     ˄
                     ˅
10824 >>>         "sooth",
10825 >>>         "unset",
10826 >>>         "unlit",
10827 >>>         "vomit",
10828 >>>         "fanny",
10829 >>>     ],
```
################################################################################<br>
2nd `list`: 

```js script
10830 >>>    fo = [
10831 >>>         "aahed",
10832 >>>         "aalii",
10833 >>>         "aargh",
10834 >>>         "aarti",
10835 >>>         "abaca",
                     ˄
                     ˅
21491 >>>         "zuzim",
21492 >>>         "zygal",
21493 >>>         "zygon",
21494 >>>         "zymes",
21495 >>>         "zymic",
21496 >>>     ],
```

Analysing the entire code *(**wordle.js** file)*, we can find 2 lists of words. Apparently, there is a **valid guesses** `list` and a **valid_answers** `list`.

We can see that the words are stored in variables called `mo` and `fo`, respectively. They are written over multiple lines of code.

So, let's extract them!

In [7]:
# Import modules
import unidecode
import re

# Define constants
NUMBER_OF_PREVIEW_ELEMENTS = 10
DIVIDER = '#' * 120

# Init variables
words_lists = [None] * 2


def print_words_lists_preview() -> None:
    """Helper function to print a preview of each words lists and their respective length"""

    # Iterate over each words list
    for words_list in words_lists:

        # Check if the words list is not empty
        if words_list:
            print(f"first:\t{words_list[:NUMBER_OF_PREVIEW_ELEMENTS]}") # Print the first words
            print(f"last:\t{words_list[-NUMBER_OF_PREVIEW_ELEMENTS:]}") # Print the last words
            print(f"length:\t{len(words_list)}")                        # Print the length of the words list
            print() 

Let's load the JavaScript source code as a `list` of lines and preview it.

In [8]:
def load_file(filepath: str) -> list:
    """Function to output a list of lines from a file"""

    with open(filepath, "r", encoding="utf-8") as f:
        return f.read().splitlines()


def print_code_preview(src_code: list) -> None:
    """Helper function to print a preview of the code"""

    print(DIVIDER)
    print("FIRST:")

    # Iterate over the first lines
    for line in src_code[:NUMBER_OF_PREVIEW_ELEMENTS]:
        print(f"\t{line}")

    print(DIVIDER)
    print("LAST:")

    # Iterate over the last lines
    for line in src_code[-NUMBER_OF_PREVIEW_ELEMENTS:]:
        print(f"\t{line}")
    
    print(DIVIDER)
    print(f"LENGTH: {len(src_code)}") # Print the code's length


# Load the code
src_code = load_file(filepath="data\\src_code\\wordle.js")

# Preview the first and last lines of code
print_code_preview(src_code=src_code)

########################################################################################################################
FIRST:
	(this.wordle = this.wordle || {}),
	    (this.wordle.bundle = (function (e) {
	        "use strict";
	        function t(e, t) {
	            var n = Object.keys(e);
	            if (Object.getOwnPropertySymbols) {
	                var a = Object.getOwnPropertySymbols(e);
	                t &&
	                    (a = a.filter(function (t) {
	                        return Object.getOwnPropertyDescriptor(e, t).enumerable;
########################################################################################################################
LAST:
	            (e.GameThemeManager = E),
	            (e.GameTile = w),
	            (e.GameToast = ho),
	            (e.NYTIcon = Fi),
	            (e.NavIcon = $i),
	            (e.NavModal = Ii),
	            Object.defineProperty(e, "__esModule", { value: !0 }),
	            e
	        );
	    })({}));
###########

Let's extract both words `list`:

+ We can detect their start based on their variable name and their opening char: `'['`
+ We can detect their end based on their closing char: `']'`

In [9]:
def extract_words_from_list(src_code: list, var_name: str, opening_char: str, closing_char: str) -> list:
    """Function to extract words from a list in the source code"""

    # Init variables
    words_list = []     # Output list of words
    store_words = False # Flag to wheter store words or not

    # Iterate over the source code's lines
    for line in src_code:
        
        # If store_words flag is set, find words in the line
        if store_words:
            result = re.findall(r"\w+", line)

            # Check if the RegEx result contains words
            if len(result):

                # Grab the first word
                # Remove accented and special characters, convert it to lowercase
                # And add it to the words list
                words_list.append(unidecode.unidecode(result[0]).lower())
        
        # If line contains the variable name and the opening char, set the flag to store words
        if var_name in line and opening_char in line:
            store_words = True

        # If line contains the closing char, set the flag to not store words
        if closing_char in line:
            store_words = False

    # Return the output words list
    return words_list

# Extract each words list
words_lists[0] = extract_words_from_list(src_code=src_code, var_name="mo =", opening_char="[", closing_char="]")
words_lists[1] = extract_words_from_list(src_code=src_code, var_name="fo =", opening_char="[", closing_char="]")

# Print the words lists previews
print_words_lists_preview()

first:	['cigar', 'rebut', 'sissy', 'humph', 'awake', 'blush', 'focal', 'evade', 'naval', 'serve']
last:	['hydro', 'liege', 'octal', 'ombre', 'payer', 'sooth', 'unset', 'unlit', 'vomit', 'fanny']
length:	2309

first:	['aahed', 'aalii', 'aargh', 'aarti', 'abaca', 'abaci', 'abacs', 'abaft', 'abaka', 'abamp']
last:	['zulus', 'zupan', 'zupas', 'zuppa', 'zurfs', 'zuzim', 'zygal', 'zygon', 'zymes', 'zymic']
length:	10665



Let's make sure, there are no duplicated words inside each list. We are going to convert each `list` into a `set`, which will get rid of duplicates.

Then, we will check if there are duplicates across both lists.

And, finally, each `set` will be converted back to a `list`.

*(Duplicated words would unbalance the probabilities of each word being randomly picked during the game execution)*

In [10]:
words_lists[0] = set(words_lists[0])
words_lists[1] = set(words_lists[1])

print(f"Duplicated words across lists: {list(words_lists[0] & words_lists[1])}")

words_lists[0] = list(words_lists[0])
words_lists[1] = list(words_lists[1])

Duplicated words across lists: []


Great, no words are duplicated across lists.

Now, taking into consideration that the **valid guesses** list will always be larger than the **valid answers** list, we can store them *(alphabetically sorted)* on `.txt` files.

The words in the first list `words_lists[0]` are our **valid answers** and the ones in the second list `words_lists[1]` are our **valid guesses**.

In [11]:
def write_file(filepath: str, words: list) -> None:
    """Function to write a list of words to a file"""

    with open(filepath, "w") as f:
        for word in words:
            f.write(f"{word}\n")

write_file("data\\words\\valid_answers.txt", sorted(words_lists[0]))
write_file("data\\words\\valid_guesses.txt", sorted(words_lists[1]))