# **Performing ETL on [Term.ooo](https://term.ooo) game** *(Brazilian Portuguese variant of [Wordle](https://www.nytimes.com/games/wordle/index.html))*

Before we start, let's manually look into the source code for the variables where each words `list` is stored.

###############################################################################<br>
1st `list`:

```js script
 4352 >>>     Kf = new Set([
 4353 >>>         "ababa",
 4354 >>>         "abacá",
 4355 >>>         "abada",
 4356 >>>         "abade",
 4357 >>>         "abado",
                     ˄
                     ˅
13494 >>>         "úropo",
13495 >>>         "úsnea",
13496 >>>         "úvico",
13497 >>>         "úvido",
13498 >>>         "úvula",
13499 >>>     ]),
```
###############################################################################<br>
2nd `list`: 

```js script
13500 >>>    Xf = {
13501 >>>        abaca: "abacá",
13502 >>>        abara: "abará",
13503 >>>        abare: "abaré",
13504 >>>        abebe: "abebé",
13505 >>>        abece: "abecê",
                           ˄
                           ˅
15638 >>>        uteis: "úteis",
15639 >>>        utero: "útero",
15640 >>>        uvico: "úvico",
15641 >>>        uvido: "úvido",
15642 >>>        uvula: "úvula",
15643 >>>    },
```

################################################################################<br>
3rd `list`: 

```js script
15644 >>>    Zf = [
15645 >>>        "termo",
15646 >>>        "suíte",
15647 >>>        "ávido",
15648 >>>        "festa",
15649 >>>        "bebia",
                    ˄
                    ˅
17083 >>>        "sósia",
17084 >>>        "local",
17085 >>>        "gemer",
17086 >>>        "saber",
17087 >>>        "visar",
17088 >>>    ],
```

Analysing the entire code *(**termo.js** file)*, we can find 3 lists of words. Apparently, there is a **valid guesses** `list`, a **valid_answers** `list` and a `dict/map` of **words with accents** to **words without accents**.

We can see that the words are stored in variables called `Kf`, `Xf` and `Zf`, respectively. They are written over multiple lines of code.

So, let's extract them!

In [1]:
import pandas as pd
import unidecode
import re

## Helper functions

In [2]:
def print_preview(words_lists: list) -> None:
    """Helper function to print a preview of each words lists and their respective length"""

    # Iterate over each words list
    for words_list in words_lists:

        # Check if the words list is not empty
        if words_list:
            print(f"first:\t{words_list[:10]}")     # Print the first words
            print(f"last:\t{words_list[-10:]}")     # Print the last words
            print(f"length:\t{len(words_list)}")    # Print the length of the words list
            print()  

Let's load the JavaScript source code as a `list` of lines and preview it.

In [3]:
# Load the source code
filepath = f"./data/0.src_code/termo.js"

with open(filepath, "r", encoding="utf-8") as f:
    src_code = f.read().splitlines()

In [4]:
# Preview the source code
print("#" * 120)
print("FIRST LINES:")

for line in src_code[:10]:
    print(f"\t{line}")

print("#" * 120)
print("LAST LINES:")

for line in src_code[-10:]:
    print(f"\t{line}")

print("#" * 120)
print(f"TOTAL LENGTH: {len(src_code)}")

########################################################################################################################
FIRST LINES:
	function a(a, e) {
	    var o = Object.keys(a);
	    if (Object.getOwnPropertySymbols) {
	        var r = Object.getOwnPropertySymbols(a);
	        e &&
	            (r = r.filter(function (e) {
	                return Object.getOwnPropertyDescriptor(a, e).enumerable;
	            })),
	            o.push.apply(o, r);
	    }
########################################################################################################################
LAST LINES:
	                                })
	                                .join(", ")
	                        )
	                    ),
	                Dh(1).then(function () {
	                    dx() || (xw.header.showBar(), gx(1));
	                }));
	        })(a, Fx);
	    for (var n = 0, s = a; n < s.length; n++) s[n].newLine();
	});
##############################################################

## Extracting the words

+ We can detect their start based on their variable name and their opening char: `'['` or `'{'`
+ We can detect their end based on their closing char: `']'` or `'}'`

In [5]:
def extract_words_from_list(src_code: list, var_name: str, opening_char: str, closing_char: str) -> list:
    """Function to extract words from a list in the source code"""

    # Init variables
    words_list = []     # Output list of words
    store_words = False # Flag to wheter store words or not

    # Iterate over the source code's lines
    for line in src_code:
        
        # If store_words flag is set, find words in the line
        if store_words:
            result = re.findall(r"\w+", line)

            # Check if the RegEx result contains words
            if len(result):

                # Grab the first word
                # Remove accented and special characters, convert it to lowercase
                # And add it to the words list
                words_list.append(unidecode.unidecode(result[0]).lower())
        
        # If line contains the variable name and the opening char, set the flag to store words
        if var_name in line and opening_char in line:
            store_words = True

        # If line contains the closing char, set the flag to not store words
        if closing_char in line:
            store_words = False

    # Return the output words list
    return words_list

# Extract each words list
words_lists = [None] * 3

words_lists[0] = extract_words_from_list(src_code=src_code, var_name="Kf =", opening_char="[", closing_char="]")
words_lists[1] = extract_words_from_list(src_code=src_code, var_name="Zf =", opening_char="[", closing_char="]")
words_lists[2] = extract_words_from_list(src_code=src_code, var_name="Xf =", opening_char="{", closing_char="}")

# Print the words lists previews
print_preview(words_lists)

first:	['ababa', 'abaca', 'abada', 'abade', 'abado', 'abafa', 'abafe', 'abafo', 'abaju', 'abala']
last:	['unsia', 'uraco', 'urano', 'urceo', 'urico', 'uropo', 'usnea', 'uvico', 'uvido', 'uvula']
length:	9147

first:	['termo', 'suite', 'avido', 'festa', 'bebia', 'honra', 'ouvir', 'pesco', 'fungo', 'pagam']
last:	['quica', 'aviao', 'retro', 'dores', 'credo', 'hinos', 'capim', 'tango', 'voces', 'jurar']
length:	1443

first:	['abaca', 'abara', 'abare', 'abebe', 'abece', 'abede', 'abico', 'abobo', 'abofe', 'aboco']
last:	['urano', 'urceo', 'urico', 'uropo', 'usnea', 'uteis', 'utero', 'uvico', 'uvido', 'uvula']
length:	2142



## Discarding duplicates

Technically, it's very likely that the words contained in the `dict/map` *(3rd list)* object are the same ones in either the **valid guesses** or **valid answers** `lists` *(first 2 lists)*.

Let's make sure all words in the `dict/map` object are not in either one of the 2 other lists.

In [6]:
# Iterate over the 3rd words list
for word in words_lists[2]:

    # If word not in either 2 first lists, print message and break
    if word not in words_lists[0] and word not in words_lists[1]:
        print(f"The {word} is not in any of the first 2 lists")
        break

# If loop is not broken, print message
else:
    print("All words are in either one of the first 2 lists")

All words are in either one of the first 2 lists


Great, there are no new words in the 3rd list, so we can discard it.

In [7]:
try:
    del words_lists[2]

except:
    print("There in no 3rd words list\n")

print_preview(words_lists)

first:	['ababa', 'abaca', 'abada', 'abade', 'abado', 'abafa', 'abafe', 'abafo', 'abaju', 'abala']
last:	['unsia', 'uraco', 'urano', 'urceo', 'urico', 'uropo', 'usnea', 'uvico', 'uvido', 'uvula']
length:	9147

first:	['termo', 'suite', 'avido', 'festa', 'bebia', 'honra', 'ouvir', 'pesco', 'fungo', 'pagam']
last:	['quica', 'aviao', 'retro', 'dores', 'credo', 'hinos', 'capim', 'tango', 'voces', 'jurar']
length:	1443



Let's make sure, there are no duplicated words inside each list. We are going to convert each `list` into a `set`, which will get rid of duplicates.

Then, we will check if there are duplicates across both lists.

And, finally, each `set` will be converted back to a `list`.

*(Duplicated words would unbalance the probabilities of each word being randomly picked during the game execution)*

In [8]:
words_lists[0] = set(words_lists[0])
words_lists[1] = set(words_lists[1])

print(f"Duplicated words across lists: {list(words_lists[0] & words_lists[1])}")

words_lists[0] = list(words_lists[0])
words_lists[1] = list(words_lists[1])

Duplicated words across lists: []


Great, no words are duplicated across lists.

Now, taking into consideration that the **valid guesses** list will always be larger than the **valid answers** list, we can store them *(alphabetically sorted)* on `.csv` files.

The words in the first list `words_lists[0]` are our **valid guesses** and the ones in the second list `words_lists[1]` are our **valid answers**.

In [9]:
def write_csv(filepath: str, words: list) -> None:
    """Function to write a list of words to a csv file"""

    df = pd.DataFrame(words, columns=["word"])
    df.to_csv(filepath, index=False)


write_csv(f"./data/1.raw/valid_guesses.csv", sorted(words_lists[0]))
write_csv(f"./data/1.raw/valid_answers.csv", sorted(words_lists[1]))