# Website Localization Prep
This script will copy a downloaded website and remove non-translatable files. It will also create an excel sheet that lists all the files of the website, and indicates which files are translatable or non-translatable, based on a provided list of extensions. These files should be looked at manually to make sure that nothing translatable has been missed.

Documenting which files are translatable and which files are not is an essential step in clarifying the scope of the project. When this translatable list of files is approved by the client, it protects you in case the client decides to change the scope later.

## Generate Loc Kit
Our first step is to generate a loc kit. We can do this easily with a script. To set up the script, we import pathlib, pandas, and shutil. We also have a function to convert bytes into a readable format.

In [70]:
from pathlib import Path
import pandas as pd
import shutil

def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    """
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            return "%3.1f %s" % (num, x)
        num /= 1024.0

We create up a "Prep" folder and set up our constants. Translatable extensions, column names.

In [65]:
WEBSITE_FOLDER = Path(r"www.havasu-falls.com")
PREP = Path("Prep")
PREP.mkdir(exist_ok=True)

TRANSLATABLE = [".htm", ".html"]
COLUMNS = ["Filepath", "Filename", "Extension", "Size", "Translatable"]
ROWS = []

We walk through each file and folder in the website folder, and determine whether it is translatable or not. This is done by comparing the extension with our pre-defined list of translatable extensions. If the file is translatable, we copy it over. If not, it is not copied. We append a row with important information to a dataframe.

In [66]:
for p in WEBSITE_FOLDER.rglob("*"):
    relative = p.relative_to(WEBSITE_FOLDER)
    prep_path = PREP / relative
    if p.is_dir():
        prep_path.mkdir(exist_ok=True)
    elif p.is_file():
        filename = p.name
        extension = p.suffix
        bytes = p.stat().st_size
        size = convert_bytes(bytes)
        translatable = False
        if extension in TRANSLATABLE:
            translatable = True
            shutil.copy(p, prep_path)
        row = p, filename, extension, size, translatable
        ROWS.append(row)

We write the file information to an excel sheet.

In [68]:
df = pd.DataFrame(ROWS, columns=COLUMNS, dtype=object)
df.to_excel("File List.xlsx", index=False)
print(df.head(5))

                                            Filepath  \
0              www.havasu-falls.com\applewebkit.html   
1               www.havasu-falls.com\contact-us.html   
2          www.havasu-falls.com\has_js=1; path=.html   
3  www.havasu-falls.com\havasu-canyon-waterfalls....   
4  www.havasu-falls.com\havasu-falls-information....   

                        Filename Extension     Size Translatable  
0               applewebkit.html     .html  10.8 KB         True  
1                contact-us.html     .html  53.2 KB         True  
2           has_js=1; path=.html     .html  10.8 KB         True  
3  havasu-canyon-waterfalls.html     .html  57.5 KB         True  
4  havasu-falls-information.html     .html  60.4 KB         True  


## Assess Word Count
Next, we load the prep folder into memoQ. Use the "Import Folder Structure" option in memoQ. In fact, it can include/exclude certain file types. If you have this feature in the CAT tool you're using, it may make the previous step of generating a loc kit unnecessary. But not all CAT tools and TMS have this feature, so generating a loc kit is sometimes necessary.  
  
Get the word count of each file with the "Statistics" button on the "Documents" tab of the ribbon and download it as a CSV file.  
![image_2.png](screenshots/image_2.png)
![image_1.png](screenshots/image_1.png)
![image_3.png](screenshots/image_3.png)

## Merge Word Count with File List
After downloading the statistics as CSV, paste the data from the entire CSV into the second tab of the file list Excel workbook. Selecting all the cells, click on A1 in the upper-left corner and rename the range as `word_count`. Replace the beginning part of the path of the data you pasted in with the beginning part of the original path listed on Sheet1.

Then, in a the `Translatable` column in Sheet1, enter this formula:  
`=IF(E2,VLOOKUP(A2,word_count,84),"")`  

For me, E2 is a TRUE/FALSE boolean for `Translatable`. word_count is my predefined data from the CSV file. And `84` is the column index of the total word count I'd like to display. If `E2` is not `Translatable`, then just display an empty string `""`.

Populate this formula down the sheet by selecting the cells you'd like to populate across and pressing `Ctrl-D`.  

![image_4.png](screenshots/image_4.png)