### Translation task
Europe is home to 24 official different languages. Although there are not many countries recognizing English as an official language, it is widely known that you can communicate in that language on most of the continent. Moreover, knowledge of English has become a necessity for those working in the IT industry and it represents the predominant language of communication in the technical domain as it facilitates global collaboration. 

Currently, our dataset combines multiple european languages across the samples obtained. Having English as the language of the dataset is ideal, and in this notebook we are going to, basically, detect the languages per row and, if the ad is not in english, we'll translate it to this language to ensure data consistency. 
The final dataset contains all job postings in English, allowing for uniform text analysis.

### Importing libraries

In [1]:
import pandas as pd
from langdetect import detect, DetectorFactory
from deep_translator import GoogleTranslator
import json
import os

### Defining paths
The described translation task is going to be performed per iteration, this will allow us to have a manageable amount of rows per iteration. This is why we'll use the "li_jobs.json" found in the data folder indicated, and the output will be in the folder of the corresponding iteration.

In [3]:
main_file="li_jobs.json"
output = "tr_li_jobs.json"
data_folder="data/5"
main_file_path = os.path.join(data_folder, main_file)
output_path = os.path.join(data_folder, output)

In [4]:
# Load the JSON file and check the number of rows
df_initial = pd.read_json(main_file_path)
row_count = df_initial.shape[0]
print(row_count)

1507


In [5]:
# checking the first 10 rows of the dataframe
df_initial.head(10)

Unnamed: 0,Title,Description,Primary Description,Detail URL,Location,Skill,Insight,Job State,Poster Id,Company Name,Company Logo,Created At,Scraped At
0,Cyber Security Architect,Job Description:\n\nSummary\n\nAre you a Cyber...,"Airbus Defence and Space · Getafe, Community o...",https://www.linkedin.com/jobs/view/4227633207,"Getafe, Community of Madrid, Spain","Skills: Information Security, Cybersecurity, +...",,LISTED,327764974,Airbus Defence and Space,https://media.licdn.com/dms/image/v2/C4E0BAQHi...,2025-05-14T13:43:44.000Z,2025-05-21T11:52:16.090Z
1,Advanced Penetration Tester (m/f/d),"Since its foundation in 1925, the DEKRA promis...","DEKRA Digital & Product Solutions · Málaga, An...",https://www.linkedin.com/jobs/view/4231914435,"Málaga, Andalusia, Spain","Skills: Reverse Engineering, Penetration Testi...",,LISTED,648251543,DEKRA Digital & Product Solutions,https://media.licdn.com/dms/image/v2/D4E0BAQEN...,2025-05-21T10:42:44.000Z,2025-05-21T11:52:16.090Z
2,Cybersecurity Specialist,A Snapshot of Your Day\n\nJoin our dynamic tea...,"Siemens Energy · Zamudio, Basque Country, Spai...",https://www.linkedin.com/jobs/view/4229281486,"Zamudio, Basque Country, Spain",Skills: Security Information and Event Managem...,,LISTED,139324179,Siemens Energy,https://media.licdn.com/dms/image/v2/D4E0BAQGs...,2025-05-16T18:46:59.000Z,2025-05-21T11:52:16.091Z
3,DevSecOps,¿QUÉ HACEMOS EN EL EQUIPO? En la unidad de Con...,Telefónica Tech · Greater Madrid Metropolitan ...,https://www.linkedin.com/jobs/view/4229104270,Greater Madrid Metropolitan Area,"Skills: Security, Security Architecture Design...",,LISTED,688504198,Telefónica Tech,https://media.licdn.com/dms/image/v2/D4D0BAQFJ...,2025-05-13T15:25:33.000Z,2025-05-21T11:52:16.092Z
4,Cyber Security Intern,"Welcome to the future of nuclear energy, where...",Westinghouse Electric Company · San Sebastián ...,https://www.linkedin.com/jobs/view/4195203580,"San Sebastián de los Reyes, Community of Madri...",Skills: Security Information and Event Managem...,,LISTED,358350140,Westinghouse Electric Company,https://media.licdn.com/dms/image/v2/C560BAQFW...,2025-04-01T16:14:40.000Z,2025-05-21T11:52:16.093Z
5,Técnico/a en Ciberseguridad y Administración d...,Técnico/a en Ciberseguridad y Administración d...,"MECIDES · Santa Cruz de Tenerife, Canary Islan...",https://www.linkedin.com/jobs/view/4226086243,"Santa Cruz de Tenerife, Canary Islands, Spain","Skills: Team Leadership, Team Building, +8 more",,LISTED,781476502,MECIDES,https://media.licdn.com/dms/image/v2/C560BAQG7...,2025-05-13T08:31:24.000Z,2025-05-21T11:52:16.093Z
6,Jr. KAM - Ciberseguridad,Desde Robert Walters estamos colaborando con u...,"Robert Walters · Zaragoza, Aragon, Spain (On-s...",https://www.linkedin.com/jobs/view/4218688853,"Zaragoza, Aragon, Spain","Skills: Account Management, Business-to-Busine...",,LISTED,610885222,Robert Walters,https://media.licdn.com/dms/image/v2/D4E0BAQG7...,2025-04-28T08:05:31.000Z,2025-05-21T11:52:16.093Z
7,Consultor/a Ciberseguridad Madrid,"En EY, tendrás la oportunidad de construir una...","EY · Madrid, Community of Madrid, Spain (On-site)",https://www.linkedin.com/jobs/view/4081110521,"Madrid, Community of Madrid, Spain","Skills: Information Security, Cybersecurity, +...",,LISTED,214206,EY,https://media.licdn.com/dms/image/v2/C510BAQGp...,2024-11-22T13:33:39.000Z,2025-05-21T11:52:16.093Z
8,Consultoría | Pentester Web - Ciberseguridad,Job Description & Summary\n\nSi quieres hacer ...,"PwC España · Madrid, Community of Madrid, Spai...",https://www.linkedin.com/jobs/view/4093842268,"Madrid, Community of Madrid, Spain","Skills: Mathematics, Presentations, +8 more",,LISTED,766401024,PwC España,https://media.licdn.com/dms/image/v2/D4D0BAQFg...,2024-12-06T17:00:20.000Z,2025-05-21T11:52:16.094Z
9,Administrador de seguridad (Cyberark),"En ChangeTheBlock, buscamos un/a Administrador...","ChangeTheBlock · Madrid, Community of Madrid, ...",https://www.linkedin.com/jobs/view/4231132768,"Madrid, Community of Madrid, Spain","Skills: Cyberark, English",,LISTED,694606844,ChangeTheBlock,https://media.licdn.com/dms/image/v2/D4D0BAQE4...,2025-05-20T07:36:24.000Z,2025-05-21T11:52:16.094Z


### Translation Function
Detects the language of specified columns and translates non-English text into English. It takes as parameters the DataFrame containing job listings, the number of rows to process and a boolean that indicates if keeping the original text in new columns for comparison is needed. This will be returning the modified DataFrame with translations in the desired columns.

In [6]:
DetectorFactory.seed = 0  # Ensures consistent language detection

In [7]:
def detect_and_translate(df, sample_size=row_count, show_original=False):
    # Columns to translate
    translate_cols = ['Title', 'Description', 'Primary Description', 'Specialties', 'Company Description']
    
    # Sample for validation
    sample_df = df.sample(n=sample_size, random_state=42).copy()

    def translate_text(text):
        # Detects language and translates to English, splitting if >5000 chars due to an API limitation.
        if pd.isna(text) or text.strip() == "":
            return text  # Skip empty or NaN values
        try:
            lang = detect(text)
            if lang != "en":
                if len(text) > 5000:
                    # Split into chunks under 5000 chars
                    chunks = []
                    start = 0
                    while start < len(text):
                        end = min(start + 4999, len(text))  # Max chunk size 4999
                        # Try to split at a natural boundary (period or space)
                        if end < len(text):
                            last_period = text.rfind('.', start, end)
                            last_space = text.rfind(' ', start, end)
                            split_point = max(last_period, last_space) if max(last_period, last_space) > start else end
                            end = split_point if split_point > start else end
                        chunks.append(text[start:end])
                        start = end + 1 if end < len(text) and text[end] in '. ' else end
                    # Translate each chunk and join
                    translated_chunks = []
                    for chunk in chunks:
                        if chunk.strip():  # Only translate non-empty chunks
                            translated_chunks.append(GoogleTranslator(source=lang, target="en").translate(chunk))
                        else:
                            translated_chunks.append(chunk)
                    return ' '.join(translated_chunks)
                else:
                    return GoogleTranslator(source=lang, target="en").translate(text)
        except Exception as e:
            print(f"Translation error: {e}")
            return text  # Return original text if translation fails
        return text  # Return original text if English

    # Apply translation function
    for col in translate_cols:
        if col in sample_df.columns:
            if show_original:
                # Store original text in a new column
                sample_df[f"{col}_original"] = sample_df[col]
            # Replace with translated text
            sample_df[col] = sample_df[col].apply(translate_text)

    return sample_df


### Storing results
Finally, the results are saved in the folder corresponding to the current iteration. A final merging for these translated versions per iteration is required to complete the dataset, and this can be performed in the last step of the merge-data.ipynb file.

In [8]:
translated_df = detect_and_translate(df_initial, sample_size=row_count, show_original=False)

translated_df = translated_df.to_dict(orient="records")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(translated_df, f, ensure_ascii=False, indent=2)

Translation error: Descripción

¿Te apasiona la tecnología? ¿Quieres desarrollarte en una empresa dinámica y en constante crecimiento? ¡Te estamos buscando! En Nunsys Group estamos ampliando nuestro equipo, buscamos un/a Ingeniero/a de Ciberseguridad en Remoto!

Nuestro viaje empezó en 2007 y tenemos una meta: ser la empresa española líder en tecnología. Nos apasiona la transformación digital y contamos con un gran portfolio de soluciones tecnológicas. Gracias a esto garantizamos que podemos afrontar cualquier reto que nos planteen nuestros clientes.

Actualmente somos más de 2.600 profesionales y tenemos presencia en 25 ciudades españolas, Portugal, USA, Colombia y Ecuador. ¡Y esto es solo el principio!

Y el proyecto ¿en qué consiste? El día a día es muy variado, pero algunas de las funciones principales en las que participarás son:

Realizarás auditorías de ciberseguridad, incluyendo auditorías de conformidad y análisis de seguridad de sistemas y redes. Implementarás soluciones de E