### Translation task
Europe is home to 24 official different languages. Although there are not many countries recognizing English as an official language, it is widely known that you can communicate in that language on most of the continent. Moreover, knowledge of English has become a necessity for those working in the IT industry and it represents the predominant language of communication in the technical domain as it facilitates global collaboration. 

Currently, our dataset combines multiple european languages across the samples obtained. Having English as the language of the dataset is ideal, and in this notebook we are going to, basically, detect the languages per row and, if the ad is not in english, we'll translate it to this language to ensure data consistency. 
The final dataset contains all job postings in English, allowing for uniform text analysis.

### Importing libraries

In [1]:
import pandas as pd
from langdetect import detect, DetectorFactory
from deep_translator import GoogleTranslator
import json
import os

### Defining paths
The described translation task is going to be performed per iteration, this will allow us to have a manageable amount of rows per iteration. This is why we'll use the "li_jobs.json" found in the data folder indicated, and the output will be in the folder of the corresponding iteration.

In [2]:
main_file="li_jobs.json"
output = "tr_li_jobs.json"
data_folder="data/7"
main_file_path = os.path.join(data_folder, main_file)
output_path = os.path.join(data_folder, output)

In [3]:
# Load the JSON file and check the number of rows
df_initial = pd.read_json(main_file_path)
row_count = df_initial.shape[0]
print(row_count)

3944


In [4]:
# checking the first 10 rows of the dataframe
df_initial.head(10)

Unnamed: 0,Title,Description,Primary Description,Detail URL,Location,Skill,Insight,Job State,Poster Id,Company Name,Company Logo,Created At,Scraped At
0,Délégué/Déléguée à la Protection des Données P...,CFL multimodal regroupant les services transve...,CFL - Société Nationale des Chemins de Fer Lux...,https://www.linkedin.com/jobs/view/4221073209,"Dudelange, Luxembourg, Luxembourg",Skills: General Data Protection Regulation (GD...,,LISTED,372922919.0,CFL - Société Nationale des Chemins de Fer Lux...,https://media.licdn.com/dms/image/v2/D4E0BAQHE...,2025-05-02T11:04:25.000Z,2025-05-28T10:14:56.869Z
1,NSI - Junior Security Engineer - FR/AN,Description de l'offre d'emploi\n\nAfin de ren...,"NSI Luxembourg PSF · Bertrange, Luxembourg, Lu...",https://www.linkedin.com/jobs/view/4235664393,"Bertrange, Luxembourg, Luxembourg","Skills: Linux, Security Information and Event ...",,LISTED,616927101.0,NSI Luxembourg PSF,https://media.licdn.com/dms/image/v2/C4D0BAQH5...,2025-05-27T13:19:58.000Z,2025-05-28T10:14:56.869Z
2,Information Technology Support Specialist (m/f/d),"About us \nAt RTL AdAlliance, we create simpli...","RTL AdAlliance · Luxembourg, Luxembourg (Hybrid)",https://www.linkedin.com/jobs/view/4232510933,"Luxembourg, Luxembourg","Skills: Windows 10, IT Operations, +8 more",,LISTED,9147426.0,RTL AdAlliance,https://media.licdn.com/dms/image/v2/D4E0BAQGo...,2025-05-21T12:50:10.000Z,2025-05-28T10:14:56.870Z
3,Data Protection & AI Compliance - Freelance,On E-Frontiers we're on the lookout for a Data...,E-Frontiers · European Union (Remote),https://www.linkedin.com/jobs/view/4221192811,European Union,"Skills: English, Artificial Intelligence (AI),...",,LISTED,339255538.0,E-Frontiers,https://media.licdn.com/dms/image/v2/D4D0BAQFF...,2025-05-05T09:40:19.000Z,2025-05-28T10:14:56.870Z
4,AI Tutor,"The AI Tutor role is a remote, full-time, temp...",xAI · EMEA (Remote),https://www.linkedin.com/jobs/view/4198823312,EMEA,"Skills: Analytical Skills, Professional Writin...",,LISTED,664570951.0,xAI,https://media.licdn.com/dms/image/v2/D560BAQFG...,2025-04-03T17:31:51.000Z,2025-05-28T10:14:56.871Z
5,Network Security Administrator,iKe recherche un consultant Network & Security...,"iKe Solutions · Luxembourg, Luxembourg (On-site)",https://www.linkedin.com/jobs/view/4239351140,"Luxembourg, Luxembourg","Skills: Ticketing Systems, McAfee, +8 more",,LISTED,409547717.0,iKe Solutions,https://media.licdn.com/dms/image/v2/D4E0BAQHg...,2025-05-28T08:57:57.000Z,2025-05-28T10:14:56.871Z
6,Administrative Assistant,AVEGA is a full-service provider based in the ...,"AVEGA Luxembourg · Kirchberg, Luxembourg, Luxe...",https://www.linkedin.com/jobs/view/4216342656,"Kirchberg, Luxembourg, Luxembourg","Skills: Administrative Assistance, English, +8...",,LISTED,,AVEGA Luxembourg,https://media.licdn.com/dms/image/v2/C560BAQEg...,2025-04-24T09:29:50.000Z,2025-05-28T10:14:56.871Z
7,Data Protection Specialist [m/w/d] / [m/f/d],Um unser Team am Standort Howald zu verstärken...,"Losch Luxembourg · Luxembourg, Luxembourg (On-...",https://www.linkedin.com/jobs/view/4228285731,"Luxembourg, Luxembourg","Skills: Data Privacy, General Data Protection ...",,LISTED,473137745.0,Losch Luxembourg,https://media.licdn.com/dms/image/v2/D4E0BAQFF...,2025-05-12T11:29:55.000Z,2025-05-28T10:14:56.872Z
8,Director of Sales - EMEA Overlay,Director of Sales - EMEA Overlay Location: 100...,Hubscale · EMEA (Remote),https://www.linkedin.com/jobs/view/4233940752,EMEA,"Skills: Overlay, Sales Processes, +8 more",,LISTED,636641157.0,Hubscale,https://media.licdn.com/dms/image/v2/D4E0BAQFK...,2025-05-23T16:33:00.000Z,2025-05-28T10:14:56.872Z
9,Threat Intelligence Consultant,🔎 Now Hiring: Senior Cyber Threat Intelligence...,"Stott and May · Luxembourg, Luxembourg (Hybrid)",https://www.linkedin.com/jobs/view/4218853949,"Luxembourg, Luxembourg",Skills: Security Information and Event Managem...,,LISTED,172671873.0,Stott and May,https://media.licdn.com/dms/image/v2/D4E0BAQEk...,2025-04-30T17:00:39.000Z,2025-05-28T10:14:56.872Z


### Translation Function
Detects the language of specified columns and translates non-English text into English. It takes as parameters the DataFrame containing job listings, the number of rows to process and a boolean that indicates if keeping the original text in new columns for comparison is needed. This will be returning the modified DataFrame with translations in the desired columns.

In [5]:
DetectorFactory.seed = 0  # Ensures consistent language detection

In [6]:
def detect_and_translate(df, sample_size=row_count, show_original=False):
    # Columns to translate
    translate_cols = ['Title', 'Description', 'Primary Description', 'Specialties', 'Company Description']
    
    # Sample for validation
    sample_df = df.sample(n=sample_size, random_state=42).copy()

    def translate_text(text):
        # Detects language and translates to English, splitting if >5000 chars due to an API limitation.
        if pd.isna(text) or text.strip() == "":
            return text  # Skip empty or NaN values
        try:
            lang = detect(text)
            if lang != "en":
                if len(text) > 5000:
                    # Split into chunks under 5000 chars
                    chunks = []
                    start = 0
                    while start < len(text):
                        end = min(start + 4999, len(text))  # Max chunk size 4999
                        # Try to split at a natural boundary (period or space)
                        if end < len(text):
                            last_period = text.rfind('.', start, end)
                            last_space = text.rfind(' ', start, end)
                            split_point = max(last_period, last_space) if max(last_period, last_space) > start else end
                            end = split_point if split_point > start else end
                        chunks.append(text[start:end])
                        start = end + 1 if end < len(text) and text[end] in '. ' else end
                    # Translate each chunk and join
                    translated_chunks = []
                    for chunk in chunks:
                        if chunk.strip():  # Only translate non-empty chunks
                            translated_chunks.append(GoogleTranslator(source=lang, target="en").translate(chunk))
                        else:
                            translated_chunks.append(chunk)
                    return ' '.join(translated_chunks)
                else:
                    return GoogleTranslator(source=lang, target="en").translate(text)
        except Exception as e:
            print(f"Translation error: {e}")
            return text  # Return original text if translation fails
        return text  # Return original text if English

    # Apply translation function
    for col in translate_cols:
        if col in sample_df.columns:
            if show_original:
                # Store original text in a new column
                sample_df[f"{col}_original"] = sample_df[col]
            # Replace with translated text
            sample_df[col] = sample_df[col].apply(translate_text)

    return sample_df


### Storing results
Finally, the results are saved in the folder corresponding to the current iteration. A final merging for these translated versions per iteration is required to complete the dataset, and this can be performed in the last step of the merge-data.ipynb file.

In [7]:
translated_df = detect_and_translate(df_initial, sample_size=row_count, show_original=False)

translated_df = translated_df.to_dict(orient="records")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(translated_df, f, ensure_ascii=False, indent=2)

Translation error: Senior Security DevOps Engineer
🏫: PPC Romania👥: Cyber Security📍: Bucuresti📝: Perioada nedeterminata

#CarieraTa incepe cu a face ceea ce iti place 😊

Suntem mereu in cautare de colegi talentati si motivati care sa ni se alature si, impreuna, sa contribuim la crearea unui viitor sustenabil, bazat pe incluziune, empatie, respect si egalitate de sanse.

Iti cream contextul sa te dezvolti, fiind responsabil/a de:Definerea cerințelor și implementarea componentele de integrare și livrare legate de configurarea și validarea securității;Managementul infrastructurii AWS: Expertiză demonstrată în infrastructură ca serviciu (IaaS) pe AWS. Responsabil pentru proiectarea, implementarea și gestionarea resurselor AWS pentru a asigura performanța maximă, securitatea și optimizarea costurilor.Administrarea GitLab, asigurând configurarea, întreținerea și asistența optime, utilizându-l în același timp pentru controlul versiunilor și proiectarea și menținerea conductelor CI/CD robuste.