### Translation task
Europe is home to 24 official different languages. Although there are not many countries recognizing English as an official language, it is widely known that you can communicate in that language on most of the continent. Moreover, knowledge of English has become a necessity for those working in the IT industry and it represents the predominant language of communication in the technical domain as it facilitates global collaboration. 

Currently, our dataset combines multiple european languages across the samples obtained. Having English as the language of the dataset is ideal, and in this notebook we are going to, basically, detect the languages per row and, if the ad is not in english, we'll translate it to this language to ensure data consistency. 
The final dataset contains all job postings in English, allowing for uniform text analysis.

### Importing libraries

In [1]:
import pandas as pd
from langdetect import detect, DetectorFactory
from deep_translator import GoogleTranslator
import json
import os

### Defining paths
The described translation task is going to be performed per iteration, this will allow us to have a manageable amount of rows per iteration. This is why we'll use the "li_jobs.json" found in the data folder indicated, and the output will be in the folder of the corresponding iteration.

In [3]:
main_file="li_jobs.json"
output = "tr_li_jobs.json"
data_folder="data/6"
main_file_path = os.path.join(data_folder, main_file)
output_path = os.path.join(data_folder, output)

In [4]:
# Load the JSON file and check the number of rows
df_initial = pd.read_json(main_file_path)
row_count = df_initial.shape[0]
print(row_count)

7182


In [5]:
# checking the first 10 rows of the dataframe
df_initial.head(10)

Unnamed: 0,Title,Description,Primary Description,Detail URL,Location,Skill,Insight,Job State,Poster Id,Company Name,Company Logo,Created At,Scraped At
0,Vulnerability Analyst,Vulnerability Analyst Junior\nAlfa Group è una...,"Alfa Group · Rome, Latium, Italy (Hybrid)",https://www.linkedin.com/jobs/view/4233218439,"Rome, Latium, Italy",Skills: Python (Programming Language),,LISTED,974757775,Alfa Group,https://media.licdn.com/dms/image/v2/D4D0BAQFF...,2025-05-22T12:38:10.000Z,2025-05-22T12:46:03.839Z
1,Helpdesk IT,"Anemocyte, Biotech Manufacturing Organization ...","ANEMOCYTE · Gerenzano, Lombardy, Italy (On-site)",https://www.linkedin.com/jobs/view/4224479861,"Gerenzano, Lombardy, Italy","Skills: Troubleshooting, Computer Repair, +8 more",,LISTED,122533242,ANEMOCYTE,https://media.licdn.com/dms/image/v2/D4D0BAQEm...,2025-05-09T12:31:14.000Z,2025-05-22T12:46:03.840Z
2,"Cyber Security Analyst, Italy",The IT/Cyber Security Analyst is a global role...,"ION · Pisa, Tuscany, Italy (On-site)",https://www.linkedin.com/jobs/view/4023610804,"Pisa, Tuscany, Italy","Skills: Linux, Security Information and Event ...",,LISTED,79075618,ION,https://media.licdn.com/dms/image/v2/D4E0BAQEj...,2024-09-11T17:18:24.000Z,2025-05-22T12:46:03.841Z
3,NOC Engineer Junior,"In iliad, abbiamo rivoluzionato il mercato del...","iliad · Milan, Lombardy, Italy (On-site)",https://www.linkedin.com/jobs/view/4230031910,"Milan, Lombardy, Italy","Skills: Network Operations Center (NOC), Troub...",,LISTED,984227560,iliad,https://media.licdn.com/dms/image/v2/D4D0BAQHV...,2025-05-14T15:50:52.000Z,2025-05-22T12:46:03.842Z
4,"2025 Security Specialist Intern, DC Security","Description\n\nAWS is growing rapidly, and we ...","Amazon Web Services (AWS) · Milan, Lombardy, I...",https://www.linkedin.com/jobs/view/4188426286,"Milan, Lombardy, Italy","Skills: Security Operations, Criminology, +8 more",,LISTED,355872060,Amazon Web Services (AWS),https://media.licdn.com/dms/image/v2/D4E0BAQE0...,2025-03-18T21:13:12.000Z,2025-05-22T12:46:03.842Z
5,Cyber Defense Center SOC Analyst,Con oltre 1.000 specialisti IT presenti in 6 P...,"Würth Phoenix · Milan, Lombardy, Italy (Hybrid)",https://www.linkedin.com/jobs/view/4233626307,"Milan, Lombardy, Italy","Skills: Security Operations, Tracking Systems,...",,LISTED,416073494,Würth Phoenix,https://media.licdn.com/dms/image/v2/C4E0BAQFR...,2025-05-20T14:47:18.000Z,2025-05-22T12:46:03.843Z
6,Cybersecurity Analyst,Cybersecurity Analyst con esperienza nella ges...,"agap2 Italia · Verona, Veneto, Italy (Hybrid)",https://www.linkedin.com/jobs/view/4225641811,"Verona, Veneto, Italy","Skills: Computer Network Operations, Security ...",,LISTED,1221573374,agap2 Italia,https://media.licdn.com/dms/image/v2/C4D0BAQHs...,2025-05-12T09:33:41.000Z,2025-05-22T12:46:03.844Z
7,Cybersecurity Junior - Automotive,"🟠🔵Teoresi S.p.A. , 35+ anni di storia , 6 soci...","Teoresi Group · Turin, Piedmont, Italy (On-site)",https://www.linkedin.com/jobs/view/4195446112,"Turin, Piedmont, Italy","Skills: CAN bus, Internet of Things (IoT), +3 ...",,LISTED,662759130,Teoresi Group,https://media.licdn.com/dms/image/v2/D4D0BAQFO...,2025-03-31T09:10:18.000Z,2025-05-22T12:46:03.844Z
8,Junior Risk & Compliance Analyst,Rad è alla ricerca di un Junior Security Consu...,RAD Cyber Security · Greater Milan Metropolita...,https://www.linkedin.com/jobs/view/4221531535,Greater Milan Metropolitan Area,"Skills: KYC Verification, Anti-Money Launderin...",,LISTED,632403450,RAD Cyber Security,https://media.licdn.com/dms/image/v2/C4D0BAQFD...,2025-05-05T15:28:03.000Z,2025-05-22T12:46:03.845Z
9,SOC Analyst,Per conto di prestigioso cliente in ambito Cyb...,"Azienda Riservata Italia · Emilia-Romagna, Ita...",https://www.linkedin.com/jobs/view/4233005171,"Emilia-Romagna, Italy","Skills: Security Operations, System on a Chip ...",,LISTED,629194346,Azienda Riservata Italia,,2025-05-22T12:33:19.000Z,2025-05-22T12:46:03.846Z


### Translation Function
Detects the language of specified columns and translates non-English text into English. It takes as parameters the DataFrame containing job listings, the number of rows to process and a boolean that indicates if keeping the original text in new columns for comparison is needed. This will be returning the modified DataFrame with translations in the desired columns.

In [6]:
DetectorFactory.seed = 0  # Ensures consistent language detection

In [7]:
def detect_and_translate(df, sample_size=row_count, show_original=False):
    # Columns to translate
    translate_cols = ['Title', 'Description', 'Primary Description', 'Specialties', 'Company Description']
    
    # Sample for validation
    sample_df = df.sample(n=sample_size, random_state=42).copy()

    def translate_text(text):
        # Detects language and translates to English, splitting if >5000 chars due to an API limitation.
        if pd.isna(text) or text.strip() == "":
            return text  # Skip empty or NaN values
        try:
            lang = detect(text)
            if lang != "en":
                if len(text) > 5000:
                    # Split into chunks under 5000 chars
                    chunks = []
                    start = 0
                    while start < len(text):
                        end = min(start + 4999, len(text))  # Max chunk size 4999
                        # Try to split at a natural boundary (period or space)
                        if end < len(text):
                            last_period = text.rfind('.', start, end)
                            last_space = text.rfind(' ', start, end)
                            split_point = max(last_period, last_space) if max(last_period, last_space) > start else end
                            end = split_point if split_point > start else end
                        chunks.append(text[start:end])
                        start = end + 1 if end < len(text) and text[end] in '. ' else end
                    # Translate each chunk and join
                    translated_chunks = []
                    for chunk in chunks:
                        if chunk.strip():  # Only translate non-empty chunks
                            translated_chunks.append(GoogleTranslator(source=lang, target="en").translate(chunk))
                        else:
                            translated_chunks.append(chunk)
                    return ' '.join(translated_chunks)
                else:
                    return GoogleTranslator(source=lang, target="en").translate(text)
        except Exception as e:
            print(f"Translation error: {e}")
            return text  # Return original text if translation fails
        return text  # Return original text if English

    # Apply translation function
    for col in translate_cols:
        if col in sample_df.columns:
            if show_original:
                # Store original text in a new column
                sample_df[f"{col}_original"] = sample_df[col]
            # Replace with translated text
            sample_df[col] = sample_df[col].apply(translate_text)

    return sample_df


### Storing results
Finally, the results are saved in the folder corresponding to the current iteration. A final merging for these translated versions per iteration is required to complete the dataset, and this can be performed in the last step of the merge-data.ipynb file.

In [8]:
translated_df = detect_and_translate(df_initial, sample_size=row_count, show_original=False)

translated_df = translated_df.to_dict(orient="records")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(translated_df, f, ensure_ascii=False, indent=2)

Translation error: Consultant  art.1 l.68/99 (categoria protetta) --> No translation was found using the current translator. Try another translator?
Translation error: zh-cn --> No support for the provided language.
Please select on of the supported languages:
{'afrikaans': 'af', 'albanian': 'sq', 'amharic': 'am', 'arabic': 'ar', 'armenian': 'hy', 'assamese': 'as', 'aymara': 'ay', 'azerbaijani': 'az', 'bambara': 'bm', 'basque': 'eu', 'belarusian': 'be', 'bengali': 'bn', 'bhojpuri': 'bho', 'bosnian': 'bs', 'bulgarian': 'bg', 'catalan': 'ca', 'cebuano': 'ceb', 'chichewa': 'ny', 'chinese (simplified)': 'zh-CN', 'chinese (traditional)': 'zh-TW', 'corsican': 'co', 'croatian': 'hr', 'czech': 'cs', 'danish': 'da', 'dhivehi': 'dv', 'dogri': 'doi', 'dutch': 'nl', 'english': 'en', 'esperanto': 'eo', 'estonian': 'et', 'ewe': 'ee', 'filipino': 'tl', 'finnish': 'fi', 'french': 'fr', 'frisian': 'fy', 'galician': 'gl', 'georgian': 'ka', 'german': 'de', 'greek': 'el', 'guarani': 'gn', 'gujarati': 'gu'