### About the Dataset source

LinkedIn is an employment-oriented well-known platform, it hosts job advertisements in general, and we can also find numerous ads related to the field. To ensure the dataset reflects current real-world demands, updated job postings were collected in a weekly basis, using different search keywords to get representative job ads in the cybersecurity field, focusing in Europe.

After some research, a useful browser extension was identified:
**ScrapeJob – LinkedIn Jobs Scraper** (https://linkedin.scrapejob.net/). With proper set up, this tool extracts job listings and their details from LinkedIn postings in a json format, which is suitable for this data science project.

### Building a Dataset
This project started with a small number of cybersecurity job samples scrapped from Linkedin. Exploration and preprocessing was conducted, but we needed more data to make it representative. The whole process involved:
- Scrape data from Linkedin weekly, using different searches (eg. cybersecurity analyst, cybersecurity architect, cybersecurity assistant, cybersecurity manager... and more. Some filters within countries were also applied in order to get more data, because there's a limitation on the tool: only 1K records can be scrapped per search).
- As this project was conducted under bi-weekly revisions, I worked under bi-weekly iterations, and we have a folder named "data", with sub-folders named "1", "2", "3"..., where the number corresponds to each iteration performed. 
- Inside each iteration folder, we find folders with the date when the data was scraped, and it contains the json files corresponding to each search. In the root of every iteration folder, we have a "li_jobs.json" file, which contains all the records from the current iteration, and we'll also find the translated version of these (performed in another notebook). 
- The current notebook performs the merging process for the records per iteration, and it also merges these iteration files to a main one, called "data.json" which is in the root of the data folder, and contains the translated version of all the records that need to be considered.
- This merging process is needed to build a comprehensive and compact dataset, as we'll be collecting more records weekly and they need to be integrated to the "main" dataset, and I tried to do it in an easy way to avoid re-running previous code and pre-processing the same rows multiple times, specially for a translation task, which is time consuming.

In [3]:
# Import necessary libraries
import os
import pandas as pd
import json

### Defining file paths
The main file will be placed inside the iteration folder and it'll contain the results per iteration, in the data folder we indicate the folder we'll process, as it's supposed to be one at a time, so as the iterations succeed, the data_folder will always be updated to the latest one.

In [4]:
main_file="li_jobs.json"
data_folder="data/5"
main_file_path = os.path.join(data_folder, main_file)

### Merging files
Merges all JSON files from subfolders in the data (iteration) folder into the main li_jobs.json file (for each iteration).

In [6]:
def merge_json_files(data_folder=data_folder, main_file=main_file):
    
    main_file_path = os.path.join(data_folder, main_file)
    main_data = []
    
    # Walk through the data iteration folder and its subfolders
    for root, _, files in os.walk(data_folder):
        for file in files:
            # Process JSON files 
            if file.endswith(".json") and file not in [main_file, "tr_li_jobs.json"]:
                file_path = os.path.join(root, file)
                print(f"Processing file: {file_path}")
                # Load the JSON file
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        sub_data = json.load(f)
                    # Append the data to the main list
                    if isinstance(sub_data, list):
                        main_data.extend(sub_data)
                    else:
                        main_data.append(sub_data)  # Handle case where sub_data is a single dict
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
    
    # Save the merged data back to the main file with UTF-8 encoding and no escaping
    with open(main_file_path, 'w', encoding='utf-8') as f:
        json.dump(main_data, f, ensure_ascii=False, indent=2)
    
    print(f"Merged data saved to {main_file_path}. Total records: {len(main_data)}")
    return

In [7]:
# merge the JSON files in the specified folder, storing the result in the main file indicated per folder
merge_json_files()

Processing file: data/5\jobsDetails_20250521140657.json
Processing file: data/5\jobsDetails_20250521141623.json
Merged data saved to data/5\li_jobs.json. Total records: 2000


### Checking and removing duplicates
As the samples are obtained through different LI searches and filters, there might be some records that appear multiple times because they match with more that one filter or keyword used. Moreover, there are some promoted ads that might always be at first in the results. We don't need duplicated data, so let's keep only one sample per ad and save it back to the main file per iteration.

This duplicates check is performed with only the first 5 columns of the dataset, because there are columns such as scraped_date or scraped_time, and even the job poster id might differ for the same advertisements, while the first columns related to the job details remain identic.

In [9]:
# Check for duplicates in the merged JSON file
df = pd.read_json(main_file_path, encoding="utf-8")
duplicate_rows = df.duplicated(subset=df.columns[:5].tolist()).sum()
print("Number of duplicate rows:", duplicate_rows)
print("Number of rows in the DataFrame:", df.shape[0])

Number of duplicate rows: 493
Number of rows in the DataFrame: 2000


In [10]:
# Drop duplicates based on the first 5 columns and keep the first occurrence
if duplicate_rows: 
    df = df.drop_duplicates(subset=df.columns[:5].tolist(), keep='first')
    print("Number of rows in the DataFrame after dropping duplicates", df.shape[0])
    df = df.to_dict(orient="records")
    # Save the cleaned data back to the main file
    with open(main_file_path, "w", encoding="utf-8") as f:
        json.dump(df, f, ensure_ascii=False, indent=2)

Number of rows in the DataFrame after dropping duplicates 1507


### Merge translated version of the results per iteration
In the previous step, we got the merged results per iteration. But we can't go straightforward to load the dataset into the preprocessing and exploring parte because: these ads are in different languages, so a translation task is required, as well as a complete version containing all the records in only one file. For the translation step, check the translate notebook contained in this repo. That notebook performs the translation task per iteration results and puts the translated version (english) into each corresponding folder. These translated versions are the ones we'll be using for the final merging to obtain a comprehensive dataset.

### Defining paths

In [1]:
# Define the range of subfolders to process (iteration folders)
subfolder_range = range(1, 6)  # this can be adjusted as needed

# Generate file paths dynamically
file_paths = [f"data/{i}/tr_li_jobs.json" for i in subfolder_range]

print("File paths to process:", file_paths)


File paths to process: ['data/1/tr_li_jobs.json', 'data/2/tr_li_jobs.json', 'data/3/tr_li_jobs.json', 'data/4/tr_li_jobs.json', 'data/5/tr_li_jobs.json']


### Loading files and merging the content

In [8]:
# Initialize an empty list to store the merged data
merged_data = []

# Loop through the file paths, load the data, and merge
for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding="utf-8") as f:
            data = json.load(f)
            if isinstance(data, list):  # Ensure the data is a list before merging
                merged_data.extend(data)
            else:
                print(f"Warning: {file_path} does not contain a list. Skipping.")
    except FileNotFoundError:
        print(f"File not found: {file_path}. Skipping.")
    except Exception as e:
        print(f"Error reading {file_path}: {e}")

### Storing merged results
From here, we'll output the translated merged version which will be used for the data science process in this project. The data/data.json file is the main file containing the meaningful dataset that is read in the main notebook. See main.ipynb for more.  

In [9]:
# Define the output file path
output_file = 'data/data.json'

# Ensure the output directory exists
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# Write the merged data to the output file
with open(output_file, 'w', encoding="utf-8") as f_out:
    json.dump(merged_data, f_out, ensure_ascii=False, indent=4)

print(f"Merged data saved to {output_file}. Total records: {len(merged_data)}")

Merged data saved to data/data.json. Total records: 23691
