## **Creating Data Lakes**

### **Introduction**

Data lakes are centralized repositories that allow you to store all your structured and unstructured data at any scale.

They are helpful for you to store, manage, and analyze large volumes of data.

In this documentation, we will explore how the `api_crawler` module creates data lakes by automatically logging every interaction with data sources.

### **Logging Function Interactions**

The `api_crawler` module creates a datalake for every function that has the `log_io_to_json` decorator on top of it.

This decorator automatically logs the inputs, outputs, execution time, and errors (if any) of a function and saves it to a JSON file by the name of the function.

**Note**: The JSON file will be saved in the directory specified by the `LAKES_BASE_DIR` environment variable. This variable can be set in the `.env` file or as a global variable using `os.environ['LAKES_BASE_DIR']`.


### **Decorator: `log_io_to_json`**

#### **Purpose**

The `log_io_to_json` decorator logs the input and output of a function to a JSON file. This helps in creating a data lake by capturing all interactions with data sources.

The goal here was to be versatile in the structrure of the lake, all the while logging all data that may be useful to store.

#### **How It Works**

1. **Unique Identifier**: Generates a unique identifier for each function call.
2. **Timing**: Captures the start and end time of the function execution.
3. **Function Execution**: Executes the original function and captures its output. If an error occurs, it logs the error.
4. **Argument Binding**: Binds the passed arguments to the function's signature and serializes them.
5. **Logging**: Stores the input, output, and timing information in a JSON file.

## **Example**

In [1]:
from api_crawler import LinkedInAPI

linkedin_api = LinkedInAPI()

job_posting_data = linkedin_api.get_job_postings_data('Data Analyst', location='New York', n_listings = 2, days_ago=7)

The call to this function, along with its inputs, outputs, and other details, will be saved in the file LinkedInAPI_get_job_postings_data.json located in the specified folder.

In this example, the output is as follows:"

```json
{
  "id": "c58afbdb-0535-4bbd-a9a5-a8eae845ed64",
  "start_time": "2024-06-14T11:46:05",
  "end_time": "2024-06-14T11:46:40",
  "input": {
    "args": {
      "self": "<api_crawler.data_sources.linked_in.LinkedInAPI object>",
      "search_query": "Data Analyst",
      "n_listings": 30,
      "kwargs": {
        "location": "New York",
        "days_ago": 7
      }
    }
  },
  "output": [
    {
      "title": "Data Analyst",
      "company": "Stripe",
      "location": "New York, United States",
      "time": "2024-06-12",
      "link": "https://www.linkedin.com/jobs/view/data-analyst-at-stripe-3824634046?position=1&pageNum=0&refId=JrOmBu3qWObVHJInLHlD%2BQ%3D%3D&trackingId=Qcv8NYThUZK1APMjmy4zYQ%3D%3D&trk=public_jobs_jserp-result_search-card"
    },
    {
      "title": "Data Analyst",
      "company": "New York Islanders",
      "location": "Floral Park, NY",
      "time": "2024-06-12",
      "link": "https://www.linkedin.com/jobs/view/data-analyst-at-new-york-islanders-3948895425?position=2&pageNum=0&refId=JrOmBu3qWObVHJInLHlD%2BQ%3D%3D&trackingId=mj%2FQFvrETnNX%2FTX8oKiJ9Q%3D%3D&trk=public_jobs_jserp-result_search-card"
    }
  ]
}

## **Accessing Your Logs**

To easily access and read the logs created by the `log_io_to_json` decorator, you can use the `read_log` utility function. This function reads the content of a specified JSON file and returns the data in a list of dictionaries.

### **How to Use `read_log`**

1. **Import the Function**: First, ensure you import the `read_log` function from the `api_crawler.data_lake` module.
2. **Specify the File Path**: Provide the path to the JSON file you want to read. This path should be relative to the base directory specified by the `LAKES_BASE_DIR` environment variable.
3. **Read the Log**: Call the `read_log` function with the file path to get the logged data.

### **Example**

Here is an example of how to use the `read_log` function:


In [4]:
from api_crawler.data_lake import read_log

read_log(f'{base_file_path}/LinkedInAPI_get_job_postings_data.json')

[{'id': 'c58afbdb-0535-4bbd-a9a5-a8eae845ed64',
  'start_time': '2024-06-14T11:46:05',
  'end_time': '2024-06-14T11:46:40',
  'input': {'args': {'self': '<api_crawler.data_sources.linked_in.LinkedInAPI object at 0x7be546563430>',
    'search_query': 'Data Analyst',
    'n_listings': 30,
    'kwargs': {'location': 'New York', 'days_ago': 7}}},
  'output': [{'title': 'Data Analyst',
    'company': 'Stripe',
    'location': 'New York, United States',
    'time': '2024-06-12',
    'link': 'https://www.linkedin.com/jobs/view/data-analyst-at-stripe-3824634046?position=1&pageNum=0&refId=JrOmBu3qWObVHJInLHlD%2BQ%3D%3D&trackingId=Qcv8NYThUZK1APMjmy4zYQ%3D%3D&trk=public_jobs_jserp-result_search-card'},
   {'title': 'Data Analyst',
    'company': 'New York Islanders',
    'location': 'Floral Park, NY',
    'time': '2024-06-12',
    'link': 'https://www.linkedin.com/jobs/view/data-analyst-at-new-york-islanders-3948895425?position=2&pageNum=0&refId=JrOmBu3qWObVHJInLHlD%2BQ%3D%3D&trackingId=mj%2FQFv

### **Conclusion**

By using the `log_io_to_json` decorator, you will effortlessly log and store function interactions. Thus allowing you to leverage the full potential of your data.