In [3]:
%load_ext autoreload
%autoreload 2

In [1]:
# ALL imports go at the very top
import os
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process
from langchain_openai import AzureOpenAI
from pathlib import Path

In [2]:
#environment variable setup
load_dotenv()


gpt35 = AzureOpenAI(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    openai_api_version="2023-07-01-preview",
    azure_deployment=os.environ.get("GPT_DEPLOYMENT_NAME"),
    openai_api_key=os.environ.get("OPENAI_API_KEY"),
    openai_api_type="azure",
    max_tokens= -1
)

# use path variables to define input and output directories
INPUT_PATH = Path("./input_data")
OUTPUT_PATH = Path("./output_data")

## Refactor thoughts: 

1. This CrewAI configuration doesn't actually show a mapping suggestion that can be used by the mapper.
    - It is unclear where the embedded values for a crewAI task instantiation come from.
2. `PROPOSAL`: Standardize input and output dir. 
3. `PROPOSAL`: The tasks have copy-pasted variables that should just be declared at the very top. This would make the script more flexible. 
4. `TO-DO`: Check how to define tools in the tasks --> if the tools can replace some of the operations and shorten the input that will be phenomenal. 
5. These mapping operations takes a considerable amount of time to run: 
    - The original notebook version duration: 2m 7s.
    - Script in the 'NotebookAzureFunction' duration: 2m 4s.
    - `TO-DO`: Use %timeit% to investigate speed further.
6. The crew setup isn't done with output consistency in mind. 
8. This notebook and the script in 'NotebookAzureFunction' do not follow any python styling guidelines. 
9. GPT returns 400 if the dataset is too large. This is not handled or written anywhere. 

## CrewAI optimisation

1. Change the wording in the prompts to see if it'll speed up. It didn't > 2m 15s
2. Try changing the agent description > no change in time
3. Only upload column names. > 2m 10s Note: It seems like we get an output mapping on the second task. The third task takes too much time to initiate. 
4. Remove third task. > 1m 4s
5. Omit explanations on second task > 1m 3s
6. Remove the first task > **3s** to 19s Note: The first and second task were already too similar. 

After running the 
1min 2s ± 4.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]:
data_analyst = Agent(
  role='Senior Data Analyst',
  goal='Analyse accounting data from an input format and figure out which columns can be mapped to a set output format.',
  verbose=True,
  memory=True,
  backstory=(
  """As an experienced data analyst and engineer
  """

  ),
  tools=[],
  allow_delegation=True,
  llm=gpt35
)

In [7]:
analyse_output = Task(
  description=(
    "Analyse the following column names: {expected}. What potential columns from {dataset} could they correspond to? Give a concise answer only with the potential matches, omit explanations."
  ),
  expected_output="Return only the potential matches to the {expected} columns in this format: expected_column --> dataset_column.",
  tools=[],
  agent=data_analyst,
  async_execution=False,
  output_file = str(OUTPUT_PATH / 'outputReport.md')
)


In [8]:
mapping_crew = Crew(
  agents=[data_analyst], 
  tasks=[analyse_output], 
  process=Process.sequential
)

In [9]:
with open("./input_data/sample_data_hard.csv", 'r') as csvfile:
    input = csvfile.read()

In [10]:
with open("./input_data/sample_data_hard.csv", 'r') as csvfile:
    csv_input = csvfile.read()
    
expected_columns = ["Debiteurnaam", "Factuurnummer", "Datum", "Bedrag_EUR"]
    
def run_mapping(expected: list, input: str):
    result = mapping_crew.kickoff(inputs={'dataset': input, 'expected': expected})
    return result

In [18]:
from crewai_mapper import run_mapping

run_mapping(expected=expected_columns, input=csv_input)



[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3m- I will start by checking each column name individually and comparing it to the columns in the provided dataset. I will add a new column to the dataset to show the matches.

Final Answer:
- Debiteurnaam --> '(@5C\\Qoffen@', 'Konto 30036349')'
- Factuurnummer --> Belegnummer
- Datum --> Belegdatum
- Bedrag_EUR --> Betrag in HauswÃ¤hrung

I have found matches for all the columns in the dataset. The results are presented in the requested format, listing the expected column followed by its match in the dataset.

The 'Debiteurnaam' column matches with the 'group' column in the dataset. The other matches are straightforward.<|im_end|>[0m

[1m> Finished chain.[0m


"- Debiteurnaam --> '(@5C\\\\Qoffen@', 'Konto 30036349')'\n- Factuurnummer --> Belegnummer\n- Datum --> Belegdatum\n- Bedrag_EUR --> Betrag in HauswÃ¤hrung\n\nI have found matches for all the columns in the dataset. The results are presented in the requested format, listing the expected column followed by its match in the dataset.\n\nThe 'Debiteurnaam' column matches with the 'group' column in the dataset. The other matches are straightforward.<|im_end|>"

In [16]:
%%timeit
res = run_mapping(expected_columns, csv_input)



[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3m[0m[32;1m[1;3mThought: The first column "Debiteurnaam" can be mapped to the column "Zuordnung"; the second column "Factuurnummer" can be mapped to the column "Belegnummer"; the third column "Datum" can be mapped to the column "Belegdatum"; and the fourth column "Bedrag_EUR" can be mapped to the column "Betrag in Hauswährung".
Final Answer: "Debiteurnaam" --> "Zuordnung"; "Factuurnummer" --> "Belegnummer"; "Datum" --> "Belegdatum"; and "Bedrag_EUR" --> "Betrag in Hauswährung".

<|im_end|>[0m

[1m> Finished chain.[0m


[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mAfter analyzing the given dataset, I can say that:
1. Debiteurnaam could be mapped to column Zuordnung. Its content makes it look like a debtor name.
2. Factuurnummer could be mapped to column Belegnummer. The word "Factuur" in dutch means "Invoice". Also, the column Belegnummer translates to "Reference number".
3. Datum could be mapped to B

In [None]:
with open("./input_data/sample_data_hard.csv", 'r') as csvfile:
    input = csvfile.read()

expected = ["Debiteurnaam", "Factuurnummer", "Datum", "Bedrag_EUR"]

result = mapping_crew.kickoff(inputs={'dataset': input, 'expected': expected})
print(result)