# Indexing Data
---
Prepare policy data and employee single view data for AI agent.

In [1]:
from dotenv import load_dotenv

In [2]:
load_dotenv(override=True)

True

## Policy Data
---
I will prepare policy data as vector database and then convert it to retrieval tool for AI agent.

In [17]:
with open('../data/input/attendance_policy.md', 'r') as f:
    policy_text = f.read()
print(len(policy_text))

9485


In [18]:
print(policy_text[:1000])

# Company Attendance Policy

**1. Purpose and Scope**    
The Attendance Policy aims to set clear expectations for employee attendance, promote a productive work environment, and ensure fairness and consistency across the organization. Attendance is a critical factor that impacts the workflow, team dynamics, and overall company performance. This policy applies to all full-time, part-time, and contract employees, regardless of department or seniority, and ensures that all individuals are aware of their responsibilities regarding attendance.

**2. General Expectations**    
Employees are expected to maintain regular, punctual attendance to ensure smooth workflow and minimal disruption to team activities. Consistent attendance is essential for both individual performance and overall team productivity. Employees should report to work on time, be ready to fulfill their responsibilities during assigned hours, and demonstrate reliability by maintaining a predictable work schedule. If an emplo

In [19]:
import re

In [20]:
pattern = r'(\*\*\d+\. [^*]+\*\*[^*]+(?:\n[^*]+)*)'
matches = re.findall(pattern, policy_text)

In [21]:
# expected 12 policies
len(matches)

12

In [22]:
for m in matches[:3]:
    print(m.strip())
    print("-" * 20)

**1. Purpose and Scope**    
The Attendance Policy aims to set clear expectations for employee attendance, promote a productive work environment, and ensure fairness and consistency across the organization. Attendance is a critical factor that impacts the workflow, team dynamics, and overall company performance. This policy applies to all full-time, part-time, and contract employees, regardless of department or seniority, and ensures that all individuals are aware of their responsibilities regarding attendance.
--------------------
**2. General Expectations**    
Employees are expected to maintain regular, punctual attendance to ensure smooth workflow and minimal disruption to team activities. Consistent attendance is essential for both individual performance and overall team productivity. Employees should report to work on time, be ready to fulfill their responsibilities during assigned hours, and demonstrate reliability by maintaining a predictable work schedule. If an employee fores

In [23]:
policy_chunks = [m.strip() for m in matches]

In [24]:
from langchain_core.documents import Document

In [25]:
policy_docs = []
for c in policy_chunks:
    doc = Document(page_content=c)
    policy_docs.append(doc)

print(len(policy_docs))

12


In [26]:
policy_docs[0]

Document(metadata={}, page_content='**1. Purpose and Scope**    \nThe Attendance Policy aims to set clear expectations for employee attendance, promote a productive work environment, and ensure fairness and consistency across the organization. Attendance is a critical factor that impacts the workflow, team dynamics, and overall company performance. This policy applies to all full-time, part-time, and contract employees, regardless of department or seniority, and ensures that all individuals are aware of their responsibilities regarding attendance.')

In [31]:
from langchain_openai import AzureOpenAIEmbeddings

In [32]:
# initialize
openai_api_version = "2024-02-01"
embedding_azure_deployment = "text-embedding-3-large"
embeddings = AzureOpenAIEmbeddings(
    azure_deployment=embedding_azure_deployment,
    openai_api_version=openai_api_version,
)

In [33]:
from langchain_chroma import Chroma

In [47]:
vector_store = Chroma(
    collection_name="attendance_policy",
    embedding_function=embeddings,
    persist_directory="./vector_store",
)

In [48]:
vector_store.add_documents(policy_docs)

['4d389fa8-1085-4bc3-82cd-9ffa18ae8a5c',
 'ec33b588-55a6-4e94-8989-bf6d8d8d1549',
 'aa9ecfb3-ceaa-4bce-92dc-816121a2928a',
 '305741c7-771b-4161-ad63-fb8b8381e25f',
 '111259e3-3cd7-4181-b8ac-cd069372c28c',
 'b595413b-f0c8-44bf-9a13-dfecd1a242f6',
 '8adccb15-adb5-433e-ab00-afc26f3075ff',
 '22624255-715e-4e4e-8293-1538efad53f8',
 'f2baf522-1155-44b6-8e36-fc6df085c182',
 '3a09373d-8c77-4812-8edf-c4f9f0fb2b21',
 '96a99e9e-5bdb-4973-b2b7-0472ae345770',
 'e962f1a0-2621-46fa-b6d9-def0bb7ba73c']

In [49]:
vector_store._collection.count()

12

## Employee Single View
---
LangChain recommends to interact with CSV data by SQL then I will crete `sqlite` database for the employee single view data.

In [38]:
import pandas as pd

In [39]:
employee_df = pd.read_csv('../data/input/employee_single_view.csv')
employee_df

Unnamed: 0,EmployeeID,Age,Gender,MaritalStatus,EducationLevel,Dependents,TenureMonths,Department,JobRole,EmploymentType,...,TeamEngagementIndex,WellnessProgramParticipation,AbsenceDaysLast6Months,SickLeaveFrequency,LateArrivalsFrequency,CommuteTimeMinutes,JobMarketIndex,Resigned,AttritionScore,AttritionCategory
0,1,50,Female,Single,Bachelor,3,168,Sales,Manager,Full-time,...,4.26,0,7,5,5,95,0.94,0,0.15,Low
1,2,36,Female,Married,Bachelor,0,130,Sales,Consultant,Contract,...,4.11,1,7,1,5,108,0.53,1,0.81,High
2,3,29,Male,Married,PhD,0,84,Finance,Consultant,Full-time,...,4.04,1,11,1,3,15,0.73,0,0.11,Low
3,4,42,Female,Divorced,Bachelor,1,65,IT,Developer,Full-time,...,1.75,1,0,3,14,58,0.92,1,0.81,High
4,5,40,Female,Divorced,PhD,3,63,Sales,Developer,Contract,...,1.35,0,10,0,14,68,0.98,1,0.76,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,196,40,Female,Married,Master,4,101,Finance,Developer,Contract,...,4.45,1,1,3,3,25,0.05,1,0.72,Medium
196,197,41,Male,Divorced,Bachelor,1,149,Finance,Analyst,Part-time,...,2.26,0,14,8,1,101,0.62,0,0.23,Low
197,198,53,Male,Divorced,PhD,3,90,Marketing,Manager,Part-time,...,3.15,0,4,6,5,109,0.21,0,0.14,Low
198,199,28,Female,Married,Bachelor,1,110,IT,Consultant,Full-time,...,4.36,1,19,2,2,11,0.59,0,0.10,Low


In [40]:
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

In [41]:
# SQLite connects to file-based databases
engine = create_engine("sqlite:///employee.db")

In [42]:
employee_df.to_sql('employee_single_view', con=engine, index=False)

200

In [43]:
db = SQLDatabase(engine=engine)

In [46]:
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM employee_single_view WHERE Age > 50 LIMIT 1;"))

sqlite
['employee_single_view']
[(10, 57, 'Male', 'Divorced', 'PhD', 2, 203, 'Sales', 'Specialist', 'Contract', 'Hybrid', 115112, 4.11, 1, 20, 4, 4, 1, 30, 0, 'Rotational', 4.17, 4.27, 2.0, 2.04, 0, 6, 2, 1, 77, 0.93, 1, 0.35, 'Low')]
