# Fake Employee Generator Using Python and Faker Library

<img src="image.png">


## Project Description:
>Generating fake or synthetic data serves various important purposes for data engineers, especially in scenarios where access to real data might be limited, sensitive, or unavailable due to privacy concerns. Some key reasons for generating fake data include:

* **Testing and Development**: Fake data allows data engineers to develop and test systems, applications, or algorithms without using real, potentially sensitive, or limited data. It helps in verifying the functionality of software or systems before deploying them in a real environment.

* **Privacy and Security**: In situations where handling sensitive information is involved (such as healthcare or financial data), generating fake data helps in ensuring privacy and compliance with data protection regulations (like GDPR or HIPAA). It prevents exposure of real personal information while still allowing testing and development.

* **Data Quality and Quantity**: Generating synthetic data assists in creating datasets with specific characteristics, distributions, or patterns that might not exist in real data. It helps in assessing the robustness of algorithms or systems by providing diverse and comprehensive datasets.

* **Training and Education**: Synthetic datasets are valuable for educational purposes, training machine learning models, or teaching data analysis techniques without using actual sensitive data. It enables students or professionals to practice data-related tasks without compromising real data integrity.

* **Scenario Simulation**: Simulating various scenarios or edge cases using synthetic data helps in understanding how systems or models perform under different conditions. It aids in predicting system behavior in unforeseen circumstances.

* **Benchmarking and Performance Testing**: Generating large volumes of fake data helps in benchmarking the performance of databases, applications, or systems under heavy loads. It allows engineers to assess scalability and performance metrics.

* **Data Anonymization Techniques**: Fake data can be used alongside anonymization techniques to create surrogate keys or mask sensitive information, enabling the sharing of datasets for collaboration or research while protecting privacy.

* **Filling Gaps in Real Data**: In cases where real datasets have missing values or incomplete records, synthetic data can be used to fill these gaps for more comprehensive analysis and testing.


## Objectives:
>The aim of this project is to create a versatile, reliable, and comprehensive dataset that meets the requirements for analysis, testing, development, and compliance while safeguarding sensitive information.

## Tools used:
* Python Programming Language
* Postgres Database
* DBeaver
* Canvas (for design)

## Module / Library Used:
* random
* Faker
* psycopg2
* Pandas

## Data Dictionary / Model

<img src="fake_employee_node.png" width="200" height="100">
Creating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including:

1. Employee ID: A unique identifier for each employee.
2. Name: First name and last name of the employee.
3. Gender: Male, female, non-binary, etc.
4. Date of Birth: Birthdate of the employee.
5. Email Address: Contact email of the employee.
6. Phone Number: Contact number of the employee.
7. Address: Home or work address of the employee.
8. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.).
9. Job Title: The specific job title of the employee.
10. Manager ID: ID of the employee's manager.
11. Hire Date: Date when the employee was hired.
12. Salary: Employee's salary or compensation.
13. Employment Status: Full-time, part-time, contractor, etc.
14. Employee Type: Regular, temporary, contract, etc.
15. Education Level: Highest level of education attained by the employee.
16. Certifications: Any relevant certifications the employee holds.
17. Skills: Specific skills or expertise possessed by the employee.
18. Performance Ratings: Ratings or evaluations of employee performance.
19. Work Experience: Previous work experience of the employee.
20. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.).
21. Work Location: Physical location where the employee works.
22. Work Hours: Regular working hours or shifts of the employee.
23. Employee Status: Active, on leave, terminated, etc.
24. Emergency Contact: Contact information of the employee's emergency contact person.
25. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.

### Step 1: Install Neccessary Libraries

In [None]:
!pip install Faker
!pip install psycopg2

The libraries `Faker` and `psycopg2` are used in this code to facilitate the generation of fake employee data and the interaction with a Postgresql database. Here's why these libraries are used.

**Faker Library:**

* The Faker library is used to generate realistic and random fake data for attributes like names, email addresses, addresses, and more. It makes it easy to create synthetic data that closely resembles real data, which is important for testing and simulating real-world scenarios.

* Without `Faker`, you would need to write custom code to generate fake data for each attribute, which can be time-consuming and may not produce as realistic results.

**psycopg2:**

* `psycopg2` library is used to connect to and interact with a postgres database. It provides functions and classes that simplify database operations, including connecting to the database, executing SQL queries, and committing changes.

`Without psycopg2`, you would have to write low-level code to establish a connection to the database and handle database operations, which can be error-prone and complex

### Step 2: Import Libraries and Variables

In [28]:
import random
import psycopg2
from faker import Faker
import pandas as pd

# Change this to the desired number of employees
NUMBER_OF_EMPLOYEES = 1000

all_employees = []  # all employee list

### Step 3: Define Functions:

In [29]:
# Create employees Table
def create_employee_tbl(cursor):
    create_table_sql = """
        CREATE TABLE IF NOT EXISTS employees (
          employee_id integer,
          name varchar(225),
          gender varchar(10),
          dates_of_birth date,
          email varchar(225),
          phone_number varchar(225),
          address varchar(225),
          department varchar(150),
          job_titles varchar(150),
          manager_id integer,
          hire_date date,
          salary float,
          employment_status varchar(225),
          employee_type varchar(225),
          education_level varchar(225),
          certifications text,
          skills text,
          performance_ratings integer,
          work_experience text,
          benefits_enrollment text,
          city varchar(225),
          work_hours varchar(225),
          employee_status varchar(225),
          emergency_contacts varchar(225)
        );
    """
    cursor.execute(create_table_sql)
    print("Table employees created!")

In [30]:
# Create a function to generate a fake employee:
def generate_fake_employee():
    
    # create an object of Faker
    fake = Faker()
    
    # Create lists to store employee information
    employee_ids = random.randint(1000, 9999)
    names = fake.name()
    genders = random.choice(['Male', 'Female', 'Non-Binary'])
    dates_of_births = fake.date_of_birth(minimum_age=18, maximum_age=65)
    emails = fake.email()
    phone_numbers = fake.phone_number() 
    addresses = fake.address()
    departments = random.choice(['HR', 'Marketing', 'Engineering', 'Finance', 'IT'])
    job_titles = fake.job()
    manager_ids = random.randint(1, 10)
    hire_dates = fake.date_between(start_date='-5y', end_date='today')
    salaries = random.randint(40000, 120000)
    employment_status = random.choice(['Full-time', 'Part-time', 'Contractor'])
    employee_type = random.choice(['Regular', 'Temporary', 'Contract']) 
    education_level = random.choice(['High School', 'Associate Degree', 'Bachelor\'s Degree', 'Master\'s Degree', 'PhD'])
    certifications = ', '.join(fake.words(nb=random.randint(1, 3))) 
    skills = ', '.join(fake.words(nb=random.randint(2, 5)))
    performance_ratings = random.randint(1, 5) 
    work_experience = fake.paragraph(nb_sentences=2)
    benefits_enrollment = ', '.join(fake.words(nb=random.randint(1, 3)))
    city = fake.city()
    work_hours = random.choice(['9-5', '12-8', 'Night Shift']) 
    employee_status = random.choice(['Active', 'On Leave', 'Terminated'])
    emergency_contacts = fake.name() + ': ' + fake.phone_number()
    
    return (
        employee_ids, names,genders, dates_of_births,emails,
        phone_numbers,addresses,departments,job_titles,manager_ids,hire_dates,salaries,
        employment_status,employee_type,education_level, certifications,skills,performance_ratings,work_experience,
        benefits_enrollment,city, work_hours,employee_status,emergency_contacts)


In [31]:
# Create a function to insert the fake employee record into the database:
def insert_employee_record(cursor, employee):
    insert_sql = """
            INSERT INTO employees(employee_id,
                                  name,
                                  gender,
                                  dates_of_birth,
                                  email,
                                  phone_number,
                                  address,
                                  department,
                                  job_titles,
                                  manager_id,
                                  hire_date,
                                  salary,
                                  employment_status,
                                  employee_type,
                                  education_level,
                                  certifications,
                                  skills,
                                  performance_ratings,
                                  work_experience,
                                  benefits_enrollment,
                                  city,
                                  work_hours,
                                  employee_status,
                                  emergency_contacts)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s);
    """
    cursor.execute(insert_sql, employee)

## step 4: Create a Connection to the Database

In [32]:
conn = None

try:
    # connect to the PostgreSQL server
    print('Connecting to the PostgreSQL database...')
    conn = psycopg2.connect(host="localhost",
                        port="5432",
                        database="fake_db",
                        user="fake_user",
                        password="fake_password")

    # set session to autocommit
    conn.set_session(autocommit=True)
    print("Postgres session set to Autocommit")


    #create a cursor
    curr = conn.cursor()

    # execute a statement
    print('PostgreSQL database version:')
    curr.execute('SELECT version()')

    # display the PostgreSQL database server version
    db_version = curr.fetchone()
    print(db_version)

    # create schema
    curr.execute("CREATE SCHEMA IF NOT EXISTS raw;")
    curr.execute("SET SCHEMA 'raw';")
    
    try:
        curr.execute("""SELECT * FROM information_schema.tables 
                    WHERE table_name=%s""", ('employees',))
        row = curr.fetchone()[0]
        print(row)
    except(Exception, psycopg2.DatabaseError) as error:
        print(error)

    if row:
        curr.execute("TRUNCATE employees;")
        print("Truncated Table Employee")
    else:
        # Create table
        create_employee_tbl(curr)

except(Exception, psycopg2.DatabaseError) as error:
    print(error)
        
        

Connecting to the PostgreSQL database...
Postgres session set to Autocommit
PostgreSQL database version:
('PostgreSQL 16.0 on aarch64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.102), 64-bit',)
fake_db
Truncated Table Employee


## Step 5: Generate Data and Insert to Database

In [33]:
for _ in range(NUMBER_OF_EMPLOYEES):
    
    employee = generate_fake_employee()
    insert_employee_record(curr, employee)
    
    all_employees.append(employee)

print(f"{NUMBER_OF_EMPLOYEES} generated and inserted into Database successfully")

10 generated and inserted into Database successfully


## Step 6. Step 3: Close Connections

In [34]:
# close connections
curr.close()
conn.close()

## Step 7: Load data into CSV

In [40]:
# covert the employee into a dictionary:
employee_dict = {
    'employee_id': [],
      'name': [],
      'gender': [],
      'dates_of_birth': [],
      'email': [],
      'phone_number': [],
      'address': [],
      'department': [],
      'job_titles': [],
      'manager_id': [],
      'hire_date': [],
      'salary': [],
      'employment_status': [],
      'employee_type': [],
      'education_level': [],
      'certifications': [],
      'skills': [],
      'performance_ratings': [],
      'work_experience': [],
      'benefits_enrollment': [],
      'city': [],
      'work_hours': [],
      'employee_status': [],
      'emergency_contacts': []
}

for record in all_employees:
    employee_dict['employee_id'].append(record[0])
    employee_dict['name'].append(record[1])
    employee_dict['gender'].append(record[2])
    employee_dict['dates_of_birth'].append(record[3])
    employee_dict['email'].append(record[4])
    employee_dict['phone_number'].append(record[5])
    employee_dict['address'].append(record[6])
    employee_dict['department'].append(record[7])
    employee_dict['job_titles'].append(record[8])
    employee_dict['manager_id'].append(record[9])
    employee_dict['hire_date'].append(record[10])
    employee_dict['salary'].append(record[11])
    employee_dict['employment_status'].append(record[12])
    employee_dict['employee_type'].append(record[13])
    employee_dict['education_level'].append(record[14])
    employee_dict['certifications'].append(record[15])
    employee_dict['skills'].append(record[16])
    employee_dict['performance_ratings'].append(record[17])
    employee_dict['work_experience'].append(record[18])
    employee_dict['benefits_enrollment'].append(record[19])
    employee_dict['city'].append(record[20])
    employee_dict['work_hours'].append(record[21])
    employee_dict['employee_status'].append(record[22])
    employee_dict['emergency_contacts'].append(record[23])

    
# create dataframe
employee_df = pd.DataFrame(employee_dict)

# save to CSV
employee_df.to_csv('fake_employee.csv')

In [41]:
employee_df.shape

(10, 24)