# Data Extraction & Loading

### Data Extraction and Loading Steps

1. **Data Extraction**:
   - **Connect to the Database**: Use appropriate libraries (e.g., `psycopg2` for PostgreSQL) to establish a connection.
   - **Retrieve Data**: Write SQL queries to extract the necessary tables or data subsets.
   - **Export Data**: Optionally, save the extracted data into CSV files for further processing.

2. **Data Loading**:
   - **Load Data into DataFrames**: Use libraries like `pandas` to load the extracted CSV files or data directly from the database into DataFrames for manipulation.


In [8]:
%pip install pandas  psycopg2

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
# Import necessary libraries
import pandas as pd
import psycopg2
import os

# Define your database connection parameters
db_params = {
    'database': 'Emp_course_management',#'Emp_Course_Management_System', #'Emp_course_management',
    'user': 'postgres',
    'password': '965335',#'postgres', #'965335',
    'host': 'localhost',
    'port': '5432'
}

# Connect to PostgreSQL database
conn = psycopg2.connect(**db_params)

# Create a cursor object
cur = conn.cursor()

# Fetch all table names from the public schema
cur.execute("""
    SELECT table_name 
    FROM information_schema.tables 
    WHERE table_schema='public';
""")
tables = cur.fetchall()

# Define staging directory
staging_dir = 'staging'
os.makedirs(staging_dir, exist_ok=True)  # Create staging directory if it doesn't exist

# Loop through each table and export to CSV
for table in tables:
    table_name = table[0]
    print(f"Exporting table: {table_name}")
    
    # Read table into a DataFrame
    df = pd.read_sql_query(f'SELECT * FROM public."{table_name}";', conn)
    
    # Define the path for the CSV file
    csv_file_path = os.path.join(staging_dir, f"{table_name}.csv")
    
    # Export DataFrame to CSV
    df.to_csv(csv_file_path, index=False)
    print(f"Table {table_name} exported to {csv_file_path}")

# Close the cursor and connection
cur.close()
conn.close()

Exporting table: _prisma_migrations
Table _prisma_migrations exported to staging\_prisma_migrations.csv
Exporting table: Employee
Table Employee exported to staging\Employee.csv
Exporting table: Course
Table Course exported to staging\Course.csv
Exporting table: CourseEnrollment
Table CourseEnrollment exported to staging\CourseEnrollment.csv
Exporting table: User
Table User exported to staging\User.csv
Exporting table: QuestionBank
Table QuestionBank exported to staging\QuestionBank.csv
Exporting table: Questions
Table Questions exported to staging\Questions.csv
Exporting table: CourseEngageLogs
Table CourseEngageLogs exported to staging\CourseEngageLogs.csv
Exporting table: Notifications
Table Notifications exported to staging\Notifications.csv
Exporting table: LearningPathMap
Table LearningPathMap exported to staging\LearningPathMap.csv
Exporting table: LearningPath
Table LearningPath exported to staging\LearningPath.csv
Exporting table: Prerequisites
Table Prerequisites exported to 

  df = pd.read_sql_query(f'SELECT * FROM public."{table_name}";', conn)
  df = pd.read_sql_query(f'SELECT * FROM public."{table_name}";', conn)


# Data cleaning & Transformation

### Data Cleaning Steps

1. **Remove Duplicates**: 
   - Identify and remove duplicate records to ensure data integrity.

2. **Handle Missing Values**:
   - Decide on a strategy for missing data (e.g., imputation, removal, or using a placeholder).
   - Implement the strategy based on your analysis needs.

3. **Data Type Conversion**:
   - Ensure all columns have the correct data types (e.g., integers, floats, dates).
   - Convert categorical variables to a suitable format (e.g., using one-hot encoding).

4. **Standardization and Normalization**:(data science)
   - Standardize numerical columns to a common scale, if necessary.
   - Normalize data for specific algorithms that require it.

### Feature Engineering and Data Preparation Steps

1. **Feature Engineering**:
   - Create new features that may be beneficial for prediction (e.g., extracting year from a date, combining features).
   - Encode categorical variables using techniques like label encoding or one-hot encoding.(data science)

2. **Aggregation and Grouping**:
   - Aggregate data to a desired level (e.g., total sales per month).
   - Group data based on relevant categories to simplify analysis.

3. **Outlier Detection and Treatment**:
   - Identify and handle outliers based on domain knowledge or statistical methods.


In [10]:
# EMPLOYEE - TABLE

# Step 1: Load the data from the CSV file (assumed to be already extracted)
employee_data = pd.read_csv('./staging/Employee.csv')

# Step 2: Extract relevant columns
cleaned_employee_data = employee_data[['emp_id', 'email', 'emp_name', 'designation']]

# Step 3: Remove duplicates
cleaned_employee_data = cleaned_employee_data.drop_duplicates(subset='emp_id')

# step 4: Change datatype
cleaned_employee_data['emp_id'] = cleaned_employee_data['emp_id'].astype('string')
cleaned_employee_data['email'] = cleaned_employee_data['email'].astype('string')
cleaned_employee_data['emp_name'] = cleaned_employee_data['emp_name'].astype('string')
cleaned_employee_data['designation'] = cleaned_employee_data['designation'].astype('string')

# Step 5: Provide information about the cleaned table
print(cleaned_employee_data.info())
print(cleaned_employee_data.head())  # Show the first few rows of the cleaned data

# Optionally, save the cleaned data to a new CSV file
cleaned_employee_data.to_csv('./prep/cleaned_employee_data.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   emp_id       103 non-null    string
 1   email        103 non-null    string
 2   emp_name     103 non-null    string
 3   designation  103 non-null    string
dtypes: string(4)
memory usage: 3.3 KB
None
   emp_id                 email         emp_name  \
0  JMD128  JMD128@jmangroup.com    Perry Dibbert   
1  JMD129  JMD129@jmangroup.com  Cristina Reilly   
2  JMD130  JMD130@jmangroup.com  Cary Kerluke MD   
3  JMD131  JMD131@jmangroup.com      Walter Veum   
4  JMD132  JMD132@jmangroup.com      Rene Kirlin   

                     designation  
0               SOLUTION_ENABLER  
1            SOLUTION_CONSULTANT  
2  TECHNOLOGY_SOLUTION_ARCHITECT  
3   PRINCIPAL_SOLUTION_ARCHITECT  
4              SOFTWARE_ENGINEER  


In [11]:
# COURSE - TABLE

# Step 1: Load the data from the CSV file
courses_data = pd.read_csv('./staging/Course.csv')

# Step 2: Extract relevant columns
cleaned_courses_data = courses_data[['course_id', 'course_name', 'description', 'duration', 'difficulty_level']]

# Step 3: Remove duplicates
cleaned_courses_data = cleaned_courses_data.drop_duplicates(subset='course_id')

# Step 3: Convert duration to weeks
def duration_to_weeks(duration):
    if 'months' in duration:
        return int(duration.split()[0]) * 4  # Assuming 1 month = 4 weeks
    elif 'years' in duration:
        return int(duration.split()[0]) * 52  # Assuming 1 year = 52 weeks
    elif 'weeks' in duration:
        return int(duration.split()[0])
    else:
        return 1  # Handle any unexpected format

cleaned_courses_data['duration_in_weeks'] = cleaned_courses_data['duration'].apply(duration_to_weeks)

# Step 5: Clean the DataFrame by dropping the original duration column
cleaned_courses_data = cleaned_courses_data.drop(columns=['duration'])

# step 6: changing column name
cleaned_courses_data.rename(columns={'description': 'course_description', 'difficulty_level' : 'course_difficulty_level', 'duration_in_weeks' : 'course_duration_in_weeks'}, inplace=True)

# Step 7: Provide information about the cleaned table
print(cleaned_courses_data.info())
print(cleaned_courses_data.head())  # Show the first few rows of the cleaned data

# Optionally, save the cleaned data to a new CSV file
cleaned_courses_data.to_csv('./prep/cleaned_courses_data.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   course_id                 103 non-null    int64 
 1   course_name               103 non-null    object
 2   course_description        103 non-null    object
 3   course_difficulty_level   103 non-null    object
 4   course_duration_in_weeks  103 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 4.2+ KB
None
   course_id                            course_name  \
0        100      Commissioning editor Fundamentals   
1        101              Neurosurgeon Fundamentals   
2        102      Merchandiser, retail Fundamentals   
3        103  Arts development officer Fundamentals   
4        104    Embryologist, clinical Fundamentals   

                                  course_description course_difficulty_level  \
0  Loss give employee ball. Eye level popular app...

In [12]:
# LEARNING_PATH - TABLE
learning_path_data = pd.read_csv('./staging/LearningPath.csv')

cleaned_learningPath = learning_path_data[['learning_path_id', 'description', 'path_name']]

cleaned_learningPath = cleaned_learningPath.drop_duplicates(subset='learning_path_id')

cleaned_learningPath.rename(columns={'path_name' : 'learning_path_name', 'description' : 'learning_path_description'}, inplace=True)

print(cleaned_learningPath.info())
print(cleaned_learningPath.head())

cleaned_learningPath.to_csv('./prep/cleaned_learning_paths.csv', index=False)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   learning_path_id           19 non-null     int64 
 1   learning_path_description  19 non-null     object
 2   learning_path_name         19 non-null     object
dtypes: int64(1), object(2)
memory usage: 588.0+ bytes
None
   learning_path_id                          learning_path_description  \
0                 1  A machine learning (ML) learning path is a str...   
1                 2  An Artificial Intelligence (AI) learning path ...   
2                 3  The Full Stack Learning Path equips learners w...   
3                 4  The Frontend Learning Path focuses on the desi...   
4               100    Master the fundamentals of software development   

        learning_path_name  
0         Machine Learning  
1  Artificial Intelligence  
2               Full Stack  
3 

In [13]:
# LearningPathMap - TABLE

learning_path_map_data = pd.read_csv('./staging/LearningPathMap.csv')

cleaned_learningPathMap = learning_path_map_data[['course_id', 'learning_path_id']]

cleaned_learningPathMap = cleaned_learningPathMap.drop_duplicates(subset=['course_id', 'learning_path_id'], keep='first')

print(cleaned_learningPathMap.info())
print(cleaned_learningPathMap.head())

cleaned_learningPathMap.to_csv('./prep/cleaned_learning_paths_map.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   course_id         262 non-null    int64
 1   learning_path_id  262 non-null    int64
dtypes: int64(2)
memory usage: 4.2 KB
None
   course_id  learning_path_id
0         12                 1
1         12                 2
2         10                 3
3         10                 4
4          9                 3


In [14]:
# CourseEnrollment - TABLE

Course_Enrollment_data = pd.read_csv('./staging/CourseEnrollment.csv')

cleaned_course_enrollment_data = Course_Enrollment_data[['enroll_id', 'emp_id', 'course_id', 'current_page', 'total_pages', 'test_score', 'course_certificate_url', 'createdAt']]

cleaned_course_enrollment_data = cleaned_course_enrollment_data.drop_duplicates(subset=['enroll_id', 'course_id'], keep='first')

print(cleaned_course_enrollment_data.info())
# Replace missing values without using inplace
cleaned_course_enrollment_data['current_page'] = cleaned_course_enrollment_data['current_page'].fillna(0)
cleaned_course_enrollment_data['total_pages'] = cleaned_course_enrollment_data['total_pages'].fillna(100)
cleaned_course_enrollment_data['test_score'] = cleaned_course_enrollment_data['test_score'].fillna(0)
# Create a new boolean column 'course_certificate_generated'
cleaned_course_enrollment_data['course_certificate_generated'] = cleaned_course_enrollment_data['course_certificate_url'].apply(lambda x: True if isinstance(x, str) and x.strip() else False)
cleaned_course_enrollment_data = cleaned_course_enrollment_data.drop(columns=['course_certificate_url'])

# Normalize current_page based on total_pages
cleaned_course_enrollment_data['completion_rate'] = cleaned_course_enrollment_data['current_page'] / cleaned_course_enrollment_data['total_pages']
# Normalize test_score (assuming the max score is 100)
cleaned_course_enrollment_data['test_score_normalized'] = cleaned_course_enrollment_data['test_score'] / 100

cleaned_course_enrollment_data.drop(columns=['current_page', 'total_pages', 'test_score'],axis=1, inplace=True)

print(cleaned_course_enrollment_data.info())
print(cleaned_course_enrollment_data.head())

cleaned_course_enrollment_data.to_csv('./prep/cleaned_courseEnrollment.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296 entries, 0 to 295
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enroll_id               296 non-null    int64  
 1   emp_id                  296 non-null    object 
 2   course_id               296 non-null    int64  
 3   current_page            294 non-null    float64
 4   total_pages             294 non-null    float64
 5   test_score              293 non-null    float64
 6   course_certificate_url  195 non-null    object 
 7   createdAt               296 non-null    object 
dtypes: float64(3), int64(2), object(3)
memory usage: 18.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296 entries, 0 to 295
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   enroll_id                     296 non-null    int64  
 1   emp_id      

In [15]:
# CourseEngageLogs - TABLE  

CourseEngageLogs = pd.read_csv('./staging/CourseEngageLogs.csv')
cleaned_course_engageLogs_data = CourseEngageLogs[['enroll_id', 'start_time', 'time_spent_in_sec']]

cleaned_course_engageLogs_data = cleaned_course_engageLogs_data.drop_duplicates(subset=['enroll_id', 'start_time'], keep='first')

print(cleaned_course_engageLogs_data.info())
print(cleaned_course_engageLogs_data.head())

cleaned_course_engageLogs_data.to_csv('./prep/cleaned_courseEngageLogs.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 791 entries, 0 to 790
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   enroll_id          791 non-null    int64 
 1   start_time         791 non-null    object
 2   time_spent_in_sec  791 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 18.7+ KB
None
   enroll_id               start_time  time_spent_in_sec
0          2  2024-10-07 04:33:08.446                  8
1          4  2024-10-07 04:33:22.656                  4
2          3  2024-10-07 05:08:09.332                  4
3          5  2024-10-08 09:14:53.354                 10
4          3  2023-10-23 21:03:47.279               4191


In [16]:
#  Notifications - TABLE

notifications_data = pd.read_csv('./staging/Notifications.csv')

cleaned_notifications_data = notifications_data[['notification_id', 'enroll_id', 'status', 'user_viewed', 'created_date']]

print(cleaned_notifications_data.info())    # status contains null - admin to taken a desition (make it into false)

cleaned_notifications_data['status'] = cleaned_notifications_data['status'].fillna(False)
cleaned_notifications_data.rename(columns={'status': 'certificate_status'}, inplace=True)

print(cleaned_notifications_data.info())    
print(cleaned_notifications_data.head())

cleaned_notifications_data.to_csv('./prep/cleaned_Notifications.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1116 entries, 0 to 1115
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   notification_id  1116 non-null   int64 
 1   enroll_id        1116 non-null   int64 
 2   status           1115 non-null   object
 3   user_viewed      1116 non-null   bool  
 4   created_date     1116 non-null   object
dtypes: bool(1), int64(2), object(2)
memory usage: 36.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1116 entries, 0 to 1115
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   notification_id     1116 non-null   int64 
 1   enroll_id           1116 non-null   int64 
 2   certificate_status  1116 non-null   bool  
 3   user_viewed         1116 non-null   bool  
 4   created_date        1116 non-null   object
dtypes: bool(2), int64(2), object(1)
memory usage: 28.5+ KB
None
   noti

  cleaned_notifications_data['status'] = cleaned_notifications_data['status'].fillna(False)


# Data Integration & Storage

### Data Integration and Storage Steps

1. **Join Tables**:
   - Merge or join different tables to create a unified dataset that includes all necessary features for analysis.
   - Ensure that the join keys are appropriate and that the merging process retains the relevant data.

2. **Data Storage**:
   - Create Final Tables:
     - Organize the cleaned and transformed data into final tables that are structured for analysis and modeling.
     - Save these final tables as CSV files or store them in a database for easy access.


In [17]:
# Load the CSV files into DataFrames
cleaned_employee_data = pd.read_csv('prep/cleaned_employee_data.csv')
cleaned_course_enrollment_data = pd.read_csv('prep/cleaned_courseEnrollment.csv')
cleaned_courses_data = pd.read_csv('prep/cleaned_courses_data.csv')
cleaned_learning_paths_map = pd.read_csv('prep/cleaned_learning_paths_map.csv')
cleaned_learning_paths_data = pd.read_csv('prep/cleaned_learning_paths.csv')
cleaned_course_engage_logs = pd.read_csv('prep/cleaned_courseEngageLogs.csv')
cleaned_notifications = pd.read_csv('prep/cleaned_Notifications.csv')


In [18]:
success_rate_df = cleaned_notifications.groupby('enroll_id').agg(
    total_attempts=('certificate_status', 'size'),  # Total attempts
    accepted_attempts=('certificate_status', lambda x: x.sum()),  # Count of accepted attempts
).reset_index()

# Calculate success rate
success_rate_df['success_rate'] = success_rate_df['accepted_attempts'] / success_rate_df['total_attempts']

# Display the results
print(success_rate_df)

     enroll_id  total_attempts  accepted_attempts  success_rate
0            2              10                  0      0.000000
1            3               9                  3      0.333333
2            4              10                  3      0.300000
3            5               7                  0      0.000000
4          100               4                  1      0.250000
..         ...             ...                ...           ...
289        385               2                  1      0.500000
290        386               2                  1      0.500000
291        387               4                  1      0.250000
292        388               5                  1      0.200000
293        389               3                  0      0.000000

[294 rows x 4 columns]


In [19]:
# Group by enroll_id and calculate total time spent
total_time_spent_df = cleaned_course_engage_logs.groupby('enroll_id')['time_spent_in_sec'].sum().reset_index()

# Display the results
print(total_time_spent_df)

     enroll_id  time_spent_in_sec
0            2              23788
1            3              25305
2            4              15029
3            5              21124
4          100              10498
..         ...                ...
289        385               2746
290        386               2697
291        387               5573
292        388               7283
293        389               3064

[294 rows x 2 columns]


In [20]:
# Merge the tables
merged_data = (
    cleaned_course_enrollment_data
    .merge(cleaned_employee_data, on='emp_id', how='left')  # Join Employee Details with Course Enrollment
    .merge(cleaned_courses_data, on='course_id', how='left')  # Join Course Enrollment with Course Details
    .merge(cleaned_learning_paths_map, on='course_id', how='left')  # Join Course Enrollment with Learning Path Mapping
    .merge(cleaned_learning_paths_data, on='learning_path_id', how='left')  # Join Course Enrollment with Course Details
    .merge(total_time_spent_df, on='enroll_id', how='left')  # Join CourseEngageLogs
    .merge(success_rate_df, on='enroll_id', how='left')  # Join Notifications
)

# Display the merged data
print(merged_data)

merged_data.to_csv('./reporting/merged.csv', index=False)

     enroll_id  emp_id  course_id                createdAt  \
0          100  JMD100        125  2024-08-19 14:25:07.025   
1          100  JMD100        125  2024-08-19 14:25:07.025   
2          101  JMD100         10  2024-08-15 08:12:56.833   
3          101  JMD100         10  2024-08-15 08:12:56.833   
4          102  JMD100        128  2024-04-29 19:10:54.473   
..         ...     ...        ...                      ...   
705        389  JMD199        171  2024-02-02 12:27:02.552   
706        389  JMD199        171  2024-02-02 12:27:02.552   
707          7  JMD101        107  2024-10-14 09:11:39.053   
708          7  JMD101        107  2024-10-14 09:11:39.053   
709          7  JMD101        107  2024-10-14 09:11:39.053   

     course_certificate_generated  completion_rate  test_score_normalized  \
0                            True             0.13                   0.13   
1                            True             0.13                   0.13   
2                       

In [21]:
# Group by emp_id to aggregate the features
employee_performance = merged_data.groupby('emp_id').agg({
    'time_spent_in_sec': 'sum',  # Total Time Spent
    'completion_rate': 'mean',  # Average Course Completion
    'test_score_normalized': 'mean',  # Average Test Score
    'course_certificate_generated': lambda x: (x == True).sum(),  # Count of Generated Certificates
}).reset_index()

# Rename the columns for clarity
employee_performance.rename(columns={
    'time_spent_in_sec': 'total_time_spent',
    'completion_rate': 'average_completion_rate',
    'test_score_normalized': 'average_test_score',
    'course_certificate_generated' : 'total_certificates'
}, inplace=True)

# Display the employee performance data
print(employee_performance.head())


   emp_id  total_time_spent  average_completion_rate  average_test_score  \
0  JMD001           53846.0                 0.616928              0.5000   
1  JMD002           25305.0                 1.000000              0.6000   
2  JMD003           21124.0                 0.557994              0.0000   
3  JMD100           65672.0                 0.590000              0.2675   
4  JMD101           34973.0                 0.075000              0.3500   

   total_certificates  
0                   2  
1                   1  
2                   0  
3                   8  
4                   9  


In [22]:
# Drop duplicates based on emp_id and course_id to ensure each course is counted once per employee
unique_courses = merged_data.drop_duplicates(subset=['emp_id', 'course_id'])
# Create a mapping of difficulty levels to counts, grouping by emp_id and course_difficulty_level
difficulty_distribution = unique_courses.groupby(['emp_id', 'course_difficulty_level']).size().unstack(fill_value=0)

# Rename columns for clarity
difficulty_distribution.columns = [f'completed_courses_{level}' for level in difficulty_distribution.columns]

# Merge difficulty distribution with employee performance data
employee_performance = employee_performance.merge(difficulty_distribution, on='emp_id', how='left')

# Fill NaN values with 0 for completed courses
employee_performance.fillna(0, inplace=True)

# Display the final employee performance data
print(employee_performance.head())
employee_performance.to_csv('./reporting/employee_performance.csv', index=False)


   emp_id  total_time_spent  average_completion_rate  average_test_score  \
0  JMD001           53846.0                 0.616928              0.5000   
1  JMD002           25305.0                 1.000000              0.6000   
2  JMD003           21124.0                 0.557994              0.0000   
3  JMD100           65672.0                 0.590000              0.2675   
4  JMD101           34973.0                 0.075000              0.3500   

   total_certificates  completed_courses_BEGINNER  completed_courses_EXPERT  \
0                   2                           1                         0   
1                   1                           0                         0   
2                   0                           0                         0   
3                   8                           1                         1   
4                   9                           1                         4   

   completed_courses_INTERMEDIATE  
0                               

In [23]:
# Calculate the learning path predictions
learning_path_performance = (
    merged_data.groupby(['emp_id', 'learning_path_id'])
    .agg({
        'completion_rate': 'mean',  # Average completion percentage
        'success_rate': 'mean',
        'time_spent_in_sec': 'mean',
        'test_score_normalized': 'mean'        
    })
    .reset_index()
)

# Min-Max normalization
min_time_spent = learning_path_performance['time_spent_in_sec'].min()
max_time_spent = learning_path_performance['time_spent_in_sec'].max()


# Calculate the ratio or combined score
learning_path_performance['combined_score'] = (
    (learning_path_performance['completion_rate'] * 0.2) + 
    ((
        (learning_path_performance['time_spent_in_sec'] - min_time_spent) /
        (max_time_spent - min_time_spent)
    )* 0.2) + 
    (learning_path_performance['success_rate'] * 0.15) + 
    (learning_path_performance['test_score_normalized'] * 0.45)
)

# Get the learning path with the highest combined score for each employee
best_learning_paths = learning_path_performance.loc[learning_path_performance.groupby('emp_id')['combined_score'].idxmax()]

# Merge with learning path details to get descriptions
best_learning_paths = best_learning_paths.merge(cleaned_learning_paths_data[['learning_path_id', 'learning_path_name']], on='learning_path_id', how='left')

# Display the final recommendations
print(best_learning_paths[['emp_id', 'learning_path_id', 'learning_path_name', 'combined_score']])
best_learning_paths.to_csv('./reporting/best_learning_paths.csv', index=False)


     emp_id  learning_path_id        learning_path_name  combined_score
0    JMD001                 4                  Frontend        0.636564
1    JMD002                 3                Full Stack        0.720000
2    JMD003                 3                Full Stack        0.277651
3    JMD100               113                  Big Data        0.573127
4    JMD101               103             Cybersecurity        0.475254
..      ...               ...                       ...             ...
98   JMD195               101           Cloud Computing        0.394146
99   JMD196               108                Blockchain        0.575945
100  JMD197               107   AI and Machine Learning        0.739434
101  JMD198               101           Cloud Computing        0.652786
102  JMD199               109  Internet of Things (IoT)        0.553414

[103 rows x 4 columns]
