<a href="https://colab.research.google.com/github/nooralteneiji/Machine_Learning-Candidate_Performance-Prototype/blob/main/Intern_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[To-do] create a requirements.txt file to perserve current library versions and ensure replicability!

In [None]:
!pip install pyECLAT numpy pandas plotly

Save the current versions of all installed libraries in Python environment into a requirements.txt file using pip.

In [None]:
pip freeze > requirements.txt

# [Manual] Dummy data

## Preparing Dataset

### Creating test data

Test dataset for now just has random values that follow a normal distribution for each column/assessment.

In [None]:
import pandas as pd
import numpy as np

columns = ["Performance Appraisal Rating", "Adaptability", "Collaboration", "Communication", "Conflict Resolution", "Decision Making", "Drive", "Emotional Balance", "Empathy", "Influence", "Innovation", "Learning Mindset", "Ownership", "Positivity", "Problem Solving", "Resilience", "Self Awareness", "Sincerity", "Structure", "Trust"]
num_rows = 300

# Generate dataset w/ each column having a normal distribution where M = 5.5 and std = 2.5
dataset = {column: [np.random.choice(["Poor", "Excellent"]) if column == "Performance Appraisal Rating" else np.random.normal(loc=5.5, scale=2.5) for _ in range(num_rows)] for column in columns}

# Ensure values are within the range 1-10
for column in columns[1:]:
    dataset[column] = [max(1, min(10, abs(int(val)))) for val in dataset[column]]

df_raw = pd.DataFrame(dataset)
df_raw.head()

Unnamed: 0,Performance Appraisal Rating,Adaptability,Collaboration,Communication,Conflict Resolution,Decision Making,Drive,Emotional Balance,Empathy,Influence,Innovation,Learning Mindset,Ownership,Positivity,Problem Solving,Resilience,Self Awareness,Sincerity,Structure,Trust
0,Poor,6,3,6,3,5,4,4,5,5,6,5,3,3,3,7,3,4,4,7
1,Excellent,5,5,7,1,1,6,10,7,4,10,4,2,5,5,5,5,4,2,6
2,Excellent,8,4,4,10,1,7,8,3,7,1,3,4,3,3,6,5,8,1,3
3,Excellent,4,6,5,3,6,2,1,5,9,9,5,2,5,6,6,3,2,6,10
4,Excellent,4,4,3,6,9,9,10,8,2,3,5,1,1,7,6,5,9,4,4


### Add `total_score` column

In [None]:
# Calculate the total score for each person.
df = df_raw.copy()
attributes = ["Adaptability", "Collaboration", "Communication", "Conflict Resolution", "Decision Making", "Drive", "Emotional Balance", "Empathy", "Influence", "Innovation", "Learning Mindset", "Ownership", "Positivity", "Problem Solving", "Resilience", "Self Awareness", "Sincerity", "Structure", "Trust"]
df['Total Score'] = df[attributes].sum(axis=1)
df.head()

Unnamed: 0,Performance Appraisal Rating,Adaptability,Collaboration,Communication,Conflict Resolution,Decision Making,Drive,Emotional Balance,Empathy,Influence,...,Learning Mindset,Ownership,Positivity,Problem Solving,Resilience,Self Awareness,Sincerity,Structure,Trust,Total Score
0,Poor,6,3,6,3,5,4,4,5,5,...,5,3,3,3,7,3,4,4,7,86
1,Excellent,5,5,7,1,1,6,10,7,4,...,4,2,5,5,5,5,4,2,6,94
2,Excellent,8,4,4,10,1,7,8,3,7,...,3,4,3,3,6,5,8,1,3,89
3,Excellent,4,6,5,3,6,2,1,5,9,...,5,2,5,6,6,3,2,6,10,95
4,Excellent,4,4,3,6,9,9,10,8,2,...,5,1,1,7,6,5,9,4,4,100


### Distributions

In [None]:
import plotly.subplots as sp
import plotly.graph_objects as go

# let's update columns to include the newly added 'total_score'
columns = ["Performance Appraisal Rating", "Adaptability", "Collaboration", "Communication", "Conflict Resolution", "Decision Making", "Drive", "Emotional Balance", "Empathy", "Influence", "Innovation", "Learning Mindset", "Ownership", "Positivity", "Problem Solving", "Resilience", "Self Awareness", "Sincerity", "Structure", "Trust", "Total Score"]

# Number of rows and columns for the subplot grid
nrows = 5
ncols = 4

fig = sp.make_subplots(rows=nrows, cols=ncols)

# For each column except "Performance Appraisal Rating"
for i, column in enumerate(columns[1:], start=1):
    # Calculate row and column indices for the current subplot
    row = (i-1) // ncols + 1
    col = (i-1) % ncols + 1

    # Create histogram
    hist = go.Histogram(x=df[column], nbinsx=10, name=column)

    # Add histogram to subplot
    fig.add_trace(hist, row=row, col=col)

fig.update_layout(height=600, width=1000, title_text="Distribution of Each Assessment Result")
fig.show()

### Categorizing scores

We will categorize the total score in accordance to a bell curve where;

`total_score low` = scores from (mean - 3std) to (mean - std)

`total_score med` = scores from (mean - std) to (mean + std)

`total_score high` = scores from (mean + std) to (mean + 3std)

![bell curve](https://cdn-images-1.medium.com/max/1600/1*U28PNb2fL8lvN6NRoEtf6Q.jpeg)

In [None]:
# Let's create a function to map each score into a new bin.
def get_data_categorical(data):
    cat = df.copy()
    for i in cat.columns:
        if i == 'Performance Appraisal Rating':
            cat[i] = cat[i].map({'Excellent': 'Outstanding', 'Poor': 'Needs some improvement'})
        elif i == 'Total Score':
            mean = cat[i].mean()
            std = cat[i].std()
            bins = [mean - 3*std, mean - std, mean + std, mean + 3*std]
            labels = [i+' low', i+' med', i+' high']
            cat[i+'_Cat'] = pd.cut(cat[i], bins=bins, labels=labels, include_lowest=True)
            cat = cat.drop([i], axis=1)
        else:
            bins = [0, 4, 7, 10] # defining the boundary for each bin
            labels = [i+' low', i+' med', i+' high']
            cat[i+'_Cat'] = pd.cut(cat[i], bins=bins, labels=labels, include_lowest=True)
            cat = cat.drop([i], axis=1)
    return cat

dataset = get_data_categorical(df)
dataset.head()

Unnamed: 0,Performance Appraisal Rating,Adaptability_Cat,Collaboration_Cat,Communication_Cat,Conflict Resolution_Cat,Decision Making_Cat,Drive_Cat,Emotional Balance_Cat,Empathy_Cat,Influence_Cat,...,Learning Mindset_Cat,Ownership_Cat,Positivity_Cat,Problem Solving_Cat,Resilience_Cat,Self Awareness_Cat,Sincerity_Cat,Structure_Cat,Trust_Cat,Total Score_Cat
0,Needs some improvement,Adaptability med,Collaboration low,Communication med,Conflict Resolution low,Decision Making med,Drive low,Emotional Balance low,Empathy med,Influence med,...,Learning Mindset med,Ownership low,Positivity low,Problem Solving low,Resilience med,Self Awareness low,Sincerity low,Structure low,Trust med,Total Score med
1,Outstanding,Adaptability med,Collaboration med,Communication med,Conflict Resolution low,Decision Making low,Drive med,Emotional Balance high,Empathy med,Influence low,...,Learning Mindset low,Ownership low,Positivity med,Problem Solving med,Resilience med,Self Awareness med,Sincerity low,Structure low,Trust med,Total Score med
2,Outstanding,Adaptability high,Collaboration low,Communication low,Conflict Resolution high,Decision Making low,Drive med,Emotional Balance high,Empathy low,Influence med,...,Learning Mindset low,Ownership low,Positivity low,Problem Solving low,Resilience med,Self Awareness med,Sincerity high,Structure low,Trust low,Total Score med
3,Outstanding,Adaptability low,Collaboration med,Communication med,Conflict Resolution low,Decision Making med,Drive low,Emotional Balance low,Empathy med,Influence high,...,Learning Mindset med,Ownership low,Positivity med,Problem Solving med,Resilience med,Self Awareness low,Sincerity low,Structure med,Trust high,Total Score med
4,Outstanding,Adaptability low,Collaboration low,Communication low,Conflict Resolution med,Decision Making high,Drive high,Emotional Balance high,Empathy high,Influence low,...,Learning Mindset med,Ownership low,Positivity low,Problem Solving med,Resilience med,Self Awareness med,Sincerity high,Structure low,Trust low,Total Score med


# [Manual] Pilot data

## Data pre-processing

### Loading the data

In [12]:
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')
#Change the line below to mention where the file 'Sample.csv' is stored on your drive
%cd '/content/drive/MyDrive/ECAs/WhiteBox HR/'
pd.set_option('display.max_columns', None)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/ECAs/WhiteBox HR


In [13]:
import pandas as pd
df_raw = pd.read_csv('Sample.csv')
df_raw = df_raw.drop(df_raw.columns[0], axis=1) # drop "performance score"
df_raw.head()

Unnamed: 0,Performance Appraisal Rating,Adaptability,Collaboration,Communication,Conflict Resolution,Decision Making,Drive,Emotional Balance,Empathy,Influence,Innovation,Learning Mindset,Ownership,Positivity,Problem Solving,Resilience,Self Awareness,Sincerity,Structure,Trust
0,Excellent,5,6,5,3,4,10,3,2,6,7,4,3,5,2,5,4,6,8,4
1,Excellent,5,7,7,5,7,8,5,3,5,5,3,6,6,4,5,5,8,7,5
2,Excellent,8,6,8,6,6,8,7,5,7,9,6,7,7,5,6,7,7,9,6
3,Excellent,7,8,7,8,6,9,8,9,8,8,4,8,10,7,8,6,6,9,6
4,Excellent,9,10,9,7,7,8,9,4,8,6,7,8,7,5,6,8,9,8,7


### Add `total_score` column

In [14]:
# Calculate the total score for each person.
df = df_raw.copy()
attributes = ["Adaptability", "Collaboration", "Communication", "Conflict Resolution", "Decision Making", "Drive", "Emotional Balance", "Empathy", "Influence", "Innovation", "Learning Mindset", "Ownership", "Positivity", "Problem Solving", "Resilience", "Self Awareness", "Sincerity", "Structure", "Trust"]
df['Total Score'] = df[attributes].sum(axis=1)
df.head()

Unnamed: 0,Performance Appraisal Rating,Adaptability,Collaboration,Communication,Conflict Resolution,Decision Making,Drive,Emotional Balance,Empathy,Influence,Innovation,Learning Mindset,Ownership,Positivity,Problem Solving,Resilience,Self Awareness,Sincerity,Structure,Trust,Total Score
0,Excellent,5,6,5,3,4,10,3,2,6,7,4,3,5,2,5,4,6,8,4,92
1,Excellent,5,7,7,5,7,8,5,3,5,5,3,6,6,4,5,5,8,7,5,106
2,Excellent,8,6,8,6,6,8,7,5,7,9,6,7,7,5,6,7,7,9,6,130
3,Excellent,7,8,7,8,6,9,8,9,8,8,4,8,10,7,8,6,6,9,6,142
4,Excellent,9,10,9,7,7,8,9,4,8,6,7,8,7,5,6,8,9,8,7,142


### Distributions

In [15]:
import plotly.subplots as sp
import plotly.graph_objects as go

# let's update columns to include the newly added 'total_score'
columns = ["Performance Appraisal Rating", "Adaptability", "Collaboration", "Communication", "Conflict Resolution", "Decision Making", "Drive", "Emotional Balance", "Empathy", "Influence", "Innovation", "Learning Mindset", "Ownership", "Positivity", "Problem Solving", "Resilience", "Self Awareness", "Sincerity", "Structure", "Trust", "Total Score"]

# define the grid
nrows = 5
ncols = 4

fig = sp.make_subplots(rows=nrows, cols=ncols)

# For each column except "Performance Appraisal Rating"
for i, column in enumerate(columns[1:], start=1):
    # Calculate row and column indices for the current subplot
    row = (i-1) // ncols + 1
    col = (i-1) % ncols + 1

    hist = go.Histogram(x=df[column], nbinsx=10, name=column)     # Creates histogram

    fig.add_trace(hist, row=row, col=col)     # Add histogram to subplot

fig.update_layout(height=600, width=1200, title_text="Distribution of Each Assessment Result")
fig.show()

### Categorizing scores

We will categorize the total score in accordance to a bell curve where;

`total_score low` = scores from (mean - 3std) to (mean - std)

`total_score med` = scores from (mean - std) to (mean + std)

`total_score high` = scores from (mean + std) to (mean + 3std)

![bell curve](https://cdn-images-1.medium.com/max/1600/1*U28PNb2fL8lvN6NRoEtf6Q.jpeg)

In [16]:
# Let's create a function to map each score into a new bin.
def get_data_categorical(data):
    cat = df.copy()
    for i in cat.columns:
        if i == 'Performance Appraisal Rating':
            cat[i] = cat[i].map({'Outstanding': 'Excellent Performance', 'Excellent':'Excellent Performance', 'Good': 'Good Performance', 'Needs some improvement': 'Poor Performance'}) # map from old name -> new name
        elif i == 'Total Score':
            mean = cat[i].mean()
            std = cat[i].std()
            bins = [mean - 3*std, mean - std, mean + std, mean + 3*std]
            labels = [i+' low', i+' med', i+' high']
            cat[i+'_Cat'] = pd.cut(cat[i], bins=bins, labels=labels, include_lowest=True)
            cat = cat.drop([i], axis=1)
        else:
            bins = [0, 4, 7, 10] # defining the boundary for each bin
            labels = [i+' low', i+' med', i+' high']
            cat[i+'_Cat'] = pd.cut(cat[i], bins=bins, labels=labels, include_lowest=True)
            cat = cat.drop([i], axis=1)
    return cat

dataset = get_data_categorical(df)
dataset.head()

Unnamed: 0,Performance Appraisal Rating,Adaptability_Cat,Collaboration_Cat,Communication_Cat,Conflict Resolution_Cat,Decision Making_Cat,Drive_Cat,Emotional Balance_Cat,Empathy_Cat,Influence_Cat,Innovation_Cat,Learning Mindset_Cat,Ownership_Cat,Positivity_Cat,Problem Solving_Cat,Resilience_Cat,Self Awareness_Cat,Sincerity_Cat,Structure_Cat,Trust_Cat,Total Score_Cat
0,Excellent Performance,Adaptability med,Collaboration med,Communication med,Conflict Resolution low,Decision Making low,Drive high,Emotional Balance low,Empathy low,Influence med,Innovation med,Learning Mindset low,Ownership low,Positivity med,Problem Solving low,Resilience med,Self Awareness low,Sincerity med,Structure high,Trust low,Total Score low
1,Excellent Performance,Adaptability med,Collaboration med,Communication med,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance med,Empathy low,Influence med,Innovation med,Learning Mindset low,Ownership med,Positivity med,Problem Solving low,Resilience med,Self Awareness med,Sincerity high,Structure med,Trust med,Total Score med
2,Excellent Performance,Adaptability high,Collaboration med,Communication high,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance med,Empathy med,Influence med,Innovation high,Learning Mindset med,Ownership med,Positivity med,Problem Solving med,Resilience med,Self Awareness med,Sincerity med,Structure high,Trust med,Total Score med
3,Excellent Performance,Adaptability med,Collaboration high,Communication med,Conflict Resolution high,Decision Making med,Drive high,Emotional Balance high,Empathy high,Influence high,Innovation high,Learning Mindset low,Ownership high,Positivity high,Problem Solving med,Resilience high,Self Awareness med,Sincerity med,Structure high,Trust med,Total Score med
4,Excellent Performance,Adaptability high,Collaboration high,Communication high,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance high,Empathy low,Influence high,Innovation med,Learning Mindset med,Ownership high,Positivity med,Problem Solving med,Resilience med,Self Awareness high,Sincerity high,Structure high,Trust med,Total Score med


### Formating into transaction dataset

We need each row to represent and indivisual and each column to represent a score.

In [17]:
# Drop the column headers by resetting the index of the df
dataset = dataset.reset_index(drop=False)
dataset = dataset.drop(columns=['index'])

# Set column headers to be an index from 0 to n
dataset.columns = range(len(dataset.columns))

dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,Excellent Performance,Adaptability med,Collaboration med,Communication med,Conflict Resolution low,Decision Making low,Drive high,Emotional Balance low,Empathy low,Influence med,Innovation med,Learning Mindset low,Ownership low,Positivity med,Problem Solving low,Resilience med,Self Awareness low,Sincerity med,Structure high,Trust low,Total Score low
1,Excellent Performance,Adaptability med,Collaboration med,Communication med,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance med,Empathy low,Influence med,Innovation med,Learning Mindset low,Ownership med,Positivity med,Problem Solving low,Resilience med,Self Awareness med,Sincerity high,Structure med,Trust med,Total Score med
2,Excellent Performance,Adaptability high,Collaboration med,Communication high,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance med,Empathy med,Influence med,Innovation high,Learning Mindset med,Ownership med,Positivity med,Problem Solving med,Resilience med,Self Awareness med,Sincerity med,Structure high,Trust med,Total Score med
3,Excellent Performance,Adaptability med,Collaboration high,Communication med,Conflict Resolution high,Decision Making med,Drive high,Emotional Balance high,Empathy high,Influence high,Innovation high,Learning Mindset low,Ownership high,Positivity high,Problem Solving med,Resilience high,Self Awareness med,Sincerity med,Structure high,Trust med,Total Score med
4,Excellent Performance,Adaptability high,Collaboration high,Communication high,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance high,Empathy low,Influence high,Innovation med,Learning Mindset med,Ownership high,Positivity med,Problem Solving med,Resilience med,Self Awareness high,Sincerity high,Structure high,Trust med,Total Score med


In [18]:
from pyECLAT import ECLAT
# Load df to ECLAT class
eclat = ECLAT(data=dataset)

# Get the binary representation of the df
eclat.df_bin.head()

Unnamed: 0,Emotional Balance med,Adaptability low,Drive med,Innovation med,Sincerity med,Trust high,Poor Performance,Communication low,Collaboration med,Innovation low,Emotional Balance low,Learning Mindset low,Positivity high,Learning Mindset med,Ownership low,Resilience high,Communication high,Sincerity low,Problem Solving med,Adaptability high,Empathy med,Decision Making med,Excellent Performance,Resilience med,Emotional Balance high,Ownership high,Trust low,Ownership med,Drive high,Conflict Resolution med,Trust med,Total Score low,Good Performance,Positivity low,Drive low,Sincerity high,Influence high,Structure med,Innovation high,Decision Making high,Self Awareness med,Self Awareness high,Collaboration low,Adaptability med,Decision Making low,Communication med,Influence low,Structure low,Total Score med,Resilience low,Empathy high,Empathy low,Structure high,Learning Mindset high,Total Score high,Problem Solving high,Problem Solving low,Influence med,Positivity med,Self Awareness low,Collaboration high,Conflict Resolution low,Conflict Resolution high
0,0,0,0,1,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,1,1,1,1,0,1,0
1,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0
2,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,1,0,1
4,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0


## Visualizing the frequent itemset

Let's look at the distribution of each item (i.e., "bread"). The higher the frequency, the more likley it will appear in an itemset after we apply the ECLAT algo (i.e.,{"bread","milk"}).

We we see that items are rarely appearing, then we will know that we shouldn't expect the ECLAT to find itemsets later on.

**NOTE:** *The methodology for applying the ECLAT algorithm in this section is based off of this [tutorial](https://hands-on.cloud/implementation-of-eclat-algorithm-using-python/).*

In [19]:
# row-by-row, lets count the number of non NaN cells present
count = eclat.df_bin.count(axis=1)
print(count)

0      63
1      63
2      63
3      63
4      63
       ..
415    63
416    63
417    63
418    63
419    63
Length: 420, dtype: int64


In [20]:
pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 600)

In [21]:
# for each column, are there an NaN values present?
eclat.df_bin.isna().sum()

Emotional Balance med       0
Adaptability low            0
Drive med                   0
Innovation med              0
Sincerity med               0
Trust high                  0
Poor Performance            0
Communication low           0
Collaboration med           0
Innovation low              0
Emotional Balance low       0
Learning Mindset low        0
Positivity high             0
Learning Mindset med        0
Ownership low               0
Resilience high             0
Communication high          0
Sincerity low               0
Problem Solving med         0
Adaptability high           0
Empathy med                 0
Decision Making med         0
Excellent Performance       0
Resilience med              0
Emotional Balance high      0
Ownership high              0
Trust low                   0
Ownership med               0
Drive high                  0
Conflict Resolution med     0
Trust med                   0
Total Score low             0
Good Performance            0
Positivity

In [22]:
# count items in each column (aka sum of scores for that column)
items_total = eclat.df_bin.astype(int).sum(axis=0)
items_total

Emotional Balance med       220
Adaptability low             23
Drive med                   171
Innovation med              230
Sincerity med               223
Trust high                   94
Poor Performance             18
Communication low            23
Collaboration med           186
Innovation low               36
Emotional Balance low        32
Learning Mindset low        117
Positivity high             134
Learning Mindset med        224
Ownership low                35
Resilience high             117
Communication high          177
Sincerity low                42
Problem Solving med         214
Adaptability high           221
Empathy med                 232
Decision Making med         197
Excellent Performance        14
Resilience med              250
Emotional Balance high      168
Ownership high              198
Trust low                    92
Ownership med               187
Drive high                  201
Conflict Resolution med     269
Trust med                   234
Total Sc

In [23]:
# count items in each row (aka number of assessments we have a score for each individual)
items_per_transaction = eclat.df_bin.astype(int).sum(axis=1)
items_per_transaction

0      21
1      21
2      21
3      21
4      21
5      20
6      20
7      21
8      21
9      21
10     21
11     21
12     21
13     21
14     21
15     21
16     21
17     21
18     21
19     21
20     21
21     21
22     21
23     21
24     21
25     21
26     21
27     21
28     21
29     21
30     21
31     21
32     21
33     21
34     21
35     21
36     21
37     21
38     21
39     21
40     21
41     21
42     21
43     21
44     21
45     21
46     21
47     21
48     21
49     21
50     21
51     21
52     21
53     21
54     21
55     21
56     21
57     21
58     21
59     21
60     21
61     21
62     21
63     21
64     21
65     21
66     21
67     21
68     21
69     21
70     21
71     21
72     21
73     21
74     21
75     21
76     21
77     21
78     21
79     21
80     21
81     21
82     21
83     21
84     21
85     21
86     21
87     21
88     21
89     21
90     21
91     21
92     21
93     21
94     21
95     21
96     21
97     21
98     21
99     21


use these Series to visualize items distribution.

In [24]:
import pandas as pd
# Loading items per column stats to the DataFrame
df = pd.DataFrame({'items': items_total.index, 'transactions': items_total.values})
# cloning pandas DataFrame for visualization purpose
df_table = df.sort_values("transactions", ascending=False)
#  Top 5 most popular products/items
df_table.head(5).style.background_gradient(cmap='Blues')

Unnamed: 0,items,transactions
32,Good Performance,388
48,Total Score med,285
29,Conflict Resolution med,269
58,Positivity med,261
52,Structure high,257


## Apply ECLAT algo

ECLAT => frequent itemsets (sets of items that appear together in a significant number of transactions, rather than rules)

### Set parameters

**Minimum support** – should be provided as a percentage of the overall items from the dataset

**Minumum combinations** – the minimum amount of items in the transaction

**Maximum combinations** – the minimum amount of items in the transaction

In [25]:
# minimum fraction of transactions an itemset has to appear in to be considered frequent.
min_support = 0.01 # the set must appear in at least 5% of whole dataset

min_combination = 2  # we dont want single item itemsets
# max_combination = max(items_per_transaction)  # for example, itemsets must contain no more than 2 items
max_combination = 2

### Find frequent itemset for whole dataset

In [26]:
whole_dataset = ECLAT(data=dataset) # initialize the algo w our dataset

rule_indices, rule_supports = whole_dataset.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' → ',
                                                 verbose=True)

Combination 2 by 2


1953it [00:25, 77.09it/s] 


In [27]:
result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)

Unnamed: 0,Item,Support
1184,Good Performance → Total Score med,0.626190
1095,Conflict Resolution med → Good Performance,0.585714
1194,Good Performance → Positivity med,0.573810
1188,Good Performance → Structure high,0.552381
910,Resilience med → Good Performance,0.545238
...,...,...
1323,Decision Making high → Problem Solving low,0.011905
1326,Decision Making high → Self Awareness low,0.011905
1350,Self Awareness high → Resilience low,0.011905
887,Excellent Performance → Ownership high,0.011905


### Subsets

Filter to only keep sets that have an item relating to performance (poor, good, excellent)

In [28]:
mask = result['Item'].apply(lambda x: any(word in x for word in ["Poor"])) # returns True if word is found
df_filtered = result[mask]

df_filtered

Unnamed: 0,Item,Support
5,Emotional Balance med → Poor Performance,0.028571
96,Drive med → Poor Performance,0.014286
151,Innovation med → Poor Performance,0.014286
206,Sincerity med → Poor Performance,0.016667
296,Poor Performance → Collaboration med,0.019048
297,Poor Performance → Learning Mindset low,0.014286
298,Poor Performance → Positivity high,0.014286
299,Poor Performance → Learning Mindset med,0.019048
300,Poor Performance → Resilience high,0.011905
301,Poor Performance → Communication high,0.021429


In [29]:
mask = result['Item'].apply(lambda x: any(word in x for word in ["Good"]))
df_filtered = result[mask]

df_filtered

Unnamed: 0,Item,Support
29,Emotional Balance med → Good Performance,0.483333
77,Adaptability low → Good Performance,0.05
120,Drive med → Good Performance,0.388095
176,Innovation med → Good Performance,0.511905
231,Sincerity med → Good Performance,0.495238
276,Trust high → Good Performance,0.209524
346,Communication low → Good Performance,0.052381
383,Collaboration med → Good Performance,0.404762
423,Innovation low → Good Performance,0.078571
459,Emotional Balance low → Good Performance,0.071429


In [30]:
mask = result['Item'].apply(lambda x: any(word in x for word in ["Excellent"]))
df_filtered = result[mask]

df_filtered

Unnamed: 0,Item,Support
20,Emotional Balance med → Excellent Performance,0.011905
166,Innovation med → Excellent Performance,0.021429
221,Sincerity med → Excellent Performance,0.019048
373,Collaboration med → Excellent Performance,0.019048
489,Learning Mindset low → Excellent Performance,0.011905
573,Learning Mindset med → Excellent Performance,0.021429
674,Communication high → Excellent Performance,0.014286
737,Problem Solving med → Excellent Performance,0.02381
775,Adaptability high → Excellent Performance,0.016667
809,Empathy med → Excellent Performance,0.019048


### Find rules for each performance group

In [31]:
# separate the rules for each performance rating
excellent_rules = result[result['Item'].str.contains('Excellent Performance')]
good_rules = result[result['Item'].str.contains('Good Performance')]
poor_rules = result[result['Item'].str.contains('Poor Performance')]

# find the items that are in 'Good Performance' or 'Poor Performance' itemsets
good_poor_items = set(good_rules['Item'].str.split(' → ').str[0]).union(set(poor_rules['Item'].str.split(' → ').str[0]))

# filter 'Excellent Performance' itemsets to only include those where the other item is not in 'good_poor_items'
excellent_rules = excellent_rules[~excellent_rules['Item'].str.split(' → ').str[0].isin(good_poor_items)]


In [32]:
print(good_poor_items)

{'Emotional Balance med', 'Sincerity med', 'Drive med', 'Innovation med', 'Trust high', 'Adaptability low', 'Poor Performance', 'Communication low', 'Collaboration med', 'Innovation low', 'Emotional Balance low', 'Learning Mindset low', 'Positivity high', 'Learning Mindset med', 'Sincerity low', 'Resilience high', 'Communication high', 'Ownership low', 'Problem Solving med', 'Adaptability high', 'Empathy med', 'Decision Making med', 'Resilience med', 'Emotional Balance high', 'Ownership high', 'Total Score low', 'Trust low', 'Ownership med', 'Drive high', 'Conflict Resolution med', 'Trust med', 'Good Performance'}


In [33]:
print(excellent_rules)

                                                Item   Support
885           Excellent Performance → Resilience med  0.023810
886   Excellent Performance → Emotional Balance high  0.019048
887           Excellent Performance → Ownership high  0.011905
888            Excellent Performance → Ownership med  0.019048
889               Excellent Performance → Drive high  0.028571
890  Excellent Performance → Conflict Resolution med  0.026190
891                Excellent Performance → Trust med  0.021429
892           Excellent Performance → Sincerity high  0.011905
893           Excellent Performance → Influence high  0.014286
894       Excellent Performance → Self Awareness med  0.026190
895         Excellent Performance → Adaptability med  0.014286
896        Excellent Performance → Communication med  0.019048
897          Excellent Performance → Total Score med  0.030952
898           Excellent Performance → Structure high  0.026190
899            Excellent Performance → Influence med  0

# [Automated] Pilot data

## Data preprocessing

In [1]:
# get data
!wget https://raw.githubusercontent.com/nooralteneiji/Machine_Learning-Candidate_Performance-Prototype/main/Sample.csv
#get python enviroment
!wget https://raw.githubusercontent.com/nooralteneiji/Machine_Learning-Candidate_Performance-Prototype/main/requirements.txt

!pip install -r requirements.txt

df_raw = 'https://raw.githubusercontent.com/nooralteneiji/Machine_Learning-Candidate_Performance-Prototype/main/Sample.csv'

--2023-12-28 06:53:05--  https://raw.githubusercontent.com/nooralteneiji/Machine_Learning-Candidate_Performance-Prototype/main/Sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21341 (21K) [text/plain]
Saving to: ‘Sample.csv’


2023-12-28 06:53:05 (53.0 MB/s) - ‘Sample.csv’ saved [21341/21341]

--2023-12-28 06:53:05--  https://raw.githubusercontent.com/nooralteneiji/Machine_Learning-Candidate_Performance-Prototype/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9674 (9.4K) [text/plain]
Saving to: ‘re

In [21]:
import pandas as pd

def preprocess_data(file_path, drop_column_index, attributes):
    # Load the data
    df_raw = pd.read_csv(file_path)

    # Drop unnecessary columns
    df_raw = df_raw.drop(df_raw.columns[drop_column_index], axis=1)

    # Calculate total score
    df = df_raw.copy()
    df['Total Score'] = df[attributes].sum(axis=1)

    # Binning the scores and renaming performance appraisal rating
    def get_data_categorical(data):
        cat = df.copy()
        for i in cat.columns:
            if i == 'Performance Appraisal Rating':
                cat[i] = cat[i].map({'Outstanding': 'Excellent Performance', 'Excellent':'Excellent Performance', 'Good': 'Good Performance', 'Needs some improvement': 'Poor Performance'})
            elif i == 'Total Score':
                mean = cat[i].mean()
                std = cat[i].std()
                bins = [mean - 3*std, mean - std, mean + std, mean + 3*std]
                labels = [i+' low', i+' med', i+' high']
                cat[i+'_Cat'] = pd.cut(cat[i], bins=bins, labels=labels, include_lowest=True)
                cat = cat.drop([i], axis=1)
            else:
                bins = [0, 4, 7, 10]
                labels = [i+' low', i+' med', i+' high']
                cat[i+'_Cat'] = pd.cut(cat[i], bins=bins, labels=labels, include_lowest=True)
                cat = cat.drop([i], axis=1)
        return cat

    dataset = get_data_categorical(df)

    # Resetting the index
    dataset = dataset.reset_index(drop=False)
    dataset = dataset.drop(columns=['index'])

    # Renaming the columns
    dataset.columns = range(len(dataset.columns))

    return dataset

In [22]:
drop_column_index = 0
attributes = ["Adaptability", "Collaboration", "Communication", "Conflict Resolution", "Decision Making", "Drive", "Emotional Balance", "Empathy", "Influence", "Innovation", "Learning Mindset", "Ownership", "Positivity", "Problem Solving", "Resilience", "Self Awareness", "Sincerity", "Structure", "Trust"]

preprocessed_data = preprocess_data(df_raw, drop_column_index, attributes)
preprocessed_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,Excellent Performance,Adaptability med,Collaboration med,Communication med,Conflict Resolution low,Decision Making low,Drive high,Emotional Balance low,Empathy low,Influence med,...,Learning Mindset low,Ownership low,Positivity med,Problem Solving low,Resilience med,Self Awareness low,Sincerity med,Structure high,Trust low,Total Score low
1,Excellent Performance,Adaptability med,Collaboration med,Communication med,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance med,Empathy low,Influence med,...,Learning Mindset low,Ownership med,Positivity med,Problem Solving low,Resilience med,Self Awareness med,Sincerity high,Structure med,Trust med,Total Score med
2,Excellent Performance,Adaptability high,Collaboration med,Communication high,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance med,Empathy med,Influence med,...,Learning Mindset med,Ownership med,Positivity med,Problem Solving med,Resilience med,Self Awareness med,Sincerity med,Structure high,Trust med,Total Score med
3,Excellent Performance,Adaptability med,Collaboration high,Communication med,Conflict Resolution high,Decision Making med,Drive high,Emotional Balance high,Empathy high,Influence high,...,Learning Mindset low,Ownership high,Positivity high,Problem Solving med,Resilience high,Self Awareness med,Sincerity med,Structure high,Trust med,Total Score med
4,Excellent Performance,Adaptability high,Collaboration high,Communication high,Conflict Resolution med,Decision Making med,Drive high,Emotional Balance high,Empathy low,Influence high,...,Learning Mindset med,Ownership high,Positivity med,Problem Solving med,Resilience med,Self Awareness high,Sincerity high,Structure high,Trust med,Total Score med


## Generating association rules

In [None]:
## Model