## <center> Datahouse Take Home Assessment

## IMPORTANT NOTE:

For this assessment, I'll be taking one input in the form of JSON and releasing one output in the form of JSON. My job is to formulate a "compatibility score" [0,1] for the applicants as I see fit.

In the assignment we are only told that there is a single json input with a team and applicants, we are not told that we know the names of these delimeters or the order they will be presented in. Even more so, we have no information what information about these employees will be present. 

To simplify my solution while attempting to account for general scenarios, I will make the following assumptions:

    1.) The json string will import with a "team" object first (exact name match) and a "applicants" object second (exact name match).

    2.) Found within these objects will be another list of objects representing individuals belonging to the super-group.

    3.) For each individual, there will be a "name" string and an "attributes" object filled with integer characteristics (on a 1-10 scale) that we are interested in observing.

    4.) The "attributes" object will contain the same attributes and be in the same order between both the "team" and "applicants" objects.

These assumptions allow for a concise solution while still being broad enough to incorporate any additional attributes the company decides to value in the future. I look forward to discussing these assumptions further with the team.

## 1.) Data Loading and Transformation

In [1]:
# I'll begin with general imports for packages and libraries
import pandas as pd
import json 
import os

In [2]:
# Next, import the json file ASSUMING STORED ON DESKTOP

# Get the path to your desktop
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')

# Specify the filename
file_name = "DHInput.json"

# Combine the desktop path and filename
file_path = os.path.join(desktop_path, file_name)

# Open the file in read mode
with open(file_path, "r") as json_file:
    # Load the JSON data
    input_data = json.load(json_file)

# Print the JSON data
input_data

{'team': [{'name': 'Eddie',
   'attributes': {'intelligence': 1,
    'strength': 5,
    'endurance': 3,
    'spicyFoodTolerance': 1}},
  {'name': 'Will',
   'attributes': {'intelligence': 9,
    'strength': 4,
    'endurance': 1,
    'spicyFoodTolerance': 6}},
  {'name': 'Mike',
   'attributes': {'intelligence': 3,
    'strength': 2,
    'endurance': 9,
    'spicyFoodTolerance': 5}}],
 'applicants': [{'name': 'John',
   'attributes': {'intelligence': 4,
    'strength': 5,
    'endurance': 2,
    'spicyFoodTolerance': 1}},
  {'name': 'Jane',
   'attributes': {'intelligence': 7,
    'strength': 4,
    'endurance': 3,
    'spicyFoodTolerance': 2}},
  {'name': 'Joe',
   'attributes': {'intelligence': 1,
    'strength': 1,
    'endurance': 1,
    'spicyFoodTolerance': 10}}]}

In [3]:
# Next, transition to separate DataFrames for transformations

# First, the current employees
employee_df = pd.json_normalize(input_data['team'])

# Get the list of column names
columns = employee_df.columns

# The below section is to remove the "attributes" header - not necessary, just done for aesthetic nature + slimming the output frame
# Iterate over each column name
new_columns = {}
for column in columns:
    # Check if the column starts with "attributes."
    if column.startswith("attributes."):
        # Extract the attribute name after "attributes."
        attribute_name = column.split(".")[1]
        # Create a new column name without "attributes."
        new_columns[column] = attribute_name

# Rename the columns
employee_df = employee_df.rename(columns=new_columns)

# Print to verify
print(employee_df)

# Next the applicants
applicants_df = pd.json_normalize(input_data['applicants'])

# Get the list of column names
columns = applicants_df.columns

# Iterate over each column name
new_columns = {}
for column in columns:
    # Check if the column starts with "attributes."
    if column.startswith("attributes."):
        # Extract the attribute name after "attributes."
        attribute_name = column.split(".")[1]
        # Create a new column name without "attributes."
        new_columns[column] = attribute_name

# Rename the columns
applicants_df = applicants_df.rename(columns=new_columns)

# Print to verify
print(applicants_df)

    name  intelligence  strength  endurance  spicyFoodTolerance
0  Eddie             1         5          3                   1
1   Will             9         4          1                   6
2   Mike             3         2          9                   5
   name  intelligence  strength  endurance  spicyFoodTolerance
0  John             4         5          2                   1
1  Jane             7         4          3                   2
2   Joe             1         1          1                  10


Now that the data is fully transferred and formatted, we should determine our scoring approach.

Since I don't know how many attributes there are, I don't want to hardcode any specific names or weight one higher than the other based on the attributes in the sample input. However, I still believe weighting is necessary in this case as specific values of a job application are more important for compatibility than others - as demonstrated by the presence of a more quirky attribute in "spicyFoodTolerance."

Therefore, the best approach is to assume that they are placed in descending order of importance.

## 2.) Methodology and Scoring:

I will be weighting the categories in descending order based on there being "n" number of categories.

For example, if there are 5 categories, the first will be weighted as 5, the next 4, and so on.

Key Assumption:

    1. We want to hire applicants that are most similar to the current team members

To do this, I will find the "average employee score" after weighting and then compare the difference of the weighted applicant score. 
The closer they match, the closer to 1.0 their final score will be.

In [4]:
# Extract attribute column names
attribute_names = [col for col in employee_df.columns if col != "name"]

# Calculate weights based on the number of attributes
num_attributes = len(attribute_names)
attribute_weights = {name: num_attributes - i for i, name in enumerate(attribute_names)}

# Create a new DataFrame with the weighted values
weighted_employees_df = employee_df.copy()
for attribute, weight in attribute_weights.items():
    weighted_employees_df[attribute] *= weight

# Calculate the theoretical maximum score
theoretical_max_score = sum(attribute_weights.values()) * 10

# Calculate the individual scores for each employee
individual_scores = weighted_employees_df[attribute_names].sum(axis=1) / theoretical_max_score

# Add the individual scores to a new temp DataFrame
scores_temp_df = pd.DataFrame()
scores_temp_df['individual_score'] = individual_scores

# Calculate the average employee score
average_team_score = scores_temp_df['individual_score'].mean()

In [7]:
# Next, we need to calculate each applicant's score and find its similarity to the average employee score
# We can reuse some variables since we assume the attributes are the same between the team and applicants

# Create a new DataFrame with the weighted values
weighted_applicants_df = applicants_df.copy()
for attribute, weight in attribute_weights.items():
    weighted_applicants_df[attribute] *= weight

# Calculate the individual scores for each applicant
individual_scores = weighted_applicants_df[attribute_names].sum(axis=1) / theoretical_max_score

# Find how similar to the average employee each applicant is, since this is what we're prioritizing
# We do this by finding how close to the maximum difference each applicant is then subtracting this "% of max difference" from 
# our highest potential candidate score of 1.0
# We round the final result for brevity
maximum_possible_difference = 1 - average_team_score

final_applicant_scores = round(1 - abs(individual_scores - average_team_score)/maximum_possible_difference,2)

## 3.) Output

In [6]:
# We now need to write our output to a list of dictionaries then dump it as a string.

# Create a list of dictionaries for scored applicants
scoredApplicants = []

# Iterate over each row of the DataFrame
for index, row in applicants_df.iterrows():
    scoredApplicants.append({"name": row['name'], "score": final_applicant_scores[index]})

# Convert the list of dictionaries to a JSON string
output_data = json.dumps({"scoredApplicants": scoredApplicants}, indent=2)

print(output_data)

# The following is how to overwrite the initial input file. I have it commented out for further validation of editing the input file.
# Its also commented out in non-multi-line format because in notebooks it will still print output if done in that format

# """
# # Get the path to your desktop
# desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')

# # Combine the desktop path and filename
# file_path = os.path.join(desktop_path, file_name)

# # Write the JSON string to the file
# with open(file_path, "w") as file:
# file.write(output_data)
# """

{
  "scoredApplicants": [
    {
      "name": "John",
      "score": 0.92
    },
    {
      "name": "Jane",
      "score": 0.88
    },
    {
      "name": "Joe",
      "score": 0.63
    }
  ]
}


## 4.) Potential Improvements

There is a massive laundry list that could be improved in my code. Below are just the thoughts that immediately come to mind:

The primary thing would be removing the hardcoding of "team" and "applicants" and instead index through the JSON.
I didn't do this because I thought it would help with readability in the end and aid in looking through the code.

I also have a hunch that the employee scoring and averaging process could be more concise.

In terms of methodology, a company might want to hire based on "what they're missing," not on "what they have." 
Therefore, I could've found "gaps" in the weighted categories and then searched for these gaps in applicants.
Instead, I chose to use the similarity methodology since with a smaller, tight-knit community (which I assumed the workplace to be), 
you'd want people to get along the best and I figured similarity scoring would be the most optimal way to do this.