##Importing and Characterizing Data

Each user has the following characteristics: Squad_number, UserID, Squad_status, TimeZone, Location, available_days, no_hrs_per_week, Role, Skill (array), Division (1-3), Rank(1-8)

For each group, we want to track the average division/rank,

The below code was written by Nihar to create random groups.

#Group Recommendation Engine

This engine selects a few recommended groups for a new user to join upon making an account.

In particular, we try to match availability, location, and average skill level within groups. However, it's still important to have an even distribution of skills within the team. For example, it's not very effective to have 5 experts in python but none in machine learning.

**This is more of a data engineering, since there are different approaches to optimizing each variable. It would also make sense to consider a more classical machine learning approach, but the curse of dimensionality (5 days + 5 skill proficiencies + timezone = 11 dimensions) to generate meaningful suggestions. Also, there is little opportunity for feedback.

In [None]:
import random
import numpy as np
import pandas as pd
from scipy.spatial import distance

# List of skills and career roles for random data generation
all_skills = ['Python', 'JavaScript', 'UI/UX Design', 'Photoshop', 'Data Science']
all_career_roles = ['Developer', 'Designer', 'Data Scientist', 'Project Manager']

# Helper function to generate random user data
def generate_random_user(user_id):
    k=random.randint(1, 3)
    d=random.randint(2, 5)

    user = {
        'name': f'User{user_id}',
        'timezone': random.choice(['GMT-5', 'GMT-6', 'GMT-7']),
        'squad_number': None,
        'squad_status': 'open',
        'location': random.choice(['New York', 'Los Angeles', 'Chicago']),
        'availability': random.sample(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], d),
        'hours': random.randint(10, 30),
        'career_role': random.choice(all_career_roles),
        'skills': random.sample(all_skills, k),
        'proficiency': [random.uniform(0.5, 1.0) for _ in range(k)]
    }
    return user

# Generate 70 users
users = [generate_random_user(i) for i in range(1, 71)]

# Function to check if a squad is open or closed
def is_squad_open(squad):
    return squad['squad_status'] == 'open' and len(squad['members']) < 3

# Initialize list of squads
squads = [{'squad_number': i, 'squad_status': 'open', 'members': []} for i in range(1, 41)]

# Shuffle the users to randomize the assignment
random.shuffle(users)

# Assign users to squads
assigned_users = set()
for user in users:
    available_squads = [squad for squad in squads if is_squad_open(squad)]
    if not available_squads:
        # If all open squads are full, assign to a closed squad
        available_squads = [squad for squad in squads if not is_squad_open(squad)]

    selected_squad = random.choice(available_squads)
    # Check if the user is already assigned to a squad
    while user['name'] in assigned_users:
        selected_squad = random.choice(available_squads)
    selected_squad['members'].append(user['name'])
    user['squad_number'] = selected_squad['squad_number']
    assigned_users.add(user['name'])

    if len(selected_squad['members']) == 3:
        selected_squad['squad_status'] = 'close'

print("Generated Users:")
for user in users:
    print(user)

print("\nGenerated Squads:")
for squad in squads:
    print(squad)

Generated Users:
{'name': 'User32', 'timezone': 'GMT-7', 'squad_number': 38, 'squad_status': 'open', 'location': 'Los Angeles', 'availability': ['Wednesday', 'Monday', 'Friday'], 'hours': 10, 'career_role': 'Developer', 'skills': ['UI/UX Design', 'Data Science', 'Photoshop'], 'proficiency': [0.6201636666492625, 0.9853525605476935, 0.9553475350494937]}
{'name': 'User20', 'timezone': 'GMT-7', 'squad_number': 4, 'squad_status': 'open', 'location': 'Chicago', 'availability': ['Thursday', 'Wednesday'], 'hours': 14, 'career_role': 'Designer', 'skills': ['Data Science', 'Python', 'JavaScript'], 'proficiency': [0.7491885868023832, 0.8913708791443895, 0.6963774563669498]}
{'name': 'User39', 'timezone': 'GMT-7', 'squad_number': 9, 'squad_status': 'open', 'location': 'Los Angeles', 'availability': ['Friday', 'Tuesday', 'Monday', 'Thursday'], 'hours': 22, 'career_role': 'Designer', 'skills': ['Photoshop', 'UI/UX Design'], 'proficiency': [0.7405449299111124, 0.5755306589040083]}
{'name': 'User57', 

##Step 1: User information

Calculate values of:
  - Average proficiency (divide sum by 5)
  - Convert timezone to value (1,2,3)
  - Convert availability to binary array (size 5)
  

In [None]:
# Convert dict to CSV (written by Jyoti)
users_df = pd.DataFrame.from_dict(users)

def user_engineer(users_info):
  # Average proficiency
  users_info['avg_proficiency'] = users_df['proficiency'].apply(lambda x: sum(x) / len(all_skills))
      # Here I've divided the by the total number of skills, since that accounts for lack of breadth, which is a weakness in some sense.
      # But an alternate method is to divide by the number of skills the user has.

  # Convert timezone to value
  zone_to_number = {
      'GMT-5': 1,
      'GMT-6': 2,
      'GMT-7': 3,
      # This numerical mapping helps discourage larger time differences
  }
  users_info['timezone_int'] = users_info['timezone'].map(zone_to_number)

  # Convert availability to binary array
  weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

  users_info['availability_bin'] = users_info['availability'].apply(lambda x: [1 if item in x else 0 for item in weekdays])

  return users_info

users_df = user_engineer(users_df)
users_df

Unnamed: 0,name,timezone,squad_number,squad_status,location,availability,hours,career_role,skills,proficiency,avg_proficiency,timezone_int,availability_bin
0,User32,GMT-7,38,open,Los Angeles,"[Wednesday, Monday, Friday]",10,Developer,"[UI/UX Design, Data Science, Photoshop]","[0.6201636666492625, 0.9853525605476935, 0.955...",0.512173,3,"[1, 0, 1, 0, 1]"
1,User20,GMT-7,4,open,Chicago,"[Thursday, Wednesday]",14,Designer,"[Data Science, Python, JavaScript]","[0.7491885868023832, 0.8913708791443895, 0.696...",0.467387,3,"[0, 0, 1, 1, 0]"
2,User39,GMT-7,9,open,Los Angeles,"[Friday, Tuesday, Monday, Thursday]",22,Designer,"[Photoshop, UI/UX Design]","[0.7405449299111124, 0.5755306589040083]",0.263215,3,"[1, 1, 0, 1, 1]"
3,User57,GMT-5,31,open,New York,"[Thursday, Tuesday, Wednesday, Monday, Friday]",13,Data Scientist,"[Python, Photoshop, UI/UX Design]","[0.9505456761162054, 0.7745459737039511, 0.592...",0.463438,1,"[1, 1, 1, 1, 1]"
4,User42,GMT-7,17,open,Los Angeles,"[Wednesday, Thursday, Monday, Friday, Tuesday]",21,Designer,"[Photoshop, JavaScript]","[0.6156355933382485, 0.5188795102376939]",0.226903,3,"[1, 1, 1, 1, 1]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,User17,GMT-5,23,open,Chicago,"[Thursday, Wednesday, Tuesday]",28,Developer,[Python],[0.70549888667227],0.141100,1,"[0, 1, 1, 1, 0]"
66,User11,GMT-6,21,open,New York,"[Monday, Tuesday, Thursday, Friday]",24,Project Manager,"[Python, UI/UX Design, Photoshop]","[0.5477744214100839, 0.8412097352109398, 0.973...",0.472497,2,"[1, 1, 0, 1, 1]"
67,User8,GMT-7,34,open,Chicago,"[Friday, Monday, Tuesday, Thursday]",12,Project Manager,"[Photoshop, Python]","[0.7054330152142543, 0.6152205380797853]",0.264131,3,"[1, 1, 0, 1, 1]"
68,User41,GMT-6,35,open,New York,"[Monday, Friday, Wednesday, Tuesday, Thursday]",19,Project Manager,"[Photoshop, UI/UX Design]","[0.6373184510188062, 0.8791086795309131]",0.303285,2,"[1, 1, 1, 1, 1]"


## Step 2: Squad Information

Calculate values of:
- Overall availability (across squad members)
- Average squad proficiency (average of average squad proficiencies)
- Timezone average

** If we change the method of calculating average user proficiencies, we might need to recalculate average squad proficiencies as well.

We additionally calculated the timezone disparity and the total working hours just as some metadata on the teams. This won't be used in the matching process though.

Additionally, the skill proficiency array shows how much average proficiency the team has in any given skill.

In [None]:
squads_df = pd.DataFrame.from_dict(squads)

# Overall availability
    ## This is a bit of clumsy programming, rework when you have the chance.
    ## It places priority on members being available at the same time as much as possible
    ## (i.e. 3 members on Monday is better than 2 members on Monday and 2 on Tuesday)
    ## Could also stay binary and just take a superset of available days
def calculate_overall_avail(users_list):
    total_avail = [0,0,0,0,0]
    users = users_df[users_df['name'].isin(users_list)]
    user_avail_array = users['availability_bin'].values

    for user_avail in user_avail_array:
      total_avail = np.add(total_avail, user_avail)

    return np.divide(total_avail, len(users_list))

squads_df["schedule"] = squads_df["members"].apply(calculate_overall_avail)

# Average squad proficiency
def calculate_average_proficiency(users_list):
    users = users_df[users_df['name'].isin(users_list)]
    return users['avg_proficiency'].mean()

squads_df['squad_proficiency_avg'] = squads_df['members'].apply(calculate_average_proficiency)

# Timezone average
def calculate_timezone_avg(users_list):
    users = users_df[users_df['name'].isin(users_list)]
    return users['timezone_int'].mean()

squads_df['timezone_avg'] = squads_df['members'].apply(calculate_timezone_avg)

# Average skill proficiency
def calculate_skill_proficiency(users_list):
  users = users_df[users_df['name'].isin(users_list)]
  skills_proficiencies = users.explode(['skills', 'proficiency'], ignore_index=True)
  skill_proficiency_sum = skills_proficiencies.groupby('skills')['proficiency'].sum().reset_index()
  represented_skills = skill_proficiency_sum['skills'].values.tolist()

  for skill in all_skills:
    if skill not in represented_skills:
      skill_info = {'skills': skill, 'proficiency': 0}
      skill_proficiency_sum = skill_proficiency_sum.append(skill_info, ignore_index = True)

  skill_proficiency_sum = skill_proficiency_sum.sort_values(by='skills')
  skill_proficiency_array = np.divide(skill_proficiency_sum['proficiency'].values.tolist(),len(users_list))

  return skill_proficiency_array

squads_df['skill_proficiency'] = squads_df['members'].apply(calculate_skill_proficiency)

# Timezone disparity
def calculate_timezone_disparity(users_list):
    users = users_df[users_df['name'].isin(users_list)]
    return users['timezone_int'].max()-users['timezone_int'].min()

squads_df['timezone_disparity'] = squads_df['members'].apply(calculate_timezone_disparity)

# Total working hours
def calculate_hours(users_list):
    users = users_df[users_df['name'].isin(users_list)]
    return users['hours'].sum()

squads_df['total_hours'] = squads_df['members'].apply(calculate_hours)

squads_df

## Step 3: Search for similarity matches

We want to match new users to groups based on similar average proficiency, timezones, and availability.

We will minimize the Euclidean distance between these different values and return the top n candidates.

This step would do well with some weighting of the different variables-- currently proficiency is very highly weighted.

In [None]:
new_user = [generate_random_user("RadicalX_user")]
new_user = user_engineer(pd.DataFrame.from_dict(new_user))

In [None]:
# Use collaborative filtering to find the best squads
num_best = 10
squad_capacity = 3
squads_df = squads_df.dropna()
# Could consider evaluating this by the size of the member set.
squads_df = squads_df[squads_df['squad_status'] == 'open']

user_availability = new_user['availability_bin'][0]

def euclidean_distance(squad1, squad2):
    return distance.euclidean(squad1[1:], squad2[1:])

def find_similar_squads(user, df, n=5):
    df['Distance'] = df.apply(lambda row: euclidean_distance(user, row), axis=1)
    similar_squads = df.sort_values(by='Distance').head(n)
    return similar_squads

squads_eval = squads_df.loc[:, [str('squad_number'),'squad_proficiency_avg',	'timezone_avg']]
squads_eval['squad_proficiency_avg'] = squads_eval['squad_proficiency_avg'].apply(lambda x: x*10)
squads_eval['availability_similarity'] = squads_df['schedule'].apply(lambda x: 5/np.dot(x, user_availability))

new_user_eval = new_user.loc[:, ['name','avg_proficiency',	'timezone_int']].values
new_user_eval[0][1] = new_user_eval[0][1]*10
new_user_eval = np.append(new_user_eval,0)

print(new_user_eval)
top_set = find_similar_squads(new_user_eval, squads_eval, num_best)
print(top_set)


['UserRadicalX_user' 2.4991913609286467 2 0]
    squad_number  squad_proficiency_avg  timezone_avg  \
15            16               2.984904           1.5   
37            38               2.631755           1.0   
25            26               2.365885           2.5   
9             10               3.206203           1.0   
23            24               3.379069           2.0   
21            22               2.888331           1.0   
22            23               2.911008           1.0   
1              2               1.319579           1.0   
30            31               1.371270           2.0   
29            30               1.293823           2.0   

    availability_similarity  Distance  
15                 1.428571  1.589570  
37                 1.250000  1.606261  
25                 1.666667  1.745150  
9                  1.428571  1.881670  
23                 1.666667  1.884665  
21                 1.666667  1.982223  
22                 1.666667  1.986799  
1      

## Step 4: Ensure skill distribution

After having minimized distance between variables in the last step, we now want to maximize the distribution of skills within the team.

To do so, we have an ordered list of skill proficiencies (corresponding to the all_skills set in alphabetical order). We compare this ordered list by the user to the average list for the teams, and extract those with the highest euclidean distance.

In [None]:
def get_user_skills(user):
  user_sp = user.explode(['skills', 'proficiency'], ignore_index=True)
  user_sp_sum = user_sp.groupby('skills')['proficiency'].sum().reset_index()
  represented_skills = user_sp_sum['skills'].values.tolist()

  for skill in all_skills:
    if skill not in represented_skills:
      skill_info = {'skills': skill, 'proficiency': 0}
      user_sp_sum = user_sp_sum.append(skill_info, ignore_index = True)

  user_sp_sum = user_sp_sum.sort_values(by='skills')
  skillset = user_sp_sum['proficiency'].values.tolist()

  return skillset

def find_disparate_skills(user, df, n=5):
    df['skill_distance'] = df['skill_proficiency'].apply(lambda x: euclidean_distance(user, x))
    similar_squads = df.sort_values(by='skill_distance', ascending = False).head(n)
    return similar_squads

candidates = squads_df[squads_df['squad_number'].isin(top_set['squad_number'])]
user_skills = get_user_skills(new_user)
top_matches = find_disparate_skills(user_skills,candidates, 5)

print(new_user)
print(top_matches)



                name timezone squad_number squad_status location  \
0  UserRadicalX_user    GMT-7         None         open  Chicago   

                                     availability  hours      career_role  \
0  [Tuesday, Friday, Monday, Thursday, Wednesday]     12  Project Manager   

                                  skills  \
0  [Data Science, Photoshop, JavaScript]   

                                         proficiency  avg_proficiency  \
0  [0.7329805987174245, 0.661578138448387, 0.8089...         0.512173   

   timezone_int availability_bin  
0             3  [1, 1, 1, 1, 1]  
    squad_number squad_status           members                   schedule  \
37            38         open          [User67]  [1.0, 1.0, 1.0, 1.0, 1.0]   
21            22         open          [User36]  [1.0, 0.0, 1.0, 1.0, 1.0]   
25            26         open  [User25, User55]  [1.0, 0.5, 0.5, 1.0, 0.5]   
23            24         open          [User60]  [1.0, 0.0, 0.0, 1.0, 1.0]   
9           

  user_sp_sum = user_sp_sum.append(skill_info, ignore_index = True)
  user_sp_sum = user_sp_sum.append(skill_info, ignore_index = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['skill_distance'] = df['skill_proficiency'].apply(lambda x: euclidean_distance(user, x))


Remaining questions:

- What constraints do we want to put upon working hours? Make sure there's always a fulltime person? Or try to have it match amongst employees?
- Where should we store information about rank?
- What is the method for starting new groups? Under this engine, no one will ever be recommended an empty group.