# People

What are they? How can we represent people in a meaningful way that doesn't get any of us fired for exfiltrating data?

## Faker

Let's generate some fake data! https://faker.readthedocs.io/en/master/




In [40]:
from faker import Faker
fake = Faker()

fake.profile() # lots of great stuff in here!

{'job': 'Planning and development surveyor',
 'company': 'Mcdonald, Moses and Patel',
 'ssn': '158-81-6202',
 'residence': 'PSC 4035, Box 3123\nAPO AA 79578',
 'current_location': (Decimal('-71.5527815'), Decimal('32.933502')),
 'blood_group': 'O-',
 'website': ['http://lloyd-morales.com/',
  'https://wolfe-adams.com/',
  'http://www.james.com/',
  'https://armstrong-wall.com/'],
 'username': 'rmoss',
 'name': 'Barbara Cowan',
 'sex': 'F',
 'address': 'Unit 4293 Box 1745\nDPO AA 97094',
 'mail': 'phamcatherine@hotmail.com',
 'birthdate': datetime.date(2005, 12, 26)}

This is a good start but ... it's kind of wonky. We have people all over the world with so many different jobs! Let's keep the spirit of this but implement some of our own limitations on fields to ensure things line up with what we'd expect a company org to look like

In [41]:
import numpy as np
import random


def choose_a_few(
    options: list[str],
    weights: list[int | float] = None,
    max_choices: int = None,
    min_choices: int = 0,
) -> set[str]:
    """A helpful function to pick a random number of choices from a list of options
    
    By default skews the weights toward the first options in the list"""
    max_choices = np.clip(max_choices or len(options), min_choices, len(options))
    
    # how many choices will we make this time?
    divisor = max_choices * (max_choices + 1) / 2    
    k_weights = [int(x) / divisor for x in range(max_choices, min_choices-1, -1)]
    n_choices = np.random.choice(list(range(min_choices,max_choices+1)), p=k_weights)
    
    # make the choices
    choices = random.choices(options, weights=weights, k=n_choices)
    return set(choices)


To ground us in this task, let's define a new `Person` object that we can fill up with info:

In [42]:
from dataclasses import dataclass, field
from typing import Literal
from enum import Enum, auto
import datetime

class timezone(str, Enum):
    EST = auto()
    PST = auto()
    UTC = auto()

@dataclass
class Location:
    city: str
    tz: timezone
    country: str

@dataclass
class Person:
    name: str
    hire_date: datetime.date
    status: Literal["Full Time", "Part Time", "Contract"]
    languages: list[str] = field(default_factory=list)
    manager:str = None
    team: str = None 
    title: str = None
    location: Location = None


Now to make some people: 

In [43]:
import pandas as pd
import numpy as np
from faker import Faker
import random

# create some fake data
fake = Faker()

employment = {"Full Time": 0.7, "Part Time": 0.05, "Contract": 0.3}
languages = {
    "Python": 0.25,
    "Scala": 0.1,
    "Go": 0.08,
    "JavaScript": 0.3,
    "Java": 0.3,
    "Typescript": 0.17,
    "Erlang": 0.01,
    "Elixir": 0.001,
}

def make_person() -> Person:
    return Person(
        name = fake.name(),
        hire_date = fake.date_between(start_date="-3y", end_date="today"),
        status = choose_a_few(list(employment), max_choices=1, min_choices=1),
        languages = choose_a_few(list(languages.keys()), weights=list(languages.values())),
        team = None, # hrmmmm this is harder
        title = None, # let's be smarter with this
        location = None, # let's also be smarter with this
    )

make_person()

Person(name='Amanda Perez', hire_date=datetime.date(2020, 11, 21), status={'Contract'}, languages={'Java', 'Go', 'Python'}, manager=None, team=None, title=None, location=None)

Now we can generate more complex attributes in a smart way. Let's set up some rules about where offices are, what teams are in which offices, then pick titles based on other info (e.g. Developers probably know at least one language ... and executives are fulltime?)

In [44]:

TEAM_TITLES:dict[str,list[str]] = {
    "DevX": ["Engineer", "Engineer", "Engineer", "Engineer", "Engineer", "AVP"],
    "DevOps": ["Engineer", "Senior Engineer", "Manager"],
    "Sales": ["Associate"],
    "Support": ["Analyst", "Manager"],
    "Platform": ["Engineer", "Senior Engineer","Managing Engineer", "AVP", "VP"],
    "Product": ["Engineer", "Manager", "Product Owner", "AVP", "VP"],
    "Internal Tools": ["Engineer", "Senior Engineer", "Manager", "AVP", "VP"],
    "Business": ["Analyst", "Associate", "Vice President", "Director", "Managing Director"]
}


def title_city_team():
    # just a few locations
    offices = {
        location.city: location
        for location in [
            Location("New York", tz="EST", country="USA"),
            Location("Seattle", tz="PST", country="USA"),
            Location("Toronto", tz="EST", country="CAN"),
            Location("London", tz="UTC", country="GBR"),
            Location("Fort Lauderdale", tz="EST", country="USA"),
            Location("Dublin", tz="UTC", country="IRL"),
        ]
    }
    # codify the hierarchical structure
    allowed_teams_per_office = {
        "New York": ["Sales", "Product", "Business"],
        "Toronto": ["Platform", "Product", "Internal Tools", "Sales", "Business"],
        "Fort Lauderdale": ["DevX"],
        "Dublin": ["DevOps", "Support"],
        "London": ["Sales", "Business"],
        "Seattle": ["Internal Tools", "Product", "Platform"],
    }
    allowed_titles_per_team = TEAM_TITLES

    city = random.choice(list(offices))
    team = random.choice(allowed_teams_per_office[city])
    title = choose_a_few(
        allowed_titles_per_team[team], max_choices=1, min_choices=1
    ).pop()
    
    return {
        "location": Location(city=city, tz=offices[city].tz, country=offices[city].country),
        "title": title,
        "team": team,
    }


title_city_team()


{'location': Location(city='Dublin', tz='UTC', country='IRL'),
 'title': 'Manager',
 'team': 'Support'}

After running this we should have a better balanced org in terms of region + titles. Then we just need to add the connections in -- i.e. who's the boss?!

In [45]:
def make_person() -> Person:
    title_city_team_ = title_city_team()
    technical = 1 if "Engineer" in title_city_team_["title"] else 0
    return Person(
        name = fake.name(),
        hire_date = fake.date_between(start_date="-3y", end_date="today"),
        status = choose_a_few(list(employment), max_choices=1, min_choices=1),
        languages = choose_a_few(list(languages.keys()), weights=list(languages.values()), min_choices=technical),
        **title_city_team_,
    )


In [46]:
people_df = pd.DataFrame((make_person() for _ in range(150)))
people_df.head()

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location
0,Adam Alexander,2021-04-28,{Full Time},"{Java, Typescript, Python}",,Sales,Associate,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}"
1,Christina Christian,2021-02-12,{Part Time},{Java},,DevX,Engineer,"{'city': 'Fort Lauderdale', 'tz': 'EST', 'coun..."
2,Leonard Moore,2021-09-30,{Part Time},{Java},,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}"
3,Austin Sparks,2021-06-23,{Contract},{Go},,Support,Analyst,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}"
4,Amy Martin,2022-05-25,{Full Time},"{Go, Python}",,Product,Engineer,"{'city': 'New York', 'tz': 'EST', 'country': '..."


So, let's group by Team and then pick a manager for everyone:

In [47]:
ranks = {team: {title: rank for rank,title in enumerate(titles)} for team, titles in TEAM_TITLES.items()}
for team in ranks:
    people_df.loc[people_df.team==team, "rank"] = people_df.loc[people_df.team==team].title.map(ranks[team])
people_df = people_df.sort_values(by=["team","rank"])

In [48]:
def naivereportsto(row, df):
    supervisor = (
        df[(df.index < row.name)].query(f"""rank <= {row["rank"]}-1""").tail(1)["name"]
    )
    supervisor = supervisor.item() if not supervisor.empty else None
    peer = df[(df.index < row.name)].query(f"""rank  == {row["rank"]}""").head(1)["name"]
    peer = peer.item() if not peer.empty else None
    return supervisor or peer or row["name"]


def reportsto(df):
    return df.assign(manager=df.apply(naivereportsto, df=df, axis=1))


def supervisors(df):
    df["peoplemanager"] = df.apply(naivereportsto, df=df, axis=1)
    df = df.groupby("team", group_keys=False).apply(reportsto).reset_index(drop=True)
    return df


people_df = people_df.pipe(supervisors)
people_df.head(5)


Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank,peoplemanager
0,Jorge Young,2020-06-18,{Contract},"{JavaScript, Typescript, Python}",Jorge Young,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",0.0,Gregory Gibson
1,Jon Hunter,2022-07-03,{Full Time},"{Java, Scala}",Jorge Young,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",0.0,Jorge Young
2,Jennifer Hinton,2022-04-06,{Contract},{Java},Jorge Young,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",0.0,Jorge Young
3,Mary Brown,2022-05-22,{Contract},{},Jorge Young,Business,Analyst,"{'city': 'New York', 'tz': 'EST', 'country': '...",0.0,Jorge Young
4,Sharon Quinn,2022-03-30,{Contract},"{JavaScript, Typescript, Python}",Jorge Young,Business,Analyst,"{'city': 'New York', 'tz': 'EST', 'country': '...",0.0,Jorge Young


Now we just need a CEO for all the team leads to report to. Set their manager as themselves to help us out later

In [49]:
CEO = make_person().__dict__ | {"team":"CEO", "title":"CEO", "status":"Full Time"}
CEO["location"] = CEO["location"].__dict__
people_df = pd.concat([people_df, pd.DataFrame([CEO])])
CEO_mask = people_df.name==CEO["name"]
people_df.loc[(people_df.manager == people_df.name) | CEO_mask ,"manager"]=CEO["name"]
people_df.loc[CEO_mask, "rank"] = people_df["rank"].max()+1

Cool .. this seems reasonably distributed?

In [51]:
expanded_df = people_df.assign(**people_df.location.apply(pd.Series))

In [104]:
import plotly.express as px
fig = px.bar(expanded_df, x="title", color="team", hover_data=["team","tz","city"], facet_col="country", template="plotly_dark")

fig.update_xaxes(matches=None, title_text=None)