tldr - This data has been downloaded from my "saved jobs" on LinkedIn. This notebook helps process and sort the data into our groups for the experiment.

# Summary
I'm running an experiment on how well two different resume's perform. Does a professionally written resume perform better than one I've written myself?			

# Hypothesis
$H_{0} = \text{The resume and cover letter provided by Haley Stock perform the same as my own self-written resume and cover letter.}$

$H_{a} = \text{The professionally-written resume and cover letter have a different interview invitation rate than self-written resume and cover letter.}$

| Name                       | Definition                                                                                                   | Example Format | P1   | P2 (from detectable difference below) |
|----------------------------|-------------------------------------------------------------------------------------------------------------|----------------|------|----------------------------------------|
| Interview Invitation Rate  | This is the percent of applications that receive an interview invite. It shows how interested companies are in my resume. | 0.50%         | 8%   | 13%                                    |
| Application Response Rate  | This is the percent of applications that receive a rejection/interview invite within 48 hrs. It shows how confident companies are in my qualifications. | 1%             |      |                                        |


# Data Collection Process
**LinkedIn:** I searched for "data" and remote in US roles. I then saved the roles that met the criteria below to a csv (via LinkedIn's data request form).

# Inclusion Criteria
Here are the kinds of jobs I'm interested in. I'm going to include these roles in the experiment based on the logic listed below.

| **Category**            | **Details**                                                                                                                                                          |
|:------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Location**            | Remote work only                                                                                                                                                     |
| **Minimum Job Roles**   | Data Scientist, Senior Data/Product Analyst, or Manager/Lead<br>No Applied Scientist, Decision Science, Analytics Engineer, or Researcher roles (significantly different than my experience would allow)<br>No Product Manager roles (not interested in product management)<br>No Director level roles (largely above my level) |
| **Type**                | Full time and contracted                                                                                                                                             |
| **Duplicate Job Roles** | Sometimes companies post multiple versions of a job (by location or reposts after a few weeks have gone by). Will do my best to dedupe these in the final "saved jobs" list. |


# Potential Pitfalls and Adjustments
1) *Cutting sample collection short.* **Discussion:** See power analysis below. Peer review pending.
2) *Confirming random assignment.* **Discussion:** Randomizing at cluster level (company + job title).
3) *Random assignment fails to equally distribute "heavy" segments across A/B.* **Discussion:** Equally distribute the following segments and assess post-assignment:
   - Recency of job posting
   - "Repeat" companies (see "8)" below)
4) *Ensure the Ceteris Parabus assumption* (that is, ensure all else is same between treatment/control besides the treatment). **Discussion:** Use the exact resume and cover letter provided for both.
5) *Cross-contamination for treatment and control groups.* **Discussion:** I don't think we'll have to worry about this, unless a recruiter leaves one company and goes to another.
6) *Multiple comparisons can lead to higher false positive rate.* **Discussion:** Only making one comparison -- professionally written resume vs. self-written resume.
7) *Simpson's paradox due to graduated roll-outs.* **Discussion:** I'm only performing one testing period, so this point is moot. Could simply look at final period if so.
8) *Primacy or novelty effect.* **Discussion:** This could be a problem, if I previously applied for a role with a company before conducting the experiment. I can adjust for this by running the experiment longer, but that's not really helpful given my current situation. Instead, I will include a category segment across treatment/control for repeat and new companies.
    - "Repeat" = companies that I've applied to before. I'll attempt to equally distribute these during assignment.
    - "New" = companies that I have not applied to before. I'll attempt to equally distribute these during assignment.

# Data Processing

In [None]:
!pwd

In [19]:
import os
import numpy as np
import pandas as pd
import random
from IPython.display import FileLink

#from sklearn.model_selection import StratifiedShuffleSplit <--this didn't work, because some groupings only have 1 job role.

#import the job sampling data
data_folder = "/notebooks/resume_experiment/data"
dataf = "modified_data.csv"
data_file = os.path.join(data_folder, dataf)

print(data_file)

/notebooks/resume_experiment/data/modified_data.csv


In [2]:
raw_df = pd.read_csv(data_file, low_memory=False)
raw_df = raw_df.dropna(axis=0,subset="role_title")
raw_df = raw_df.loc[raw_df.loc[:,'DELETE Row'] != True]
raw_df['sorting_hat_col'] = raw_df['is_repeat_company'].astype('str')+"_"+raw_df['job_posted_pin']+"_"+raw_df['role_cat']

display(raw_df.sample(5))
display(raw_df.loc[raw_df.loc[:,"sorting_hat_col"].isnull()].head(10))

Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,Applied Jobs Check,DELETE Row,Check Unique Job/Co,job_posted_pin,role_cat,sorting_hat_col
146,Staff Data Scientist - Remote,Empower,"11/11/24, 1:49 AM",http://www.linkedin.com/jobs/view/3980785534,90.0,0.0,8/17/2024,,,1,More30,Data Scientist,0.0_More30_Data Scientist
359,Lead Insights Analyst,Billtrust,"11/9/24, 10:14 AM",http://www.linkedin.com/jobs/view/4029850547,14.0,0.0,11/1/2024,,,1,14days,Manager,0.0_14days_Manager
77,"Lead Data Scientist, Personalization Analytics",Movable Ink,"11/9/24, 10:31 AM",http://www.linkedin.com/jobs/view/4049955359,7.0,0.0,11/8/2024,,,1,7days,Manager,0.0_7days_Manager
383,Lead Data Scientist,Better Life Partners,"11/5/24, 9:30 AM",http://www.linkedin.com/jobs/view/4031469435,30.0,0.0,10/16/2024,,,1,30days,Manager,0.0_30days_Manager
341,Sr. Data Scientist - Product Analytics (Remote...,Smartsheet,"11/9/24, 10:14 AM",http://www.linkedin.com/jobs/view/4034556325,7.0,0.0,11/8/2024,,,1,7days,Data Scientist,0.0_7days_Data Scientist


Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,Applied Jobs Check,DELETE Row,Check Unique Job/Co,job_posted_pin,role_cat,sorting_hat_col


In [None]:
#build a list of the companies that are in our dataset. include new vs. repeat as a dimension
companies = raw_df.copy().filter(['company','is_repeat_company','days_since_post'])
companies['counts'] = 1
agg_mets = {'counts':"count", 'days_since_post':"mean"}
companies = companies.groupby(['company','is_repeat_company']).agg(agg_mets).reset_index()
display(companies.head())

In [None]:
#ensure company lists are mutually exclusive
new_companies = set(new_cluster['company'])
repeat_companies = set(repeat_cluster['company'])

is_mutually_exclusive = new_companies.isdisjoint(repeat_cluster)
assert is_mutually_exclusive, "Companies overlapp between new and repeat."

In [33]:
def sortingHat(data, sampling_col, stratification_col, split_percent=0.5, seed=42):
    uni_strats = data[stratification_col].unique()
    rng = np.random.default_rng(seed)
    
    # Create a new column for the splits
    data['ab_split'] = np.nan
    
    for s in uni_strats:
        sorting_df = data.loc[data.loc[:,stratification_col]==s]
        nrows = sorting_df.shape[0]
        split = pd.Series(rng.integers(low=0,high=1, size=nrows, endpoint=True), index=sorting_df.index)
        split.name = "ab_split"
        #split_df = pd.concat([sorting_df, split], axis=1)
         # Assign the split values to the DataFrame
        data.loc[sorting_df.index, 'ab_split'] = split
        
        #print(f"{s}: {sorting_df.shape}")
        #print(split[:6])
        #display(sorting_df.head(2))
    display(data.filter(['ab_split','role_title']).groupby('ab_split').count())
        
    return(data)

df = sortingHat(raw_df, 'company', 'sorting_hat_col', split_percent=0.5)
split_path = os.path.join(data_folder,"split.csv")
df.to_csv(split_path)
#print(len(grps))
#print(grps)
    

Unnamed: 0_level_0,role_title
ab_split,Unnamed: 1_level_1
0.0,142
1.0,135


In [24]:
#display(FileLink('./split.csv'))

In [None]:


for c in clusters:
    generate random number between 1,2
    if random number == 1:
        assign c to control
    elif random number == 2:
        assign c to treatment

save the assignment to a file for control
save the assignment to a file for treatment
tell me how many jobs are in each treatment and control cluster