tldr - This data has been downloaded from my "saved jobs" on LinkedIn. This notebook helps process and sort the data into our groups for the experiment.

# Summary
I'm running an experiment on how well two different resume's perform. Does a professionally written resume perform better than one I've written myself?			

# Hypothesis
$H_{0} = \text{The resume and cover letter provided by Haley Stock perform the same as my own self-written resume and cover letter.}$

$H_{a} = \text{The professionally-written resume and cover letter have a different interview invitation rate than self-written resume and cover letter.}$

| Name                       | Definition                                                                                                   | Example Format | P1   | P2 (from detectable difference below) |
|----------------------------|-------------------------------------------------------------------------------------------------------------|----------------|------|----------------------------------------|
| Interview Invitation Rate  | This is the percent of applications that receive an interview invite. It shows how interested companies are in my resume. | 0.50%         | 8%   | 13%                                    |
| Application Response Rate  | This is the percent of applications that receive a rejection/interview invite within 48 hrs. It shows how confident companies are in my qualifications. | 1%             |      |                                        |


# Data Collection Process
**LinkedIn:** I searched for "data" and remote in US roles. I then saved the roles that met the criteria below to a csv (via LinkedIn's data request form).

# Inclusion Criteria
Here are the kinds of jobs I'm interested in. I'm going to include these roles in the experiment based on the logic listed below.

| **Category**            | **Details**                                                                                                                                                          |
|:------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Location**            | Remote work only                                                                                                                                                     |
| **Minimum Job Roles**   | Data Scientist, Senior Data/Product Analyst, or Manager/Lead<br>No Applied Scientist, Decision Science, Analytics Engineer, or Researcher roles (significantly different than my experience would allow)<br>No Product Manager roles (not interested in product management)<br>No Director level roles (largely above my level) |
| **Type**                | Full time and contracted                                                                                                                                             |
| **Duplicate Job Roles** | Sometimes companies post multiple versions of a job (by location or reposts after a few weeks have gone by). Will do my best to dedupe these in the final "saved jobs" list. |


# Potential Pitfalls and Adjustments
1) *Cutting sample collection short.* **Discussion:** See power analysis below. Peer review pending.
2) *Confirming random assignment.* **Discussion:** We will randomizing at cluster level (repeat/not yet applied company + Analyst/Scientist/or Manager role).
3) *Random assignment fails to equally distribute "heavy" segments across A/B.* **Discussion:** Equally distribute the following segments and assess post-assignment:
   - Recency of job posting
   - "Repeat" companies (see "8)" below)
   - Role type: Analyst, Scientist, or Manager
4) *Ensure the Ceteris Parabus assumption* (that is, ensure all else is same between treatment/control besides the treatment). **Discussion:** Use the exact resume and cover letter provided for both.
5) *Cross-contamination for treatment and control groups.* **Discussion:** This could be an issue if different role types have the same recruiter.
6) *Multiple comparisons can lead to higher false positive rate.* **Discussion:** Only making one comparison -- professionally written resume vs. self-written resume.
7) *Simpson's paradox due to graduated roll-outs.* **Discussion:** I'm only performing one testing period, so this point is moot. Could simply look at final period if so.
8) *Primacy or novelty effect.* **Discussion:** This could be a problem, if I previously applied for a role with a company before conducting the experiment. I can adjust for this by running the experiment longer, but that's not really helpful given my current situation. Instead, I will include a category segment across treatment/control for repeat and new companies.
    - "Repeat" = companies that I've applied to before. I'll attempt to equally distribute these during assignment.
    - "New" = companies that I have not applied to before. I'll attempt to equally distribute these during assignment.

# Data Processing

In [1]:
_ = """
To-Do:
- sortingHat: make it sort at stratification cluster level.


"""

In [30]:
import os
import numpy as np
import pandas as pd
import random
from IPython.display import FileLink

#from sklearn.model_selection import StratifiedShuffleSplit <--this didn't work, because some groupings only have 1 job role.

#import the job sampling data
data_folder = "/notebooks/resume_experiment/data"
dataf = "modified_data.csv"
data_file = os.path.join(data_folder, dataf)

print(data_file)

/notebooks/resume_experiment/data/modified_data.csv


In [50]:
raw_df = pd.read_csv(data_file, low_memory=False)
raw_df = raw_df.dropna(axis=0,subset="role_title")
raw_df = raw_df.loc[raw_df.loc[:,'DELETE Row'] != True]
#raw_df['sorting_hat_col'] = raw_df['is_repeat_company'].astype('str')+"_"+raw_df['job_posted_pin']+"_"+raw_df['role_cat']
raw_df['sorting_hat_col'] = raw_df['is_repeat_company'].astype('str')+"_"+raw_df['role_cat']
raw_df = raw_df.drop(["Applied Jobs Check", "DELETE Row", "Check Unique Job/Co"], axis=1)

display(raw_df.sample(5))
display(raw_df.loc[raw_df.loc[:,"sorting_hat_col"].isnull()].head(10))

Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col
175,Lead Data Scientist,Level Data,"11/9/24, 10:43 AM",http://www.linkedin.com/jobs/view/3977077193,120.0,0.0,7/18/2024,More30,Manager,0.0_Manager
321,"Manager, Data Strategy + Operations",brightwheel,"11/8/24, 8:34 AM",http://www.linkedin.com/jobs/view/4026720902,14.0,0.0,11/1/2024,14days,Manager,0.0_Manager
229,Staff Data Scientist - ML Engineer,SentiLink,"9/19/24, 6:56 PM",http://www.linkedin.com/jobs/view/4026706612,14.0,0.0,11/1/2024,14days,Data Scientist,0.0_Data Scientist
231,Data Scientist,Stripe,"11/5/24, 9:02 AM",http://www.linkedin.com/jobs/view/4032373471,7.0,1.0,11/8/2024,7days,Data Scientist,1.0_Data Scientist
89,Senior Data Scientist - School,McGraw Hill,"11/9/24, 10:37 AM",http://www.linkedin.com/jobs/view/4048889493,7.0,0.0,11/8/2024,7days,Data Scientist,0.0_Data Scientist


Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col


In [51]:
#build a list of the companies that are in our dataset. include new vs. repeat as a dimension
companies = raw_df.copy().filter(['company','is_repeat_company','days_since_post'])
companies['counts'] = 1
agg_mets = {'counts':"count", 'days_since_post':"mean"}
companies = companies.groupby(['company','is_repeat_company']).agg(agg_mets).reset_index()
display(companies.head())

Unnamed: 0,company,is_repeat_company,counts,days_since_post
0,6sense,0.0,1,7.0
1,8x8,0.0,1,14.0
2,Abercrombie & Fitch Co.,0.0,1,14.0
3,Abnormal Security,0.0,1,14.0
4,Age of Learning,0.0,1,6.0


In [52]:
#ensure company lists are mutually exclusive
new_companies = set(new_cluster['company'])
repeat_companies = set(repeat_cluster['company'])

is_mutually_exclusive = new_companies.isdisjoint(repeat_cluster)
assert is_mutually_exclusive, "Companies overlapp between new and repeat."

NameError: name 'new_cluster' is not defined

In [58]:
def sortingHat(data, stratification_col, split_percent=0.5, seed=42):
    #Create a list of unique stratification segments.
    uni_strats = pd.DataFrame(data[stratification_col].unique(), columns=['sorting_hat_col'])
    
    #Setting our seed
    rng = np.random.default_rng(seed)
    
    # Create a new column for the splits
    for s in uni_strats:
        strat_df = data.loc[data.loc[:,stratification_col]==s]
        nrows = strat_df.shape[0]
        strat_df['ab_split'] = pd.Series(rng.integers(low=0, high=1, size=nrows, endpoint=True), index=strat_df.index)
    #uni_strats['ab_split'] = pd.Series(rng.integers(low=0, high=1, size=uni_strats.shape[0], endpoint=True), index=uni_strats.index)
    
    return(uni_strats)

sort_split = sortingHat(raw_df, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")
display(sort_split)



    

Unnamed: 0,sorting_hat_col
3,0.0_Data Analyst
1,0.0_Data Scientist
0,0.0_Manager
5,1.0_Data Analyst
2,1.0_Data Scientist
4,1.0_Manager


In [54]:
split_df = pd.merge(raw_df, sort_split, how='outer', on='sorting_hat_col').sort_values(by="sorting_hat_col")
display(split_df)

Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col,ab_split
0,Sr. Data Analyst,InStride,"11/9/24, 10:42 AM",http://www.linkedin.com/jobs/view/4064739945,7.0,0.0,11/8/2024,7days,Data Analyst,0.0_Data Analyst,0
28,Customer Success Analytics Analyst,GitLab,"11/7/24, 10:21 AM",http://www.linkedin.com/jobs/view/4070530323,7.0,0.0,11/8/2024,7days,Data Analyst,0.0_Data Analyst,0
29,Data Analyst/Sr Data Analyst,Mobile Health,"11/9/24, 10:28 AM",http://www.linkedin.com/jobs/view/4044609003,30.0,0.0,10/16/2024,30days,Data Analyst,0.0_Data Analyst,0
30,Senior Product Data Analyst,Customer.io,"11/9/24, 10:23 AM",http://www.linkedin.com/jobs/view/4011266366,7.0,0.0,11/8/2024,7days,Data Analyst,0.0_Data Analyst,0
31,Senior Product Analyst,Greenlight,"11/8/24, 8:39 AM",http://www.linkedin.com/jobs/view/4055136369,21.0,0.0,10/25/2024,30days,Data Analyst,0.0_Data Analyst,0
...,...,...,...,...,...,...,...,...,...,...,...
264,Client Analytics Manager,Mozilla,"11/9/24, 10:43 AM",http://www.linkedin.com/jobs/view/4006108718,1.0,1.0,11/14/2024,7days,Manager,1.0_Manager,0
262,Senior Manager - Data Science,McAfee,"11/9/24, 10:32 AM",http://www.linkedin.com/jobs/view/4041683824,14.0,1.0,11/1/2024,14days,Manager,1.0_Manager,0
275,"Lead, Advanced Analytics, Global Markets Insights",Airbnb,"10/24/24, 9:38 PM",http://www.linkedin.com/jobs/view/4053479748,7.0,1.0,11/8/2024,7days,Manager,1.0_Manager,0
268,"Senior Manager, Analytics - Product & Revenue",Spring Health,"11/7/24, 10:21 AM",http://www.linkedin.com/jobs/view/4034906029,7.0,1.0,11/8/2024,7days,Manager,1.0_Manager,0


In [55]:
split_path = os.path.join(data_folder,"split.csv")
split_df.to_csv(split_path)
#print(len(grps))
#print(grps)

# Evaluate the Splits
Counting distribution of each split:
- Companies: applying again vs. never applied before.
- Role types: Analyst, Scientist, Manager
- Time since job post.

In [56]:
def evalSplit(data, split_col):
    df = data.filter([split_col,'role_title','ab_split']).groupby([split_col, 'ab_split']).count()
    return(df)

total_split = evalSplit(split_df, 'sorting_hat_col').reset_index()
total_split = total_split.pivot(index="sorting_hat_col", columns="ab_split", values="role_title")
display(total_split)

company_split = evalSplit(split_df, 'is_repeat_company')
display(company_split)

role_type_split = evalSplit(split_df, 'role_cat')
display(role_type_split)

time_since_split = evalSplit(split_df, 'job_posted_pin')
display(time_since_split)

ab_split,0,1
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,51.0,
0.0_Data Scientist,,97.0
0.0_Manager,80.0,
1.0_Data Analyst,,10.0
1.0_Data Scientist,,24.0
1.0_Manager,15.0,


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0,131
0.0,1,97
1.0,0,15
1.0,1,34


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0,51
Data Analyst,1,10
Data Scientist,1,121
Manager,0,95


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1
14days,0,41
14days,1,33
30days,0,27
30days,1,12
7days,0,74
7days,1,82
More30,0,4
More30,1,4
