tldr - This data has been downloaded from my "saved jobs" on LinkedIn. This notebook helps process and sort the data into our groups for the experiment.

# Summary
I'm running an experiment on how well two different resume's perform. Does a professionally written resume perform better than one I've written myself?			

# Hypothesis
$H_{0} = \text{The resume and cover letter provided by Haley Stock perform the same as my own self-written resume and cover letter.}$

$H_{a} = \text{The professionally-written resume and cover letter have a different interview invitation rate than self-written resume and cover letter.}$

| Name                       | Definition                                                                                                   | Example Format | P1   | P2 (from detectable difference below) |
|----------------------------|-------------------------------------------------------------------------------------------------------------|----------------|------|----------------------------------------|
| Interview Invitation Rate  | This is the percent of applications that receive an interview invite. It shows how interested companies are in my resume. | 0.50%         | 8%   | 13%                                    |
| Application Response Rate  | This is the percent of applications that receive a rejection/interview invite within 48 hrs. It shows how confident companies are in my qualifications. | 1%             |      |                                        |


# Data Collection Process
**LinkedIn:** I searched for "data" and remote in US roles. I then saved the roles that met the criteria below to a csv (via LinkedIn's data request form).

# Inclusion Criteria
Here are the kinds of jobs I'm interested in. I'm going to include these roles in the experiment based on the logic listed below.

| **Category**            | **Details**                                                                                                                                                          |
|:------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Location**            | Remote work only                                                                                                                                                     |
| **Minimum Job Roles**   | Data Scientist, Senior Data/Product Analyst, or Manager/Lead<br>No Applied Scientist, Decision Science, Analytics Engineer, or Researcher roles (significantly different than my experience would allow)<br>No Product Manager roles (not interested in product management)<br>No Director level roles (largely above my level) <br>Minimum salary of $100k/yr|
| **Type**                | Full time and contracted                                                                                                                                             |
| **Duplicate Job Roles** | Sometimes companies post multiple versions of a job (by location or reposts after a few weeks have gone by). Will do my best to dedupe these in the final "saved jobs" list. |


# Potential Pitfalls and Adjustments
1) *Cutting sample collection short.* **Discussion:** See power analysis below. Peer review pending.
2) *Confirming random assignment.* **Discussion:** We will randomizing at cluster level (repeat/not yet applied company + Analyst/Scientist/or Manager role).
3) *Random assignment fails to equally distribute "heavy" segments across A/B.* **Discussion:** Equally distribute the following segments and assess post-assignment:
   - Recency of job posting
   - "Repeat" companies (see "8)" below)
   - Role type: Analyst, Scientist, or Manager
4) *Ensure the Ceteris Parabus assumption* (that is, ensure all else is same between treatment/control besides the treatment). **Discussion:** Use the exact resume and cover letter provided for both.
5) *Cross-contamination for treatment and control groups.* **Discussion:** This could be an issue if different role types have the same recruiter.
6) *Multiple comparisons can lead to higher false positive rate.* **Discussion:** Only making one comparison -- professionally written resume vs. self-written resume.
7) *Simpson's paradox due to graduated roll-outs.* **Discussion:** I'm only performing one testing period, so this point is moot. Could simply look at final period if so.
8) *Primacy or novelty effect.* **Discussion:** This could be a problem, if I previously applied for a role with a company before conducting the experiment. I can adjust for this by running the experiment longer, but that's not really helpful given my current situation. Instead, I will include a category segment across treatment/control for repeat and new companies.
    - "Repeat" = companies that I've applied to before. I'll attempt to equally distribute these during assignment.
    - "New" = companies that I have not applied to before. I'll attempt to equally distribute these during assignment.

# Data Processing

In [4]:
_ = """
To-Do:
- (FIXED!) sortingHat: make it sort at stratification cluster level.


"""

In [5]:
# Import our python packages.
import os
import numpy as np
import pandas as pd
import random
from IPython.display import FileLink, Markdown

#from sklearn.model_selection import StratifiedShuffleSplit <--this didn't work, because some groupings only have 1 job role.

#import the job sampling data
data_folder = "/notebooks/resume_experiment/data"
dataf = "modified_data.csv"
dataf2 = "modified_data II.csv"
dataf3 = "modified_data III.csv"
dataf4 = "modified_data IV.csv"
dataf5 = "modified_data V.csv"
dataf6 = "modified_data VI.csv"

data_file = os.path.join(data_folder, dataf)
data_file2 = os.path.join(data_folder, dataf2)
data_file3 = os.path.join(data_folder, dataf3)
data_file4 = os.path.join(data_folder, dataf4)
data_file5 = os.path.join(data_folder, dataf5)
data_file6 = os.path.join(data_folder, dataf6)

print(data_file)

/notebooks/resume_experiment/data/modified_data.csv


In [6]:
def procData(fname):
    data_folder = "/notebooks/resume_experiment/data"
    data_file = os.path.join(data_folder, fname)
    # Read in our job postings data from "modified_data.csv"
    raw_df = pd.read_csv(data_file, low_memory=False)
    raw_df = raw_df.dropna(axis=0,subset="role_title")
    raw_df = raw_df.loc[raw_df.loc[:,'DELETE Row'] != True]
    # Base the assignment on whether a repeat company and the type of role (data scientist, analyst, or manager/lead).
    raw_df['sorting_hat_col'] = raw_df['is_repeat_company'].astype('str')+"_"+raw_df['role_cat']
    raw_df = raw_df.drop(["Applied Jobs Check", "DELETE Row", "Check Unique Job/Co"], axis=1)
    return(raw_df)

In [7]:
# Read in our job postings data from "modified_data.csv"
#raw_df = pd.read_csv(data_file, low_memory=False)
#raw_df = raw_df.dropna(axis=0,subset="role_title")
#raw_df = raw_df.loc[raw_df.loc[:,'DELETE Row'] != True]
#
## Base the assignment on whether a repeat company and the type of role (data scientist, analyst, or manager/lead).
#raw_df['sorting_hat_col'] = raw_df['is_repeat_company'].astype('str')+"_"+raw_df['role_cat']
#raw_df = raw_df.drop(["Applied Jobs Check", "DELETE Row", "Check Unique Job/Co"], axis=1)
#
#display(raw_df.sample(5))
#display(raw_df.loc[raw_df.loc[:,"sorting_hat_col"].isnull()].head(10))

raw_df = procData(dataf)
raw_df2 = procData(dataf2)
raw_df3 = procData(dataf3)
raw_df4 = procData(dataf4)
raw_df5 = procData(dataf5)
raw_df6 = procData(dataf6)
display(raw_df6.head())

Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col
0,"Senior Data Analyst, Member Experience",Spring Health,"12/11/24, 11:18 AM",http://www.linkedin.com/jobs/view/4093698040,,1.0,,,Data Analyst,1.0_Data Analyst
1,Senior Manager Analytics,Brigit,"12/5/24, 12:58 PM",http://www.linkedin.com/jobs/view/4091232536,,0.0,,,Manager,0.0_Manager
2,"Senior Analyst, Credit Analytics",Upstart,"12/11/24, 11:17 AM",http://www.linkedin.com/jobs/view/4069178854,,1.0,,,Data Analyst,1.0_Data Analyst
3,"Senior Analyst, Data Science Underwriting Modeler",Liberty Mutual Insurance,"12/11/24, 11:10 AM",http://www.linkedin.com/jobs/view/4094951267,,0.0,,,Data Scientist,0.0_Data Scientist
4,Senior Data Analyst (Performance Marketing),Acceleration Partners,"12/11/24, 11:18 AM",http://www.linkedin.com/jobs/view/4094953764,,0.0,,,Data Analyst,0.0_Data Analyst


In [8]:
#build a list of the companies that are in our dataset. include new vs. repeat as a dimension
companies = raw_df.copy().filter(['company','is_repeat_company','days_since_post'])
companies['counts'] = 1
agg_mets = {'counts':"count", 'days_since_post':"mean"}
companies = companies.groupby(['company','is_repeat_company']).agg(agg_mets).reset_index()
display(companies.head())

Unnamed: 0,company,is_repeat_company,counts,days_since_post
0,6sense,0.0,1,7.0
1,8x8,0.0,1,14.0
2,Abercrombie & Fitch Co.,0.0,1,14.0
3,Abnormal Security,0.0,1,14.0
4,Age of Learning,0.0,1,6.0


In [9]:
# Create a function to sort our job postings into two groups using a stratification column.
def sortingHat(data, stratification_col, split_percent=0.5, seed=42):
    """
    docstr by ChatGPT
    Stratify and assign job postings into two groups based on a given column.

    This function takes a dataset and a stratification column, then splits the rows 
    within each unique stratification group into two random groups (A/B splits) using 
    the provided split percentage and random seed.

    Args:
        data (pd.DataFrame): The input DataFrame containing job postings.
        stratification_col (str): The column name to use for stratification. Rows with 
            the same value in this column will be grouped and split together.
        split_percent (float, optional): The proportion of rows assigned to one of 
            the groups (default is 0.5, creating roughly equal splits).
        seed (int, optional): The random seed for reproducibility (default is 42).

    Returns:
        pd.DataFrame: A modified DataFrame with an additional column `'ab_split'`, 
        indicating group assignment (0 or 1) for each row.

    Raises:
        KeyError: If `stratification_col` is not a column in the input DataFrame.
        ValueError: If `split_percent` is not between 0 and 1.

    Examples:
        >>> import pandas as pd
        >>> import numpy as np
        >>> raw_df = pd.DataFrame({
        ...     "job_id": [1, 2, 3, 4, 5],
        ...     "department": ["HR", "Engineering", "HR", "Engineering", "HR"]
        ... })
        >>> sorted_df = sortingHat(raw_df, "department", split_percent=0.5)
        >>> sorted_df
           job_id     department  ab_split
        0       1             HR       0.0
        1       2  Engineering       1.0
        2       3             HR       1.0
        3       4  Engineering       0.0
        4       5             HR       0.0
    """
    # Ensure the stratification column exists in the DataFrame
    if stratification_col not in data.columns:
        raise KeyError(f"Column '{stratification_col}' not found in DataFrame.")

    # Validate split_percent
    if not (0 <= split_percent <= 1):
        raise ValueError("split_percent must be between 0 and 1.")
    
    #df = pd.DataFrame(columns=data.columns)
    #Create a list of unique stratification segments.
    uni_strats = data[stratification_col].unique()
    
    #Setting our seed
    rng = np.random.default_rng(seed)

    # Create a new column for the splits
    data['ab_split'] = np.nan
    
    for s in uni_strats:
       # Filter rows for the current stratification group
        sorting_df = data.loc[data[stratification_col] == s]
        nrows = sorting_df.shape[0]
        
        # Generate random splits and align indices
        split = pd.Series(rng.integers(low=0, high=1, size=nrows, endpoint=True), index=sorting_df.index)
        
        # Assign the split values to the DataFrame
        data.loc[sorting_df.index, 'ab_split'] = split
        
        #print(f"{s}: {sorting_df.shape}")
        #print(split)
        #display(data.loc[sorting_df.index].head(2))
        
    #uni_strats['ab_split'] = pd.Series(rng.integers(low=0, high=1, size=uni_strats.shape[0], endpoint=True), index=uni_strats.index)
    
    return(data)

sort_split = sortingHat(raw_df, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")
sort_split2 = sortingHat(raw_df2, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")
sort_split3 = sortingHat(raw_df3, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")
sort_split4 = sortingHat(raw_df4, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")
sort_split5 = sortingHat(raw_df5, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")
sort_split6 = sortingHat(raw_df6, 'sorting_hat_col', split_percent=0.5).sort_values(by="sorting_hat_col")

display(sort_split6)



    

Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col,ab_split
118,Analytics and Reporting Engineer,Five9,"12/11/24, 11:10 AM",http://www.linkedin.com/jobs/view/4079269829,,0.0,,,Data Analyst,0.0_Data Analyst,1.0
62,Senior Product Analyst,Maze,"12/11/24, 11:11 AM",http://www.linkedin.com/jobs/view/4093483506,,0.0,,,Data Analyst,0.0_Data Analyst,1.0
172,Senior Reporting Analyst (Remote),ICF,"12/11/24, 11:19 AM",http://www.linkedin.com/jobs/view/4088781214,,0.0,,,Data Analyst,0.0_Data Analyst,0.0
59,Claims Intelligence Data Insights Analyst,Allstate,"12/10/24, 2:41 PM",http://www.linkedin.com/jobs/view/4093121198,,0.0,,,Data Analyst,0.0_Data Analyst,1.0
174,Vertical Market Data Analyst,Telit Cinterion,"12/9/24, 10:26 AM",http://www.linkedin.com/jobs/view/4078019556,,0.0,,,Data Analyst,0.0_Data Analyst,1.0
...,...,...,...,...,...,...,...,...,...,...,...
138,"Manager, Fraud Analytics",SeatGeek,"11/25/24, 6:15 PM",http://www.linkedin.com/jobs/view/4070132055,,1.0,,,Manager,1.0_Manager,0.0
290,Business Intelligence Lead,Humana,"11/14/24, 10:49 AM",http://www.linkedin.com/jobs/view/4069772691,,1.0,,,Manager,1.0_Manager,1.0
95,Marketing Analytics Manager,Webflow,"12/11/24, 10:34 AM",http://www.linkedin.com/jobs/view/4092497698,,1.0,,,Manager,1.0_Manager,1.0
186,Lead Data Scientist - Remote (Data Scientist IV),Myriad Genetics,"12/11/24, 11:09 AM",http://www.linkedin.com/jobs/view/4094267552,,1.0,,,Manager,1.0_Manager,0.0


In [24]:
#split_df = pd.merge(raw_df, sort_split, how='outer', on='sorting_hat_col').sort_values(by="sorting_hat_col")
#display(split_df)

In [10]:
split_path = os.path.join(data_folder,"split.csv")
sort_split.to_csv(split_path)

split_path2 = os.path.join(data_folder, "split2.csv")
sort_split2.to_csv(split_path2)

split_path3 = os.path.join(data_folder, "split3.csv")
sort_split3.to_csv(split_path3)

split_path4 = os.path.join(data_folder, "split4.csv")
sort_split4.to_csv(split_path4)

split_path5 = os.path.join(data_folder, "split5.csv")
sort_split5.to_csv(split_path5)

split_path6 = os.path.join(data_folder, "split6.csv")
sort_split6.to_csv(split_path6)

# Evaluate the Splits
Counting distribution of each split:
- Companies: applying again vs. never applied before.
- Role types: Analyst, Scientist, Manager
- Time since job post.

In [11]:
split_df = pd.read_csv(split_path,low_memory=False)
split_df2 = pd.read_csv(split_path2,low_memory=False)
split_df3 = pd.read_csv(split_path3, low_memory=False)
split_df4 = pd.read_csv(split_path4, low_memory=False)
split_df5 = pd.read_csv(split_path5, low_memory=False)
split_df6 = pd.read_csv(split_path6, low_memory=False)

In [12]:
def evalSplit(data, split_col):
    df = data.filter([split_col,'role_title','ab_split']).groupby([split_col, 'ab_split']).count()
    return(df)

def showEvalSplit(data):
    total_split = evalSplit(data, 'sorting_hat_col').reset_index()
    total_split = total_split.pivot(index="sorting_hat_col", columns="ab_split", values="role_title")
    display(total_split)

    company_split = evalSplit(data, 'is_repeat_company')
    display(company_split)

    role_type_split = evalSplit(data, 'role_cat')
    display(role_type_split)

    time_since_split = evalSplit(data, 'job_posted_pin')
    display(time_since_split)
    

In [13]:
showEvalSplit(split_df)
showEvalSplit(split_df2)
showEvalSplit(split_df3)
display(Markdown("## Split 4"))
showEvalSplit(split_df4)
display(Markdown("## Split 5"))
showEvalSplit(split_df5)
display(Markdown("## Split 6"))
showEvalSplit(split_df6)


ab_split,0.0,1.0
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,27,24
0.0_Data Scientist,46,51
0.0_Manager,39,41
1.0_Data Analyst,7,3
1.0_Data Scientist,15,9
1.0_Manager,8,7


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0.0,112
0.0,1.0,116
1.0,0.0,30
1.0,1.0,19


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0.0,34
Data Analyst,1.0,27
Data Scientist,0.0,61
Data Scientist,1.0,60
Manager,0.0,47
Manager,1.0,48


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1
14days,0.0,40
14days,1.0,34
30days,0.0,19
30days,1.0,20
7days,0.0,79
7days,1.0,77
More30,0.0,4
More30,1.0,4


ab_split,0.0,1.0
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,5.0,6.0
0.0_Data Scientist,15.0,19.0
0.0_Manager,6.0,5.0
1.0_Data Analyst,2.0,4.0
1.0_Data Scientist,2.0,2.0
1.0_Manager,,1.0


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0.0,26
0.0,1.0,30
1.0,0.0,4
1.0,1.0,7


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0.0,7
Data Analyst,1.0,10
Data Scientist,0.0,17
Data Scientist,1.0,21
Manager,0.0,6
Manager,1.0,6


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1


ab_split,0.0,1.0
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,13,15
0.0_Data Scientist,13,19
0.0_Manager,5,9
1.0_Data Analyst,10,3
1.0_Data Scientist,8,13
1.0_Manager,3,2


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0.0,31
0.0,1.0,43
1.0,0.0,21
1.0,1.0,18


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0.0,23
Data Analyst,1.0,18
Data Scientist,0.0,21
Data Scientist,1.0,32
Manager,0.0,8
Manager,1.0,11


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1


## Split 4

ab_split,0.0,1.0
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,20,14
0.0_Data Scientist,34,38
0.0_Manager,24,26
1.0_Data Analyst,6,11
1.0_Data Scientist,10,9
1.0_Manager,6,4


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0.0,78
0.0,1.0,78
1.0,0.0,22
1.0,1.0,24


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0.0,26
Data Analyst,1.0,25
Data Scientist,0.0,44
Data Scientist,1.0,47
Manager,0.0,30
Manager,1.0,30


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1


## Split 5

ab_split,0.0,1.0
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,18,21
0.0_Data Scientist,21,29
0.0_Manager,6,4
1.0_Data Analyst,5,1
1.0_Data Scientist,7,7
1.0_Manager,1,3


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0.0,45
0.0,1.0,54
1.0,0.0,13
1.0,1.0,11


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0.0,23
Data Analyst,1.0,22
Data Scientist,0.0,28
Data Scientist,1.0,36
Manager,0.0,7
Manager,1.0,7


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1


## Split 6

ab_split,0.0,1.0
sorting_hat_col,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0_Data Analyst,46.0,37.0
0.0_Data Scientist,37.0,39.0
0.0_Manager,31.0,31.0
0.0_Other,,1.0
1.0_Data Analyst,16.0,20.0
1.0_Data Scientist,36.0,30.0
1.0_Manager,24.0,24.0
1.0_Other,1.0,


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
is_repeat_company,ab_split,Unnamed: 2_level_1
0.0,0.0,114
0.0,1.0,108
1.0,0.0,77
1.0,1.0,74


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
role_cat,ab_split,Unnamed: 2_level_1
Data Analyst,0.0,62
Data Analyst,1.0,57
Data Scientist,0.0,73
Data Scientist,1.0,69
Manager,0.0,55
Manager,1.0,55
Other,0.0,1
Other,1.0,1


Unnamed: 0_level_0,Unnamed: 1_level_0,role_title
job_posted_pin,ab_split,Unnamed: 2_level_1
