This data has been downloaded from my "saved jobs" on LinkedIn. This notebook helps process and sort the data into our groups for the experiment.

# Summary
I'm running an experiment on how well two different resume's perform. Does a professionally written resume perform better than one I've written myself?			

# Hypothesis
$H_{0} = \text{The resume and cover letter provided by Haley Stock perform the same as my own self-written resume and cover letter.}$

$H_{a} = \text{The professionally-written resume and cover letter have a different interview invitation rate than self-written resume and cover letter.}$

| Name                       | Definition                                                                                                   | Example Format | P1   | P2 (from detectable difference below) |
|----------------------------|-------------------------------------------------------------------------------------------------------------|----------------|------|----------------------------------------|
| Interview Invitation Rate  | This is the percent of applications that receive an interview invite. It shows how interested companies are in my resume. | 0.50%         | 8%   | 13%                                    |
| Application Response Rate  | This is the percent of applications that receive a rejection/interview invite within 48 hrs. It shows how confident companies are in my qualifications. | 1%             |      |                                        |


# Data Collection Process
**LinkedIn:** I searched for "data" and remote in US roles. I then saved the roles that met the criteria below to a csv (via LinkedIn's data request form).

# Inclusion Criteria
Here are the kinds of jobs I'm interested in. I'm going to include these roles in the experiment based on the logic listed below.

| **Category**            | **Details**                                                                                                                                                          |
|:------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Location**            | Remote work only                                                                                                                                                     |
| **Minimum Job Roles**   | Data Scientist, Senior Data/Product Analyst, or Manager/Lead<br>No Applied Scientist, Decision Science, Analytics Engineer, or Researcher roles (significantly different than my experience would allow)<br>No Product Manager roles (not interested in product management)<br>No Director level roles (largely above my level) |
| **Type**                | Full time and contracted                                                                                                                                             |
| **Duplicate Job Roles** | Sometimes companies post multiple versions of a job (by location or reposts after a few weeks have gone by). Will do my best to dedupe these in the final "saved jobs" list. |


# Potential Pitfalls and Adjustments
1) *Cutting sample collection short.* **Discussion:** See power analysis below. Peer review pending.
2) *Confirming random assignment.* **Discussion:** Randomizing at cluster level (company + job title).
3) *Random assignment fails to equally distribute "heavy" segments across A/B.* **Discussion:** Equally distribute the following segments and assess post-assignment:
   - Recency of job posting
   - "Repeat" companies (see "8)" below)
4) *Ensure the Ceteris Parabus assumption* (that is, ensure all else is same between treatment/control besides the treatment). **Discussion:** Use the exact resume and cover letter provided for both.
5) *Cross-contamination for treatment and control groups.* **Discussion:** I don't think we'll have to worry about this, unless a recruiter leaves one company and goes to another.
6) *Multiple comparisons can lead to higher false positive rate.* **Discussion:** Only making one comparison -- professionally written resume vs. self-written resume.
7) *Simpson's paradox due to graduated roll-outs.* **Discussion:** I'm only performing one testing period, so this point is moot. Could simply look at final period if so.
8) *Primacy or novelty effect.* **Discussion:** This could be a problem, if I previously applied for a role with a company before conducting the experiment. I can adjust for this by running the experiment longer, but that's not really helpful given my current situation. Instead, I will include a category segment across treatment/control for repeat and new companies.
    - "Repeat" = companies that I've applied to before. I'll attempt to equally distribute these during assignment.
    - "New" = companies that I have not applied to before. I'll attempt to equally distribute these during assignment.

# Data Processing

In [8]:
!pwd

/notebooks/resume_experiment


In [1]:
import os
import numpy as np
import pandas as pd
import random

#import the job sampling data
data_folder = "/notebooks/resume_experiment/data"
dataf = "res_xp_data.csv"
data_file = os.path.join(data_folder, dataf)

print(data_file)

/notebooks/resume_experiment/data/res_xp_data.csv


In [10]:
raw_df = pd.read_csv(data_file, low_memory=False)
raw_df = raw_df.loc[raw_df.loc[:,'DELETE Row'] != True]
display(raw_df.head(10))

Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,Applied Jobs Check,DELETE Row,Check Unique Job/Co
0,Data Analytics Manager,Panasonic North America,"11/9/24, 10:32 AM",http://www.linkedin.com/jobs/view/4022291235,7.0,0.0,11/8/2024,,,2
1,Arity- Data Scientist Senior Manager - Arity,Arity,"11/9/24, 10:42 AM",http://www.linkedin.com/jobs/view/4064869221,7.0,0.0,11/8/2024,,,1
2,Data Science Lead Associate,Protective Life,"11/9/24, 10:31 AM",http://www.linkedin.com/jobs/view/4040898277,14.0,0.0,11/1/2024,,,1
3,Data Science Senior Manager,CLARA Analytics,"11/12/24, 11:06 AM",http://www.linkedin.com/jobs/view/4074304576,3.0,0.0,11/12/2024,,,1
4,Lead Data Scientist,Apollo GraphQL,"11/9/24, 10:49 AM",http://www.linkedin.com/jobs/view/4070255734,7.0,0.0,11/8/2024,,,1
5,Senior Product Data Scientist,"Propel, Inc","11/9/24, 10:49 AM",http://www.linkedin.com/jobs/view/4068137912,7.0,0.0,11/8/2024,,,1
6,"Staff Data Scientist (Remote, US)",Grafana Labs,"11/9/24, 10:31 AM",http://www.linkedin.com/jobs/view/4034580358,7.0,1.0,11/8/2024,,,1
8,"Data Scientist, Core Products","Coalition, Inc.","11/9/24, 10:29 AM",http://www.linkedin.com/jobs/view/4025947473,14.0,0.0,11/1/2024,,,1
10,Sr. Data Analyst,InStride,"11/9/24, 10:42 AM",http://www.linkedin.com/jobs/view/4064739945,7.0,0.0,11/8/2024,,,1
11,Senior Manager - Data Science,McAfee,"11/9/24, 10:32 AM",http://www.linkedin.com/jobs/view/4041683824,14.0,1.0,11/1/2024,,,1


In [20]:
#build a list of the companies that are in our dataset. include new vs. repeat as a dimension
companies = raw_df.copy().filter(['company','is_repeat_company','days_since_post'])
companies['counts'] = 1
agg_mets = {'counts':"count", 'days_since_post':"mean"}
companies = companies.groupby(['company','is_repeat_company']).agg(agg_mets).reset_index()

(220, 4)


In [25]:
#Assign companies to clusters
new_cluster = companies.loc[companies.loc[:,'is_repeat_company']==False]
print(new_cluster.shape)

repeat_cluster = companies.loc[companies.loc[:,'is_repeat_company']==True]
print(repeat_cluster.shape)

(196, 4)
(24, 4)


In [29]:
#ensure company lists are mutually exclusive
new_companies = set(new_cluster['company'])
repeat_companies = set(repeat_cluster['company'])

is_mutually_exclusive = new_companies.isdisjoint(repeat_cluster)
assert False, raise Exception("Companies overlapp")

AssertionError: 

In [None]:
#Randomly assign clusters to groups

for c in clusters:
    generate random number between 1,2
    if random number == 1:
        assign c to control
    elif random number == 2:
        assign c to treatment

save the assignment to a file for control
save the assignment to a file for treatment
tell me how many jobs are in each treatment and control cluster