This data has been downloaded from my "saved jobs" on LinkedIn. This notebook helps process and sort the data into our groups for the experiment.

# Summary
I'm running an experiment on how well two different resume's perform. Does a professionally written resume perform better than one I've written myself?			

# Hypothesis
$H_{0} = \text{The resume and cover letter provided by Haley Stock perform the same as my own self-written resume and cover letter.}$

$H_{a} = \text{The professionally-written resume and cover letter have a different interview invitation rate than self-written resume and cover letter.}$

| Name                       | Definition                                                                                                   | Example Format | P1   | P2 (from detectable difference below) |
|----------------------------|-------------------------------------------------------------------------------------------------------------|----------------|------|----------------------------------------|
| Interview Invitation Rate  | This is the percent of applications that receive an interview invite. It shows how interested companies are in my resume. | 0.50%         | 8%   | 13%                                    |
| Application Response Rate  | This is the percent of applications that receive a rejection/interview invite within 48 hrs. It shows how confident companies are in my qualifications. | 1%             |      |                                        |


# Data Collection Process
**LinkedIn:** I searched for "data" and remote in US roles. I then saved the roles that met the criteria below to a csv (via LinkedIn's data request form).

# Inclusion Criteria
Here are the kinds of jobs I'm interested in. I'm going to include these roles in the experiment based on the logic listed below.

| **Category**            | **Details**                                                                                                                                                          |
|:------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Location**            | Remote work only                                                                                                                                                     |
| **Minimum Job Roles**   | Data Scientist, Senior Data/Product Analyst, or Manager/Lead<br>No Applied Scientist, Decision Science, Analytics Engineer, or Researcher roles (significantly different than my experience would allow)<br>No Product Manager roles (not interested in product management)<br>No Director level roles (largely above my level) |
| **Type**                | Full time and contracted                                                                                                                                             |
| **Duplicate Job Roles** | Sometimes companies post multiple versions of a job (by location or reposts after a few weeks have gone by). Will do my best to dedupe these in the final "saved jobs" list. |


# Potential Pitfalls and Adjustments
1) *Cutting sample collection short.* **Discussion:** See power analysis below. Peer review pending.
2) *Confirming random assignment.* **Discussion:** Randomizing at cluster level (company + job title).
3) *Random assignment fails to equally distribute "heavy" segments across A/B.* **Discussion:** Equally distribute the following segments and assess post-assignment:
   - Recency of job posting
   - "Repeat" companies (see "8)" below)
4) *Ensure the Ceteris Parabus assumption* (that is, ensure all else is same between treatment/control besides the treatment). **Discussion:** Use the exact resume and cover letter provided for both.
5) *Cross-contamination for treatment and control groups.* **Discussion:** I don't think we'll have to worry about this, unless a recruiter leaves one company and goes to another.
6) *Multiple comparisons can lead to higher false positive rate.* **Discussion:** Only making one comparison -- professionally written resume vs. self-written resume.
7) *Simpson's paradox due to graduated roll-outs.* **Discussion:** I'm only performing one testing period, so this point is moot. Could simply look at final period if so.
8) *Primacy or novelty effect.* **Discussion:** This could be a problem, if I previously applied for a role with a company before conducting the experiment. I can adjust for this by running the experiment longer, but that's not really helpful given my current situation. Instead, I will include a category segment across treatment/control for repeat and new companies.
    - "Repeat" = companies that I've applied to before. I'll attempt to equally distribute these during assignment.
    - "New" = companies that I have not applied to before. I'll attempt to equally distribute these during assignment.

# Data Processing

In [8]:
!pwd

/notebooks/resume_experiment


In [9]:
import os
import numpy as np
import pandas as pd

data_folder = "/notebooks/resume_experiment/data"
dataf = "resume_experiment2024.11.csv"
data_file = os.path.join(data_folder, dataf)

print(data_file)

/notebooks/resume_experiment/data/resume_experiment2024.11.csv


In [10]:
raw_df = pd.read_csv(data_file, low_memory=False)
display(raw_df.head(3))

Unnamed: 0,role title,company,date saved,posting url,date posted,listed function
0,Principal Data Scientist - Service Enablement ...,Atlassian,"11/5/24, 10:34 AM",http://www.linkedin.com/jobs/view/4065282699,,
1,"Senior Manager, Product Insights",Included Health,"11/8/24, 8:38 AM",http://www.linkedin.com/jobs/view/4000281134,,
2,Data Scientist,"Reddit, Inc.","11/4/24, 5:56 PM",http://www.linkedin.com/jobs/view/3998254278,,
