tldr - This data has been downloaded from my "saved jobs" on LinkedIn. This notebook helps process and sort the data into our groups for the experiment.

# Summary
I'm running an experiment on how well two different resume's perform. Does a professionally written resume perform better than one I've written myself?			

# Hypothesis
$H_{0} = \text{The resume and cover letter provided by Haley Stock perform the same as my own self-written resume and cover letter.}$

$H_{a} = \text{The professionally-written resume and cover letter have a different interview invitation rate than self-written resume and cover letter.}$

| Name                       | Definition                                                                                                   | Example Format | P1   | P2 (from detectable difference below) |
|----------------------------|-------------------------------------------------------------------------------------------------------------|----------------|------|----------------------------------------|
| Interview Invitation Rate  | This is the percent of applications that receive an interview invite. It shows how interested companies are in my resume. | 0.50%         | 8%   | 13%                                    |
| Application Response Rate  | This is the percent of applications that receive a rejection/interview invite within 48 hrs. It shows how confident companies are in my qualifications. | 1%             |      |                                        |


# Data Collection Process
**LinkedIn:** I searched for "data" and remote in US roles. I then saved the roles that met the criteria below to a csv (via LinkedIn's data request form).

# Inclusion Criteria
Here are the kinds of jobs I'm interested in. I'm going to include these roles in the experiment based on the logic listed below.

| **Category**            | **Details**                                                                                                                                                          |
|:------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Location**            | Remote work only                                                                                                                                                     |
| **Minimum Job Roles**   | Data Scientist, Senior Data/Product Analyst, or Manager/Lead<br>No Applied Scientist, Decision Science, Analytics Engineer, or Researcher roles (significantly different than my experience would allow)<br>No Product Manager roles (not interested in product management)<br>No Director level roles (largely above my level) <br>Minimum salary of $100k/yr|
| **Type**                | Full time and contracted                                                                                                                                             |
| **Duplicate Job Roles** | Sometimes companies post multiple versions of a job (by location or reposts after a few weeks have gone by). Will do my best to dedupe these in the final "saved jobs" list. |


# Potential Pitfalls and Adjustments
1) *Cutting sample collection short.* **Discussion:** See power analysis below. Peer review pending.
2) *Confirming random assignment.* **Discussion:** We will randomizing at cluster level (repeat/not yet applied company + Analyst/Scientist/or Manager role).
3) *Random assignment fails to equally distribute "heavy" segments across A/B.* **Discussion:** Equally distribute the following segments and assess post-assignment:
   - Recency of job posting
   - "Repeat" companies (see "8)" below)
   - Role type: Analyst, Scientist, or Manager
4) *Ensure the Ceteris Parabus assumption* (that is, ensure all else is same between treatment/control besides the treatment). **Discussion:** Use the exact resume and cover letter provided for both.
5) *Cross-contamination for treatment and control groups.* **Discussion:** This could be an issue if different role types have the same recruiter.
6) *Multiple comparisons can lead to higher false positive rate.* **Discussion:** Only making one comparison -- professionally written resume vs. self-written resume.
7) *Simpson's paradox due to graduated roll-outs.* **Discussion:** I'm only performing one testing period, so this point is moot. Could simply look at final period if so.
8) *Primacy or novelty effect.* **Discussion:** This could be a problem, if I previously applied for a role with a company before conducting the experiment. I can adjust for this by running the experiment longer, but that's not really helpful given my current situation. Instead, I will include a category segment across treatment/control for repeat and new companies.
    - "Repeat" = companies that I've applied to before. I'll attempt to equally distribute these during assignment.
    - "New" = companies that I have not applied to before. I'll attempt to equally distribute these during assignment.

# Data Analysis

I have compiled the data with responses into a file `'data/total_split.csv'`. It contains every applied job posting along with a date that I received an interview. 

We've reached the original sample size of 350 for control and treatment. However, the response rate is far lower than expected. I will still perform the power analysis on this collection of data. However, it's very likely the difference won't be stat sig.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm, t

#Loading and renaming cols
raw_samples = pd.read_csv('data/total_split.csv')
display(raw_samples.sample(2))

Unnamed: 0.1,Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col,ab_split,Link,Applied Date \n(Blank if not applied),Deletion details,Interview Invite
627,15,Senior Data Analyst,GAC Solutions,"12/9/24, 9:22 AM",http://www.linkedin.com/jobs/view/4082859628,,0,,,Data Analyst,0.0_Data Analyst,1,https://www.linkedin.com/jobs/view/4082859628/...,12/9/2024,,
351,99,Data Analyst,QuinStreet,"12/3/24, 10:30 AM",http://www.linkedin.com/jobs/view/4017596454,,0,,,Data Analyst,0.0_Data Analyst,1,https://www.quinstreet.com/careers/?gh_jid=623...,12/3/2024,,


In [61]:
def calcRR(df):
    df = df.copy()[['ab_split','response_yn']].groupby('ab_split').agg(['sum','count']).reset_index()
    df.columns = ['ab_split','sum','count']
    df['response_rate'] = df['sum'].div(df['count'])
    return(df)

class PowerAnalysis:
    """A class to represent a power analysis."""
    
    def __init__(self,raw_data):
        """Initialize the power analysis object."""
        self.raw_data = raw_data
        
    def proc_rawdata(self, rm_null_rows="Applied Date \n(Blank if not applied)", observation_gran=['role_title','company'], 
                     control_treatment="ab_split", response_var="Interview Invite", segmentation=None, 
                     verbose=True, new_col_names=['role_title','company','ab_split',"applied_date",'invite']):
        """Return a processed dataframe that shows the row wise response variable.
            segmentation arg expects a tuple.        
        """
        self.control_treatment = control_treatment
        self.verbose=verbose
        
        if segmentation is None:
            df = self.raw_data.copy()
        else:
            df = self.raw_data.copy()
            df = df.loc[df.loc[:,segmentation[0]]==segmentation[1]]
        
        #remove rows where rm_null_rows is empty; delete anything marked with delete in that col.
        samples = df.loc[~df.loc[:,rm_null_rows].isnull()]
        samples = samples.loc[samples.loc[:,rm_null_rows]!="DELETE"]
        #shrink the df to only use the needed columns.
        filter_cols = observation_gran + [control_treatment] + [rm_null_rows] + [response_var]
        #if we have a segmentation column, then add that to our filterable cols. Otherwise don't do anything.
        filter_cols.extend([segmentation[0]]) if segmentation else None
        #filter
        samples = samples[filter_cols]
        #rename the columns
        if segmentation:
            samples.columns = new_col_names + [segmentation[1]]
        #    print(f"Segmentation col name: {segmentation[1]}")
        else:
            samples.columns = new_col_names
        samples['response_yn'] = np.where(samples['invite'].isnull(), 0, 1)
        
        self.samples = samples
        return(self)

    def ctrl_treat_split(self):
        """Return a control and treatment df with the proper Response Rate columns.
        this function expects the control_treament column to have 0 for control and 1 for treatment.
        """
        control = self.samples.loc[self.samples.loc[:,self.control_treatment]==0]
        treatment = self.samples.loc[self.samples.loc[:,self.control_treatment]==1]
        ctrl_rr = calcRR(control)
        trmt_rr = calcRR(treatment)
        return(ctrl_rr, trmt_rr)
    
#Split into control and treatment.
#control = samples.loc[samples.loc[:,'ab_split']==0]
#treat = samples.loc[samples.loc[:,'ab_split']==1]



#ctrl_rr = calcRR(control)
#display(ctrl_rr.head(1))

#trmt_rr = calcRR(treat)
#display(trmt_rr.head(1))

In [62]:
#Adding response variable
samples = PowerAnalysis(raw_samples)
proc_samples = samples.proc_rawdata()
all_in_samples = proc_samples.samples
#display(all_in_samples.sample(2))

ctrl, trt = proc_samples.ctrl_treat_split()
display(ctrl.head(1))
display(trt.head(1))

Unnamed: 0,ab_split,sum,count,response_rate
0,0,9,323,0.027864


Unnamed: 0,ab_split,sum,count,response_rate
0,1,8,363,0.022039


The formula to determine the z statistic for this analysis is:

$Z = \frac{(\hat{p_1}-\hat{p_2}) - 0}{\sqrt{\hat{p}\cdot(1-\hat{p})\cdot(\frac{1}{n_1} + \frac{1}{n_2})}}$

In [63]:
#Perform the power analysis
def Zstat(df1, df2):
    Z_numer = (df1['response_rate'] - df2['response_rate']) - 0
    phat = (df1['sum'] + df2['sum'])/(df1['count'] + df2['count'])
    Z_denom = np.sqrt(phat * (1-phat) * (1/df1['count'] + 1/df2['count']))
    Z = Z_numer/Z_denom
    return(Z)

Z = Zstat(ctrl, trt)

print(f"Z statistic: {Z[0]}")

Z statistic: 0.4898810818653188


In [64]:
from scipy.stats import norm

def Pvalue(Z):
    pvalue = 2 * (1-norm.cdf(Z))
    return(pvalue)

print(f"Our pvalue: {Pvalue(Z)[0]}")

Our pvalue: 0.6242180507005721


Since our pvalue is not lower than 0.05, then we can not be confident that our results are statistically significant - ie we could have gotten this result by random chance.

# Where do we go from here?

The best solution is to collect more samples. This, however, would take time. I've been averaging 100 applications per day if I only submit job applications.

Knowing our response rate now, how long would we need to get enough sample for a stat sig result?

Our formula for finding a better sample size:

$n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \cdot (\hat{p_1}(1-\hat{p_1}) + \hat{p_2}(1-\hat{p_2}))}{(\hat{p_1} - \hat{p_2})^2}$

In [65]:
from scipy.stats import norm

def SampleSize(p1, p2, alpha=0.05, beta=0.2, two_tailed_alpha=True):
    #calculate Zstatistic for alpha
    if two_tailed_alpha:
        alpha_zstat = norm.ppf(1-alpha/2)
    else:
        alpha_zstat = norm.ppf(1-alpha)
    #calculate Zstatistic for beta.
    beta_zstat = norm.ppf(1-beta)
    nnumer = (alpha_zstat + beta_zstat)**2 * (p1 * (1-p1) + p2 * (1-p2))
    ndenom = (p1 - p2)**2
    n = nnumer/ndenom
    return(n)

n = SampleSize(p1=0.027864,p2=0.022039, alpha=0.1, beta=0.3)
combined_sample = n*2
print(f"Our required sample size: {combined_sample}")

current_sample = (ctrl['count'] + trt['count'])[0]
print(f"We have {current_sample} now.")

remaining_sample = combined_sample - current_sample
print(f"We would have to collect an additional {remaining_sample} samples")

print(f"If 100 apps can be submitted in 1 day, it will take '{remaining_sample/100}' days to collect the required sample")

Our required sample size: 13491.505955749142
We have 686 now.
We would have to collect an additional 12805.505955749142 samples
If 100 apps can be submitted in 1 day, it will take '128.05505955749143' days to collect the required sample


If I kept our alpha and beta at 10% and 30% respectively, the required sample will be too large to collect myself--an additional 12805 applications and 128 days.

We can also look at the individual segments to see if there were any significant results by segment instead.

# Analysis by Segment

In [66]:
display(samples.raw_data.sample(3))

Unnamed: 0.1,Unnamed: 0,role_title,company,date_saved,posting_url,days_since_post,is_repeat_company,date_posted,job_posted_pin,role_cat,sorting_hat_col,ab_split,Link,Applied Date \n(Blank if not applied),Deletion details,Interview Invite
116,97,Data Scientist,Exact Sciences,"11/9/24, 10:34 AM",http://www.linkedin.com/jobs/view/4020468338,14.0,0,11/1/2024,14days,Data Scientist,0.0_Data Scientist,0,https://exactsciences.wd1.myworkdayjobs.com/en...,11/21/2024,,
421,31,"Senior Data Scientist, Analytics",Duetto,"12/3/24, 10:20 AM",http://www.linkedin.com/jobs/view/4083244662,,1,,,Data Scientist,1.0_Data Scientist,1,https://job-boards.greenhouse.io/duettoresearc...,12/4/2024,,
123,68,Senior Data Scientist - Telematics,Tiger Analytics,"11/9/24, 10:50 AM",http://www.linkedin.com/jobs/view/4045025337,30.0,0,10/16/2024,30days,Data Scientist,0.0_Data Scientist,0,https://www.linkedin.com/jobs/view/4045025337/...,11/21/2024,,


In [71]:
#segment: role
segmented_sample = samples.proc_rawdata(segmentation=('role_cat','Manager'))
ctrl_mgr, treat_mgr = segmented_sample.ctrl_treat_split()
display(ctrl_mgr)
display(treat_mgr)
mgr_z = Zstat(ctrl_mgr, treat_mgr)
print(f"Z stat for Manager segment: {mgr_z}")
print(f"Our pvalue: {Pvalue(mgr_z)[0]}")

Unnamed: 0,ab_split,sum,count,response_rate
0,0,5,69,0.072464


Unnamed: 0,ab_split,sum,count,response_rate
0,1,4,85,0.047059


Z stat for Manager segment: 0    0.668355
dtype: float64
Our pvalue: 0.5039068525320394


Segmenting by manager roles suffers from the same issue--low sample. As a result we cannot reject the null hypothesis. It's possible the control resume and cover letter perform better, but it's also possible we got this result due to random chance.