In [15]:
import pandas as pd
from testing_scripts.constants import *

# Section 1: Creating a labeled dataframe
This section involves reading in the resumes, filtering out entries whose CVs are too short, and creating positive and negative classes while filtering out all entries in neither class.

## From resumes parquet
This subsection assumes the existence of the resumes parquet file, processes, and exports the Labeled dataframe

In [5]:
# Read in the parquet
RESUMES_PARQUET_INPUT_FILENAME = "data/resumes.parquet"
raw_df = pd.read_parquet(RESUMES_PARQUET_INPUT_FILENAME, engine='pyarrow')  # raw dataframe

# Filter the dataframe by minimum cv length
MIN_CV_LENGTH = 500
filtered_df = raw_df.loc[raw_df['CV'].dropna().apply(len) >= MIN_CV_LENGTH]

# Add a true label column based on the specified keywords
POSITIVE_POSITION = "Project Manager"
POSITIVE_KEYWORD = "Project Manager"
NEGATIVE_POSITION = "QA Engineer"   # "Java Developer"
NEGATIVE_KEYWORD = "QA"             # "Java"

import testing_scripts.label_resumes
testing_scripts.label_resumes.add_true_label_column(filtered_df, POSITIVE_POSITION, POSITIVE_KEYWORD, NEGATIVE_POSITION, NEGATIVE_KEYWORD)
labeled_df = filtered_df            # alias

# Filter out entries whose true label is NA (i.e. belongs to neither class)
labeled_df = labeled_df[labeled_df["True Label"].notna()]

# Export the labeled dataframe
LABELED_DATAFRAME_OUTPUT_FILENAME = "data/labeled_df.csv"
labeled_df.to_csv(LABELED_DATAFRAME_OUTPUT_FILENAME)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[TRUE_LABEL_COLUMN_NAME] = df.apply(label, axis = 1)


## From import
If the labeled_df.csv file already exists, run this instead to import the file

In [8]:
LABELED_DATAFRAME_INPUT_FILENAME = "data/labeled_df.csv"
labeled_df = pd.read_csv(LABELED_DATAFRAME_INPUT_FILENAME)

## True label playground
This subsection contains some light code for examining the true label

In [16]:
# The size of the positive and negative classes
value_counts = labeled_df["True Label"].value_counts()
print(value_counts)

positiveClassSize = value_counts.get(POSITIVE_LABEL, default=0)
negativeClassSize = value_counts.get(NEGATIVE_LABEL, default=0)
print(f"Proportion of positives = {positiveClassSize / (positiveClassSize + negativeClassSize)}")

True Label
1    6753
0    6379
Name: count, dtype: int64
proportion of positives = 0.5142400243679561


In [31]:
# Example positive entry
examplePositiveEntry = labeled_df.loc[labeled_df["True Label"] == POSITIVE_LABEL].iloc[0]
examplePositiveCV: str = examplePositiveEntry.to_dict()["CV"]
print(f"Truncated positive CV:\n====================\n {examplePositiveCV[:1000]}")

Truncated positive CV:
 High levels of self-organization, structure, and attention to detail have helped build a successful career in advertising, as evidenced by hundreds of successfully completed projects, and train dozens of specialists. Previous experience is similar to project management methodologies used in the IT industry, including budgeting, planning, stakeholder management, risk mitigation, and effective communication. Creating new products inspires and motivates further development.
Account director
2018 - 2021
Management and development of client portfolio. 
Control over project development and progress. 
Planning and budgeting based on client portfolio. 
Analysis of project effectiveness and profitability. 
Operational management: organizing, coordinating, and controlling the work of the account team (planning and task allocation). 
Ensuring effective interaction of the account managers team between agency departments.

Senior account manager 
2017 - 2018
Communication wi

In [30]:
# Example negative entry 
exampleNegativeEntry = labeled_df.loc[labeled_df["True Label"] == NEGATIVE_LABEL].iloc[10]
exampleNegativeCV = exampleNegativeEntry.to_dict()["CV"]
print(f"Truncated negative CV:\n====================\n {exampleNegativeCV[:1000]}")

Truncated negative CV:
 
June/2022 - Present
- Experience with QA/Web tools (bug-reports, check-lists, documentation writing, writing/ updating test cases, testing with a database(postgresql), testing API requests, GitHub, TeamCity);
- Experience with a Regression tests, Integration, Functional tests, End-to-end, Acceptance, Smoke, Stress;
- Experience with Automation tools (JS/ Playwright, test coverage (UI, API, Database));
- Experience and understanding of Agile Development methodologies especially Scrum.

December/2021 - June/2022
- Experience with QA/Web tools (bug-reports, check-lists, writing/updating test cases, testing API requests, GitHub);
μ Experience with Automation tools (JS/Cypress, test coverage (UI, API));
- Experience with a Regression tests, Integration, Functional tests, End-to-end, Acceptance, Smoke;
- Experience and understanding of Agile Development methodologies especially Kanban.

November/2021 - December/2021
- Experience with QA/mobile tools(bug-reports, chec

# Section 2: Marking samples for Experiments
This section involves marking samples in the labeled dataframe for experiments. This allows us to experiment on a few samples at a time, rather than all entries at once.

## From labeled_df
This subsection assumes the existence of the labeled_df object within this notebook, processes, and exports the Marked dataframe

In [32]:
# How many samples from each class we want to mark for experiments
NUM_POSITIVE_SAMPLES = 100
NUM_NEGATIVE_SAMPLES = 100

# Create a new column "Marked for Experiments" and deterministically mark 
# the first NUM_POSITIVE_SAMPLES positive entries and the first NUM_NEGATIVE_SAMPLES negative entries True and all others false
labeled_df["Marked for Experiments"] = False
positive_sample_indices = labeled_df[labeled_df["True Label"] == POSITIVE_LABEL].index[:NUM_POSITIVE_SAMPLES]
negative_sample_indices = labeled_df[labeled_df["True Label"] == NEGATIVE_LABEL].index[:NUM_NEGATIVE_SAMPLES]
labeled_df.loc[positive_sample_indices, "Marked for Experiments"] = True
labeled_df.loc[negative_sample_indices, "Marked for Experiments"] = True
marked_df = labeled_df          # alias

# Export the marked dataframe
MARKED_DATAFRAME_OUTPUT_FILENAME = "data/marked_df.csv"
marked_df.to_csv(MARKED_DATAFRAME_OUTPUT_FILENAME)

## From import
If the marked_df.csv file already exists, run this instead to import the file

In [33]:
MARKED_DATAFRAME_INPUT_FILENAME = "data/marked_df.csv"
marked_df = pd.read_csv(MARKED_DATAFRAME_INPUT_FILENAME)

## Mark playground
This subsection contains some light code for examining the true label

In [38]:
# The total number of marked entries (should match NUM_POSITIVE_SAMPLES + NUM_NEGATIVE_SAMPLES)
value_counts = marked_df["Marked for Experiments"].value_counts()
print(f"Number of samples = {value_counts.get(True)}")

Number of samples = 200
