### START HERE IF SOURCING FROM df_manual_FOR_training
### PLEASE SET CORRECT DIRECTORY PATHS BELOW


# Descriptives and visualization


In [1]:
import os # type:ignore # isort:skip # fmt:skip # noqa # nopep8
import sys # type:ignore # isort:skip # fmt:skip # noqa # nopep8
from pathlib import Path # type:ignore # isort:skip # fmt:skip # noqa # nopep8

mod = sys.modules[__name__]

code_dir = None
code_dir_name = 'Code'
unwanted_subdir_name = 'Analysis'

if code_dir_name not in str(Path.cwd()).split('/')[-1]:
    for _ in range(5):

        parent_path = str(Path.cwd().parents[_]).split('/')[-1]

        if (code_dir_name in parent_path) and (unwanted_subdir_name not in parent_path):

            code_dir = str(Path.cwd().parents[_])

            if code_dir is not None:
                break
else:
    code_dir = str(Path.cwd())
sys.path.append(code_dir)

# %load_ext autoreload
# %autoreload 2


In [2]:
from setup_module.imports import * # type:ignore # isort:skip # fmt:skip # noqa # nopep8
from setup_module import researchpy_fork as rp # type:ignore # isort:skip # fmt:skip # noqa # nopep8
from setup_module.statannotations_fork.Annotator import Annotator # type:ignore # isort:skip # fmt:skip # noqa # nopep8


Using MPS


0it [00:00, ?it/s]

<Figure size 640x480 with 0 Axes>

In [3]:
try:
    df_sectors_all = pd.read_pickle(f'{table_save_path}Sectors Output from script.pkl')
except FileNotFoundError:
    cbs_notebook = '\\'.join(f'{scraped_data}CBS/CBS.ipynb')
    %run $cbs_notebook import df_sectors_all # type:ignore # isort:skip # fmt:skip # noqa # nopep8


# Functions

In [4]:
def show_and_close_plots():
    plt.show()
    plt.clf()
    plt.cla()
    plt.close()
    plt.rc('font', **font)
    plt.rcParams['font.family'] = font['family']


In [5]:
def close_plots():
    plt.clf()
    plt.cla()
    plt.close()
    plt.rc('font', **font)
    plt.rcParams['font.family'] = font['family']


# Analysis plan:

1. ## [Descriptives and tables](./1.%20descriptives_and_tables.ipynb)
2. ## [Visualization](./2.%20visualization.ipynb)
3. ## [Frequencies and Normality tests](./2.%20frequencies_and_normality_test.ipynb)
   1. ### Frequencies, histograms, and QQ plots
      * Normal test
      * Kurtosis test
      * Shapiro
      * Anderson
      * Bartlett
   2. ### Correlation between independent variables (IVs) and control variables and Multicolinarity test
      * Pearson's R
      * VIF
     - ***ivs_dummy*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
     - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
     - ***% Sector per Workforce*** (continous ratio) = Sector percentage per worksforce (0-100)
     - ***num_words*** (continous ratio) = Number of words in job description
     - ***English Requirement in Job Ad*** (binary nominal) = English requirement in job description (0 vs. 1)
     - ***Dutch Requirement in Job Ad*** (binary nominal) = Dutch requirement in job description (0 vs. 1)
     - ***Platform*** (binary dummy) = LinkedIn (0 vs. 1), Indeed (0 vs. 1), Glassdoor (0 vs. 1)

4. ## [ANOVA and Chi-square (Pearson's R)](./3.%20chisqt_and_anova.ipynb)

   1. ### Chi-square
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
      * **df_jobs:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)

   2. ### One-way ANOVA, interactions, and post-hoc test
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
          - If Levene's test is *not significant*, use classic ANOVA and Tukey's post hoc test
          - If Levene's test is *significant*, use Welch's and Kruskal-Wallis ANOVA and Games Howell's post hoc test
      * **df_jobs:**
         - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
         - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
           - If Levene's test is *not significant*, use classic ANOVA and Tukey's post hoc test
           - If Levene's test is *significant*, use Welch's and Kruskal-Wallis ANOVA and Games Howell's post hoc test

5. ## [Regression Analysis](./3.%20regression_analysis.ipynb)
   1. ### Logistic Regression  with all interaction (smf):
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
      * **df_jobs:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
   2. ### OLS Regression with all interaction:
      * **df_jobs:**
        - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
   3. ### Multilevel OLS Regression with all interaction:
      * **df_jobs:**
        - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)

6. ## [Specification Curve Analysis](./4.%20specification_curve_analysis.ipynb)

   1. ### Logistic Specification Curve Analysis:
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
      * **df_jobs:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
   2. ### OLS Specification Curve Analysis:
      * **df_jobs:**
        - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)


# READ DATA

In [6]:
with open(f'{data_dir}df_manual_len.txt', 'r') as f:
    df_manual_len = int(f.read())

df_manual = pd.read_pickle(f'{df_save_dir}df_manual_for_training.pkl')
assert len(df_manual) == df_manual_len, f'DATAFRAME MISSING DATA! DF SHOULD BE OF LENGTH {df_manual_len} BUT IS OF LENGTH {len(df_manual)}'
print(f'Dataframe loaded with shape: {df_manual.shape}')
df_manual = categorize_df_gender_age(df_manual)


Dataframe loaded with shape: (5947, 75)


In [7]:
with open(f'{data_dir}df_jobs_for_analysis_len.txt', 'r') as f:
    df_jobs_len = int(f.read())

df_jobs = pd.read_pickle(f'{df_save_dir}df_jobs_for_analysis.pkl')
assert len(df_jobs) == df_jobs_len, f'DATAFRAME MISSING DATA! DF SHOULD BE OF LENGTH {df_jobs_len} BUT IS OF LENGTH {len(df_manual)}'
print(f'Dataframe loaded with shape: {df_jobs.shape}')
df_jobs = categorize_df_gender_age(df_jobs)


Dataframe loaded with shape: (308583, 101)


## Set dataframes

#### Dataframes dict

In [8]:
dataframes = {
    'df_jobs': df_jobs,
    # 'df_manual': df_manual,
}


# Descriptives

### All info

In [9]:
df_jobs.head()


Unnamed: 0,Search Keyword,Platform,Job ID,Job Title,Company Name,Location,Job Description,Rating,Employment Type,Company URL,Job URL,Job Age,Job Age Number,Collection Date,Data Row,Tracking ID,Industry,Job Date,Type of ownership,Language,Dutch Requirement in Job Ad,English Requirement in Job Ad,Dutch Requirement in Job Ad_No,Dutch Requirement in Job Ad_Yes,English Requirement in Job Ad_No,English Requirement in Job Ad_Yes,Sector Code,Sector,Keywords Count,Gender_Female_n,Gender_Female_% per Sector,Gender_Female_% per Social Category,Gender_Female_% per Workforce,Gender_Male_n,Gender_Male_% per Sector,Gender_Male_% per Social Category,Gender_Male_% per Workforce,Gender,Age_Older_n,Age_Older_% per Sector,Age_Older_% per Social Category,Age_Older_% per Workforce,Age_Younger_n,Age_Younger_% per Sector,Age_Younger_% per Social Category,Age_Younger_% per Workforce,Age,Sector_n,% Sector per Workforce,Sector Job Advertisement Count,Sector Gender Designation Job Advertisement Count,Sector Age Designation Job Advertisement Count,Gender_Female,Gender_Male,Gender_Mixed,Age_Mixed,Age_Older,Age_Younger,Gender_Num,Age_Num,Interaction_Female_Older_% per Sector,Interaction_Female_Younger_% per Sector,Interaction_Male_Older_% per Sector,Interaction_Male_Younger_% per Sector,Platform_Num,Platform_LinkedIn,Platform_Indeed,Platform_Glassdoor,Job Description spacy_sentencized,Job Description spacy_sentencized_num_words,Job Description spacy_sentencized_num_unique_words,Job Description spacy_sentencized_num_chars,Job Description spacy_sentencized_num_chars_no_whitespact_and_punt,Job Description spacy_sentencized_num_punctuations,Job Description_num_words,Job Description_num_unique_words,Job Description_num_chars,Job Description_num_chars_no_whitespact_and_punt,Job Description_num_punctuations,Job Description spacy_sentencized_lower,Dutch Requirement in Sentence,English Requirement in Sentence,Dutch Requirement in Sentence_No,Dutch Requirement in Sentence_Yes,English Requirement in Sentence_No,English Requirement in Sentence_Yes,Job Description spacy_tokenized,Job Description spacy_sentencized_cleaned,Job Description nltk_tokenized,Job Description gensim_tokenized,Job Description bert_tokenized,Warmth,Warmth_Probability,Competence,Competence_Probability,Warmth_actual,Competence_actual,Warmth_predicted,Warmth_Probability_predicted,Competence_predicted,Competence_Probability_predicted
0,wholesale,Indeed,pj_da9f2c12243d7031,Transaction Monitoring Expert,Michael Page,Amsterdam,About Our Client\nThe Global KYC organisation ...,-1.0,-1,https://indeed.nl/rc/clk?jk=da9f2c12243d7031&f...,https://nl.indeed.com/vacature-bekijken/pagead...,2 dagen geleden,2 dagen geleden,2021-01-24,,,,,,en,No,No,1.0,0.0,1.0,0.0,G,Commercial services,11.0,3421.0,43.13,28.47,13.54,4510.0,56.87,34.04,17.85,Mixed Gender,2704.0,34.09,25.44,10.7,5228.0,65.92,35.73,20.69,Mixed Age,7931.0,31.39,1787.0,6664.0,11464.0,0.0,0.0,1.0,1.0,0.0,0.0,1,1,1470.63,2843.37,1938.77,3748.49,1.0,0.0,1.0,0.0,About Our Client,3.0,3.0,16.0,14.0,0.0,558.0,320.0,3876.0,3240.0,23.0,about our client,No,No,1.0,0.0,1.0,0.0,"[about, our, client]",about our client,[client],[client],"[about, our, client]",0,0.01,0,0.02,,,,,,
1,wholesale,Indeed,pj_da9f2c12243d7031,Transaction Monitoring Expert,Michael Page,Amsterdam,About Our Client\nThe Global KYC organisation ...,-1.0,-1,https://indeed.nl/rc/clk?jk=da9f2c12243d7031&f...,https://nl.indeed.com/vacature-bekijken/pagead...,2 dagen geleden,2 dagen geleden,2021-01-24,,,,,,en,No,No,1.0,0.0,1.0,0.0,G,Commercial services,11.0,3421.0,43.13,28.47,13.54,4510.0,56.87,34.04,17.85,Mixed Gender,2704.0,34.09,25.44,10.7,5228.0,65.92,35.73,20.69,Mixed Age,7931.0,31.39,1787.0,6664.0,11464.0,0.0,0.0,1.0,1.0,0.0,0.0,1,1,1470.63,2843.37,1938.77,3748.49,1.0,0.0,1.0,0.0,The Global KYC organisation is part of ING's C...,10.0,10.0,56.0,45.0,1.0,558.0,320.0,3876.0,3240.0,23.0,the global kyc organisation is part of ing's c...,No,No,1.0,0.0,1.0,0.0,"[the, global, kyc, organisation, is, part, of,...",the global kyc organisation is part of ing 's ...,"[global, kyc, organisation, part, ing, 's, coo...","[global, kyc, organis, ing, coo, domain]","[the, global, ky, ##c, organisation, is, part,...",0,0.01,0,0.06,,,,,,
2,wholesale,Indeed,pj_da9f2c12243d7031,Transaction Monitoring Expert,Michael Page,Amsterdam,About Our Client\nThe Global KYC organisation ...,-1.0,-1,https://indeed.nl/rc/clk?jk=da9f2c12243d7031&f...,https://nl.indeed.com/vacature-bekijken/pagead...,2 dagen geleden,2 dagen geleden,2021-01-24,,,,,,en,No,No,1.0,0.0,1.0,0.0,G,Commercial services,11.0,3421.0,43.13,28.47,13.54,4510.0,56.87,34.04,17.85,Mixed Gender,2704.0,34.09,25.44,10.7,5228.0,65.92,35.73,20.69,Mixed Age,7931.0,31.39,1787.0,6664.0,11464.0,0.0,0.0,1.0,1.0,0.0,0.0,1,1,1470.63,2843.37,1938.77,3748.49,1.0,0.0,1.0,0.0,Its purpose is Enabling people and organisatio...,20.0,19.0,131.0,111.0,1.0,558.0,320.0,3876.0,3240.0,23.0,its purpose is enabling people and organisatio...,No,No,1.0,0.0,1.0,0.0,"[its, purpose, is, enabling, people, and, orga...",its purpose is enabling people and organisatio...,"[purpose, enabling, people, organisations, use...","[purpos, enabl, peopl, organis, us, bank, serv...","[its, purpose, is, enabling, people, and, orga...",0,0.29,1,0.64,,,,,,
3,wholesale,Indeed,pj_da9f2c12243d7031,Transaction Monitoring Expert,Michael Page,Amsterdam,About Our Client\nThe Global KYC organisation ...,-1.0,-1,https://indeed.nl/rc/clk?jk=da9f2c12243d7031&f...,https://nl.indeed.com/vacature-bekijken/pagead...,2 dagen geleden,2 dagen geleden,2021-01-24,,,,,,en,No,No,1.0,0.0,1.0,0.0,G,Commercial services,11.0,3421.0,43.13,28.47,13.54,4510.0,56.87,34.04,17.85,Mixed Gender,2704.0,34.09,25.44,10.7,5228.0,65.92,35.73,20.69,Mixed Age,7931.0,31.39,1787.0,6664.0,11464.0,0.0,0.0,1.0,1.0,0.0,0.0,1,1,1470.63,2843.37,1938.77,3748.49,1.0,0.0,1.0,0.0,Our Global KYC organisation is a first line of...,34.0,31.0,239.0,203.0,1.0,558.0,320.0,3876.0,3240.0,23.0,our global kyc organisation is a first line of...,No,No,1.0,0.0,1.0,0.0,"[our, global, kyc, organisation, is, a, first,...",our global kyc organisation is a first line of...,"[global, kyc, organisation, first, line, defen...","[global, kyc, organis, line, defenc, depart, p...","[our, global, ky, ##c, organisation, is, a, fi...",0,0.16,1,0.92,,,,,,
4,wholesale,Indeed,pj_da9f2c12243d7031,Transaction Monitoring Expert,Michael Page,Amsterdam,About Our Client\nThe Global KYC organisation ...,-1.0,-1,https://indeed.nl/rc/clk?jk=da9f2c12243d7031&f...,https://nl.indeed.com/vacature-bekijken/pagead...,2 dagen geleden,2 dagen geleden,2021-01-24,,,,,,en,No,No,1.0,0.0,1.0,0.0,G,Commercial services,11.0,3421.0,43.13,28.47,13.54,4510.0,56.87,34.04,17.85,Mixed Gender,2704.0,34.09,25.44,10.7,5228.0,65.92,35.73,20.69,Mixed Age,7931.0,31.39,1787.0,6664.0,11464.0,0.0,0.0,1.0,1.0,0.0,0.0,1,1,1470.63,2843.37,1938.77,3748.49,1.0,0.0,1.0,0.0,In our Global KYC organisation you will be wor...,18.0,18.0,128.0,109.0,1.0,558.0,320.0,3876.0,3240.0,23.0,in our global kyc organisation you will be wor...,No,No,1.0,0.0,1.0,0.0,"[in, our, global, kyc, organisation, you, will...",in our global kyc organisation you will be wor...,"[global, kyc, organisation, working, many, col...","[global, kyc, organis, work, colleagu, differ,...","[in, our, global, ky, ##c, organisation, you, ...",1,0.85,0,0.06,,,,,,


In [10]:
# All info
analysis_columns = [
    'Warmth',
    'Competence'
]

for df_name, df in dataframes.items():
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    df = categorize_df_gender_age(df)

    df.info()


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308583 entries, 0 to 308582
Columns: 101 entries, Search Keyword to Competence_Probability_predicted
dtypes: category(4), float64(64), int64(4), object(29)
memory usage: 229.5+ MB


In [11]:
non_list_columns = [c for c in df_jobs.columns if not df_jobs[c].progress_apply(lambda x: isinstance(x, list)).any() and not df_jobs[c].progress_apply(lambda x: isinstance(x, str)).any()]
non_list_columns = df_jobs.columns.get_indexer(non_list_columns)


In [12]:
dfSummary(df_jobs.iloc[:, non_list_columns], is_collapsible = True)


No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Rating [float64],Mean (sd) : -0.4 (1.6) min < med < max: -1.0 < -1.0 < 5.0 IQR (CV) : 0.0 (-0.2),26 distinct values,,"142,752 (46.3%)"
2,Data Row [float64],Mean (sd) : 359.7 (284.7) min < med < max: 1.0 < 291.0 < 1000.0 IQR (CV) : 467.0 (1.3),999 distinct values,,"167,401 (54.2%)"
3,Dutch Requirement in Job Ad_No [float64],1. 1.0 2. 0.0,"298,863 (96.9%) 9,720 (3.1%)",,0 (0.0%)
4,Dutch Requirement in Job Ad_Yes [float64],1. 0.0 2. 1.0,"298,863 (96.9%) 9,720 (3.1%)",,0 (0.0%)
5,English Requirement in Job Ad_No [float64],1. 1.0 2. 0.0,"290,923 (94.3%) 17,660 (5.7%)",,0 (0.0%)
6,English Requirement in Job Ad_Yes [float64],1. 0.0 2. 1.0,"290,923 (94.3%) 17,660 (5.7%)",,0 (0.0%)
7,Keywords Count [float64],1. 11.0 2. 9.0 3. 7.0 4. 4.0 5. 6.0 6. 5.0 7. 8.0 8. 3.0 9. 1.0 10. 2.0,"55,320 (17.9%) 48,801 (15.8%) 35,078 (11.4%) 34,069 (11.0%) 33,933 (11.0%) 30,873 (10.0%) 26,234 (8.5%) 25,057 (8.1%) 10,359 (3.4%) 8,859 (2.9%)",,0 (0.0%)
8,Gender_Female_n [float64],Mean (sd) : 656.5 (1046.9) min < med < max: 7.0 < 226.0 < 3970.0 IQR (CV) : 329.0 (0.6),19 distinct values,,0 (0.0%)
9,Gender_Female_% per Sector [float64],Mean (sd) : 45.4 (19.5) min < med < max: 12.5 < 43.1 < 84.3 IQR (CV) : 37.5 (2.3),18 distinct values,,0 (0.0%)
10,Gender_Female_% per Social Category [float64],Mean (sd) : 5.5 (8.7) min < med < max: 0.1 < 1.9 < 33.0 IQR (CV) : 2.7 (0.6),19 distinct values,,0 (0.0%)


In [13]:
for df_name, df in dataframes.items():
    skim(df_jobs.iloc[:, non_list_columns])


## Sentence Level

### All Gender and Age info at Sentence Level

In [14]:
# Gender and Age info by sentence
def run_descriptives_ivs_all_sent(df_name, df, ivs_all=ivs_all):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Gender and Age info at Sentence Level')
    print('-'*30)
    get_df_info(df, ivs_all=ivs_all)
    print('-'*30)


In [15]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_ivs_all_sent_interact(df_name):
        run_descriptives_ivs_all_sent(df_name, dataframes[df_name])
else:
    run_descriptives_ivs_all_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender and Age info at Sentence Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308583 entries, 0 to 308582
Columns: 101 entries, Search Keyword to Competence_Probability_predicted
dtypes: category(4), float64(64), int64(4), object(29)
memory usage: 229.5+ MB
Gender:
--------------------
Gender Counts:
Gender
Mixed Gender    117398
Male            112854
Female           78331
Name: count, dtype: int64
--------------------
Gender Percentages:
Gender
Mixed Gender   38.00
Male           36.60
Female         25.40
Name: proportion, dtype: float64
--------------------
Gender not available.
Gender_Num:
--------------------
Gender_Num Counts:
Gender_Num
1    117398
2    112854
0     78331
Name: count, dtype: int64
--------------------
Gender_Num Percentages:
Gender_Num
1   38.00
2   36.60
0   25.40
Name: proportion, dtype: float64
--------------------
Min Gender_Num value: 0.0
Max Gender_Num 

### % Gender and Age info at Sentence Level

In [16]:
def run_descriptives_iv_percs_sent(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    for iv_perc in ivs_perc:
        min_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].min()].values[0]
        max_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].max()].values[0]
        mean = df[iv_perc].mean().round(2).astype(float)
        std = df[iv_perc].std().round(2).astype(float)
        print(f'{iv_perc}:\nMin Sector: {df[iv_perc].min():.1f}% in {min_sector}\nMax Sector: {df[iv_perc].max():.1f}% in {max_sector}\nMean: {mean}\nStandard Deviation: {std}\n')
        print('-'*20)


In [17]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_iv_percs_sent_interact(df_name):
        run_descriptives_iv_percs_sent(df_name, dataframes[df_name])
else:
    run_descriptives_iv_percs_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender_Female_% per Sector:
Min Sector: 12.5% in Construction
Max Sector: 84.3% in Health and social work activities
Mean: 45.36
Standard Deviation: 19.48

--------------------
Gender_Male_% per Sector:
Min Sector: 15.6% in Health and social work activities
Max Sector: 87.5% in Construction
Mean: 54.59
Standard Deviation: 19.51

--------------------
Age_Older_% per Sector:
Min Sector: 18.9% in Accommodation and food serving
Max Sector: 58.3% in Water supply and waste management
Mean: 40.85
Standard Deviation: 10.11

--------------------
Age_Younger_% per Sector:
Min Sector: 44.4% in Water supply and waste management
Max Sector: 80.8% in Accommodation and food serving
Mean: 59.05
Standard Deviation: 9.98

--------------------
CPU times: user 12 ms, sys: 5.84 ms, total: 17.8 ms
Wall time: 18.8 ms


### All Warmth and Competence info at Sentence Level

In [18]:
# Warmth and Competence percentages info by sentence
def run_descriptives_dvs_sent(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Warmth and Competence info at Sentence Level')
    print('-'*30)
    get_df_info(df, ivs_all=dvs_all + dvs_all_predicted)
    print('-'*30)


In [19]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_dvs_sent_interact(df_name):
        run_descriptives_dvs_sent(df_name, dataframes[df_name])
else:
    run_descriptives_dvs_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence info at Sentence Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308583 entries, 0 to 308582
Columns: 101 entries, Search Keyword to Competence_Probability_predicted
dtypes: category(4), float64(64), int64(4), object(29)
memory usage: 229.5+ MB
Warmth:
--------------------
Warmth Counts:
Warmth
0    234366
1     74217
Name: count, dtype: int64
--------------------
Warmth Percentages:
Warmth
0   75.90
1   24.10
Name: proportion, dtype: float64
--------------------
Min Warmth value: 0.0
Max Warmth value: 1.0
--------------------
Warmth Mean: 0.24
--------------------
Warmth Standard Deviation: 0.43
Competence:
--------------------
Competence Counts:
Competence
0    161987
1    146596
Name: count, dtype: int64
--------------------
Competence Percentages:
Competence
0   52.50
1   47.50
Name: proportion, dtype: float64
--------------------
Min Competence value: 0.0
Max 

## Job Ad Level

### All Gender and Age info at Job Ad Level

In [20]:
# Gender and Age info by job ad
def run_descriptives_ivs_all_job(df_name, df, ivs_all=ivs_all):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Gender and Age info at Job Advertisement Level')
    print('-'*30)
    get_df_info(df.groupby(['Job ID']).first(), ivs_all=ivs_all)
    print('-'*30)


In [21]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_ivs_all_job_interact(df_name):
        run_descriptives_ivs_all_job(df_name, dataframes[df_name])
else:
    run_descriptives_ivs_all_job(list(dataframes.keys())[0], list(dataframes.values())[0])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender and Age info at Job Advertisement Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 16135 entries, 1254300802 to pj_fff1ad3ab60d874b
Data columns (total 100 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   Search Keyword                                                      16135 non-null  object  
 1   Platform                                                            16135 non-null  object  
 2   Job Title                                                           16135 non-null  object  
 3   Company Name                                                        16134 non-null  object  
 4   Location                                                            16135 non-null  object  
 5   Job Description                          

### % Gender and Age info at Job Ad Level

In [22]:
def run_descriptives_iv_percs_job(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')
    df = df.groupby(['Job ID']).first()

    for iv_perc in ivs_perc:
        min_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].min()].values[0]
        max_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].max()].values[0]
        mean = df[iv_perc].mean().round(2).astype(float)
        std = df[iv_perc].std().round(2).astype(float)
        print(f'{iv_perc}:\nMin Sector: {df[iv_perc].min():.1f}% in {min_sector}\nMax Sector: {df[iv_perc].max():.1f}% in {max_sector}\nMean: {mean}\nStandard Deviation: {std}\n')
        print('-'*20)


In [23]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_iv_percs_job_interact(df_name):
        run_descriptives_iv_percs_job(df_name, dataframes[df_name])
else:
    run_descriptives_iv_percs_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender_Female_% per Sector:
Min Sector: 12.5% in Construction
Max Sector: 84.3% in Health and social work activities
Mean: 43.82
Standard Deviation: 18.86

--------------------
Gender_Male_% per Sector:
Min Sector: 15.6% in Health and social work activities
Max Sector: 87.5% in Construction
Mean: 56.13
Standard Deviation: 18.89

--------------------
Age_Older_% per Sector:
Min Sector: 18.9% in Accommodation and food serving
Max Sector: 58.3% in Water supply and waste management
Mean: 40.61
Standard Deviation: 10.23

--------------------
Age_Younger_% per Sector:
Min Sector: 44.4% in Water supply and waste management
Max Sector: 80.8% in Accommodation and food serving
Mean: 59.26
Standard Deviation: 10.14

--------------------
CPU times: user 264 ms, sys: 12.5 ms, total: 277 ms
Wall time: 285 ms


### All Warmth and Competence info at Job Ad Level

In [24]:
# Warmth and Competence info by job ad
def run_descriptives_dvs_job(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Warmth and Competence info at Job Advertisement Level')
    print('-'*30)
    get_df_info(df.groupby(['Job ID']).first(), ivs_all=dvs_all + dvs_all_predicted)
    print('-'*30)


In [25]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_dvs_job_interact(df_name):
        run_descriptives_dvs_job(df_name, dataframes[df_name])
else:
    run_descriptives_dvs_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence info at Job Advertisement Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 16135 entries, 1254300802 to pj_fff1ad3ab60d874b
Data columns (total 100 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   Search Keyword                                                      16135 non-null  object  
 1   Platform                                                            16135 non-null  object  
 2   Job Title                                                           16135 non-null  object  
 3   Company Name                                                        16134 non-null  object  
 4   Location                                                            16135 non-null  object  
 5   Job Description                   

### All Job Ad string info at Job Ad Level

In [26]:
# Get longest and shortest sentence
def run_job_desc_lengths(df_name, df, text_col=None, num_words_col=None):
    if text_col is None:
        if df_name == 'df_jobs':
            text_cols = ['Job Description spacy_sentencized', 'Job Description']
        elif df_name == 'df_manual':
            text_cols = ['Job Description spacy_sentencized']
    if num_words_col is None:
        if df_name == 'df_jobs':
            num_words_cols = ['Job Description spacy_sentencized_num_words', 'Job Description_num_words']
        elif df_name == 'df_manual':
            num_words_cols = ['Job Description spacy_sentencized_num_words']
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Job Description Length at Sentence Level')
    print('-'*30)
    for text_col in text_cols:
        print(f'Analyzing {text_col}')
        print('='*30)
        len_average_char = df[text_col].loc[df[text_col].notna()].progress_apply(len).mean()
        average_char = df[text_col].loc[(df[text_col].loc[df[text_col].notna()].progress_apply(len) - len_average_char).abs().idxmin()]
        len_longest_char = df[text_col].loc[df[text_col].notna()].progress_apply(len).max()
        longest_char = df[text_col].loc[df[text_col].loc[df[text_col].notna()].progress_apply(len).idxmax()]
        len_shortest_char = df[text_col].loc[df[text_col].notna()].progress_apply(len).min()
        shortest_char = df[text_col].loc[df[text_col].loc[df[text_col].notna()].progress_apply(len).idxmin()]

    for num_words_col in num_words_cols:
        print(f'Analyzing {num_words_col}')
        print('='*30)
        len_average = df[num_words_col].mean()
        len_average_sd = df[num_words_col].std()
        len_longest = df[num_words_col].max()
        len_shortest = df[num_words_col].min()

        print(f'Average length: {len_average}')
        print('-'*30)
        print(f'Standard deviation: {len_average_sd}')
        print('-'*30)
        print(f'Longest: {len_longest}')
        print('-'*30)
        print(f'Shortest: {len_shortest}')
        print('-'*30)


In [27]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_job_desc_lengths_interact(df_name):
        run_job_desc_lengths(df_name, dataframes[df_name])
else:
    run_job_desc_lengths(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Job Description Length at Sentence Level
------------------------------
Analyzing Job Description spacy_sentencized
Analyzing Job Description
Analyzing Job Description spacy_sentencized_num_words
Average length: 17.663730665655592
------------------------------
Standard deviation: 16.437469916730734
------------------------------
Longest: 349.0
------------------------------
Shortest: 3.0
------------------------------
Analyzing Job Description_num_words
Average length: 613.2867323883133
------------------------------
Standard deviation: 524.4305847258732
------------------------------
Longest: 10385.0
------------------------------
Shortest: 4.0
------------------------------
CPU times: user 766 ms, sys: 38.9 ms, total: 805 ms
Wall time: 827 ms


# Controls

## Sentence Level

### Controls all info at Sentence Level

In [28]:
# Control variables info by sentence
def run_descriptives_controls_sent(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print(f'Control varibales info at Sentence Level: {controls_}')
    print('-'*30)
    get_df_info(df, ivs_all = controls_)
    print('-'*30)


In [29]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_controls_sent_interact(df_name):
        run_descriptives_controls_sent(df_name, dataframes[df_name])
else:
    run_descriptives_controls_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Control varibales info at Sentence Level: ['Job Description spacy_sentencized_num_words', '% Sector per Workforce', 'Sector Job Advertisement Count', 'Keywords Count', 'English Requirement in Job Ad_Yes', 'Dutch Requirement in Job Ad_Yes']
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308583 entries, 0 to 308582
Columns: 101 entries, Search Keyword to Competence_Probability_predicted
dtypes: category(4), float64(64), int64(4), object(29)
memory usage: 229.5+ MB
Job Description spacy_sentencized_num_words:
--------------------
Min Job Description spacy_sentencized_num_words value: 3.0
Max Job Description spacy_sentencized_num_words value: 349.0
--------------------
Job Description spacy_sentencized_num_words Mean: 17.66
--------------------
Job Description spacy_sentencized_num_words Standard Deviation: 16.44
% Sector per Workforce:
--------------------
Min % Sector per Workforce value: 0.11

### All info % Sector per Workforce at Sentence Level

In [30]:
def run_descriptives_sectors_all_job(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Sector info at Sentence Level')
    print('-'*30)
    get_df_info(df, ivs_all=['% Sector per Workforce'])
    print('-'*30)


In [31]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_sectors_all_job_interact(df_name):
        run_descriptives_sectors_all_job(df_name, dataframes[df_name])
else:
    run_descriptives_sectors_all_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Sector info at Sentence Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308583 entries, 0 to 308582
Columns: 101 entries, Search Keyword to Competence_Probability_predicted
dtypes: category(4), float64(64), int64(4), object(29)
memory usage: 229.5+ MB
% Sector per Workforce:
--------------------
Min % Sector per Workforce value: 0.115
Max % Sector per Workforce value: 31.385
--------------------
% Sector per Workforce Mean: 5.41
--------------------
% Sector per Workforce Standard Deviation: 8.76


------------------------------
CPU times: user 5.77 ms, sys: 1.51 ms, total: 7.29 ms
Wall time: 7.52 ms


### % Sector per Workforce at Sentence Level

In [32]:
def run_descriptives_sectors_job(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    min_sector = df['Sector'].loc[df['% Sector per Workforce'] == df['% Sector per Workforce'].min()].values[0]
    max_sector = df['Sector'].loc[df['% Sector per Workforce'] == df['% Sector per Workforce'].max()].values[0]
    mean = df['% Sector per Workforce'].mean().round(2).astype(float)
    std = df['% Sector per Workforce'].std().round(2).astype(float)
    print(f'"% Sector per Workforce":\nMin Sector: {df["% Sector per Workforce"].min():.1f}% in {min_sector}\nMax Sector: {df["% Sector per Workforce"].max():.1f}% in {max_sector}\nMean: {mean}\nStandard Deviation: {std}\n')
    print('-'*20)


In [33]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_sectors_job_interact(df_name):
        run_descriptives_sectors_job(df_name, dataframes[df_name])
else:
    run_descriptives_sectors_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

"% Sector per Workforce":
Min Sector: 0.1% in Energy supply
Max Sector: 31.4% in Commercial services
Mean: 5.41
Standard Deviation: 8.76

--------------------
CPU times: user 4.22 ms, sys: 2.07 ms, total: 6.28 ms
Wall time: 8.8 ms


### IVs and Controls Correlation Matrix

In [34]:
def run_corr_ivs_controls_sent(df_name, df, ivs_=None, controls_=None):
    if ivs_ is None:
        ivs_ = ivs_dummy_perc_and_perc_interactions
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    considered_features = controls_[:1] + ivs_[:]
    corr_df = df[considered_features].corr()
    print('-'*20)
    # print(f'Correlation Matrix for {df_name}')
    # print(corr_df)
    print('-'*20)
    print('Highly correlated variables:\n')
    print('-'*20)
    print(corr_df[(corr_df > 0.5) & (corr_df != 1)].stack().sort_values(ascending=False).drop_duplicates())
    print('-'*20)


In [35]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_dummy_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_dummy_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female_% per Sector               Interaction_Female_Younger_% per Sector   0.93
Interaction_Female_Older_% per Sector    Gender_Female_% per Sector                0.90
                                         Gender_Female                             0.89
Interaction_Male_Younger_% per Sector    Gender_Male_% per Sector                  0.86
Interaction_Male_Older_% per Sector      Gender_Male_% per Sector                  0.83
Gender_Female                            Gender_Female_% per Sector                0.81
Gender_Male_% per Sector                 Gender_Male                               0.80
Age_Older                                Interaction_Male_Older_% per Sector       0.78
Gender_Male                              Interaction_Male_Older_% per Sector       0.77
Interaction_Female_Older_% per Sector    Interaction_Female_Youn

In [36]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_dummy_and_perc)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_dummy_and_perc)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female  Gender_Female_% per Sector   0.81
Gender_Male    Gender_Male_% per Sector     0.80
Age_Older      Age_Older_% per Sector       0.66
Age_Younger    Age_Younger_% per Sector     0.63
Gender_Mixed   Age_Younger                  0.55
               Age_Younger_% per Sector     0.53
dtype: float64
--------------------
CPU times: user 81.1 ms, sys: 8.12 ms, total: 89.3 ms
Wall time: 103 ms


In [37]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female_% per Sector             Interaction_Female_Younger_% per Sector   0.93
                                       Interaction_Female_Older_% per Sector     0.90
Gender_Male_% per Sector               Interaction_Male_Younger_% per Sector     0.86
                                       Interaction_Male_Older_% per Sector       0.83
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.68
Age_Older_% per Sector                 Interaction_Male_Older_% per Sector       0.58
dtype: float64
--------------------
CPU times: user 58 ms, sys: 5.42 ms, total: 63.4 ms
Wall time: 67.9 ms


In [38]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_perc_interactions)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.68
dtype: float64
--------------------
CPU times: user 20.8 ms, sys: 2.08 ms, total: 22.9 ms
Wall time: 21.4 ms


## Job Ad Level

### All Controls info at Job Ad Level

In [39]:
# Control variables info by job ad
def run_descriptives_controls_job(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Control varibales info at Job Advertisement Level')
    print('-'*30)
    get_df_info(df.groupby(['Job ID']).first(), ivs_all = controls_)
    print('-'*30)


In [40]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_controls_job_interact(df_name):
        run_descriptives_controls_job(df_name, dataframes[df_name])
else:
    run_descriptives_controls_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Control varibales info at Job Advertisement Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 16135 entries, 1254300802 to pj_fff1ad3ab60d874b
Data columns (total 100 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   Search Keyword                                                      16135 non-null  object  
 1   Platform                                                            16135 non-null  object  
 2   Job Title                                                           16135 non-null  object  
 3   Company Name                                                        16134 non-null  object  
 4   Location                                                            16135 non-null  object  
 5   Job Description                       

In [41]:
def run_corr_ivs_controls_job(df_name, df, ivs_=None, controls_=None):
    if ivs_ is None:
        ivs_ = ivs_dummy_perc_and_perc_interactions
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    df = df.groupby(['Job ID']).first()

    considered_features = controls_[:1] + ivs_[:]
    corr_df = df[considered_features].corr()
    print('-'*20)
    # print(f'Correlation Matrix for {df_name}')
    # print(corr_df)
    print('-'*20)
    print('Highly correlated variables:\n')
    print('-'*20)
    print(corr_df[(corr_df > 0.5) & (corr_df != 1)].stack().sort_values(ascending=False).drop_duplicates())
    print('-'*20)


In [42]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Interaction_Female_Younger_% per Sector  Gender_Female_% per Sector                0.93
Interaction_Female_Older_% per Sector    Gender_Female_% per Sector                0.89
Gender_Female                            Interaction_Female_Older_% per Sector     0.87
Interaction_Male_Younger_% per Sector    Gender_Male_% per Sector                  0.84
Interaction_Male_Older_% per Sector      Gender_Male_% per Sector                  0.83
Gender_Male                              Gender_Male_% per Sector                  0.82
Interaction_Male_Older_% per Sector      Gender_Male                               0.79
                                         Age_Older                                 0.78
Gender_Female                            Gender_Female_% per Sector                0.78
Age_Older                                Age_Older_% per Sector 

In [43]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_and_perc)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_and_perc)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Male    Gender_Male_% per Sector     0.82
Gender_Female  Gender_Female_% per Sector   0.78
Age_Older      Age_Older_% per Sector       0.69
Age_Younger    Age_Younger_% per Sector     0.63
Gender_Mixed   Age_Younger_% per Sector     0.54
               Age_Younger                  0.50
dtype: float64
--------------------
CPU times: user 262 ms, sys: 11 ms, total: 273 ms
Wall time: 272 ms


In [44]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female_% per Sector             Interaction_Female_Younger_% per Sector   0.93
                                       Interaction_Female_Older_% per Sector     0.89
Gender_Male_% per Sector               Interaction_Male_Younger_% per Sector     0.84
                                       Interaction_Male_Older_% per Sector       0.83
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.66
Age_Older_% per Sector                 Interaction_Male_Older_% per Sector       0.64
dtype: float64
--------------------
CPU times: user 265 ms, sys: 14 ms, total: 279 ms
Wall time: 298 ms


In [45]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_interactions)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.66
dtype: float64
--------------------
CPU times: user 276 ms, sys: 20.5 ms, total: 297 ms
Wall time: 353 ms


## Imbalance Ratios

In [46]:
# Imbalance Ratio
all_imbalance_ratio_dict = {}
def run_imbalance_ratio(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    warmth_imbalance_ratio = df['Warmth'].loc[
        df['Warmth'] == 1].count()/df['Warmth'].loc[df['Warmth'] == 0
    ].count()
    competence_imbalance_ratio = df['Competence'].loc[
        df['Competence'] == 1].count()/df['Competence'].loc[df['Competence'] == 0
    ].count()

    all_imbalance_ratio_dict[f'{df_name} Warmth'] = warmth_imbalance_ratio
    all_imbalance_ratio_dict[f'{df_name} Competence'] = competence_imbalance_ratio

    print('='*20)
    print('Imabalance Ratios')
    print('-'*10)
    print(f'Warmth IR: {warmth_imbalance_ratio:.2f}')
    print(f'Competence IR: {competence_imbalance_ratio:.2f}')
    print('='*20)

    with open(f'{data_dir}{df_name}_all_imbalance_ratio_dict.json', 'w') as f:
        json.dump(all_imbalance_ratio_dict, f)


In [47]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_imbalance_ratio_interact(df_name):
        run_imbalance_ratio(df_name, dataframes[df_name])
else:
    run_imbalance_ratio(list(dataframes.keys())[0], list(dataframes.values())[0])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Imabalance Ratios
----------
Warmth IR: 0.32
Competence IR: 0.90
CPU times: user 6.25 ms, sys: 3.23 ms, total: 9.48 ms
Wall time: 10.4 ms


# Tables

In [48]:
def save_desc_excel(
    df_desc,
    index_var,
    title_prefix,
    file_save_path,
    sheet_name=None,
    startrow=None,
    startcol=None,
):
    if sheet_name is None:
        sheet_name = 'All'
    if startrow is None:
        startrow = 1
    if startcol is None:
        startcol = 1

    # index = df_desc.index.to_frame().reset_index(drop=True)
    df_desc = df_desc.reset_index(drop=False, col_level=1, col_fill=f'{title_prefix} Job Advertisements')

    # Define last rows and cols locs
    header_range = len(df_desc.columns.levels)
    endrow = startrow + header_range + df_desc.shape[0]
    endcol = startcol + df_desc.shape[1]

    # Write
    writer = pd.ExcelWriter(f'{file_save_path}.xlsx')
    df_desc.to_excel(writer, sheet_name=sheet_name, merge_cells=True, startrow=startrow, startcol=startcol)
    workbook  = writer.book
    worksheet = writer.sheets[sheet_name]
    worksheet.set_row(startrow + header_range, None, None, {'hidden': True}) # hide the empty row that appears after the headers
    worksheet.set_column(startrow, 1, None, None, {'hidden': True}) # hide the index column

    # MAIN BODY
    # Format column headers
    for i, (col_num, col_value) in tqdm_product(range(header_range), (enumerate(df_desc.columns.values))):
        row_to_write = startrow + i
        col_to_write = startcol + 1 + col_num # 1 is for index
        header_formats = {'bold': False, 'font_name': 'Times New Roman', 'font_size': 12, 'font_color': 'black', 'align': 'center', 'top': True, 'bottom': True, 'left': False, 'right': False}

        if col_value[i] in ['n', 'M', 'SD']:
            header_formats |= {'italic': True}

        if col_value[i] == '95% Conf.':
            worksheet.set_column(col_to_write, col_to_write, 8.5)

        if col_value[i] == index_var:
            worksheet.set_column(col_to_write, col_to_write, 10)
            header_formats['align'] = 'left'
            header_formats |= {'text_wrap': True}
            worksheet.merge_range(row_to_write, col_to_write, header_range, col_to_write, index_var, workbook.add_format(header_formats))
        else:
            worksheet.write(row_to_write, col_to_write, col_value[i], workbook.add_format(header_formats))

    # Format body columns
    num = [col_num for col_num, value in enumerate(df_desc.columns.values) if value[-1] == 'n']
    perc = [col_num for col_num, value in enumerate(df_desc.columns.values) if value[-1] == '%']
    body_max_row_idx, body_max_col_idx = df_desc.shape

    for c, r in tqdm_product(range(body_max_col_idx), range(body_max_row_idx)):
        row_to_write = startrow + header_range + 1 + r # 1 is for the hidden empty column under the header
        col_to_write = startcol + 1 + c # 1 is for index
        body_formats = {'num_format': '0.00', 'font_name': 'Times New Roman', 'font_size': 12, 'font_color': 'black', 'align': 'center', 'text_wrap': True, 'left': False, 'right': False}

        if r == body_max_row_idx-1:
            body_formats |= {'bottom': True}

        if c == 0:
            body_formats |= {'align': 'left'}

        if c in num:
            body_formats |= {'num_format': '0'}

        if c in perc:
            body_formats |= {'num_format': '0.0'}

        worksheet.write(row_to_write, col_to_write, df_desc.iloc[r, c], workbook.add_format(body_formats))

    writer.close()


In [49]:
def make_df_desc(df, df_name, vars_list, var_name, index_var, sentence_level=False, continous_var_names_list=None):

    if continous_var_names_list is None:
        continous_var_names_list = ['Probabilities', 'Percentages']

    if df_name == 'df_manual':
        title_prefix = 'Manually Annotated Dataset'
    elif df_name == 'df_jobs':
        title_prefix = 'Classifier Labeled'

    if sentence_level == False:
        level = 'Job Advertisement'
        df = df.groupby('Job ID').first()
    if sentence_level == True:
        level = 'Sentence'

    # Warmth and Competence Categorical df
    if len(set(var_name.split()).intersection(continous_var_names_list)) == 0:
        df_cat = rp.summary_cat(df[vars_list], ascending= True).round(2)
        df_cat['Variable'] = df_cat['Variable'].replace('', np.nan).fillna(method='ffill')
        df_cat = df_cat.loc[df_cat['Outcome'] == 1].drop(columns=['Outcome'])
        totals = pd.DataFrame(df_cat.sum(numeric_only=True)).transpose()
        totals.insert(0, 'Variable', 'Total')
        df_cat = df_cat.fillna('-')
        df_cat = pd.concat([df_cat, totals], axis='index', ignore_index=True)

    # Warmth and Competence Continuous df
    df_cont = rp.summary_cont(df[vars_list], conf = 0.95, decimals = 2)

    # Merged df
    if len(set(var_name.split()).intersection(continous_var_names_list)) == 0:
        df_desc = df_cat.merge(df_cont, on='Variable', how='outer')
        df_desc = df_desc.fillna('-')
    else:
        df_desc = df_cont

    # Rename variable columns
    df_desc['Variable'] = df_desc['Variable'].progress_apply(
        lambda var_name: f'{var_name.split("_")[1]}-dominated'.replace('_', ' ').strip()
        if '_' in var_name and 'Mixed' not in var_name and '%' not in var_name and 'Probability' not in var_name
        else f'{var_name.split("_")[1]} Gender'.replace('_', ' ')
        if '_' in var_name and 'Mixed' in var_name and '%' not in var_name and 'Probability' not in var_name
        else " ".join(var_name.split("_")[1:]).split()[0]
        if '_' in var_name and 'Mixed' not in var_name and '%' in var_name and 'Probability' not in var_name
        else f'{var_name.split("_")[0]} Probability'.replace('_', ' ')
        if '_' in var_name and 'Mixed' not in var_name and '%' not in var_name and 'Probability' in var_name
        else var_name
    )

    # Clean up df and set index
    if len(set(var_name.split()).intersection(continous_var_names_list)) == 0:
        drop_columns = ['N', 'SE', '95% Conf.', 'Interval']
        rename_dict = {'Variable': index_var, 'Count': 'n', 'Percent': '%', 'Mean': 'M'}
    else:
        drop_columns = ['N', 'SE']
        rename_dict = {'Variable': index_var, 'Mean': 'M', 'SD': 'SD', '95% Conf. Int.': '95% CI'}

    df_desc = df_desc.drop(columns=drop_columns)
    df_desc = df_desc.rename(columns=rename_dict)
    df_desc = df_desc.set_index(keys=[index_var], drop=True)

    # Make into MultiIndex
    df_desc.columns = pd.MultiIndex.from_product([[level], df_desc.columns])

    return df_desc


In [50]:
vars_dict = {
    'Categorical Gender Sector Designation': ivs_gender_dummy,
    'Categorical Age Sector Designation': ivs_age_dummy,
    'Percentages of Gender per Sector (%)': ivs_gender_perc,
    'Percentages of Age per Sector (%)': ivs_age_perc,
    'Warmth and Competence Categorical Coding': dvs,
    'Warmth and Competence Probabilities': dvs_prob,
}


In [51]:
def make_desc_tables(df_name, df, var_name, vars_list):
    if df_name == 'df_manual':
        title_prefix = 'Manually Annotated Dataset'
    elif df_name == 'df_jobs' and 'Warmth and Competence' not in var_name:
        title_prefix = 'Collected Dataset'
    elif df_name == 'df_jobs':
        title_prefix = 'Classifier Labeled'

    # Set index varaible name
    if 'Warmth and Competence' in var_name:
        index_var = 'Stereotype-related frames'
    elif 'Percentages' in var_name:
        index_var = 'Percentages per Sector (PPS)'
    else:
        index_var = 'Sectors'

    with contextlib.suppress(KeyError):
        # Categorical DF on job ad level
        df_desc_cat_jobad = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=False)

        # Categorical DF on sentence level
        df_desc_cat_sent = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=True)

        # Merge Categorical dfs
        df_desc_cat = df_desc_cat_jobad.merge(df_desc_cat_sent, on=index_var)

        # Continuous DF on job ad level
        df_desc_cont_jobad = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=False)

        # Continuous DF on sentence level
        df_desc_cont_sent = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=True)

        # Merge Continuous dfs
        df_desc_cont = df_desc_cont_jobad.merge(df_desc_cont_sent, on=index_var)

        # Collect dfs in list
        df_desc_list = [df_desc_cat, df_desc_cont]

        for df_desc in df_desc_list:
            levels_with_title = [[f'{title_prefix} Job Advertisements']]
            # Add title prefix
            levels_with_title.extend(
                list(df_desc.columns.get_level_values(i).unique())
                    for i in range(len(df_desc.columns.levels))
            )
            # levels_with_title.insert(0, )
            if 'Warmth and Competence' not in var_name:
                levels_with_title.insert(1, [var_name])

            # Make into MultiIndex
            df_desc.columns = pd.MultiIndex.from_product(levels_with_title)

            # Save Tables
            # File save path
            file_save_path = f'{table_save_path}descriptives {df_name} {title_prefix} {var_name} - Job Advertisement'
            # CSV
            df_desc.to_csv(f'{file_save_path}.csv', index=True)
            # PKL
            df_desc.to_pickle(f'{file_save_path}.pkl')
            # TEX
            with pd.option_context('max_colwidth', 10000000000):
                df_desc.style.to_latex(
                    f'{file_save_path}.tex',
                    convert_css=True,
                    environment='longtable',
                    hrules=True,
                    # escape=True,
                    # multicolumn=True,
                    multicol_align='c',
                    position='H',
                    caption=f'{var_name} Descriptives', label='Descriptives'
                )
            # MD
            df_desc.to_markdown(f'{file_save_path}.md', index=True)
            # EXCEL
            save_desc_excel(df_desc, index_var, title_prefix, file_save_path)

        print('\n')
        print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')
        print(f'{var_name} Descriptives')
        if df_desc_list[0].equals(df_desc_list[1]):
            print(df_desc_list[0])
        else:
            print(df_desc_list[0])
            print(df_desc_list[1])
        print('\n')


In [52]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys(), var_name=vars_dict.keys())
    def make_desc_tables_interact(df_name, var_name):
        make_desc_tables(df_name, dataframes[df_name], var_name, vars_dict[var_name])
else:
    for (df_name, df), (var_name, vars_list) in tqdm_product(dataframes.items(), vars_dict.items()):
        make_desc_tables(df_name, df, var_name, vars_list)


  0%|          | 0/6 [00:00<?, ?it/s]











0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Categorical Gender Sector Designation Descriptives
                  Collected Dataset Job Advertisements                                           
                 Categorical Gender Sector Designation                                           
                           Job Advertisement                             Sentence                
                                   n                     %     M    SD      n       %    M    SD 
Sectors                                                                                          
Female-dominated                 3475.00                21.54 0.22 0.41  78331.00 25.38 0.25 0.44
Mixed Gender                     6301.00                39.05 0.39 0.49 117398.00 38.04 0.38 0.49
Male-dominated                   6359.00                39.41 0.39 0.49 112854.00 36.57 0.37 0.48
Total                           16135.00               100.00    -    - 308583.00 99.99    -    -












0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Categorical Age Sector Designation Descriptives
                  Collected Dataset Job Advertisements                                           
                   Categorical Age Sector Designation                                            
                           Job Advertisement                            Sentence                 
                                   n                     %    M    SD      n       %     M    SD 
Sectors                                                                                          
Older-dominated                  3605.00               22.34 0.22 0.42  62868.00  20.37 0.20 0.40
Mixed Gender                    10277.00               63.69 0.64 0.48 198012.00  64.17 0.64 0.48
Younger-dominated                2253.00               13.96 0.14 0.35  47703.00  15.46 0.15 0.36
Total                           16135.00               99.99    -    - 308583.00 100.00    -    -












0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Percentages of Gender per Sector (%) Descriptives
                             Collected Dataset Job Advertisements                                                           
                             Percentages of Gender per Sector (%)                                                           
                                      Job Advertisement                                    Sentence                         
                                              M                     SD  95% Conf. Interval    M       SD  95% Conf. Interval
Percentages per Sector (PPS)                                                                                                
Female                                      43.82                 18.86   43.53    44.11    45.36   19.48   45.29    45.43  
Male                                        56.13                 18.89   55.84    56.42    54.59   19.51   54.52    54.66  












0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Percentages of Age per Sector (%) Descriptives
                             Collected Dataset Job Advertisements                                                           
                              Percentages of Age per Sector (%)                                                             
                                      Job Advertisement                                    Sentence                         
                                              M                     SD  95% Conf. Interval    M       SD  95% Conf. Interval
Percentages per Sector (PPS)                                                                                                
Older                                       40.61                 10.23   40.45    40.76    40.85   10.11   40.81    40.89  
Younger                                     59.26                 10.14   59.10    59.41    59.05    9.98   59.02    59.09  












0it [00:00, ?it/s]

  0%|          | 0/27 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/27 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence Categorical Coding Descriptives
                          Classifier Labeled Job Advertisements                                          
                                    Job Advertisement                            Sentence                
                                            n                     %    M    SD      n       %    M    SD 
Stereotype-related frames                                                                                
Warmth                                   2808.00                17.40 0.17 0.38  74217.00 24.05 0.24 0.43
Competence                               6529.00                40.46 0.40 0.49 146596.00 47.51 0.48 0.50
Total                                    9337.00                57.86    -    - 220813.00 71.56    -    -












0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence Probabilities Descriptives
                          Classifier Labeled Job Advertisements                                                         
                                    Job Advertisement                                   Sentence                        
                                            M                    SD  95% Conf. Interval    M      SD  95% Conf. Interval
Stereotype-related frames                                                                                               
Warmth Probability                         0.26                 0.33    0.26     0.27     0.33   0.37    0.33     0.33  
Competence Probability                     0.39                 0.37    0.38     0.40     0.45   0.38    0.45     0.45  


CPU times: user 3.93 s, sys: 297 ms, total: 4.22 s
Wall time: 4.76 s
