### START HERE IF SOURCING FROM df_manual_FOR_training
### PLEASE SET CORRECT DIRECTORY PATHS BELOW


# Descriptives and visualization


In [1]:
import os # type:ignore # isort:skip # fmt:skip # noqa # nopep8
import sys # type:ignore # isort:skip # fmt:skip # noqa # nopep8
from pathlib import Path # type:ignore # isort:skip # fmt:skip # noqa # nopep8

mod = sys.modules[__name__]

code_dir = None
code_dir_name = 'Code'
unwanted_subdir_name = 'Analysis'

if code_dir_name not in str(Path.cwd()).split('/')[-1]:
    for _ in range(5):

        parent_path = str(Path.cwd().parents[_]).split('/')[-1]

        if (code_dir_name in parent_path) and (unwanted_subdir_name not in parent_path):

            code_dir = str(Path.cwd().parents[_])

            if code_dir is not None:
                break
else:
    code_dir = str(Path.cwd())
sys.path.append(code_dir)

# %load_ext autoreload
# %autoreload 2


In [2]:
from setup_module.imports import * # type:ignore # isort:skip # fmt:skip # noqa # nopep8
from setup_module import researchpy_fork as rp # type:ignore # isort:skip # fmt:skip # noqa # nopep8
from setup_module.statannotations_fork.Annotator import Annotator # type:ignore # isort:skip # fmt:skip # noqa # nopep8


Using MPS


0it [00:00, ?it/s]

<Figure size 640x480 with 0 Axes>

In [3]:
try:
    df_sectors_all = pd.read_pickle(f'{table_save_path}Sectors Output from script.pkl')
except FileNotFoundError:
    cbs_notebook = '\\'.join(f'{scraped_data}CBS/CBS.ipynb')
    %run $cbs_notebook import df_sectors_all # type:ignore # isort:skip # fmt:skip # noqa # nopep8


# Functions

In [4]:
def show_and_close_plots():
    plt.show()
    plt.clf()
    plt.cla()
    plt.close()


In [5]:
def close_plots():
    plt.clf()
    plt.cla()
    plt.close()


# Analysis plan:

1. ## [Descriptives and tables](./1.%20descriptives_and_tables.ipynb)
2. ## [Visualization](./2.%20visualization.ipynb)
3. ## [Frequencies and Normality tests](./2.%20frequencies_and_normality_test.ipynb)
   1. ### Frequencies, histograms, and QQ plots
      * Normal test
      * Kurtosis test
      * Shapiro
      * Anderson
      * Bartlett
   2. ### Correlation between independent variables (IVs) and control variables and Multicolinarity test
      * Pearson's R
      * VIF
     - ***ivs_dummy*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
     - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
     - ***% Sector per Workforce*** (continous ratio) = Sector percentage per worksforce (0-100)
     - ***num_words*** (continous ratio) = Number of words in job description
     - ***English Requirement in Job Ad*** (binary nominal) = English requirement in job description (0 vs. 1)
     - ***Dutch Requirement in Job Ad*** (binary nominal) = Dutch requirement in job description (0 vs. 1)
     - ***Platform*** (binary dummy) = LinkedIn (0 vs. 1), Indeed (0 vs. 1), Glassdoor (0 vs. 1)

4. ## [ANOVA and Chi-square (Pearson's R)](./3.%20chisqt_and_anova.ipynb)

   1. ### Chi-square
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
      * **df_jobs:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)

   2. ### One-way ANOVA, interactions, and post-hoc test
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
          - If Levene's test is *not significant*, use classic ANOVA and Tukey's post hoc test
          - If Levene's test is *significant*, use Welch's and Kruskal-Wallis ANOVA and Games Howell's post hoc test
      * **df_jobs:**
         - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
         - ***ivs*** (binary nominal) = Social category designation (Female, Male, Mixed Gender)
           - If Levene's test is *not significant*, use classic ANOVA and Tukey's post hoc test
           - If Levene's test is *significant*, use Welch's and Kruskal-Wallis ANOVA and Games Howell's post hoc test

5. ## [Regression Analysis](./3.%20regression_analysis.ipynb)
   1. ### Logistic Regression  with all interaction (smf):
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
      * **df_jobs:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
   2. ### OLS Regression with all interaction:
      * **df_jobs:**
        - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
   3. ### Multilevel OLS Regression with all interaction:
      * **df_jobs:**
        - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)

6. ## [Specification Curve Analysis](./4.%20specification_curve_analysis.ipynb)

   1. ### Logistic Specification Curve Analysis:
      * **df_manual:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
      * **df_jobs:**
        - ***dvs*** (binary nominal) = 'Warmth' and 'Competence' (0 vs. 1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)
   2. ### OLS Specification Curve Analysis:
      * **df_jobs:**
        - ***dvs_prob*** (continous ratio) = 'Warmth' and 'Competence' probabilities (0-1)
        - ***ivs_perc*** (continous ratio) = Social category percentage per sector (0-100)


# READ DATA

In [6]:
with open(f'{data_dir}df_manual_len.txt', 'r') as f:
    df_manual_len = int(f.read())

df_manual = pd.read_pickle(f'{df_save_dir}df_manual_for_analysis.pkl')
assert len(df_manual) == df_manual_len, f'DATAFRAME MISSING DATA! DF SHOULD BE OF LENGTH {df_manual_len} BUT IS OF LENGTH {len(df_manual)}'
print(f'Dataframe loaded with shape: {df_manual.shape}')
df_manual = categorize_df_gender_age(df_manual)


Dataframe loaded with shape: (5947, 76)


In [7]:
with open(f'{data_dir}df_jobs_for_analysis_len.txt', 'r') as f:
    df_jobs_len = int(f.read())

df_jobs = pd.read_pickle(f'{df_save_dir}df_jobs_for_analysis.pkl')
assert len(df_jobs) == df_jobs_len, f'DATAFRAME MISSING DATA! DF SHOULD BE OF LENGTH {df_jobs_len} BUT IS OF LENGTH {len(df_manual)}'
print(f'Dataframe loaded with shape: {df_jobs.shape}')
df_jobs = categorize_df_gender_age(df_jobs)


Dataframe loaded with shape: (309438, 79)


## Set dataframes

#### Dataframes dict

In [8]:
dataframes = {
    'df_jobs': df_jobs,
    # 'df_manual': df_manual,
}


# Descriptives

### All info

In [9]:
# All info
analysis_columns = [
    'Warmth',
    'Competence'
]

for df_name, df in dataframes.items():
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    df = categorize_df_gender_age(df)

    df.info()


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

<class 'pandas.core.frame.DataFrame'>
Index: 309438 entries, 0 to 309445
Data columns (total 79 columns):
 #   Column                                                              Non-Null Count   Dtype   
---  ------                                                              --------------   -----   
 0   Search Keyword                                                      309438 non-null  object  
 1   Platform                                                            309438 non-null  object  
 2   Job ID                                                              309438 non-null  object  
 3   Job Title                                                           309438 non-null  object  
 4   Company Name                                                        309438 non-null  object  
 5   Location                                                            309438 non-null  object  
 6   Dutch Requirement in Job Ad                   

In [10]:
non_list_columns = [c for c in df_jobs.columns if not df_jobs[c].apply(lambda x: isinstance(x, list)).any()]
non_list_columns = df_jobs.columns.get_indexer(non_list_columns)


In [11]:
dfSummary(df_jobs.iloc[:, non_list_columns], is_collapsible = True)


No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Search Keyword [object],1. education 2. application developer 3. health 4. culture 5. career development specialist 6. accommodation 7. information 8. storage 9. staff 10. financial institution 11. other,"30,921 (10.0%) 21,720 (7.0%) 18,058 (5.8%) 17,874 (5.8%) 13,431 (4.3%) 12,503 (4.0%) 10,983 (3.5%) 10,812 (3.5%) 10,421 (3.4%) 8,016 (2.6%) 154,699 (50.0%)",,0 (0.0%)
2,Platform [object],1. Indeed 2. LinkedIn 3. Glassdoor,"141,882 (45.9%) 141,318 (45.7%) 26,238 (8.5%)",,0 (0.0%)
3,Job ID [object],1. 2302104114 2. p_f813a8cc3e2997f9 3. p_861b1385e97ed936 4. 4039450758 5. p_01c9d06d50571f1f 6. p_adc3da29712d95cf 7. p_fa09d35effab434a 8. p_b18aeaa60401f77a 9. p_6117bcaabf406d32 10. 3783743044 11. other,"277 (0.1%) 186 (0.1%) 173 (0.1%) 167 (0.1%) 141 (0.0%) 136 (0.0%) 132 (0.0%) 130 (0.0%) 127 (0.0%) 123 (0.0%) 307,846 (99.5%)",,0 (0.0%)
4,Job Title [object],1. Assistant Professor of Public 2. International Manager Payroll 3. Inbound Marketer 4. Youth Activities Counselor 5. Software Engineer 6. General Internship Application 7. Data Architect | Business Inte 8. Senior Mechanical Engineer 9. Project Manager 10. Business Analist RPA 11. other,"1,191 (0.4%) 1,085 (0.4%) 1,039 (0.3%) 987 (0.3%) 979 (0.3%) 918 (0.3%) 869 (0.3%) 828 (0.3%) 767 (0.2%) 767 (0.2%) 300,008 (97.0%)",,0 (0.0%)
5,Company Name [object],1. Talent 2. Bookingcom 3. Philips 4. Werkzoekennl 5. ING 6. eBay Inc 7. PVH 8. Accenture 9. ABN AMRO Bank 10. IamExpat 11. other,"5,739 (1.9%) 3,734 (1.2%) 3,099 (1.0%) 2,731 (0.9%) 2,710 (0.9%) 2,631 (0.9%) 2,240 (0.7%) 2,145 (0.7%) 2,141 (0.7%) 2,068 (0.7%) 280,200 (90.6%)",,0 (0.0%)
6,Location [object],"1. Amsterdam 2. Amsterdam, North Holland, Neth 3. Rotterdam, South Holland, Neth 4. Amsterdam 5. Amsterdam Centrum 6. Utrecht, Utrecht, Netherlands 7. Eindhoven, North Brabant, Neth 8. RotterdamThe Hague metropolita 9. Hoofddorp 10. Amsterdam Zuid 11. other","95,155 (30.8%) 54,614 (17.6%) 11,720 (3.8%) 10,952 (3.5%) 8,693 (2.8%) 8,000 (2.6%) 6,570 (2.1%) 5,846 (1.9%) 4,442 (1.4%) 4,012 (1.3%) 99,434 (32.1%)",,0 (0.0%)
7,Dutch Requirement in Job Ad [object],1. No 2. Yes,"299,693 (96.9%) 9,745 (3.1%)",,0 (0.0%)
8,English Requirement in Job Ad [object],1. No 2. Yes,"291,678 (94.3%) 17,760 (5.7%)",,0 (0.0%)
9,Dutch Requirement in Job Ad_No [float64],1. 1.0 2. 0.0,"299,693 (96.9%) 9,745 (3.1%)",,0 (0.0%)
10,Dutch Requirement in Job Ad_Yes [float64],1. 0.0 2. 1.0,"299,693 (96.9%) 9,745 (3.1%)",,0 (0.0%)


In [12]:
for df_name, df in dataframes.items():
    skim(df_jobs.iloc[:, non_list_columns])


## Sentence Level

### All Gender and Age info at Sentence Level

In [13]:
# Gender and Age info by sentence
def run_descriptives_ivs_all_sent(df_name, df, ivs_all=ivs_all):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Gender and Age info at Sentence Level')
    print('-'*30)
    get_df_info(df, ivs_all=ivs_all)
    print('-'*30)


In [14]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_ivs_all_sent_interact(df_name):
        run_descriptives_ivs_all_sent(df_name, dataframes[df_name])
else:
    run_descriptives_ivs_all_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender and Age info at Sentence Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 309438 entries, 0 to 309445
Data columns (total 79 columns):
 #   Column                                                              Non-Null Count   Dtype   
---  ------                                                              --------------   -----   
 0   Search Keyword                                                      309438 non-null  object  
 1   Platform                                                            309438 non-null  object  
 2   Job ID                                                              309438 non-null  object  
 3   Job Title                                                           309438 non-null  object  
 4   Company Name                                                        309438 non-null  object  
 5   Location                                                         

### % Gender and Age info at Sentence Level

In [15]:
def run_descriptives_iv_percs_sent(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    for iv_perc in ivs_perc:
        min_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].min()].values[0]
        max_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].max()].values[0]
        mean = df[iv_perc].mean().round(2).astype(float)
        std = df[iv_perc].std().round(2).astype(float)
        print(f'{iv_perc}:\nMin Sector: {df[iv_perc].min():.1f}% in {min_sector}\nMax Sector: {df[iv_perc].max():.1f}% in {max_sector}\nMean: {mean}\nStandard Deviation: {std}\n')
        print('-'*20)


In [16]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_iv_percs_sent_interact(df_name):
        run_descriptives_iv_percs_sent(df_name, dataframes[df_name])
else:
    run_descriptives_iv_percs_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender_Female_% per Sector:
Min Sector: 12.5% in Construction
Max Sector: 84.3% in Health and social work activities
Mean: 45.36
Standard Deviation: 19.47

--------------------
Gender_Male_% per Sector:
Min Sector: 15.6% in Health and social work activities
Max Sector: 87.5% in Construction
Mean: 54.59
Standard Deviation: 19.5

--------------------
Age_Older_% per Sector:
Min Sector: 18.9% in Accommodation and food serving
Max Sector: 58.3% in Water supply and waste management
Mean: 40.84
Standard Deviation: 10.11

--------------------
Age_Younger_% per Sector:
Min Sector: 44.4% in Water supply and waste management
Max Sector: 80.8% in Accommodation and food serving
Mean: 59.06
Standard Deviation: 9.98

--------------------
CPU times: user 12.9 ms, sys: 4.42 ms, total: 17.3 ms
Wall time: 19.9 ms


### All Warmth and Competence info at Sentence Level

In [17]:
# Warmth and Competence percentages info by sentence
def run_descriptives_dvs_sent(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Warmth and Competence info at Sentence Level')
    print('-'*30)
    get_df_info(df, ivs_all=dvs_all + dvs_all_predicted)
    print('-'*30)


In [18]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_dvs_sent_interact(df_name):
        run_descriptives_dvs_sent(df_name, dataframes[df_name])
else:
    run_descriptives_dvs_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence info at Sentence Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 309438 entries, 0 to 309445
Data columns (total 79 columns):
 #   Column                                                              Non-Null Count   Dtype   
---  ------                                                              --------------   -----   
 0   Search Keyword                                                      309438 non-null  object  
 1   Platform                                                            309438 non-null  object  
 2   Job ID                                                              309438 non-null  object  
 3   Job Title                                                           309438 non-null  object  
 4   Company Name                                                        309438 non-null  object  
 5   Location                                                  

## Job Ad Level

### All Gender and Age info at Job Ad Level

In [19]:
# Gender and Age info by job ad
def run_descriptives_ivs_all_job(df_name, df, ivs_all=ivs_all):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Gender and Age info at Job Advertisement Level')
    print('-'*30)
    get_df_info(df.groupby(['Job ID']).first(), ivs_all=ivs_all)
    print('-'*30)


In [20]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_ivs_all_job_interact(df_name):
        run_descriptives_ivs_all_job(df_name, dataframes[df_name])
else:
    run_descriptives_ivs_all_job(list(dataframes.keys())[0], list(dataframes.values())[0])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender and Age info at Job Advertisement Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 16134 entries, 1254300802 to pj_fff1ad3ab60d874b
Data columns (total 78 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   Search Keyword                                                      16134 non-null  object  
 1   Platform                                                            16134 non-null  object  
 2   Job Title                                                           16134 non-null  object  
 3   Company Name                                                        16134 non-null  object  
 4   Location                                                            16134 non-null  object  
 5   Dutch Requirement in Job Ad               

### % Gender and Age info at Job Ad Level

In [21]:
def run_descriptives_iv_percs_job(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')
    df = df.groupby(['Job ID']).first()

    for iv_perc in ivs_perc:
        min_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].min()].values[0]
        max_sector = df['Sector'].loc[df[iv_perc] == df[iv_perc].max()].values[0]
        mean = df[iv_perc].mean().round(2).astype(float)
        std = df[iv_perc].std().round(2).astype(float)
        print(f'{iv_perc}:\nMin Sector: {df[iv_perc].min():.1f}% in {min_sector}\nMax Sector: {df[iv_perc].max():.1f}% in {max_sector}\nMean: {mean}\nStandard Deviation: {std}\n')
        print('-'*20)


In [22]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_iv_percs_job_interact(df_name):
        run_descriptives_iv_percs_job(df_name, dataframes[df_name])
else:
    run_descriptives_iv_percs_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender_Female_% per Sector:
Min Sector: 12.5% in Construction
Max Sector: 84.3% in Health and social work activities
Mean: 43.82
Standard Deviation: 18.86

--------------------
Gender_Male_% per Sector:
Min Sector: 15.6% in Health and social work activities
Max Sector: 87.5% in Construction
Mean: 56.13
Standard Deviation: 18.89

--------------------
Age_Older_% per Sector:
Min Sector: 18.9% in Accommodation and food serving
Max Sector: 58.3% in Water supply and waste management
Mean: 40.61
Standard Deviation: 10.23

--------------------
Age_Younger_% per Sector:
Min Sector: 44.4% in Water supply and waste management
Max Sector: 80.8% in Accommodation and food serving
Mean: 59.26
Standard Deviation: 10.14

--------------------
CPU times: user 213 ms, sys: 9.54 ms, total: 222 ms
Wall time: 222 ms


### All Warmth and Competence info at Job Ad Level

In [23]:
# Warmth and Competence info by job ad
def run_descriptives_dvs_job(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Warmth and Competence info at Job Advertisement Level')
    print('-'*30)
    get_df_info(df.groupby(['Job ID']).first(), ivs_all=dvs_all + dvs_all_predicted)
    print('-'*30)


In [24]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_dvs_job_interact(df_name):
        run_descriptives_dvs_job(df_name, dataframes[df_name])
else:
    run_descriptives_dvs_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence info at Job Advertisement Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 16134 entries, 1254300802 to pj_fff1ad3ab60d874b
Data columns (total 78 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   Search Keyword                                                      16134 non-null  object  
 1   Platform                                                            16134 non-null  object  
 2   Job Title                                                           16134 non-null  object  
 3   Company Name                                                        16134 non-null  object  
 4   Location                                                            16134 non-null  object  
 5   Dutch Requirement in Job Ad        

### All Job Ad string info at Job Ad Level

In [25]:
# Get longest and shortest sentence
def run_job_desc_lengths(df_name, df, text_col=None, num_words_col=None):
    if text_col is None:
        text_col = 'Job Description spacy_sentencized'
    if num_words_col is None:
        num_words_col = 'Job Description spacy_sentencized_num_words'
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Job Description Length at Sentence Level')
    print('-'*30)
    len_average_char = df[text_col].apply(len).mean()
    average_char = df[text_col].loc[(df[text_col].apply(len) - len_average_char).abs().idxmin()]
    len_longest_char = df[text_col].apply(len).max()
    longest_char = df[text_col].loc[df[text_col].apply(len).idxmax()]
    len_shortest_char = df[text_col].apply(len).min()
    shortest_char = df[text_col].loc[df[text_col].apply(len).idxmin()]

    len_average = df[num_words_col].mean()
    len_longest = df[num_words_col].max()
    len_shortest = df[num_words_col].min()

    print(f'Average Sentence length: {len_average}')
    print('-'*30)
    print(f'Longest Sentence length: {len_longest}')
    print('-'*30)
    print(f'Shortest Sentence length: {len_shortest}')
    print('-'*30)


In [26]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_job_desc_lengths_interact(df_name):
        run_job_desc_lengths(df_name, dataframes[df_name])
else:
    run_job_desc_lengths(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Job Description Length at Sentence Level
------------------------------
Average Sentence length: 17.632743877610377
------------------------------
Longest Sentence length: 349.0
------------------------------
Shortest Sentence length: 1.0
------------------------------
CPU times: user 221 ms, sys: 10.6 ms, total: 231 ms
Wall time: 233 ms


# Controls

## Sentence Level

### Controls all info at Sentence Level

In [27]:
# Control variables info by sentence
def run_descriptives_controls_sent(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print(f'Control varibales info at Sentence Level: {controls_}')
    print('-'*30)
    get_df_info(df, ivs_all = controls_)
    print('-'*30)


In [28]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_controls_sent_interact(df_name):
        run_descriptives_controls_sent(df_name, dataframes[df_name])
else:
    run_descriptives_controls_sent(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Control varibales info at Sentence Level: ['% Sector per Workforce', 'Job Description spacy_sentencized_num_words', 'English Requirement in Job Ad_Yes', 'Dutch Requirement in Job Ad_Yes']
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 309438 entries, 0 to 309445
Data columns (total 79 columns):
 #   Column                                                              Non-Null Count   Dtype   
---  ------                                                              --------------   -----   
 0   Search Keyword                                                      309438 non-null  object  
 1   Platform                                                            309438 non-null  object  
 2   Job ID                                                              309438 non-null  object  
 3   Job Title                                                           309438 non-null  object  
 4   Company Name  

### All info % Sector per Workforce at Sentence Level

In [29]:
def run_descriptives_sectors_all_job(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Sector info at Sentence Level')
    print('-'*30)
    get_df_info(df, ivs_all=['% Sector per Workforce'])
    print('-'*30)


In [30]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_sectors_all_job_interact(df_name):
        run_descriptives_sectors_all_job(df_name, dataframes[df_name])
else:
    run_descriptives_sectors_all_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Sector info at Sentence Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 309438 entries, 0 to 309445
Data columns (total 79 columns):
 #   Column                                                              Non-Null Count   Dtype   
---  ------                                                              --------------   -----   
 0   Search Keyword                                                      309438 non-null  object  
 1   Platform                                                            309438 non-null  object  
 2   Job ID                                                              309438 non-null  object  
 3   Job Title                                                           309438 non-null  object  
 4   Company Name                                                        309438 non-null  object  
 5   Location                                                            30943

### % Sector per Workforce at Sentence Level

In [31]:
def run_descriptives_sectors_job(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    min_sector = df['Sector'].loc[df['% Sector per Workforce'] == df['% Sector per Workforce'].min()].values[0]
    max_sector = df['Sector'].loc[df['% Sector per Workforce'] == df['% Sector per Workforce'].max()].values[0]
    mean = df['% Sector per Workforce'].mean().round(2).astype(float)
    std = df['% Sector per Workforce'].std().round(2).astype(float)
    print(f'"% Sector per Workforce":\nMin Sector: {df["% Sector per Workforce"].min():.1f}% in {min_sector}\nMax Sector: {df["% Sector per Workforce"].max():.1f}% in {max_sector}\nMean: {mean}\nStandard Deviation: {std}\n')
    print('-'*20)


In [32]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_sectors_job_interact(df_name):
        run_descriptives_sectors_job(df_name, dataframes[df_name])
else:
    run_descriptives_sectors_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

"% Sector per Workforce":
Min Sector: 0.1% in Energy supply
Max Sector: 31.4% in Commercial services
Mean: 5.41
Standard Deviation: 8.75

--------------------
CPU times: user 3.51 ms, sys: 1.2 ms, total: 4.71 ms
Wall time: 3.7 ms


### IVs and Controls Correlation Matrix

In [33]:
def run_corr_ivs_controls_sent(df_name, df, ivs_=None, controls_=None):
    if ivs_ is None:
        ivs_ = ivs_dummy_perc_and_perc_interactions
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    considered_features = controls_[:2] + ivs_[:]
    corr_df = df[considered_features].corr()
    print('-'*20)
    # print(f'Correlation Matrix for {df_name}')
    # print(corr_df)
    print('-'*20)
    print('Highly correlated variables:\n')
    print('-'*20)
    print(corr_df[(corr_df > 0.5) & (corr_df != 1)].stack().sort_values(ascending=False).drop_duplicates())
    print('-'*20)


In [34]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_dummy_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_dummy_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female_% per Sector               Interaction_Female_Younger_% per Sector   0.93
Interaction_Female_Older_% per Sector    Gender_Female_% per Sector                0.90
                                         Gender_Female                             0.89
Interaction_Male_Younger_% per Sector    Gender_Male_% per Sector                  0.86
Interaction_Male_Older_% per Sector      Gender_Male_% per Sector                  0.83
Gender_Female                            Gender_Female_% per Sector                0.81
Gender_Male_% per Sector                 Gender_Male                               0.80
Age_Older                                Interaction_Male_Older_% per Sector       0.78
Gender_Male                              Interaction_Male_Older_% per Sector       0.77
Interaction_Female_Older_% per Sector    Interaction_Female_Youn

In [35]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_dummy_and_perc)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_dummy_and_perc)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female  Gender_Female_% per Sector   0.81
Gender_Male    Gender_Male_% per Sector     0.80
Age_Older      Age_Older_% per Sector       0.66
Age_Younger    Age_Younger_% per Sector     0.63
Gender_Mixed   Age_Younger                  0.55
               Age_Younger_% per Sector     0.53
dtype: float64
--------------------
CPU times: user 95.6 ms, sys: 7.79 ms, total: 103 ms
Wall time: 105 ms


In [36]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female_% per Sector             Interaction_Female_Younger_% per Sector   0.93
                                       Interaction_Female_Older_% per Sector     0.90
Gender_Male_% per Sector               Interaction_Male_Younger_% per Sector     0.86
                                       Interaction_Male_Older_% per Sector       0.83
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.68
Age_Older_% per Sector                 Interaction_Male_Older_% per Sector       0.58
dtype: float64
--------------------
CPU times: user 67.3 ms, sys: 5.99 ms, total: 73.3 ms
Wall time: 72.6 ms


In [37]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_sent_interact(df_name):
        run_corr_ivs_controls_sent(df_name, dataframes[df_name], ivs_=ivs_perc_interactions)
else:
    run_corr_ivs_controls_sent(list(dataframes.keys())[0], list(dataframes.values())[0], ivs_=ivs_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.68
dtype: float64
--------------------
CPU times: user 28.7 ms, sys: 3.98 ms, total: 32.7 ms
Wall time: 36.3 ms


## Job Ad Level

### All Controls info at Job Ad Level

In [38]:
# Control variables info by job ad
def run_descriptives_controls_job(df_name, df, controls_=None):
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    print('='*30)
    print('Control varibales info at Job Advertisement Level')
    print('-'*30)
    get_df_info(df.groupby(['Job ID']).first(), ivs_all = controls_)
    print('-'*30)


In [39]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_descriptives_controls_job_interact(df_name):
        run_descriptives_controls_job(df_name, dataframes[df_name])
else:
    run_descriptives_controls_job(list(dataframes.keys())[0], dataframes[list(dataframes.keys())[0]])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Control varibales info at Job Advertisement Level
------------------------------

DF INFO:

<class 'pandas.core.frame.DataFrame'>
Index: 16134 entries, 1254300802 to pj_fff1ad3ab60d874b
Data columns (total 78 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   Search Keyword                                                      16134 non-null  object  
 1   Platform                                                            16134 non-null  object  
 2   Job Title                                                           16134 non-null  object  
 3   Company Name                                                        16134 non-null  object  
 4   Location                                                            16134 non-null  object  
 5   Dutch Requirement in Job Ad            

In [40]:
def run_corr_ivs_controls_job(df_name, df, ivs_=None, controls_=None):
    if ivs_ is None:
        ivs_ = ivs_dummy_perc_and_perc_interactions
    if controls_ is None:
        controls_ = controls

    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    df = df.groupby(['Job ID']).first()

    considered_features = controls_[:2] + ivs_[:]
    corr_df = df[considered_features].corr()
    print('-'*20)
    # print(f'Correlation Matrix for {df_name}')
    # print(corr_df)
    print('-'*20)
    print('Highly correlated variables:\n')
    print('-'*20)
    print(corr_df[(corr_df > 0.5) & (corr_df != 1)].stack().sort_values(ascending=False).drop_duplicates())
    print('-'*20)


In [41]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Interaction_Female_Younger_% per Sector  Gender_Female_% per Sector                0.93
Interaction_Female_Older_% per Sector    Gender_Female_% per Sector                0.89
Gender_Female                            Interaction_Female_Older_% per Sector     0.87
Interaction_Male_Younger_% per Sector    Gender_Male_% per Sector                  0.84
Interaction_Male_Older_% per Sector      Gender_Male_% per Sector                  0.83
Gender_Male                              Gender_Male_% per Sector                  0.82
Interaction_Male_Older_% per Sector      Gender_Male                               0.79
                                         Age_Older                                 0.78
Gender_Female                            Gender_Female_% per Sector                0.78
Age_Older                                Age_Older_% per Sector 

In [42]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_and_perc)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_dummy_and_perc)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Male    Gender_Male_% per Sector     0.82
Gender_Female  Gender_Female_% per Sector   0.78
Age_Older      Age_Older_% per Sector       0.69
Age_Younger    Age_Younger_% per Sector     0.63
Gender_Mixed   Age_Younger_% per Sector     0.54
               Age_Younger                  0.50
dtype: float64
--------------------
CPU times: user 237 ms, sys: 10.6 ms, total: 248 ms
Wall time: 248 ms


In [43]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_and_perc_interactions)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_and_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Gender_Female_% per Sector             Interaction_Female_Younger_% per Sector   0.93
                                       Interaction_Female_Older_% per Sector     0.89
Gender_Male_% per Sector               Interaction_Male_Younger_% per Sector     0.84
                                       Interaction_Male_Older_% per Sector       0.83
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.66
Age_Older_% per Sector                 Interaction_Male_Older_% per Sector       0.64
dtype: float64
--------------------
CPU times: user 220 ms, sys: 6.78 ms, total: 226 ms
Wall time: 227 ms


In [44]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_corr_ivs_controls_job_interact(df_name):
        run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_interactions)
else:
    run_corr_ivs_controls_job(df_name, dataframes[df_name], ivs_=ivs_perc_interactions)


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

--------------------
--------------------
Highly correlated variables:

--------------------
Interaction_Female_Older_% per Sector  Interaction_Female_Younger_% per Sector   0.66
dtype: float64
--------------------
CPU times: user 212 ms, sys: 4.08 ms, total: 216 ms
Wall time: 215 ms


## Imbalance Ratios

In [45]:
# Imbalance Ratio
all_imbalance_ratio_dict = {}
def run_imbalance_ratio(df_name, df):
    print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')

    warmth_imbalance_ratio = df['Warmth'].loc[
        df['Warmth'] == 1].count()/df['Warmth'].loc[df['Warmth'] == 0
    ].count()
    competence_imbalance_ratio = df['Competence'].loc[
        df['Competence'] == 1].count()/df['Competence'].loc[df['Competence'] == 0
    ].count()

    all_imbalance_ratio_dict[f'{df_name} Warmth'] = warmth_imbalance_ratio
    all_imbalance_ratio_dict[f'{df_name} Competence'] = competence_imbalance_ratio

    print('='*20)
    print('Imabalance Ratios')
    print('-'*10)
    print(f'Warmth IR: {warmth_imbalance_ratio:.2f}')
    print(f'Competence IR: {competence_imbalance_ratio:.2f}')
    print('='*20)

    with open(f'{data_dir}{df_name}_all_imbalance_ratio_dict.json', 'w') as f:
        json.dump(all_imbalance_ratio_dict, f)


In [46]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys())
    def run_imbalance_ratio_interact(df_name):
        run_imbalance_ratio(df_name, dataframes[df_name])
else:
    run_imbalance_ratio(list(dataframes.keys())[0], list(dataframes.values())[0])


++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Imabalance Ratios
----------
Warmth IR: 0.34
Competence IR: 1.23
CPU times: user 5.43 ms, sys: 1.39 ms, total: 6.81 ms
Wall time: 6.59 ms


# Tables

In [47]:
def save_desc_excel(
    df_desc,
    index_var,
    title_prefix,
    file_save_path,
    sheet_name=None,
    startrow=None,
    startcol=None,
):
    if sheet_name is None:
        sheet_name = 'All'
    if startrow is None:
        startrow = 1
    if startcol is None:
        startcol = 1

    # index = df_desc.index.to_frame().reset_index(drop=True)
    df_desc = df_desc.reset_index(drop=False, col_level=1, col_fill=f'{title_prefix} Job Advertisements')

    # Define last rows and cols locs
    header_range = len(df_desc.columns.levels)
    endrow = startrow + header_range + df_desc.shape[0]
    endcol = startcol + df_desc.shape[1]

    # Write
    writer = pd.ExcelWriter(f'{file_save_path}.xlsx')
    df_desc.to_excel(writer, sheet_name=sheet_name, merge_cells=True, startrow=startrow, startcol=startcol)
    workbook  = writer.book
    worksheet = writer.sheets[sheet_name]
    worksheet.set_row(startrow + header_range, None, None, {'hidden': True}) # hide the empty row that appears after the headers
    worksheet.set_column(startrow, 1, None, None, {'hidden': True}) # hide the index column

    # MAIN BODY
    # Format column headers
    for i, (col_num, col_value) in tqdm_product(range(header_range), (enumerate(df_desc.columns.values))):
        row_to_write = startrow + i
        col_to_write = startcol + 1 + col_num # 1 is for index
        header_formats = {'bold': False, 'font_name': 'Times New Roman', 'font_size': 12, 'font_color': 'black', 'align': 'center', 'top': True, 'bottom': True, 'left': False, 'right': False}

        if col_value[i] in ['n', 'M', 'SD']:
            header_formats |= {'italic': True}

        if col_value[i] == '95% Conf.':
            worksheet.set_column(col_to_write, col_to_write, 8.5)

        if col_value[i] == index_var:
            worksheet.set_column(col_to_write, col_to_write, 10)
            header_formats['align'] = 'left'
            header_formats |= {'text_wrap': True}
            worksheet.merge_range(row_to_write, col_to_write, header_range, col_to_write, index_var, workbook.add_format(header_formats))
        else:
            worksheet.write(row_to_write, col_to_write, col_value[i], workbook.add_format(header_formats))

    # Format body columns
    num = [col_num for col_num, value in enumerate(df_desc.columns.values) if value[-1] == 'n']
    perc = [col_num for col_num, value in enumerate(df_desc.columns.values) if value[-1] == '%']
    body_max_row_idx, body_max_col_idx = df_desc.shape

    for c, r in tqdm_product(range(body_max_col_idx), range(body_max_row_idx)):
        row_to_write = startrow + header_range + 1 + r # 1 is for the hidden empty column under the header
        col_to_write = startcol + 1 + c # 1 is for index
        body_formats = {'num_format': '0.00', 'font_name': 'Times New Roman', 'font_size': 12, 'font_color': 'black', 'align': 'center', 'text_wrap': True, 'left': False, 'right': False}

        if r == body_max_row_idx-1:
            body_formats |= {'bottom': True}

        if c == 0:
            body_formats |= {'align': 'left'}

        if c in num:
            body_formats |= {'num_format': '0'}

        if c in perc:
            body_formats |= {'num_format': '0.0'}

        worksheet.write(row_to_write, col_to_write, df_desc.iloc[r, c], workbook.add_format(body_formats))

    writer.close()


In [48]:
def make_df_desc(df, df_name, vars_list, var_name, index_var, sentence_level=False, continous_var_names_list=None):

    if continous_var_names_list is None:
        continous_var_names_list = ['Probabilities', 'Percentages']

    if df_name == 'df_manual':
        title_prefix = 'Manually Annotated Dataset'
    elif df_name == 'df_jobs':
        title_prefix = 'Classifier Labeled'

    if sentence_level == False:
        level = 'Job Advertisement'
        df = df.groupby('Job ID').first()
    if sentence_level == True:
        level = 'Sentence'

    # Warmth and Competence Categorical df
    if len(set(var_name.split()).intersection(continous_var_names_list)) == 0:
        df_cat = rp.summary_cat(df[vars_list], ascending= True).round(2)
        df_cat['Variable'] = df_cat['Variable'].replace('', np.nan).fillna(method='ffill')
        df_cat = df_cat.loc[df_cat['Outcome'] == 1].drop(columns=['Outcome'])
        totals = pd.DataFrame(df_cat.sum(numeric_only=True)).transpose()
        totals.insert(0, 'Variable', 'Total')
        df_cat = df_cat.fillna('')
        df_cat = pd.concat([df_cat, totals], axis='index', ignore_index=True)

    # Warmth and Competence Continuous df
    df_cont = rp.summary_cont(df[vars_list], conf = 0.95, decimals = 2)

    # Merged df
    if len(set(var_name.split()).intersection(continous_var_names_list)) == 0:
        df_desc = df_cat.merge(df_cont, on='Variable', how='outer')
        df_desc = df_desc.fillna('')
    else:
        df_desc = df_cont

    # Rename variable columns
    df_desc['Variable'] = df_desc['Variable'].apply(
        lambda var_name: f'{var_name.split("_")[1]}-dominated'.replace('_', ' ').strip()
        if '_' in var_name and 'Mixed' not in var_name and '%' not in var_name and 'Probability' not in var_name
        else f'{var_name.split("_")[1]} Gender'.replace('_', ' ')
        if '_' in var_name and 'Mixed' in var_name and '%' not in var_name and 'Probability' not in var_name
        else " ".join(var_name.split("_")[1:]).split()[0]
        if '_' in var_name and 'Mixed' not in var_name and '%' in var_name and 'Probability' not in var_name
        else f'{var_name.split("_")[0]} Probability'.replace('_', ' ')
        if '_' in var_name and 'Mixed' not in var_name and '%' not in var_name and 'Probability' in var_name
        else var_name
    )

    # Clean up df and set index
    if len(set(var_name.split()).intersection(continous_var_names_list)) == 0:
        drop_columns = ['N', 'SE', '95% Conf.', 'Interval']
        rename_dict = {'Variable': index_var, 'Count': 'n', 'Percent': '%', 'Mean': 'M'}
    else:
        drop_columns = ['N', 'SE']
        rename_dict = {'Variable': index_var, 'Mean': 'M', 'SD': 'SD', '95% Conf. Int.': '95% CI'}

    df_desc = df_desc.drop(columns=drop_columns)
    df_desc = df_desc.rename(columns=rename_dict)
    df_desc = df_desc.set_index(keys=[index_var], drop=True)

    # Make into MultiIndex
    df_desc.columns = pd.MultiIndex.from_product([[level], df_desc.columns])

    return df_desc


In [49]:
vars_dict = {
    'Gender Categorical Designation of Sector': ivs_gender_dummy,
    'Age Categorical Designation of Sector': ivs_age_dummy,
    'Gender Percentages per Sector (%)': ivs_gender_perc,
    'Age Percentages per Sector (%)': ivs_age_perc,
    'Warmth and Competence Categorical Coding': dvs,
    'Warmth and Competence Probabilities': dvs_prob,
}


In [50]:
def make_desc_tables(df_name, df, var_name, vars_list):
    if df_name == 'df_manual':
        title_prefix = 'Manually Annotated Dataset'
    elif df_name == 'df_jobs' and 'Warmth and Competence' not in var_name:
        title_prefix = 'Collected Dataset'
    elif df_name == 'df_jobs':
        title_prefix = 'Classifier Labeled'

    # Set index varaible name
    if 'Warmth and Competence' in var_name:
        index_var = 'Stereotype-related frames'
    elif 'Percentages' in var_name:
        index_var = 'Percentages per Sector (PPS)'
    else:
        index_var = 'Sectors'

    with contextlib.suppress(KeyError):
        # Categorical DF on job ad level
        df_desc_cat_jobad = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=False)

        # Categorical DF on sentence level
        df_desc_cat_sent = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=True)

        # Merge Categorical dfs
        df_desc_cat = df_desc_cat_jobad.merge(df_desc_cat_sent, on=index_var)

        # Continuous DF on job ad level
        df_desc_cont_jobad = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=False)

        # Continuous DF on sentence level
        df_desc_cont_sent = make_df_desc(df, df_name, vars_list=vars_list, var_name=var_name, index_var=index_var, sentence_level=True)

        # Merge Continuous dfs
        df_desc_cont = df_desc_cont_jobad.merge(df_desc_cont_sent, on=index_var)

        # Collect dfs in list
        df_desc_list = [df_desc_cat, df_desc_cont]

        for df_desc in df_desc_list:
            levels_with_title = [[f'{title_prefix} Job Advertisements']]
            # Add title prefix
            levels_with_title.extend(
                list(df_desc.columns.get_level_values(i).unique())
                    for i in range(len(df_desc.columns.levels))
            )
            # levels_with_title.insert(0, )
            if 'Warmth and Competence' not in var_name:
                levels_with_title.insert(1, [var_name])

            # Make into MultiIndex
            df_desc.columns = pd.MultiIndex.from_product(levels_with_title)

            # Save Tables
            # File save path
            file_save_path = f'{table_save_path}descriptives {df_name} {title_prefix} {var_name} - Job Advertisement'
            # CSV
            df_desc.to_csv(f'{file_save_path}.csv', index=True)
            # PKL
            df_desc.to_pickle(f'{file_save_path}.pkl')
            # TEX
            with pd.option_context('max_colwidth', 10000000000):
                df_desc.style.to_latex(
                    f'{file_save_path}.tex',
                    convert_css=True,
                    environment='longtable',
                    hrules=True,
                    # escape=True,
                    # multicolumn=True,
                    multicol_align='c',
                    position='H',
                    caption=f'{var_name} Descriptives', label='Descriptives'
                )
            # MD
            df_desc.to_markdown(f'{file_save_path}.md', index=True)
            # EXCEL
            save_desc_excel(df_desc, index_var, title_prefix, file_save_path)

        print('\n')
        print(f'{"+"*20} {df_name.upper()} {"+"*20}\n')
        print(f'{var_name} Descriptives')
        if df_desc_list[0].equals(df_desc_list[1]):
            print(df_desc_list[0])
        else:
            print(df_desc_list[0])
            print(df_desc_list[1])
        print('\n')


In [51]:
%%time
if len(dataframes) > 1:
    @interact(df_name=dataframes.keys(), var_name=vars_dict.keys())
    def make_desc_tables_interact(df_name, var_name):
        make_desc_tables(df_name, dataframes[df_name], var_name, vars_dict[var_name])
else:
    for (df_name, df), (var_name, vars_list) in tqdm_product(dataframes.items(), vars_dict.items()):
        make_desc_tables(df_name, df, var_name, vars_list)


  0%|          | 0/6 [00:00<?, ?it/s]











0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender Categorical Designation of Sector Descriptives
                   Collected Dataset Job Advertisements                                             
                 Gender Categorical Designation of Sector                                           
                            Job Advertisement                               Sentence                
                                    n                       %     M    SD      n       %    M    SD 
Sectors                                                                                             
Female-dominated                  3475.00                  21.54 0.22 0.41  78480.00 25.36 0.25 0.44
Mixed Gender                      6300.00                  39.05 0.39 0.49 117967.00 38.12 0.38 0.49
Male-dominated                    6359.00                  39.41 0.39 0.49 112991.00 36.51 0.37 0.48
Total                            16134.00                 100.00           309438.00 

0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Age Categorical Designation of Sector Descriptives
                   Collected Dataset Job Advertisements                                           
                  Age Categorical Designation of Sector                                           
                            Job Advertisement                            Sentence                 
                                    n                     %    M    SD      n       %     M    SD 
Sectors                                                                                           
Older-dominated                   3605.00               22.34 0.22 0.42  62959.00  20.35 0.20 0.40
Mixed Gender                     10276.00               63.69 0.64 0.48 198252.00  64.07 0.64 0.48
Younger-dominated                 2253.00               13.96 0.14 0.35  48227.00  15.59 0.16 0.36
Total                            16134.00               99.99           309438.00 100.01          







0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Gender Percentages per Sector (%) Descriptives
                             Collected Dataset Job Advertisements                                                           
                              Gender Percentages per Sector (%)                                                             
                                      Job Advertisement                                    Sentence                         
                                              M                     SD  95% Conf. Interval    M       SD  95% Conf. Interval
Percentages per Sector (PPS)                                                                                                
Female                                      43.82                 18.86   43.53    44.11    45.36   19.47   45.29    45.43  
Male                                        56.13                 18.89   55.84    56.42    54.59   19.50   54.52    54.66  












0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Age Percentages per Sector (%) Descriptives
                             Collected Dataset Job Advertisements                                                           
                                Age Percentages per Sector (%)                                                              
                                      Job Advertisement                                    Sentence                         
                                              M                     SD  95% Conf. Interval    M       SD  95% Conf. Interval
Percentages per Sector (PPS)                                                                                                
Older                                       40.61                 10.23   40.45    40.76    40.84   10.11   40.80    40.88  
Younger                                     59.26                 10.14   59.10    59.41    59.06    9.98   59.03    59.10  












0it [00:00, ?it/s]

  0%|          | 0/27 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/27 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence Categorical Coding Descriptives
                          Classifier Labeled Job Advertisements                                          
                                    Job Advertisement                            Sentence                
                                            n                     %    M    SD      n       %    M    SD 
Stereotype-related frames                                                                                
Warmth                                    3165.00               19.62 0.20 0.40  77867.00 25.16 0.25 0.43
Competence                                8013.00               49.67 0.50 0.50 170586.00 55.13 0.55 0.50
Total                                    11178.00               69.29           248453.00 80.29          












0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]



++++++++++++++++++++ DF_JOBS ++++++++++++++++++++

Warmth and Competence Probabilities Descriptives
                          Classifier Labeled Job Advertisements                                                         
                                    Job Advertisement                                   Sentence                        
                                            M                    SD  95% Conf. Interval    M      SD  95% Conf. Interval
Stereotype-related frames                                                                                               
Warmth Probability                         0.92                 0.12    0.92     0.92     0.91   0.12    0.91     0.91  
Competence Probability                     0.90                 0.12    0.90     0.90     0.88   0.13    0.88     0.88  


CPU times: user 3.33 s, sys: 306 ms, total: 3.64 s
Wall time: 3.97 s
