## Dataset descs

prod_sent: 
* Tabular:  ["Product_Type"]
* Text: ["Product_Description"]

wine:
* Tabular: ["points", "price"]
* Text: ['country', 'description','province']

fake:
* Tabular: ['required_experience','required_education']
* Text: ["title", "description", "salary_range"]

kick: 
* Tabular: ['goal', 'disable_communication', 'country', 'currency', 'deadline', 'created_at']
* Text: ['name', 'desc', 'keywords']

jigsaw:
* Tabular: ['asian', 'atheist', 'bisexual', 'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu', 'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability', 'jewish', 'latino', 'male', 'muslim', 'other_disability', 'other_gender', 'other_race_or_ethnicity', 'other_religion', 'other_sexual_orientation', 'physical_disability', 'psychiatric_or_mental_illness', 'transgender', 'white', 'funny', 'wow', 'sad', 'likes', 'disagree']
* Text: ['comment_text']



In [3]:
from datasets import Dataset, DatasetDict, load_dataset
from auto_mm_bench.datasets import dataset_registry
from sklearn.preprocessing import OrdinalEncoder
from src.utils import get_dataset_info


How do i want to deal with missing values?
* Text: replace with "None"
* Categorical: encode as categorical column therefore becomes -1
* Numerical: Maybe i should just leave this as is too

So for all as text:
* replace missing values with "None"

For mixed:
* Categorical gets encoded to -1
* Text gets swapped to "None"

In [13]:
dataset_name = "product_sentiment_machine_hack"
# ["wine_reviews" ,"fake_job_postings2" , "product_sentiment_machine_hack", "kick_starter_funding", "jigsaw_unintended_bias100K"]

for dataset_name in [
    # "wine_reviews",
    "fake_job_postings2",
    # "product_sentiment_machine_hack",
    # "kick_starter_funding",
    # "jigsaw_unintended_bias100K",
    # "imdb_genre_prediction",
]:
    di = get_dataset_info(dataset_name)
    train_dataset = dataset_registry.create(dataset_name, "train")
    test_dataset = dataset_registry.create(dataset_name, "test")
    cols = train_dataset.feature_columns + train_dataset.label_columns

    train_txt = train_dataset.data[cols]
    test_txt = test_dataset.data[cols]

    # # Fill missing values with "None"
    # train_txt = train_txt.fillna("None")
    # test_txt = test_txt.fillna("None")

    # load dataset from dataframe
    train_ds = Dataset.from_pandas(train_txt)
    train_ds = train_ds.class_encode_column(train_dataset.label_columns[0])
    test_ds = Dataset.from_pandas(test_txt)
    test_ds = test_ds.class_encode_column(train_dataset.label_columns[0])

    train_ds = train_ds.train_test_split(
        test_size=0.15, seed=42, stratify_by_column=train_dataset.label_columns[0]
    )

    ds = DatasetDict(
        {"train": train_ds["train"], "validation": train_ds["test"], "test": test_ds}
    )

    # Now we have made the split but still need to deal with missing values, and that depends on the column type

    # All as text
    train_all_text = ds["train"].to_pandas()
    val_all_text = ds["validation"].to_pandas()
    test_all_text = ds["test"].to_pandas()

    # train_all_text[train_dataset.feature_columns].fillna("None", inplace=True)
    # train_all_text[train_dataset.feature_columns].fillna("None", inplace=True)
    # train_all_text[train_dataset.feature_columns].fillna("None", inplace=True)

    ds_all_text = DatasetDict(
        {
            "train": Dataset.from_pandas(train_all_text),
            "validation": Dataset.from_pandas(val_all_text),
            "test": Dataset.from_pandas(test_all_text),
        }
    )

    ds_all_text.push_to_hub(dataset_name + "_all_text")

    # Not all as text
    train = ds["train"].to_pandas()
    val = ds["validation"].to_pandas()
    test = ds["test"].to_pandas()

    # train[di.text_cols].fillna("None", inplace=True)
    # val[di.text_cols].fillna("None", inplace=True)
    # test[di.text_cols].fillna("None", inplace=True)

    # ds.push_to_hub(dataset_name)
    if len(di.categorical_cols) > 0:
        train[di.categorical_cols] = train[di.categorical_cols].astype("category")

        enc = OrdinalEncoder(encoded_missing_value=-1)
        train[di.categorical_cols] = enc.fit_transform(train[di.categorical_cols])

        val[di.categorical_cols] = val[di.categorical_cols].astype("category")
        val[di.categorical_cols] = enc.transform(val[di.categorical_cols])

        test[di.categorical_cols] = test[di.categorical_cols].astype("category")
        test[di.categorical_cols] = enc.transform(test[di.categorical_cols])

    ds2 = DatasetDict(
        {
            "train": Dataset.from_pandas(train),
            "validation": Dataset.from_pandas(val),
            "test": Dataset.from_pandas(test),
        }
    )

    ds2.push_to_hub(dataset_name + "_ordinal")


Pushing split train to the Hub.                                          
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:00<00:00, 4415.06it/s]
Pushing split validation to the Hub.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:00<00:00, 4544.21it/s]
Pushing split test to the Hub.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:00<00:00, 5548.02it/s]
Pushing split train to the Hub.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:00<00:00, 15196.75it/s]
Pushing split validation to the Hub.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:00<00:00, 29746.84it/s]
Pushing split test to the Hub.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00

In [10]:
test.dtypes

title                   object
salary_range            object
description             object
required_experience    float64
required_education     float64
fraudulent               int64
dtype: object

In [5]:
dataset = load_dataset("james-burton/fake_job_postings2_ordinal")

Found cached dataset parquet (/home/james/.cache/huggingface/datasets/james-burton___parquet/james-burton--fake_job_postings2_ordinal-5cf31f78073ab818/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████| 3/3 [00:00<00:00, 275.53it/s]


In [11]:
# Create a mapping function that takes any value of None and swaps it for a string value of "None"
def map_none_to_string(example):
    for k, v in example.items():
        if v is None:
            example[k] = "None"
    return example


dataset = dataset.map(map_none_to_string)


                                                                    

In [12]:
dataset["test"][0]


{'title': 'Senior Software Engineers, C++ for AUTOMOTIVE',
 'salary_range': 'None',
 'description': 'Software Competitiveness International (SOFT COM INTERNATIONAL), is a rapidly growing company, specializing in Software Research &amp; Development and Information &amp; Communications Technologies Services, located in Athens, and headquartered in Crete. The skills, the experience and the methodologies\xa0 of the company and its experts, most of them with a long presence and a high recognition internationally, provide to its clients, both locally and internationally,\xa0 technical excellence and valuable services, and to its employees the working conditions to further develop their technological expertise within a multi-national environment. Currently the company expands its activities further, continuing the expansion of a very promising cooperation with the German Automotive Market. \xa0\xa0 \xa0 \xa0 \xa0Currently we are looking for\xa0Senior Software Engineers, C++ for AUTOMOTIVE\xa0

In [3]:
di = get_dataset_info("fake")

for dataset_name in ["fake_job_postings2"]:
    train_dataset = dataset_registry.create(dataset_name, "train")
    test_dataset = dataset_registry.create(dataset_name, "test")
    cols = train_dataset.feature_columns + train_dataset.label_columns

    train_txt = train_dataset.data[cols]
    # train_txt[di.categorical_cols] = train_txt[di.categorical_cols].astype("category")
    test_txt = test_dataset.data[cols]
    # test_txt[di.categorical_cols] = test_txt[di.categorical_cols].astype("category")
    train_dataset.data[train_dataset.label_columns[0]] = train_dataset.data[
        train_dataset.label_columns[0]
    ]
    test_dataset.data[train_dataset.label_columns[0]] = test_dataset.data[
        train_dataset.label_columns[0]
    ]

    # load dataset from dataframe
    train_ds = Dataset.from_pandas(train_txt)
    train_ds = train_ds.class_encode_column(train_dataset.label_columns[0])
    test_ds = Dataset.from_pandas(test_txt)
    test_ds = test_ds.class_encode_column(train_dataset.label_columns[0])

    train_ds = train_ds.train_test_split(
        test_size=0.15, seed=42, stratify_by_column=train_dataset.label_columns[0]
    )
    ds2 = DatasetDict(
        {"train": train_ds["train"], "validation": train_ds["test"], "test": test_ds}
    )


                                                                         

In [4]:
ds2

DatasetDict({
    train: Dataset({
        features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
        num_rows: 10816
    })
    validation: Dataset({
        features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
        num_rows: 1909
    })
    test: Dataset({
        features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
        num_rows: 3182
    })
})

In [5]:
from datasets import load_dataset

ds2 = load_dataset("james-burton/fake_job_postings2")
train, val, test = ds2["train"], ds2["validation"], ds2["test"]

# enc = di.cat_encoder

train = train.to_pandas()
train[di.categorical_cols] = train[di.categorical_cols].astype("category")

enc = OrdinalEncoder(encoded_missing_value=-1)
train[di.categorical_cols] = enc.fit_transform(train[di.categorical_cols])

val = val.to_pandas()
val[di.categorical_cols] = val[di.categorical_cols].astype("category")
val[di.categorical_cols] = enc.transform(val[di.categorical_cols])

test = test.to_pandas()
test[di.categorical_cols] = test[di.categorical_cols].astype("category")
test[di.categorical_cols] = enc.transform(test[di.categorical_cols])

ds3 = DatasetDict(
    {
        "train": Dataset.from_pandas(train),
        "validation": Dataset.from_pandas(val),
        "test": Dataset.from_pandas(test),
    }
)

ds3.push_to_hub("fake_job_postings2_ord")


Found cached dataset parquet (/home/james/.cache/huggingface/datasets/james-burton___parquet/james-burton--fake_job_postings2-2d8f0884d0193a2d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████| 3/3 [00:00<00:00, 628.86it/s]


In [10]:
train[di.text_cols].fillna("NaN").astype(str).dtypes

title           object
description     object
salary_range    object
dtype: object

In [14]:
train[di.text_cols].fillna("NaN").iloc[0]["salary_range"]


'NaN'

In [17]:
ds2["train"][3]

{'title': 'Senior Supply Network Planner',
 'salary_range': None,
 'description': 'Supply network planning; being responsible for the optimal fulfilment of the company supply planning requirements.Inventory days optimization; monitoring stock days and executing corrective action plan whenever required.OOS management; defining safety stocks based on accuracy issues like forecasting, supply etc in order to minimize out-of-stock situations and optimize demand and supply balance.Demand planning; utilizing cross-functional business input and personal business knowledge to monitor sales trends and drive process improvements for all markets and channels.',
 'required_experience': None,
 'required_education': None,
 'fraudulent': 0}

In [18]:
ds3["train"][3]


{'title': 'Senior Supply Network Planner',
 'salary_range': None,
 'description': 'Supply network planning; being responsible for the optimal fulfilment of the company supply planning requirements.Inventory days optimization; monitoring stock days and executing corrective action plan whenever required.OOS management; defining safety stocks based on accuracy issues like forecasting, supply etc in order to minimize out-of-stock situations and optimize demand and supply balance.Demand planning; utilizing cross-functional business input and personal business knowledge to monitor sales trends and drive process improvements for all markets and channels.',
 'required_experience': -1.0,
 'required_education': -1.0,
 'fraudulent': 0}

In [108]:
ds2.values()

dict_values([Dataset({
    features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
    num_rows: 1909
}), Dataset({
    features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
    num_rows: 10816
}), Dataset({
    features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
    num_rows: 3182
})])

In [116]:
train

Unnamed: 0,title,salary_range,description,required_experience,required_education,fraudulent
0,Creative Director - Art,,Kettle is hiring a Creative DirectorJob Locati...,,,0
1,UI Developer For Rails App,,Hello Show is transforming the way real estate...,Mid-Senior level,,0
2,Customer Service Associate - Part Time,,The Customer Service Associate will be based i...,Entry level,High School or equivalent,0
3,Lead Developer (Freelance),,Role Purpose: To be responsible for maintenanc...,Mid-Senior level,Associate Degree,0
4,Front-end Developer,2000-5000,"You are versatile with javascript, CSS, HTML5 ...",Entry level,Associate Degree,0
...,...,...,...,...,...,...
1904,Inside Sales Representative(Entry Level),,Handi Ramp is currently seeking an Inside Sale...,Associate,Bachelor's Degree,0
1905,Software Support Representative,25000-30000,Providing reliable library automation solution...,Entry level,Unspecified,0
1906,Leeds Apprentice Web Developer Under NAS 16-18...,,Government funding is only available for 16-18...,Not Applicable,High School or equivalent,0
1907,"Sales, Assistant Manager & Market Manager Posi...",45000-67000,"We are Argenta Field Solutions, a rapidly expa...",Entry level,Unspecified,0


In [118]:
train

Unnamed: 0,title,salary_range,description,required_experience,required_education,fraudulent
0,Creative Director - Art,,Kettle is hiring a Creative DirectorJob Locati...,-1.0,-1.0,0
1,UI Developer For Rails App,,Hello Show is transforming the way real estate...,5.0,-1.0,0
2,Customer Service Associate - Part Time,,The Customer Service Associate will be based i...,2.0,4.0,0
3,Lead Developer (Freelance),,Role Purpose: To be responsible for maintenanc...,5.0,0.0,0
4,Front-end Developer,2000-5000,"You are versatile with javascript, CSS, HTML5 ...",2.0,0.0,0
...,...,...,...,...,...,...
1904,Inside Sales Representative(Entry Level),,Handi Ramp is currently seeking an Inside Sale...,0.0,1.0,0
1905,Software Support Representative,25000-30000,Providing reliable library automation solution...,2.0,9.0,0
1906,Leeds Apprentice Web Developer Under NAS 16-18...,,Government funding is only available for 16-18...,6.0,4.0,0
1907,"Sales, Assistant Manager & Market Manager Posi...",45000-67000,"We are Argenta Field Solutions, a rapidly expa...",2.0,9.0,0


In [111]:
train[di.categorical_cols]

Unnamed: 0,required_experience,required_education
0,,
1,Mid-Senior level,
2,Entry level,High School or equivalent
3,Mid-Senior level,Associate Degree
4,Entry level,Associate Degree
...,...,...
1904,Associate,Bachelor's Degree
1905,Entry level,Unspecified
1906,Not Applicable,High School or equivalent
1907,Entry level,Unspecified


In [101]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

enc = OrdinalEncoder(
    categories=[
        np.array(
            [
                "Associate",
                "Director",
                "Entry level",
                "Executive",
                "Internship",
                "Mid-Senior level",
                "Not Applicable",
                np.nan,
            ],
            dtype=object,
        ),
        np.array(
            [
                "Associate Degree",
                "Bachelor's Degree",
                "Certification",
                "Doctorate",
                "High School or equivalent",
                "Master's Degree",
                "Professional",
                "Some College Coursework Completed",
                "Some High School Coursework",
                "Unspecified",
                "Vocational",
                "Vocational - Degree",
                "Vocational - HS Diploma",
                np.nan,
            ],
            dtype=object,
        ),
    ],
    encoded_missing_value=-1,
)
enc.fit_transform(train_txt[di.categorical_cols])

array([[-1., -1.],
       [-1., -1.],
       [ 0.,  1.],
       ...,
       [ 0.,  1.],
       [ 2.,  4.],
       [ 5.,  9.]])

In [102]:
enc = OrdinalEncoder(encoded_missing_value=-1)
enc.fit_transform(train_txt[di.categorical_cols])

array([[-1., -1.],
       [-1., -1.],
       [ 0.,  1.],
       ...,
       [ 0.,  1.],
       [ 2.,  4.],
       [ 5.,  9.]])

In [92]:
train_txt[di.categorical_cols]

Unnamed: 0,required_experience,required_education
0,,
1,,
2,Associate,Bachelor's Degree
3,Associate,Bachelor's Degree
4,Mid-Senior level,Bachelor's Degree
...,...,...
12720,,
12721,Associate,Associate Degree
12722,Associate,Bachelor's Degree
12723,Entry level,High School or equivalent


In [87]:
import pandas as pd

# assume test_txt is a DataFrame containing the test dataset
test_txt[di.categorical_cols] = test_txt[di.categorical_cols].astype("category")

# extract the unique category values and their corresponding codes from the train dataset
cat_values = {}
cat_codes = {}
for col in di.categorical_cols:
    cat_values[col] = train_txt[col].cat.categories
    cat_codes[col] = train_txt[col].cat.codes

# create new categorical columns for the test dataset using the same categories and codes as the train dataset
for col in di.categorical_cols[0]:
    test_txt[col] = pd.Categorical(
        test_txt[col],
        categories=[
            "Associate",
            "Director",
            "Entry level",
            "Executive",
            "Internship",
            "Mid-Senior level",
            "Not Applicable",
        ],
        ordered=False,
    )
    # test_txt[col] = test_txt[col].cat.codes

AttributeError: Can only use .cat accessor with a 'category' dtype

In [88]:
pd.Categorical(
    test_txt[di.categorical_cols[0]],
    categories=[
        "Associate",
        "Director",
        "Entry level",
        "Executive",
        "Internship",
        "Mid-Senior level",
        "Not Applicable",
    ],
    ordered=False,
)

[NaN, 'Director', 'Entry level', 'Mid-Senior level', 'Entry level', ..., 'Entry level', 'Mid-Senior level', 'Not Applicable', 'Associate', 'Not Applicable']
Length: 3182
Categories (7, object): ['Associate', 'Director', 'Entry level', 'Executive', 'Internship', 'Mid-Senior level', 'Not Applicable']

In [79]:
cat_values

{'required_experience': Index(['Associate', 'Director', 'Entry level', 'Executive', 'Internship',
        'Mid-Senior level', 'Not Applicable'],
       dtype='object'),
 'required_education': Index(['Associate Degree', 'Bachelor's Degree', 'Certification', 'Doctorate',
        'High School or equivalent', 'Master's Degree', 'Professional',
        'Some College Coursework Completed', 'Some High School Coursework',
        'Unspecified', 'Vocational', 'Vocational - Degree',
        'Vocational - HS Diploma'],
       dtype='object')}

In [84]:
type(cat_values["required_experience"])


pandas.core.indexes.base.Index

In [83]:
cat_codes

{'required_experience': 0       -1
 1       -1
 2        0
 3        0
 4        5
         ..
 12720   -1
 12721    0
 12722    0
 12723    2
 12724    5
 Length: 12725, dtype: int8,
 'required_education': 0       -1
 1       -1
 2        1
 3        1
 4        1
         ..
 12720   -1
 12721    0
 12722    1
 12723    4
 12724    9
 Length: 12725, dtype: int8}

In [80]:
test_txt

Unnamed: 0,title,salary_range,description,required_experience,required_education,fraudulent
0,"Senior Software Engineers, C++ for AUTOMOTIVE",,Software Competitiveness International (SOFT C...,-1,-1,0
1,Creative Director,,Frequency540 (FQ540) is an independent agency ...,1,-1,0
2,Assistant Personal Chef,,"Maria's Gourmet Kitchen, Houston's first of it...",2,-1,0
3,Environmental Health & Safety Compliance Manager,,Start a career in beer...Our client is one Nor...,5,1,0
4,Machine Operator,,DescriptionThe machine operator will be respon...,2,-1,0
...,...,...,...,...,...,...
3177,Sales and Management Training,,We Are Looking For Full Time Entry Level Reps ...,2,4,0
3178,Account Executive,,"This is who we are: Network Closing Services, ...",5,4,0
3179,UI UX Designer,50000-100000,TradeGecko is a VC-backed fast-growing startup...,6,9,0
3180,Web Developer,,Experienced Web Developer/ProgrammerGraphic Mo...,0,-1,0


In [73]:
train_ds.class_encode_column(di.categorical_cols[0])["train"]


                                                                                        

Dataset({
    features: ['title', 'salary_range', 'description', 'required_experience', 'required_education', 'fraudulent'],
    num_rows: 10816
})

In [76]:
train_ds["train"]["required_education"]


['Vocational',
 'Unspecified',
 "Bachelor's Degree",
 None,
 'Unspecified',
 None,
 'Some College Coursework Completed',
 'Unspecified',
 None,
 None,
 'High School or equivalent',
 "Bachelor's Degree",
 'High School or equivalent',
 'High School or equivalent',
 'High School or equivalent',
 "Bachelor's Degree",
 None,
 'High School or equivalent',
 None,
 "Bachelor's Degree",
 None,
 'Unspecified',
 None,
 None,
 "Bachelor's Degree",
 None,
 'Professional',
 None,
 None,
 None,
 None,
 None,
 "Bachelor's Degree",
 'Professional',
 None,
 None,
 None,
 "Bachelor's Degree",
 'Some High School Coursework',
 'High School or equivalent',
 None,
 None,
 "Bachelor's Degree",
 "Bachelor's Degree",
 None,
 "Bachelor's Degree",
 "Master's Degree",
 None,
 None,
 None,
 None,
 "Bachelor's Degree",
 "Bachelor's Degree",
 None,
 None,
 None,
 None,
 'High School or equivalent',
 None,
 'High School or equivalent',
 'High School or equivalent',
 None,
 None,
 None,
 'High School or equivalent',
 '

In [65]:
# train_txt[di.categorical_cols] = train_txt[di.categorical_cols].astype('category').cat.codes
train_txt[di.categorical_cols].apply(lambda x: x.cat.codes)


Unnamed: 0,required_experience,required_education
0,,
1,,
2,Associate,Bachelor's Degree
3,Associate,Bachelor's Degree
4,Mid-Senior level,Bachelor's Degree
...,...,...
12720,,
12721,Associate,Associate Degree
12722,Associate,Bachelor's Degree
12723,Entry level,High School or equivalent


In [66]:
train_txt["required_experience"]

0                     NaN
1                     NaN
2               Associate
3               Associate
4        Mid-Senior level
               ...       
12720                 NaN
12721           Associate
12722           Associate
12723         Entry level
12724    Mid-Senior level
Name: required_experience, Length: 12725, dtype: category
Categories (7, object): ['Associate', 'Director', 'Entry level', 'Executive', 'Internship', 'Mid-Senior level', 'Not Applicable']

In [69]:
dict(zip(train_txt["required_experience"].cat.codes, train_txt["required_experience"]))

# Do the same but for a whole dataframe: train_txt[di.categorical_cols]

{-1: nan,
 0: 'Associate',
 5: 'Mid-Senior level',
 2: 'Entry level',
 6: 'Not Applicable',
 1: 'Director',
 4: 'Internship',
 3: 'Executive'}

In [67]:
import pandas as pd

categories = pd.unique(train_txt[di.categorical_cols].to_numpy().ravel())

In [68]:
categories


array([nan, 'Associate', "Bachelor's Degree", 'Mid-Senior level',
       'Entry level', 'High School or equivalent', 'Not Applicable',
       'Unspecified', 'Director', "Master's Degree", 'Internship',
       'Some College Coursework Completed', 'Associate Degree',
       'Certification', 'Professional', 'Executive', 'Doctorate',
       'Vocational', 'Some High School Coursework',
       'Vocational - HS Diploma', 'Vocational - Degree'], dtype=object)

In [58]:
import shap

explainer = shap.explainers.Partition(
    model=print, masker=train_txt[di.categorical_cols]
)

TypeError: can only concatenate str (not "int") to str

In [51]:
train_txt[di.categorical_cols].cat.codes


AttributeError: 'DataFrame' object has no attribute 'cat'

In [57]:
import scipy as sp
from shap.utils import hclust

# sp.cluster.hierarchy.complete(
#     sp.spatial.distance.pdist(
#         train_txt[di.categorical_cols].fillna(
#             train_txt[di.categorical_cols].median()).values.T,
#         metric="correlation",
#     )
# )
hclust(train_txt[di.categorical_cols].values, metric="correlation")


TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [44]:
ds2["train"]["required_education"]

['Vocational',
 'Unspecified',
 "Bachelor's Degree",
 None,
 'Unspecified',
 None,
 'Some College Coursework Completed',
 'Unspecified',
 None,
 None,
 'High School or equivalent',
 "Bachelor's Degree",
 'High School or equivalent',
 'High School or equivalent',
 'High School or equivalent',
 "Bachelor's Degree",
 None,
 'High School or equivalent',
 None,
 "Bachelor's Degree",
 None,
 'Unspecified',
 None,
 None,
 "Bachelor's Degree",
 None,
 'Professional',
 None,
 None,
 None,
 None,
 None,
 "Bachelor's Degree",
 'Professional',
 None,
 None,
 None,
 "Bachelor's Degree",
 'Some High School Coursework',
 'High School or equivalent',
 None,
 None,
 "Bachelor's Degree",
 "Bachelor's Degree",
 None,
 "Bachelor's Degree",
 "Master's Degree",
 None,
 None,
 None,
 None,
 "Bachelor's Degree",
 "Bachelor's Degree",
 None,
 None,
 None,
 None,
 'High School or equivalent',
 None,
 'High School or equivalent',
 'High School or equivalent',
 None,
 None,
 None,
 'High School or equivalent',
 '

In [40]:
train_ds["train"].to_pandas().isna().sum()

title                     0
salary_range           9032
description               1
required_experience    4029
required_education     5061
fraudulent                0
dtype: int64

In [30]:
import numpy as np

np.array([ds2["train"]["required_education"]])

array([['Vocational', 'Unspecified', "Bachelor's Degree", ...,
        "Bachelor's Degree", None, None]], dtype=object)

In [31]:
np.array(
    " | ".join(
        [
            f"{col}: {val}"
            for col, val in zip(
                ["required_education"], np.array([ds2["train"]["required_education"]])
            )
        ]
    ),
    dtype="<U512",
)


array('required_education: [\'Vocational\' \'Unspecified\' "Bachelor\'s Degree" ... "Bachelor\'s Degree"\n None None]',
      dtype='<U512')

In [32]:
np.array([train_txt["required_education"]])

array([[nan, nan, "Bachelor's Degree", ..., "Bachelor's Degree",
        'High School or equivalent', 'Unspecified']], dtype=object)

In [6]:
train_txt[train_dataset.feature_columns].isna().sum() / len(train_txt)

comment_text                           0.00000
asian                                  0.77492
atheist                                0.77492
bisexual                               0.77492
black                                  0.77492
buddhist                               0.77492
christian                              0.77492
female                                 0.77492
heterosexual                           0.77492
hindu                                  0.77492
homosexual_gay_or_lesbian              0.77492
intellectual_or_learning_disability    0.77492
jewish                                 0.77492
latino                                 0.77492
male                                   0.77492
muslim                                 0.77492
other_disability                       0.77492
other_gender                           0.77492
other_race_or_ethnicity                0.77492
other_religion                         0.77492
other_sexual_orientation               0.77492
physical_disa

In [11]:
# find unique values
train_txt.dtypes


comment_text                            object
asian                                  float64
atheist                                float64
bisexual                               float64
black                                  float64
buddhist                               float64
christian                              float64
female                                 float64
heterosexual                           float64
hindu                                  float64
homosexual_gay_or_lesbian              float64
intellectual_or_learning_disability    float64
jewish                                 float64
latino                                 float64
male                                   float64
muslim                                 float64
other_disability                       float64
other_gender                           float64
other_race_or_ethnicity                float64
other_religion                         float64
other_sexual_orientation               float64
physical_disa

In [79]:
dataset = dataset_registry.create("wine_reviews", "train")
print(dataset.label_columns)
print()

['variety']



In [1]:
# use these keys to specify which dataset to load
print(dataset_registry.list_keys())

train_dataset = dataset_registry.create("product_sentiment_machine_hack", "train")
test_dataset = dataset_registry.create("product_sentiment_machine_hack", "test")
print(train_dataset.data)


['product_sentiment_machine_hack', 'jigsaw_unintended_bias', 'jigsaw_unintended_bias100K', 'google_qa_label', 'google_qa_answer_helpful', 'google_qa_answer_plausible', 'google_qa_answer_type_procedure', 'google_qa_answer_type_reason_explanation', 'google_qa_question_type_reason_explanation', 'google_qa_answer_satisfaction', 'women_clothing_review', 'melbourne_airbnb', 'mercari_price_suggestion', 'ae_price_prediction', 'mercari_price_suggestion100K', 'imdb_genre_prediction', 'fake_job_postings', 'kick_starter_funding', 'jc_penney_products', 'wine_reviews', 'news_popularity', 'news_channel', 'news_popularity2', 'fake_job_postings2', 'bookprice_prediction', 'data_scientist_salary', 'california_house_price']
Downloading /home/james/.auto_mm_bench/datasets/machine_hack_sentiment_analysis/train.csv from https://automl-mm-bench.s3.amazonaws.com/machine_hack_product_sentiment/train.csv...


100%|██████████| 611k/611k [00:00<00:00, 1.62MiB/s]


Downloading /home/james/.auto_mm_bench/datasets/machine_hack_sentiment_analysis/test.csv from https://automl-mm-bench.s3.amazonaws.com/machine_hack_product_sentiment/dev.csv...


100%|██████████| 154k/154k [00:00<00:00, 571kiB/s] 

      Unnamed: 0  Text_ID                                Product_Description  \
0           5743     2333  #techcrunch #google This Post Has Nothing to d...   
1           3042     3448  Data is the new oil. (Companies like Google an...   
2           4359      720  my sister is throwing the Google sxsw party to...   
3           5685     2328  Clear +succinct visions make for great UX (thi...   
4           2078     4955  40% of Google Maps use is mobile marissamayer ...   
...          ...      ...                                                ...   
5086        1096     3235  I'm eyeing the grilled cheese stand for after ...   
5087        3519     2728  Let's make that a temp-to-hire position RT @me...   
5088         212     1370  @mention -&gt; RT @mention New #UberSocial for...   
5089        2786     2530  @mention Did you find out the hours of the app...   
5090        3942     8162  &quot;At SXSW, Apple schools the marketing exp...   

      Product_Type  Sentiment  
0      




feature_columns ['Product_Description', 'Product_Type']
feature_types ['text', 'categorical']
label_columns ['Sentiment']
label_types ['categorical']
metric acc
problem_type multiclass


In [None]:
from datasets import Dataset
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
dataset = Dataset.from_pandas(df)


In [16]:
train_dataset.label_columns[0]

'Sentiment'

In [83]:
ds["train"].value_counts()

AttributeError: 'Dataset' object has no attribute 'value_counts'

In [54]:
train_dataset.feature_columns

['country', 'description', 'points', 'price', 'province']

In [34]:
info_cols = [
    # 'splits',
    "feature_columns",
    "feature_types",
    "label_columns",
    "label_types",
    #  'data',
    "metric",
    "problem_type",
]

for col in info_cols:
    print(col, getattr(train_dataset, col))


feature_columns ['title', 'salary_range', 'description', 'required_experience', 'required_education']


AttributeError: 'FakeJobPostings2' object has no attribute 'feature_types'

In [38]:
train_dataset.feature_columns


['title',
 'salary_range',
 'description',
 'required_experience',
 'required_education']

In [85]:
train_dataset.data[train_dataset.label_columns[0]].value_counts()

Pinot Noir                    10617
Chardonnay                     9402
Cabernet Sauvignon             7577
Red Blend                      7157
Bordeaux-style Red Blend       5532
Riesling                       4151
Sauvignon Blanc                3973
Syrah                          3314
Rosé                           2851
Merlot                         2482
Nebbiolo                       2243
Zinfandel                      2171
Sangiovese                     2166
Malbec                         2122
Portuguese Red                 1973
White Blend                    1888
Sparkling Blend                1722
Tempranillo                    1448
Rhône-style Red Blend          1177
Pinot Gris                     1164
Champagne Blend                1117
Cabernet Franc                 1082
Grüner Veltliner               1076
Portuguese White                927
Bordeaux-style White Blend      853
Pinot Grigio                    842
Gamay                           820
Gewürztraminer              

In [45]:
train_dataset.data["required_education"].value_counts()

Bachelor's Degree                    3425
High School or equivalent            1428
Unspecified                          1082
Master's Degree                       323
Associate Degree                      215
Certification                         123
Some College Coursework Completed      81
Professional                           59
Vocational                             40
Some High School Coursework            23
Doctorate                              19
Vocational - HS Diploma                 9
Vocational - Degree                     5
Name: required_education, dtype: int64

In [40]:
train_dataset.feature_columns


['title',
 'salary_range',
 'description',
 'required_experience',
 'required_education']

In [26]:
from datasets import load_dataset

my_ds = load_dataset("james-burton/product_sentiment_machine_hack")

HfHubHTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/datasets/james-burton/product_sentiment_machine_hack