# Bidding engine development

# Contents

- [Imports](#imports)
    - [Libraries](#libraries)
    - [Raw data](#raw-data)
- [Processing](#processing)
    - [Missing values](#missing-values)
        - [Filling with forward fill](#filling-with-forward-fill)
        - [Filling with most frequent](#filling-with-most-frequent)
        - [Filling with nearest neighbors](#filling-with-neighbours)
    - [Converting object features to numerical](#converting-object-features-to-numerical)
    - [Scaling metrics features](#scaling-metrics-features)
    - [Downcasting numerical features](#downcasting-numerical-features)
    - [Dummy variable for broad match modifier](#dummy-variable-for-broad-match-modifier)
    - [Profiling interim data](#timeseries-of-the-random-feature-for-the-random-keyword)
    - [Convert CPC to GBP](#convert-cpc-to-gbp)
- [Quick look into keywords](#quick-look-into-keywords)
    - [Most sessions](#most-sessions)
    - [Least sessions](#least-sessions)
    - [Most impressions](#most-impressions)
    - [Least impressions](#least-impressions)
- [Processed data](#processed-data)
- [Keyword clustering](#keyword-clustering)
    - [Embedding](#embedding)
    - [k-means clustering](#k-means-clustering)
    - [Agglomerative clustering](#agglomerative-clustering)
- [Topic extraction](#topic-extraction)
- [Work in progress](#work-in-progress)

# Imports

## Libraries


In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dotenv import load_dotenv

SEED = 10
np.random.seed(seed=SEED)
load_dotenv()

## Raw data

In [None]:
raw_df = pd.read_feather(
    os.path.join(os.environ['RAW_DATA_PATH'], 'bidding_data.feather'))
raw_df.info()

In [None]:
raw_df.isna().sum()

In [None]:
sum(raw_df.duplicated())

In [None]:
raw_df.head()

In [None]:
raw_df.describe().T.style.background_gradient()

Observations

1. 39 (out of 587,593) NaNs in `Cost` column. Nans rows can be dropped due to insignificant amount. Alternatively, can interpolated (forward fill) or imputed.

2. The following columns could be converted to numerical (floats) judging by the `.head()` and columns names
```
8   AbsoluteTopImpressionPercentage   587593 non-null  object        
9   TopImpressionPercentage           587593 non-null  object        
10  SearchImpressionShare             587593 non-null  object        
11  SearchTopImpressionShare          587593 non-null  object        
12  SearchRankLostTopImpressionShare  587593 non-null  object        
```

# Processing
[top](#contents)

In [None]:
interim_df = raw_df.copy(deep=True)
interim_df_sorted = interim_df.sort_values(
    by=['CriterionId', 'Date']).reset_index(drop=True)


In [None]:
def plot_keyword(df, criterion, features):
    '''
    The plot_keyword function takes a dataframe, criterion ID, and list of features to plot.
    It then plots the specified features for the given criterion ID.
    
    Parameters
    ----------
        df
            Pass the dataframe to the function
        criterion
            Select the keyword to plot
        features
            Specify which columns to plot
    '''
    for feature in features:
        plt.plot(df.loc[df['CriterionId'] == criterion, 'Date'],
                 df.loc[df['CriterionId'] == criterion, feature],
                 label=feature)
    plt.title('Keyword ID {}\nUnique keywords {}'.format(
        criterion, raw_df.loc[raw_df['CriterionId'] == criterion,
                              'Criteria'].unique()))
    plt.xticks(rotation=45)
    if len(features) == 1:
        plt.ylabel(feature)
    else:
        plt.legend()
    plt.show()

In [None]:
def plot_filled(df, criterion, features):
    '''
    The plot_filled function takes a dataframe, criterion ID, and list of features as input.
    It then plots the filled values for each feature in the list against the date.
    If there is only one feature in the list, it labels that axis with that feature name.
    
    Parameters
    ----------
        df
            Pass the dataframe to the function
        criterion
            Filter the dataframe to only show the rows that have a criterionid equal to criterion
        features
            Specify which columns to plot
    '''
    from matplotlib.markers import MarkerStyle

    markers = list(MarkerStyle.markers.keys())[1:len(features) + 1]
    plt.figure()
    for feature, marker in zip(features, markers):
        if 'filled' in feature:
            plt.plot(df.loc[(df['CriterionId'] == criterion)
                            & (df['Cost'].isna()), 'Date'],
                     df.loc[(df['CriterionId'] == criterion) &
                            (df['Cost'].isna()), feature],
                     marker,
                     label=feature)
        else:
            plt.plot(df.loc[df['CriterionId'] == criterion, 'Date'],
                     df.loc[df['CriterionId'] == criterion, feature],
                     label=feature)
    plt.title('Keyword ID {}\nUnique keywords {}'.format(
        criterion,
        df.sort_values(
            by=['CriterionId', 'Date']).loc[df['CriterionId'] == criterion,
                                            'Criteria'].unique()))
    plt.xticks(rotation=45)
    if len(features) == 1:
        plt.ylabel(feature)
    else:
        plt.legend()
    plt.show()

## Missing values

In [None]:
missing_criterionids = raw_df.loc[raw_df['Cost'].isnull(),
                                  'CriterionId'].unique()
print(f'unique keywords with missing values: {len(missing_criterionids)}')

In [None]:
# random_criterion_with_missing_cost = np.random.choice(missing_criterionids)
# features = ['Cost']
# plot_keyword(df=interim_df_sorted,
#              criterion=random_criterion_with_missing_cost,
#              features=features)


### Filling with forward fill

In [None]:
# interim_df_sorted['Cost_forward_filled'] = interim_df_sorted[
#     'Cost'].interpolate(method='ffill')


### Filling with most frequent

In [None]:
from sklearn.impute import SimpleImputer

frequency_imp = SimpleImputer(strategy='most_frequent')
interim_df_sorted['Cost_frequency_filled'] = interim_df_sorted['Cost'].copy(
    deep=True)
for criterion in missing_criterionids:
    interim_df_sorted.loc[
        interim_df_sorted['CriterionId'] == criterion,
        'Cost_frequency_filled'] = frequency_imp.fit_transform(
            interim_df_sorted.loc[
                interim_df_sorted['CriterionId'] == criterion,
                'Cost_frequency_filled'].values.reshape(-1, 1)).reshape(-1)

In [None]:
# interim_df_sorted['Cost_mode_filled'] = interim_df_sorted['Cost'].copy(
#     deep=True)
# modes = []
# for criterion in missing_criterionids:
#     mode = interim_df_sorted.loc[interim_df_sorted['CriterionId'] == criterion,
#                                  'Cost_mode_filled'].mode().values[0]
#     interim_df_sorted.loc[interim_df_sorted['CriterionId'] == criterion,
#                           'Cost_mode_filled'] = interim_df_sorted.loc[
#                               interim_df_sorted['CriterionId'] == criterion,
#                               'Cost_mode_filled'].fillna(value=mode)
#     modes.append(mode)


### Filling with neighbours

In [None]:
# from sklearn.impute import KNNImputer

# knn_imp = KNNImputer(n_neighbors=5, weights='distance')
# interim_df_sorted['Cost_knn_filled'] = interim_df_sorted['Cost'].copy(
#     deep=True)
# for criterion in missing_criterionids:
#     interim_df_sorted.loc[
#         interim_df_sorted['CriterionId'] == criterion,
#         'Cost_knn_filled'] = knn_imp.fit_transform(interim_df_sorted.loc[
#             interim_df_sorted['CriterionId'] == criterion,
#             'Cost_knn_filled'].values.reshape(-1, 1)).reshape(-1)

Comparison of the filling methods on the random keyword

In [None]:
# filled = interim_df_sorted.loc[interim_df_sorted['Cost'].isna(), [
#     'Cost', 'Cost_knn_filled', 'Cost_forward_filled', 'Cost_frequency_filled',
#     'Cost_mode_filled'
# ]]


In [None]:
# plot_filled(
#     df=interim_df_sorted,
#     #   uncomment to see filling on a random keyword
#     # criterion=np.random.choice(missing_criterionids),
#     #   uncomment below to see filing on the same keyword as above
#     criterion=random_criterion_with_missing_cost,
#     features=[
#         'Cost', 'Cost_forward_filled', 'Cost_frequency_filled',
#         'Cost_knn_filled', 'Cost_mode_filled'
#     ])
# print('keyword {}. mode: {}'.format(
#     criterion,
#     interim_df_sorted.loc[interim_df_sorted['CriterionId'] == criterion,
#                           'Cost'].mode()[0]))


Conclusion on filling NaNs

Due to sparsity of the data, filling with the most frequent (mode) per keyword value was chosen

In [None]:
interim_df_sorted_no_nans = interim_df_sorted.copy(deep=True)
interim_df_sorted_no_nans['Cost'] = interim_df_sorted_no_nans[
    'Cost_frequency_filled']
interim_df_sorted_no_nans.drop(
    columns=[
        # 'Cost_forward_filled',
        'Cost_frequency_filled',
        # 'Cost_knn_filled',
        # 'Cost_mode_filled',
    ],
    inplace=True)


## Converting object features to numerical

In [None]:
object_columns = [
    'AbsoluteTopImpressionPercentage', 'TopImpressionPercentage',
    'SearchImpressionShare', 'SearchTopImpressionShare',
    'SearchRankLostTopImpressionShare'
]

for col in object_columns:
    interim_df_sorted_no_nans[col] = interim_df_sorted_no_nans[col].str.lstrip(
        '<').str.rstrip('%').astype(float)

## Scaling metrics features

See the ranges of metrics in [Google Ad API](https://developers.google.com/google-ads/api/fields/v11/metrics)

In [None]:
for col in object_columns:
    interim_df_sorted_no_nans[col] = (interim_df_sorted_no_nans[col] -
                                      interim_df_sorted_no_nans[col].min()
                                      ) / interim_df_sorted_no_nans[col].max()
    if 'Percentage' in col:
        interim_df_sorted_no_nans[col] *= 100

## Downcasting numerical features

In [None]:
fcols = interim_df_sorted_no_nans.select_dtypes('float').columns
icols = interim_df_sorted_no_nans.select_dtypes('integer').columns

interim_df_sorted_no_nans[fcols] = interim_df_sorted_no_nans[fcols].apply(
    pd.to_numeric, downcast='float')
interim_df_sorted_no_nans[icols] = interim_df_sorted_no_nans[icols].apply(
    pd.to_numeric, downcast='integer')


## Dummy variable for broad match modifier

For the same index, AdGroupName can contain BMM in the name, but Criteria may not include it. '+' in Criteria is prevailing

In [None]:
interim_df_sorted_no_nans['BMM'] = interim_df_sorted_no_nans[
    'Criteria'].str.contains('+', regex=False).astype(np.int8)


In [None]:
interim_df_sorted_no_nans[
    interim_df_sorted_no_nans['AdGroupName'].str.contains('BMM')]


In [None]:
interim_df_sorted_no_nans[interim_df_sorted_no_nans['Criteria'].str.contains(
    '+', regex=False)]


<!-- ## Replacing non alphanum in keywords -->

In [None]:
# all(interim_df_sorted_no_nans['Criteria'].str.islower())

In [None]:
# interim_df_sorted_no_nans['Criteria'] = interim_df_sorted_no_nans[
#     'Criteria'].str.replace(r'[^\w\s]+', '', regex=True)


## Profiling interim data

In [None]:
if 'interim_data.html' not in os.listdir(os.environ['DATA_PROFILES_PATH']):
    from ydata_profiling import ProfileReport

    os.makedirs(os.environ['DATA_PROFILES_PATH'], exist_ok=True)
    interim_data_profile = ProfileReport(interim_df_sorted_no_nans,
                                         title='Interim Data Profile')
    interim_data_profile.to_file(
        os.path.join(os.environ['DATA_PROFILES_PATH'], 'interim_data.html'))
    del interim_data_profile


<!-- ## Add ROI -->

In [None]:
# interim_df_sorted_no_nans['ROI_gbp'] = interim_df_sorted_no_nans[
#     'Margin'].divide(interim_df_sorted_no_nans['Cost_gbp'],
#                      fill_value=0).replace({
#                          np.inf: 0,
#                          np.nan: 0
#                      })
# interim_df_sorted_no_nans['ROI'] = interim_df_sorted_no_nans['Margin'].divide(
#     interim_df_sorted_no_nans['Cost'], fill_value=0).replace({
#         np.inf: 0,
#         np.nan: 0
#     })


## Convert CPC to GBP

In [None]:
micros_to_gbp = interim_df_sorted_no_nans['Cost'].div(
    interim_df_sorted_no_nans['Cost_gbp']).replace({
        np.nan: 0,
        np.inf: 0
    }).unique()[1:].mean()


In [None]:
interim_df_sorted_no_nans[
    'CpcBid_gbp'] = interim_df_sorted_no_nans['CpcBid'] / micros_to_gbp


# Quick look into keywords
[top](#contents)

## Unique keyword ids

In [None]:
len(interim_df_sorted_no_nans['CriterionId'].unique())

## Unique keyword ids with positive sessions

In [None]:
len(interim_df_sorted_no_nans.loc[(interim_df_sorted_no_nans['Margin'] > 0) &
                                  (interim_df_sorted_no_nans['Sessions'] > 0),
                                  'CriterionId'].unique())


## Negative margins

In [None]:
# negative_margins = interim_df_sorted_no_nans.loc[
#     interim_df_sorted_no_nans['Margin'] < 0, 'CriterionId'].unique()


In [None]:
# for criterion in negative_margins:
#     plot_keyword(df=interim_df_sorted_no_nans,
#                  criterion=criterion,
#                  features=['Sessions', 'Cost_gbp'])
#     print('Negative margins: {} at {}'.format(interim_df_sorted_no_nans.loc[
#         (interim_df_sorted_no_nans['CriterionId'] == criterion) &
#         (interim_df_sorted_no_nans['Margin'] < 0), 'Margin'].values, interim_df_sorted_no_nans.loc[
#         (interim_df_sorted_no_nans['CriterionId'] == criterion) &
#         (interim_df_sorted_no_nans['Margin'] < 0), 'Date'].dt.date))


In this context, the negative margins are unlikely to mean loss. Apply abs() to margins

In [None]:
interim_df_sorted_no_nans['Margin'] = abs(interim_df_sorted_no_nans['Margin'])

## Most sessions

In [None]:
most_sessions = interim_df_sorted_no_nans.sort_values(
    by=['Sessions'],
    ascending=False).loc[interim_df_sorted_no_nans['Margin'] > 0,
                         'CriterionId'][:75].unique()
print(most_sessions)
interim_df_sorted_no_nans.loc[
    interim_df_sorted_no_nans['CriterionId'].isin(most_sessions),
    'Criteria'].unique()

## Least sessions

In [None]:
least_sessions = interim_df_sorted_no_nans.sort_values(
    by=['Sessions'], ascending=True)['CriterionId'].unique()[:10]
print(least_sessions)
interim_df_sorted_no_nans.loc[
    interim_df_sorted_no_nans['CriterionId'].isin(least_sessions),
    'Criteria'].unique()

## Most impressions

In [None]:
most_impressions = interim_df_sorted_no_nans.sort_values(
    by=['Impressions'], ascending=False)['CriterionId'].unique()[:10]
print(most_impressions)
interim_df_sorted_no_nans.loc[
    interim_df_sorted_no_nans['CriterionId'].isin(most_impressions),
    'Criteria'].unique()

## Least impressions

In [None]:
least_impressions = interim_df_sorted_no_nans.sort_values(
    by=['Impressions'], ascending=True)['CriterionId'].unique()[:10]
print(least_impressions)
interim_df_sorted_no_nans.loc[
    interim_df_sorted_no_nans['CriterionId'].isin(least_impressions),
    'Criteria'].unique()

# Processed data
[top](#contents)

In [None]:
processed_df = interim_df_sorted_no_nans.copy(deep=True)
fcols = processed_df.select_dtypes('float').drop(
    columns=['Cost', 'CpcBid']).columns.to_list()

In [None]:
processed_df = processed_df[
    ['CriterionId', 'Criteria', 'Date', 'Impressions', 'BMM'] + fcols]


In [None]:
processed_df.info()

In [None]:
processed_df.describe().T.style.background_gradient()

In [None]:
processed_df.head()

## Profiling processed data

In [None]:
if 'processed_data.html' not in os.listdir(os.environ['DATA_PROFILES_PATH']):
    from ydata_profiling import ProfileReport

    os.makedirs(os.environ['DATA_PROFILES_PATH'], exist_ok=True)
    processed_data_profile = ProfileReport(processed_df,
                                           title='Processed Data Profile')
    processed_data_profile.to_file(
        os.path.join(os.environ['DATA_PROFILES_PATH'], 'processed_data.html'))
    del processed_data_profile

## Adding Criteria_bmm_ignored

In [None]:
processed_df['Criteria_bmm_ignored'] = processed_df['Criteria'].str.replace(
    r'[^\w\s]+', '', regex=True)


# Keyword clustering
[top](#contents)

## Embedding

In [None]:
del raw_df, interim_df_sorted_no_nans, interim_df_sorted

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v1')


In [None]:
kwds = processed_df['Criteria'].unique()
print(len(kwds))
kwds

In [None]:
kwds_no_bmm = processed_df['Criteria'].str.replace(r'[^\w\s]+', '',
                                                   regex=True).unique()
print(len(kwds_no_bmm))
kwds_no_bmm

In [None]:
kwds_embeddings = model.encode(kwds_no_bmm)
kwds_embeddings.shape

## k-means clustering

In [None]:
def get_clusters(clusters, return_clustered=False):
    clustered = {}
    for i, cluster_id in enumerate(clusters):
        if cluster_id not in clustered:
            clustered[cluster_id] = []
        clustered[cluster_id].append(kwds_no_bmm[i])

    for i, kw in sorted(clustered.items()):
        print('Cluster {}, {} elements\n{}\n'.format(i + 1, len(kw), kw))

    if return_clustered:
        return clustered

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize

n_clusters = 5
kmeans_model = KMeans(n_clusters=n_clusters,
                      max_iter=500,
                      n_init=100,
                      init='k-means++',
                      random_state=SEED)
kmeans_model.fit(normalize(kwds_embeddings))
kmeans_clusters = kmeans_model.labels_

In [None]:
get_clusters(kmeans_clusters)

## Agglomerative clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

agglomerative_model = AgglomerativeClustering(n_clusters=None,
                                              metric='cosine',
                                              linkage='complete',
                                              distance_threshold=0.9)
agglomerative_model.fit(kwds_embeddings)
agglo_clusters = agglomerative_model.labels_

In [None]:
clustered = get_clusters(agglo_clusters, return_clustered=True)

In [None]:
clustered

In [None]:
processed_df['Agglomerated_cluster'] = processed_df[
    'Criteria_bmm_ignored'].map(
        lambda x: [k for k, v in clustered.items() if x in v][0])
processed_df


In [None]:
processed_df['Agglomerated_cluster'].value_counts()

In [None]:
processed_df.columns

In [None]:
impressions_df = pd.DataFrame()
for cluster in set(agglo_clusters):
    impressions_df[f'{cluster}'] = processed_df.loc[
        processed_df['Agglomerated_cluster'] == cluster,
        ['Impressions', 'Date']].groupby(by='Date').sum()


In [None]:
impressions_df

In [None]:
for col in impressions_df.columns:
    plt.plot(impressions_df.loc[:,col])
    plt.xticks(rotation=45)

# Forecasting

# Topic extraction

In [None]:
# from bertopic import BERTopic

# topic_model = BERTopic(min_topic_size=50, top_n_words=5)
# topics, probs = topic_model.fit_transform(kwds_no_bmm)

In [None]:
# topic_model.get_topic_info()

In [None]:
# topic_model.visualize_topics()

In [None]:
# all_topics = topic_model.get_topics()
# all_topics

## Visualise clusters with PCA
Following the [sklearn example](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py)

In [None]:
# from sklearn.decomposition import PCA

# reduced_data = PCA(n_components=2).fit_transform(kwds_embeddings)
# kmeans_model_reduced = KMeans(n_clusters=n_clusters,
#                               max_iter=500,
#                               n_init=100,
#                               init='k-means++',
#                               random_state=SEED)
# kmeans_model_reduced.fit(reduced_data)

# h = 0.02
# x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
# y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
# xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Z = kmeans_model_reduced.predict(np.c_[xx.ravel(),
#                                        yy.ravel()].astype(np.float32))
# Z = Z.reshape(xx.shape)
# centroids = kmeans_model_reduced.cluster_centers_
# plt.imshow(Z,
#            interpolation='nearest',
#            extent=(xx.min(), xx.max(), yy.min(), yy.max()),
#            cmap=plt.cm.Pastel2)
# plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# plt.scatter(centroids[:, 0],
#             centroids[:, 1],
#             marker='x',
#             s=169,
#             linewidths=3,
#             color='w',
#             zorder=10)


In [None]:
# import umap
# umap_embeddings = umap.UMAP(n_neighbors=15,
#                             n_components=5,
#                             metric='cosine').fit_transform(kwds_embeddings)

In [None]:
# import hdbscan
# cluster = hdbscan.HDBSCAN(min_cluster_size=15,
#                           metric='euclidean',
#                           cluster_selection_method='eom').fit(umap_embeddings)

In [None]:
# import matplotlib.pyplot as plt

# # Prepare data
# umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(kwds_embeddings)
# result = pd.DataFrame(umap_data, columns=['x', 'y'])
# result['labels'] = cluster.labels_


In [None]:
# result

In [None]:
# # Visualize clusters
# fig, ax = plt.subplots()
# outliers = result.loc[result.labels == -1, :]
# clustered = result.loc[result.labels != -1, :]
# plt.scatter(outliers.x, outliers.y, color='#BDBDBD')
# plt.scatter(clustered.x, clustered.y, c=clustered.labels,cmap='viridis_r')
# plt.colorbar()

In [None]:
# import gensim.downloader as api

# glove = api.load('glove-wiki-gigaword-50')

In [None]:
# processed_df['Criteria_tokens'] = processed_df['Criteria'].map(lambda x: x.split())

In [None]:
# max_length = processed_df['Criteria_tokens'].map(lambda x: len(x)).unique().max()
# max_length

In [None]:
# glove.index_to_key

In [None]:
# in_glove = []
# for kw in processed_df['Criteria'].unique():
#     for ikw in kw.split():
#         in_glove.append(ikw in glove.index_to_key)

In [None]:
# [x.split() for x in processed_df['Criteria'].unique()]

In [None]:
# len(glove.index_to_key)

In [None]:
# processed_df['in_glove'] = processed_df['Criteria_tokens'].map(
#     lambda x: all([True for xi in x if xi in glove.index_to_key]))


In [None]:
# processed_df['in_glove']

In [None]:
# all([xi in glove.index_to_key for xi in [x.split() for x in processed_df['Criteria'].unique()]])

In [None]:
# processed_df['Criteria_tokens'].map(
#     lambda x: all([True for xi in x if xi in glove.index_to_key]))


In [None]:
# vocab = []
# embeddings = {}
# with open(os.path.join(os.environ['GLOVE_PATH'], 'glove.6B.50d.txt'),
#           'rt',
#           encoding='utf-8') as f:
#     for line in f:
#         values = line.split()
#         word = values[0]
#         vocab.append(word)
#         vector = np.asarray(values[1:], 'float32')
#         embeddings[word] = vector

In [None]:
# interim_df['Criteria_tokens'] = interim_df['Criteria'].str.split()

In [None]:
# from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained('roberta-base')

In [None]:
# interim_df['Criteria_tokens'] = interim_df['Criteria'].map(
#     lambda x: tokenizer.tokenize(x, padding='max_length', truncation=True))


In [None]:
# interim_df['Criteria_tokens']

In [None]:
# interim_df['Criteria_ids'] = interim_df['Criteria_tokens'].map(
#     lambda x: tokenizer.convert_tokens_to_ids(x))


In [None]:
# interim_df['Criteria_ids']

In [None]:
# from transformers import DistilBertModel

# model = DistilBertModel.from_pretrained('distilbert-base-uncased',
#                                         output_hidden_states=True)
# model.eval()