<a href="https://colab.research.google.com/github/jasonsgraham/nlp_notes/blob/main/nlp_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup #

In [2]:
#@title Default title text
!wget -q https://gist.githubusercontent.com/jasonsgraham/f63e1737121e2154ee3ad228398137e2/raw/7f0bb2c8e756e43cabcd90027dedfb79670132ae/setup_colab.py -O colab_setup.py
%run colab_setup.py

Loading WANDB api key.


In [4]:
%%sh
pip install -q --upgrade transformers
pip install -q --upgrade wandb
pip install -q --upgrade mlflow

Check if notebook is running in Colab or Kaggle

In [5]:
try:
    import mlflow
except ImportError as e:
    !pip install mlflow
    import mlflow

In [6]:
import json
from pathlib import Path

import numpy as np
import pandas as pd
from scipy import sparse
from tqdm import tqdm

pd.options.display.width = 180
pd.options.display.max_colwidth = 120

In [7]:
import sys
GOOGLE_COLAB = 'google.colab' in sys.modules

if GOOGLE_COLAB:
  data_dir = Path('/content/drive/MyDrive/Colab Notebooks/input/AI4Code')
  output_dir = Path('/content/drive/MyDrive/Colab Notebooks/output/AI4Code')
  train_parquet_file = data_dir / 'train.parquet'
else:
  data_dir = data_dir = Path('../input/AI4Code')
  output_dir = Path('./')

# Ordering the Cells #

In [None]:
df_orders = pd.read_csv(
    data_dir / 'train_orders.csv',
    index_col='id',
    squeeze=True,
).str.split()  # Split the string representation of cell_ids into a list

df_orders

In [10]:
NUM_TRAIN = 10000


def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )

if train_parquet_file.exists():
  df = pd.read_parquet(train_parquet_file)
else:
  paths_train = list((data_dir / 'train').glob('*.json'))[:NUM_TRAIN]
  notebooks_train = [
      read_notebook(path) for path in tqdm(paths_train, desc='Train NBs')
  ]
  df = (
      pd.concat(notebooks_train)
      .set_index('id', append=True)
      .swaplevel()
      .sort_index(level='id', sort_remaining=False)
  )
  df.to_parquet(train_parquet_file)

In [15]:
df_orders.shape

(139256,)

In [None]:
df

In [19]:
nb_id = df.index.unique('id')[6]
# Get the correct order
cell_order = df_orders.loc[nb_id]
nb = df.loc[nb_id, :]
print("The ordered notebook:")
#nb.loc[cell_order, :]
cell_order

The ordered notebook:


['3e551fb7',
 '45049ad8',
 '8bb41691',
 '123b4f4c',
 '0b92cb59',
 '5a8b6e2d',
 'df963df4',
 '3c7d19bc',
 '0f3db81b',
 'eadf5c66',
 '33ff3073',
 '6cfbe868',
 '88cc83b2',
 '818c4c15']

In [20]:
def get_ranks(base, derived):
    return [base.index(d) for d in derived]

cell_ranks = get_ranks(cell_order, list(nb.index))
nb.insert(0, 'rank', cell_ranks)

nb

Unnamed: 0_level_0,rank,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3e551fb7,0,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
45049ad8,1,code,"train_data = pd.read_csv(""/kaggle/input/titanic/train.csv"")\ntest_data = pd.read_csv(""/kaggle/input/titanic/test.csv"")"
123b4f4c,3,code,import plotly.express as px
0b92cb59,4,code,train_data.head(20)
df963df4,6,code,train_data.isnull().sum() #checking out which column has most no. of NaN Values
0f3db81b,8,code,"px.bar(data_frame=train_data, x='Sex', y='Survived',color='Sex',facet_row_spacing=0, title=""Relation between Gender ..."
33ff3073,10,code,"total_passengers = train_data['Sex'].count()\ncount_males = 0\ncount_females = 0\nfor i,j in zip(train_data['Sex'], ..."
818c4c15,13,code,"from sklearn.ensemble import RandomForestClassifier\n\n\ny = train_data[""Survived""]\n\nfeatures = [""Pclass"", ""Sex"", ..."
6cfbe868,11,markdown,## Survival Rate for Male Passenger is : 12.235 %\n\n## Survival Rate for Female Passenger is : 26.150 %
eadf5c66,9,markdown,## Who has more luck in here? \n\n\nFrom the above data we can find out that females had more survival rate on Titan...


In [22]:
from pandas.testing import assert_frame_equal

assert_frame_equal(nb.loc[cell_order, :], nb.sort_values('rank'))

In [23]:
df_orders_ = df_orders.to_frame().join(
    df.reset_index('cell_id').groupby('id')['cell_id'].apply(list),
    how='right',
)

ranks = {}
for id_, cell_order, cell_id in df_orders_.itertuples():
    ranks[id_] = {'cell_id': cell_id, 'rank': get_ranks(cell_order, cell_id)}

df_ranks = (
    pd.DataFrame
    .from_dict(ranks, orient='index')
    .rename_axis('id')
    .apply(pd.Series.explode)
    .set_index('cell_id', append=True)
)

df_ranks

Unnamed: 0_level_0,Unnamed: 1_level_0,rank
id,cell_id,Unnamed: 2_level_1
00001756c60be8,1862f0a6,0
00001756c60be8,2a9e43d6,2
00001756c60be8,038b763d,4
00001756c60be8,2eefe0ef,6
00001756c60be8,0beab1cd,8
...,...,...
12b925c525495d,84762508,17
12b925c525495d,bb270083,21
12b925c525495d,473e430f,14
12b925c525495d,71181d6d,4


# Splits #

The `df_ancestors.csv` file identifies groups of notebooks derived from a common origin, that is, notebooks belonging to the same forking tree.

In [24]:
df_ancestors = pd.read_csv(data_dir / 'train_ancestors.csv', index_col='id')

In [25]:
df_ancestors

Unnamed: 0_level_0,ancestor_id,parent_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
00001756c60be8,945aea18,
00015c83e2717b,aa2da37e,317b65d12af9df
0001bdd4021779,a7711fde,
0001daf4c2c76d,090152ca,
0002115f48f982,272b483a,
...,...,...
fffc30d5a0bc46,6aed207b,
fffc3b44869198,a6aaa8d7,
fffc63ff750064,0a1b5b65,
fffcd063cda949,d971e960,


In [26]:
from sklearn.model_selection import GroupShuffleSplit

NVALID = 0.1  # size of validation set

splitter = GroupShuffleSplit(n_splits=1, test_size=NVALID, random_state=0)

# Split, keeping notebooks with a common origin (ancestor_id) together
ids = df.index.unique('id')
ancestors = df_ancestors.loc[ids, 'ancestor_id']
ids_train, ids_valid = next(splitter.split(ids, groups=ancestors))
ids_train, ids_valid = ids[ids_train], ids[ids_valid]

df_train = df.loc[ids_train, :]
df_valid = df.loc[ids_valid, :]

In [32]:
df_valid

Unnamed: 0_level_0,Unnamed: 1_level_0,cell_type,source
id,cell_id,Unnamed: 2_level_1,Unnamed: 3_level_1
000757b90aaca0,8f84d7a9,code,import pandas as pd\nimport spacy\nimport networkx as nx # a really useful network analysis l...
000757b90aaca0,eb6ca769,code,nlp = spacy.load('en_core_web_lg') # A more detailed model (with higher-dimension word vectors) - 13s to l...
000757b90aaca0,bc595bc2,code,"plt.rcParams['figure.figsize'] = [10, 10] # makes the output plots large enough to be useful"
000757b90aaca0,93cceeef,code,rowlimit = 500 # this limits the tweets to a manageable number\ndata = pd.read_csv('../input/ExtractedT...
000757b90aaca0,3cb3d383,code,data.head(6)
...,...,...,...
1292c88558dbc8,15290200,markdown,# 2. Import Datasets
1292c88558dbc8,affda817,markdown,# Data Dictionary
1292c88558dbc8,0d9947c2,markdown,<b>We notice from the plot most the word frequancies are the common word. and that's will not help us to understand ...
1292c88558dbc8,af4b2ad7,markdown,## 3.5 Print Selective Rows from Non-Toxic Comments


# Feature Engineering #

Let's generate [tf-idf features](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) to use with our ranking model. These features will help our model learn what kinds of words tend to occur most often at various positions within a notebook.

## AI4Code Extract all functions, variables... names


(see, https://www.kaggle.com/code/haithamaliryan/ai4code-extract-all-functions-variables-names) Upvote if this is useful to you.

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Training set
tfidf = TfidfVectorizer(min_df=0.01)
X_train = tfidf.fit_transform(df_train['source'].astype(str))
# Rank of each cell within the notebook
y_train = df_ranks.loc[ids_train].to_numpy()
# Number of cells in each notebook
groups = df_ranks.loc[ids_train].groupby('id').size().to_numpy()

Now let's add the code cell ordering as a feature. We'll append a column that enumerates the code cells in the correct order, like `1, 2, 3, 4, ...`, while having the dummy value `0` for all markdown cells. This feature will help the model learn to put the code cells in the correct order.

In [33]:
code=nb.loc[nb.cell_type=="code"]

In [34]:
code

Unnamed: 0_level_0,rank,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3e551fb7,0,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
45049ad8,1,code,"train_data = pd.read_csv(""/kaggle/input/titanic/train.csv"")\ntest_data = pd.read_csv(""/kaggle/input/titanic/test.csv"")"
123b4f4c,3,code,import plotly.express as px
0b92cb59,4,code,train_data.head(20)
df963df4,6,code,train_data.isnull().sum() #checking out which column has most no. of NaN Values
0f3db81b,8,code,"px.bar(data_frame=train_data, x='Sex', y='Survived',color='Sex',facet_row_spacing=0, title=""Relation between Gender ..."
33ff3073,10,code,"total_passengers = train_data['Sex'].count()\ncount_males = 0\ncount_females = 0\nfor i,j in zip(train_data['Sex'], ..."
818c4c15,13,code,"from sklearn.ensemble import RandomForestClassifier\n\n\ny = train_data[""Survived""]\n\nfeatures = [""Pclass"", ""Sex"", ..."


In [None]:
port tokenize
import io

code.loc['33ff3073','source']

code_text = tokenize.generate_tokens(io.StringIO(code.loc['33ff3073','source']).readline)
[tok for tok in code_text]

In [None]:
# Extract only function names, variables, comments then we can join them
code_text = tokenize.generate_tokens(io.StringIO(code.loc['33ff3073','source']).readline)
[tok.string for tok in code_text if tok.type==53 or tok.type==55]

In [42]:
# Add code cell ordering
X_train = sparse.hstack((
    X_train,
    np.where(
        df_train['cell_type'] == 'code',
        df_train.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

print(X_train.shape)

(416710, 282)


# Train #

We'll use the ranking algorithm provided by XGBoost.

In [44]:
from xgboost import XGBRanker

model = XGBRanker(
    min_child_weight=10,
    subsample=0.5,
    tree_method='hist',
)

model.fit(X_train, y_train, group=groups)

XGBRanker(min_child_weight=10, subsample=0.5, tree_method='hist')

# Evaluate #

Now let's see how well our model learned to order Kaggle notebook cells. We'll evaluate predictions on the validation set with a variant of the Kendall tau correlation.

## Validation set ##

First we'll create features for the validation set just like we did for the training set.

In [45]:
# Validation set
X_valid = tfidf.transform(df_valid['source'].astype(str))
# The metric uses cell ids
y_valid = df_orders.loc[ids_valid]

X_valid = sparse.hstack((
    X_valid,
    np.where(
        df_valid['cell_type'] == 'code',
        df_valid.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

Here we'll use the model to predict the rank of each cell within its notebook and then convert these ranks into a list of ordered cell ids.

In [46]:
y_pred = pd.DataFrame({'rank': model.predict(X_valid)}, index=df_valid.index)
y_pred = (
    y_pred
    .sort_values(['id', 'rank'])  # Sort the cells in each notebook by their rank.
                                  # The cell_ids are now in the order the model predicted.
    .reset_index('cell_id')  # Convert the cell_id index into a column.
    .groupby('id')['cell_id'].apply(list)  # Group the cell_ids for each notebook into a list.
)
y_pred.head(10)

id
000757b90aaca0    [8f84d7a9, eb6ca769, bc595bc2, 93cceeef, 3cb3d383, 6e3a3d90, abc159f0, b20690ef, 20f10a90, e301d5a4, 744648dd, 1fa48...
000890decea38e    [86a09baa, 6c87eeba, d757b392, fb49d33e, 52a99293, 5a4a2bbc, 77c0fb61, 966ceb67, 1ff9852b, bea84840, 2a4a5ac7, 8afe2...
0009acaa9aa47e    [304ad2c3, 5f4ae1f5, 1d217b81, 510ff074, 9b289685, b1714b44, 716dad85, d3c01a47, a77df155, 2dc4f6f6, 02f9f02a, 9bdaa...
001106f5f235f6    [2d035cf0, 37e65ad4, 7208c7b5, 7761e223, f05252da, f7738626, 3d7d3b06, ba791708, 12a04565, 3148a4aa, 4d978208, b32dd...
00181d9eb98d2c    [cc6e1157, 3ef648a4, 1f1aa782, 48778ea3, 2e25881f, 56159531, bace8369, c7375d07, 22d0985f, 60667602, 864238e0, 2606c...
001c0599b0a3e5    [3f3f7be1, 600f419a, 0f0b9bfc, 10ffeb58, 919e192a, 2b97a048, e5b38898, 9c5a9aa7, 0b15d473, 1dce54a8, 18198b77, 2c770...
0026ce20e23778    [a94a3251, 9c2d055b, 11bd3b3f, 450680b8, daad77d0, 19a850ac, 94a8181f, e1f15510, fc64940d, e905e109, c161e485, 4711e...
002ba502bdac45    [b71dfd9b, 33

Now let's examine a notebook to see how the model did.

In [47]:
nb_id = df_valid.index.get_level_values('id').unique()[8]

display(df.loc[nb_id])
display(df.loc[nb_id].loc[y_pred.loc[nb_id]])

Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
b437301e,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
a127110f,code,import warnings\nwarnings.filterwarnings('ignore')
a8c42629,code,!pip install recordlinkage
94679249,code,import recordlinkage\nfrom recordlinkage.datasets import load_febrl1
dead433d,code,# Loading and using the in-built dataset \ndf = load_febrl1()\ndf.head()
9fbe6800,code,list(df.columns)
134e3b58,code,separator = recordlinkage.BlockIndex(on='given_name')\npairs = separator.index(df)\nprint(len(pairs))
5bdc5bd9,code,pairs
f8c15bb7,code,# Comparing every field with every field of each pair - similarity scores\n# We can see that the given name has 1 si...
3dd3baaa,code,"# Select all features except blocking key\ndf_f = features.drop(['given_name'], axis=1)\ndf_f.head()"


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
b437301e,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
a127110f,code,import warnings\nwarnings.filterwarnings('ignore')
a8c42629,code,!pip install recordlinkage
94679249,code,import recordlinkage\nfrom recordlinkage.datasets import load_febrl1
dead433d,code,# Loading and using the in-built dataset \ndf = load_febrl1()\ndf.head()
9fbe6800,code,list(df.columns)
134e3b58,code,separator = recordlinkage.BlockIndex(on='given_name')\npairs = separator.index(df)\nprint(len(pairs))
5bdc5bd9,code,pairs
f8c15bb7,code,# Comparing every field with every field of each pair - similarity scores\n# We can see that the given name has 1 si...
6bb45576,markdown,<h2>1.1 Blocking</h2>\n\nWe will first make our process efficient by minimizing the total comparison window. If we h...


## Metric ##

This competition uses a variant of the [Kendall tau correlation](https://www.kaggle.com/competitions/AI4Code/overview/evaluation), which will measure how close to the correct order our predicted orderings are. See this notebook for more on this metric: [Competition Metric - Kendall Tau Correlation](https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation/notebook).

In [None]:
from bisect import bisect


def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):
        j = bisect(sorted_so_far, u)
        inversions += i - j
        sorted_so_far.insert(j, u)
    return inversions


def kendall_tau(ground_truth, predictions):
    total_inversions = 0
    total_2max = 0  # twice the maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

Let's test the metric with a dummy submission created from the ids of the shuffled notebooks.

In [None]:
y_dummy = df_valid.reset_index('cell_id').groupby('id')['cell_id'].apply(list)
dummy_score = kendall_tau(y_valid, y_dummy)

mlflow.log_metric("Dummy_Score", dummy_score)

Comparing this to the score on the predictions, we can see that our model was indeed able to improve the cell ordering somewhat.

In [None]:
prediction_score = kendall_tau(y_valid, y_pred)
mlflow.log_metric("Prediction_Score", prediction_score)

# Submission #

To create a submission for this competition, we'll apply our model to the notebooks in the test set. Note that this is a **Code Competition**, which means that the test data we see here is only a small sample. When we submit our notebook for scoring, this example data will be replaced with the full test set of about 20,000 notebooks.

First we load the data.

In [None]:
paths_test = list((data_dir / 'test').glob('*.json'))
notebooks_test = [
    read_notebook(path) for path in tqdm(paths_test, desc='Test NBs')
]
df_test = (
    pd.concat(notebooks_test)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

Then create the tf-idf and code cell features.

In [None]:
X_test = tfidf.transform(df_test['source'].astype(str))
X_test = sparse.hstack((
    X_test,
    np.where(
        df_test['cell_type'] == 'code',
        df_test.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

And then create predictions on the test set.

In [None]:
y_infer = pd.DataFrame({'rank': model.predict(X_test)}, index=df_test.index)
y_infer = y_infer.sort_values(['id', 'rank']).reset_index('cell_id').groupby('id')['cell_id'].apply(list)
y_infer

The `sample_submission.csv` file shows what a correctly formatted submission must look like. We'll just use it as a visual check, but you might like to directly modify the values of sample submission instead. (This would help prevent failed submissions due to missing notebook ids or incorrectly named columns, for instance.)