# Readability Metrics


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

https://pypi.org/project/py-readability-metrics/


In [2]:
df = pd.read_csv(r"https://raw.githubusercontent.com/jjschueder/SMUCaptsoneA/main/train.csv")

# Model Building

## Bert Transformer Model

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2834 entries, 0 to 2833
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              2834 non-null   object 
 1   url_legal       830 non-null    object 
 2   license         830 non-null    object 
 3   excerpt         2834 non-null   object 
 4   target          2834 non-null   float64
 5   standard_error  2834 non-null   float64
dtypes: float64(2), object(4)
memory usage: 133.0+ KB


In [4]:
!pip install multimodal_transformers
#https://medium.com/georgian-impact-blog/how-to-incorporate-tabular-data-with-huggingface-transformers-b70ac45fcfb4
#https://multimodal-toolkit.readthedocs.io/en/latest/notes/introduction.html#how-to-initialize-transformer-with-tabular-models
#https://github.com/georgian-io/Multimodal-Toolkit/blob/master/main.py

Collecting multimodal_transformers
  Downloading multimodal_transformers-0.1.4a0.tar.gz (18 kB)
Collecting transformers==3.1
  Downloading transformers-3.1.0-py3-none-any.whl (884 kB)
[K     |████████████████████████████████| 884 kB 4.0 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 27.4 MB/s 
[?25hCollecting tokenizers==0.8.1.rc2
  Downloading tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 38.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 40.9 MB/s 
Building wheels for collected packages: multimodal-transformers
  Building wheel for multimodal-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for multimodal-transformers: filename=multimodal_transformers-0.1.4a0-py3-none-any.whl

In [5]:
from transformers import AutoTokenizer
from multimodal_transformers.data import load_data
text_cols = ['excerpt']
label_col = 'target' 
numerical_cols = ['standard_error']

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

torch_dataset = load_data(
    df,
    text_cols,
    tokenizer,
    numerical_cols=numerical_cols,
    sep_text_token_str=tokenizer.sep_token,
    label_col = 'target'
)

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [6]:
from multimodal_transformers.model import AutoModelWithTabular, TabularConfig
from transformers import AutoConfig

config = AutoConfig.from_pretrained('bert-base-uncased')
tabular_config = TabularConfig(
    num_labels=1, #1 for regression
    numerical_feat_dim=torch_dataset.numerical_feats.shape[1],
    combine_feat_method='weighted_feature_sum_on_transformer_cat_and_numerical_feats',
)
config.tabular_config = tabular_config

model = AutoModelWithTabular.from_pretrained('bert-base-uncased', config=config)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertWithTabular: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertWithTabular from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertWithTabular from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertWithTabular were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifi

In [7]:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./logs/model_name",
    logging_dir="./logs/runs",
    overwrite_output_dir=True,
    do_train=True,
    per_device_train_batch_size=32,
    num_train_epochs=1,
    evaluate_during_training=True,
    logging_steps=25,
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=torch_dataset
)

trainer.train()

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/89 [00:00<?, ?it/s]

RuntimeError: ignored

In [None]:
df.iloc[2216,3]

"Ludwig was only four years old when he began to study music. Like children of today he shed many a tear over the first lessons. In the beginning his father taught him piano and violin, and forced him to practice. At school he learned, just as we do today, reading, writing, arithmetic, and later on, Latin.\nNever again after thirteen, did Ludwig go to school for he had to work and earn his living.\nDo you wonder what kind of a boy he was?\nWe are told that he was shy and quiet. He talked little and took no interest in the games that his boy and girl companions played.\nWhile Ludwig was in school he played at a concert for the first time. He was then eight years old. Two years later he had composed quite a number of pieces. One of these was printed. It was called Variations on Dressler's March."