# Language Model: flat data, without context

**Note**: The default model is `MOSTLY_AI/LSTMFromScratch-3m`, a lightweight LSTM model trained from scratch (**GPU strongly recommended**). You can also use pre-trained HuggingFace models by setting e.g. `model="microsoft/phi-1.5"` (**GPU required**).


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mostly-ai/mostlyai-engine/blob/main/examples/language.ipynb)

In [None]:
import pandas as pd
from mostlyai.engine import LanguageModel

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/arxiv"
trn_df = pd.read_parquet(f"{url}/synthetic-data-papers.parquet")[['category', 'title']]

# create and fit the model
lm = LanguageModel(
    model="MOSTLY_AI/LSTMFromScratch-3m",  # use a light-weight LSTM model, trained from scratch (GPU recommended)
    # model="microsoft/phi-1.5",           # or alternatively use a HF-hosted LLM model (GPU required)
    max_training_time=10,                  # limit training to 10 minutes for demo purposes
    tgt_encoding_types={
        'category': 'LANGUAGE_CATEGORICAL',
        'title': 'LANGUAGE_TEXT',
    },
    verbose=1,
)
lm.fit(trn_df)

# generate synthetic samples
syn_tgt_df = lm.sample(n_samples=100)

In [None]:
syn_tgt_df.head(5)