# Creating and Logging Vocabulary as a Weights & Biases Artifact

The Python script performs the following tasks to create a **vocabulary** from the **``train_data.csv``** dataset and log the entire process to **``Weights & Biases (wandb)``**:

1. **Initialization**:
The script starts by downloading the Natural Language Toolkit (NLTK) stopwords to filter out common words that do not contribute significantly to the sentiment analysis.

2. **Wandb Run Setup**:
A new wandb run is initiated within the specified project and job type. This run will record all operations and log outputs related to generating the vocabulary.

3. **Artifact Download**:
The **``train_data.csv``** artifact, which contains the pre-cleaned training data, is fetched from wandb using **``run.use_artifact``** with the latest version tag. The artifact's content is then downloaded locally for processing.

4. **Data Loading**:
The training data is read into a Pandas DataFrame from the CSV file. The text data needed for the vocabulary is extracted into a list using the **``load_data_from_dataframe``** function.

5. **Vocabulary Construction**:
The **``add_docs_to_vocab``** function iterates over each document in the text data, cleans it, and updates the vocab Counter object with the resulting tokens.

6. **Token Cleaning**:
The **``clean_doc``** function processes each document to remove punctuation, non-alphabetic tokens, stopwords, and very short tokens to produce a list of meaningful tokens.

7. **Vocabulary Logging**:
The initial size of the vocabulary is logged to wandb, capturing the number of unique tokens before filtering.

8. **Filtering Tokens**:
Tokens with a minimum occurrence (defined as 2) are retained to ensure the vocabulary only contains words that appear more than once in the corpus.

9. **Filtered Vocabulary Logging**:
The size of the filtered vocabulary, i.e., the number of tokens that meet the minimum occurrence criterion, is also logged to wandb.

10. **Vocabulary Saving**:
The filtered list of tokens is saved to a local file vocabulary.csv, ready to be uploaded to wandb.

11. **Artifact Creation and Uploading**:
A new artifact named **``vocab``** of type **``Vocab``** is created, and the **``vocabulary.csv``** file is added to this artifact. The artifact is then logged to the wandb run, which uploads it to the wandb server.

12. **Run Completion**:
The wandb run is concluded using **``run.finish()``**, marking the end of the vocabulary generation and logging process.

This script facilitates an automated, reproducible approach to generating a vocabulary from text data and ensures that the results are tracked and stored in a structured manner within the wandb platform. It showcases how to use wandb for artifact management, from data retrieval and preprocessing to artifact creation and logging.

## Install, load libraries and setup wandb

In [None]:
!pip install wandb

In [2]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [12]:
import wandb
import pandas as pd
import string
import re
from collections import Counter
from nltk.corpus import stopwords
import nltk
import os

In [4]:
# Ensure that NLTK Stopwords are downloaded
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Initialization, Wandb Run Setup and Artifact Download

In [8]:
# Initialize wandb run
run = wandb.init(project='sentiment_analysis', job_type='generate_vocab')

# Download the train_data.csv artifact
artifact = run.use_artifact('train_data:latest')
train_data_path = artifact.download()

[34m[1mwandb[0m:   1 of 1 files downloaded.  


## Data Loading, Vocabulary Construction, Token Cleaning, Vocabulary Saving

In [10]:
# load text data into memory from a Pandas DataFrame
def load_data_from_dataframe(df):
    return df['text'].tolist()

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# turn documents into clean tokens
def add_docs_to_vocab(texts, vocab):
    for doc in texts:
        tokens = clean_doc(doc)
        vocab.update(tokens)

# save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()

## All together

In [13]:
# Correctly specify the path to the train_data.csv file
full_train_data_path = os.path.join(train_data_path, 'train_data.csv')

# Load the training data
train_data_df = pd.read_csv(full_train_data_path)

# Load text data
texts = load_data_from_dataframe(train_data_df)

# Define vocab
vocab = Counter()

# Add all docs to vocab
add_docs_to_vocab(texts, vocab)

# Log the size of the vocab
wandb.log({'initial_vocab_size': len(vocab)})

# Keep tokens with a min occurrence
min_occurrence = 2
tokens = [k for k, c in vocab.items() if c >= min_occurrence]
wandb.log({'filtered_vocab_size': len(tokens)})

# Save tokens to a vocabulary file
save_list(tokens, 'vocabulary.txt')

# Create a new artifact for the vocabulary CSV
vocab_artifact = wandb.Artifact(
    name='vocab',
    type='Vocab',
    description='Vocabulary from training data'
)

# Add CSV file to the artifact
vocab_artifact.add_file('vocabulary.txt')

# Log the new artifact to wandb
run.log_artifact(vocab_artifact)

<Artifact vocab>

In [14]:
# Finish the wandb run and upload the artifacts to cloud
run.finish()

0,1
filtered_vocab_size,▁
initial_vocab_size,▁

0,1
filtered_vocab_size,25769
initial_vocab_size,44332
