# Dataset Splitting and Uploading to Weights & Biases (wandb)

In this exercise, the primary objective is to **split a pre-cleaned dataset** into **training** and **testing** subsets, and subsequently upload these subsets back to **``wandb``** as new artifacts. The dataset, housed in a **``wandb``** artifact named **``clean_data``**, comprises text files categorized into positive and negative sentiments, stored in two directories named **``pos``** and **``neg``**. The steps carried out in the script are detailed below:



1. **Data Loading**:

A function named **``load_data``** is defined to read the text files from the pos and neg directories, loading the text data into a list, and assigning labels of 1 for positive and 0 for negative sentiments.

2. **Data Splitting**:

A function named **``split_data``** is defined to split the loaded data into training and testing subsets. The **``train_test_split``** function from the sklearn library is utilized for this purpose, allowing for a configurable split ratio and random seed to ensure reproducibility.

3. **Wandb Run Initialization**:

A wandb run is initialized using **``wandb.init``** under the project **``my_user/sentiment_analysis``** with a job type of **``data_segregation``**. This initializes a new run on wandb to which the newly created artifacts will be logged.

4. **Artifact Retrieval and Download**:

The **``clean_data``** artifact is retrieved using run.use_artifact and its content is downloaded to the local directory using **``artifact.download``**.

5. **Data Conversion and Saving**:

The training and testing subsets are converted to Pandas DataFrames and saved to CSV files using the **``to_csv``** method. This creates local CSV files containing the text data and corresponding labels.

6. **``Artifact Creation``**:

Two new wandb artifacts named **``train_data``** and **``test_data``** are created to house the training and testing subsets, respectively. These artifacts are assigned types **``TrainData``** and **``TestData``**.

7. **``File Addition to Artifacts``**:

The local CSV files are added to the respective artifacts using the **``add_file``** method. This prepares the artifacts with the data for uploading to wandb.

8. **``Artifact Uploading``**:

The **``train_data``** and **``test_data``** artifacts are uploaded to wandb using **``run.log_artifact``**. This makes the split datasets available on wandb for further use.

9. **``Wandb Run Termination``**:

Optionally, the wandb run is terminated using wandb.finish to indicate the completion of the data splitting and uploading process.

This script automates the process of splitting a dataset into training and testing subsets, and uploading these subsets to wandb as new artifacts. By doing so, the script facilitates the organized management and sharing of data, ensuring that the datasets are readily accessible for subsequent analysis or modeling tasks within the wandb environment.

## Install, load libraries and setup wandb

In [None]:
!pip install wandb

In [None]:
# Login to Weights & Biases
!wandb login --relogin

In [None]:
import os
import wandb
import pandas as pd
from sklearn.model_selection import train_test_split

## Wandb Run Initialization, Artifact Retrieval and Download

In [None]:
# Initialize wandb run
run = wandb.init(project='sentiment_analysis', job_type='data_segregation')

# Get the clean_data artifact
artifact = run.use_artifact('clean_data:latest')

# Download the content of the artifact to the local directory
data_path = artifact.download()

## Loading and Split Data

In [None]:
# Function to load data from the 'pos' and 'neg' directories
def load_data(data_path):
    data = []
    labels = []
    for sentiment in ['pos', 'neg']:
        sentiment_path = os.path.join(data_path, sentiment)
        for file_name in os.listdir(sentiment_path):
            with open(os.path.join(sentiment_path, file_name), 'r', encoding='utf-8') as file:
                data.append(file.read())
                labels.append(1 if sentiment == 'pos' else 0)
    return data, labels

# Function to split data into training and testing sets
def split_data(data, labels, train_size=0.9, random_state=None):
    x_train, x_test, y_train, y_test = train_test_split(
        data, labels, train_size=train_size, random_state=random_state, stratify=labels)
    return x_train, x_test, y_train, y_test

## All together

In [None]:
# Load data
data, labels = load_data(data_path)

# Split data (default is 90% training, 10% testing, with a random state for reproducibility)
x_train, x_test, y_train, y_test = split_data(data, labels, random_state=42)

In [None]:
# Convert split data to DataFrames
train_data = pd.DataFrame({'text': x_train, 'label': y_train})
test_data = pd.DataFrame({'text': x_test, 'label': y_test})

# Log the shapes of the training and testing datasets
wandb.log({'train_data_shape': train_data.shape,
           'test_data_shape': test_data.shape})

In [None]:
train_data.head()

In [None]:
train_data.shape

In [None]:
test_data.head()

In [None]:
test_data.shape

In [None]:
# Save split data to CSV files
train_data.to_csv('train_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

# Create new artifacts for train and test data
train_artifact = wandb.Artifact(
    name='train_data',
    type='TrainData',
    description='Training data split from clean_data'
)
test_artifact = wandb.Artifact(
    name='test_data',
    type='TestData',
    description='Testing data split from clean_data'
)

# Add CSV files to the artifacts
train_artifact.add_file('train_data.csv')
test_artifact.add_file('test_data.csv')

# Log the new artifacts to wandb
run.log_artifact(train_artifact)
run.log_artifact(test_artifact)

In [None]:
# finish the wandb run
# note that using run the files are update only after finish() is executed
wandb.finish()