# Data Verification for Cleaned Sentiment Analysis Dataset

In the provided Python exercise, a series of tests were formulated and executed to verify the integrity and structure of a cleaned dataset obtained from a **``Weights & Biases (wandb)``** artifact named **``clean_data``**. The dataset comprises text files categorized into positive and negative sentiments, stored in two directories named **``pos``** and **``neg``**. The tests were conducted to ensure that the data is in the expected format and adheres to certain criteria before proceeding with further analysis or modeling tasks.

Here’s a summary of the tests conducted:

1. **Directory Existence**:
Checked whether the directories **``pos``** and **``neg``** exist within the downloaded artifact.

2. **Instance Count**:
Verified that there are at least 500 instances (files) in each of the **``pos``** and **``neg``** directories.

3. **Duplicate Verification**:
Ensured that there are no duplicate files within and across the **``pos``** and **``neg``** directories by comparing filenames. This test operates under the assumption that unique filenames correspond to unique content.

4. **File Non-emptiness**:
Checked that each file in the **``pos``** and **``neg``** directories is not empty, ensuring that every file contains some data.

These tests were structured within a Python script using the **pytest** framework. The script initiates a **wandb** run to log any potential issues and utilizes a **fixture** to download the **``clean_data``** artifact from **wandb**, providing a local path to the data for the tests. Each test is defined as a separate function, and the pytest command at the end of the script executes all tests, providing a detailed output of the results.

This verification process is crucial as it ensures the cleaned dataset is well-structured, free of duplicates, and ready for subsequent analysis or machine learning tasks. By logging the results to **wandb**, there's a traceable record of the data verification process, which contributes to the reproducibility and reliability of the analysis pipeline.

## Install, load libraries and setup wandb

In [None]:
!pip install wandb pytest pytest-sugar

In [2]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [5]:
import wandb

## Pytest


### How pytest discovers tests


**pytests** uses the following [conventions](https://docs.pytest.org/en/latest/goodpractices.html#conventions-for-python-test-discovery) to automatically discovering tests:

1. files with tests should be called **``test_*.py``** or **``*_test.py``**
2. test function name should start with **``test_``**

### Fixture



An important aspect when using **``pytest``** is understanding the fixture's scope works.

The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **``session``** and **``function``**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

## Create and run a test file

In [3]:
%%file test_data.py
import pytest
import wandb
import os

# This is global so all tests are collected under the same run
run = wandb.init(project="sentiment_analysis", job_type="data_checks")

@pytest.fixture(scope="session")
def data():
    # Download the clean_data artifact
    local_path = run.use_artifact("clean_data:latest").download()
    return local_path

def test_directory_existence(data):
    """
    Test that the 'pos' and 'neg' directories exist
    """
    assert os.path.isdir(os.path.join(data, 'pos'))
    assert os.path.isdir(os.path.join(data, 'neg'))

def test_instance_count(data):
    """
    Test that there are at least 500 instances in 'pos' and 'neg' directories
    """
    assert len(os.listdir(os.path.join(data, 'pos'))) >= 500
    assert len(os.listdir(os.path.join(data, 'neg'))) >= 500

def test_no_duplicates(data):
    """
    Test that there are no duplicate files within and across 'pos' and 'neg' directories
    """
    pos_files = set(os.listdir(os.path.join(data, 'pos')))
    neg_files = set(os.listdir(os.path.join(data, 'neg')))
    # No duplicates within directories
    assert len(pos_files) == len(os.listdir(os.path.join(data, 'pos')))
    assert len(neg_files) == len(os.listdir(os.path.join(data, 'neg')))
    # No duplicates across directories
    assert len(pos_files.intersection(neg_files)) == 0

def test_non_empty_files(data):
    """
    Test that each file in 'pos' and 'neg' directories is not empty
    """
    for folder in ['pos', 'neg']:
        for file in os.listdir(os.path.join(data, folder)):
            assert os.path.getsize(os.path.join(data, folder, file)) > 0

Writing test_data.py


In [4]:
# run tests
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.10.12, pytest 7.4.3, pytest-sugar 0.9.7)[0m
cachedir: .pytest_cache
rootdir: /content
plugins: sugar-0.9.7, anyio-3.7.1
collected 4 items                                                                                  [0m

 [36mtest_data.py[0m::test_directory_existence[0m [32m✓[0m                                             [32m25% [0m[40m[32m█[0m[40m[32m█▌       [0m
 [36mtest_data.py[0m::test_instance_count[0m [32m✓[0m                                                  [32m50% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██     [0m
 [36mtest_data.py[0m::test_no_duplicates[0m [32m✓[0m                                                   [32m75% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m[40m[32m█[0m[40m[32m█▌  [0m
 [36mtest_data.py[0m::test_non_empty_files[0m [32m✓[0m                                                [32m100% [0m[40m[32m█[0m[40m[32m█[0m[40m[3

In [6]:
# Optionally, finish the wandb run
wandb.finish()