[Go to Home Page](https://weaviate.oneblink.ai)



# Toxic Comment Classifier

This Jupyter Notebook is part of the Toxic Comment Classifier project. The original project can be found [here](https://github.com/weaviate/weaviate-examples/tree/main/weaviate-toxic-comment-classifier).

## Project Overview
In this project, we explore the power of semantic search using the Toxic Comment Classification dataset in Weaviate. This dataset includes comments with binary labels indicating whether they are toxic. The project demonstrates how to index this dataset in Weaviate and use semantic search to classify comments.

### Contextual Classification
The project employs Contextual classification, a technique for making predictions based on context without relying on pre-existing training data. It's particularly effective for data with strong semantic connections.

## Technology Stack
- Python
- Weaviate
- Streamlit

### Weaviate Modules/Models
We use the [text2vec-contextionary (Contextionary) vectorizer](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-contextionary), which builds 300-dimensional vectors using a Weighted Mean of Word Embeddings (WMOWE) technique, trained on Wiki and CommonCrawl data.


In [None]:

# Import necessary libraries
import pandas as pd
import weaviate
import weaviate.classes as wvc
from IPython.display import display, Image

# Displaying a demo image (if available)
# Image("demo.gif")  # Uncomment this line if the demo.gif is available in the directory

# Connect to Weaviate
client = weaviate.connect_to_local(port=8081,grpc_port=50052)



## Setting Up the Weaviate Client and Collection
Before indexing the data, we set up the Weaviate client and create a collection named 'Comments'. If this collection already exists, it's deleted first to start fresh.


In [None]:

# Import necessary libraries
import pandas as pd
import weaviate
import weaviate.classes as wvc

# Define the collection name
COLLECTION_NAME = 'Comments'

# Set up the client
client = weaviate.connect_to_local(port=8081,grpc_port=50052)

# If collection already exists, delete it
if client.collections.exists(COLLECTION_NAME):
    client.collections.delete(COLLECTION_NAME)

# Create collection
client.collections.create(
    name=COLLECTION_NAME,
    properties=[
        wvc.Property(
            name="comment",
            data_type=wvc.DataType.TEXT,
            description="The text of the comment"
        ),
        wvc.Property(
            name="label",
            data_type=wvc.DataType.TEXT,
            description="The label of the comment"
        ),
    ],
    description="comments of people",
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_contextionary(),
)



## Data Preparation
Next, we load and preprocess the Toxic Comment Classification dataset. This involves shuffling the data, renaming columns, and converting binary labels into text labels for better readability.


In [None]:

# Load the dataset
data = pd.read_csv("./train.csv")

# Shuffle the dataset
data = data.sample(frac=1, random_state=42)

# Rename the dataframe columns to match the names from collection definition
data = data.rename(columns={'comment_text': 'comment', 'toxic': 'label'})

# Turn binary label into text
data.label = data.label.replace({1: 'Toxic', 0: 'Non Toxic'})



## Data Ingestion into Weaviate
We then ingest the prepared data into the Weaviate collection. This is a critical step in making our data searchable via semantic search.


In [None]:

# Fetch CRUD collection object
comments = client.collections.get(COLLECTION_NAME)

# Prepare objects (only 1000 entries, you can increase or decrease the number of entries)
objects_to_add = [
    wvc.DataObject(properties=rec) for rec in data.to_dict(orient='records')[:1000]
]

# Insert data into Weaviate
response = comments.data.insert_many(objects_to_add)

# Check for errors during insertion
if response.has_errors:
    for resp in response.all_responses:
        print(resp.message)
else:
    print("Data added successfully")



## Running the Streamlit App
To interact with the classifier, we use a Streamlit app. The app allows users to input a comment and receive a classification in return. Due to the limitations of Jupyter, the Streamlit app must be run externally. Instructions for running the app are as follows:

1. Open a terminal.
2. Navigate to the directory containing the `app.py` file.
3. Run the command `streamlit run app.py`.
4. A local server will start, and you can view the app in your browser at `http://localhost:8501`.


In [None]:
!python3.11 -m streamlit run app.py

[Go to Home Page](https://weaviate.oneblink.ai)
