Word Similarity Finder API & Streamlit App

Overview

This project provides a simple web application for finding similar words using a gensim Word2Vec model. It consists of two main components:

FastAPI Backend (api.py): A robust REST API that loads a Word2Vec model, accepts a word as input, and returns a list of the 5 most similar words. It includes input validation and graceful error handling.
Streamlit Frontend (app.py): A user-friendly web interface that allows users to enter a word, interact with the FastAPI backend, and view the results or any error messages.

Additionally, code has been provided for training the model (at the end).

Project Structure

/word-similarity-app
|-- api.py            # FastAPI backend server
|-- app.py            # Streamlit frontend application
|-- train.py          # model training file
|-- requirements.txt  # Project dependencies
|-- word2vec.model    # (Generated) Word2Vec model file
|-- README.md         # This file

Prerequisites

Before you begin, ensure you have the following installed:

Python 3.8+
pip (Python package installer)
Git

Setup and Installation

Follow these steps to set up the project locally.

1. Clone the Repository

First, clone the repository to your local machine using Git.

git clone https://github.com/joygeo007/python_dev_task.git
cd python-dev-task

2. Create and Activate a Virtual Environment

It is highly recommended to use a virtual environment to manage project dependencies.

On macOS / Linux:

python3 -m venv venv
source venv/bin/activate

On Windows:

python -m venv venv
.\venv\Scripts\activate

3. Install Dependencies

Install all the required Python packages from the requirements.txt file.

pip install -r requirements.txt

`requirements.txt` contents:

fastapi
uvicorn
gensim
streamlit
requests
nest_asyncio
pandas
pyarrow
nltk
argparse

Running the Application

You will need to run two processes in two separate terminal windows: the API server and the Streamlit app.

Step 1: Start the FastAPI Backend

In your first terminal, run the following command to start the Uvicorn server for the API. The --reload flag will automatically restart the server when you make changes to the code.

uvicorn api:app --host 0.0.0.0 --port 8000 --reload

You should see output indicating that the server is running, typically on http://127.0.0.1:8000.

API Usage Examples

You can interact with the API directly using tools like curl or Postman. The server must be running on http://localhost:8000.

1. Successful Request (200 OK)

This is an example of a valid request where the word exists in the model's vocabulary.

Request:

curl -X POST "http://localhost:8000/similar_words" \
-H "Content-Type: application/json" \
-d '{"word": "computer"}'

Expected Response:

{
  "similar_words": [
    "user",
    "interface",
    "system",
    "time",
    "graph"
  ]
}

(Note: The exact words will depend on the training data of your Word2Vec model.)

2. Word Not Found (404 Not Found)

This happens when the word is valid but is not present in the model's vocabulary.

Request:

curl -X POST "http://localhost:8000/similar_words" \
-H "Content-Type: application/json" \
-d '{"word": "hellow"}'

Expected Response:

{
  "detail": "Word 'hellow' not found in the vocabulary."
}

3. Invalid Input (422 Unprocessable Entity)

This error occurs when the input fails the validation rules (e.g., it contains only numbers, punctuation, or is empty).

Request:

curl -X POST "http://localhost:8000/similar_words" \
-H "Content-Type: application/json" \
-d '{"word": "123"}'

Expected Response:

{
  "detail": [
    {
      "loc": [
        "body",
        "word"
      ],
      "msg": "Numeric-only input is not allowed. Please provide an alphanumeric word.",
      "type": "value_error"
    }
  ]
}

Step 2: Run the Streamlit Frontend

In a second terminal (ensure your virtual environment is still active), run the following command to launch the Streamlit application.

streamlit run app.py

Streamlit will automatically open a new tab in your web browser. If it doesn't, navigate to the URL shown in the terminal (usually http://localhost:8501).

Usage

Make sure both the FastAPI server and the Streamlit app are running.
Open your web browser and go to the Streamlit URL (http://localhost:8501).
Enter a single, alphabetic word into the input box (e.g., "computer", "human", "graph").
Click the "Find Similar Words" button.
The results or any relevant error messages will be displayed on the page.

Model Training Guidelines

The train.py script is designed to process text data that contains a mix of natural language and source code. It uses a 'code-aware' preprocessing pipeline to generate high-quality tokens before training a gensim Word2Vec model.

The preprocessing logic is specifically tuned to:

Preserve important code entities like function_names, variable_names, and module.method calls.
Convert natural language to lowercase while maintaining the original case for code.
Remove common stop words and standalone numbers that add little value.
Filter out very short words and variables.

Dataset Requirements

To use this script, your dataset must meet the following criteria:

File Format: The dataset must be a Parquet file (.parquet).
Column Name: The data must be loadable into a pandas DataFrame that contains a column named exactly "answer". This column should hold the raw text documents (strings) you want to train the model on.

How to Run the Training Script

You can run the script from your terminal. It requires two command-line arguments to specify the locations of your dataset and where to save the final model. Command-line Arguments

--dataset-path (Required): The full path to your input .parquet dataset file.
--model-path (Required): The full path where the trained model file will be saved. It's conventional to name this file word2vec.model.

Example Usage

Here is an example command to run the script. Make sure you have activated your Python environment and installed the necessary dependencies.

python train.py --dataset-path /path/to/your/tech_dataset.parquet --model-path ./word2vec.model

After running, the script will log its progress to the console, and upon completion, the trained word2vec.model file will be available at the specified output path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Similarity Finder API & Streamlit App

Overview

Project Structure

Prerequisites

Setup and Installation

1. Clone the Repository

2. Create and Activate a Virtual Environment

3. Install Dependencies

`requirements.txt` contents:

Running the Application

Step 1: Start the FastAPI Backend

API Usage Examples

1. Successful Request (200 OK)

2. Word Not Found (404 Not Found)

3. Invalid Input (422 Unprocessable Entity)

Step 2: Run the Streamlit Frontend

Usage

Model Training Guidelines

Dataset Requirements

How to Run the Training Script

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
api.py		api.py
main.py		main.py
packages.txt		packages.txt
requirements.txt		requirements.txt
stack_overflow_tech_final.parquet		stack_overflow_tech_final.parquet
train.py		train.py
word2vec.model		word2vec.model

joygeo007/python_dev_task

Folders and files

Latest commit

History

Repository files navigation

Word Similarity Finder API & Streamlit App

Overview

Project Structure

Prerequisites

Setup and Installation

1. Clone the Repository

2. Create and Activate a Virtual Environment

3. Install Dependencies

requirements.txt contents:

Running the Application

Step 1: Start the FastAPI Backend

API Usage Examples

1. Successful Request (200 OK)

2. Word Not Found (404 Not Found)

3. Invalid Input (422 Unprocessable Entity)

Step 2: Run the Streamlit Frontend

Usage

Model Training Guidelines

Dataset Requirements

How to Run the Training Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`requirements.txt` contents:

Packages