# **Notes**

### **Project Overview**

* **Goal:** Implement text detection using three different OCR technologies and compare their performance using a specific similarity metric.
* **Environment:** The project is executed in **Google Colab**, utilizing data stored in **Google Drive**.

### **Data Preparation**

* **Dataset:** A custom dataset comprising images containing text (e.g., "Collect moments not things").
* **Labeling:** The filename of each image corresponds exactly to the text written in the image (ground truth).
* **Source:** Images were sourced from Pexels.
* **Alternatives:** While complex datasets like **COCO-Text** exist, this tutorial uses simpler images with clear text to make the metric comparison more straightforward.
* **Setup:**
* Data is zipped and uploaded to Google Drive.
* Drive is mounted in Colab, and the data is unzipped for local access.



### **Dependencies**

* The following Python libraries are installed:
* `pytesseract` (for Tesseract)
* `easyocr`
* `boto3` (for AWS Textract)
* `pillow` (image processing)



### **1. Tesseract OCR Implementation**

* **Wrapper:** Uses `pytesseract` (uses tesseract under the hood) to interface with the Tesseract engine.
* **Method:** `pytesseract.image_to_string(image, lang='eng')`.
* **Observation:** The out-of-the-box performance on raw scene images was very poor.
* **Note:** Tesseract generally requires significant image preprocessing and configuration tuning to work well, but this experiment tested it on raw input intentionally.

### **2. EasyOCR Implementation**

* **Method:**
* Initialize a `Reader` object for English (`['en']`).
* Call `reader.readtext(image_path)`.


* **Output:** Returns a list containing the bounding box, detected text, and confidence score.
* **Parsing:** The code iterates through the results to concatenate detected words into a single string.
* **Observation:** Performed significantly better than raw Tesseract but still had minor inaccuracies.

### **3. AWS Textract Implementation**

* **AWS Setup:**
* Create an **IAM User** with `AmazonTextractFullAccess` policy.
* Generate **Access Key** and **Secret Access Key** (Security warning: keep these private).
* *Note:* This service is not free; it is pay-per-use.


* **Code:**
* Use `boto3.client('textract')` with credentials and region (e.g., `us-east-1`).
* Call `client.detect_document_text` passing the image bytes.


* **Parsing:** Iterate through the response blocks; if the `BlockType` is `'LINE'`, append the text.
* **Observation:** Provided the most accurate raw detections among the three.

### **Performance Comparison Methodology**

* **Metric:** **Jaccard Similarity Index**.
* Logic: (Intersection of words) / (Union of words).
* Calculates how many words the prediction and ground truth have in common relative to the total unique words.
* A function was generated using ChatGPT to compute this metric.


* **Preprocessing for Evaluation:**
* Both ground truth (filename) and predictions were converted to **lowercase**.
* Special characters (periods, question marks, exclamation points, newlines) were removed to ensure fair comparison.


* **Execution:** The script iterates through all 100 images, calculates the score for each tool, and computes the average.

### **Final Results**

The average Jaccard Similarity scores for this specific dataset (raw scene text without preprocessing) were:

* **Tesseract:** ~0.01 (Lowest performance on raw data).
* **EasyOCR:** ~0.21 (Moderate performance).
* **AWS Textract:** ~0.34 (Highest performance).

**Conclusion:** For this specific experiment using raw images without preprocessing, **AWS Textract** outperformed the open-source alternatives. However, Tesseract is noted to be powerful if the data is properly preprocessed.

---

## **Jaccard Similarity**

```python
def jaccard_similarity(sentence1, sentence2):
    # Tokenize sentences into sets of words
    set1 = set(sentence1.lower().split())
    set2 = set(sentence2.lower().split())

    # Calculate Jaccard similarity
    intersection_size = len(set1.intersection(set2))
    union_size = len(set1.union(set2))

    # Avoid division by zero if both sets are empty
    similarity = intersection_size / union_size if union_size != 0 else 0.0

    return similarity

# Example usage:
sentence1 = "This is a sample sentence"
sentence2 = "Sample sentence for testing"

similarity = jaccard_similarity(sentence1, sentence2)
print(f"Jaccard Similarity: {similarity}")

```

### How the answer (0.2857...) was calculated

The Jaccard Similarity formula is: 

1. **The Sets (Unique Words):**
* **Sentence 1:** `{'this', 'is', 'a', 'sample', 'sentence'}` (5 words)
* **Sentence 2:** `{'sample', 'sentence', 'for', 'testing'}` (4 words)


2. **Intersection (Shared Words):**
* The words appearing in **both** lists are `'sample'` and `'sentence'`.
* **Count = 2**


3. **Union (Total Unique Words):**
* Combining both lists and removing duplicates: `{'this', 'is', 'a', 'sample', 'sentence', 'for', 'testing'}`.
* **Count = 7**


4. **The Math:** $$\frac{2}{7} \approx 0.285714...$$

