## Text Distance

Text distance (also known as edit distance or string similarity) is a quantitative measure of how different two strings are from each other. It calculates the minimum number of operations needed to transform one string into another, helping us determine the lexical similarity between words.

### How It Works

The most common text distance algorithm is **Levenshtein distance**, which counts three types of single-character edits:
- **Insertions**: Adding a character (`hos` → `hose`)
- **Deletions**: Removing a character (`house` → `hose`)
- **Substitutions**: Replacing a character (`hose` → `home`)

For example, the distance between `hose` and `house` is 1 (one insertion of 'u'), indicating high similarity and likely a typo.

### Common Algorithms

- **Levenshtein Distance**: Counts insertions, deletions, and substitutions equally
- **Damerau-Levenshtein**: Also accounts for transpositions (`teh` → `the`)
- **Hamming Distance**: Measures substitutions only (strings must be same length)
- **Jaro-Winkler**: Focuses on matching characters at the beginning (good for names)

### Practical Applications

1. **Spell Checkers**: Detecting typos like `recieve` → `receive` (distance: 2)
2. **Search Engines**: Finding results even with misspelled queries
3. **Autocomplete**: Suggesting corrections as users type
4. **Data Deduplication**: Identifying similar records in databases (`John Smith` vs `Jon Smith`)
5. **DNA Sequencing**: Comparing genetic sequences for mutations
6. **Plagiarism Detection**: Finding similar text passages

### Example Comparisons

- `cat` vs `bat`: Distance = 1 (one substitution)
- `kitten` vs `sitting`: Distance = 3 (substitute k→s, e→i, insert g)
- `hello` vs `helo`: Distance = 1 (one deletion)
- `algorithm` vs `altorithm`: Distance = 1 (transposition of 'go' → 'og')

### Implementation Considerations

- **Performance**: Computing distance is O(m×n) where m and n are string lengths
- **Threshold**: Set a maximum distance for matches (e.g., ≤2 for typo detection)
- **Normalization**: Often normalized to 0-1 range for similarity scores
- **Case Sensitivity**: Usually converted to lowercase before comparison

-------

There are multiple Python libraries that help us use various text distance methods, including `textdistance`, `python-Levenshtein`, and `fuzzywuzzy`. For this tutorial, we will use **`jellyfish`** because it offers a clean API, high performance (implemented in C), and supports a wide range of phonetic and distance algorithms.

### Installation
```bash
pip install jellyfish
```

### Why Jellyfish?

- **Fast Performance**: Core algorithms written in C for speed
- **Comprehensive**: Includes both edit distance and phonetic matching
- **Zero Dependencies**: Lightweight with no external requirements
- **Well-Maintained**: Active development and good documentation
- **Easy to Use**: Simple, intuitive function calls

### Supported Distance Methods

**Edit Distance Algorithms:**
- `levenshtein_distance()` - Classic edit distance
- `damerau_levenshtein_distance()` - Includes transpositions
- `hamming_distance()` - For equal-length strings
- `jaro_similarity()` - Similarity score (0-1)
- `jaro_winkler_similarity()` - Jaro with prefix bonus

**Phonetic Algorithms** (sound-based matching):
- `soundex()` - Classic phonetic algorithm
- `metaphone()` - Improved phonetic encoding
- `nysiis()` - New York State phonetic system
- `match_rating_codex()` - Name matching system

> For more details and advanced usage, see the [official documentation](https://jellyfish.readthedocs.io/en/latest/comparison.html).

--------------------

### Levenshtein Distance

In [3]:
import jellyfish 
jellyfish.levenshtein_distance("jellyfish", "sellyfihs")

3

-------------

### Hamming Distance

In [4]:
jellyfish.hamming_distance("mohamed", "mohamef")

1

---------------

### Damerau-Levenshtein Distance

In [5]:
jellyfish.damerau_levenshtein_distance("mohamed zahran", "mohamd zahran")

1

------------

> ## `Great Job`

-------