SciPy provides a wide range of distance metrics for computing pairwise distances between points in a dataset, primarily through the `scipy.spatial.distance` module. Here's an overview of the different types of distances available:

---

### 1. **Euclidean and Related Metrics**
- **Euclidean distance**: 
  - Straight-line distance between two points in Euclidean space.
  - Formula: \(\sqrt{\sum (x_i - y_i)^2}\)
  - Metric: `'euclidean'`

- **Squared Euclidean distance**:
  - Sum of squared differences (without taking the square root).
  - Metric: `'sqeuclidean'`

- **Chebyshev distance** (Maximum distance):
  - Maximum absolute difference along any dimension.
  - Formula: \(\max(|x_i - y_i|)\)
  - Metric: `'chebyshev'`

---

### 2. **Manhattan and Minkowski Metrics**
- **Manhattan (Cityblock) distance**:
  - Sum of absolute differences between points.
  - Formula: \(\sum |x_i - y_i|\)
  - Metric: `'cityblock'`

- **Minkowski distance**:
  - Generalization of Euclidean and Manhattan distances.
  - Formula: \((\sum |x_i - y_i|^p)^{1/p}\)
  - Metric: `'minkowski'`
  - Parameter: \(p\) (power parameter).

---

### 3. **Cosine Distance**
- Measures the cosine of the angle between two vectors.
- Formula: \(1 - \frac{x \cdot y}{\|x\| \|y\|}\)
- Metric: `'cosine'`

---

### 4. **Correlation Distance**
- Measures dissimilarity based on correlation between vectors.
- Formula: \(1 - \text{correlation}(x, y)\)
- Metric: `'correlation'`

---

### 5. **Hamming Distance**
- Fraction of elements that differ between two vectors.
- Suitable for binary or categorical data.
- Formula: \(\frac{\text{Number of differing elements}}{\text{Total number of elements}}\)
- Metric: `'hamming'`

---

### 6. **Jaccard Distance**
- Measures dissimilarity between two sets.
- Formula: \(1 - \frac{|x \cap y|}{|x \cup y|}\)
- Metric: `'jaccard'`

---

### 7. **Bray-Curtis Distance**
- Measures dissimilarity based on the sum of absolute differences.
- Formula: \(\frac{\sum |x_i - y_i|}{\sum |x_i + y_i|}\)
- Metric: `'braycurtis'`

---

### 8. **Mahalanobis Distance**
- Distance that accounts for the correlation between variables.
- Formula: \(\sqrt{(x - y)^T \Sigma^{-1} (x - y)}\), where \(\Sigma\) is the covariance matrix.
- Metric: `'mahalanobis'`

---

### 9. **Canberra Distance**
- Weighted sum of absolute differences.
- Formula: \(\sum \frac{|x_i - y_i|}{|x_i| + |y_i|}\)
- Metric: `'canberra'`

---

### 10. **Other Specialized Metrics**
- **Chi-square distance** (for non-negative data):
  - \(\sum \frac{(x_i - y_i)^2}{x_i + y_i}\)
  - Metric: `'chi-square'`
  
- **Kulsinski distance**: Binary metric derived from the Jaccard similarity.
- **Rogers-Tanimoto distance**: Binary metric measuring dissimilarity.

---

The choice of distance metric depends on the type of data, its properties, and the application.

---

### 1. **Euclidean Distance**
- **When to Use**: 
  - For continuous numerical data where the magnitude of the difference matters.
  - Ideal when features are on the same scale or after normalization.
- **Applications**:
  - Clustering (e.g., K-Means).
  - Nearest neighbor search.

---

### 2. **Squared Euclidean Distance**
- **When to Use**: 
  - When computational efficiency is preferred, as it avoids the square root operation.
  - For algorithms that care more about relative distances than actual values.
- **Applications**:
  - Multidimensional scaling.
  - Certain machine learning algorithms (e.g., kernel methods).

---

### 3. **Chebyshev Distance**
- **When to Use**: 
  - For problems where the maximum difference in any dimension matters more than the total distance.
  - Useful in chessboard-like environments.
- **Applications**:
  - Warehouse logistics.
  - Grid-based games.

---

### 4. **Manhattan (Cityblock) Distance**
- **When to Use**:
  - For continuous or ordinal data when the absolute differences are meaningful.
  - In high-dimensional spaces where Euclidean distance becomes less effective.
- **Applications**:
  - L1-regularized models (e.g., Lasso).
  - Movement costs in grid-based pathfinding.

---

### 5. **Minkowski Distance**
- **When to Use**:
  - Generalization of Euclidean and Manhattan distances.
  - Use for datasets where you can tune the parameter \(p\) to balance between Manhattan (\(p=1\)) and Euclidean (\(p=2\)) distances.
- **Applications**:
  - Algorithms where flexible distance definitions are beneficial.

---

### 6. **Cosine Distance**
- **When to Use**:
  - For high-dimensional, sparse data where direction matters more than magnitude.
  - For measuring similarity rather than spatial distance.
- **Applications**:
  - Text similarity (e.g., TF-IDF vectors).
  - Recommendation systems.

---

### 7. **Correlation Distance**
- **When to Use**:
  - For datasets where you want to measure how patterns correlate, regardless of magnitude.
  - Similar to cosine but focuses on linear relationships.
- **Applications**:
  - Time-series analysis.
  - Genomic data comparisons.

---

### 8. **Hamming Distance**
- **When to Use**:
  - For binary or categorical data.
  - Measures how many attributes differ.
- **Applications**:
  - Error correction (e.g., binary strings or DNA sequences).
  - Comparing categorical labels.

---

### 9. **Jaccard Distance**
- **When to Use**:
  - For binary or set-based data.
  - To measure dissimilarity between two sets.
- **Applications**:
  - Document comparison.
  - Image segmentation.

---

### 10. **Bray-Curtis Distance**
- **When to Use**:
  - For non-negative data where relative differences matter.
  - Sensitive to small differences in low values.
- **Applications**:
  - Ecology (e.g., species abundance data).
  - Proportion-based datasets.

---

### 11. **Mahalanobis Distance**
- **When to Use**:
  - For datasets with correlated features.
  - Accounts for the covariance structure of the data.
- **Applications**:
  - Outlier detection.
  - Multivariate analysis.

---

### 12. **Canberra Distance**
- **When to Use**:
  - For data with varying scales.
  - Penalizes small differences more strongly.
- **Applications**:
  - Comparing profiles with small values (e.g., spectrograms).
  - Environmental science data.

---

### 13. **Chi-Square Distance**
- **When to Use**:
  - For non-negative data or frequencies.
  - Highlights differences in small counts.
- **Applications**:
  - Histograms in computer vision.
  - Goodness-of-fit tests.

---

### Summary Table for Quick Reference

| **Metric**          | **Best for**                          | **Data Type**      |
|----------------------|---------------------------------------|--------------------|
| Euclidean           | Geometric similarity                  | Continuous         |
| Squared Euclidean   | Relative distances, efficiency        | Continuous         |
| Chebyshev           | Max deviation, grid-based problems    | Continuous         |
| Manhattan           | Absolute differences                  | Continuous/Ordinal |
| Minkowski           | Generalization of Euclidean/Manhattan | Continuous         |
| Cosine              | Vector similarity                     | Sparse/High-dim.   |
| Correlation         | Pattern relationships                 | Continuous         |
| Hamming             | Binary differences                    | Binary/Categorical |
| Jaccard             | Set similarity                        | Binary/Set-based   |
| Bray-Curtis         | Relative abundances                   | Non-negative       |
| Mahalanobis         | Correlated features                   | Continuous         |
| Canberra            | Small differences in varying scales   | Positive values    |
| Chi-Square          | Frequency comparisons                 | Non-negative       |

### Key Considerations:
- Normalize or standardize your data if the metric is sensitive to scale (e.g., Euclidean, Manhattan).
- For categorical data, one-hot encoding or other preprocessing steps may be necessary.
- Sparse or high-dimensional data often benefits from cosine or correlation distance.

In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder 
from sklearn.metrics import pairwise_distances

from scipy.spatial.distance import hamming, euclidean, pdist, squareform

## Soyabeen small dataset

A dataset of soybean plant observations, including information on plants infested by one of four diseases.

- Dataset contains 47 instances.
- Each instance represents a single plant.
- Characterized by 35 attributes.
- Attributes are categorical.
- Attributes mostly capture various symptoms like leaf spots, root rot, mold growth, seed damage, etc.

Citation:

- Michalski,R.. (1987). Soybean (Small). UCI Machine Learning Repository. https://doi.org/10.24432/C5DS3P.

First 2 features:

1. **Date**: The time when the soybean sample was collected. May be represented as a date or as the day of the year.
  
2. **Hail**: Indicates whether the plants have been affected by hail, generally a binary "yes" or "no."

3. **Germination**: Describes the rate of germination.

In [3]:
soybean_path = 'soybean_data_use.csv'

soy_df = pd.read_csv(soybean_path)

soy_df = soy_df.loc[:10, ['date', 'hail', 'germination']]

soy_df

Unnamed: 0,date,hail,germination
0,august,no,lt-80%
1,september,yes,lt-80%
2,july,yes,80-89%
3,october,yes,90-100%
4,august,yes,lt-80%
5,september,yes,90-100%
6,july,yes,80-89%
7,july,yes,lt-80%
8,october,yes,80-89%
9,october,yes,lt-80%


In [4]:
soy_df['date'].unique()

array(['august', 'september', 'july', 'october'], dtype=object)

In [5]:
soy_df['hail'].unique()

array(['no', ' yes'], dtype=object)

In [6]:
soy_df['germination'].unique()

array(['lt-80%', '80-89%', ' 90-100%'], dtype=object)

##Calculating Haming Distance

In [7]:
soy_df.loc[:1, :]

Unnamed: 0,date,hail,germination
0,august,no,lt-80%
1,september,yes,lt-80%


In [8]:
hamming(soy_df.loc[0].to_numpy(), soy_df.loc[1].to_numpy())

np.float64(0.6666666666666666)

In [10]:
from sklearn.preprocessing import LabelEncoder
from scipy.spatial.distance import pdist, squareform
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'hail': ['Yes', 'No', 'Yes'],
    'germination': [95, 90, 85]
})

# Apply LabelEncoder to each column
encoded_df = df.apply(lambda col: LabelEncoder().fit_transform(col.astype(str)))

# Compute pairwise Hamming distances
dst = pdist(encoded_df.to_numpy(), metric='hamming')
dst_matrix = squareform(dst)

# Convert to DataFrame
distance_df = pd.DataFrame(dst_matrix)
print(distance_df)


          0    1         2
0  0.000000  1.0  0.666667
1  1.000000  0.0  1.000000
2  0.666667  1.0  0.000000


In [9]:
dst = pdist(soy_df.to_numpy(), metric='hamming')
dst_matrix = squareform(dst)
pd.DataFrame(dst_matrix)

ValueError: Unsupported dtype object

In [11]:
soy_df

Unnamed: 0,date,hail,germination
0,august,no,lt-80%
1,september,yes,lt-80%
2,july,yes,80-89%
3,october,yes,90-100%
4,august,yes,lt-80%
5,september,yes,90-100%
6,july,yes,80-89%
7,july,yes,lt-80%
8,october,yes,80-89%
9,october,yes,lt-80%


In [12]:
or_encoder = OrdinalEncoder() 
soy_df_enc = or_encoder.fit_transform(soy_df)
soy_df_enc

array([[0., 1., 2.],
       [3., 0., 2.],
       [1., 0., 1.],
       [2., 0., 0.],
       [0., 0., 2.],
       [3., 0., 0.],
       [1., 0., 1.],
       [1., 0., 2.],
       [2., 0., 1.],
       [2., 0., 2.],
       [2., 1., 0.]])

In [13]:
dst = pdist(soy_df_enc, metric='hamming')
dst

array([0.66666667, 1.        , 1.        , 0.33333333, 1.        ,
       1.        , 0.66666667, 1.        , 0.66666667, 0.66666667,
       0.66666667, 0.66666667, 0.33333333, 0.33333333, 0.66666667,
       0.33333333, 0.66666667, 0.33333333, 1.        , 0.66666667,
       0.66666667, 0.66666667, 0.        , 0.33333333, 0.33333333,
       0.66666667, 1.        , 0.66666667, 0.33333333, 0.66666667,
       0.66666667, 0.33333333, 0.33333333, 0.33333333, 0.66666667,
       0.66666667, 0.33333333, 0.66666667, 0.33333333, 1.        ,
       0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
       0.33333333, 0.33333333, 0.66666667, 1.        , 0.66666667,
       0.33333333, 1.        , 0.33333333, 0.66666667, 0.66666667])

In [14]:
dst_matrix = squareform(dst)
pd.DataFrame(dst_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,0.666667,1.0,1.0,0.333333,1.0,1.0,0.666667,1.0,0.666667,0.666667
1,0.666667,0.0,0.666667,0.666667,0.333333,0.333333,0.666667,0.333333,0.666667,0.333333,1.0
2,1.0,0.666667,0.0,0.666667,0.666667,0.666667,0.0,0.333333,0.333333,0.666667,1.0
3,1.0,0.666667,0.666667,0.0,0.666667,0.333333,0.666667,0.666667,0.333333,0.333333,0.333333
4,0.333333,0.333333,0.666667,0.666667,0.0,0.666667,0.666667,0.333333,0.666667,0.333333,1.0
5,1.0,0.333333,0.666667,0.333333,0.666667,0.0,0.666667,0.666667,0.666667,0.666667,0.666667
6,1.0,0.666667,0.0,0.666667,0.666667,0.666667,0.0,0.333333,0.333333,0.666667,1.0
7,0.666667,0.333333,0.333333,0.666667,0.333333,0.666667,0.333333,0.0,0.666667,0.333333,1.0
8,1.0,0.666667,0.333333,0.333333,0.666667,0.666667,0.333333,0.666667,0.0,0.333333,0.666667
9,0.666667,0.333333,0.666667,0.333333,0.333333,0.666667,0.666667,0.333333,0.333333,0.0,0.666667


In [15]:
dst_matrix1 = pairwise_distances(soy_df_enc, metric='hamming')
pd.DataFrame(dst_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,0.666667,1.0,1.0,0.333333,1.0,1.0,0.666667,1.0,0.666667,0.666667
1,0.666667,0.0,0.666667,0.666667,0.333333,0.333333,0.666667,0.333333,0.666667,0.333333,1.0
2,1.0,0.666667,0.0,0.666667,0.666667,0.666667,0.0,0.333333,0.333333,0.666667,1.0
3,1.0,0.666667,0.666667,0.0,0.666667,0.333333,0.666667,0.666667,0.333333,0.333333,0.333333
4,0.333333,0.333333,0.666667,0.666667,0.0,0.666667,0.666667,0.333333,0.666667,0.333333,1.0
5,1.0,0.333333,0.666667,0.333333,0.666667,0.0,0.666667,0.666667,0.666667,0.666667,0.666667
6,1.0,0.666667,0.0,0.666667,0.666667,0.666667,0.0,0.333333,0.333333,0.666667,1.0
7,0.666667,0.333333,0.333333,0.666667,0.333333,0.666667,0.333333,0.0,0.666667,0.333333,1.0
8,1.0,0.666667,0.333333,0.333333,0.666667,0.666667,0.333333,0.666667,0.0,0.333333,0.666667
9,0.666667,0.333333,0.666667,0.333333,0.333333,0.666667,0.666667,0.333333,0.333333,0.0,0.666667


The uses `pairwise_distances` from `sklearn.metrics` with the Hamming metric to calculate a distance matrix.



### Explanation:
1. **Encoding the Data**:
   - Use `LabelEncoder` to transform non-numeric columns (e.g., `date` and `hail`) into numeric representations.
   - Make sure all columns are numeric before passing them to `pairwise_distances`.

2. **Pairwise Distances**:
   - `pairwise_distances` calculates the Hamming distance between all rows of the encoded DataFrame.
   - The result is a square matrix where the entry `(i, j)` represents the Hamming distance between row `i` and row `j`.

3. **DataFrame Conversion**:
   - Convert the resulting distance matrix to a `DataFrame` for better visualization.


This is the pairwise normalized Hamming distance matrix. Each value represents the fraction of differing features between the rows.

In [16]:
np.array_equal(dst_matrix, dst_matrix1)

True

Calculating Euclidian distance : 

In [17]:
oh_encoder = OneHotEncoder(sparse_output=False) 
soy_df_oh_enc = oh_encoder.fit_transform(soy_df)

In [18]:
soy_df.nunique()

date           4
hail           2
germination    3
dtype: int64

In [19]:
soy_df_oh_enc

array([[1., 0., 0., 0., 0., 1., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1., 1., 0., 0.]])

In [20]:
soy_df_oh_enc.shape

(11, 9)

In [21]:
euclidean(soy_df_oh_enc[0,:], soy_df_oh_enc[1,:])

2.0

In [22]:
dst = pdist(soy_df_oh_enc, metric='euclidean')
dst_matrix = squareform(dst)
pd.DataFrame(dst_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,2.0,2.44949,2.44949,1.414214,2.44949,2.44949,2.0,2.44949,2.0,2.0
1,2.0,0.0,2.0,2.0,1.414214,1.414214,2.0,1.414214,2.0,1.414214,2.44949
2,2.44949,2.0,0.0,2.0,2.0,2.0,0.0,1.414214,1.414214,2.0,2.44949
3,2.44949,2.0,2.0,0.0,2.0,1.414214,2.0,2.0,1.414214,1.414214,1.414214
4,1.414214,1.414214,2.0,2.0,0.0,2.0,2.0,1.414214,2.0,1.414214,2.44949
5,2.44949,1.414214,2.0,1.414214,2.0,0.0,2.0,2.0,2.0,2.0,2.0
6,2.44949,2.0,0.0,2.0,2.0,2.0,0.0,1.414214,1.414214,2.0,2.44949
7,2.0,1.414214,1.414214,2.0,1.414214,2.0,1.414214,0.0,2.0,1.414214,2.44949
8,2.44949,2.0,1.414214,1.414214,2.0,2.0,1.414214,2.0,0.0,1.414214,2.0
9,2.0,1.414214,2.0,1.414214,1.414214,2.0,2.0,1.414214,1.414214,0.0,2.0


In [23]:
dst_matrix1 = pairwise_distances(soy_df_oh_enc, metric='euclidean')
pd.DataFrame(dst_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,2.0,2.44949,2.44949,1.414214,2.44949,2.44949,2.0,2.44949,2.0,2.0
1,2.0,0.0,2.0,2.0,1.414214,1.414214,2.0,1.414214,2.0,1.414214,2.44949
2,2.44949,2.0,0.0,2.0,2.0,2.0,0.0,1.414214,1.414214,2.0,2.44949
3,2.44949,2.0,2.0,0.0,2.0,1.414214,2.0,2.0,1.414214,1.414214,1.414214
4,1.414214,1.414214,2.0,2.0,0.0,2.0,2.0,1.414214,2.0,1.414214,2.44949
5,2.44949,1.414214,2.0,1.414214,2.0,0.0,2.0,2.0,2.0,2.0,2.0
6,2.44949,2.0,0.0,2.0,2.0,2.0,0.0,1.414214,1.414214,2.0,2.44949
7,2.0,1.414214,1.414214,2.0,1.414214,2.0,1.414214,0.0,2.0,1.414214,2.44949
8,2.44949,2.0,1.414214,1.414214,2.0,2.0,1.414214,2.0,0.0,1.414214,2.0
9,2.0,1.414214,2.0,1.414214,1.414214,2.0,2.0,1.414214,1.414214,0.0,2.0
