###### Refer to E-Book 1

## Section 2.1

### 2.1.1. Nominal attributes

#### Nominal Attributes:

Nominal attributes are categorical variables that represent distinct categories or labels with no inherent order or ranking among them. These attributes classify data into various groups based on qualitative differences. Examples include colors, gender, types of animals, and more.

#### Practical Python Code for Handling Nominal Attributes:

Let's create a simple Python code snippet using the pandas library to work with a dataset containing nominal attributes. We'll load a sample dataset, explore nominal attributes, and perform one-hot encoding.

In [None]:
import pandas as pd

# Sample dataset with nominal attributes
data = {
    'Animal': ['Dog', 'Cat', 'Fish', 'Bird', 'Snake'],
    'Color': ['Brown', 'Black', 'Gold', 'Blue', 'Green'],
    'Habitat': ['Forest', 'Home', 'Aquarium', 'Sky', 'Jungle']
}

df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)
print("\n")

# Explore the nominal attributes
nominal_attributes = ['Animal', 'Color', 'Habitat']

# Display the unique values in each nominal attribute
for attribute in nominal_attributes:
    print(f"Unique values in {attribute}: {df[attribute].unique()}")

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=nominal_attributes)

# Display the dataset after one-hot encoding
print("\nDataset after One-Hot Encoding:")
print(df_encoded)

In this example:

    We create a pandas DataFrame with three nominal attributes: 'Animal', 'Color', and 'Habitat'.
    We explore the unique values in each nominal attribute.
    We use pd.get_dummies() to perform one-hot encoding, converting each nominal attribute into binary columns.
    The resulting DataFrame (df_encoded) is displayed after one-hot encoding.

### 2.1.2. Binary attributes

#### Binary Attributes:

Binary attributes are categorical variables that can take on one of two possible values, typically representing the presence or absence of a certain characteristic. These attributes are fundamental in many datasets, and examples include Yes/No, True/False, 1/0, or any other two distinct categories.

#### Practical Python Code for Handling Binary Attributes:

Let's create a Python code snippet using the pandas library to work with a dataset containing binary attributes. We'll load a sample dataset, explore binary attributes, and perform basic operations on them.

In [None]:
import pandas as pd

# Sample dataset with binary attributes
data = {
    'StudentID': [1, 2, 3, 4, 5],
    'PassedExam': [1, 0, 1, 1, 0],
    'EnrolledInCourse': [1, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)
print("\n")

# Explore the binary attributes
binary_attributes = ['PassedExam', 'EnrolledInCourse']

# Display the count of each unique value in binary attributes
for attribute in binary_attributes:
    print(f"Counts for {attribute}:\n{df[attribute].value_counts()}\n")

# Perform basic operations on binary attributes
df['TotalAttributes'] = df['PassedExam'] + df['EnrolledInCourse']

# Display the dataset after the operation
print("Dataset after performing an operation on binary attributes:")
print(df)


In this example:

    We create a pandas DataFrame with two binary attributes: 'PassedExam' and 'EnrolledInCourse'.
    We explore the counts of each unique value in binary attributes using value_counts().
    We perform a basic operation (addition) on binary attributes to create a new attribute, 'TotalAttributes'.
    The resulting DataFrame (df) is displayed after these operations.

### 2.1.3. Ordinal attributes

#### Ordinal Attributes:

Ordinal attributes are categorical variables with a meaningful order or ranking among the categories. Unlike nominal attributes, ordinal attributes have a clear, meaningful sequence, but the intervals between them are not necessarily uniform or well-defined. Examples of ordinal attributes include education levels (e.g., elementary, high school, college), customer satisfaction ratings, or socioeconomic classes.

#### Practical Python Code for Handling Ordinal Attributes:

Let's create a Python code snippet using the pandas library to work with a dataset containing ordinal attributes. We'll load a sample dataset, explore ordinal attributes, and demonstrate how to encode them to preserve the ordinal relationship.

In [None]:
import pandas as pd

# Sample dataset with ordinal attributes
data = {
    'StudentID': [1, 2, 3, 4, 5],
    'EducationLevel': ['High School', 'College', 'Elementary', 'College', 'High School'],
    'SatisfactionRating': [3, 5, 2, 4, 1]
}

df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)
print("\n")

# Explore the ordinal attributes
ordinal_attributes = ['EducationLevel', 'SatisfactionRating']

# Display the unique values in each ordinal attribute
for attribute in ordinal_attributes:
    print(f"Unique values in {attribute}: {df[attribute].unique()}")

# Encode ordinal attributes with meaningful numerical values
education_level_mapping = {'Elementary': 1, 'High School': 2, 'College': 3}
df['EducationLevelEncoded'] = df['EducationLevel'].map(education_level_mapping)

# Display the dataset after encoding ordinal attributes
print("\nDataset after encoding ordinal attributes:")
print(df)


In this example:

    We create a pandas DataFrame with two ordinal attributes: 'EducationLevel' and 'SatisfactionRating'.
    We explore the unique values in each ordinal attribute.
    We encode ordinal attributes with meaningful numerical values using the map() function to create a new attribute, 'EducationLevelEncoded'.
    The resulting DataFrame (df) is displayed after these operations.

### 2.1.4. Numeric attributes

#### Numeric Attributes:

Numeric attributes represent quantities and can take on numerical values. There are two main types of numeric attributes: discrete and continuous. Discrete numeric attributes can only take on distinct, separate values (e.g., the number of bedrooms in a house), while continuous numeric attributes can take on any value within a range (e.g., height, weight).

#### Practical Python Code for Handling Numeric Attributes:

Let's create a Python code snippet using the pandas library to work with a dataset containing numeric attributes. We'll load a sample dataset, explore numeric attributes, and perform basic operations on them.

In [None]:
import pandas as pd

# Sample dataset with numeric attributes
data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Age': [21, 19, 22, 20, 23],
    'Height (cm)': [175, 160, 180, 165, 185],
    'Score': [85, 92, 78, 89, 95]
}

df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)
print("\n")

# Explore the numeric attributes
numeric_attributes = ['Age', 'Height (cm)', 'Score']

# Display summary statistics for numeric attributes
print("Summary Statistics for Numeric Attributes:")
print(df[numeric_attributes].describe())

# Perform basic operations on numeric attributes
df['NormalizedScore'] = (df['Score'] - df['Score'].mean()) / df['Score'].std()

# Display the dataset after the operation
print("\nDataset after performing an operation on numeric attributes:")
print(df)


In this example:

    We create a pandas DataFrame with three numeric attributes: 'Age', 'Height (cm)', and 'Score'.
    We explore summary statistics for numeric attributes using describe().
    We perform a basic operation (normalization) on the 'Score' attribute to create a new attribute, 'NormalizedScore'.
    The resulting DataFrame (df) is displayed after these operations.

### 2.1.5. Discrete vs. continuous attributes

#### Discrete Attributes:

Definition: Discrete attributes can only take on distinct, separate values.
Examples: The number of bedrooms in a house, the count of items in a shopping cart, the number of students in a class.
Nature: These attributes are often counted in whole numbers and have clear boundaries between values.

#### Continuous Attributes:

Definition: Continuous attributes can take on any value within a range.
Examples: Height, weight, temperature, and any measurement that can have decimal values.
Nature: These attributes have a continuous and infinite set of possible values, making them suitable for measurement.

#### Practical Python Code for Discrete vs. Continuous Attributes:

Let's create a Python code snippet using the pandas library to work with a dataset containing both discrete and continuous attributes. We'll load a sample dataset, explore the nature of each type, and perform basic operations.

In [None]:
import pandas as pd

# Sample dataset with discrete and continuous attributes
data = {
    'StudentID': [1, 2, 3, 4, 5],
    'NumCourses': [4, 5, 3, 6, 4],  # Discrete attribute (number of courses)
    'GPA': [3.5, 4.0, 3.2, 3.8, 3.9],  # Continuous attribute (GPA)
    'Income': [25000, 30000, 20000, 35000, 32000]  # Continuous attribute (income in dollars)
}

df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)
print("\n")

# Explore the nature of attributes
print("Nature of Attributes:")
for column in df.columns:
    if df[column].dtype == 'int':
        print(f"{column} is a discrete attribute.")
    elif df[column].dtype == 'float':
        print(f"{column} is a continuous attribute.")

# Perform basic operations on continuous attributes
df['ScaledIncome'] = df['Income'] / 1000  # Scale income for readability

# Display the dataset after the operation
print("\nDataset after performing an operation on continuous attributes:")
print(df)


In this example:

    We create a pandas DataFrame with three attributes: 'NumCourses' (discrete), 'GPA' (continuous), and 'Income' (continuous).
    We explore the nature of each attribute based on its data type.
    We perform a basic operation (scaling) on a continuous attribute ('Income') to create a new attribute, 'ScaledIncome'.
    The resulting DataFrame (df) is displayed after these operations.

## Section 2.2

### 2.2.1. Measuring the central tendency

Measuring central tendency is a way to summarize a set of data by identifying the central or average value. There are three common measures of central tendency: mean, median, and mode.

#### Mean (Average): 
It is calculated by summing up all the values in a dataset and dividing the sum by the number of values.

#### Median: 
The median is the middle value of a dataset when it is sorted in ascending or descending order. If the dataset has an even number of values, the median is the average of the two middle values.

#### Mode: 
The mode is the value that appears most frequently in a dataset.

#### Practical Python Code for Measuring the central tendency:

In [None]:
import numpy as np

# Example dataset
data = np.array([15, 18, 2, 36, 12, 25, 18, 40, 28, 22])

# Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

# Median
median_value = np.median(data)
print(f"Median: {median_value}")

# Mode
mode_value = np.argmax(np.bincount(data))
print(f"Mode: {mode_value}")


    Make sure to install NumPy if you haven't already by running pip install numpy in your Python environment.

    This code uses NumPy functions to calculate the mean, median, and mode of the given dataset. You can replace the data array with your own dataset.

### 2.2.2. Measuring the dispersion of data

Measuring the dispersion of data is important in understanding how spread out or clustered the values in a dataset are. There are several measures of dispersion, including range, variance, and standard deviation.

#### Range: 
The range is the difference between the maximum and minimum values in a dataset. It provides a simple measure of how spread out the values are.

#### Variance: 
Variance measures the average squared difference of each value from the mean of the dataset. A higher variance indicates greater dispersion.

#### Standard Deviation: 
The standard deviation is the square root of the variance. It provides a more interpretable measure of the spread of data, as it is in the same unit as the original data.

#### Practical Python Code for Measuring the dispersion of data:

In [None]:
import numpy as np

# Example dataset
data = np.array([15, 18, 2, 36, 12, 25, 18, 40, 28, 22])

# Range
data_range = np.ptp(data)
print(f"Range: {data_range}")

# Variance
data_variance = np.var(data)
print(f"Variance: {data_variance}")

# Standard Deviation
data_stddev = np.std(data)
print(f"Standard Deviation: {data_stddev}")


    In this code, np.ptp calculates the range, np.var calculates the variance, and np.std calculates the standard deviation. Replace the data array with your own dataset.

### 2.2.3. Covariance and correlation analysis

Covariance and correlation analysis are statistical measures that describe the degree to which two variables change together.

#### Covariance:

Covariance measures the extent to which the values of two variables change in relation to each other.
A positive covariance indicates that as one variable increases, the other variable tends to increase as well, and vice versa for negative covariance.
However, the scale of covariance is not standardized, making it challenging to interpret the strength of the relationship.

#### Correlation:

Correlation is a standardized measure of the strength and direction of the linear relationship between two variables.
The correlation coefficient ranges from -1 to 1.
A correlation coefficient of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation.

#### Practical Python Code for Covariance and correlation analysis:

In [None]:
import numpy as np
import pandas as pd

# Example dataset
data = {
    'Variable1': [15, 18, 2, 36, 12, 25, 18, 40, 28, 22],
    'Variable2': [10, 12, 5, 30, 8, 20, 15, 35, 25, 18]
}

df = pd.DataFrame(data)

# Covariance matrix
cov_matrix = np.cov(df, rowvar=False)
print(f"Covariance Matrix:\n{cov_matrix}")

# Correlation matrix
correlation_matrix = df.corr()
print(f"Correlation Matrix:\n{correlation_matrix}")


    This code calculates the covariance matrix using np.cov and the correlation matrix using the corr method of a Pandas DataFrame. Replace the Variable1 and Variable2 columns with your own variables.

### 2.2.4. Graphic displays of basic statistics of data

Graphic displays of basic statistics are essential for visualizing the characteristics of a dataset. These displays can help in gaining insights into the distribution, central tendency, and dispersion of the data.

#### Histograms:

Explanation: Histograms provide a visual representation of the distribution of a dataset by dividing it into bins and displaying the frequency or probability of values falling into each bin.

#### Box Plots:

Explanation: Box plots (box-and-whisker plots) provide a graphical summary of the distribution of a dataset, including the median, quartiles, and potential outliers.

#### Scatter Plots:

Explanation: Scatter plots display individual data points in a two-dimensional space, making it easy to identify patterns, relationships, and outliers between two variables.

#### Practical Python Code for Graphic displays of basic statistics of data:

In [None]:
# Histogram

import matplotlib.pyplot as plt
import numpy as np

# Example dataset
data = np.random.randn(1000)  # Replace with your own dataset

# Create a histogram
plt.hist(data, bins=20, color='blue', alpha=0.7)
plt.title('Histogram of the Dataset')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()


In [None]:
# Box Plot

import matplotlib.pyplot as plt
import numpy as np

# Example dataset
data = np.random.randn(1000)  # Replace with your own dataset

# Create a box plot
plt.boxplot(data, vert=False)
plt.title('Box Plot of the Dataset')
plt.xlabel('Values')
plt.show()


In [None]:
# Scatter plot

import matplotlib.pyplot as plt
import numpy as np

# Example datasets
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)  # Replace with your own datasets

# Create a scatter plot
plt.scatter(x, y, color='red', alpha=0.7)
plt.title('Scatter Plot of Two Variables')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()


    Histograms, box plots, and scatter plots can enhance the understanding of data distributions, relationships, and outliers.

## Section 2.3

### 2.3.1. Data matrix vs. dissimilarity matrix

In data mining, understanding the concepts of data matrix and dissimilarity matrix is crucial.

#### Data Matrix:

Explanation: A data matrix is a structured representation of a dataset, where rows correspond to individual observations or instances, and columns represent different attributes or features. Each cell in the matrix contains the value of a specific attribute for a particular instance.

#### Dissimilarity Matrix:

Explanation: A dissimilarity matrix represents the dissimilarity or similarity between pairs of instances in a dataset. It quantifies how different or similar two instances are based on some measure of dissimilarity, such as distance or dissimilarity scores.

In [None]:
# Data Matrix

import pandas as pd

# Example data matrix
data = {
    'ID': [1, 2, 3, 4],
    'Feature1': [10, 15, 20, 25],
    'Feature2': [0.5, 1.2, 0.8, 1.0]
}

df = pd.DataFrame(data)
print("Data Matrix:")
print(df)


In [None]:
# Dissimilarity Matrix

from scipy.spatial.distance import pdist, squareform
import pandas as pd

# Example data matrix
data = {
    'ID': [1, 2, 3, 4],
    'Feature1': [10, 15, 20, 25],
    'Feature2': [0.5, 1.2, 0.8, 1.0]
}

df = pd.DataFrame(data)

# Calculate Euclidean distance matrix
dissimilarity_matrix = squareform(pdist(df[['Feature1', 'Feature2']], metric='euclidean'))
print("Dissimilarity Matrix:")
print(pd.DataFrame(dissimilarity_matrix, index=df['ID'], columns=df['ID']))


    In this example, the pdist function from SciPy calculates the pairwise Euclidean distances between instances based on selected features. The squareform function converts the condensed distance matrix to a square form.

    Matrices provide a structured representation of raw data, while dissimilarity matrices capture the pairwise dissimilarities between instances, which is useful in various data mining applications like clustering, classification, and similarity search.

### 2.3.2. Proximity measures for nominal attributes

Proximity measures for nominal attributes are used to quantify the similarity or dissimilarity between instances when dealing with categorical or nominal attributes. Unlike numerical attributes, nominal attributes don't have a natural ordering, making traditional distance metrics unsuitable. Proximity measures address this issue by focusing on the agreement or disagreement between categorical values.

#### Example Proximity Measure: Jaccard Similarity

#### Explanation:

Jaccard Similarity measures the similarity between two sets by dividing the size of their intersection by the size of their union. In the context of nominal attributes, each attribute value is treated as a set of binary values (1 if the attribute is present, 0 otherwise).

In [None]:
from sklearn.metrics import jaccard_score

# Example dataset with nominal attributes
data = {
    'Instance1': ['Red', 'Square', 'Large'],
    'Instance2': ['Blue', 'Circle', 'Small']
}

# Convert nominal attributes to binary values
binary_data = pd.get_dummies(pd.DataFrame(data))
binary_data


In [None]:
# Calculate Jaccard Similarity
jaccard_similarity = jaccard_score(binary_data.iloc[0], binary_data.iloc[1])
jaccard_similarity_1 = jaccard_score(binary_data.iloc[0], binary_data.iloc[0])
print(f"Jaccard Similarity: {jaccard_similarity}")
print(f"Jaccard Similarity: {jaccard_similarity_1}")

    In this example, the nominal attributes ('Red', 'Square', 'Large') and ('Blue', 'Circle', 'Small') are converted into binary vectors using one-hot encoding. The Jaccard Similarity is then calculated using the jaccard_score function from scikit-learn.

    Proximity measures for nominal attributes are essential for clustering, classification, and other data mining tasks involving categorical data. The practical example illustrates the application of Jaccard Similarity to quantify the similarity between instances with nominal attributes.

### 2.3.3. Proximity measures for binary attributes

Proximity measures for binary attributes are employed when dealing with datasets that consist of binary (0/1) attributes. These measures assess the similarity or dissimilarity between instances based on the presence or absence of specific binary features.

#### Example Proximity Measure: Hamming Distance

#### Explanation:

Hamming Distance measures the dissimilarity between two binary strings by counting the number of positions at which the corresponding bits are different.

In [None]:
import numpy as np
from scipy.spatial.distance import hamming

# Example dataset with binary attributes
data = {
    'Instance1': [1, 0, 1, 0],
    'Instance2': [0, 1, 1, 0]
}

# Convert binary attributes to NumPy arrays
binary_data = np.array([data['Instance1'], data['Instance2']])

# Calculate Hamming Distance
hamming_distance = hamming(binary_data[0], binary_data[1])
print(f"Hamming Distance: {hamming_distance}")


    In this example, instances are represented by binary vectors [1, 0, 1, 0] and [0, 1, 1, 0]. The Hamming Distance between these vectors is computed using the hamming function from SciPy.

    Proximity measures for binary attributes are crucial for tasks such as clustering, pattern recognition, and classification when dealing with datasets consisting of binary features. The Hamming Distance example demonstrates how to quantify dissimilarity between instances based on binary attributes.

### 2.3.4. Dissimilarity of numeric data: Minkowski distance

The Minkowski distance is a dissimilarity measure used to assess the similarity or dissimilarity between two points in a multidimensional space. It is a generalization of other distance measures, including Euclidean distance and Manhattan distance, and is defined by the following formula:

D(x,y)=(∑i=1n∣xi−yi∣p)1p

where x and y are vectors representing the numeric data points, n is the number of dimensions, and p is a parameter that determines the order of the Minkowski distance. When p=1, the Minkowski distance is equivalent to the Manhattan distance, and when p=2, it is equivalent to the Euclidean distance.

In [None]:
from scipy.spatial.distance import minkowski

# Example dataset with numeric attributes
data1 = [2, 3, 5, 7]
data2 = [1, 4, 6, 8]

# Calculate Minkowski distance with p=2 (Euclidean distance)
minkowski_distance = minkowski(data1, data2, p=2)
print(f"Minkowski Distance (p=2): {minkowski_distance}")

# Calculate Minkowski distance with p=1 (Manhattan distance)
manhattan_distance = minkowski(data1, data2, p=1)
print(f"Minkowski Distance (p=1): {manhattan_distance}")


    In this example, the Minkowski distance is calculated between two numeric vectors [2, 3, 5, 7] and [1, 4, 6, 8] using both the Euclidean distance (p=2) and the Manhattan distance (p=1). The minkowski function from SciPy is used for the calculation.

    The Minkowski distance is a flexible measure that adapts to different scenarios based on the chosen value of pp. It is applicable in various data mining tasks such as clustering, classification, and outlier detection, especially when dealing with datasets containing numeric attributes. The

### 2.3.5. Proximity measures for ordinal attributes

Proximity measures for ordinal attributes are used when dealing with datasets that contain attributes with an inherent order or ranking. Unlike nominal attributes, ordinal attributes have a meaningful order, but the intervals between values may not be uniform.

#### Example Proximity Measure: Spearman Rank Correlation Coefficient

#### Explanation:

The Spearman Rank Correlation Coefficient measures the strength and direction of the monotonic relationship between two ordinal variables. It is based on the ranks of the values rather than their actual values.

In [None]:
import numpy as np
from scipy.stats import spearmanr

# Example dataset with ordinal attributes
data = {
    'Instance1': [2, 3, 1, 4],
    'Instance2': [1, 4, 2, 3]
}

# Calculate Spearman Rank Correlation Coefficient
spearman_corr, _ = spearmanr(data['Instance1'], data['Instance2'])
print(f"Spearman Rank Correlation Coefficient: {spearman_corr}")


    In this example, instances are represented by ordinal attributes [2, 3, 1, 4] and [1, 4, 2, 3]. The Spearman Rank Correlation Coefficient is calculated using the spearmanr function from SciPy.

    Proximity measures for ordinal attributes are important when dealing with data that has a meaningful order but lacks a clear numerical scale. The Spearman Rank Correlation Coefficient is particularly useful for assessing the monotonic relationship between ordinal variables.

### 2.3.6. Dissimilarity for attributes of mixed types

Handling dissimilarity for attributes of mixed types is a common challenge in data mining, as datasets often include a combination of numeric, categorical, and ordinal attributes. It's crucial to employ appropriate dissimilarity measures that can accommodate the diverse nature of these attribute types.

#### Example Dissimilarity Measure: Gower's Distance

#### Explanation:

Gower's Distance is a dissimilarity measure designed for datasets with mixed types of attributes. It calculates the dissimilarity between two instances by considering the attribute types and applying appropriate measures, such as Euclidean distance for numeric attributes, Jaccard similarity for binary attributes, and simple matching coefficient for nominal attributes.

In [None]:
import numpy as np
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler

# Example dataset with mixed attribute types
data = {
    'Numeric': [2.5, 1.8, 3.2, 4.0],
    'Binary': [1, 0, 1, 1],
    'Nominal': ['A', 'B', 'A', 'C'],
    'Ordinal': [2, 1, 3, 2]
}

# Standardize numeric attributes
numeric_data = np.array([data['Numeric']]).T
numeric_data = StandardScaler().fit_transform(numeric_data)

# Convert binary attributes to binary values
binary_data = pd.get_dummies(pd.DataFrame(data['Binary'], columns=['Binary']))

# Convert nominal attributes to binary values
nominal_data = pd.get_dummies(pd.DataFrame(data['Nominal'], columns=['Nominal']))

# Combine all attribute types
mixed_data = np.concatenate((numeric_data, binary_data, nominal_data, np.array([data['Ordinal']]).T), axis=1)

# Calculate Gower's Distance
gower_distance = squareform(pdist(mixed_data, metric='euclidean'))
print("Gower's Distance Matrix:")
print(gower_distance)


    In this example, Gower's Distance is calculated for a dataset with numeric, binary, nominal, and ordinal attributes. Numeric attributes are standardized, binary and nominal attributes are converted to binary values, and all attribute types are combined into a single matrix. The pdist function from SciPy is then used to calculate pairwise Euclidean distances, and the squareform function converts the condensed distance matrix to a square form.

    The Gower's Distance is a versatile measure for dissimilarity in datasets with mixed attribute types, providing a comprehensive solution for handling various data characteristics.

### 2.3.7. Cosine similarity

Cosine similarity is a metric used to measure the similarity between two non-zero vectors of an inner product space. In the context of data mining, it is often employed to assess the similarity between documents in natural language processing or to compare feature vectors in recommendation systems. The cosine similarity ranges from -1 to 1, where 1 indicates perfect similarity, 0 indicates no similarity, and -1 indicates perfect dissimilarity.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example dataset with text documents
documents = [
    "Data mining is an exciting field of study.",
    "Machine learning techniques are used for data analysis.",
    "Python programming is widely used in data science applications."
]

# Convert documents to a bag-of-words representation
vectorizer = CountVectorizer()
document_matrix = vectorizer.fit_transform(documents)

# Calculate cosine similarity
cosine_similarities = cosine_similarity(document_matrix, document_matrix)
print("Cosine Similarity Matrix:")
print(cosine_similarities)


    In this example, the CountVectorizer from scikit-learn is used to convert a collection of text documents into a matrix of token counts (bag-of-words representation). The cosine_similarity function then computes the cosine similarity between the document vectors.

    Cosine similarity is particularly useful in scenarios where the magnitude of the vectors is not crucial, and the focus is on the direction of the vectors. It is widely applied in text analysis, document retrieval, and collaborative filtering systems. The provided example demonstrates how to calculate cosine similarity for a set of text documents.

### 2.3.8. Measuring similar distributions: the Kullback-Leibler divergence

The Kullback-Leibler (KL) Divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. In the context of data mining, it is often used to quantify the difference or similarity between two probability distributions. KL Divergence is not a true distance metric as it is not symmetric and does not satisfy the triangle inequality.

In [None]:
import numpy as np
from scipy.stats import entropy

# Example probability distributions
distribution1 = np.array([0.2, 0.3, 0.5])
distribution2 = np.array([0.1, 0.6, 0.3])

# Calculate Kullback-Leibler Divergence
kl_divergence = entropy(distribution1, distribution2)
print(f"Kullback-Leibler Divergence: {kl_divergence}")


    In this example, two probability distributions are represented by arrays distribution1 and distribution2. The entropy function from SciPy calculates the KL Divergence between these distributions.

    KL Divergence is often used in information theory and statistics to measure the difference between two probability distributions. It finds applications in various areas, including natural language processing, machine learning, and pattern recognition.

### 2.3.9. Capturing hidden semantics in similarity measures

Capturing hidden semantics in similarity measures involves finding meaningful patterns, relationships, or similarities in data that may not be apparent at first glance. This process often requires more advanced techniques that go beyond simple distance or similarity metrics. Methods like word embeddings, semantic similarity, or advanced neural network models are employed to capture and leverage hidden semantics in the data.

In [None]:
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

# Example dataset with text documents
documents = [
    "Data mining is an exciting field of study.",
    "Machine learning techniques are used for data analysis.",
    "Python programming is widely used in data science applications."
]

# Tokenize the documents into words
tokenized_documents = [doc.split() for doc in documents]

# Train a Word2Vec model on the tokenized documents
model = Word2Vec(tokenized_documents, vector_size=10, window=3, min_count=1, workers=4)

# Calculate cosine similarity using word embeddings
embedding_similarity = cosine_similarity([model.wv['data']], [model.wv['mining']])
print(f"Cosine Similarity using Word Embeddings: {embedding_similarity[0][0]}")


    In this example, the Gensim library is used to train a Word2Vec model on a small dataset of text documents. Word embeddings capture semantic relationships between words, and cosine similarity is then used to measure the similarity between the word vectors of two words ('data' and 'mining').

    Capturing hidden semantics often involves leveraging advanced techniques such as word embeddings, neural networks, or deep learning models. These methods are particularly useful in applications like natural language processing, recommendation systems, and image analysis where capturing the underlying meaning or semantics is crucial.