1. **Data Reduction**
- Introduces the chapter objectives, which include identifying dimensionality based on features, cases, and value reduction techniques. It also explains the advantages of data reduction in preprocessing for data mining.

2. **Dimensions of Large Data Sets**
- Discusses the three main dimensions of preprocessed data sets as columns (features), rows (cases), and feature values. It states that the basic operations in data reduction involve deleting a column, deleting a row, or reducing feature values.
- Provides an example of how two features can be replaced by one composite feature to reduce dimensions. It also notes that data reduction does not necessarily reduce mining quality and can sometimes improve results.
- Lists parameters for analyzing data reduction tradeoffs: computing time, accuracy, representation of models, and simplicity of representation.

3. **Feature Reduction**
- Introduces feature selection and composition as two standard tasks for feature reduction. It defines each task and their objectives.
- Lists the three main feature selection methods: filter, wrapper, and embedded. It also explains the difference between feature selection and feature extraction.

 **Feature Selection**
- Discusses a technique for feature selection based on comparing means and variances of feature values between classes. It provides equations to formalize the technique.
- Demonstrates the feature selection technique on a sample dataset and identifies which feature is a candidate for reduction.
- Techniques for feature selection include: forward selection, backward elimination, and stepwise selection. It also notes that feature selection can be used to reduce the number of features in a dataset.
- Next equations formalize the test, where A and B are sets of feature values measured for two different classes, and n1 and n2 are the corresponding number of samples: 

$$ SE(A-B) = \sqrt{\frac{var(A)}{n1} + \frac{var(B)}{n2}} $$

$$ TEST: |mean(A) - mean(B)| / SE(A-B) > thresholdvalue $$

In [54]:
# read data from csv file
import pandas as pd

""" df = pd.read_csv('data_preprocessed.csv')
df.head()

# select columns
df = df[['name', 'age_edp_binned_means', 'height', 'weight', 'genders', 'income_edp_binned_means']]

# rename columns
df = df.rename(columns={'age_edp_binned_means': 'age', 'income_edp_binned_means': 'income'})

# drop rows with missing values
df = df.dropna()

# select random sample of 15 rows
df = df.sample(n=20)

df.to_csv('data_reduced.csv', index=False) """
df = pd.read_csv('data_reduced.csv')
df.head()

Unnamed: 0,name,age,height,weight,genders,income
0,Ethan,87.0,316,26.0,F,H
1,Alexander,53.65,154,140.0,M,L
2,Henry,87.0,185,187.0,M,L
3,Xiao,29.94,216,129.0,M,L
4,Adam,29.94,213,90.0,M,L


In [55]:
# Feature Selection by given function
"""
Next equations formalize the test, where A and B are sets of feature values measured for two different classes, and n1 and n2 are the corresponding number of samples:
$$ SE(A-B) = \sqrt{\frac{var(A)}{n1} + \frac{var(B)}{n2}} $$
$$ TEST: |mean(A) - mean(B)| / SE(A-B) > thresholdvalue $$
"""
threshold = 1.5
# calculate SE(A-B)
def mean(A): # average of A
    sum = 0
    for i in A:
        sum += i
    return sum/len(A)

def variance(A): # variance of A calculated as sum of squared differences from the mean
    variance = 0
    for i in A:
        variance += (i - mean(A))**2
    return variance

def sqrt(A): # square root of A
    return A**(1/2)

def SE(A, B): # standard error of A-B    
    return sqrt(variance(A)/len(A) + variance(B)/len(B))

# calculate TEST
def TEST(A, B): # test if difference of means of A and B is significant so that we can drop one of them
    return abs(mean(A) - mean(B)) / SE(A, B) > threshold

age = df['age'].tolist()
height = df['height'].tolist()
weight = df['weight'].tolist()
# replace F with 1 and M with 0
gender = []
for g in df['genders'].tolist():
    if g == 'F':
        gender.append(1)
    else:
        gender.append(0)

print('age and height are significant: ', TEST(age, height))
print('age and weight are significant: ', TEST(age, weight))
print('age and gender are significant: ', TEST(age, gender))
print('height and weight are significant: ', TEST(height, weight))
print('height and gender are significant: ', TEST(height, gender))
print('weight and gender are significatn: ', TEST(weight, gender))

age and height are significant:  True
age and weight are significant:  False
age and gender are significant:  True
height and weight are significant:  False
height and gender are significant:  True
weight and gender are significatn:  True


**Feature Composition**
- Provides an overview of feature composition using principal component analysis to merge features while retaining original information.

**Entropy Measure For Ranking Features**
- Introduces an unsupervised feature selection technique based on entropy measure, including calculations and algorithms.
- Provides an example of the entropy measure technique on a sample dataset.
- Notes that the entropy measure technique can be used to rank features and select the top k features.
- Normalized Euclidean distance measure is used to calculate the distance between two feature vectors. The equation is as follows:

$$ d(x,y) = \sqrt{\sum_{i=1}^{n} (\frac{x_i - y_i}{max_i - min_i})^2} $$
- The entropy measure is used to calculate the entropy of a feature vector. The equation is as follows:

$$ entropy(x) = -\sum_{i=1}^{n} p_i log_2 p_i $$

- Calculate similarity for nominal features using Hamming distance, which is the number of features that are different between two feature vectors. The equation is as follows:

$$ d(x,y) = (\sum_{i=1}^{n} |x_i=y_i|)/n $$

- Calculate similarity for binary features using Jaccard coefficient, which is the number of features that are the same between two feature vectors divided by the number of features that are different between the two feature vectors. The equation is as follows:

$$ d(x,y) = \frac{\sum_{i=1}^{n} \delta(x_i, y_i)}{\sum_{i=1}^{n} \delta(x_i, y_i)} $$

- Entropy calculation and sequential backward ranking algorithm are used to rank features. The algorithm is as follows:





In [56]:
# Create a similarity matrix for each row of the dataset

# calculate similarity between two rows

def similarity(row1, row2):
    # calculate similarity between two rows
    # similarity is the number of columns where the two rows have the same values
    similarity = 0
    for i in range(len(row1)):
        if row1[i] == row2[i]:
            similarity += 1
    return str(similarity) + " / " + str(len(row1))

# calculate similarity matrix for the dataset
def similarity_matrix(df):
    # calculate similarity matrix for the dataset
    similarity_matrix = []
    for i in range(len(df)):
        row = []
        for j in range(len(df)):
            row.append(similarity(df.iloc[i], df.iloc[j]))
        similarity_matrix.append(row)
    return similarity_matrix

# calculate similarity matrix for the dataset
similarity_matrix = similarity_matrix(df)

# print similarity matrix with row and column labels using pandas
df_similarity_matrix = pd.DataFrame(similarity_matrix, columns=df['name'].tolist(), index=df['name'].tolist())
df_similarity_matrix

  if row1[i] == row2[i]:


Unnamed: 0,Ethan,Alexander,Henry,Xiao,Adam,Michael,Elizabeth,Abigail,Luke,Varun,Jason,Penelope,Jayden,Sofia,Jacob,Daniel,Joseph,Benjamin,Matthew,Aubrey
Ethan,6 / 6,0 / 6,1 / 6,0 / 6,0 / 6,0 / 6,2 / 6,1 / 6,0 / 6,2 / 6,2 / 6,0 / 6,0 / 6,1 / 6,0 / 6,2 / 6,2 / 6,2 / 6,0 / 6,2 / 6
Alexander,0 / 6,6 / 6,2 / 6,2 / 6,2 / 6,3 / 6,0 / 6,1 / 6,1 / 6,1 / 6,1 / 6,2 / 6,3 / 6,0 / 6,1 / 6,0 / 6,0 / 6,1 / 6,2 / 6,1 / 6
Henry,1 / 6,2 / 6,6 / 6,2 / 6,2 / 6,2 / 6,1 / 6,1 / 6,1 / 6,2 / 6,0 / 6,1 / 6,2 / 6,0 / 6,1 / 6,0 / 6,0 / 6,2 / 6,2 / 6,2 / 6
Xiao,0 / 6,2 / 6,2 / 6,6 / 6,3 / 6,2 / 6,0 / 6,0 / 6,2 / 6,1 / 6,0 / 6,1 / 6,2 / 6,1 / 6,2 / 6,1 / 6,1 / 6,1 / 6,3 / 6,1 / 6
Adam,0 / 6,2 / 6,2 / 6,3 / 6,6 / 6,2 / 6,0 / 6,0 / 6,2 / 6,1 / 6,0 / 6,1 / 6,2 / 6,1 / 6,2 / 6,1 / 6,1 / 6,1 / 6,3 / 6,1 / 6
Michael,0 / 6,3 / 6,2 / 6,2 / 6,2 / 6,6 / 6,0 / 6,1 / 6,1 / 6,1 / 6,1 / 6,2 / 6,3 / 6,0 / 6,1 / 6,0 / 6,0 / 6,1 / 6,2 / 6,1 / 6
Elizabeth,2 / 6,0 / 6,1 / 6,0 / 6,0 / 6,0 / 6,6 / 6,2 / 6,1 / 6,1 / 6,1 / 6,1 / 6,0 / 6,2 / 6,1 / 6,1 / 6,1 / 6,1 / 6,0 / 6,1 / 6
Abigail,1 / 6,1 / 6,1 / 6,0 / 6,0 / 6,1 / 6,2 / 6,6 / 6,1 / 6,0 / 6,2 / 6,2 / 6,1 / 6,2 / 6,1 / 6,1 / 6,1 / 6,0 / 6,0 / 6,0 / 6
Luke,0 / 6,1 / 6,1 / 6,2 / 6,2 / 6,1 / 6,1 / 6,1 / 6,6 / 6,1 / 6,0 / 6,2 / 6,1 / 6,2 / 6,3 / 6,1 / 6,1 / 6,1 / 6,2 / 6,1 / 6
Varun,2 / 6,1 / 6,2 / 6,1 / 6,1 / 6,1 / 6,1 / 6,0 / 6,1 / 6,6 / 6,1 / 6,1 / 6,1 / 6,0 / 6,1 / 6,1 / 6,1 / 6,3 / 6,1 / 6,3 / 6


In [57]:
"""
■ From information theory, we know that entropy is a global measure, and that it is less for ordered configurations and higher for disordered configurations.
■ The proposed technique compares the entropy measure for a given data set before and after removal of a feature. If the two measures are close, then the reduced set of features will satisfactorily approximate the original set.
■ For a data set of N samples, the entropy measure is
■ where Sij is the similarity between samples xi and xj. This measure is computed in each of the iterations as a basis for deciding the ranking of features. We rank features by gradually removing the least important feature in maintaining the order in the configurations of data. The steps of the algorithm are base on sequential backward ranking, and they have been successfully tested on several real-world applications
"""
# Entropy Measure For Ranking Features:
# 1. Start with the initial full set of features F.
# 2. For each feature f Є F, remove one feature f from F and obtain a subset Ff. Find the difference between entropy for F and entropy for all Ff. In our example, we have to compare the differences (EF−EF−F1), (EF − EF−F2), and (EF − EF−F3).
# 3. Let fk be a feature such that the difference between entropy for F and entropy for Ffk is minimum.
# 4. Update the set of features F = F − {Fk}, where − is a difference operation on sets. In our example, if the difference (EF − EF−F1) is minimum, then the reduced set of features is {F2, F3}. F1 becomes the bottom of the ranked list.
# 5. Repeat steps 2-4 until there is only one feature in F.

features = df.columns.tolist()
features.remove('name')
features

['age', 'height', 'weight', 'genders', 'income']

In [64]:
# 2. For each feature f Є F, remove one feature f from F and obtain a subset Ff. Find the difference between entropy for F and entropy for all Ff. In our example, we have to compare the differences (EF−EF−F1), (EF − EF−F2), and (EF − EF−F3).

def entropy(df):
    # calculate entropy for the dataset
    entropy = 0
    for i in range(len(df)):
        for j in range(len(df)):
            sim = int(similarity(df.iloc[i], df.iloc[j]).split(' / ')[0])
            entropy += sim
    return entropy

def entropy_difference(df, feature):
    # calculate entropy difference for the dataset when feature is removed
    df_copy = df.copy()
    df_copy = df_copy.drop(columns=[feature])
    return entropy(df) - entropy(df_copy)

# calculate entropy difference for the dataset when feature is removed
entropy_differences = []
for feature in features:
    entropy_differences.append(entropy_difference(df, feature))

# print entropy difference with feature labels using pandas
df_entropy_difference = pd.DataFrame(entropy_differences, columns=['entropy differences'], index=features)
df_entropy_difference

  if row1[i] == row2[i]:
  if row1[i] == row2[i]:
  if row1[i] == row2[i]:
  if row1[i] == row2[i]:
  if row1[i] == row2[i]:


Unnamed: 0,entropy differences
age,136
height,24
weight,26
genders,218
income,134


In [59]:
# # 3. Let fk be a feature such that the difference between entropy for F and entropy for Ffk is minimum.
# 
# # find feature with minimum entropy difference
# 
# # find feature with minimum entropy difference
# min_entropy_difference_feature = df_entropy_difference['entropy differences'].idxmin()
# print("feature with minimum entropy difference: ", min_entropy_difference_feature)
# 
# # 4. Update the set of features F = F − {Fk}, where − is a difference operation on sets. In our example, if the difference (EF − EF−F1) is minimum, then the reduced set of features is {F2, F3}. F1 becomes the bottom of the ranked list.
# 
# # update set of features
# features.remove(min_entropy_difference_feature)
# 
# # 5. Repeat steps 2-4 until there is only one feature in F.
# 
# # repeat steps 2-4 until there is only one feature in F
# while len(features) > 1:
#     entropy_differences = []
#     for feature in features:
#         entropy_differences.append(entropy_difference(df, feature))
#     df_entropy_difference = pd.DataFrame(entropy_differences, columns=['entropy differences'], index=features)
#     min_entropy_difference_feature = df_entropy_difference['entropy differences'].idxmin()
#     print("feature with minimum entropy difference: ", min_entropy_difference_feature)
#     features.remove(min_entropy_difference_feature)

**ChiMerge Technique**
- ChiMerge is one automated discretization algorithm that analyzes the quality of multiple intervals for a given feature by using χ2 statistics.
- The algorithm determines similarities between distributions of data in two adjacent intervals based on output classification of samples.
- If the conclusion of the χ2 test is that the output class is independent of the feature's intervals, then the intervals should be merged; otherwise, it indicates that the difference between intervals is statistically significant, and no merger will be performed.
- ChiMerge algorithm consists of three basic steps for discretization:
    1. Sort the data for the given feature in ascending order.
    2. Define initial intervals so that every value of the feature is in a separate interval.
    3. Repeat until no χ2 of any two adjacent intervals is less then threshold value.
- After each merger, χ2 tests for the remaining intervals are calculated, and two adjacent features with the smallest χ2 values are found. If the calculated χ2 is less than the threshold, merge these intervals. If no merge is possible, and the number of intervals is greater than the user-defined maximum, increase the threshold value.
- The χ2 test or contingency-table test is used in the methodology for determining the independence of two adjacent intervals.
- When the data are summarized in a contingency table, the χ2 test is given by the formula:

\begin{equation}
x^2 = \sum_{i=1}^2 \sum_{j=1}^k \frac{(A_i - E_{ij})^2}{E_{ij}}
\end{equation}

where
\begin{align*}
 k &= \text{the number of classes}, \\
 A_i &= \text{the number of instances in the }i\text{-th interval, }j\text{-th class}, \\
 E_{ij} &= \text{the expected frequency of } A_i,\text{which is computed as } (R_i, C_j) N, \\
 R_i &= \text{the number of instances in the }i\text{-th interval} \sum_{j=1}^k A_i, \\
 C_j &= \text{the number of instances in the }j\text{-th class} \sum_{i=1}^k A_i, \\
 N &= \text{the total number of instances} \sum_{i=1}^k R_i.
\end{align*}

- The χ2 test is used to determine whether the difference between the observed frequency and the expected frequency is statistically significant. The null hypothesis is that the observed frequency and the expected frequency are equal. If the χ2 value is greater than the threshold value, the null hypothesis is rejected, and the intervals are merged.


In [68]:
# ChiMerge discretization algorithm
# 1. Sort the values of the attribute in ascending order and get the corresponding class labels.
# 2. Apply the ChiMerge algorithm to combine adjacent intervals with similar class distributions.
# 3. Calculate the ChiSquare value for each pair of adjacent intervals.
# 4. Merge the pair of intervals with the minimum ChiSquare value.
# 5. Repeat steps 3 and 4 until the stopping criterion is met.
# 6. Discretize the attribute based on the intervals obtained from the previous step.

sorted_df = df.sort_values(by=['income'])

# calculate class frequencies
class_frequencies = {}
for i in range(len(sorted_df)):
    income = sorted_df.iloc[i]['income']
    if income not in class_frequencies:
        class_frequencies[income] = 1
    else:
        class_frequencies[income] += 1
class_frequencies


{'H': 7, 'L': 7, 'M': 6}