# Notes

#### What do I do with categorical data with large number of unique categories?

When dealing with string data that represents categorical variables and has a large number of unique categories, such as email domains, directly one-hot encoding them can lead to a very high-dimensional and sparse dataset, which may not be ideal for model training due to the curse of dimensionality.

Instead, you can consider the following strategies:

1. Grouping or Aggregating Categories: You can group similar categories together based on some criteria. For example, for email domains, you could group them by the top-level domain (TLD), such as ".com", ".org", ".net", etc. This reduces the number of categories while still capturing important information.

2. Frequency-Based Encoding: Encode the categories based on their frequency of occurrence in the dataset. For example, you could replace each email domain with the count of how many times it appears in the dataset.

3. Target Encoding: Encode the categories based on the mean of the target variable (e.g., the average response rate) for each category. This can be helpful if there is a relationship between the categorical variable and the target variable.

4. Dimensionality Reduction Techniques: Apply dimensionality reduction techniques such as PCA (Principal Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding) to reduce the dimensionality of the data while preserving important information.

5. Embedding: Use techniques like word embeddings (e.g., Word2Vec, GloVe) to represent the categorical variables as dense, low-dimensional vectors. This can capture semantic relationships between categories and reduce dimensionality.

6. Hashing: Hash the categorical variables into a fixed number of bins using techniques like feature hashing. This reduces the dimensionality while still preserving some information about the categories.

#### What is Frequency-Based Encoding?

Frequency-based encoding, also known as count encoding, is a technique used to encode categorical variables based on the frequency (or count) of each category in the dataset. Instead of assigning a unique numerical value to each category, this method replaces each category with the count of how many times it appears in the dataset. Here's a more detailed explanation of how frequency-based encoding works:

1. Calculate Frequency: For each unique category in the categorical variable, calculate the frequency or count of occurrences in the dataset.

2. Replace Categories: Replace each category with its corresponding frequency value. This effectively transforms the categorical variable into a numerical variable.

3. Handling New Categories: Decide how to handle categories that are not present in the training dataset but may appear in the test or validation datasets. You can either assign them a default value (e.g., 0) or impute them based on some other criterion (e.g., median frequency).

4. Use Case: Frequency-based encoding is useful when the frequency of occurrence of each category is informative and correlated with the target variable. For example, in some datasets, categories that occur more frequently may be associated with higher or lower target values.

5. Example:
- Original Categorical Variable: ["A", "B", "A", "C", "B", "B"]
- Frequency-Based Encoded Variable: [2, 3, 2, 1, 3, 3]

In [None]:
#Python Implementation
import pandas as pd

# Sample dataset
data = {"category": ["A", "B", "A", "C", "B", "B"]}
df = pd.DataFrame(data)

# Calculate frequency of each category
frequency_map = df['category'].value_counts().to_dict()

# Replace categories with frequencies
df['frequency_encoded'] = df['category'].map(frequency_map)


#### Does frequency-based encoding introduce hierarchical relationship to nominal data as it converts string data to numeric?

Frequency-based encoding does not inherently introduce hierarchy into nominal categorical data because it simply replaces categories with their respective frequencies. It treats each category equally based on how often it appears in the dataset, without assigning any ordinal or hierarchical relationship between them.

However, it's crucial to be aware of potential issues that may arise when using frequency-based encoding:

- Magnitude of Numbers: The numerical values assigned to categories are based on their frequency, which can vary significantly depending on the distribution of categories. High-frequency categories will have larger values, potentially introducing discrepancies in magnitude.

- Overfitting: In some cases, frequency-based encoding may lead to overfitting, especially if the target variable is correlated with the frequency of certain categories. Models may learn to rely too heavily on the frequency information, which may not generalize well to new data.

- Handling New Categories: Categories that appear infrequently or are absent from the training dataset may pose challenges during prediction. It's essential to decide how to handle such cases, either by assigning a default value or using alternative imputation methods.

- Sparse Categories: Categories with very low frequencies may not provide meaningful information and may even introduce noise into the model. It's essential to evaluate the impact of these sparse categories on model performance.

#### Can I use categorical data with Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is typically applied to numeric data, so categorical variables need to be converted into numeric format before applying PCA.

There are two common approaches to handle categorical variables before PCA:

- One-Hot Encoding: This approach converts categorical variables into a binary format where each category becomes a separate binary feature (0 or 1). One-hot encoding is suitable when the categories do not have an inherent ordinal relationship.

- Label Encoding: Label encoding assigns a unique integer to each category, thereby converting categorical variables into ordinal numeric format. However, it's essential to note that label encoding may introduce ordinality where none exists, which can mislead the PCA algorithm.

However, it's crucial to consider the nature of the categorical variables and their relationship with the target variable before applying PCA. In some cases, categorical variables may not contribute significantly to the variance in the data and may not be suitable for inclusion in the PCA analysis. 

#### What is Target Encoding?

Target encoding, also known as mean encoding or likelihood encoding, is a technique used to encode categorical variables based on the target variable. Unlike one-hot encoding or label encoding, which create new binary or ordinal features, target encoding replaces each category with the mean (or some other statistical measure) of the target variable for that category. This is particularly useful for classification tasks where the target variable is categorical.

Here's a step-by-step explanation of how target encoding works:

1. Calculate the Mean Target for Each Category: For each unique category in the categorical variable, compute the mean of the target variable (e.g., the proportion of positive class labels).

2. Replace Categories with Mean Target Values: Replace each category in the categorical variable with its corresponding mean target value. This effectively transforms the categorical variable into a numeric feature.

3. Handle Rare Categories: Deal with rare categories by applying smoothing techniques to prevent overfitting. This involves blending the mean target value with the overall mean of the target variable or using other regularization methods.

4. Validation Strategy: Perform target encoding on the training dataset only and validate its effectiveness using cross-validation or holdout validation. This prevents data leakage and ensures that the encoding reflects the true relationship between the categorical variable and the target variable.

5. Apply Encoding to Test Data: Use the target encoding mappings learned from the training data to transform the categorical variables in the test dataset. Avoid re-calculating mean target values using the test set to maintain consistency.

Target encoding offers several advantages:

- It captures information about the relationship between the categorical variable and the target variable.
- It reduces dimensionality compared to one-hot encoding, which is particularly useful for high-cardinality categorical variables.
- It can be effective in cases where there is a strong correlation between the categorical variable and the target variable.

However, target encoding also has limitations and considerations:

- It may lead to overfitting, especially for categories with few observations. Smoothing techniques are often applied to address this issue.
- It may not perform well for categories with noisy or sparse data.
- It requires careful validation and handling of rare categories to avoid data leakage and ensure generalization to unseen data.


#### What is Hashing?

In the context of machine learning (ML), hashing is often used as a technique to convert categorical features into a format that can be used by machine learning algorithms. Here's how hashing works in the context of ML:

1. Categorical Features: In many real-world datasets, you'll encounter categorical features - variables that represent categories or labels rather than numerical values. Examples include "color," "city," or "product category."

2. Encoding Categorical Features: Machine learning algorithms typically require numerical input data. Therefore, categorical features need to be encoded into numerical values before they can be used for training models.

3. One-Hot Encoding vs. Hashing: One common approach to encode categorical features is one-hot encoding, where each category is converted into a binary vector with a 1 in the corresponding position and 0s elsewhere. However, one-hot encoding can lead to high-dimensional and sparse feature representations, especially if the categorical features have a large number of unique categories.

4. Hashing Trick: The hashing trick offers an alternative approach to encoding categorical features. Instead of creating a separate binary feature for each category, hashing maps each category directly to a fixed number of hash buckets using a hash function. This reduces the dimensionality of the feature space and can help mitigate issues related to high dimensionality and sparsity.

5. Hash Functions: Hash functions used in the hashing trick map each category to a hash bucket deterministically. Different categories may be mapped to the same hash bucket, resulting in collisions. However, in practice, the effect of collisions is often mitigated by using a sufficiently large number of hash buckets.

6. Advantages and Considerations: The hashing trick can be computationally efficient and memory-efficient, especially when dealing with high-cardinality categorical features. However, it may lead to some loss of information due to collisions, and the resulting feature representations may not be as interpretable as those produced by one-hot encoding.

7. Usage in ML Pipelines: In ML pipelines, the hashing trick is often applied as a preprocessing step to encode categorical features before training machine learning models. It allows handling of categorical features with a large number of categories without significantly increasing the dimensionality of the feature space.


#### Should I hash my data before PCA?

Yes, you can use hashing to encode categorical data and then apply Principal Component Analysis (PCA) to the resulting feature space. However, there are a few considerations to keep in mind:

1. Dimensionality Reduction: The primary goal of PCA is to reduce the dimensionality of the feature space while preserving most of the variance in the data. Hashing can help reduce the dimensionality of categorical features, especially if the number of unique categories is large. This can be beneficial when dealing with high-dimensional data.

2. Sparse vs. Dense Representation: Depending on the specific hashing technique used, the resulting feature space may be sparse (with many zeros) or dense (with non-zero values). Sparse representations are more common with one-hot encoding, while dense representations are more typical with hashing. PCA tends to perform better with dense representations, so using hashing can be advantageous in this regard.

3. Information Loss: Hashing may lead to collisions, where different categories are mapped to the same hash bucket. This can result in some loss of information, especially if the number of hash buckets is relatively small compared to the number of unique categories. PCA will operate on this transformed data, so the potential information loss should be considered.

4. Interpretability: The interpretability of the resulting principal components may be affected by the hashing process. While PCA provides a lower-dimensional representation of the data, the individual components may not directly correspond to specific original features, especially if hashing has been applied.


### PCA - on entire dataset or subset?

PCA is typically performed on the entire dataset, including both numeric and categorical features. However, before applying PCA, categorical features need to be converted into a numerical representation through encoding techniques like one-hot encoding, label encoding, or target encoding. Once all features are numeric, PCA can be applied to the combined dataset.

Performing PCA separately on different subsets of the dataset (e.g., numeric and categorical features) and then combining them afterward is not a common approach. PCA is designed to find linear combinations of all features that capture the maximum variance in the data, and applying it separately to different subsets may not effectively capture the relationships between features.

### PCA - will it be computationally costly if the dataset is very large?

1. Computational Cost: PCA on a large dataset can require significant computational resources, particularly for calculating the covariance matrix or singular value decomposition (SVD) of the data. Running PCA on subsets of features may reduce the computational burden, but you'll still need to perform PCA on each subset separately, which can be time-consuming.

2. Dimensionality Reduction: The primary goal of PCA is to reduce the dimensionality of the data while preserving the maximum amount of variance. Running PCA separately on subsets of features may not effectively capture the overall variance structure of the dataset, potentially leading to loss of important information.

3. Interpretability: PCA aims to find orthogonal linear combinations of features that explain the variance in the data. Combining PCA results from different subsets may complicate the interpretation of the principal components, as they may not be directly comparable or meaningful when considered together.

Alternative Approaches: If computational cost is a concern, consider using alternative dimensionality reduction techniques that are more scalable, such as Random Projection, t-SNE (t-Distributed Stochastic Neighbor Embedding), or UMAP (Uniform Manifold Approximation and Projection). These methods can be more efficient for high-dimensional datasets.

#### Recall - How can we improve recall?

## Data-Level Techniques:

1. Oversampling:

This is a common approach for imbalanced datasets. Duplicate samples from the minority class to create a more balanced class distribution. This forces the model to pay more attention to the minority class during training. However, be cautious of overfitting, as replicating data points can lead to the model memorizing the training data instead of learning generalizable patterns.

2. SMOTE (Synthetic Minority Oversampling Technique):

This is a more advanced technique for oversampling. It creates synthetic data points for the minority class by interpolating between existing minority class samples. This can be more effective than simple duplication as it introduces some variation into the minority class data.

3. Undersampling:

This involves randomly removing data points from the majority class to match the size of the minority class. This approach is simpler to implement but can reduce the size of your dataset, potentially impacting model performance.

## Model-Level Techniques:

1. Cost-Sensitive Learning:

CatBoost supports assigning different costs to misclassifying different classes. Assign a higher cost to misclassifying the minority class samples during training. This tells the model that it's more critical to avoid missing those cases, even if it means sacrificing some accuracy on the majority class.

2. Hyperparameter Tuning:

Experiment with different CatBoost hyperparameters that can influence recall. Some parameters to consider include:
- loss_function: Changing from the default Logloss to a custom function that penalizes false negatives more heavily might be beneficial.
- class_weights: Assign higher weights to the minority class samples during training. This is similar to cost-sensitive learning but implemented differently.
- n_estimators (number of trees): Increasing the number of trees can sometimes improve recall, but be mindful of overfitting.

3. Threshold Adjustment:

By default, CatBoost predicts a class based on the probability threshold of 0.5. Experiment with adjusting this threshold. Lowering the threshold can increase recall by classifying more samples as the positive class (minority class in your case). However, this will also increase false positives. You'll need to find a balance between recall and precision based on your specific needs.

## Evaluation Metrics:

When dealing with imbalanced data, using accuracy alone can be misleading. Consider using F1-score, which combines precision and recall, or AUC-ROC, which measures the model's ability to distinguish between classes, to get a more comprehensive picture of your model's performance.