### STRUCTURE FOR THE CLUSTERING MODEL

1. EDA: data exploration and visualisation. To see what we have and what we can do about it. Explore data types, outliers (boxplot or use log), missings. Visualize them.
2. Feature Engineering. We have to make features to be acceptable for model. Normalization between [0; 1]
3. Model test, evalusation
4. Model run, prediction
5. Tuning

When stuck on what to do next always go back to the description here.<br>

Preparing variables for K-means clustering involves several important steps to ensure that the clustering process is effective and meaningful. Here are the key steps to prepare your variables for K-means:

### Preparing Data
#### 1. Data Collection and Cleaning:

Gather your dataset, ensuring that it is complete and free of errors.<br>
Handle any missing values in your dataset using techniques like imputation or removal of incomplete records.<br>

#### 2. Feature Selection:

Carefully select the features (variables) that are relevant to your clustering task. Including irrelevant features can negatively impact the results.<br>

#### 3. Normalization/Standardization:

Normalize or standardize your data to ensure that all features have similar scales. This is particularly important because K-means relies on distance measures, and differences in feature scales can bias the clustering.<br>
Common methods include z-score standardization (subtracting the mean and dividing by the standard deviation) or min-max scaling (scaling features to a specific range, e.g., [0, 1] or [-1, 1]).<br>

#### 4. Outlier Detection and Handling:

Identify and address outliers in your dataset, as they can significantly affect K-means clustering results. You can remove, transform, or down-weight outliers based on the nature of your data and domain knowledge.<br>
Sometimes it is difficult to decide on if the points should be treated as outliers. Some rules to follow:
- if it is less than 1% of cases -- deletion. could be a good option
- if it is ~10% -- it is more likely some type of behavior OR a fraud. If in doubt, try to consult with client on strange behavior, they may have a clue on if it is a standard one.

#### 5. Feature Engineering:

Consider creating new features or transformations of existing features that may enhance the clustering process. Feature engineering can help capture underlying patterns in the data.<br>
Check any collinearity between the variables. If there is any, consider removing variables or transforming them.<br>
Try to understand the meaning of the variable to prepare new ones. For example, it could be usefult to make a variable `add_to_cart_per_view` from total `add_to_cart` and total `view_item` events.

#### 6. Dimensionality Reduction:

If you have a high-dimensional dataset, consider applying dimensionality reduction techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the number of features while retaining the most important information.<br>
Another good embedding tool is LightFM library working on the simple Neural Network (you can find example in this Notebook).<br>
Embeddings also could be goot tool to search for underlying patterns and they will eliminate multicollinearity.<br>

#### 7. Encoding Categorical Variables:

If your dataset contains categorical variables, you need to encode them into numerical values. Common methods include one-hot encoding or label encoding, depending on the nature of the categorical data.

### Understanding K-means output
After the data is prepared, we can run K-means.<br>
However, to do so, we need to choose amount of clusters to run.<br>
To do that several steps are required.

### THE NUMBERS

#### 1. Selection of Distance Metric:

Choose an appropriate distance metric based on the nature of your data. Euclidean distance is the default and widely used, but other metrics like Manhattan distance, cosine similarity, or custom distance functions may be more suitable for specific data types.

#### 2.Determining the Number of Clusters (K):

Decide on the number of clusters you want to identify. You can use methods like the elbow method, silhouette score, or domain knowledge to help determine the optimal K value.
Initializations:

#### 3. Choose an initialization method for the cluster centroids.
K-means++ is a common choice as it tends to lead to better convergence. You can also experiment with random initialization.
Convergence Criteria:

#### 4. Set a stopping criterion to determine when the algorithm has converged.
Common criteria include a maximum number of iterations or when the centroids no longer change significantly.
Data Visualization:

#### 5. Metrics.
Use metrics such as `calinski_harabasz score` or `davies_bouldin score` to decide on what amount of clusters would work the best.<br>
The higher calinski_harabasz the better, the lower davies_bouldin the better.

### THE VISUALIZATION

#### 1. Construct the plot
Plotting will give you a sence on if the chosen amount of data is actually meaningful in the dimentions chosen.
Statistics are calculated on distances and can not 100% determine the best clusters number option, so you have to see the possible solutions on the plot. Compare several outputs.

### THE INTERPRETATION

#### 1. Use clusters to see the data
Add clusters to the data and run research to see the difference between the clusters. Do they differ from each other? Could it be explained? Is there meaning behind the clusters? Do you consider the need in other variables to explain it better?
The interpretation is the final part where you decide if the chosen clusters are good enough and if you need more data to re-run the algorithm.

### NOTE

Before running K-means, consider visualizing your data using scatter plots, PCA plots, or other techniques to gain insights into the data structure.
Data Splitting:

Consider splitting your data into a training set for clustering and a separate test set for evaluating the results. This helps avoid overfitting and ensures that your clustering model generalizes well to new data.