# Unsupervised Learning - Project

In this Project, we are going to perform a full unsupervised learning machine learning project on a "Wholesale Data" dataset. The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

[Kaggle Link](https://www.kaggle.com/datasets/binovi/wholesale-customers-data-set)

### Notebook: [MachineLearning_UnSupervised_Fonseca_Leonardo.ipynb](https://github.com/leoaugusto1976/LHL-Unsupervised-Learning-Project/blob/main/notebooks/MachineLearning_UnSupervised_Fonseca_Leonardo.ipynb)

> The file contains dedicated code sections for each part below, making it easy to identify and locate each item. Each part is highlighted for quick and convenient reference.

# Part I : EDA - Exploratory Data Analysis & Pre-processing

The given dataset seems to be a grocery sales dataset containing information about various products sold by a grocery store. To perform an exploratory data analysis (EDA) on this dataset, we can perform the following tasks:

- Data Import: Import the dataset into a statistical software tool such as Python or R.
- Data Cleaning: Check the dataset for any missing or incorrect data and clean the dataset accordingly. This may involve removing or imputing missing data or correcting any obvious errors.
Data Description: Generate summary statistics such as mean, median, and standard deviation for each column of the dataset. This will help in understanding the distribution of data in each column.
- Data Visualization: Create various visualizations such as histograms, box plots, scatter plots, and heatmaps to understand the relationships and trends between the different variables in the dataset. For example, we can create a scatter plot between the "Fresh" and "Milk" variables to see if there is any correlation between them.
- Outlier Detection: Check for any outliers in the dataset and determine whether they are valid or erroneous data points.
- Correlation Analysis: Calculate the correlation between different variables in the dataset to determine which variables are highly correlated and which ones are not. For example, we can calculate the correlation between "Grocery" and "Detergents_Paper" to see if there is any relationship between these two variables.
- Data Transformation: If necessary, transform the data by standardizing or normalizing the variables to make them comparable across different scales.
- Feature Selection: Identify the most important features or variables that contribute the most to the overall variance in the dataset. This can be done using various feature selection techniques such as principal component analysis (PCA) or random forest regression.

# Part II - KMeans Clustering

The objective of the analysis is to group similar products together into clusters based on their attributes such as fresh, milk, grocery, frozen, detergents_paper, and delicatessen. To perform the k-means clustering analysis, you will need to pre-process the dataset, determine the optimal number of clusters, initialize the centroids, assign data points to clusters, update the centroids, and repeat until convergence.

# Part III - Hierarchical Clustering 

Hierarchical clustering is a popular unsupervised machine learning algorithm that is used to identify patterns and group similar data points together in a hierarchy. The algorithm works by iteratively merging or splitting clusters based on a similarity measure until a dendrogram is formed.

To perform hierarchical clustering analysis, you will need to pre-process the dataset, determine the optimal number of clusters using techniques such as dendrogram.

# Part IV - PCA

In this section you are going to perform principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

# Part V - Conclusion

From the model you developed and the exploratory data analysis (EDA) conducted, generate four bullet points as your findings.

- The correlation heatmap is showing the correlation coefficients between different features in the dataset. 

    - **Grocery and Detergents_paper (0.85):**

        A correlation coefficient of 0.85 indicates a strong positive linear relationship between "Grocery" and "Detergents_Paper".
        It suggests that as the spending on groceries increases, there's a high probability that spending on detergents and paper products also increases, and vice versa.
        These two categories may have similar purchasing patterns or might often be bought together.

    - **Grocery and Milk (0.73):**

        A correlation coefficient of 0.73 indicates a moderately strong positive linear relationship between "Grocery" and "Milk".
        It suggests that there's some level of association between spending on groceries and spending on milk products, but it's not as strong as the correlation between groceries and detergents/paper.

    - **Detergent_paper and Milk (0.68):**

        A correlation coefficient of 0.68 indicates a moderate positive linear relationship between "Detergents_Paper" and "Milk".
        It suggests a somewhat similar trend in purchasing patterns between these two categories but is not as strong as the correlation between groceries and detergents/paper.

    These correlations can help in understanding how certain categories of products might be related in consumer purchasing behavior. The stronger the positive correlation, the more likely those categories are bought together or exhibit similar patterns in purchases. It's important to note that correlation doesn't imply causation; it only shows the strength and direction of a linear relationship between two variables.
 
 - **K-Means Clustering:** Based on the Elbow method, it's visually evident that the optimal number of clusters, K, occurs around 4 or 5. At this point, the reduction in inertia slows down noticeably, indicating a saturation in cluster improvement. As further increasing the number of clusters doesn't significantly reduce inertia, I've selected K=4 as the optimal number of clusters for my analysis.

 - **Hierarchical Clustering:** In summary, the interpretation suggests that the clusters are well-separated in the feature space, each with its own distinct characteristics based on the ranges of X-axis and Y-axis values. The visualization indicates that the algorithm successfully grouped the data into distinct clusters based on their spatial distribution.

 - **PCA:** Interpreting the graphic, adding more components contributes to a higher cumulative explained variance, meaning that more variance in the original dataset is retained as the number of components increases. Generally, a decision is made based on the trade-off between capturing enough variance and minimizing the number of components, aiming to strike a balance to avoid overfitting while retaining the most critical information from the dataset.