# Part (b): Introduction to Data Analytics
### UE22CS342AA2 - Data Analytics 



The following assignment has the below problems:
- Problem 1
- Problem 2
- Problem 3
- Problem 4
- Problem 5
- Problem 6

*Snippet to install a package cleanly*
```
if (!requireNamespace("tidyverse", quietly = TRUE)) {
    install.packages("tidyverse")
}
```
*Load a package*

```
library(tidyverse)
```


# About The Dataset
The below is a sample dataset on the customer satisfaction based on experience from a purchased product.

- CustomerID - Unique identifier of each customer.
- Age: Customers age.
- Gender of the customer.
- Satisfaction Score: On a scale from 1 to 10.
- Purchase Frequency .
- Feedback by the customer.

*Problem 1*

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. In the customer satisfaction dataset, can you classify the columns with missing data into different categories? (MCAR, MNAR or MAR) (1 point)



In [None]:
# Your answer here.

In [None]:
data <- read.csv("/kaggle/input/customer-satisfaction/customer_satisfaction.csv")
data

In [None]:
# define custom function to get count of missing data.
count_na_or_empty <- function(x) {
  sum(is.na(x) | x == "")
}

# Apply the function to each column and count missing values
missing_counts <- sapply(data, count_na_or_empty)

# Print the missing values count
(missing_counts)

- *Purchase Frequency*: MCAR: Both products with good review and a bad one have missing values.
- *Feedback*: MNAR: It is evident that the reviews with low satisfcation score have missing feedback.

# About the dataset

- The below dataset constitute the results of a chemical analysis on wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. 
- It has been taken from [here](https://archive.ics.uci.edu/dataset/109/wine).

*Features*
- Alcohol: Percentage of alcohol in the wine (vol. %).
- Malic Acid: Concentration of malic acid (g/dm³).
- Ash: Ash content (g/dm³).
- Alcalinity of Ash: Alcalinity of ash (in terms of NaOH) (g/dm³).
- Magnesium: Magnesium content (mg/dm³).
- Total Phenols: Total phenol content (g/dm³).
- Flavanoids: Flavanoid content (g/dm³).
- Nonflavanoid Phenols: Non-flavanoid phenol content (g/dm³).
- Proanthocyanins: Proanthocyanin content (g/dm³).
- Color Intensity: Color intensity of the wine (arbitrary units).
- Hue: Hue of the wine (arbitrary units).
- OD280/OD315 of Diluted Wines: Ratio of optical densities at 280 nm and 315 nm (arbitrary units).
- Proline: Proline content (mg/dm³).

*`Quality`*:
The target class of the alcohol


In [None]:
if (!require(dplyr)) {
    install.packages(dplyr)
}

*Problem 2*

The mean values of the columns `Flavanoids` and `Total_phenols` in the wine dataset are 2.03 and 2.29, respectively. Although these means are nearby, does this imply that the distributions of these two variables are practically equivalent?

To answer this, create a histogram and overlay the probability density curve for each of the 2 variables. Discuss your findings, particularly focusing on the shape and spread of the distributions. Do add a note on the peaks of the histogram. (2 points)


In [None]:
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)


In [None]:
# Your answer here.

In [None]:
# Loading the wine dataset
data <- read.csv("/kaggle/input/wine-quality/wine_quality_combined.csv")

In [None]:
# Show the first 5 values.
head(data)

In [None]:
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(moments)

# Calculate mean and standard deviation for Flavanoids
mean_f <- mean(data$Flavanoids, na.rm = TRUE)
std_dev_f <- sd(data$Flavanoids, na.rm = TRUE)

# # Calculate mean and standard deviation for Total_phenols
mean_tp <- mean(data$Total_phenols, na.rm = TRUE)
std_dev_tp <- sd(data$Total_phenols, na.rm = TRUE)

paste("The mean and standard deviation of Flavanoids are: ", mean_f, std_dev_f)
paste("The mean and standard deviation of total phenols are: ", mean_tp, std_dev_tp)

# # Create histograms with KDE and overlaid normal distribution for both columns
p1 <- ggplot(data, aes(x = Flavanoids)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = 'skyblue', alpha = 0.7, color = 'black') +
  geom_density(color = 'red') +
  stat_function(fun = dnorm, args = list(mean = mean_f, sd = std_dev_f), color = 'blue', linewidth = 1) +
  labs(title = "Histogram with Normal Distribution Curve Flavanoids",
       x = "Flavanoids",
       y = "Density") +
  theme_minimal()

p2 <- ggplot(data, aes(x = Total_phenols)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = 'skyblue', alpha = 0.7, color = 'black') +
  geom_density(color = 'red') +
  stat_function(fun = dnorm, args = list(mean = mean_tp, sd = std_dev_tp), color = 'blue', linewidth = 1) +
  labs(title = "Histogram with Normal Distribution Curve (Total Phenols)",
       x = "Total Phenols",
       y = "Density") +
  theme_minimal()

# Arrange plots side by side
grid.arrange(p1, p2, ncol = 1)


From the histograms and the density plots overlaid on them, we observe that despite the means being close to each other, there is no resemblance between the plots. The deviations differ. Flavanoids are more spread out than the total phenol content.

Both histograms have two peaks, indicating concentrations of values around those regions, although this is more apparent in the total phenol content.

*Problem 3*

For the different types of quality of alcohol, compare the distributions of its color intensity. Add a note on the outliers and skewness of each category. You can make use of the box plot. (1 + 1 points)



In [None]:
# Your answer here

In [None]:
qualities <- unique(data$quality)

# Create an empty list to store plots
plots <- list()

# Iterate over each quality type
for (q in qualities) {
  # Subset data for the current quality
  subset <- data %>% filter(quality == q)
  
  # Calculate skewness for color intensity
  skewness_value <- skewness(subset$Color_intensity, na.rm = TRUE)
  
  # Create a histogram with KDE
  p_hist <- ggplot(subset, aes(x = Color_intensity)) +
    geom_histogram(aes(y = ..density..), bins = 10, fill = 'skyblue', color = 'black', alpha = 0.7) +
    geom_density(color = 'red', size = 1) +
    labs(title = paste("Histogram of Color Intensity (Quality", q, ")"),
         subtitle = paste("Skewness:", round(skewness_value, 2)),
         x = "Color Intensity", y = "Density") +
    theme_minimal()
  
  # Create a box plot
  p_box <- ggplot(subset, aes(x = factor(quality), y = Color_intensity)) +
    geom_boxplot(fill = 'lightgreen') +
    labs(title = paste("Box Plot of Color Intensity (Quality", q, ")"),
         x = "Quality", y = "Color Intensity") +
    theme_minimal()
  
  # Add plots to the list
  plots[[length(plots) + 1]] <- p_hist
  plots[[length(plots) + 1]] <- p_box
}

# Arrange plots in a grid
grid.arrange(grobs = plots, ncol = 2)


The distribution of each quality of wine is displayed above. All the distributions are positively skewed. This can be seen in the box plot as well. The wines of quality 1 2 and 3 are skewed by 0.57, 1.02 and 0.29 respectively.

Wine of quality 1 and 2 have outliers unlike quality 3, no outliers.

*Problem 4*

Explain Dimensionality Reduction. Perform PCA on the dataset and extract the proportion of variance explained by each principal component. How many principal components should be retained based on the `scree plot`? Examine the loadings of the first two principal components. Which variables contribute most to these components? (2 points)

**Hint**:
Scree Plot is a common method for determining the number of PCs to be retained through a graphical representation. A Scree Plot is a simple line segment plot that shows the eigenvalues for each individual PC. 

You can learn more about a scree plot [here](https://sanchitamangale12.medium.com/scree-plot-733ed72c8608)

**Hint**:
In PCA, the contribution of each feature to a principal component is called the `loading`. Loadings are compared with the absolute value



In [None]:
# Your answer here

In [None]:
head(data)

In [None]:
# Remove the target variable, PCA focuses only on the features.
data_modified <- (data %>% select(-quality))

# Perform PCA
# center = TRUE subtracts the datapoint from its mean
# scale = TRUE fits into an unit variance.
pca_result <- prcomp(data_modified, center = TRUE, scale. = TRUE)

# Summary of PCA - Check out the cummulative variance
# Notice the points where the variance saturates.
summary(pca_result)




In [None]:
eigenvalues <- pca_result$sdev^2

# Create a scree plot between component number and eigenvalues
plot(eigenvalues, type = "b", xlab = "Principal Component", ylab = "Eigenvalue", main = "Scree Plot")

From the contribution to the overall variance in the PCA result & eigen values as seen on the scree plots, we can retain 5 PC's as a part of the reduced dataset. This is due to the elbow formation at PC5 and contribution of a cumulative variance >80%. After this point, the change in the cumulative variance becomes smaller.

In [None]:
# Estimate the loadings
loadings <- pca_result$rotation[, 1:2]

# Display the loadings
loadings

Higher absolute values in the loadings indicate stronger contribution of those variables to the component.

The columns `Flavanoids` and `Total_Phenols` contribute the most to PC1.

The columns `Color_intensity` and `Alcohol` contribute the most to PC2.

*Problem 5* 

Now, for the selected components, find the percentage contribution of each feature to each of the components. (Hint: First find the square of the loadings also known as the cos2 values - find percentage contribution using these values) (2 points)

In [None]:
# Your answer here

In [None]:
# Load necessary libraries
library(ggplot2)
library(reshape2)

# Estimate the loadings for the first 5 principal components
loadings <- pca_result$rotation[, 1:5]

# Calculate squared loadings (cos2) for the first 5 principal components
cos2 <- loadings^2

# Calculate the percentage contribution for each feature in each of the 5 components
percentage_contribution <- t(apply(cos2, 2, function(x) x / sum(x) * 100))
print
print(cos2)      
print(percentage_contribution)


*Problem 6*

Generate a plot using the `fviz_pca_var` function to visualize how well each feature is represented by the principal components in the PCA analysis. (Pay attention to the length of the arrows in the fviz_pca_var plot.) (1 point)

In [None]:
# Your answer here.

In [None]:
# Load the factoextra package
library(factoextra)


# Cos2 plot for variables
fviz_pca_var(pca_result, col.var = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE)



The reduced dataset would have Flavinoids, Total_phenols, Alcohol, Color Intensity and a few more as the features.

PCA is performed at times when the dataset is large. The reduced set takes considerably smaller space and shorter time to train classifiers/ regression models with only making use of the most influential data.

*fin*