In [None]:
# Load required libraries
library(dplyr)      # For data manipulation
library(ggplot2)    # For data visualization
library(corrplot)   # For correlation plots

# Data Exploration and Cleaning

In [None]:
# Load the data (adjust the path to your dataset)
data <- read.csv("/kaggle/input/sustainable-fashion-eco-friendly-trends/sustainable_fashion_trends_2024.csv")

In [None]:
# Overview of the data
str(data)          # Structure of the dataset

In [None]:
summary(data)      # Summary statistics for numerical columns

In [None]:
head(data)         # View the first few rows

In [None]:
# Check for missing values
colSums(is.na(data))

In [None]:
# Clean the data if necessary
data <- na.omit(data)  # Remove rows with missing values

# Exploratory Data Analysis (EDA)

> a) Sustainability Ratings Distribution

In [None]:
# Count of sustainability ratings
ggplot(data, aes(x = Sustainability_Rating)) +
  geom_bar(fill = "skyblue") +
  ggtitle("Distribution of Sustainability Ratings") +
  xlab("Sustainability Rating") +
  ylab("Count")

*We can observe that the distribution is fairly even across all rating categories. There are approximately the same number of items with ratings A, B, C, and D.*

> b) Country-wise Analysis of Sustainability Ratings

In [None]:
# Sustainability ratings by country
country_ratings <- data %>%
  group_by(Country, Sustainability_Rating) %>%
  summarise(Count = n())

ggplot(country_ratings, aes(x = Country, y = Count, fill = Sustainability_Rating)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Sustainability Ratings by Country") +
  xlab("Country") +
  ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The countries included are Australia, Brazil, China, France, Germany, India, Italy, Japan, UK, and USA. The sustainability ratings are categorized into four groups: A, B, C, and D. The height of each bar represents the number of items in that country that fall into a particular rating category.

We can observe that:
1. Rating A is the most common rating for most countries, with the exception of Germany and Italy.
2. Rating B is also relatively common in most countries.
3. Rating C is less common than A and B for most countries.
4. Rating D is the least common rating for all countries.

> c) Carbon Footprint and Water Usage

In [None]:
# Scatter plot of Carbon Footprint vs Water Usage
ggplot(data, aes(x = Carbon_Footprint_MT, y = Water_Usage_Liters, color = Sustainability_Rating)) +
  geom_point(alpha = 0.7) +
  ggtitle("Carbon Footprint vs Water Usage") +
  xlab("Carbon Footprint (MT)") +
  ylab("Water Usage (Liters)") +
  theme_minimal()

The plot displays the relationship between carbon footprint and water usage for items with different sustainability ratings (A, B, C, and D). Each dot represents an item, and its position on the graph indicates its carbon footprint (x-axis) and water usage (y-axis). The color of the dot corresponds to the item's sustainability rating.

Here are some observations from the plot:

No clear pattern: There doesn't seem to be a strong correlation between carbon footprint and water usage. The dots are scattered across the graph, indicating that items with high carbon footprints can have both high and low water usage, and vice versa.
Overlap in ratings: Items with different sustainability ratings are clustered together in some areas of the graph, suggesting that carbon footprint and water usage alone might not be sufficient to distinguish between the ratings.

> d) Average Price and Market Trend

In [None]:
# Boxplot of Average Price by Market Trend
ggplot(data, aes(x = Market_Trend, y = Average_Price_USD, fill = Market_Trend)) +
  geom_boxplot() +
  ggtitle("Average Price by Market Trend") +
  xlab("Market Trend") +
  ylab("Average Price (USD)")

Overall Observations:

The median prices for all three market trends are relatively close to each other, suggesting that the average price isn't drastically different across these trends.
The boxes representing the interquartile range (IQR) are roughly similar in size, indicating that the spread of prices within each trend is comparable.
There are outliers present in all three trends, represented by the points beyond the whiskers.
Specific Observations:

* Declining Trend: The median price for declining products is slightly higher than the other two trends. There's a wider range of prices, as indicated by the longer box and whiskers.
* Growing Trend: The median price for growing products is slightly lower than the declining trend but similar to the stable trend. The distribution is relatively symmetrical.
* Stable Trend: The median price for stable products is similar to the growing trend. The distribution is also relatively symmetrical, with a slightly narrower range than the declining trend.
Key Takeaway:

While the average prices across the three market trends are not significantly different, there are variations in the spread of prices within each trend. Declining products tend to have a wider range of prices compared to growing and stable products.

# Correlation Analysis

In [None]:
# Subset numerical columns
numerical_data <- data %>%
  select(Carbon_Footprint_MT, Water_Usage_Liters, Waste_Production_KG, Product_Lines, Average_Price_USD)

# Calculate correlation matrix
cor_matrix <- cor(numerical_data)

# Plot the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black")

# Logistic Regression for Predictive Modeling

> a) Prepare the Data

In [None]:
# Convert sustainability rating to binary: 1 = High (A/B), 0 = Low (C/D)
data$Sustainability_Binary <- ifelse(data$Sustainability_Rating %in% c("A", "B"), 1, 0)

# Logistic Regression
model <- glm(Sustainability_Binary ~ Carbon_Footprint_MT + Water_Usage_Liters + 
              Waste_Production_KG + Product_Lines + Average_Price_USD,
             data = data, family = binomial)

# Summary of the model
summary(model)


* None of the predictors (e.g., carbon footprint, water usage, waste production, product lines, and price) significantly influence the likelihood of a high sustainability rating.
* The p-values for all coefficients are greater than 0.05, meaning there is no strong evidence that these variables impact sustainability ratings.
* The model has very little predictive power as indicated by the minimal change in deviance (from 6931.4 to 6928.7) and a high AIC value (6940.7).

> b) Model Predictions and Performance

In [None]:
# Predict the probability of high sustainability rating
data$Prediction_Prob <- predict(model, type = "response")

# Set threshold for classification
data$Prediction <- ifelse(data$Prediction_Prob > 0.5, 1, 0)

# Confusion Matrix
table(Predicted = data$Prediction, Actual = data$Sustainability_Binary)

# Analysis of Environmental Impact vs Sustainability Ratings

> a) Carbon Footprint, Water Usage, and Waste Production by Rating

In [None]:
# Boxplots for environmental metrics by Sustainability Rating
par(mfrow = c(1, 3))  # Arrange plots side by side

# Carbon Footprint
boxplot(Carbon_Footprint_MT ~ Sustainability_Rating, data = data,
        main = "Carbon Footprint by Sustainability Rating",
        xlab = "Sustainability Rating", ylab = "Carbon Footprint (MT)", col = "lightblue")

# Water Usage
boxplot(Water_Usage_Liters ~ Sustainability_Rating, data = data,
        main = "Water Usage by Sustainability Rating",
        xlab = "Sustainability Rating", ylab = "Water Usage (Liters)", col = "lightgreen")

# Waste Production
boxplot(Waste_Production_KG ~ Sustainability_Rating, data = data,
        main = "Waste Production by Sustainability Rating",
        xlab = "Sustainability Rating", ylab = "Waste Production (KG)", col = "lightpink")

Looking at the carbon footprint plot, we observe that the median carbon footprint is relatively similar across all four ratings. However, there is a wider spread in the carbon footprint for ratings B and C compared to A and D.

Similarly, in the water usage plot, the median water usage is consistent across all ratings. The spread of water usage is also relatively similar for all ratings.

In the waste production plot, the median waste production is again similar across all ratings. However, the spread of waste production is wider for ratings B and C compared to A and D.

# Brand Price Analysis

> Relationship Between Price and Sustainability Rating

In [None]:
# Boxplot for Average Price by Sustainability Rating
ggplot(data, aes(x = Sustainability_Rating, y = Average_Price_USD, fill = Sustainability_Rating)) +
  geom_boxplot() +
  ggtitle("Average Price by Sustainability Rating") +
  xlab("Sustainability Rating") +
  ylab("Average Price (USD)") +
  theme_minimal()

# Market Trend vs Sustainability Ratings

In [None]:
# Proportion of Sustainability Ratings within Market Trends
market_trend_ratings <- data %>%
  group_by(Market_Trend, Sustainability_Rating) %>%
  summarise(Count = n(), .groups = "drop")

ggplot(market_trend_ratings, aes(x = Market_Trend, y = Count, fill = Sustainability_Rating)) +
  geom_bar(stat = "identity", position = "fill") +
  ggtitle("Proportion of Sustainability Ratings by Market Trend") +
  xlab("Market Trend") +
  ylab("Proportion") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()


# Recycling Programs and Environmental Metrics

In [None]:
# Compare Carbon Footprint for brands with and without recycling programs
ggplot(data, aes(x = Recycling_Programs, y = Carbon_Footprint_MT, fill = Recycling_Programs)) +
  geom_boxplot() +
  ggtitle("Carbon Footprint by Recycling Programs") +
  xlab("Recycling Programs") +
  ylab("Carbon Footprint (MT)") +
  theme_minimal()

# Water Usage
ggplot(data, aes(x = Recycling_Programs, y = Water_Usage_Liters, fill = Recycling_Programs)) +
  geom_boxplot() +
  ggtitle("Water Usage by Recycling Programs") +
  xlab("Recycling Programs") +
  ylab("Water Usage (Liters)") +
  theme_minimal()

In [None]:
# Load required libraries
library(rpart)        # For Decision Trees
library(rpart.plot)   # To plot Decision Trees
library(randomForest) # For Random Forests
library(caret)        # For model evaluation

# Convert Sustainability_Binary to a factor
data$Sustainability_Binary <- as.factor(data$Sustainability_Binary)

# Split data into training and testing sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(data$Sustainability_Binary, p = 0.7, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]


# Decision Tree Model

In [None]:
# Train a Decision Tree
decision_tree <- rpart(Sustainability_Binary ~ Carbon_Footprint_MT + Water_Usage_Liters +
                        Waste_Production_KG + Product_Lines + Average_Price_USD,
                       data = train_data, method = "class", cp = 0.01)

# Plot the Decision Tree
rpart.plot(decision_tree, type = 4, extra = 101, main = "Decision Tree for Sustainability Rating")

In [None]:
# Predict on test data
pred_tree <- predict(decision_tree, test_data, type = "class")

# Confusion Matrix
confusionMatrix(pred_tree, test_data$Sustainability_Binary)

# Random Forest Model

In [None]:
# Train a Random Forest model
set.seed(123)  # For reproducibility
random_forest <- randomForest(Sustainability_Binary ~ Carbon_Footprint_MT + Water_Usage_Liters +
                                Waste_Production_KG + Product_Lines + Average_Price_USD,
                              data = train_data, importance = TRUE, ntree = 500)

# Print the model summary
print(random_forest)


In [None]:
# Variable Importance Plot
importance(random_forest)
varImpPlot(random_forest, main = "Variable Importance in Random Forest")

In [None]:
# Predict on test data
pred_rf <- predict(random_forest, test_data)

# Confusion Matrix
confusionMatrix(pred_rf, test_data$Sustainability_Binary)

# Compare Model Performance

In [None]:
# Compare Accuracy of Decision Tree and Random Forest
accuracy_tree <- confusionMatrix(pred_tree, test_data$Sustainability_Binary)$overall['Accuracy']
accuracy_rf <- confusionMatrix(pred_rf, test_data$Sustainability_Binary)$overall['Accuracy']

# Print Accuracy
cat("Decision Tree Accuracy:", accuracy_tree, "\n")
cat("Random Forest Accuracy:", accuracy_rf, "\n")

Both models have relatively low accuracy, indicating that they are not very effective in predicting the target variable. The Decision Tree model performs slightly better than the Random Forest model in this case. Further analysis and improvement of the models may be necessary to achieve better performance.