In [2]:
import pandas as pd

# Load the csv files
# These csv files represent the different product categories from the NewChic dataset
data_frame_accessories = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\accessories.csv')
data_frame_bags = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\bags.csv')
data_frame_beauty = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\beauty.csv')
data_frame_house = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\house.csv')
data_frame_jewelry = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\jewelry.csv')
data_frame_kids = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\kids.csv')
data_frame_men = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\men.csv')
data_frame_shoes = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\shoes.csv')
data_frame_women = pd.read_csv(r'C:\Users\Rishab Manokaran\Downloads\A1_2024_Released\A1_2024_Released\women.csv')

# We combine all the datasets/dataframe in to a single dataset/dataframe
data_frame_combined = pd.concat([data_frame_accessories, data_frame_bags, data_frame_beauty, data_frame_house, data_frame_jewelry, data_frame_kids, data_frame_men, data_frame_shoes, data_frame_women], ignore_index=True)

# Selection of only the columns that are relevant for the purpose of analysis because they contribute to understanding product performance.
#This in turn also means that certain columns that are irrelevant to find the top-7 categories and top -10 products are being dropped. 
chosen_columns = ['category', 'name', 'current_price', 'likes_count', 'discount', 'is_new']
data_frame_preprocessed = data_frame_combined[chosen_columns]

# Missing-data handling
# Some products might not have any likes or discount or the value might be missing hence we fill the missing values in the columns likes_count and discount with 0
# we fill the values missing in the column current_price with the median value, which helps to mitigate the impact of outliers
data_frame_preprocessed['likes_count'].fillna(0, inplace=True)
data_frame_preprocessed['discount'].fillna(0, inplace=True)
data_frame_preprocessed['current_price'].fillna(data_frame_preprocessed['current_price'].median(), inplace=True)

# The dataset is filtered based on the column likes_count. This focuses on the products with significant number of likes.
# We have made the threshold value to be 50.
data_frame_preprocessed = data_frame_preprocessed[data_frame_preprocessed['likes_count'] > 50]

# To find the categories with highest average pricing we aggregate the data and find the average price per category.
premium_categories = data_frame_preprocessed.groupby('category')['current_price'].mean().sort_values(ascending=False)

# We selected these top 7 best categories because of the premium pricing strategy of these categories which is basically based on average price.
best_7_categories = premium_categories.head(7).index.tolist()

# The dataset is filtered to include products only from the top 7 categories
data_frame_filtered = data_frame_preprocessed[data_frame_preprocessed['category'].isin(best_7_categories)]

# Findind the top 10 best products from the top 7 categories which are ranked based on likes_count, current_price, discount and is_new column
best_10_products = data_frame_filtered.sort_values(
    by=['likes_count', 'current_price', 'discount', 'is_new'], 
    ascending=[False, False, False, False]
).head(10)

# Display the best 10 products from the selected best 7 categories
print("Top 7 Categories (based on average price):", best_7_categories)
print("Top 10 Products in the Top 7 Categories:\n", best_10_products)

# Matched Columns:
#'category': Essential for classifying and sorting products by category.
#'name': Serves to identify individual products.
#'current_price': Provides a glimpse into pricing tactics.
#'likes_count': Shows customer interest and product popularity.
#'discount': Indicates the attractiveness of the pricing.
#'is_new': Highlights the novelty of the product, which can impact customer appeal.

# Removed Columns:
#Columns not included in the selected list were omitted as they didn't directly aid in identifying top products or evaluating category performance.
#Columns like URLs, images, etc were excluded since they don't offer valuable insights for this analysis.

Top 7 Categories (based on average price): ['shoes', 'men', 'bags', 'women', 'kids', 'beauty', 'house']
Top 10 Products in the Top 7 Categories:
       category                                               name  \
55676    shoes  Chaussures Plats Décontractées En Suède Mocass...   
70482    women               Blouse Large Couleur Pure pour Femme   
70457    women                   Robe Longue avec Boutons Chinois   
63939    women  Gracila Femme Maxi Robe Irrégulier Vêtement Vi...   
56378    shoes  Chaussures De Grande Taille Semelle Souple À E...   
60200    women  Soutien-gorge Sexy à Décollecté Plongeant sans...   
50188    shoes               Bottines Plates Doublées de Fourrure   
63442    women  Soutien-gorge Sexy Antichoc Sans Armature Ling...   
54981    shoes  SOCOFY Sandales Confortables Plates Avec Bride...   
74381    women        Manteau imprimé floral à feuilles à capuche   

       current_price  likes_count  discount  is_new  
55676          14.99        21547       

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_frame_preprocessed['likes_count'].fillna(0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame_preprocessed['likes_count'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or 