# Product Analysis for "Seblak" in Tokopedia (an online shopping website)


# Introduction
=============================

**Web Scraping for Product Analysis Project**
<p style="font-size:18px;">
Made By  : Brenda Kwan<br>
Objective: This program was created to perform web scraping for data analysis of the product "seblak" (an Indonesian snack) on Tokopedia to find out the potential of the seblak market and the income that can be obtained from dropshipping "seblak".
</p>



# A) Web Scraping
- Retrieving Product Name, Product Price, Seller, Store City, Number of Sales, and Product Rating data from the Tokopedia website for seblak products
- Creating a dataframe using pandas to store the extracted data

In [1]:
# Import packages
# Selenium will open the web browser
from selenium import webdriver
import time
import pandas as pd

# Beautifulsoup will extract the data from the browser
from bs4 import BeautifulSoup

# Create a DataFrame to store the results of web scraping
df = pd.DataFrame()

# Start the web browser
driver = webdriver.Chrome()

# Initialize lists to store data
listPrice = []
listName = []
listSeller = []
listCity = []
listPurchases = []
listRating = []

# Loop through 100 pages
for page in range(1, 101):
    # Set the URL with placeholder {page} to indicate the page number
    url = f"https://www.tokopedia.com/search?navsource=&page={page}&q=seblak&search_id=20240725142332A392199A864D2909F4MR&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st="
    
    # Open the URL
    driver.get(url)

    # 1 second wait before moving to the next page
    time.sleep(1)
    
    # Get the HTML
    html = driver.page_source
    
    # Parse the HTML
    soup = BeautifulSoup(html, "html.parser")
    
    # Extract product details
    products = soup.find_all("div", class_="bYD8FcVCFyOBiVyITwDj1Q==")
    for product in products:

        # Extract discounted price 
        price_element = product.find("div", class_="_67d6E1xDKIzw+i2D2L0tjw== t4jWW3NandT5hvCFAiotYg==")
        if not price_element:
            # If no discounted price, get the regular price
            price_element = product.find("div", class_="XvaCkHiisn2EZFq0THwVug==")
        # Extract text if price exists, else set None
        price_text = price_element.get_text() if price_element else None
        listPrice.append(price_text)

        # Extract name
        name_element = product.find("div", class_="_6+OpBPVGAgqnmycna+bWIw==")
        name_text = name_element.get_text() if name_element else None
        listName.append(name_text)
        
        # Extract seller
        seller_element = product.find("span", class_="T0rpy-LEwYNQifsgB-3SQw== pC8DMVkBZGW7-egObcWMFQ== flip")
        seller_text = seller_element.get_text() if seller_element else None
        listSeller.append(seller_text)
        
        # Extract city
        city_element = product.find("span", class_="pC8DMVkBZGW7-egObcWMFQ== flip")
        city_text = city_element.get_text() if city_element else None
        listCity.append(city_text)
        
        # Extract purchases
        purchases_element = product.find("span", class_="se8WAnkjbVXZNA8mT+Veuw==")
        purchases_text = purchases_element.get_text() if purchases_element else None
        listPurchases.append(purchases_text)
        
        # Extract rating if available on main page
        rating_element = product.find("span", class_="_9jWGz3C-GX7Myq-32zWG9w==")
        rating_text = rating_element.get_text() if rating_element else None

        # If rating is missing, open the product page
        if not rating_text:
            product_link_tag = product.find("a", href=True)
            if product_link_tag:
                product_link = product_link_tag["href"]

                try:
                    driver.get(product_link)
                    time.sleep(3)  # Allow time for page to load
                    
                    product_soup = BeautifulSoup(driver.page_source, "html.parser")
                    rating_element = product_soup.find("span", {"data-testid": "lblPDPDetailProductRatingNumber"})
                    rating_text = rating_element.get_text() if rating_element else None

                except Exception as e:
                    print(f"Error fetching rating for {name_text}: {e}")
                    rating_text = None

                finally:
                    driver.back()  # Return to main search page
                    time.sleep(2)

        listRating.append(rating_text)



# Create the DataFrame, set column values to list values
df['Product Name'] = listName
df['Product Price'] = listPrice
df['Seller'] = listSeller
df['Store City'] = listCity
df['Quantity Sold'] = listPurchases
df['Product Rating'] = listRating

df

Unnamed: 0,Product Name,Product Price,Seller,Store City,Quantity Sold,Product Rating
0,Kylafood Paket Paket 2 Seblak Cup + 4 Seblak A...,Rp83.400,,,25 terjual,
1,Kylafood Paket 2 Seblak Cup + 2 Basreng Original,Rp63.600,,,2 terjual,
2,Kylafood Seblak Cup isi 2 pcs,Rp31.800,,,100+ terjual,
3,"Kylafood Paket (Seblak cup, Seblak rempah, Bas...",Rp60.600,kylafood,Bandung,23 terjual,
4,"Seblak Rafael, Seblak Coet Instan Halal",Rp25.000,Brother Meat Shop,Jakarta Selatan,750+ terjual,
...,...,...,...,...,...,...
1025,cemilan seblak bastik pedas daun jeruk 250gram,Rp7.500,mazenith snack,Kab.Ciamis,,
1026,Kerupuk Seblak Bulat | 100 Gram | Banna Foody,Rp6.999,BannaFoody,Banda Aceh,,
1027,KRUPUK SEBLAK CIKRUH KENCUR,Rp3.400,Kairashop.id,Surabaya,50+ terjual,
1028,PROMO SEBLAK INSTAN ( BELI 3 GRATIS 1 BASRENG ),Rp45.000,rumahcemilan82,Kab. Cianjur,100+ terjual,


# B) Data Preparation
- Display several rows of data with conditions
- Display a summary of the created dataframe
- Check missing values ​​in the dataframe

In [2]:
# Show several rows with the condition Product Rating == 4.8
df_new  = df.loc[(df['Product Rating'] == '4.8')]
df_new

Unnamed: 0,Product Name,Product Price,Seller,Store City,Quantity Sold,Product Rating
9,SEBLAK VIRAL/MIX SEBLAK CAMPURAN/MIX VIRAL RENYAH,Rp14.999,putri raja ngemil,Kab. Bandung,90+ terjual,4.8
22,Makaroni bantet pedas daun jeruk 1kg makroni b...,Rp15.760,camilanqu_shop,Kab. Majalengka,29 terjual,4.8
35,kerupuk seblak rafael 200gr,Rp10.999,Bunda Qiana Store,Bandung,100+ terjual,4.8
36,kerupuk seblak kering mangar manggar jaat akar...,Rp11.500,Qie snack,Kab. Jember,15 terjual,4.8
46,SEBRING KRUPUK KERUPUK SEBLAK KERING PEDAS DAU...,Rp16.000,Aydaa Snack,Surakarta,100+ terjual,4.8
48,Davinsi Cemilan Mix Seblak Kering Kemasan 500g,Rp30.000,Makaroni Davinsi Official,Cimahi,24 terjual,4.8
122,1CUP Minyak Bawang PLUS DAUNJERUK serbaguna un...,Rp10.000,gaiagarut,Kab. Garut,100+ terjual,4.8
166,Seblak instan original,Rp15.000,"Rumah Seblak, Bandung",Kab. Bekasi,100+ terjual,4.8
169,Toping Seblak Baso Aci PAKET MIX 100pcs Cuanki...,Rp31.000,gaiagarut,Kab. Garut,500+ terjual,4.8
185,KERUPUK KRUPUK SEBLAK JABLAY PELANGI PEDAS TER...,Rp26.000,Golden Cakery,Jakarta Pusat,100+ terjual,4.8


## i) Displays summary data to check if there are any missing values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product Name    1030 non-null   object
 1   Product Price   1030 non-null   object
 2   Seller          1000 non-null   object
 3   Store City      1000 non-null   object
 4   Quantity Sold   827 non-null    object
 5   Product Rating  290 non-null    object
dtypes: object(6)
memory usage: 48.4+ KB


- There are missing values for seller, city, rating, dan purchases

## ii) Handling missing and duplicate values

In [4]:
# Drop rows with missing values
df_null_dropped = df.dropna()

# Drop duplicate rows
df = df_null_dropped.drop_duplicates()

# Reset the index of the rows while dropping the previous index 
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Product Name,Product Price,Seller,Store City,Quantity Sold,Product Rating
0,Kylafood Seblak Rempah Authentik,Rp12.900,kylafood,Bandung,3rb+ terjual,4.9
1,SEBLAK VIRAL/MIX SEBLAK CAMPURAN/MIX VIRAL RENYAH,Rp14.999,putri raja ngemil,Kab. Bandung,90+ terjual,4.8
2,[BELI LOKAL] SEBLAK CAMPUR/MIX CAMPUR KERUPUK ...,Rp24.999,putri raja ngemil,Kab. Bandung,500+ terjual,4.9
3,[Beli Lokal] seblak as beton kerupuk pedas dau...,Rp26.999,putri raja ngemil,Kab. Bandung,90+ terjual,4.5
4,Kylafood Seblak Original,Rp22.500,kylafood,Bandung,10rb+ terjual,4.9
...,...,...,...,...,...,...
283,cuanki CIPET MINI isi 50pcs toping Baso aci cu...,Rp13.500,gaiagarut,Kab. Garut,30+ terjual,3.6
284,Cuanki lidah Toping baso aci seblak isi 5 pcs,Rp6.000,Grosir Putra Bdg,Kab. Bandung,100+ terjual,4.6
285,SEBLAK INSTAN,Rp14.000,Nunina Frozen Food,Jakarta Timur,50+ terjual,4.9
286,SOMAY KERING BAHAN SEBLAK,Rp8.000,TokoKu Melisa,Bandung,24 terjual,4.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product Name    288 non-null    object
 1   Product Price   288 non-null    object
 2   Seller          288 non-null    object
 3   Store City      288 non-null    object
 4   Quantity Sold   288 non-null    object
 5   Product Rating  288 non-null    object
dtypes: object(6)
memory usage: 13.6+ KB


- Because the total non-null count is the same for every column, there are no missing values 

## iii) Data cleaning (converting from string to float)
- For the Product Price, Product Rating, and Quantity Sold columns, the string value is removed and the remaining numerical value is converted to float
- Create a function to remove all string values

In [6]:
# Import numpy package
import numpy as np

# Create method to eliminate string values
def clean_numerical(x):
    # Check if the column value is a string
    if isinstance(x, str):
        # All string values replaced with ' ' and whitespace is removed with .strip()
        x = x.replace('Rp', '').replace('.', '').replace(' terjual', '').replace('+','').strip()
        # For 'Quantity Sold' the rb is removed and the leftover numerical value is converted to float and multiplied by 1000
        if 'rb' in x:
            x = x.replace('rb', '').strip()
            x = float(x) * 1000
        else:
            x = float(x)
    return x

# Applying the clean_numerical method to eliminate string values and convert the column value to float
df['Product Price'] = df['Product Price'].apply(clean_numerical).astype('float')
df['Product Rating'] = df['Product Rating'].astype('float')
df['Quantity Sold'] = df['Quantity Sold'].apply(clean_numerical).astype('float')

df


Unnamed: 0,Product Name,Product Price,Seller,Store City,Quantity Sold,Product Rating
0,Kylafood Seblak Rempah Authentik,12900.0,kylafood,Bandung,3000.0,4.9
1,SEBLAK VIRAL/MIX SEBLAK CAMPURAN/MIX VIRAL RENYAH,14999.0,putri raja ngemil,Kab. Bandung,90.0,4.8
2,[BELI LOKAL] SEBLAK CAMPUR/MIX CAMPUR KERUPUK ...,24999.0,putri raja ngemil,Kab. Bandung,500.0,4.9
3,[Beli Lokal] seblak as beton kerupuk pedas dau...,26999.0,putri raja ngemil,Kab. Bandung,90.0,4.5
4,Kylafood Seblak Original,22500.0,kylafood,Bandung,10000.0,4.9
...,...,...,...,...,...,...
283,cuanki CIPET MINI isi 50pcs toping Baso aci cu...,13500.0,gaiagarut,Kab. Garut,30.0,3.6
284,Cuanki lidah Toping baso aci seblak isi 5 pcs,6000.0,Grosir Putra Bdg,Kab. Bandung,100.0,4.6
285,SEBLAK INSTAN,14000.0,Nunina Frozen Food,Jakarta Timur,50.0,4.9
286,SOMAY KERING BAHAN SEBLAK,8000.0,TokoKu Melisa,Bandung,24.0,4.0


In [7]:
# Checking if the data type of Quantity Sold, Product Rating, dan Product Price
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Product Name    288 non-null    object 
 1   Product Price   288 non-null    float64
 2   Seller          288 non-null    object 
 3   Store City      288 non-null    object 
 4   Quantity Sold   288 non-null    float64
 5   Product Rating  288 non-null    float64
dtypes: float64(3), object(3)
memory usage: 13.6+ KB


# C)  Business Understanding/Problem Statement
Utilise the SMART Framework and make a Problem Statement

#### SPECIFIC
- Gain data-driven insights and maximize dropshipping from seblak (currently viral product) by analyzing seblak data scraped from Tokopedia and analyzing the potential for seblak sales.

#### MEASURABLE
- Increase income from seblak dropshipping by 5% in one year

#### ACHIEVABLE
- With web scraping tools, you can extract seblak data from Tokopedia at least 50 data to assess the potential of seblak if it can increase my income by 5% in one year. To analyze the data, we will see inferential and descriptive statistics, we can use statistical analysis tools such as scipy, matplotlib, seaborn, etc.

#### RELEVANT:
- Analyzing the potential profit from selling seblak with web scraping and statistical analysis is important so that I can know now what consumer preferences are and maximize my income from utilizing them

#### TIME-BOUND:
- The goal to increase my income by 5% can be achieved after one year

#### PROBLEM STATEMENT:
- Increase my income by 5% in one year by extracting viral seblak product data from Tokopedia and performing statistical analysis on seblak data so that I can find out the potential profit that will be obtained from seblak dropshipping.

# D) Analysis

In [8]:
# Select only the columns 'Product Price','Product Rating', 'Quantity Sold' for data analysis
df_analysis = df[['Product Price','Product Rating', 'Quantity Sold']]
df_analysis

Unnamed: 0,Product Price,Product Rating,Quantity Sold
0,12900.0,4.9,3000.0
1,14999.0,4.8,90.0
2,24999.0,4.9,500.0
3,26999.0,4.5,90.0
4,22500.0,4.9,10000.0
...,...,...,...
283,13500.0,3.6,30.0
284,6000.0,4.6,100.0
285,14000.0,4.9,50.0
286,8000.0,4.0,24.0


## i) Check the distribution and outliers of price data, quantity sold, rating: mean, median, standard deviation, skewness, and kurtosis.

Standard deviation:
- Measure of how dispersed data is relative to the mean
- High standard deviation (>=2)

Skewness: 
- A measure of symmetricity, quantification of how much a distribution is pushed left or right
- Data is extremely skewed (> 1)
    - There are outliers or large number of extreme values at the right side of the distribution
    - Majority of data clustered at lower end with a few very high values stretching the distribution's tail far to the right

Kurtosis:
- Quantification of how much of the distribution is in the tail, used for testing normality
- Very high kurtosis value (> 3)
    - Extreme outliers far beyond the range of typical values
    - Data distribution dominated by a few extremely large values, making tails very heavy

In [15]:
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate central tendency, dispersion, and determining normality (mean, median, standard deviation, skewness, kurtosis)
mean_price = df_analysis['Product Price'].mean()
mean_purchases = df_analysis['Quantity Sold'].mean()
mean_rating = df_analysis['Product Rating'].mean()
median_price = df_analysis['Product Price'].median()
median_purchases = df_analysis['Quantity Sold'].median()
median_rating = df_analysis['Product Rating'].median()
std_price = df_analysis['Product Price'].std()
std_purchases = df_analysis['Quantity Sold'].std()
std_rating = df_analysis['Product Rating'].std()
skewness_price = df_analysis['Product Price'].skew()
skewness_purchases = df_analysis['Quantity Sold'].skew()
skewness_rating = df_analysis['Product Rating'].skew()
kurtosis_price = df_analysis['Product Price'].kurtosis()
kurtosis_purchases = df_analysis['Quantity Sold'].kurtosis()
kurtosis_rating = df_analysis['Product Rating'].kurtosis()
print(f"Average price of seblak: Rp. {mean_price:.2f}")
print(f"Average number of seblak sold: {mean_purchases:.2f}")
print(f"Average rating of seblak: {mean_rating:.2f}")
print(f"Median price of seblak: Rp. {median_price:.2f}")
print(f"Median number of seblak sold: {median_purchases:.2f}")
print(f"Median rating of seblak: {median_rating:.2f}")
print(f"Standard deviation of price of seblak: Rp. {std_price:.2f}")
print(f"Standard deviation of price of seblak: {std_purchases:.2f}")
print(f"Standard deviation of rating of seblak: {std_rating:.2f}")
print(f"Skewness of price seblak: {skewness_price:.2f}")
print(f"Skewness of the amount of seblak sold: {skewness_purchases:.2f}")
print(f"Skewness of the rating of seblak: {skewness_rating:.2f}")
print(f"Kurtosis of the price of seblak: {kurtosis_price:.2f}")
print(f"Kurtosis of the amount of seblak sold: {kurtosis_purchases:.2f}")
print(f"Kurtosis of the rating of seblak: {kurtosis_rating:.2f}")

Average price of seblak: Rp. 23701.84
Average number of seblak sold: 212.27
Average rating of seblak: 4.87
Median price of seblak: Rp. 15055.00
Median number of seblak sold: 23.00
Median rating of seblak: 5.00
Standard deviation of price of seblak: Rp. 32285.12
Standard deviation of price of seblak: 902.36
Standard deviation of rating of seblak: 0.25
Skewness of price seblak: 6.58
Skewness of the amount of seblak sold: 8.38
Skewness of the rating of seblak: -3.07
Kurtosis of the price of seblak: 66.05
Kurtosis of the amount of seblak sold: 80.47
Kurtosis of the rating of seblak: 11.02


PRODUCT PRICE:
- The distribution of product prices is not normal (skewed), where there are several seblak products that are very high priced, but the majority of seblak products have prices around the median price (Rp. 15,055).
- Because the average price of seblak (Rp 23,701.84) is greater than the median price of seblak (Rp. 15,055), the price distribution is right-skewed
- The right-skewed price distribution can be seen from the skewness of the product price (6.58 > 1) which means there are outliers on the higher prices side
- Most of the prices are on the left side, indicating that the product is cheap, meaning lower cost to sell the product
- High kurtosis (66.05) indicates a sharp peak and thick tail and there are outliers on the high price side

QUANTITY SOLD:
- The distribution of the number of seblak sold is not normal (skewed), because there is a very large variation in the Quantity Sold
- Because the average Quantity Sold (212) > median Quantity Sold (23) means the distribution of the Quantity Sold is right-skewed
- The right-skewed distribution of the Quantity Sold can be seen from the skewness of the Quantity Sold (8.38) which means there are many outliers with the Quantity Sold being very high
- High kurtosis (80.47) indicates many outliers with high sales, sharp peaks and thick tails

PRODUCT RATING:
- The distribution of product ratings is normal and quite symmetrical because the average rating of seblak (4.87) is close to the median rating (5)
- But the skewness of the product rating (-3.07) indicates a slightly left-skewed distribution, there are few outliers on the lower rating side, this indicates that the product 'Seblak' has an overall very good rating
- High kurtosis (11.02) indicates a sharp peak and thick tails and most products have very high ratings

CONCLUSION:
- All columns show a skewed distribution and do not follow a normal distribution. The "Product Price" and "Number of Sales" columns are skewed to the right with many outliers that have high values, while the "Product Rating" column is skewed to the left with most of the rating values ​​being very high, indicating that consumers are satisfied with the seblak product

## ii) Confidence Interval: To get insight into the potential income that can be earned from seblak dropshipping

Since the distribution of product price and quantity sold is not normal, both are right-skewed, the income column will have a non-normal distribution (skewed). So you can't use stats.norm.interval(conf_level,loc=average,scale="The_width_of th_ distribution_from_the_average"), the formula is used:
- Upper = Median + 1.7 * ((1.25*IQR )/ (1.35* np.sqrt(N)))
- Lower = Median - 1.7 * ((1.25*IQR) / (1.35* np.sqrt(N)))

In [16]:
# Calculate earnings for each product seblak by multiplying the price with units sold and creating a new column 'Earnings' in the dataframe
df_analysis['Earnings'] = df_analysis['Product Price'] * df_analysis['Quantity Sold']

# Find the standard deviation for the Earnings column
std = df_analysis['Earnings'].std()

# Find median 
median = df_analysis['Earnings'].median()

# Length of the datafrane (how many records)
N = len(df_analysis)
# Upper and lower quantile of Earnings
q1 = df_analysis['Earnings'].quantile(0.25)
q3 = df_analysis['Earnings'].quantile(0.75)

# Find the interquartile range
iqr = q3 - q1

# Find the upper and lower value for Earnings
upper = median + 1.7 * ((1.25*iqr) / (1.35* np.sqrt(N)))
lower = median - 1.7 * ((1.25*iqr) / (1.35* np.sqrt(N)))


print(f'Lower Limit (Earnings): Rp {lower:.2f}')
print(f'Upper Limit (Earnings): Rp {upper:.2f}')

Lower Limit (Earnings): Rp 197597.59
Upper Limit (Earnings): Rp 432402.41


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_analysis['Earnings'] = df_analysis['Product Price'] * df_analysis['Quantity Sold']


Because the average price of seblak products is low, the potential earnings is rather low, ranging from IDR 479,700 to IDR 2,920,099.5

## iii) Hypothesis Testing (Two-Sample Independent Test): Are the prices of goods in Jabodetabek and outside Jabodetabek different?

- Jabodetabek: Jakarta, Tangerang, South Tangerang, Bogor, Depok, Bekasi
- Hypothesis of Two-Sample Independent Test:
    - Two-sample test because it compares two different groups to see the difference between the two, in this context, it wants to compare the price of seblak in Jabodetabek and outside Jabodetabek
    - Independent because one group does not affect the other group, in Jabodetabek and outside Jabodetabek are different populations, and do not affect each other

- Hypothesis tested:
    - H0: There is no difference in the average price of seblak between Jabodetabek and outside Jabodetabek
    - H1: There is a difference in the average price of seblak between Jabodetabek and outside Jabodetabek

We have to create a 'Jabodetabek' column to indicate whether the store city is in Jabodetabek or not.

In [17]:
# Define Jabodetabek cities
jabodetabek_cities = ['Jakarta', 'Tangerang', 'Bogor', 'Depok', 'Bekasi']

# Create a new column 'Region' based on city names
df['Region'] = df['Store City'].apply(lambda x: 'Jabodetabek' if any(city in x for city in jabodetabek_cities) else 'Non-Jabodetabek')
df


Unnamed: 0,Product Name,Product Price,Seller,Store City,Quantity Sold,Product Rating,Region
0,Kylafood Seblak Rempah Authentik,12900.0,kylafood,Bandung,3000.0,4.9,Non-Jabodetabek
1,SEBLAK VIRAL/MIX SEBLAK CAMPURAN/MIX VIRAL RENYAH,14999.0,putri raja ngemil,Kab. Bandung,90.0,4.8,Non-Jabodetabek
2,[BELI LOKAL] SEBLAK CAMPUR/MIX CAMPUR KERUPUK ...,24999.0,putri raja ngemil,Kab. Bandung,500.0,4.9,Non-Jabodetabek
3,[Beli Lokal] seblak as beton kerupuk pedas dau...,26999.0,putri raja ngemil,Kab. Bandung,90.0,4.5,Non-Jabodetabek
4,Kylafood Seblak Original,22500.0,kylafood,Bandung,10000.0,4.9,Non-Jabodetabek
...,...,...,...,...,...,...,...
283,cuanki CIPET MINI isi 50pcs toping Baso aci cu...,13500.0,gaiagarut,Kab. Garut,30.0,3.6,Non-Jabodetabek
284,Cuanki lidah Toping baso aci seblak isi 5 pcs,6000.0,Grosir Putra Bdg,Kab. Bandung,100.0,4.6,Non-Jabodetabek
285,SEBLAK INSTAN,14000.0,Nunina Frozen Food,Jakarta Timur,50.0,4.9,Jabodetabek
286,SOMAY KERING BAHAN SEBLAK,8000.0,TokoKu Melisa,Bandung,24.0,4.0,Non-Jabodetabek


In [12]:
# Create two new dataframes to separate Jabodetabek and Non-Jabodetabek records (filtering) for two-sample test
jabodetabek_df = df[df['Region'] == 'Jabodetabek']
non_jabodetabek_df = df[df['Region'] == 'Non-Jabodetabek']

In [13]:
# Perform two-sample independent t-test
t_stat, p_val = stats.ttest_ind(jabodetabek_df['Product Price'], non_jabodetabek_df['Product Price'])

print('T-Statistic:', t_stat)
print('P-value:', p_val)

T-Statistic: 0.8726719709124907
P-value: 0.38357396258653265


H0 is accepted because the p-value > 0.05: Although the price of raw materials in the two locations is different, the price of seblak goods in Jabodetabek and outside Jabodetabek is on average the same and there is no difference.

## iv) Correlation between the Product Price and Quantity Sold of Seblak?

Because the seblak data has been analyzed as a skewed distribution (not normal) after seeing the skewness 1.61 > 0.5 (normal distribution skewness), then we have to use the spearman correlation.

In [14]:
# Import scipy for correlation analysis
from scipy import stats
corr_rho, pval_s = stats.spearmanr(df['Product Price'], df['Quantity Sold'])
print(f"rho-correlation: {corr_rho}")
print(f"p-value: {pval_s}")

rho-correlation: -0.09437608835430066
p-value: 0.10999423588567625


P-value is used to provide information whether the correlation is real or just by chance or not significant, otherwise it is real.

- P-value (0.99) > 0.05, so the correlation between product price and sales volume is just a coincidence
- Based on Spearman correlation (-0.0007) there is no correlation between product price and sales volume
 - Price changes will not have a significant impact on the number of seblak sales

# E) Conclusion

Based on the available data, although the price of seblak varies with some extremely high prices and the number of sales shows a very large variation, the average product rating is good, so seblak has market potential and this high rating shows that customers are satisfied with seblak products, it is one of the supporting reasons for seblak dropshipping. However, after looking at the upper and lower limits for income, there is a large variation in income potential and it can fluctuate. Moreover, after looking at the median number of sales, only 23, this indicates that seblak products, although "viral", do not sell too much. There was no significant price difference between Jabodetabek and outside Jabodetabek, and there was no significant correlation between price and number of sales, so dropshipping can be from any city in Indonesia. If there is a strategy to handle income fluctuations and ensure a healthy profit margin, it can increase the success of dropshipping seblak.

# Recommendation

Overall, seblak does not seem to sell very well as a product for me to dropship due to its low median sales, and small profit margin, given the product's cheap cost. Because I want to maximize the earnings from dropshipping and achieve the goal of increasing income by 5%, I can analyze the potential income from dropshipping other viral products in the same way as I have done (web scraping, data cleaning, descriptive statistics, and inferential statistics). If it has a greater market potential, then that is the product I will dropship.