# COGS 108 - Data Checkpoint

# Names

- Zhoutianning Pan
- Kelly Huang
- Demeng Zhang
- Duoduo Fu

<a id='research_question'></a>
# Research Question

Is there a statistically significant relationship between whether the seller paid to boost his product within the platform (E-commerce Wish) and the sales revenue for the product. Additionally, test the same relationship in each price range group by disaggreagating the data into different groups accoridng to the price range.

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: Sales of summer clothes in E-commerce Wish
- Link to the dataset: https://www.kaggle.com/datasets/jmmvutu/summer-products-and-sales-in-ecommerce-wish
- Number of observations: 1573

This dataset records the sales of summer clothes on the e-commerce platform called Wish and is stored as a csv file. This dataset has a variable called uses_ad_boosts, which fulfills our most important expectation.

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import plotly.express as px

# Data Cleaning

The first step was to read in the csv with pandas. Next, we found the unique number of currencies to examine whether the prices are all in the same currency. Then, since we wanted to investigate the relationship between the utilization of ad boosts and sales revenue, we keep the three relevant columns: price, units_sold, and uses_ad_boosts. Finally, we calculated the total revenue of each product by multiplying price and units_sold.

In [30]:
## import data
df = pd.read_csv('summer-products-with-rating-and-performance_2020-08.csv')

## we want to see if all retail price is in the same currency
unique_currencies = df['currency_buyer'].unique()

if len(unique_currencies) == 1:
    print("All retail prices in the dataset are in the same currency:", unique_currencies[0])
else:
    print("Retail prices in the dataset are in different currencies.")

## delete the irrelevant columns
df = df[['price','units_sold','uses_ad_boosts']]

## take a look at the data
df.head()


All retail prices in the dataset are in the same currency: EUR


Unnamed: 0,price,units_sold,uses_ad_boosts
0,16.0,100,0
1,8.0,20000,1
2,8.0,100,0
3,8.0,5000,1
4,2.72,100,1


In [31]:
## calculate the total_revenue
df['total_revenue'] = df['price'] * df['units_sold']
df

Unnamed: 0,price,units_sold,uses_ad_boosts,total_revenue
0,16.00,100,0,1600.0
1,8.00,20000,1,160000.0
2,8.00,100,0,800.0
3,8.00,5000,1,40000.0
4,2.72,100,1,272.0
...,...,...,...,...
1568,6.00,10000,1,60000.0
1569,2.00,100,1,200.0
1570,5.00,100,0,500.0
1571,13.00,100,0,1300.0


In [35]:
# display the distribution of prices
price_hist = px.histogram(df['price'], histnorm='probability', marginal='box', title="Distribution of Prices", barmode='overlay',nbins=10, opacity=0.7)
price_hist.show()

In [34]:
# assgin the price to different price range category 
def price_range(price):
    if price <= 5:
        return '0 - 5'
    elif 5 < price <= 10:
        return '5 - 10'
    elif 10 < price <= 15:
        return '10 - 15'
    else:
        return 'over 20'

df['price_range'] = df['price'].apply(price_range)
df

Unnamed: 0,price,units_sold,uses_ad_boosts,total_revenue,price_range
0,16.00,100,0,1600.0,over 20
1,8.00,20000,1,160000.0,5 - 10
2,8.00,100,0,800.0,5 - 10
3,8.00,5000,1,40000.0,5 - 10
4,2.72,100,1,272.0,0 - 5
...,...,...,...,...,...
1568,6.00,10000,1,60000.0,5 - 10
1569,2.00,100,1,200.0,0 - 5
1570,5.00,100,0,500.0,0 - 5
1571,13.00,100,0,1300.0,10 - 15


In [33]:
# pivot table showing the mean 'total revenue' according to price range and boolean of 'uses_ad_boosts'
df.pivot_table(index='price_range',
               columns='uses_ad_boosts',
               values='total_revenue',
               aggfunc='mean')

uses_ad_boosts,0,1
price_range,Unnamed: 1_level_1,Unnamed: 2_level_1
0 - 5,18377.108099,12199.491713
10 - 15,41383.81746,50046.585366
5 - 10,37307.235948,37664.334983
over 20,48345.948718,45812.727273


# Ethics & Privacy (Updated)

There does not have any privacy or terms of use issues, because we are not collecting any personal data. But the data we choose might have some biases because it has self-rating data. Also, it might have some sellers use some way to make fake data, but we believe it is just rarely of them. The data is publicly available online and all comes from different sources, and those data were the real data from Wish.com, so as the platform itself has a policy about the seller so we think it would not affect the analysis we made. 
However, our reseach result may cause companies do more advertisements to boost thier product which may bring confusion and inconvinience to consumers when they selecting products. Also, there may be a price range which would perform the best sales revenue, and companies may transform thier product to fit the specific price range. This may leave less choice for thouse consumers who prefer products in other price ranges.

# Expectation (Updated)

Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Each member is expected to respond to the group chat timely and actively communicate with other members.*
* *Each member is expected to attend the scheduled group meetings. If someone cannot attend the meeting, she should inform us one day prior and catch up by reading the agenda.*
* *Each member is expected to check our group repository on a daily basis so that everyone is on the same page.*
* *Each member is expected to know the deadlines of different components of the group project clearly and remind others.*
* *Each member is expected to complete her assigned part on the scheduled date. If someone has other assignments or exams that conflict with the work, she should communicate with the group and find someone to help cover the work.*
* *Each member is expected to work on a relatively even amount of content.*
* *Each member is expected to be updated with class announcements about the project.*
* *Each member is expected to check the grade and comments once released and make adjustions accordingly.*