In [None]:
# for development
import importlib
import sys
if 'web.utils.utils' in sys.modules:
    print("reloading web.utils.utils")
    importlib.reload(utils)
else:
    from web.utils import utils

In [None]:
df = utils.load_data()

In [None]:
df = utils.clean_data(df)

In [None]:
import altair as alt
alt.data_transformers.disable_max_rows()


## Rubric
_In this assignment you will be using your final project dataset and exploring it with your tool of choice from the course (d3/vegalite/altair/Tableau). Begin by inspecting the available data without visualizing the data, and write down three hypotheses. Next, investigate each of your hypotheses by visualizing relevant variables (including derived variables, if that helps) in multiple ways. Look for correlations, clusters, outliers, or any other patterns. See if you can find evidence for or against each hypothesis. As you explore, retain multiple sheets in your workbook that show the development of your analysis. Try to find something unexpected in the data. For only one of the hypotheses, describe your exploration process, noting changes and refinements you made to the visualizations as you went along, as well as what worked or didn’t work during your exploration process. We expect to see at least three steps in this refinement process. For the remaining two hypotheses, list each hypothesis and conclusion, provide the beginning and final visualization._


## User Profile and Background

We are looking at the "FBA" - Fulfilled by Amazon - product selling model. An individidual or small company finds manufacturers (probably in China) from whom to buy existing products, has them shipped to Amazon warehouses, prepares product pages as an Amazon Seller, but pays Amazon to do all aspects of fulfillment. JungleScout is a data and services seller to the market of these Amazon FBA sellers and would-be sellers. Our data is drawn from their database.

Our hypothetical user is someone who is inexperienced as an Amazon Seller, and is looking for existing products that are good opportunities. Such a user has read pages like this one https://www.junglescout.com/find-products-to-sell/, which give guidance like this:

Characteristics of a Good Product
1. Retail price between 25 – 50 USD
2. Low seasonality.
3. Less than 200 reviews for the top sellers (less than 100 is excellent!)
4. Small (fits in a shoebox) and Lightweight
5. Can be improved.
6. Simple to manufacture.



## Database

The database I used was constructed by synthesizing some advice given by JungleScout (JS) as to what makes a good product choice. The following query yielded slightly over 5k results, which I downloaded by hand in pages of 100 (took about 15 minutes).
* Price: between 20 and 100 USD
* Minimum Net Profit estimate: 15 USD
* Minimum monthly estimated sales: 200
* Maximum reviews: 50
* Maximum "Listing Quality Score" (LQS): 6 (Scale is 1-10)
* Exclude difficult categories: Electronics, Food
* Exclude "Top Brands"
* Include only items that JS marked as Fulfilled-by-Amazon (FBA)

In addition, I chose to eliminate "Clothing" because it accounted for about 1/2 the total listings, and I wanted to keep the size of my dataset manageable.

The data returned from JS has the following schema.


In [None]:
df.dtypes

### Additional Data Cleaning



Data problem: some records have review counts that are ridiculously high. This is especially worrisome because I requested only products that had 50 or fewer reviews. More on that later.


In [None]:
alt.Chart(df[df['Reviews'] > 50]).mark_bar().encode(
    x=alt.X('Reviews:Q', bin=alt.Bin(maxbins=100)),
    y='count(*):Q',
    tooltip=['count(*):Q', 'Reviews']
).properties(width=1200, height=400)

For now, I've decided to kick them out

In [None]:
questionable_review_records = df[df['Reviews'] > 50]
questionable_review_records.shape

In [None]:
df = df[df['Reviews'] <= 50]

There's also a data problem with ridiculously high estimated monthly sales. 

In [None]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Est_Monthly_Sales:Q', bin=alt.Bin(maxbins=100)),
    y='count(*):Q',
    tooltip=['count(*):Q', 'Est_Monthly_Sales']
).properties(width=1200, height=400)

For now, let's just drop them. We should do more research into this.

In [None]:
df = df[df['Est_Monthly_Sales'] <= 40000]

## Hypotheses

My initial general hypothesis is that there are very few if any really good products to choose from, because this sounds to me like a "get-rich-quick" scam that's too good to be true. If there were easy opportunities, someone would have already taken them.

Specifically:

1. Products that have high monetary potential (Net or Estimated Monthly Revenue or some conbination) also have high competition (Reviews, Sellers).
2. JungleScout's proprietary "Listing Quality Score" (LQS) is intended to capture the quality of the product marketing. If I feel that I'm a good marketer, then products with low LQS are an opportunity for me. Hypothesis: products with low LQS also have low monetary potential -- they're just crap products anyway, so marketing is just lipstick on a pig.
3. I expect some correlations:
* sellers and reviews
* Est_Monthly_Sales and Est_Monthly_Revenue
* LQS and reviews

Let's look at correlations first.

Est_Monthly_Sales and Est_Monthly_Revenue seem correlated

In [None]:
alt.Chart(df).mark_circle().encode(
    x='Est_Monthly_Sales:Q',
    y='Est_Monthly_Revenue:Q',
    tooltip=['Sellers', 'LQS', 'Reviews', 'Rank', 'Fees', 'Net', 'Est_Monthly_Sales','Est_Monthly_Revenue', 'Category', 'Product_Name']
).properties(width=800, height=400)

Sellers and Reviews do not seem correlated, at least not at first glance:

In [None]:
alt.Chart(df).mark_circle().encode(
    y='Sellers:Q',
    x='Reviews:Q',
    tooltip=['Sellers', 'LQS', 'Reviews', 'Rank', 'Fees', 'Net', 'Est_Monthly_Sales','Est_Monthly_Revenue', 'Category', 'Product_Name']
).properties(width=800, height=400)

JS tells us that review count is an indicator of competition among sellers. But then, what does the "sellers" field mean? Perhaps it is cpomputed differently and reviews are a better indication of competitiveness than sellers?

In [None]:
alt.Chart(df).mark_circle().encode(
    x='Net:Q',
    y='Reviews:Q',
    tooltip=['Sellers', 'LQS', 'Reviews', 'Rank', 'Fees', 'Net', 'Est_Monthly_Sales','Est_Monthly_Revenue', 'Category', 'Product_Name']
).properties(width=800, height=1200)

The above graphic seems to indicate that, if we assume that "Net" is a proxy for potential profit and "Reviews" is a proxy for competitiveness, then my initial hypothesis does not seem to be true. It seems that there are profitable products at every level of competitiveness.

Side note: sliders would be good for # of reviews, since that's just between 0 and 50

Let's look at Estimated Monthly Sales against Reviews:

In [None]:
alt.Chart(df).mark_circle().encode(
    x='Est_Monthly_Sales:Q',
    y='Reviews:Q',
    tooltip=['Sellers', 'LQS', 'Reviews', 'Rank', 'Fees', 'Net', 'Est_Monthly_Sales','Est_Monthly_Revenue', 'Category', 'Product_Name']
).properties(width=800, height=1200)

Again, my initial assumption seems incorrect, since there are high-grossing products at every level of reviews.

Let's turn to my other hypothesis, with respect to JS's "LQS" score. This is a proprietary value that advertise as a measure of marketing quality.

In [None]:
alt.Chart(df).mark_circle().encode(
    x='Net:Q',
    y='LQS:Q',
    tooltip=['Sellers', 'LQS', 'Reviews', 'Rank', 'Fees', 'Net', 'Est_Monthly_Sales','Est_Monthly_Revenue', 'Category', 'Product_Name']
).properties(width=800, height=400)

Again, my hypothesis seems incorrect: there are plenty of products at every level of profitability and at each level of LQS. So perhaps there is room to out-market other sellers.

But I should have noticed: I thought I asked for LQS only below 6? Why am I getting so many products with higher LQS?

In [None]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('LQS:Q', bin=alt.Bin(maxbins=10)),
    y='count(*):Q',
    tooltip=['count(*):Q', 'LQS']
)

Let's try a different view of LQS and Net, a boxplot:

In [None]:
alt.Chart(df).mark_boxplot().encode(
    x = 'LQS:Q',
    y = 'Net:Q'
)


The above seems more convincing of the observation that there are "good" product at all level of LQS. Box plots are probably helpful for other views as well.

The following two are less successful, but they explore the limits.

In [None]:
alt.Chart(df).mark_boxplot().encode(
    x = 'LQS:Q',
    y = 'Est_Monthly_Revenue:Q'
)


In [None]:
alt.Chart(df).mark_boxplot().encode(
    x = 'Reviews:Q',
    y = 'Net:Q'
).properties(width=1200, height=400)


Here are some histograms that might also be interesting for a user.

In [None]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Reviews:Q', bin=alt.Bin(maxbins=50)),
    y='count(*):Q',
    tooltip=['count(*):Q', 'Reviews']
)

This viz is intended to explore the relation between Net and Monthly Revenue. It surprises me that they do not correlate much. This raises the question, what does "Net" really mean? We know we need several grains of salt because it does _not_ include the purchase price. So what is it? We need to think a bit more.

In [None]:
alt.Chart(df).mark_circle().encode(
    x='Net:Q',
    y='Est_Monthly_Revenue:Q',
    tooltip=['Sellers', 'LQS', 'Reviews', 'Rank', 'Fees', 'Net', 'Est_Monthly_Sales','Est_Monthly_Revenue', 'Category', 'Product_Name']
).properties(width=800, height=400)

## Takeaways

* It would be nice to have an interactive scatterplot that allows the user to choose the which dimension to place on each axis -- and perhaps another one for color. I think tableau is good at that. Can this be done with Altair?
* There are more problems with the data than I expected. We need to do more exploration and thinking about data quality.

My initial hypotheses were mostly incorrect - if I can trust the data quality. This makes me think that we should do our final project as
* Who: naive JS user
* What: the results of a search on JS (the db I used is only one example)
* Why: discover products that fit these interesting niches: high profitability/low competition, high profitability/poor marketing, etc.

The existing JS interface returns just a list, which you can sort. We should think of ourselves as designing and prototyping a better results page.