## Project Description
Sports clothing and athleisure attire is a huge industry, worth approximately <a href="https://www.statista.com/statistics/254489/total-revenue-of-the-global-sports-apparel-market/">$193 billion in 2021</a> with a strong growth forecast over the next decade!

In this notebook, I perform data analysis on the database of an online sports clothing company. I dive into product data such as pricing, reviews, descriptions, and ratings, as well as revenue and website traffic, to improve revenue.

## Data definition

<p>The database <code>sports</code>, contains five tables, with <code>product_id</code> being the primary key for all of them.</p>

<h4 id="info"><code>info</code></h4>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_name</code></td>
<td><code>varchar</code></td>
<td>Name of the product</td>
</tr>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
</tr>
<tr>
<td><code>description</code></td>
<td><code>varchar</code></td>
<td>Description of the product</td>
</tr>
</tbody>
</table>
<h4 id="finance"><code>finance</code></h4>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
</tr>
<tr>
<td><code>listing_price</code></td>
<td><code>float</code></td>
<td>Listing price for product</td>
</tr>
<tr>
<td><code>sale_price</code></td>
<td><code>float</code></td>
<td>Price of the product when on sale</td>
</tr>
<tr>
<td><code>discount</code></td>
<td><code>float</code></td>
<td>Discount, as a decimal, applied to the sale price</td>
</tr>
<tr>
<td><code>revenue</code></td>
<td><code>float</code></td>
<td>Amount of revenue generated by each product, in US dollars</td>
</tr>
</tbody>
</table>
<h4 id="reviews"><code>reviews</code></h4>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_name</code></td>
<td><code>varchar</code></td>
<td>Name of the product</td>
</tr>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
</tr>
<tr>
<td><code>rating</code></td>
<td><code>float</code></td>
<td>Product rating, scored from <code>1.0</code> to <code>5.0</code></td>
</tr>
<tr>
<td><code>reviews</code></td>
<td><code>float</code></td>
<td>Number of reviews for the product</td>
</tr>
</tbody>
</table>
<h4 id="traffic"><code>traffic</code></h4>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
</tr>
<tr>
<td><code>last_visited</code></td>
<td><code>timestamp</code></td>
<td>Date and time the product was last viewed on the website</td>
</tr>
</tbody>
</table>
<h4 id="brands"><code>brands</code></h4>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
</tr>
<tr>
<td><code>brand</code></td>
<td><code>varchar</code></td>
<td>Brand of the product</td>
</tr>
</tbody>
</table>

<p>I dealt with missing data as well as numeric, string, and timestamp data types to draw insights about the products in the online store. Let's start by initializing the local PostgreSQL database.</p>

### Initializing connection with the database

In [30]:
# Loading sql extension using ipython-sql
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [31]:
# Initializing sql user, host, and database
%sql postgresql://postgres:Postgresql_pr0@localhost/sports

### Counting missing values

In [32]:
%%sql

-- Number of non-null records
SELECT
    COUNT(i.product_id) AS total_rows
    , COUNT(f.listing_price) AS count_list_price
    , COUNT(f.discount) AS count_discount
    , COUNT(f.revenue) AS count_revenue
    , COUNT(t.last_visited) AS count_last_visit
    , COUNT(r.rating) AS count_ratings
    , COUNT(r.reviews) AS count_reviews
FROM
    info AS i
  JOIN
    finance AS f
  USING(product_id)
  JOIN
    traffic AS t
  USING(product_id)
  JOIN
    reviews AS r
  USING(product_id);

 * postgresql://postgres:***@localhost/sports
1 rows affected.


total_rows,count_list_price,count_discount,count_revenue,count_last_visit,count_ratings,count_reviews
3179,3120,3120,3120,2928,3120,3120


We can see the database contains 3,179 products in total but most of the important data points are missing for 59 products (3179-3120). Moreover, <code>last_visited</code> is missing around 8% of its values.

Now let's turn our attention to pricing.

### Brands - Nike vs Adidas pricing

How do the price points of Nike and Adidas products differ? This can help us analyse company's stock range and customer market.

Let's assign labels to different price ranges, grouping by <code>brand</code> and <code>price_category</code>, also including the total <code>revenue</code> for each price range and brand.

In [33]:
%%sql

-- Exploring the revenue generated by each brand in different price ranges
SELECT
    b.brand,
    CASE
        WHEN f.listing_price::INTEGER < 42 THEN 'Budget'
        WHEN 42 <= f.listing_price::INTEGER
                AND f.listing_price::INTEGER < 74 THEN 'Average'
        WHEN 74 <= f.listing_price::INTEGER
                AND f.listing_price::INTEGER < 129 THEN 'Expensive'
        ELSE 'Elite'
    END AS price_category
    , COUNT(f.product_id)
    , ROUND(SUM(f.revenue::NUMERIC), 2) AS total_revenue
FROM
    brands AS b
    INNER JOIN
    finance AS f
    USING(product_id)
WHERE
    b.brand IS NOT NULL
    
GROUP BY
    b.brand,
    price_category
ORDER BY
    total_revenue DESC;

 * postgresql://postgres:***@localhost/sports
8 rows affected.


brand,price_category,count,total_revenue
Adidas,Expensive,849,4626980.07
Adidas,Average,1060,3233661.06
Adidas,Elite,307,3014316.83
Adidas,Budget,359,651661.12
Nike,Budget,357,595341.02
Nike,Elite,82,128475.59
Nike,Expensive,90,71843.15
Nike,Average,16,6623.5


Grouping products by brand and price range allows us to see that products from **Adidas generate more total revenue regardless of price category**.

Specifically, Adidas products from <code>Elite</code> category priced \$129 or more typically generate the highest revenue, so the company can potentially increase revenue by **shifting their stock to have a larger proportion of Elite products!**