### Interest Analysis
---
    1. Which interests have been present in all month_year dates in our dataset?

In [1]:
SELECT  
    interest_name, 
    COUNT(*) AS time 
FROM interest_metrics me
JOIN interest_map ma ON me.interest_id = ma.id 
GROUP BY interest_name
HAVING COUNT(*) >= (SELECT COUNT(DISTINCT month_year) FROM interest_metrics)
ORDER BY [time] DESC

interest_name,time
Pizza Lovers,18
Plus Size Women,14
Pool and Spa Researchers,14
Portugal Trip Planners,14
Powerboat Purchasers,14
Pre-Measured Grocery Shoppers,14
Premier League Fans,14
Preppy Clothing Shoppers,14
Price Conscious Home Shoppers,14
Professional Chefs,14


    2. Using this same `total_months` measure - calculate the cumulative percentage of all records starting at 14 months - which `total_months` value passes the 90% cumulative percentage value?

In [2]:
WITH tbl1 AS (
    SELECT  
        interest_id , 
        COUNT(DISTINCT month_year) as total_months
    FROM interest_metrics i1
    WHERE month_year IS NOT NULL
    GROUP BY interest_id
),
tbl2 AS (  
    SELECT 
        total_months,
        CAST (COUNT(DISTINCT interest_id) AS FLOAT) as interest_count
    FROM tbl1
    GROUP BY total_months
)
SELECT  
    total_months,
    interest_count,
    FORMAT (SUM(interest_count) OVER(ORDER BY total_months DESC  ) / SUM(interest_count) OVER(), 'P') AS cumulative_percent  
FROM tbl2

total_months,interest_count,cumulative_percent
14,480,39.93%
13,82,46.76%
12,65,52.16%
11,94,59.98%
10,86,67.14%
9,95,75.04%
8,67,80.62%
7,90,88.10%
6,33,90.85%
5,38,94.01%


    3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing?

In [3]:
WITH tbl AS (   
    SELECT  
        interest_id, 
        COUNT(DISTINCT month_year) as total_months
    FROM interest_metrics i1
    WHERE month_year IS NOT NULL
    GROUP BY interest_id
    HAVING COUNT(DISTINCT month_year) < 6
)
SELECT COUNT(*) AS data_remove
FROM interest_metrics me
RIGHT JOIN tbl ON tbl.interest_id = me.interest_id


data_remove
400


    4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed interest example for your arguments - think about what it means to have less months present from a segment perspective.

In [4]:
WITH REMOVE_CTE AS (    
    SELECT  
        interest_id, 
        COUNT(DISTINCT month_year) as total_months
    FROM interest_metrics i1
    WHERE month_year IS NOT NULL
    GROUP BY interest_id
    HAVING COUNT(DISTINCT month_year) < 6
),
REMOVE_Month_CTE AS (   
    SELECT 
        month_year, 
        count(*) as total_remove
    FROM interest_metrics i 
    RIGHT JOIN REMOVE_CTE r ON i.interest_id = r.interest_id
    GROUP BY month_year
),
ORIGINAL_CTE AS (       
    SELECT 
        month_year, 
        count(*) as total_original
    FROM interest_metrics i 
    WHERE month_year is not null 
    GROUP BY month_year 
)
SELECT O.month_year,
        total_original,
        total_remove,
        format( cast(total_remove as float) / total_original,'p') as remove_percent
FROM ORIGINAL_CTE O
JOIN REMOVE_Month_CTE R on O.month_year = R.month_year
ORDER BY O.month_year;

month_year,total_original,total_remove,remove_percent
2018-07-01,729,20,2.74%
2018-08-01,767,15,1.96%
2018-09-01,780,6,0.77%
2018-10-01,857,4,0.47%
2018-11-01,928,3,0.32%
2018-12-01,995,9,0.90%
2019-01-01,973,7,0.72%
2019-02-01,1121,49,4.37%
2019-03-01,1136,58,5.11%
2019-04-01,1099,64,5.82%


- We can see that the percent of removed `month_year` which have cumulative percent lower than 90% , is very low and not significant. So removing these values can increase performance by attracting customer to interest one. 

    5. After removing these interests - how many unique interests are there for each month?

In [5]:
WITH REMOVE_CTE AS (    
    SELECT  
        interest_id, 
        COUNT(DISTINCT month_year) as total_months
    FROM interest_metrics i1
    WHERE month_year IS NOT NULL
    GROUP BY interest_id
    HAVING COUNT(DISTINCT month_year) >= 6
)
SELECT 
    month_year,
    COUNT(*) AS interest_month_cnt
FROM interest_metrics i 
JOIN REMOVE_CTE r ON r.interest_id = i.interest_id
WHERE month_year IS NOT NULL
GROUP BY month_year
ORDER BY month_year


month_year,interest_month_cnt
2018-07-01,709
2018-08-01,752
2018-09-01,774
2018-10-01,853
2018-11-01,925
2018-12-01,986
2019-01-01,966
2019-02-01,1072
2019-03-01,1078
2019-04-01,1035
