### Data Exploration and Cleansing
---
    1. Update the `interest_metrics` table by modifying the `month_year` column to be a date data type with the start of the month

In [2]:
ALTER TABLE interest_metrics
ALTER COLUMN month_year VARCHAR(10)

UPDATE interest_metrics
SET month_year = CONVERT(DATE, '01/' + REPLACE(month_year, '-', '/'), 105)

    2. What is count of records in the `interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the null values appearing first?

In [3]:
SELECT
    month_year,
    COUNT(*) as records
FROM interest_metrics
GROUP BY month_year
ORDER BY month_year

month_year,records
,1194
2018-07-01,729
2018-08-01,767
2018-09-01,780
2018-10-01,857
2018-11-01,928
2018-12-01,995
2019-01-01,973
2019-02-01,1121
2019-03-01,1136


    3. What do you think we should do with these null values in the `interest_metrics`

In [4]:
SELECT 
    ROUND(100*CAST(COUNT(*) as float)/(SELECT COUNT(*) FROM interest_metrics),2) AS null_percentage
FROM interest_metrics
WHERE month_year is NULL

null_percentage
8.37


For null values in `month_year` column, we might consider dropping them because:

- They account for 8.37% of total rows, dropping them would not affect much to the final result.
- Dropping null values from month_year column will also drop null values from others column as well.

    4. How many `interest_id` values exist in the `interest_metrics` table but not in the `interest_map` table? What about the other way around?

-  `interest_id` values exist in the `interest_metrics` table but not in the `interest_map`

In [5]:
SELECT DISTINCT 
    interest_id
FROM interest_metrics me 
LEFT JOIN interest_map ma ON ma.id = me.interest_id
WHERE ma.id is NULL

interest_id
""


-  `interest_id` values exist in the `interest_map` table but not in the `interest_metrics`

In [6]:
SELECT DISTINCT 
    id
FROM interest_metrics me 
RIGHT JOIN interest_map ma ON ma.id = me.interest_id
WHERE me.interest_id is NULL

id
19598
35964
40185
40186
42010
42400
47789


    5. Summarise the id values in the `interest_map` by its total record count in this table

In [7]:
SELECT
    COUNT(*) AS records
FROM interest_map

records
1209


    6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where `interest_id` = 21246 in your joined output and include all columns from `interest_metrics` and all columns from `interest_map` except from the id column.

- We should use table `interest_metrics` LEFT JOIN with table `interest_map`

In [8]:
SELECT 
    me.*,
    ma.interest_name,
    ma.interest_summary,
    ma.created_at,
    ma.last_modified
FROM interest_metrics me 
LEFT JOIN interest_map ma ON ma.id = me.interest_id
WHERE interest_id = 21246 

_month,_year,month_year,interest_id,composition,index_value,ranking,percentile_ranking,interest_name,interest_summary,created_at,last_modified
7.0,2018.0,2018-07-01,21246,2.26,0.65,722,0.96,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
8.0,2018.0,2018-08-01,21246,2.13,0.59,765,0.26,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
9.0,2018.0,2018-09-01,21246,2.06,0.61,774,0.77,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
10.0,2018.0,2018-10-01,21246,1.74,0.58,855,0.23,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
11.0,2018.0,2018-11-01,21246,2.25,0.78,908,2.16,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
12.0,2018.0,2018-12-01,21246,1.97,0.7,983,1.21,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
1.0,2019.0,2019-01-01,21246,2.05,0.76,954,1.95,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
2.0,2019.0,2019-02-01,21246,1.84,0.68,1109,1.07,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
3.0,2019.0,2019-03-01,21246,1.75,0.67,1123,1.14,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000
4.0,2019.0,2019-04-01,21246,1.58,0.63,1092,0.64,Readers of El Salvadoran Content,People reading news from El Salvadoran media sources.,2018-06-11 17:50:04.0000000,2018-06-11 17:50:04.0000000


    7. Are there any records in your joined table where the `month_year` value is before the `created_at` value from the `interest_map` table? Do you think these values are valid and why?

In [9]:
SELECT 
    me.*,
    ma.interest_name,
    ma.interest_summary,
    ma.created_at,
    ma.last_modified
FROM interest_metrics me 
LEFT JOIN interest_map ma ON ma.id = me.interest_id
WHERE month_year < created_at

_month,_year,month_year,interest_id,composition,index_value,ranking,percentile_ranking,interest_name,interest_summary,created_at,last_modified
7,2018,2018-07-01,32701,4.23,1.41,483,33.74,Womens Equality Advocates,People visiting sites advocating for womens equal rights.,2018-07-06 14:35:03.0000000,2018-07-06 14:35:03.0000000
7,2018,2018-07-01,32702,3.56,1.18,580,20.44,Romantics,People reading about romance and researching ideas for planning romantic moments.,2018-07-06 14:35:04.0000000,2018-07-06 14:35:04.0000000
7,2018,2018-07-01,32703,5.53,1.8,375,48.56,School Supply Shoppers,Consumers shopping for classroom supplies for K-12 students.,2018-07-06 14:35:04.0000000,2018-07-06 14:35:04.0000000
7,2018,2018-07-01,32704,8.04,2.27,225,69.14,Major Airline Customers,People visiting sites for major airline brands to plan and view travel itinerary.,2018-07-06 14:35:04.0000000,2018-07-06 14:35:04.0000000
7,2018,2018-07-01,32705,4.38,1.34,505,30.73,Certified Events Professionals,Professionals reading industry news and researching products and services for event management.,2018-07-06 14:35:04.0000000,2018-07-06 14:35:04.0000000
7,2018,2018-07-01,33191,3.99,2.11,283,61.18,Online Shoppers,People who spend money online,2018-07-17 10:40:03.0000000,2018-07-17 10:46:58.0000000
8,2018,2018-08-01,33957,2.01,0.84,704,8.21,Call of Duty Enthusiasts,People reading news and product releases for Call of Duty games and merchandise.,2018-08-02 16:05:03.0000000,2018-08-02 16:05:03.0000000
8,2018,2018-08-01,33958,1.88,0.73,740,3.52,Astrology Enthusiasts,People reading daily horoscopes and astrology content.,2018-08-02 16:05:03.0000000,2018-08-02 16:05:03.0000000
8,2018,2018-08-01,33959,2.54,1.86,67,91.26,Boston Bruins Fans,People reading news about the Boston Bruins and watching games. These consumers are more likely to spend money on team gear.,2018-08-02 16:05:03.0000000,2018-08-02 16:05:03.0000000
8,2018,2018-08-01,33960,2.68,1.67,118,84.62,Chicago Blackhawks Fans,People reading news about the Chicago Blackhawks and watching games. These consumers are more likely to spend money on team gear.,2018-08-02 16:05:03.0000000,2018-08-02 16:05:03.0000000


- There are total of 188 rows having `month_year` value before `created_at`. This is valid because column `month_year` has been converted to date datatype with the day being the first day of the month. 