# SQL EDA of All-The-News-2.1

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Import-Libraries-and-Set-Settings" data-toc-modified-id="Import-Libraries-and-Set-Settings-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Libraries and Set Settings</a></span></li><li><span><a href="#SQL-EDAs" data-toc-modified-id="SQL-EDAs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>SQL EDAs</a></span><ul class="toc-item"><li><span><a href="#Check-the-current-state-of-the-AllTheNews21-Table" data-toc-modified-id="Check-the-current-state-of-the-AllTheNews21-Table-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Check the current state of the <code>AllTheNews21</code> Table</a></span></li><li><span><a href="#How-many-article-entries-are-in-this-dataset?" data-toc-modified-id="How-many-article-entries-are-in-this-dataset?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>How many article entries are in this dataset?</a></span></li><li><span><a href="#What-is-the-range-of-the-years-in-this-dataset?" data-toc-modified-id="What-is-the-range-of-the-years-in-this-dataset?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>What is the range of the years in this dataset?</a></span></li><li><span><a href="#How-many-article-entries-per-year?" data-toc-modified-id="How-many-article-entries-per-year?-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>How many article entries per year?</a></span></li><li><span><a href="#What-is-the-Average-article-entries-per-year?" data-toc-modified-id="What-is-the-Average-article-entries-per-year?-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>What is the Average article entries per year?</a></span></li><li><span><a href="#How-many-entries-per-month-for-each-year?" data-toc-modified-id="How-many-entries-per-month-for-each-year?-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>How many entries per month for each year?</a></span></li><li><span><a href="#What-is-the-Average-article-entries-per-month-overall?" data-toc-modified-id="What-is-the-Average-article-entries-per-month-overall?-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>What is the Average article entries per month overall?</a></span></li><li><span><a href="#Which-month-is-overall-the-highest?-Which-month-is-the-lowest?" data-toc-modified-id="Which-month-is-overall-the-highest?-Which-month-is-the-lowest?-3.8"><span class="toc-item-num">3.8&nbsp;&nbsp;</span>Which month is overall the highest? Which month is the lowest?</a></span></li><li><span><a href="#How-many-months-are-in-this-dataset-in-total?" data-toc-modified-id="How-many-months-are-in-this-dataset-in-total?-3.9"><span class="toc-item-num">3.9&nbsp;&nbsp;</span>How many months are in this dataset in total?</a></span></li></ul></li></ul></div>

## Purpose

- The purpose of this notebook is to run some quick SQL EDA of the Article entries in the Postgre DB
- There are **2,584,165** article entries and 26 publishers loaded into the database

## Import Libraries and Set Settings

In [1]:
from sqlalchemy import create_engine   # conda install -c anaconda sqlalchemy
from dotenv import load_dotenv         # conda install -c conda-forge python-dotenv
import os                              # Python default package
import pandas as pd

# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [2]:
sns.set_theme(style="whitegrid")

In [3]:
pd.options.display.max_rows = 1000

In [4]:
load_dotenv() # => True if no error

True

In [5]:
# Load secrets from the .env file
db_name = os.getenv("db_name")
db_username = os.getenv("db_username")
db_password = os.getenv("db_password")
db_table_schema = os.getenv("db_table_schema")
connection_string = f"postgres://{db_username}:{db_password}@localhost:5432/{db_name}"
engine = create_engine(connection_string)

In [6]:
# List of available tables in the DB
q = """
SELECT * 
FROM information_schema.tables
WHERE table_catalog = '{db_name}'
AND table_schema = '{db_table_schema}';
""".format(
    db_name = db_name,
    db_table_schema = db_table_schema
)

pd.read_sql(q, con=engine)[["table_name"]]

Unnamed: 0,table_name
0,AllTheNews21


## SQL EDAs

### Check the current state of the `AllTheNews21` Table

In [7]:
q = """
SELECT *
FROM public."AllTheNews21"
LIMIT 3;
"""
pd.read_sql(q, con=engine)

Unnamed: 0,index,date,year,month,day,author,title,article,url,section,publication,title_length,article_length
0,190194,2017-04-17 18:40:00,2017,4,17,Jack Slack,Whittaker versus Jacare: Pace Over Power,One slip in the middleweight division of 2017 ...,https://www.vice.com/en_us/article/wnmzb9/whit...,Sports,Vice,40,5962
1,190206,2019-06-19 21:48:32,2019,6,19,Nicole Einbinder,College admissions scandal students suing for ...,A group of students and their parents filed a ...,https://www.businessinsider.com/students-rejec...,,Business Insider,62,3070
2,190229,2017-05-24 11:30:01,2017,5,24,Emily Todd VanDerWerff,"The Handmaid’s Tale season 1, episode 7: “The ...","Every week, a few members of the Vox Culture t...",https://www.vox.com/culture/2017/5/24/15682174...,,Vox,111,11035


### How many article entries are in this dataset?

In [8]:
q = """
SELECT COUNT(*) AS articles_count
FROM public."AllTheNews21";
"""
pd.read_sql(q, con=engine)

Unnamed: 0,articles_count
0,2584165


Answer: 2,584,165

### What is the range of the years in this dataset?

In [9]:
q = """
SELECT 
    MIN(year) AS min_year,
    MAX(year) AS max_year
FROM public."AllTheNews21";
"""
pd.read_sql(q, con=engine)

Unnamed: 0,min_year,max_year
0,2016,2020


Answer: 2013-2020

### How many article entries per year?

In [10]:
q = """
SELECT
    year,
    COUNT(*) AS articles_count
FROM public."AllTheNews21"
GROUP BY year
ORDER BY year;
"""
article_count_per_year = pd.read_sql(q, con=engine)

In [19]:
# Visualize
fig = px.bar(
    article_count_per_year, 
    x="year", 
    y="articles_count",
    color="articles_count",
    labels={
        "year":"Publication Year",
        "articles_count": "Articles Count"
    },
    title="Distribution of Articles Per Year"
)
fig.show()

### What is the Average article entries per year?

In [12]:
q = """
with cte AS (
    SELECT
        year,
        COUNT(*) AS articles_count
    FROM public."AllTheNews21"
    GROUP BY year
    ORDER BY year
)
SELECT AVG(cte.articles_count) AS mean_articles_count_per_year
FROM cte;
"""
pd.read_sql(q, con=engine)

Unnamed: 0,mean_articles_count_per_year
0,516833.0


Answer: 516,833 Articles per year

### How many entries per month for each year?

- Recall that our Min-Max years are 2013-2020
- 2020 only account for its first quarter
  - It's number for April is much lower

In [13]:
q = """
SELECT
    year,
    month,
    COUNT(*) AS entries
FROM public."AllTheNews21"
GROUP BY year, month
ORDER BY year, month;
"""
articles_per_month_per_year = pd.read_sql(q, con=engine)

In [21]:
# The years and the colors we assign to each one
years = [
    (2016, "#f26633"),
    (2017, "#35faba"),
    (2018, "#a643f7"),
    (2019, "#e3d044"),
    (2020, "#939afa")
]

# Start the figure
fig = go.Figure()

# Create a bar for each months in each year
for yr, yr_color in years:

    fig.add_trace(go.Bar(
        x=articles_per_month_per_year[articles_per_month_per_year["year"] == yr].sort_values(by="month", ascending=True)["month"],
        y=articles_per_month_per_year[articles_per_month_per_year["year"] == yr].sort_values(by="month", ascending=True)["entries"],
        name="{yr}".format(yr=yr),
        marker_color="{col}".format(col=yr_color)
    ))

# Create as stacked groups
fig.update_layout(
    title="Distribution of Article Per Month For Each Year",
    xaxis=dict(
        title='Months',
        titlefont_size=16,
        tickfont_size=14,
        tickmode='linear'
    ),
    yaxis=dict(
        title='Count of Articles',
        titlefont_size=16,
        tickfont_size=14,
    ),
    legend=dict(
        x=1.0,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    ),
    barmode='stack', # stack or group
    #bargap=0.15, # gap between bars of adjacent location coordinates.
    #bargroupgap=0.1 # gap between bars of the same location coordinate.
)

fig.show()

### What is the Average article entries per month overall?

In [15]:
q = """
with cte AS (
    SELECT
        year,
        month,
        COUNT(*) AS articles_count
    FROM public."AllTheNews21"
    GROUP BY year, month
    ORDER BY year, month
)
SELECT AVG(cte.articles_count) AS mean_articles_count_per_year
FROM cte;
"""
pd.read_sql(q, con=engine)

Unnamed: 0,mean_articles_count_per_year
0,49695.480769


Answer: 49,695 Articles per month over the 5 years

### Which month is overall the highest? Which month is the lowest?

- **However, remember that Jan, Feb, and March also account for the extras from 2020**
- **To really compare on equal footing, exclude the articles from 2020**

In [23]:
# We are excluding the entries in 2020 in order to compare on equal footing
q = """
SELECT
    month,
    COUNT(*) AS articles_count
FROM public."AllTheNews21"
WHERE year != 2020
GROUP BY month
ORDER BY articles_count DESC;
"""
pd.read_sql(q, con=engine)

Unnamed: 0,month,articles_count
0,10,216639
1,3,209225
2,5,208956
3,6,206399
4,11,203962
5,7,202720
6,8,199507
7,9,197975
8,4,197207
9,1,190515


### How many months are in this dataset in total?

In [17]:
q = """
with cte  AS (
    SELECT 
        year,
        COUNT(DISTINCT month) AS months_in_year
    FROM public."AllTheNews21"
    GROUP BY year
)
SELECT SUM(months_in_year) AS months_in_dataset
FROM cte;
"""
pd.read_sql(q, con=engine)

Unnamed: 0,months_in_dataset
0,52.0


In [18]:
q = """
SELECT 
    year,
    COUNT(DISTINCT month) AS months_in_year
FROM public."AllTheNews21"
GROUP BY year
"""
pd.read_sql(q, con=engine)

Unnamed: 0,year,months_in_year
0,2016,12
1,2017,12
2,2018,12
3,2019,12
4,2020,4
