<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Flagging-Good-Articles" data-toc-modified-id="Flagging-Good-Articles-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Flagging Good Articles</a></span></li><li><span><a href="#Flagging-Hold-out-Set" data-toc-modified-id="Flagging-Hold-out-Set-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Flagging Hold-out Set</a></span></li></ul></div>

In this notebook, we will flag some entries in the database as `is_holdout` so we can use them as holdout set.

But before we do that, we must select the good data from our database. Our database comprises of some articles that have little to no textual contents (bad data). We want to filter those out.

On average, news articles have a length between 400-700 words. However, to be on the marginal, we want to capture anything that has at least one paragraph (~100 words)

First, create a new `is_good_article` column on `AllTheNews21`

**Only run this once in pgAdmin**

```sql
-- Create a new "is_good_article" column: Default value is False
ALTER TABLE public."AllTheNews21"
ADD COLUMN is_good_article BOOLEAN DEFAULT FALSE;
```

Also, create a new `is_holdout` column on `AllTheNews21`. 

**Only run this once in pgAdmin**

```sql
-- Create a new "is_holdout" column: Default value is False
ALTER TABLE public."AllTheNews21"
ADD COLUMN is_holdout BOOLEAN DEFAULT FALSE;
```

In [1]:
from sqlalchemy import create_engine   # conda install -c anaconda sqlalchemy
from dotenv import load_dotenv         # conda install -c conda-forge python-dotenv
import os                              # Python default package
import pandas as pd

In [2]:
pd.options.display.max_rows = 1000
pd.set_option('max_colwidth', 400)

In [3]:
load_dotenv() # => True if no error

True

In [10]:
# Load secrets from the .env file
db_name = os.getenv("db_name")
db_username = os.getenv("db_username")
db_password = os.getenv("db_password")
db_table_schema = os.getenv("db_table_schema")
connection_string = f"postgres://{db_username}:{db_password}@localhost:5432/{db_name}"
engine = create_engine(connection_string)

## Flagging Good Articles

We will consider an article as good if it has a least 90 words

In [None]:
# Select all articles with at least 90 words

q = f"""
SELECT index, category
FROM public."AllTheNews21"
WHERE word_count >= 90
"""
good_articles = pd.read_sql(q, con=engine)

How are they currently distributed along the categories?

In [None]:
good_articles.groupby("category").count()

Now, flag those articles

In [None]:
# # Finally, update the database for those articles
# for index in good_articles["index"]:
    
#     q = f"""
#     UPDATE public."AllTheNews21"
#     SET is_good_article = true
#     WHERE index = '{index}'
#     """
    
#     engine.execute(q)

## Flagging Hold-out Set

Among all the good data, here is the distribution of our label:

[insert distribution here]

We will randomly select 200 articles per categories as holdout. We have  21 categories in total so that will give us 4200 entries for our holdout set.

In [20]:
# List of distinct categories in the DB
categories = [
    "arts and entertainment",
    "automobiles",
    "business",
    "climate and environment",
    "energy",
    "finance and economics",
    "food",
    "global healthcare",
    "health and wellness",
    "legal and crimes",
    "life",
    "markets and investments",
    "personal finance",
    "politics",
    "real estate",
    "science and technology",
    "sports",
    "travel and transportation",
    "us",
    "wealth",
    "world"
]

In [25]:
# Select random articles per category to use as holdout set
houldout_articles = pd.DataFrame()

for cat in categories:

    q = f"""
    SELECT 
        index,
        category
    FROM public."AllTheNews21"
    WHERE category = '{cat}'
    ORDER BY RANDOM()
    LIMIT 200
    """
    houldout_articles = houldout_articles.append(pd.read_sql(q, con=engine))

In [28]:
# # Finally, update the database for those articles
# for index in houldout_articles["index"]:
    
#     q = f"""
#     UPDATE public."AllTheNews21"
#     SET is_holdout = true
#     WHERE index = '{index}'
#     """
    
#     engine.execute(q)