# Load BBC to Postgre DB

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Import-Libraries-and-Settings" data-toc-modified-id="Import-Libraries-and-Settings-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Libraries and Settings</a></span></li><li><span><a href="#Read-Text-Files-Into-Dataframe" data-toc-modified-id="Read-Text-Files-Into-Dataframe-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Read Text Files Into Dataframe</a></span></li><li><span><a href="#Export-to-CSV" data-toc-modified-id="Export-to-CSV-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Export to CSV</a></span></li><li><span><a href="#Load-into-DB" data-toc-modified-id="Load-into-DB-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Load into DB</a></span></li></ul></div>

## Purpose

- The purpose of this notebook is to load the contents of the BBC and the BBCSport datasets into Postgre SQL database as close to the raw source as possible
- The **BBC dataset** consists of **2225 documents** from the BBC news website corresponding to stories in five topical areas from **2004-2005**.
  - business
  - entertainment
  - politics
  - sport
  - tech
- The **BBCSport dataset** consists of **737 documents** from the BBC Sport website corresponding to sports news articles in five topical areas from **2004-2005**.
  - athletics
  - cricket
  - football
  - rugby
  - tennis

## Import Libraries and Settings

In [1]:
from sqlalchemy import create_engine   # conda install -c anaconda sqlalchemy
from dotenv import load_dotenv         # conda install -c conda-forge python-dotenv
import os                              # Python default package
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm         # Provides progress bar for long tasks
import glob

In [2]:
pd.options.display.max_rows = 1000

In [3]:
load_dotenv() # => True if no error

True

In [4]:
# Load secrets from the .env file
db_name = os.getenv("db_name")
db_username = os.getenv("db_username")
db_password = os.getenv("db_password")
connection_string = f"postgres://{db_username}:{db_password}@localhost:5432/{db_name}"
engine = create_engine(connection_string)

## Read Text Files Into Dataframe

**Only run this if restarting from raw data. This line takes long to run. Instead, use the exported `bbc.csv` to reset the DB**

Let's do the same for the BBC Sport dataset

## Export to CSV

**Only run this if restarting from raw data. This line takes long to run. Instead, use the exported `bbc.csv` to reset the DB**

Let's export this into a CSV file to make it easier to use later in case we need it again later

## Load into DB

**If re-running this, make sure to drop the table first**