<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#About-The-Dataset" data-toc-modified-id="About-The-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>About The Dataset</a></span></li><li><span><a href="#Preliminary-Analysis" data-toc-modified-id="Preliminary-Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Preliminary Analysis</a></span></li><li><span><a href="#Import-Libraries-and-Set-Settings" data-toc-modified-id="Import-Libraries-and-Set-Settings-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Import Libraries and Set Settings</a></span></li><li><span><a href="#First-attempt-to-correct-the-problematic-entries-in-the-CSV" data-toc-modified-id="First-attempt-to-correct-the-problematic-entries-in-the-CSV-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>First attempt to correct the problematic entries in the CSV</a></span></li><li><span><a href="#Loading-the-Data-into-the-Database" data-toc-modified-id="Loading-the-Data-into-the-Database-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Loading the Data into the Database</a></span></li></ul></div>

## Purpose

- The purpose of this notebook is to load the contents of the `All-The-News-21.csv` dataset into Postgre SQL database as close to the raw source as possible
- This dataset initially contains 2.7 million news articles and essays from 27 American publishers
- We are only keeping those entries that do not have `NULL` in the `article` column
- After some initial cleanups as done in this notebook, we end up with **2,584,165** article entries and 26 publishers loaded into the database

## About The Dataset

- Name: All The News 2.1
- Source: https://components.one/datasets/all-the-news-2-news-articles-dataset/

This dataset contains 2.7 million news articles and essays from 27 American publications

- Includes:
  - date
  - title
  - publication
  - article text
  - publication name
  - year
  - month
  - URL (for some)
  
Articles mostly span from 2013 to early 2020

## Preliminary Analysis

- After some initial cleanups done here, we end up with **2,584,165** article entries loaded into the DB
- We skipped problematic entries that were not formatted correctly
- The database now has the following table structure for the `AllTheNews21` table

Columns | Data Type | Source | Has `NULL` or Empty Values | How many `NULL`/Empty Values? | Description
:-|:-:|:-:|:-:|:-:|:-
`index`|`bigint`|Original|No|N/A|Used as Primary Key
`date`|`text`|Original|No|N/A|Date of publication of the news article
`year`|`bigint`|Original|No|N/A|Year-part of the date of publication of the news article
`month`|`bigint`|Original|No|N/A|Mnoth-part of the date of publication of the news article
`day`|`bigint`|Original|No|N/A|Day-part of the date of publication of the news article
`author`|`text`|Original|Yes|924,621|Author of the news article
`title`|`text`|Original|Yes|16|Title of the news article
`article`|`text`|Original|No|N/A|Content of the news article
`url`|`text`|Original|No|N/A|URL of the news article
`section`|`text`|Original|Yes|830,754|The section category of the news article
`publication`|`text`|Original|No|N/A|The publisher of the news article
`title_length`|`bigint`|Calculated|No|N/A|The string length of the title for the news article
`article_length`|`bigint`|Calculated|No|N/A|The string length of the content for the news article

**NOTES - TODOS**
- **A lot of cleaning needs to be done for `section`: Approach would be to clean by Publication**
- **`publication` is clean and contains 26 unique publishers**

## Import Libraries and Set Settings

In [1]:
from sqlalchemy import create_engine   # conda install -c anaconda sqlalchemy
from dotenv import load_dotenv         # conda install -c conda-forge python-dotenv
import os                              # Python default package
import pandas as pd
from tqdm.notebook import tqdm         # Provides progress bar for long tasks
import glob

In [2]:
pd.options.display.max_rows = 1000

In [3]:
load_dotenv() # => True if no error

True

In [5]:
# Load secrets from the .env file
db_name = os.getenv("db_name")
db_username = os.getenv("db_username")
db_password = os.getenv("db_password")
connection_string = f"postgres://{db_username}:{db_password}@localhost:5432/{db_name}"
engine = create_engine(connection_string)

## First attempt to correct the problematic entries in the CSV

This attempt did not work but I am keeping the codes in here for reference in case I need to go back to it later

In [6]:
# content = open("datasets/all-the-news-2-1/parts/all-the-news-2-1_0.csv", "r", encoding="utf8").read().replace('\n','')

# with open("datasets/all-the-news-2-1/parts/all-the-news-2-1_0.csv__copy.csv", "w", encoding="utf8") as g:
#     g.write(content)

# df = pd.read_csv(
#     "datasets/all-the-news-2-1/parts/all-the-news-2-1_0.csv", 
#     engine="python", 
#     encoding="utf-8",
#     error_bad_lines=False,
#     nrows=1000
# )
# print(df.shape)
# df

In [7]:
# df = df.drop(columns=["Unnamed: 0.1"])
# df = df.rename(columns={"Unnamed: 0": "index"})
# print(df.shape)
# df

In [8]:
# df["month"] = df["month"].astype("int64")
# df

In [9]:
# df = df.dropna(subset=['article'])
# df.info()

## Loading the Data into the Database

This is the section that works to load all the articles data into the database

In [10]:
# This cell fixes the field larger than field limit (131072) Error that happens with some CSVs

# Import libraries
import sys
import csv

# Settings
maxInt = sys.maxsize

# Main Loop
while True:
    
    # Decrease the maxInt value by factor 10 as long as the OverflowError occurs
    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

Now, loading the dataset into the database

In [11]:
# All the CSV parts
# parts_csv = glob.glob("datasets/all-the-news-2-1/parts/*.csv")
# 
# for part in parts_csv:
#     print("\n>>> Working on", part, "...")

with engine.connect() as db:
    
    # Import the whole csv
    news_raw = pd.read_csv(
        "raw-datasets/all-the-news-21/original-massive-data/all-the-news-2-1.csv", 
        engine="python", 
        encoding="utf-8",
        error_bad_lines=False,
        chunksize = 10000
    )
    
    # Looping over chuncks of news_raw
    for chunk in tqdm(news_raw):
        
        # Remove/Rename some columns
        chunk = chunk.drop(columns=["Unnamed: 0.1"])
        chunk = chunk.rename(columns={"Unnamed: 0": "index"})

        # Drop the record if it does not have a text
        chunk = chunk.dropna(subset=['article'])
        
        # Append the chunk to the table
        chunk.to_sql(
            'AllTheNews21', # The table name
            db, # The database
            if_exists = 'append', 
            index = False # Do not include the pandas index column as PK
        )

    # When all is done, print done
    print('--- Task done. ---')

0it [00:00, ?it/s]

--- Task done. ---
