## Articles Project: First Attempt at Pipeline

This is an annotated version of the ```dockingfile.py``` file. Most of the code doing the scraping, cleaning and loading is contained in the ```scraping_functions.py```, ```cleaning_functions.py``` and ```loading_functions``` modules. To look at the code and the notebook simultaneously **in Jupyter Lab**, I recommend selecting any function/module, right-clicking and then clicking on "Show Contextual Help". 

In [1]:
import os
import pandas as pd

os.chdir("/Users/margaritabozhinova/Desktop/ArticlesProject/Pipeline/")

#### Step 1: Scraping the articles

The scraping functions include:
* a ```get_links``` function that returns all the article elements that have "article-card" class in the link that's passed as an argument (here, the National Post's website)
* functions that get the article IDs (```get_article_ids```) and article (```get_article_urls```) URLs from the article elements
* a function that scrapes the text that can be found in the article URLs (```get_article_text```)

In [13]:
import scraping_functions

elements = scraping_functions.get_links("https://nationalpost.com/")

ids = scraping_functions.get_article_ids(elements)

urls = scraping_functions.get_article_urls(elements)

articles_text = scraping_functions.get_article_text(urls)

#### Step 2: Cleaning the text

The ```clean_article_text``` function is used to clean the article text. It carries out two steps: 
* use regex to remove any HTML code surrounding the text
* remove common strings that are likely not article text (ex.: every article ends with a copyright statement and the National Post's address)

The ```clean_article_text``` function returns a clean version of the articles_text list yielded by ```get_article_text```. 

In [3]:
import cleaning_functions

clean_text = cleaning_functions.clean_article_text(articles_text)

article_df = pd.DataFrame(
    list(zip(ids, urls, clean_text)), columns=["id", "url", "text"]
)

Strings that appear with a frequency > 2 are removed from the article as a quick clean-up step that gets rid of repeating elements that are likely not article text. To get an idea of the types of strings that are removed, the ```check_common_str``` function produces a table of the strings with frequencies > 2. 

In [12]:
cleaning_functions.check_common_str(articles_text)

Unnamed: 0,string,frequency
0,,10
1,"© 2023 National Post, a division of Postmedia...",22
2,Ikigai,3
3,ADVERTISEMENT,18
4,Edit your picks to remove vehicles if you want...,5
5,"© 2023 Driving, a division of Postmedia Netwo...",5
6,,465
7,Don't have an account? Create Account,22
8,"© 2023 The GrowthOp, a division of Postmedia ...",3
9,"365 Bloor Street East, Toronto, Ontario, M4W 3L4",33


#### Step 3: Load into SQLite Database

First, a SQLite database called dbfile.db and a 'natpostarticles' table within it are created if dbfile.db did not already exist. This is done using the ```create_table``` function. 

Then the extracted data (ids, urls and cleaned text) are loaded into the database using the ```load_data``` function.

In [5]:
import loading_functions

dbfilepath = "/Users/margaritabozhinova/Desktop/ArticlesProject/Database/dbfile.db"

if not os.path.isfile(dbfilepath):
    loading_functions.create_table(dbfilepath)

loading_functions.load_data(dbfilepath, article_df)

Quick check of what's in the SQLite Database by retrieving the first five observations. 

In [14]:
import sqlite3 

conn = sqlite3.connect(dbfilepath)

articles_head = pd.read_sql("select * from natpostarticles LIMIT 5;", conn)

conn.close()

articles_head

Unnamed: 0,id,link,artxt
0,91edfe48-6867-496e-99e8-57e536037779,https://nationalpost.com/opinion/barbara-kay-w...,Antisemitism in the United States is on the ri...
1,d219a338-06dc-45de-b12b-355d2d3d39e6,https://nationalpost.com/news/canada/tiktok-sp...,OTTAWA — The Broadbent Institute is keeping Ti...
2,d660cdc5-bb61-4dbf-8dd3-00bc5f352fe6,https://nationalpost.com/news/tiktok-could-be-...,Two U.S. senators plan to introduce legislatio...
3,11547b50-e681-441c-9266-331496c7d3dd,https://nationalpost.com/news/weekend-posted-w...,Here’s your Weekend Posted. A bit of fine news...
4,dcabbb65-1311-4379-87c6-886335226b6d,https://nationalpost.com/news/canada/canadian-...,"TORONTO — Geraldine “Geri” Smith, a long-time ..."
