#### Project group 35 - Marcus Braunschweig Andersen, Øyvin Moxness Konglevoll

In [21]:
import numpy as np
import pandas as pd 
import re
from collections import Counter 
import itertools
import matplotlib as plt
import csv
import string
import psycopg2
from IPython.display import Image

# Task 2

<img src="datascience.png">

To demonstrate that we have a working database, we will use the handed out function for executing SQL-queris in python.

In [14]:
# Function to access the database locally, and execute a query
# Make sure to change the username, databse and password
def execQuery(query):
    try:
        connection = psycopg2.connect(user = "postgres",
                                      password = "dataScience20",
                                      host = "localhost",
                                      port = "5432",
                                      database = "fakenews")
        cursor = connection.cursor()
        cursor.execute(query)
        record = cursor.fetchall()
        return record
    except (Exception, psycopg2.Error) as error :
        connection = False
        print ("Error while connecting to PostgreSQL", error)
    finally:
        if(connection):
            cursor.close()
            connection.close()
            print("Executed query and closed connection.")
            
numberOfArticles = execQuery("""SELECT count(articleid)
                                FROM articles;""")
print(numberOfArticles)

articleDomainsCount = execQuery("""SELECT count(domainID)
                                   FROM articleDomains;""")
print(articleDomainsCount)

differentTypes = execQuery("""SELECT *
                              FROM Typer; """)
print(differentTypes)

Executed query and closed connection.
[(979937,)]
Executed query and closed connection.
[(979937,)]
Executed query and closed connection.
[(0, 'rumor'), (1, 'hate'), (6, 'unreliable'), (10, 'conspiracy'), (14, 'clickbait'), (15, 'satire'), (27, 'fake'), (42, 'reliable'), (132, 'bias'), (136, 'political'), (351, 'junksci'), (397, 'NULL'), (628, 'unknown')]


As shown above, we have a database containing 979937 articles, and it supports simple queries, such as seeing all the different types all of the articles have. We used the '1mio-raw.csv' file, but in the cleaning process discarded all articles with faults in the articleID 

# Task 3

## 3.1
### Relational algebra
\begin{align*}
    A &:= Articles\\
    B &:= DomainArticles\\
    D &:= \pi_{articleId}(A \bowtie_{Timestamps_{timstamp} \geq '2018-01-15 00:00:00.000000'}Timestamps) \\
    C &:= \pi_{DomainID}(D \bowtie B)\\
    E &:= \pi_{typeID}(\sigma_{type='reliable'}(Typer)) \\
    F &:= \pi_{Domain,DomainId}(\sigma_{Domains_{typeID='E'}})(Domains) \\
\end{align*}
The domains of news articles scraped at or after January 15, 2018, can now be found with
\begin{align*}
\Pi_{Domain}(F \bowtie  C)
\end{align*}

### SQL
``` mysql
SELECT DISTINCT domainet FROM domains 
WHERE typeid IN 
(SELECT typeid FROM Typer WHERE typen = 'reliable') AND 
domainID IN (SELECT domainID FROM articleDomains WHERE articleID IN 
(SELECT articleID FROM articles WHERE scrapedid IN 
(SELECT timeid FROM timestamps WHERE timstamp >= '2018-01-15 00:00:00.000000')));
```
Running the SQL-query:

In [17]:
domains = execQuery("""SELECT DISTINCT domainet FROM domains 
WHERE typeid IN 
(SELECT typeid FROM Typer WHERE typen = 'reliable') AND 
domainID IN (SELECT domainID FROM articleDomains WHERE articleID IN 
(SELECT articleID FROM articles WHERE scrapedid IN 
(SELECT timeid FROM timestamps WHERE timstamp >= '2018-01-15 00:00:00.000000')));""")
print(domains)

Executed query and closed connection.
[('christianpost.com',), ('consortiumnews.com',), ('nutritionfacts.org',)]


## 3.2
### Extended relational algebra
\begin{align*}
    A &:= Articles\\
    B &:= DomainArticles\\
    W &:= Writers\\
    X &:= \pi_{typeID}(\sigma_{type='fake'}(Typer))\\
    D &:= \pi_{DomainId}(\sigma_{Domains_{typeID='X'}})Domains)\\
    E &:= \pi_{ArticleID}(B \bowtie D) \\
    G &:= \gamma_{authorId, count(articleID)\rightarrow countA}(W \bowtie E) \\
    H &:= G_{Max(countA)} \\
    I &:= \Pi_{authorId}(\sigma_{countA \geq H}(G)) \\
\end{align*}

The name(s) of the most prolific author(s) of fake news articles can now be found with:
\begin{align*}\Pi_{authorName}(Authors \bowtie I)\end{align*}

### SQL
``` mysql
Select author_name from authors where authorID in (Select authorID from (Select authorID, count(articleid)
	From writers
	Where articleID in (SELECT ARTICLEID as X FROM articledomains WHERE DOMAINID IN 
			(SELECT DOMAINID FROM DOMAINs WHERE TYPEID = 
				(SELECT TYPEID FROM TYPER WHERE TYPEN = 'fake'))) AND authorID > 0
	Group by authorID) as mycount
	where count = (Select max(count) from (Select authorID, count(articleid)
	From writers
	Where articleID in (SELECT ARTICLEID as X FROM articledomains WHERE DOMAINID IN 
			(SELECT DOMAINID FROM DOMAINs WHERE TYPEID = 
				(SELECT TYPEID FROM TYPER WHERE TYPEN = 'fake'))) AND authorID > 0
	Group by authorID) as mycount));
```
Running the SQL query:

In [16]:
authors = execQuery("""Select author_name from authors where authorID in (Select authorID from (Select authorID, count(articleid)
	From writers
	Where articleID in (SELECT ARTICLEID as X FROM articledomains WHERE DOMAINID IN 
			(SELECT DOMAINID FROM DOMAINs WHERE TYPEID = 
				(SELECT TYPEID FROM TYPER WHERE TYPEN = 'fake'))) AND authorID > 0
	Group by authorID) as mycount
	where count = (Select max(count) from (Select authorID, count(articleid)
	From writers
	Where articleID in (SELECT ARTICLEID as X FROM articledomains WHERE DOMAINID IN 
			(SELECT DOMAINID FROM DOMAINs WHERE TYPEID = 
				(SELECT TYPEID FROM TYPER WHERE TYPEN = 'fake'))) AND authorID > 0
	Group by authorID) as mycount));""")
print(authors)

Executed query and closed connection.
[('john rolls',)]


## 3.3
We failed to write out this query succesfully, but this our attempt. It finds article ID's who share meta_keywordID's, and also shows the meta_keywordID they share
### SQL
``` mysql
WITH tags_small AS (SELECT * FROM articlemeta_keywords WHERE articleid <= 500 and meta_keywordID > 0),
     articles_small AS (SELECT DISTINCT articleid FROM tags_small)
	SELECT a1.meta_keywordID, a1.articleID AS a1, a2.articleID AS a2 FROM tags_small a1 JOIN tags_small a2 ON a1.articleID <> a2.articleID and a1.meta_keywordID = a2.meta_keywordID;
```
Running this query:

In [20]:
set_equi_join = execQuery("""WITH tags_small AS (SELECT * FROM articlemeta_keywords WHERE articleid <= 500 and meta_keywordID > 0),
     articles_small AS (SELECT DISTINCT articleid FROM tags_small)
	SELECT a1.meta_keywordID, a1.articleID AS a1, a2.articleID AS a2 FROM tags_small a1 JOIN tags_small a2 ON a1.articleID <> a2.articleID and a1.meta_keywordID = a2.meta_keywordID;""")
print(set_equi_join)

Executed query and closed connection.
[(12, 19, 20), (13, 19, 20), (14, 19, 20), (15, 19, 20), (16, 19, 20), (17, 19, 20), (18, 19, 20), (19, 19, 20), (12, 20, 19), (13, 20, 19), (14, 20, 19), (15, 20, 19), (16, 20, 19), (17, 20, 19), (18, 20, 19), (19, 20, 19), (27, 33, 99), (27, 33, 98), (37, 73, 138), (38, 73, 483), (38, 73, 138), (27, 98, 99), (27, 98, 33), (27, 99, 98), (27, 99, 33), (60, 109, 167), (61, 109, 167), (37, 138, 73), (38, 138, 483), (38, 138, 73), (64, 138, 483), (60, 167, 109), (61, 167, 109), (38, 483, 138), (38, 483, 73), (64, 483, 138)]


# Task 5
## 5.1 Spider
In order to scrape wikipedia for articels we have used the scrapy framework. Below is the code for our scrapy.Spider which scrapes the article obtaining the HTML code. Then 