## Description of Notebook

This notebook is designed to perform specific functions based on the cells you run. Each cell is labeled with its purpose and intent. This is intended to be  a simpler way to gather data from the database, make more specific queries, and search for more specific data - as opposed to the main script, which will simply collect all the articles from each RSS feed, preprocess them, and store them in the DB for future analysis.

*Note that this notebook assumes there is a remote database set up and accessible*

__________________________________________________________________________________________________________________________________________________________________________
## Imports 

**RUN THIS CELL - needed for imports and initializing needed variables**

*What this cell does:*

1. Imports required libraries and custom classes
2. Initializes a Database connection
3. Initializes all the feeds
    - Creates the RSS_Feed object (ex. BleepingComputerRSS, CensysRSS, etc)
    - Gets all the new articles for the feeds 
    - Gets the content for these articles
    - Preprocesses the content for these articles 
4. Get a list of all tags from the Database

In [1]:
from data_analysis.ClusteringTechniques import *
from FP_Classes.RSS_DB_Connection import RSS_DB_Connection
import json 

from FP_Classes.RSS_Feed import RSS_Feed, RSS_Article
from FP_Classes.Tag import Tag
from FP_Classes.Feeds.BleepingComputer import BleepingComputerRSS   # BleepingComputer
from FP_Classes.Feeds.Censys import CensysRSS                       # Censys (general)
from FP_Classes.Feeds.CensysDir import CensysDirRSS                 # Censys (director)
from FP_Classes.Feeds.DefenseDepartment import DefenseDeptRSS       # Department of Defense
from FP_Classes.Feeds.Microsoft import MicrosoftRSS                 # Microsoft 
from FP_Classes.Feeds.NationalVulnDatabase import NVD_RSS           # National Vulnerability Database
from FP_Classes.Feeds.NIST import NIST_RSS                          # NIST 
from FP_Classes.Feeds.StateDepartment import StateDeptRSS           # State Department (multiple feeds) 
from FP_Classes.Feeds.TheHackerNews import HackerNewsRSS            # Hacker News (news articles)


config = json.load(open('config/config.json'))
db_creds = json.load(open('config/' + config['db-creds-json-path']))

dbConn:RSS_DB_Connection = RSS_DB_Connection(
                                username=db_creds['username'],
                                password=db_creds['password'],
                                host=db_creds['host']
                            )

# Initialization of all variables
configDir = "config/"   # Change ONLY if you changed the default file hierarchy 

# Update the remote DB with new tags if there are any
if dbConn.newTagsFromExcel(configDir + config['update-tags-filepath']):
    print("[+] NOTICE: Added new tags to remote database.")
else: print("[+] ERROR: Failed to add new tags to the remote database. Moving on.")

# Initialize all Feed objects 
bleepingComputerRss = BleepingComputerRSS(dbConn.getAllArticleTitles(feed_title=BleepingComputerRSS.BC_FeedTitle))
censysRss = CensysRSS(dbConn.getAllArticleTitles(feed_title=CensysRSS.CS_FeedTitle))
censysDirRss = CensysDirRSS(dbConn.getAllArticleTitles(feed_title=CensysDirRSS.CS_FeedTitle))
defenseDeptRss = DefenseDeptRSS(dbConn.getAllArticleTitles(feed_title=DefenseDeptRSS.DoD_FeedTitle))
microsoftRss = MicrosoftRSS(dbConn.getAllArticleTitles(feed_title=MicrosoftRSS.MS_FeedTitle))
nvdRss = NVD_RSS(dbConn.getAllArticleTitles(feed_title=NVD_RSS.NVD_FeedTitle))
nistRss = NIST_RSS(dbConn.getAllArticleTitles(feed_title=NIST_RSS.NIST_FeedTitle))
stateDeptRss = StateDeptRSS(dbConn.getAllArticleTitles(feed_title=StateDeptRSS.SD_FeedTitle)) 
hackernewsRss = HackerNewsRSS(dbConn.getAllArticleTitles(feed_title=HackerNewsRSS.HN_FeedTitle))

# Create a list of all the RSS Feed objects 
allFeeds:list[RSS_Feed] = [
    bleepingComputerRss,
    censysRss,
    censysDirRss,
    defenseDeptRss,
    microsoftRss,
    nvdRss,
    nistRss,
    stateDeptRss,
    hackernewsRss
]

allTags:list[Tag] = dbConn.getAllTags()


NOTICE in RSS_DB_Connection.newTagsFromExcel(): called newTagsFromExcel() - beginning process.
NOTICE in RSS_DB_Connection.newTagsFromExcel(): excel sheet read and DB connection established successfully. Formatting query...
NOTICE in RSS_DB_Connection.newTagsFromExcel(): new tag queries formatted and executed successfully. Terminating connections and quitting.
SUCCESS.
[+] NOTICE: Added new tags to remote database.
[+] Initializing feed: BleepingComputer | https://www.bleepingcomputer.com/feed/
[+] INIT article "BleepingComputer - Google to fight hackers with weekly Chrome security updates"
	[+] Getting article content...
	[+] Preprocessing content...

[+] INIT article "BleepingComputer - Preventative medicine for securing IoT tech in healthcare organizations"
	[+] Getting article content...
	[+] Preprocessing content...

[+] INIT article "BleepingComputer - EvilProxy phishing campaign targets 120,000 Microsoft 365 users"
	[+] Getting article content...
	[+] Preprocessing content...



__________________________________________________________________________________________________________________________________________________________________________
## Interacting with the Database

### [+] Get Information/Data from the Database

##### Get a list of all tags (as Tag objects)

In [None]:
allTags:list[Tag] = dbConn.getAllTags()   # Get tags
for t in allTags: print(t.toString())     # Print results

##### Get the titles for all tracked feeds

In [None]:
allFeedTitles:list[str] = dbConn.getAllFeeds()    # Get feed titles
for t in allFeedTitles: print(t)                  # Print the results

##### Get lists of articles

*For a specific feed*

In [None]:
# Change feed title
feedTitle:str = "BleepingComputer"                          

allArticles:list[RSS_Article] = dbConn.getAllArticles(feedTitle=feedTitle) # Get all the articles
for a in allArticles: print(a.toString())                                  # Print the results

*No specific feed - all articles in DB (with SQL query limit, 1000 I think)*

In [None]:
allArticles:list[RSS_Article] = dbConn.getAllArticles()
for a in allArticles: print(a.toString())

*All articles for a list of tags*

In [None]:
# CHANGE list_of_tags !!
list_of_tags:list[str] = ['CVE', 'APT', 'Lazarus']      # List of strings (tag names) 

articles_for_tags:dict[str, RSS_Article] = dbConn.getArticlesForTags(list_of_tags)
for a in articles_for_tags.values(): print(a.toString(), a.tags)

### [+] Sending new information to the Database

It is recommended to run both these cells. The first will only update the ARTICLES table, and the second will only update the TAGS_FOR_ARTICLE table. I kept them separate incase you only wish to do one or the other, but it is recommended to run both together. 

Also, the TAG_FOR_ARTICLE table in the database has a foreign key restraint on article_title, thus it is important to run the "Update Articles for RSS Feeds in Database" BEFORE running "Classify all New Articles and update tags in Database" because the INSERT statements into the TAG_FOR_ARTICLE table will fail if the article title does not exist in the ARTICLE table. 

##### Update Articles for RSS Feeds in Database

This cell will update the remote database with the new articles from the Import cell (cell 5). Note that this will not classify the articles, only add the Article object to the database "ARTICLE" table.

In [2]:
print("[+] Starting updates to database for all new articles...\n")

for feed in allFeeds: 
    if dbConn.addArticles(feed.articles): print(f"\tSuccessfully added articles for {feed.feed_title}.")
    else: print(f"\tThere was some error adding the articles for \"{feed.feed_title}\". Moving on.")

# Success message
print("\n[+] DONE: All articles were updated in the remote database.")


[+] Starting updates to database for all new articles...

NOTICE: Articles added successfully.
	Successfully added articles for BleepingComputer.
NOTICE: Articles added successfully.
	Successfully added articles for Censys Global Reach.
NOTICE: Articles added successfully.
	Successfully added articles for Censys Director Blog.
NOTICE: Articles added successfully.
	Successfully added articles for Defense-gov Explore Feed.
NOTICE: Articles added successfully.
	Successfully added articles for MSRC Security Update Guide.
NOTICE: Articles added successfully.
	Successfully added articles for National Vulnerability Database.
NOTICE: Articles added successfully.
	Successfully added articles for NIST Cybersecurity and IT news and events.
NOTICE: Articles added successfully.
	Successfully added articles for United States Department of State.
NOTICE: Articles added successfully.
	Successfully added articles for Hacker News.

[+] DONE: All articles were updated in the remote database.


##### Classify All New Articles and update tags in Database

This will classify the new articles and send the tag updates to the database.


In [3]:
for feed in allFeeds: 
    if dbConn.updateArticles(feed, allTags): print(f"\n[+] Successfully classified and added articles for feed \"{feed.feed_title}\". Continuing...")
    else: print(f"\n[+] ERROR: There was some error classifying and updating articles for \"{feed.feed_title}\". Moving on...")

NOTICE in RSS_DB_Connection.__testFeedExists__(): test query found at least one result for "BleepingComputer". Proceeding.
[+] Classifying and formatting queries for articles from feed: BleepingComputer

[+] Updating database with articles for feed BleepingComputer
[+] Updating database with tags for articles from feed: BleepingComputer
NOTICE in RSS_DB_Connection.updateArticles(): new articles and tags for BleepingComputer added to the DB successfully. Closing cursor.


[+] Successfully classified and added articles for feed "BleepingComputer". Continuing...
NOTICE in RSS_DB_Connection.__testFeedExists__(): test query found at least one result for "Censys Global Reach". Proceeding.
[+] Classifying and formatting queries for articles from feed: Censys Global Reach

[+] Updating database with articles for feed Censys Global Reach
[+] Updating database with tags for articles from feed: Censys Global Reach
NOTICE in RSS_DB_Connection.updateArticles(): new articles and tags for Censys Glob

__________________________________________________________________________________________________________________________________________________________________________
## Latent Dirichlet Allocation (LDA) Analysis

**Purpose:** 
LDA is a natural language processing technique that aims to group (cluster) related articles together based on their content and the importance of common words and phrases (in context of the overall article).

**NOTE:** 
Run the hyperparameters cell before any others or they will not work.

#### [+] LDA Hyperparameters 

**RUN THIS CELL**

*Note: this cell may take a few minutes, depending on how many articles are returned and the lda_LIMIT*

In [None]:
''' 
lda_NUM_TOPICS - The number of clusters in the final result
    
    NOTE: A higher num_topics can be beneficial but may take longer.
          Too high of a value for num_topics could overfit the data and result in irrelevant topics or clusters 
          Too low of a value for num_topics could underfit the data and result in clusters that are hard to interpret

lda_LIMIT - The limit on the number of articles to get from the DB.

    NOTE: A higher limit will take longer, but will include more articles and the groupings will likely be more meaningful.
          A limit of 0 -> no limit, get all articles in the DB

lda_FEED - Specify a specific feed to get articles from (the feed title).

    NOTE: Make sure the feed exists in the DB. If you are not sure, see the "Feeds" section to get a list of all feed titles from the DB
          Empty string ("") -> no specific feed, get articles for all feeds
          
'''

lda_NUM_TOPICS:int  = 10
lda_LIMIT:int       = 100
lda_FEED:str        = ""

# --------------------------------------------------------------------------------------------------------- #
# DO NOT CHANGE BELOW THIS LINE

# Get other parameters based on the above hyperparameters
lda_articles:list[RSS_Article] = dbConn.getAllArticles(feedTitle=lda_FEED)
lda:LDA_Article_Clustering = LDA_Article_Clustering(lda_articles, num_topics=lda_NUM_TOPICS, limit=lda_LIMIT)


#### [+] Print LDA Results

In [None]:
print("\n---------------------------\nLDA Topic Assignments:\n" + lda.strAllTopicAssignments())

s = "" 
for i in range(len(lda.topics_dict.keys())): s += lda.strInfoForTopic(i)
print("\n---------------------------\nLDA Topic Details and Articles:\n\n" + s)