* Note: requires massive refactor for updated application design.

# The Tweet Summarizer

This application takes in raw tweet objects from a database, then extracts summaries and 
entities from them using rule-based and machine learning-based approaches. 

*This is a demonstration attempt at detecting "events" from a hand-picked selection of 
Twitter accounts. Therefore, this demo invoked limited use of advanced spam- and noise-detection 
techniques, since it is assumed that this dataset will be free from such things. 

### The Process

In order to filter raw tweet objects for noise, then summarize them and extract their 
natural entities, we will use the following approach.

1. Reading in the Dataset
2. Pre-processing the Data
3. Vectorizing the Data
4. Feature Selection
5. Building the Machine Learning Classifiers

Let's begin!

## 1. Reading in the Dataset
Before we begin the feature extraction process, we need to connect to the database 
and pull down information. Two things to note as we begin this process: 
* database connection information is held in a separate configuration file inside "/Users/
$User/Quantum/Event Detector/Twitter Event Detector/"
* since the idea is to gather real-time data, the idea would be to poll the database every 
15-45 seconds. However, since we are merely testing the concept here, initial polls will be 
conducted every 120-300 seconds. 

Here, we connect to the database and read the last 100 entries (changeable in the first variable).

In [4]:
number_of_rows = 100

import configparser
import os
import psycopg2

def config_file_reader(API_caller: str) -> tuple:
    """
    A common configuration file reader.
    
    Reads data from a common configuration file, determining which fields to call depending 
    on the API caller passed to it.
    
    :param API_caller:(str) the name of the service calling this API
     
    :return: (tuple) a tuple of strings of each configuration returned for the called service
    """
    # "/Users/$User/Quantum/Event Detector/Twitter Event Detector/". 
    home_directory_path = os.path.expanduser("~")
    logger_directory_path = os.path.join(home_directory_path, "Quantum", "Event Detector", 
                                         "Twitter Event Detector", "Logs")
    config_directory_path = os.path.join(home_directory_path, "Quantum", "Event Detector", 
                                         "Twitter Event Detector", "Common")
    config_file_path = os.path.join(config_directory_path, "config.ini")

    # instantiates the configuration parser
    config = configparser.ConfigParser()
    
    # if config files exists, proceed: else, create directory structure, then fail gracefully
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    else:
        os.makedirs(config_directory_path)
        print("No config file found in " + config_directory_path + 
              ". Please place a configuration file into this directory and try again.")
        
    if API_caller == "data_access_object":
        database_type = config["DATABASE"]["type"]
        database_host = config["DATABASE"]["host"]
        database_name = config["DATABASE"]["database_name"]
        database_user = config["DATABASE"]["user"]
        database_password = config["DATABASE"]["password"]
        database_instance_id = config["DATABASE"]["database_instance_id"]
        database_port = config["DATABASE"]["database_port"]
        return database_type, database_host, database_name, database_user, database_password, \
               database_instance_id, database_port
    elif API_caller == "logger_setup":
        return logger_directory_path,
    elif API_caller == "languages":
        languages = config["LANGUAGES"]["supported_languages"]
        return languages
    elif API_caller == "account_metadata_importer":
        # this API call only requires the directory path to the config file (which stores a CSV file necessary)
        return config_directory_path,
    else:
        print("Error on reading config file: no API caller specified")
        
def raw_tweet_database_connector():
    """
    Creates and returns a connection object to a PostgreSQL database.
    
    :return: (psycopg2.connect) a PostgreSQL connection object
    """
    config = config_file_reader("data_access_object")
    database_type, database_host, database_name, database_user, database_password, \
        database_instance_id, database_port = config
    
    try:
        connection = psycopg2.connect(host=database_host, dbname=database_name, user=database_user, 
                                      password=database_password, port=database_port)
        return connection
    except psycopg2.OperationalError:
        print('Database connection error')
        
def raw_tweet_database_reader() -> list:
    """
    Reads the last 100 entries in the Raw Tweet Database. 
    
    :return: (pandas dataframe) a dataframe containing the last 100 entries in the Raw Tweet Database 
    """
    # calls the database connector
    connection = raw_tweet_database_connector()
    cursor = connection.cursor()
    
    sql = "SELECT tweet_time_created, tweet_uid, tweet_text, tweet_source, reply_tweet_uid, reply_tweet_count, " \
      "quote_tweet, quote_tweet_uid, quote_tweet_text, quote_tweet_count, retweet_tweet_status,  " \
      "retweet_tweet_count, tweet_language, user_uid, user_name, user_screen_name, user_description, " \
      "user_verification, user_follower_count, user_friends_count, user_statuses_count, user_time_created, " \
      "tweet_coordinates, tweet_place, tweet_place_country_code, tweet_place_bounding_box, " \
      "tweet_hashtags, tweet_urls, tweet_symbols, tweet_user_mentions, user_location FROM twitter_posts " \
      "ORDER BY tweet_time_created DESC LIMIT %s;"
    
    cursor.execute(sql, (number_of_rows,))
    return cursor.fetchall()

fetched_results = raw_tweet_database_reader()

  """)


In [5]:
# shows one row of data from the database
print(*fetched_results[2], sep='\n\n')

2020-04-05 02:31:05

1246626381577977857

Undocumented workers among those hit first — and worst — by the coronavirus shut down https://t.co/xSHunyCcDx

<a href="http://www.socialflow.com" rel="nofollow">SocialFlow</a>

None

0

False

None

None

0

None

0

en

2467791

The Washington Post

washingtonpost

Breaking news, analysis, and opinion. Founded in 1877. Our staff on Twitter: https://twitter.com/washingtonpost/lists/washington-post-people

True

15357213

1660

354175

2007-03-27 11:19:39

None

None

None

None

None

['https://wapo.st/39JcYKz']

None

None

Washington, DC


We now need to initialize spaCy and load all of its dependencies. 

In [6]:
import spacy

# imports the medium-sized English-language spaCy trained module, with vectors
nlp = spacy.load('en_core_web_md')

As the first step of our text processing, we need to extract all of the named 
entities from the tweet text. We will do this by first running the entire collected stream 
through a spaCy pipeline. 

We start by creating a list of only the tweet text data, then running that list through spaCy.

In [7]:
# we create a tuple of the data we want spaCy to ingest from the tweet_text and user_location fields
fetched_tweet_text = []
fetched_user_location = []
fetched_data_tuples = []
for _ in fetched_results:
    # list index of text tweet data within each tweet object
    fetched_tweet_text.append(_[2])   
    fetched_user_location.append(_[-1])

# create a list of tuples of (tweet_text, user_location)
for text, location in zip(fetched_tweet_text, fetched_user_location):
    fetched_data_tuples.append((text, {'user_location': location}))

# creates a spaCy pipe, which processes input text data as a stream, returning a Doc object for each of those Docs
docs = list(nlp.pipe(fetched_data_tuples, as_tuples=True))

# prints out Doc data - only way to show context is to print during pipe creation. 
for doc, context in nlp.pipe(fetched_data_tuples, as_tuples=True):
    print(doc.text)
    print('\t', context)
    for ent in doc.ents:
        print('\t', ent.text, ent.label_)
    print('---\n')
    break

Every March, a rattlesnake festival brings about $8.3 million into the local economy of Sweetwater, Texas. The tradition dates back to 1958. 

Photographer Lizzie Chen wanted to explore why it has remained for 62 years. https://t.co/iGiOZ6dlg2 https://t.co/UYFfmKwgRu
	 {'user_location': 'None'}
	 Every March DATE
	 about $8.3 million MONEY
	 Sweetwater GPE
	 Texas GPE
	 1958 DATE
	 Lizzie Chen PERSON
	 62 years DATE
	 https://t.co/iGiOZ6dlg2 https://t.co/UYFfmKwgRu PERSON
---



All of the processing on the Docs objects has already been done:: all that's left now is 
to use the data. 

We need to be able to process location data from tweets that don't contain it. Many will 
have the data as part of the tweet text, but many won't. There are a few reasons for 
this:

* context context
If a tweet is about a person, or about a well known event, location data is not 
necessary, nor is it necessarily helpful. 

* user context
If a tweet is from a small source, a local newspaper, or even a national source, often 
the context is that the _source_ is local. 

For the first reason, we must come up with ways to make sure that the context of the 
tweet overrides the location data, even if it is provided. We need to come up with 
ways of doing this, because of the steps we're going to take for the second reason. 

For the second reason, we can inject the tweet user's location data if no other 
location data exists in the tweet text. 

In this code block, we try to complete the entire merging process, making the context Doc part of the tweet_text Doc.

In [8]:
def adding_user_location():
    """
    Uses the provided 'context' data added to a spaCy Doc, runs it through its 
    own NLP pipeline to extract entity data, then returns a Doc object that 
    contains NLP metadata to be added to the original Doc object it was derived from.
    
    param: (dict)
    
    returns: (spaCy Doc)
    """
    # list of Doc objects with context-added location data
    docs_with_gpe = []
    
    for doc in docs:
        ents = [(ent.text, ent.label_) for ent in doc[0].ents]
        # code block to determine of spaCy detected entities contain GPE
        contains_GPE = False
        if ents:
            for e in ents:
                if e[1] == "GPE":
                    contains_GPE = True
        if contains_GPE is False:
            doc_with_gpe = doc_reconstructor(doc)
            docs_with_gpe.append(doc_with_gpe)
        else:
            docs_with_gpe.append(doc[0])
    return docs_with_gpe
        
        
def doc_reconstructor(original_doc):
    """
    Takes the data from the old Doc (text and context) and combines it to make and return a new Doc.
    """
    combined_text_and_context = str(original_doc[0]) + '. ' + str(original_doc[1]['user_location']) + '.'
    doc = nlp(combined_text_and_context)
    return doc

docs_with_gpe = adding_user_location()


In [9]:
for doc in docs_with_gpe:
    print(doc)
    print(type(doc))
    print('\n---\n')
    break

Every March, a rattlesnake festival brings about $8.3 million into the local economy of Sweetwater, Texas. The tradition dates back to 1958. 

Photographer Lizzie Chen wanted to explore why it has remained for 62 years. https://t.co/iGiOZ6dlg2 https://t.co/UYFfmKwgRu
<class 'spacy.tokens.doc.Doc'>

---



## 2. Pre-processing the Text

Here we attempt a bag of words approach to each tweet, seeing what results we get.

In [10]:
def token_processor():
    # list that holds processed tokens in string form
    processed_docs = []

    for doc in docs_with_gpe:    
        # creates list per doc
        doc_list = []
        # flag to determine if next token is a hashtag
        is_hashtag = False
        for token in doc:
            # removes hashtags by checking if the preceding token was a hashtag, assuming that 
            # the next token would be the hashtag text; breaks from loop without adding to 
            # processed token list if token is hashtag
            if token.text == '#':
                is_hashtag = True
            # checks if previous token was a hashtag character
            if is_hashtag is False:
                # checks if the token is an alpha character (removes numerals and punctuation)
                if token.is_alpha is True:
                    # checks if token is part of a stop list
                    if token.is_stop is False:
                        # checks if token is URL-like
                            if token.like_url is False:
                                # lowercases each token (uses the spaCy token's lowercase attribute)
                                token_text = token.lemma_
                                token_text_lemma = token_text.lower()
                                doc_list.append(token_text_lemma)
            # if is_hashtag has been set to True, skips processing logic and resets flag
            else:
                is_hashtag = False
                
        processed_docs.append(doc_list)

    return processed_docs

## 3. Vectorizing the Data

Here we will use a bag-of-words approach using gensim

### Creating and querying a corpus

In [11]:
from gensim.corpora.dictionary import Dictionary

def gensim_processor(processed_docs):
    # creates a Dictionary from the tokens in processed_docs
    dictionary = Dictionary(processed_docs)
    # list comprehension iterates over processed_docs to create a gensim mmCorpus
    # from dictionary
    corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
    return corpus, dictionary

### Vectorizing Example 1: Using gensim

#### Creating a bag-of-words 

Here we use the gensim corpus and dictionary to see the most common terms per document and across all documents. We can also use the dictionary to look up the terms. 

In [12]:
import itertools
from collections import Counter
from collections import defaultdict

def bag_of_words(corpus, dictionary):
    # we use the first tweet in our tweet corpus as our reference point
    doc = corpus[0]

    # sorting the doc for frequency
    bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

    # print the top 10 words in the document along with the count
    print('\n\ntop 10 words in tweet')
    for word_id, word_count in bow_doc[:10]:
        print(dictionary.get(word_id), word_count)

    # creates a defaultdict - defaultdict assigns default values to non-existent keys
    # and by supplying the argument 'int', we ensure that nonexistent keys are automatially 
    # assigned a default value of 0, making it ideal for storing the counts of words
    total_word_count = defaultdict(int)
    # itertools.chain.from_iterable() allows us to iterate through a set of sequences as 
    # if they were one continuous sequence - this lets us iterate through our 'corpus' 
    # object, which is a list of lists
    for word_id, word_count in itertools.chain.from_iterable(corpus):
        total_word_count[word_id] += word_count

    # creates a sorted list from the defaultdict 
    sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)

    # prints the top 10 words across all documents, along with their count
    print('\n\ntop 10 words across entire tweet corpus')
    for word_id, word_count in sorted_word_count[:10]:
        print(dictionary.get(word_id), word_count)

#### TF-IDF with gensim

In [13]:
from gensim.models.tfidfmodel import TfidfModel

def tfidf(corpus):
    # creates a new TfidfModel using the corpus
    tfidf = TfidfModel(corpus)
    
    # the weights for each token in doc [0]
    tfidf_weights = tfidf[corpus][0]
    
    # sorts the weights from highest to lowest
    sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
    
    # prints the top 10 weighted words
    print('\n\ntop 10 TF-IDF weights in tweet')
    for term_id, weight in sorted_tfidf_weights[:10]:
        print(dictionary.get(term_id), weight)

#### Example 1 Runner

In [14]:
processed_docs = token_processor()

corpus, dictionary = gensim_processor(processed_docs)

bag_of_words(corpus, dictionary)

tfidf(corpus)



top 10 words in tweet
bring 1
chen 1
date 1
economy 1
explore 1
festival 1
lizzie 1
local 1
march 1
million 1


top 10 words across entire tweet corpus
new 39
coronavirus 33
york 22
case 11
city 11
death 10
health 9
say 9
pandemic 9
world 8


top 10 TF-IDF weights in tweet
chen 0.25622879300369217
date 0.25622879300369217
explore 0.25622879300369217
festival 0.25622879300369217
lizzie 0.25622879300369217
march 0.25622879300369217
photographer 0.25622879300369217
rattlesnake 0.25622879300369217
remain 0.25622879300369217
sweetwater 0.25622879300369217


### Vectorizing Example 2: Using scikit-learn

## 4. Feature Selection

Since we are trying to determine an unlimited, unpredictable number of topics in the tweets, we must use an unsupervised learning algorithm to get the necessary results.

More to come.