# The Part of Speech Tweet Summarizer

This application takes in raw tweet objects from a database, then extracts summaries and
entities from them using spaCy part-of-speech tools.

## Concept

### The Process

In order to filter raw tweet objects for noise, then summarize them and extract their
natural entities, we will use the following approach.

1. Reading in the Dataset
2. Pre-processing the Data

## Execution

### 1. Reading in the Dataset
Before we begin the feature extraction process, we need to connect to the database
and pull down information. Two things to note as we begin this process:
* database connection information is held in a separate configuration file inside "/Users/
$User/Quantum/Event Detector/Twitter Event Detector/"
* since the idea is to gather real-time data, the idea would be to poll the database every
15-45 seconds. However, since we are merely testing the concept here, initial polls will be
conducted every 120-300 seconds.

Here, we connect to the database and read the last 100 entries (changeable in the first variable).

#### Options

We also have an option here for pulling data from a CSV file, bypassing the database collection steps, for when we have
specific data that we want to explore using our NLP code.

To do this, we set the `database` flag to False, and the code will use the CSV file contained in the `data` directory
under the project root.

Note: this CSV _must_ be formatted in the same manner as the database data, or else the code will not break. Thus, this
is not meant to be a general parser, only to analyze specific saved data from the original database.

In [18]:
# this cell sets up the database connections to pull data directly

import configparser
import os
import psycopg2

def config_file_reader(API_caller: str) -> tuple:
    """
    A common configuration file reader.

    Reads data from a common configuration file, determining which fields to call depending
    on the API caller passed to it.

    :param API_caller:(str) the name of the service calling this API

    :return: (tuple) a tuple of strings of each configuration returned for the called service
    """
    home_directory_path = os.path.expanduser("~")
    logger_directory_path = os.path.join(home_directory_path, "Quantum", "Event Detector",
                                         "Twitter Event Detector", "Logs")
    config_directory_path = os.path.join(home_directory_path, "Quantum", "Event Detector",
                                         "Twitter Event Detector", "Common")
    config_file_path = os.path.join(config_directory_path, "config.ini")

    # instantiates the configuration parser
    config = configparser.ConfigParser()

    # if config files exists, proceed: else, create directory structure, then fail gracefully
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    else:
        os.makedirs(config_directory_path)
        print("No config file found in " + config_directory_path +
              ". Please place a configuration file into this directory and try again.")

    if API_caller == "data_access_object":
        database_type = config["DATABASE"]["type"]
        database_host = config["DATABASE"]["host"]
        database_name = config["DATABASE"]["database_name"]
        database_user = config["DATABASE"]["user"]
        database_password = config["DATABASE"]["password"]
        database_instance_id = config["DATABASE"]["database_instance_id"]
        database_port = config["DATABASE"]["database_port"]
        return database_type, database_host, database_name, database_user, database_password, \
               database_instance_id, database_port
    elif API_caller == "logger_setup":
        return logger_directory_path,
    elif API_caller == "languages":
        languages = config["LANGUAGES"]["supported_languages"]
        return languages
    elif API_caller == "account_metadata_importer":
        # this API call only requires the directory path to the config file (which stores a CSV file necessary)
        return config_directory_path,
    else:
        print("Error on reading config file: no API caller specified")

def raw_tweet_database_connector():
    """
    Creates and returns a connection object to a PostgreSQL database.

    :return: (psycopg2.connect) a PostgreSQL connection object
    """
    config = config_file_reader("data_access_object")
    database_type, database_host, database_name, database_user, database_password, \
        database_instance_id, database_port = config

    try:
        connection = psycopg2.connect(host=database_host, dbname=database_name, user=database_user,
                                      password=database_password, port=database_port)
        return connection
    except psycopg2.OperationalError:
        print('Database connection error')

def raw_tweet_database_reader() -> list:
    """
    Reads the last 100 entries in the Raw Tweet Database.

    :return: (pandas dataframe) a dataframe containing the last 100 entries in the Raw Tweet Database
    """
    # calls the database connector
    connection = raw_tweet_database_connector()
    cursor = connection.cursor()

    sql = "SELECT tweet_time_created, tweet_uid, tweet_text, tweet_source, reply_tweet_uid, reply_tweet_count, " \
      "quote_tweet, quote_tweet_uid, quote_tweet_text, quote_tweet_count, retweet_tweet_status,  " \
      "retweet_tweet_count, tweet_language, user_uid, user_name, user_screen_name, user_description, " \
      "user_verification, user_follower_count, user_friends_count, user_statuses_count, user_time_created, " \
      "tweet_coordinates, tweet_place, tweet_place_country_code, tweet_place_bounding_box, " \
      "tweet_hashtags, tweet_urls, tweet_symbols, tweet_user_mentions, user_location FROM twitter_posts " \
      "ORDER BY tweet_time_created DESC LIMIT %s;"

    cursor.execute(sql, (number_of_rows,))
    return cursor.fetchall()



In [19]:
# this cell sets up the CSV reader to read data from disk

import os
import csv

def import_csv(headers: bool) -> list:
    # list of lists that will contain all tweet data
    results = []

    # makes list of all files in data directory
    csv_directory_path = os.path.join('data', 'inputs')
    csv_file_path = os.listdir(csv_directory_path)

    for file in csv_file_path:
        with open(os.path.join(csv_directory_path, file), mode='r') as csv_file:
            csv_reader = csv.reader(csv_file)
            if headers is True:
                next(csv_reader)
            for line in csv_reader:
                results.append(list(line))
    return results

In [20]:
number_of_rows = 100
database = False
headers = True

if database is True:
    fetched_results = raw_tweet_database_reader()
else:
    fetched_results = import_csv(headers)


We now need to initialize spaCy and load all of its dependencies.

In [21]:
import spacy

# imports the medium-sized English-language spaCy trained module, with vectors
nlp = spacy.load('en_core_web_md')

### 2. Text Pre-processing

As the first step of our text pre-processing, we need to extract all of the named
entities from the tweet text. We will do this by first running the entire collected stream
through a spaCy pipeline.

We start by creating a list of only the tweet text data, then running that list through spaCy.

In [22]:
# we create a tuple of the data we want spaCy to ingest from the tweet_text and user_location fields
fetched_tweet_text = []
fetched_user_location = []
fetched_data_tuples = []
for _ in fetched_results:
    # list index of text tweet data within each tweet object
    fetched_tweet_text.append(_[2])
    fetched_user_location.append(_[-1])

# create a list of tuples of (tweet_text, user_location)
for text, location in zip(fetched_tweet_text, fetched_user_location):
    fetched_data_tuples.append((text, {'user_location': location}))

# creates a spaCy pipe, which processes input text data as a stream, returning a Doc object for each of those Docs
docs = list(nlp.pipe(fetched_data_tuples, as_tuples=True))

# prints out Doc data - only way to show context is to print during pipe creation.
# for doc, context in nlp.pipe(fetched_data_tuples, as_tuples=True):
#     print(doc.text)
#     print('\t', context)
#     for ent in doc.ents:
#         print('\t', ent.text, ent.label_)
#     print('---\n')
#     break

All of the processing on the Docs objects has already been done:: all that's left now is
to use the data.

We need to be able to process location data from tweets that don't contain it. Many will
have the data as part of the tweet text, but many won't. There are a few reasons for
this:

* context context
If a tweet is about a person, or about a well known event, location data is not
necessary, nor is it necessarily helpful.

* user context
If a tweet is from a small source, a local newspaper, or even a national source, often
the context is that the _source_ is local.

For the first reason, we must come up with ways to make sure that the context of the
tweet overrides the location data, even if it is provided. We need to come up with
ways of doing this, because of the steps we're going to take for the second reason.

For the second reason, we can inject the tweet user's location data if no other
location data exists in the tweet text.

In this code block, we try to complete the entire merging process, making the context Doc part of the tweet_text Doc.

In [23]:
def adding_user_location() -> list:
    """
    Uses the provided 'context' data added to a spaCy Doc, runs it through its
    own NLP pipeline to extract entity data, then returns a Doc object that
    contains NLP metadata to be added to the original Doc object it was derived from.

    param: (dict)

    returns: (spaCy Doc)
    """
    # list of Doc objects with context-added location data
    docs_with_gpe = []

    for doc in docs:
        ents = [(ent.text, ent.label_) for ent in doc[0].ents]
        # code block to determine of spaCy detected entities contain GPE
        contains_GPE = False
        if ents:
            for e in ents:
                if e[1] == "GPE":
                    contains_GPE = True
        if contains_GPE is False:
            doc_with_gpe = doc_reconstructor(doc)
            docs_with_gpe.append(doc_with_gpe)
        else:
            docs_with_gpe.append(doc[0])
    return docs_with_gpe


def doc_reconstructor(original_doc):
    """
    Takes the data from the old Doc (text and context) and combines it to make and return a new Doc.
    """
    combined_text_and_context = str(original_doc[0]) + '. ' + str(original_doc[1]['user_location']) + '.'
    doc = nlp(combined_text_and_context)
    return doc

docs_with_gpe = adding_user_location()

## 2. Pre-processing the Text

Here we have the option of extracting the tokens from each tweet text instance, or
extracting entire parts of speech from each of them.

In [24]:
# extracting tokens

def token_processor():
    # list that holds processed tokens in string form
    processed_docs = []

    for doc in docs_with_gpe:
        # creates list per doc
        doc_list = []
        # flag to determine if next token is a hashtag
        is_hashtag = False
        for token in doc:
            # removes hashtags by checking if the preceding token was a hashtag, assuming that
            # the next token would be the hashtag text; breaks from loop without adding to
            # processed token list if token is hashtag
            if token.text == '#':
                is_hashtag = True
            # checks if previous token was a hashtag character
            if is_hashtag is False:
                # checks if the token is an alpha character (removes numerals and punctuation)
                if token.is_alpha is True:
                    # checks if token is part of a stop list
                    if token.is_stop is False:
                        # checks if token is URL-like
                            if token.like_url is False:
                                # lowercases each token (uses the spaCy token's lowercase attribute)
                                token_text = token.lemma_
                                token_text_lemma = token_text.lower()
                                doc_list.append(token_text_lemma)
            # if is_hashtag has been set to True, skips processing logic and resets flag
            else:
                is_hashtag = False

        processed_docs.append(doc_list)

    return processed_docs

In [25]:
# extracting parts of speech
def part_of_speech_processor(docs: list) -> list:
    """
    The eventual idea is to parse out the five Ws: who, what, where, when, and why. This begins the
    process by starting to parse the text into its constituent parts of speech, starting a basic level.

    The output is a dictionary, with the following key:value pairs:
        text:   tweet text
        who:    any named entities that correspond with people or organizations
        what:   verbs and noun chunks that could correspond with events
        where:  any named entities that correspond with places (if none available, dafaults to tweet user location)
        when:   any named entities that correspond with time (if none availble, defaults to tweet timestamp)
        why:    TBD

    :param      docs: a list of spaCy Doc object containing a sequence of tokens and their linguistic annotations

    :returns    parts_of_speech: dictionary containing tweet text plus rule-based parsing of who, what, where, when, and why
    """

    # categorizes spaCy entity types (which themselves from from the OntoNotes 5 corpus: details at https://spacy.io/api/annotation#named-entities)
    who_types = ['NORP', 'ORG', 'PERSON']
    what_types = ['EVENT', 'LAW']
    where_types = ['GPE', 'LOC']
    when_types = ['DATE', 'TIME']
    why_types = []
    uncat_types = ['CARDINAL' 'FAC', 'LANGUAGE', 'MONEY', 'ORDINAL', 'PERCENT', 'PRODUCT', 'QUANTITY', 'WORK_OF_ART']

    parts_of_speech = []

    for doc in docs:
        who_tokens = []
        what_tokens = []
        where_tokens = []
        when_tokens = []
        why_tokens = []
        uncat_tokens = []

        noun_chunks = []

        ents = [(ent.text, ent.label_) for ent in doc.ents]
        if ents:
            for e in ents:
                if e[1] in who_types:
                    who_tokens.append(e[0])
                elif e[1] in what_types:
                    what_tokens.append(e[0])
                elif e[1] in where_types:
                    where_tokens.append(e[0])
                elif e[1] in when_types:
                    when_tokens.append(e[0])
                elif e[1] in why_types:
                    why_tokens.append(e[0])
                elif e[1] in uncat_types:
                    uncat_tokens.append([e[0], e[1]])
                else:
                    uncat_tokens.append(['UNKNOWN CATEGORY', e[0], e[1]])

        noun_chunks.append(list(doc.noun_chunks))

        parts_of_speech.append([doc, who_tokens, what_tokens, where_tokens, when_tokens, why_tokens,
                                uncat_tokens, noun_chunks])

    return parts_of_speech

In [34]:
extraction_type = 'pos'

if extraction_type == 'pos':
    summarized_tweets = part_of_speech_processor(docs_with_gpe)
elif extraction_type == 'token':
    summarized_tweets = token_processor()

## 3. Output

Since this is simply a testbed for using spaCy rules to summarize text inputs, what we have
now will be output to a CSV file for analysis and use in other applications.

The output CSV will be in `data/outputs`.

In [35]:
import datetime as dt

now_time = str(dt.datetime.now().strftime("%d-%m-%Y:%H:%M:%S"))

output_directory = os.path.join('data', 'outputs')
output_file = os.path.join(output_directory, now_time + '.csv')

with open(output_file, 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['tweet_text', 'who_tokens', 'what_tokens', 'where_tokens', 'when_tokens', 'why_tokens',
                                'uncat_tokens', 'noun_chunks'])
    writer.writerows(summarized_tweets)