# Looking through N-Grams as Factors
### (Started July 2, 2019)

## Introduction
After seeing the potentially strong results from filtering the articles by "phase" then by a second keyword, it became clear that there could be some other interesting groupings of words.

The intuition is that there are likely to be certain groups of words that could result in statistically significant risk-adjusted returns.

The high-level approach will be:
1. Reduce the words in the corpus of text as much as possible. The key here is to remove as many irrelevant words.
2. For each set of n-grams:
    * Filter the article Data Frame using the words in the n-gram
    * Get the Return metrics for the filtered articles
3. Calculate and sort by the metrics

## Table of Contents 

1. ["Imports, Settings and Data Loading"](#1)
2. ["Text Cleaning and Feature Reduction"](#2)
3. ["Build N-Gram Functionality](#3)

<a id="1"></a>
## Imports, Settings and Data Loading

Note: All of this section came from the previous notebook.

In [30]:
# Imports

# Standard Libraries
from itertools import combinations

# Numerical Libraries
import numpy as np
from scipy.stats import skew, kurtosis
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Visual Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Local Package Libraries
import sys
sys.path.append("../..")

from src.data.make_dataset import clean_and_open_business_wire_data_01, get_raw_data
from src.features.general_helper_functions import GetPrices
from src.features.nlp_functions import *

In [2]:
# Settings

# Stop the warnings for chain in pandas...
pd.options.mode.chained_assignment = None

%load_ext autoreload
%autoreload 2

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

%matplotlib inline

In [3]:
_, watchlist_raw, stock_prices_raw = get_raw_data()

In [16]:
article_df = clean_and_open_business_wire_data_01(None)
article_df.time = pd.to_datetime(article_df.time)

# Watchlist

# 0: Create a copy of the data
watchlist_df = watchlist_raw.copy()
print("Original size: ", watchlist_df.shape)

# 1: Get a list of the unique companies that have scraped article data
unique_companies = article_df.ticker.unique()

# 2: Keep only the companies from the list
watchlist_df = watchlist_df.loc[watchlist_df.Ticker.isin(unique_companies)]
print("Final size: ", watchlist_df.shape)

watchlist_df.columns = ["ticker", "marketcap", "sector", "exchange"]

watchlist_df.head()


# Stock Prices

# 0: Make a copy of the stock prices here
prices_df = stock_prices_raw.copy()
print("Original size: ", prices_df.shape)

# 1: Reduce the new copy of prices to only the companies under our scope
prices_df = prices_df[unique_companies]
print("Final size: ", prices_df.shape)

# 2: Sort by date
prices_df.sort_index(inplace=True)

# 3: Ensure index is datetime object
prices_df.index = pd.to_datetime(prices_df.index)

prices_df.tail()

Original size:  (721, 4)
Final size:  (197, 4)
Original size:  (5402, 208)
Final size:  (5402, 197)


Unnamed: 0,ACAD,ACHC,ACOR,ADUS,AERI,AGIO,AIMT,AKCA,AKRX,ALDR,...,VRAY,VREX,WMGI,WVE,XENT,XLRN,XNCR,XON,YI,ZGNX
2019-06-17,25.42,34.0,7.68,72.56,33.29,50.66,20.54,24.39,4.34,12.45,...,8.95,28.21,31.31,27.43,24.45,38.36,32.55,7.6,6.75,40.3
2019-06-18,25.93,34.2,7.66,72.92,33.84,51.91,20.2,23.98,4.48,11.73,...,9.13,28.95,31.69,27.92,23.97,40.62,34.54,8.6,6.62,39.79
2019-06-19,26.62,34.04,7.54,75.89,32.89,51.42,20.07,22.8,4.65,11.55,...,9.11,29.26,31.88,26.93,24.16,39.92,34.65,8.55,6.62,40.37
2019-06-20,25.73,34.52,7.28,74.91,32.13,51.3,20.24,22.9,4.74,11.54,...,8.92,29.57,30.48,26.86,23.62,40.85,35.1,8.01,6.62,40.91
2019-06-21,26.0,34.72,7.49,74.63,30.91,50.78,20.0,24.26,4.78,11.6,...,8.72,29.21,30.09,26.18,23.15,40.18,35.16,7.57,6.49,40.62


## Text Cleaning and Feature Reduction

Note: The first block is also from the previous Notebook. Should probably add these to src.

In [17]:
def clean_text(df, column_name):
    df[column_name] = df[column_name].apply(remove_white_spaces)
    df[column_name] = df[column_name].apply(remove_non_alphanumeric)
    df[column_name] = df[column_name].apply(remove_numbers)
    df[column_name] = df[column_name].apply(remove_stop_words)
    df[column_name] = df[column_name].apply(lemmatize_text)
    return df

def calculate_word_frequency(word_list, df):
    d = {word: sum([1 if word in article else 0 for article in df.title.values])/df.shape[0] for word in word_list}
    
    return pd.Series(d, index = d.keys()).sort_values(ascending=False)

def get_list_of_words(articles, cut_off):    
    combined_titles = " ".join(articles.title.values.tolist())

    set_of_words = list(set(combined_titles.split(" ")))
    
    set_of_words = [word for word in set_of_words if len(word) > 3]

    word_frequency = calculate_word_frequency(set_of_words, articles)
    set_of_words = word_frequency.loc[word_frequency > cut_off].index
    return set_of_words

In [18]:
article_df = clean_text(article_df, "title")

list_of_words = get_list_of_words(article_df, cut_off=0.01)

print(len(list_of_words))

254


Now can go through the article titles and filter out all words that are not in the list_of_words.

In [19]:
def drop_word_from_text(text, list_of_words):
    return " ".join([word for word in text.split(" ") if word in list_of_words])

In [20]:
article_df.title = article_df.title.apply(drop_word_from_text, args=(list_of_words,))

Can drop the columns "ticker" and "article" as they won't be needed. 

Further, it will be useful to have a column for each word with a value of True or False if the word exists in the title or not.

In [24]:
article_df = article_df.drop(["ticker", "article"], axis=1)

In [28]:
for word in list_of_words:
    article_df[word] = article_df.title.str.contains(word)
    
article_df.shape

(8433, 256)

In [29]:
article_df.head()

Unnamed: 0,time,title,mark,market,search,research,chan,hand,researchandmarkets,global,...,micro,american,unite,stage,cure,strategy,light,next,receive,administration
0,2019-06-04,pharmaceutical present annual global healthcar...,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
1,2019-05-18,pharmaceutical present phase result treatment ...,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
2,2019-05-15,grow company award,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2019-05-07,pharmaceutical present america health care con...,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2019-05-02,disease pipeline review insight researchandmar...,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False


<a id="3"></a>

## Build N-Gram

In [36]:
def n_gram(words, n):
    return combinations(words, n)

In [37]:
def find_set_of_articles_for_ngram(articles, ngram_tuple):
    temp_articles = articles.copy()
    for word in ngram_tuple:
        temp_articles = temp_articles.loc[temp_articles[word]]
    return temp_articles.index.tolist()

In [55]:
%%time
ngram = n_gram(list_of_words, 2)

dict_linking_ngram_to_indexes = {word_tuple: find_set_of_articles_for_ngram(article_df, item) for word_tuple in ngram}

Wall time: 4min 48s


Now, this is a very large number of pairs. Is it necessary to check them all? How could we further reduce the number of words? Perhaps can go back to the idea of frequency to remove all pairs of words with a low frequency!

In [56]:
len(dict_linking_ngram_to_indexes)

32131