# Looking through N-Grams as Factors
### (Started July 2, 2019)

## Introduction
After seeing the potentially strong results from filtering the articles by "phase" then by a second keyword, it became clear that there could be some other interesting groupings of words.

The intuition is that there are likely to be certain groups of words that could result in statistically significant risk-adjusted returns.

The high-level approach will be:
1. Reduce the words in the corpus of text as much as possible. The key here is to remove as many irrelevant words.
2. For each set of n-grams:
    * Filter the article Data Frame using the words in the n-gram
    * Get the Return metrics for the filtered articles
3. Calculate and sort by the metrics

## Table of Contents 

1. ["Imports, Settings and Data Loading"](#1)
2. ["Text Cleaning and Feature Reduction"](#2)
3. ["Build N-Gram Functionality](#3)

<a id="1"></a>
## Imports, Settings and Data Loading

Note: All of this section came from the previous notebook.

In [1]:
# Imports

# Standard Libraries
from itertools import combinations

# Numerical Libraries
import numpy as np
from scipy.stats import skew, kurtosis
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Visual Libraries
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Local Package Libraries
import sys
sys.path.append("../..")

from src.data.make_dataset import *
from src.features.general_helper_functions import *
from src.features.text_cleaning import *

In [2]:
# Settings

# Stop the warnings for chain in pandas...
pd.options.mode.chained_assignment = None

%load_ext autoreload
%autoreload 2

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

%matplotlib inline

In [3]:
_, watchlist_raw, stock_prices_raw = get_raw_data()

*(Added the cleaning and formatting functions to make_dataset.py - July 2, 2019)*

In [4]:
article_df = clean_and_open_business_wire_data_01(None)
article_df.time = pd.to_datetime(article_df.time)

# Watchlist
watchlist_df = clean_and_format_watchlist(watchlist_raw, article_df.ticker.unique())


# Stock Prices
prices_df = clean_and_format_prices(stock_prices_raw, article_df.ticker.unique())

# Return Window
return_window = compute_return_window(article_df, prices_df, n_window=30)

return_window.head()

            ACAD    ACHC  ACOR  ADUS  AERI  AGIO  AIMT  AKCA  AKRX  ALDR  \
1998-01-02   NaN  3.5958   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1998-01-05   NaN  3.4911   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1998-01-06   NaN  3.5958   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1998-01-07   NaN  3.5958   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1998-01-08   NaN  3.5958   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

            ...   VRAY  VREX  WMGI  WVE  XENT  XLRN  XNCR  XON  YI  ZGNX  
1998-01-02  ...    NaN   NaN   NaN  NaN   NaN   NaN   NaN  NaN NaN   NaN  
1998-01-05  ...    NaN   NaN   NaN  NaN   NaN   NaN   NaN  NaN NaN   NaN  
1998-01-06  ...    NaN   NaN   NaN  NaN   NaN   NaN   NaN  NaN NaN   NaN  
1998-01-07  ...    NaN   NaN   NaN  NaN   NaN   NaN   NaN  NaN NaN   NaN  
1998-01-08  ...    NaN   NaN   NaN  NaN   NaN   NaN   NaN  NaN NaN   NaN  

[5 rows x 197 columns]
        time                                              title ticke

Unnamed: 0,R_0,R_1,R_2,R_3,R_4,R_5,R_6,R_7,R_8,R_9,...,R_19,R_20,R_21,R_22,R_23,R_24,R_25,R_26,R_27,R_28
0,0.016568,-0.017357,0.019329,0.012229,0.010651,0.008679,0.011045,-0.042998,0.002761,0.02288,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,-0.010794,-0.035081,-0.065921,-0.01542,-0.03084,-0.02313,0.000771,0.00771,-0.052043,-0.040093,...,-0.011951,-0.064765,-0.020046,-0.000386,0.026214,-0.008096,0.002313,,,
3,0.013173,0.006587,0.001162,-0.045719,-0.008136,0.005037,-0.005812,-0.030221,-0.061217,-0.010461,...,-0.00155,-0.03487,0.001162,-0.005812,-0.007361,-0.009299,-0.006974,-0.060054,-0.01511,0.004649
4,0.04,0.050588,0.012157,0.02549,0.018824,0.013333,-0.034118,0.003922,0.017255,0.006275,...,-0.059216,-0.043529,-0.005882,0.010588,-0.023137,0.013333,0.006275,0.004706,0.002745,0.005098


## Text Cleaning and Feature Reduction

Note: The first block is also from the previous Notebook. Should probably add these to src.

*(Added to src: nlp_functions.py - July 2, 2019)*

In [5]:
def calculate_word_frequency(word_list, df):
    d = {word: sum([1 if word in article else 0 for article in df.title.values])/df.shape[0] for word in word_list}
    
    return pd.Series(d, index = d.keys()).sort_values(ascending=False)

def get_list_of_words(articles, cut_off):    
    combined_titles = " ".join(articles.title.values.tolist())

    set_of_words = list(set(combined_titles.split(" ")))
    
    set_of_words = [word for word in set_of_words if len(word) > 3]

    word_frequency = calculate_word_frequency(set_of_words, articles)
    set_of_words = word_frequency.loc[word_frequency > cut_off].index
    return set_of_words

In [6]:
article_df = clean_text(article_df, "title")

list_of_words = get_list_of_words(article_df, cut_off=0.01)

print(len(list_of_words))

254


Now can go through the article titles and filter out all words that are not in the list_of_words.

In [7]:
def keep_sublist_words(text, list_of_words):
    return " ".join([word for word in text.split(" ") if word in list_of_words])

In [8]:
article_df.title = article_df.title.apply(keep_sublist_words, args=(list_of_words,))

Can drop the columns "ticker" and "article" as they won't be needed. 

Further, it will be useful to have a column for each word with a value of True or False if the word exists in the title or not.

In [9]:
article_df = article_df.drop(["ticker", "article"], axis=1)

In [10]:
for word in list_of_words:
    article_df[word] = article_df.title.str.contains(word)
    
article_df.shape

(8433, 256)

In [11]:
article_df.head()

Unnamed: 0,time,title,mark,market,search,research,chan,hand,researchandmarkets,global,...,american,stage,unite,micro,strategy,cure,administration,light,receive,next
0,2019-06-04,pharmaceutical present annual global healthcar...,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
1,2019-05-18,pharmaceutical present phase result treatment ...,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
2,2019-05-15,grow company award,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2019-05-07,pharmaceutical present america health care con...,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2019-05-02,disease pipeline review insight researchandmar...,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False


<a id="3"></a>

## Build N-Gram

In [12]:
def get_n_gram(words, n):
    return combinations(words, n)

def get_ngram_articles(articles, ngram_tuple):
    temp_articles = articles.copy()
    for word in ngram_tuple:
        temp_articles = temp_articles.loc[temp_articles[word]]
    return temp_articles.index.tolist()

def get_dict_ngram_articles(articles, list_of_words):
    ngram = get_n_gram(list_of_words, 2)
    return {word_tuple: get_ngram_articles(article_df, word_tuple) for word_tuple in ngram}

In [13]:
%%time
dict_linking_ngram_to_indexes = get_dict_ngram_articles(article_df, list_of_words)


print(len(dict_linking_ngram_to_indexes))

32131
Wall time: 1min 56s


<a id="4"></a>

## Iterate and Calculate Metrics per N-Gram

For each n-gram will need to collect the stock returns for our window, then calculate the return metrics.

In [18]:
# Original function is from Notebook 02 and converted to work with n-grams

def get_return_details_per_word(article_df, return_df, word_list, holding_period, cut_off=0.05):
    set_of_ngrams = get_n_gram(word_list, 2)
    
    res_dict = {}
    for ngram in set_of_ngrams:
        event_id_list = get_ngram_articles(article_df, ngram)
        #print(event_id_list)
        returns = return_df["R_{}".format(holding_period - 1)].iloc[event_id_list].dropna()
        res_dict[ngram] = [
            np.mean(returns), 
            np.std(returns), 
            skew(returns.values), 
            kurtosis(returns.values),
            returns.shape[0]/article_df.shape[0]
        ]
    
    cols = ["return", "dev", "skew", "kurt", "freq_occurance"]
    
    return pd.DataFrame(res_dict, index=cols).T

def sharpe_ratio(row, holding_period, annual_risk_free_rate):
    scale_param = 252 / holding_period # This will be used to annualize the expected return 
                                       # and the deviation
    num = (scale_param * row["return"] - annual_risk_free_rate) 
    den = (np.sqrt(scale_param) * row["dev"])
    return num / den

In [20]:
res_df = get_return_details_per_word(article_df, 
                                     return_window, 
                                     list_of_words, 
                                     holding_period=20, 
                                     cut_off=0.05)

In [21]:
res_df.head(10)

Unnamed: 0_level_0,mark,mark,mark,mark,mark,mark,mark,mark,mark,mark,...,cure,cure,cure,cure,administration,administration,administration,light,light,receive
Unnamed: 0_level_1,market,search,research,chan,hand,researchandmarkets,global,line,port,pipeline,...,administration,light,receive,next,light,receive,next,receive,next,next
return,0.106122,0.105561,0.105561,0.07581,0.07581,0.07581,0.109309,0.113945,0.110854,0.113865,...,,,,,,0.530292,,,,0.525436
dev,0.2232,0.222447,0.222447,0.200197,0.200197,0.200197,0.226605,0.226717,0.221694,0.226786,...,,,,,,0.0,,,,0.0
skew,0.735344,0.740837,0.740837,1.00575,1.00575,1.00575,0.703906,0.673792,0.678931,0.674687,...,,,,,,0.0,,,,0.0
kurt,-0.769131,-0.755797,-0.755797,0.001682,0.001682,0.001682,-0.84556,-0.914813,-0.854988,-0.913864,...,,,,,,-3.0,,,,-3.0
freq_occurance,0.426894,0.411004,0.411004,0.263607,0.263607,0.263607,0.21238,0.149413,0.104115,0.149176,...,0.0,0.0,0.0,0.0,0.0,0.000119,0.0,0.0,0.0,0.000119
