# Using Keywords as a Factor

Continuing from Notebook 01-FirstDataExploration I will begin analyzing the article titles for keywords that could have some interesting statistical properties.

## Information from the Previous Notebook

Now, it would be good to look through the keywords from the article titles and see if there are any keywords that are indicitive of positive or negative results.

The initial look will be very basic and intuitive as we are just take a first look.

Hypothesis:
* Words like positive and negative should be telling of which direction the stock will go. Also w.r.t. exploration there should be other words that may be helpful in determining the direction (and possibly the magnitude)

Collect all of the "important" words (using NLP practices) for a bag of words.

Seperate the events into classes, in this case will use quartiles (this can be changed later on). For each word that has "enough" frequency we want to get as much information as possible about the probability of a company ending in each quartile bracket.

## Table of Contents

1. ["Import and Settings"](#1)
2. ["Importing and Cleaning Data"](#2)
3. ["NLP Exploration"](#3)

## Imports and Settings
<a id="1"><a/>

In [79]:
# Imports

# Numerical Libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Visual Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Local Package Libraries
import sys
sys.path.append("../..")

from src.data.make_dataset import clean_and_open_business_wire_data_01, get_raw_data
from src.features.general_helper_functions import GetPrices
from src.features.nlp_functions import *

In [33]:
# Settings

# Stop the warnings for chain in pandas...
pd.options.mode.chained_assignment = None

%load_ext autoreload
%autoreload 2

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Importing and Cleaning the Data
<a id="2"></a>

In [3]:
_, watchlist_raw, stock_prices_raw = get_raw_data()

In [4]:
clinical_trial_df = clean_and_open_business_wire_data_01(None)
clinical_trial_df.time = pd.to_datetime(clinical_trial_df.time)

In [7]:
# Watchlist

# 0: Create a copy of the data
watchlist_df = watchlist_raw.copy()
print("Original size: ", watchlist_df.shape)

# 1: Get a list of the unique companies that have scraped article data
unique_companies = clinical_trial_df.ticker.unique()

# 2: Keep only the companies from the list
watchlist_df = watchlist_df.loc[watchlist_df.Ticker.isin(unique_companies)]
print("Final size: ", watchlist_df.shape)

watchlist_df.columns = ["ticker", "marketcap", "sector", "exchange"]

watchlist_df.head()


# Stock Prices

# 0: Make a copy of the stock prices here
prices_df = stock_prices_raw.copy()
print("Original size: ", prices_df.shape)

# 1: Reduce the new copy of prices to only the companies under our scope
prices_df = prices_df[unique_companies]
print("Final size: ", prices_df.shape)

# 2: Sort by date
prices_df.sort_index(inplace=True)

# 3: Ensure index is datetime object
prices_df.index = pd.to_datetime(prices_df.index)

prices_df.tail()

Original size:  (721, 4)
Final size:  (197, 4)
Original size:  (5402, 208)
Final size:  (5402, 197)


Unnamed: 0,ACAD,ACHC,ACOR,ADUS,AERI,AGIO,AIMT,AKCA,AKRX,ALDR,...,VRAY,VREX,WMGI,WVE,XENT,XLRN,XNCR,XON,YI,ZGNX
2019-06-17,25.42,34.0,7.68,72.56,33.29,50.66,20.54,24.39,4.34,12.45,...,8.95,28.21,31.31,27.43,24.45,38.36,32.55,7.6,6.75,40.3
2019-06-18,25.93,34.2,7.66,72.92,33.84,51.91,20.2,23.98,4.48,11.73,...,9.13,28.95,31.69,27.92,23.97,40.62,34.54,8.6,6.62,39.79
2019-06-19,26.62,34.04,7.54,75.89,32.89,51.42,20.07,22.8,4.65,11.55,...,9.11,29.26,31.88,26.93,24.16,39.92,34.65,8.55,6.62,40.37
2019-06-20,25.73,34.52,7.28,74.91,32.13,51.3,20.24,22.9,4.74,11.54,...,8.92,29.57,30.48,26.86,23.62,40.85,35.1,8.01,6.62,40.91
2019-06-21,26.0,34.72,7.49,74.63,30.91,50.78,20.0,24.26,4.78,11.6,...,8.72,29.21,30.09,26.18,23.15,40.18,35.16,7.57,6.49,40.62


### Filter for "Phase"

In [14]:
df_filtered_for_phase = clinical_trial_df.loc[clinical_trial_df.title.str.contains("phase")]
print("Filtered size: ", df_filtered_for_phase.shape)

Filtered size:  (226, 4)


### Filter the Stock Prices

In [16]:
# Get the stock prices for 30 days following each event
price_window = GetPrices(
    df_filtered_for_phase, 
    prices_df, 
    n_window=30
).add_prices_to_frame()

#.dropna(axis=0, inplace=True)

def perc_return(value_matrix, i, j):
    return (value_matrix[j][i] / value_matrix[j][0]) - 1


return_values = np.array([
    np.array([
        perc_return(price_window.values, i, j) for i in range(1, price_window.values.shape[1])
    ]) for j in range(price_window.values.shape[0])
])


cols = ["R_{}".format(i) for i in range(return_values.shape[1])]

return_window = pd.DataFrame(return_values, index=price_window.index, columns=cols)

return_window.head()

Unnamed: 0,R_0,R_1,R_2,R_3,R_4,R_5,R_6,R_7,R_8,R_9,...,R_19,R_20,R_21,R_22,R_23,R_24,R_25,R_26,R_27,R_28
1,,,,,,,,,,,...,,,,,,,,,,
6,0.029448,0.027403,-0.01636,-0.032311,0.042945,0.084663,0.095706,0.055624,0.06953,0.062577,...,0.036401,0.061759,0.069121,0.005726,0.018405,-0.018814,-0.002454,0.03681,0.053988,0.018814
12,0.015244,0.023247,0.028201,0.041159,0.042302,0.023628,0.071646,0.031631,0.016387,0.020579,...,-0.068216,-0.040777,-0.042683,-0.08346,-0.098323,-0.028201,0.010671,0.02096,-0.016387,-0.00343
25,0.147844,0.166838,0.11191,0.123203,0.026694,-0.011807,-0.031828,-0.031828,-0.085729,-0.100103,...,0.033368,-0.021561,0.042094,-0.033881,-0.029261,-0.065195,-0.035934,-0.046715,-0.051335,-0.079569
22,-0.0373,-0.056838,-0.052102,-0.078745,-0.055654,-0.088218,-0.075784,-0.024867,-0.015394,-0.018354,...,-0.153345,-0.194198,-0.224393,-0.207815,-0.234458,-0.205447,-0.171107,-0.156306,-0.127294,-0.107164


## NLP Feature Exploration
<a id="3"></a>

In this section we will clean up the title in df_filtered_for_phase data using:

* remove_white_spaces
* remove_non_alphanumeric
* remove_numbers
* remove_stop_words

In [19]:
articles = df_filtered_for_phase
articles.title = articles.title.apply(remove_white_spaces)
articles.title = articles.title.apply(remove_non_alphanumeric)
articles.title = articles.title.apply(remove_numbers)
articles.title = articles.title.apply(remove_stop_words)

articles.head()

Unnamed: 0,time,title,ticker,article
1,2019-05-18,acadia pharmaceuticals present phase clarity ...,ACAD,san diego--(business wire)--acadia pharmaceut...
6,2019-04-25,acadia pharmaceuticals initiates phase clarit...,ACAD,san diego--(business wire)--acadia pharmaceut...
12,2019-03-27,positive phase study results trofinetide pedi...,ACAD,"san diego & cincinnati & melbourne, australia..."
25,2018-10-31,acadia pharmaceuticals announces positive top ...,ACAD,san diego--(business wire)--acadia pharmaceut...
22,2019-01-17,acorda announces lancet neurology publication ...,ACOR,"ardsley, n.y.--(business wire)--acorda therap..."


Looking at the titles again I noticed that the company's name is also generally in the title. It would be best to remove that as the company will already be it's own feature implicitly.

In [58]:
company_names = watchlist_df.loc[watchlist_df.ticker.isin(unique_companies)].index.tolist()
articles.title = articles.title.apply(remove_company_name, args=(company_names,))

We will now need our set of unique keyword

In [65]:
combined_titles = " ".join(articles.title.values.tolist())

set_of_words = list(set(combined_titles.split(" ")))
set_of_words.remove("")

len(set_of_words)

857

We will also want to remove all "words" that have 2 or less characters:

In [125]:
set_of_words.remove("phase")

In [122]:
set_of_words = [word for word in set_of_words if len(word) > 2]

len(set_of_words)

828

Need a function that, given a data frame (or sub data frame) with the article titles, will give the frequency that the word occurs.

In [131]:
def calculate_word_frequency(word_list, df):
    d = {word: sum([1 if word in article else 0 for article in df.title.values])/df.shape[0] for word in word_list}
    
    return pd.Series(d, index = d.keys()).sort_values(ascending=False)

In [132]:
word_frequency = calculate_word_frequency(set_of_words, articles)
word_frequency.head(20)

trial        0.469027
clinical     0.314159
announce     0.309735
patient      0.305310
study        0.292035
results      0.269912
announces    0.256637
com          0.252212
patients     0.247788
pre          0.238938
ted          0.207965
line         0.199115
data         0.176991
treatment    0.172566
present      0.163717
positive     0.154867
met          0.150442
initiate     0.132743
cancer       0.123894
anal         0.119469
dtype: float64

So, what we are looking for are words that are in a *substantial* number of documents and provide *enough* classification information.

by substantial, in this case I will use an absolute cut-off of 5%

In [141]:
set_of_words = word_frequency.loc[word_frequency > 0.05].index
set_of_words

Index(['trial', 'clinical', 'announce', 'patient', 'study', 'results',
       'announces', 'com', 'patients', 'pre', 'ted', 'line', 'data',
       'treatment', 'present', 'positive', 'met', 'initiate', 'cancer', 'anal',
       'research', 'initiates', 'market', 'markets', 'report', 'first',
       'analysis', 'med', 'chi', 'iii', 'cell', 'pipeline', 'advanced',
       'annual', 'car', 'disease', 'top', 'end', 'enroll', 'meeting',
       'researchandmarkets', 'develop', 'age', 'update', 'tumor',
       'development', 'society', 'reports', 'presents', 'anti', 'complete',
       'enrollment', 'part', 'point', 'pivotal', 'tumors', 'iga', 'europe',
       'carcinoma'],
      dtype='object')

#### Positive and Negative Frequency Ratio

These numbers will give the frequency of each word in the positive return or negative return events divided by the total number of the positive or negative return events respectively.

I will build the function to take in the return number so I can build this part out to test on various days for optimality.

In [127]:
def split_return_window(ret_df, holding_period=0):
    '''returns the positive and negative dataframes, in that order.'''
    pos = ret_df.loc[ret_df["R_{}".format(holding_period)] > 0].index
    neg = ret_df.loc[ret_df["R_{}".format(holding_period)] <= 0].index
    
    return pos, neg

In [130]:
pos_ind, neg_ind = split_return_window(return_window, 0)

pos_articles = articles.loc[pos_ind]
neg_articles = articles.loc[neg_ind]

In [142]:
pos_word_freq = calculate_word_frequency(set_of_words, pos_articles)
neg_word_freq = calculate_word_frequency(set_of_words, neg_articles)

In [149]:
sent_df = pd.DataFrame([pos_word_freq, neg_word_freq], index = ["pos", "neg"]).T
sent_df.head(10)

Unnamed: 0,pos,neg
trial,0.468198,0.448622
patient,0.326855,0.325815
announce,0.325088,0.300752
study,0.298587,0.273183
clinical,0.291519,0.290727
announces,0.273852,0.260652
patients,0.266784,0.253133
com,0.257951,0.303258
results,0.234982,0.243108
line,0.226148,0.223058


Let's take a look at the difference between the columns. This would indicate a "lean" towards on direction or another.

In [150]:
sent_df["diff"] = sent_df.pos - sent_df.neg

In [159]:
sent_df.sort_values("diff", inplace=True, ascending=False)

top_10 = sent_df.iloc[:10]
bottom_10 = sent_df.iloc[:-10:-1]

In [160]:
top_10.T

Unnamed: 0,positive,top,initiate,initiates,first,study,announce,tumors,treatment,pivotal
pos,0.160777,0.09364,0.14311,0.123675,0.125442,0.298587,0.325088,0.077739,0.171378,0.070671
neg,0.122807,0.062657,0.112782,0.095238,0.097744,0.273183,0.300752,0.055138,0.150376,0.050125
diff,0.03797,0.030983,0.030328,0.028437,0.027697,0.025404,0.024336,0.022601,0.021002,0.020546


In [161]:
bottom_10.T

Unnamed: 0,present,pre,com,annual,data,end,ted,pipeline,disease
pos,0.120141,0.19788,0.257951,0.04947,0.151943,0.058304,0.183746,0.097173,0.060071
neg,0.182957,0.255639,0.303258,0.090226,0.18797,0.092732,0.218045,0.12782,0.090226
diff,-0.062816,-0.057759,-0.045308,-0.040756,-0.036026,-0.034428,-0.0343,-0.030646,-0.030155


Alright so, now we have a way to get the top and bottom scoring words.

Now would like to convert this notebook portion into a piece of production code. This will allow further investigation on what keywords are important as I can look to longer holding periods.

Remember, for now, I am simply looking for keywords as factors. They will likely be pushed in as contains_word = {True, False} later on.

In [179]:
def calculate_word_frequency(word_list, df):
    d = {word: sum([1 if word in article else 0 for article in df.title.values])/df.shape[0] for word in word_list}
    
    return pd.Series(d, index = d.keys()).sort_values(ascending=False)

def split_return_window(ret_df, holding_period=0):
    '''returns the positive and negative dataframes, in that order.'''
    pos = ret_df.loc[ret_df["R_{}".format(holding_period)] > 0].index
    neg = ret_df.loc[ret_df["R_{}".format(holding_period)] <= 0].index
    
    return pos, neg

def calculate_top_and_bottom_keywords(article_df, return_df, holding_period, cut_off=0.05):
    combined_titles = " ".join(articles.title.values.tolist())

    set_of_words = list(set(combined_titles.split(" ")))
    set_of_words.remove("")
    set_of_words.remove("phase")
    
    set_of_words = [word for word in set_of_words if len(word) > 2]

    word_frequency = calculate_word_frequency(set_of_words, articles)
    
    set_of_words = word_frequency.loc[word_frequency > cut_off].index
    
    pos_ind, neg_ind = split_return_window(return_df, holding_period)

    pos_articles = articles.loc[pos_ind]
    neg_articles = articles.loc[neg_ind]
    
    pos_word_freq = calculate_word_frequency(set_of_words, pos_articles)
    neg_word_freq = calculate_word_frequency(set_of_words, neg_articles)
    
    sent_df = pd.DataFrame([pos_word_freq, neg_word_freq], index = ["pos", "neg"]).T
    
    sent_df["diff"] = sent_df.pos - sent_df.neg
    
    sent_df.sort_values("diff", inplace=True, ascending=False)

    return sent_df["diff"].iloc[:10], sent_df["diff"].iloc[:-10:-1]

In [180]:
calculate_top_and_bottom_keywords(articles.title, return_window, 0, 0.05)

(positive     0.037970
 top          0.030983
 initiate     0.030328
 initiates    0.028437
 first        0.027697
 study        0.025404
 announce     0.024336
 tumors       0.022601
 treatment    0.021002
 pivotal      0.020546
 Name: diff, dtype: float64, present    -0.062816
 pre        -0.057759
 com        -0.045308
 annual     -0.040756
 data       -0.036026
 end        -0.034428
 ted        -0.034300
 pipeline   -0.030646
 disease    -0.030155
 Name: diff, dtype: float64)

Great! It works, will add to src and refactor there.

Further, we can create a dataframe that tracks which words are import through time.