# First Run of Data Exploration

In this notebook I will be:
1. creating the functionality to open the raw data
2. filtering for clinical trial results
3. read through a few of the results and taking notes on what I do, and if possible, what I do not see

## Table of Contents

1. [Importing Modules and Libraries](#1)
2. [Opening the Raw Data](#2)
3. [Data Cleaning](#3)
4. [Filter for "Clinical"](#4)

## Importing Modules and Libraries for Exploration
<a id="1"></a>

In [10]:
# Imports
import numpy as np
import pandas as pd
import sys
sys.path.append("../../src/data/")
from make_dataset import get_raw_data

sys.path.append("../../src/features")
from nlp_functions import remove_non_english_articles

In [61]:
# Settings

# Stop the warnings for chain in pandas...
pd.options.mode.chained_assignment = None

%load_ext autoreload
%autoreload 2

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Opening the Raw Data
<a id="2"></a>

Using the function from src, open the 3 datasets that have so far been collected.

In [4]:
business_wire_raw, watchlist_raw, stock_prices_raw = get_raw_data()

Let's take a quick look to ensure that everything loaded properly

In [5]:
business_wire_raw.head()

Unnamed: 0,link,time,title,ticker,article
0,http://www.businesswire.com/news/home/20190604...,"June 04, 2019",ACADIA Pharmaceuticals to Present at the Goldm...,ACAD,SAN DIEGO--(BUSINESS WIRE)--ACADIA Pharmaceut...
1,http://www.businesswire.com/news/home/20190518...,"May 18, 2019",ACADIA Pharmaceuticals to Present Phase 2 CLAR...,ACAD,SAN DIEGO--(BUSINESS WIRE)--ACADIA Pharmaceut...
2,http://www.businesswire.com/news/home/20190515...,"May 15, 2019",Fastest Growing Companies/Startups in San Fran...,ACAD,"BOULDER, Colo.--(BUSINESS WIRE)--Growjo annou..."
3,http://www.businesswire.com/news/home/20190507...,"May 07, 2019",ACADIA Pharmaceuticals to Present at the Bank ...,ACAD,SAN DIEGO--(BUSINESS WIRE)--ACADIA Pharmaceut...
4,http://www.businesswire.com/news/home/20190502...,"May 02, 2019","Alzheimer's Disease: Pipeline Review, Develope...",ACAD,"DUBLIN--(BUSINESS WIRE)--The ""Alzheimer's Dis..."


In [6]:
watchlist_raw.head()

Unnamed: 0_level_0,Ticker,Market Cap,Sector,Exchange
Company Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abeona Therapeutics Inc.,ABEO,310.2,Medical,NSDQ
"ARCA biopharma, Inc.",ABIO,6.37,Medical,NSDQ
"ABIOMED, Inc.",ABMD,15897.32,Medical,NSDQ
Arbutus Biopharma Corporation,ABUS,210.26,Medical,NSDQ
ACADIA Pharmaceuticals Inc.,ACAD,2867.97,Medical,NSDQ


In [7]:
stock_prices_raw.head()

Unnamed: 0,OMER,VRAY,ICPT,AUTL,AXGN,PDLI,HZNP,ASMB,SGRY,ANAB,...,PBYI,WVE,ENDP,VREX,RDNT,HSKA,RETA,XON,TCMD,ARNA
2009-10-08,8.73,,,,3.58,3.202,,,,,...,,,23.95,,2.9,4.2749,,,,41.6
2009-10-09,8.46,,,,3.62,3.2833,,,,,...,,,23.82,,3.06,4.5664,,,,42.1
2009-10-12,8.41,,,,3.53,3.291,,,,,...,,,23.97,,3.24,4.2749,,,,40.8
2009-10-13,7.47,,,,3.62,3.3569,,,,,...,,,23.83,,3.18,4.2701,,,,41.3
2009-10-14,7.44,,,,3.79,3.4343,,,,,...,,,23.88,,3.2,4.1778,,,,41.5


## Data Cleaning
<a id="3"></a>

### Business Wire Articles

From previous explorations on the data set, I know that there are a few issues pertaining to the business wire article data.

1. There are a couple articles listed as NaN, these samples will need to be removed.
2. Some of the data are in foreign languages, these samples will need to be removed.
3. To make searching easier, I will convert the text into all lower case.
4. Further for this exploration we will not need to keep the "link" feature

### Watchlist

While scraping for the Business Wire Articles I only scraped for companies within a certain bound w.r.t. Market Capitalization. To clean this and ensure we keep the companies that were scraped (with no errors), I will take the unique tickers and filter the Watchlist. Also I will rename the columns to fit with the other data frames.

In [11]:
# Business Wire Articles

# 0: Create a copy of the data
clinical_trial_df = business_wire_raw.copy()
print("Original size: ", clinical_trial_df.shape)

# 1: Remove NaN
clinical_trial_df.dropna(inplace=True)
print("Size after removing NaN: ", clinical_trial_df.shape)

# 2: Remove non-English articles
clinical_trial_df = remove_non_english_articles(clinical_trial_df)
print("Size after removing non-English articles: ", clinical_trial_df.shape)

# 3: Set all strings to lower case in "title" and "article" columns
clinical_trial_df.article = clinical_trial_df.article.apply(str.lower)
clinical_trial_df.title = clinical_trial_df.title.apply(str.lower)

# 4: Drop "link" column
clinical_trial_df.drop("link", inplace=True, axis=1)

clinical_trial_df.head()

Original size:  (8806, 5)
Size after removing NaN:  (8802, 5)
Size after removing non-English articles:  (8435, 5)


Unnamed: 0,time,title,ticker,article
0,"June 04, 2019",acadia pharmaceuticals to present at the goldm...,ACAD,san diego--(business wire)--acadia pharmaceut...
1,"May 18, 2019",acadia pharmaceuticals to present phase 2 clar...,ACAD,san diego--(business wire)--acadia pharmaceut...
2,"May 15, 2019",fastest growing companies/startups in san fran...,ACAD,"boulder, colo.--(business wire)--growjo annou..."
3,"May 07, 2019",acadia pharmaceuticals to present at the bank ...,ACAD,san diego--(business wire)--acadia pharmaceut...
4,"May 02, 2019","alzheimer's disease: pipeline review, develope...",ACAD,"dublin--(business wire)--the ""alzheimer's dis..."


In [31]:
# Watchlist

# 0: Create a copy of the data
watchlist_df = watchlist_raw.copy()
print("Original size: ", watchlist_df.shape)

# 1: Get a list of the unique companies that have scraped article data
unique_companies = clinical_trial_df.ticker.unique()

# 2: Keep only the companies from the list
watchlist_df = watchlist_df.loc[watchlist_df.Ticker.isin(unique_companies)]
print("Final size: ", watchlist_df.shape)

watchlist_df.columns = ["ticker", "marketcap", "sector", "exchange"]

watchlist_df.head()

Original size:  (721, 4)
Final size:  (197, 4)


Unnamed: 0_level_0,ticker,marketcap,sector,exchange
Company Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ACADIA Pharmaceuticals Inc.,ACAD,2867.97,Medical,NSDQ
"Acadia Healthcare Company, Inc.",ACHC,2444.69,Medical,NSDQ
"Acorda Therapeutics, Inc.",ACOR,614.92,Medical,NSDQ
Addus HomeCare Corporation,ADUS,908.35,Medical,NSDQ
"Aerie Pharmaceuticals, Inc.",AERI,1819.98,Medical,NSDQ


Wow, that is a lot of companies that wer dropped. I may have to go back and take a look at the data to see if this makes sense or not. Recall that I only scraped for a selected Market Cap. size.

In [15]:
# Will also make a copy of the prices here
prices_df = stock_prices_raw.copy()

## Filter for Keywords
<a id="4"></a>

### "Clinical" in the Article

In [76]:
df_filtered_for_clinical = clinical_trial_df.loc[clinical_trial_df.article.str.contains("clinical")]
print("Filtered size: ", df_filtered_for_clinical.shape)

Filtered size:  (4519, 4)


So now there are about half the number of articles. Let's take a quick look at some of them.

In [82]:
# Create a function that will help display the article with it's meta-data
def display_text(article_row):
    watchlist_row = watchlist_df.loc[watchlist_df.ticker == article_row.ticker]
    
    line_1 = "{} - {} - {}".format(watchlist_row.index.values[0], article_row.ticker, article_row.time)
    line_2 = article_row.title
    line_3 = article_row.article
    line_4 = "-" * 30
    line_5 = "\n"
    
    return "\n\n".join([line_1, line_2, line_3, line_4, line_5])

In [78]:
sample_df = df_filtered_for_clinical.sample(20)

In [83]:
for _, row in sample_df.iterrows():
    print(display_text(row))

MacroGenics, Inc. - MGNX - February 28, 2019

data from incyte’s cancer research portfolio to be featured in seven abstracts at the aacr annual meeting 2019

 wilmington, del.--(business wire)--incyte corporation (nasdaq:incy) announces that seven abstracts showcasing data from its cancer research portfolio will be presented at the upcoming american association for cancer research (aacr) annual meeting 2019. the meeting will be held march 29 – april 3, 2019, at the georgia world congress center in atlanta, georgia. “in particular, we are pleased to present, for the first time at a major medical meeting, early data on our oral pd-l1 inhibitor program—whose lead candidate, incb86550, recently entered clinical trials.” accepted abstracts feature data from clinical studies involving incyte’s anti-pd-1 monoclonal antibody, incmga00121, in patients with advanced solid tumors, as well as pre-clinical characterizations of the company’s oral, small molecule pd-l1 inhibitor program and its pd-l1

Not very fruitful.

But scanning through, perhaps instead we could search the title for "phase". 

### "Phase" in the Title

In [88]:
df_filtered_for_trial = clinical_trial_df.loc[clinical_trial_df.title.str.contains("phase")]
print("Filtered size: ", df_filtered_for_trial.shape)

Filtered size:  (226, 4)


In [89]:
sample_df = df_filtered_for_trial.sample(20)

for _, row in sample_df.iterrows():
    print(display_text(row))

Spark Therapeutics, Inc. - ONCE - July 16, 2018

pfizer initiates pivotal phase 3 program for investigational hemophilia b gene therapy

 new york & philadelphia--(business wire)--pfizer inc. (nyse:pfe) and spark therapeutics (nasdaq:once) announced today that pfizer initiated a phase 3 open-label, multi-center, lead-in study (nct03587116) to evaluate the efficacy and safety of current factor ix prophylaxis replacement therapy in the usual care setting. the factor ix prophylaxis efficacy data obtained in the lead-in study will serve as the within-subject control group for those patients that enroll into the next part of the phase 3 study, which will evaluate the investigational gene therapy fidanacogene elaparvovec for the treatment of hemophilia b. the interventional portion of this pivotal phase 3 study will enroll patients who have completed at least six months in the lead-in study. fidanacogene elaparvovec is the official united states adopted name (usan) and will become the recomm