# Overview
The following is an example to get you familiar with the data, as well as some basic formatting.  It can be run using Google Colab.

There are definitely more efficient ways to achieve the following, particularly ways to load all the data at once.

Start by exploring the data and thinking about some of our observational and predictive analyses we discussed as well as analyses you think would be interesting.  Jot down any questions or ideas you have so we can discuss in our first meeting. 

#Brief overview of the data sources

Price Data - daily data available back until 2018 + monthly data going back until 2014.  Link to Drug Specifics is NDC code

Drug Specifics - additional detail about the drugs including thereapeutic class, manufacturer, volume, etc.  Link to Price Data is NDC code

News Items - I've pulled two broad searches from AlphaSense to start, 1) Drug price mentions and 2) Drug shortage mentions.  Note that we have the flexibility to run more refined searches as needed.  Also I believe you are able to view articles from the excel file by clicking on the link in the Read in AlphaSense column

In [2]:
import pandas as pd
import matplotlib as plt
import zipfile as z
from datetime import date
from datetime import datetime

In [3]:
## Merge zip files
zips = [str(i)+'_prices.zip' for i in range(2017,2021)]

"""
Open the first zip file as append and then read all
subsequent zip files and append to the first one
"""
with z.ZipFile(zips[0], 'a') as z1:
    for fname in zips[1:]:
        zf = z.ZipFile(fname, 'r')
        for n in zf.namelist():
            z1.writestr(n, zf.open(n).read())

In [2]:
iqvia = 'IQVIA Additional Drug Detail v72020.xlsx'
pricingData = z.ZipFile('2017_2020_prices.zip')
df_IQVIA_Data = pd.read_excel(iqvia)

In [5]:
# It's important to format the drug NDCs to include 0s, especially true for 
# NDCs that have leading 0's.  Often these are dropped because the files are 
# loaded from CSVs.

df_IQVIA_Data['NDC'] = [e[0] for e in df_IQVIA_Data.NDC.str.split(' ')]
df_IQVIA_Data.NDC.astype(str).str.zfill(11)## Fill 0 from beginning until the length is 11
df_IQVIA_Data

Unnamed: 0,NDC,Product,Product Launch Date,Estimated LOE Date,ATC4,Major Class,Acute/Chronic,Prod Form,Pack Size,Pack Quantity,...,Oct 2019\nTRx,Nov 2019\nTRx,Dec 2019\nTRx,Jan 2020\nTRx,Feb 2020\nTRx,Mar 2020\nTRx,Apr 2020\nTRx,May 2020\nTRx,Jun 2020\nTRx,Jul 2020\nTRx
0,00002060440,SEROMYCIN 11/1977 TCC,1977-11-01 00:00:00,Unspecified,"J04A1 ANTI-TB, SINGLE INGRED",ANTITUBERCULARS,CHRONIC,ORALS,1,40.0,...,,,,,,,,,,
1,00002105202,DIETHYLSTILBESTROL 07/1975 LLY,1975-07-01 00:00:00,Unspecified,"G03C0 OESTROG EX G3A,G3E,G3F","SEX HORMONES (ANDROGENS, OESTROGENS, PROGESTOG...",CHRONIC,ORALS,1,100.0,...,,,,,,,,,,
2,00002143301,TRULICITY 10/2014 LLY,2014-10-01 00:00:00,09/2026,A10S0 GLP-1 AGONIST ANTIDIAB,ANTIDIABETICS,CHRONIC,INJECTABLES,1,0.5,...,585.0,577.0,613.0,671.0,695.0,762.0,721.0,708.0,821.0,752.0
3,00002143380,TRULICITY 10/2014 LLY,2014-10-01 00:00:00,09/2026,A10S0 GLP-1 AGONIST ANTIDIAB,ANTIDIABETICS,CHRONIC,INJECTABLES,4,0.5,...,215492.0,205008.0,219208.0,224008.0,214856.0,249341.0,232536.0,229835.0,238035.0,242731.0
4,00002143401,TRULICITY 10/2014 LLY,2014-10-01 00:00:00,09/2026,A10S0 GLP-1 AGONIST ANTIDIAB,ANTIDIABETICS,CHRONIC,INJECTABLES,1,0.5,...,578.0,576.0,669.0,623.0,611.0,629.0,592.0,684.0,790.0,843.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12664,99207051114,LIDEX 11/1971 B.U,1971-11-01 00:00:00,Unspecified,D07A0 TOP.CORTICOSTEROID PLAIN,DERMATOLOGICS,ACUTE,DERMATOLOGICALS,1,30.0,...,,,,,,,,,,
12665,99207051117,LIDEX 11/1971 B.U,1971-11-01 00:00:00,Unspecified,D07A0 TOP.CORTICOSTEROID PLAIN,DERMATOLOGICS,ACUTE,DERMATOLOGICALS,1,60.0,...,,,,,,,,,,
12666,99207051413,LIDEX 11/1971 B.U,1971-11-01 00:00:00,Unspecified,D07A0 TOP.CORTICOSTEROID PLAIN,DERMATOLOGICS,ACUTE,DERMATOLOGICALS,1,15.0,...,,,,,,,,,,
12667,99207051746,LIDEX 11/1971 B.U,1971-11-01 00:00:00,Unspecified,D07A0 TOP.CORTICOSTEROID PLAIN,DERMATOLOGICS,ACUTE,DERMATOLOGICALS,1,60.0,...,,,,,,,,,,


In [6]:
'''
The following loops over all CSVs in a zip file and assigns a Date column with the date 
portion of the file name 
'''

price_dates = []

for f in pricingData.namelist():
    df=pd.read_csv(pricingData.open(f))
# Since we are only focusing on Brand, I filter out the generics.
    df = df.loc[df['Brand/Generic'] == 'Brand']
# Each csv file contains the date, I strip out this field and assign it to a Date column
    df['Date'] = datetime.strptime(f.split('_')[2],'%Y%m%d')
    df = df[['Drug Identifier','Drug Group','Brand/Generic','Manufacturer','WAC','Date']]
    df['Drug Identifier'] = df['Drug Identifier'].astype(str).str.zfill(11)
    price_dates.append(df)
    
    

df_Pricing_Data = pd.concat(price_dates,ignore_index=True)

In [8]:
'''IQVIA Data is used for left merge so that result consists of complete drug data (price + detail).  
For initial analyses, it may be worth looking at just the Pricing Data to get a larger set of data
to base our drug "universe" off of.  There's a few options. 
'''

merge_ = pd.merge(df_IQVIA_Data,df_Pricing_Data,left_on='NDC',right_on='Drug Identifier',how='inner')

In [3]:
merge_ = pd.read_csv('merge_with_iqvia_2017_10_2020_09.csv')

In [4]:
columns = ['Aug 2014\nTRx', 'Sep 2014\nTRx', 'Oct 2014\nTRx',
       'Nov 2014\nTRx', 'Dec 2014\nTRx', 'Jan 2015\nTRx', 'Feb 2015\nTRx',
       'Mar 2015\nTRx', 'Apr 2015\nTRx', 'May 2015\nTRx', 'Jun 2015\nTRx',
       'Jul 2015\nTRx', 'Aug 2015\nTRx', 'Sep 2015\nTRx', 'Oct 2015\nTRx',
       'Nov 2015\nTRx', 'Dec 2015\nTRx', 'Jan 2016\nTRx', 'Feb 2016\nTRx',
       'Mar 2016\nTRx', 'Apr 2016\nTRx', 'May 2016\nTRx', 'Jun 2016\nTRx',
       'Jul 2016\nTRx', 'Aug 2016\nTRx', 'Sep 2016\nTRx', 'Oct 2016\nTRx',
       'Nov 2016\nTRx', 'Dec 2016\nTRx', 'Jan 2017\nTRx', 'Feb 2017\nTRx',
       'Mar 2017\nTRx', 'Apr 2017\nTRx', 'May 2017\nTRx', 'Jun 2017\nTRx',
       'Jul 2017\nTRx', 'Aug 2017\nTRx', 'Sep 2017\nTRx', 'Oct 2017\nTRx',
       'Nov 2017\nTRx', 'Dec 2017\nTRx', 'Jan 2018\nTRx', 'Feb 2018\nTRx',
       'Mar 2018\nTRx', 'Apr 2018\nTRx', 'May 2018\nTRx', 'Jun 2018\nTRx',
       'Jul 2018\nTRx', 'Aug 2018\nTRx', 'Sep 2018\nTRx', 'Oct 2018\nTRx',
       'Nov 2018\nTRx', 'Dec 2018\nTRx', 'Jan 2019\nTRx', 'Feb 2019\nTRx',
       'Mar 2019\nTRx', 'Apr 2019\nTRx', 'May 2019\nTRx', 'Jun 2019\nTRx',
       'Jul 2019\nTRx', 'Aug 2019\nTRx', 'Sep 2019\nTRx', 'Oct 2019\nTRx',
       'Nov 2019\nTRx', 'Dec 2019\nTRx', 'Jan 2020\nTRx', 'Feb 2020\nTRx',
       'Mar 2020\nTRx', 'Apr 2020\nTRx', 'May 2020\nTRx', 'Jun 2020\nTRx',
       'Jul 2020\nTRx']

In [5]:
merge_.Date = pd.to_datetime(merge_.Date)

In [None]:
merge_['Year'] = merge_.Date.apply(lambda x:x.year)

In [None]:
merge_['Month'] = merge_.Date.apply(lambda x:x.month)

In [None]:
def fun (row):
    str_ = dict_[row['Month']]+' '+str(row['Year'])+'\nTRx'
    return row[str_]

In [None]:
dict_={1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
merge_['TRx'] = merge_.apply(lambda x:fun(x),axis=1)

In [14]:
merge_ = merge_.loc[:, ~merge_.columns. isin(columns)]

12

In [10]:
merge_.to_csv('merge_with_iqvia_2017_10_2020_09.csv')

In [12]:
merge_

MemoryError: 

MemoryError: 