# Cleaning messy PDF data with pandas and Jupyter notebooks
## Part 3 - DEA ARCOS Report 5: Statistical Summary for Retail Drug Purchases

### Background

#### What is ARCOS?
The DEA publishes data annually from its Automation of Reports and Consolidated Orders System, or ARCOS. According to the DEA's website, ARCOS "monitors the flow of DEA controlled substances from their point of manufacture through commercial distribution channels to point of sale or distribution at the dispensing/retail level - hospitals, retail pharmacies, practitioners, mid-level practitioners, and teaching institutions....these transactions...are then summarized into reports which give investigators in Federal and state government agencies information which can then be used to identify the diversion of controlled substances into illicit channels of distribution. The information on drug distribution is used throughout the United States (U.S.) by U.S. Attorneys and DEA investigators to strengthen criminal cases in the courts."

So, ARCOS exists to help the government identify patterns in the manufacture and distribution of controlled substances that might indicate that these substances are being sold illegally. Annual ARCOS reports are publically available on the DEA's website, dating back to the year 2000, but unfortunately they are only available in PDF form and are dozens or even hundreds of pages long. 

#### What's in this notebook?
I was interested in doing some data analysis and visualization on the distribution of oxycodone, an opioid painkiller that is one of the main drivers of the current prescription pain pill (and arguably heroin) addiction epidemic in the United States right now. 

Aside from a wealth of fascinating (and sometimes disturbing, sad, and frightening) data to explore, the ARCOS data also presents a great data cleansing challenge, given that it is distributed in PDFs - the perfect opportunity to practice your pandas skills, for example. Luckily, the files tend to have nearly identical formatting, aside from a shift in report formatting in 2006 and a few anomalies here and there.

This notebook is also meant to show the functionality of pandas and Jupyter notebooks for data cleaning - working with this data was a great project for me to improve my pandas skills and I'm sharing the code here so others can learn and practice. 

This is part 3 - analysis of Report 5. 

In [1]:
# load libraries, the drug code dict, and list of territories
import pandas as pd
import numpy as np
import pickle

drug_codes = pickle.load(open("drug_codes.pickle", "rb"))
geos = pickle.load(open("geographies.pickle", "rb"))

# make a list of all possible parts of state names for later validation
l = [x.split(" ") for x in geos]
flat_geos = [i for sublist in l for i in sublist]
flat_geos = set(flat_geos)

# activity codes dictionary & flat list
activity_codes = {'A': 'PHARMACIES', 
                  'B': 'HOSPITALS', 
                  'C': 'PRACTITIONERS', 
                  'D': 'TEACHING INSTITUTIONS',
                  'M': 'MID-LEVEL PRACTITIONERS', 
                  'N-U': 'NARCOTIC TREATMENT PROGRAMS'}
l = [x.split(" ") for x in activity_codes.values()]
flat_activity_codes = [i for sublist in l for i in sublist]
flat_activity_codes = set(flat_activity_codes)

del l

### Notes on the data 

Get the raw data (in PDF....!)
You can find the ARCOS reports here: https://www.deadiversion.usdoj.gov/arcos/retail_drug_summary/index.html

There are six ARCOS reports published each year and I chose to work with three of them in particular:
* Report 1:  Retail Drug Distribution by Zip Code for Each State - total drug amounts (in grams) distributed to retail registrants in each state, by 'gateway' zip code (the first three numbers of the zip), on a quarterly basis
* Report 3: Quarterly Distribution in Grams per 100K Population - quarterly drug consumption in grams per 100,000 population, by state
* Report 5: Statistical Summary for Retail Drug Purchases - average annual purchases by drug by business activity (pharmacy, hospital, etc.)


A few notes: 

* For years before 2006, the reports are lumped together into one giant PDF (700+ pages long). In more recent years they have elected to publish a separate PDF for each report. 

* I tried several approaches for simply getting the text out of the PDF - for a variety of reasons (in particular the unwieldy nature of the pre-2006 PDFs), it was easiest and quickest to just copy-paste the entire contents of the PDF into a text file. This was an OK solution for me since there aren't that many of them - if you were doing this with hundreds of files you would want to find another way. Another problem I ran into right away was the length of the title running onto multiple lines in the txt file and causing a lot of formatting challenges in a dataframe, so I manually adjusted the title text in each txt file. 

* For the pre-2006 reports, I (manually and carefully) removed the report content I wasn't interested in from the text file, and then used pandas to clean what remained. 

In this notebook we will be working with Report 5, the distribution by retail activity. 

#### Step 1 - Getting from PDF to pandas in the notebook


What to consider and experiment with:
* How will you pull the data out of the PDF? How much of the formatting (columns, headers, etc) will you be able to preserve?
* What delimiter works best?
* If the number of PDF files is small, are there any steps you can perform right in the txt or spreadsheet file that will make things easier?

There are different options for getting data from a PDF into a format you can interact with more directly. I ended up just copy-pasting the full contents of each file as it didn't seem that some of the PDF-to-spreadsheet/other tools out there would really save me that much time. 

I tried several text editors and spreadsheet applications, looking for something that would do a relatively good job delimiting the data based on the PDF files. Sublime is one of my favorites and that's what I used in the end. 

Tips
* Try a couple different editors and delimit options, and read each one into pandas to see how the structure of the data looks. Choose one that will minimize the amount of cleaning you need to do
* Keep your .txt file open as you begin cleaning in pandas
* Never save over your raw .txt file! This is a trial-and-error process and you will likely end up losing some data at one point or another. If you've saved over the starting point you will have to go back to your PDF...

These were the nastiest by far in terms of formatting and I ended up with many versions of the cleaning function. My approach to this was to run the function line by line when it failed to process a file correctly, find the issue and fix it and allow that to be a separate new function. You could take the extra step of refactoring them all into one version.

In [20]:
activity_2000 = pd.read_csv('activity_2000.txt', delim_whitespace=True)
activity_2000.head(10)

Unnamed: 0,ARCOS,2,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,
1,STATE:,ALASKA,BUSINESS,ACTIVITY:,PHARMACIES,,,,,,
2,NUMBER,OF,TOTAL,GRAMS,AVERAGE,,,,,,
3,DRUG,REGISTRANTS,SOLD,TO,PURCHASE,PER,,,,,
4,DRUG,NAME,CODE,SOLD,TO,REGISTRANTS,REGISTRANT,,,,
5,----------------------------------------------...,,,,,,,,,,
6,DL-AMPHETAMINE,BASE,1100B,75,2359.66,31.46,,,,,
7,D-AMPHETAMINE,BASE,1100D,79,7791.91,98.63,,,,,
8,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,OXYCODONE,9143,82,65876.54,803.37,,,,,,


We can already see this is going to be a lot harder to deal with - just for a start, the drug names come before the codes and are spread over multiple columns, which also shifts over much of the numeric data.

#### Step 2 - rename columns & add columns for readability

First, we can make things a little easier by renaming the columns and adding in the new columns we'll need.

I chose to rename the existing columns A-F to make it easier to keep track of them while working out the routines to access and format the data they hold. 

In [21]:
activity_2000.rename(columns={'ARCOS': "A", 
                   '2':'B', 
                   '-': 'C', 
                   'REPORT': 'D', 
                   '5':'E', 
                   'RETAIL':'TOTAL', 
                   'STATISTICAL': 'F'}, 
          inplace=True)


activity_2000.insert(column='Year', loc=0, value=2000)
activity_2000.insert(column='State', loc=1, value=None)
activity_2000.insert(column='Business Activity', loc=2, value=None)
activity_2000.insert(column='Drug', loc=3, value=None)

activity_2000.head()

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
0,2000,,,,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,
1,2000,,,,STATE:,ALASKA,BUSINESS,ACTIVITY:,PHARMACIES,,,,,,
2,2000,,,,NUMBER,OF,TOTAL,GRAMS,AVERAGE,,,,,,
3,2000,,,,DRUG,REGISTRANTS,SOLD,TO,PURCHASE,PER,,,,,
4,2000,,,,DRUG,NAME,CODE,SOLD,TO,REGISTRANTS,REGISTRANT,,,,


#### Step 3 - Get state names

First step is to try to get the state names, which we can see should be in column B when "STATE:" is in column A. I chose to do this by creating a rule for each case of state name length. First we can check whether there's any irregular data that would not follow the assumptions the cleaning routine is based on. 

In [12]:
# are there any values in col B following a col A = STATE:
# that are not part or whole of a state name

# any values that print below will be something that doesn't belong

for i in activity_2000['B'].loc[activity_2000['A']=='STATE:'].unique().tolist():
    if i not in flat_geos:
        print("Unexpected value {}".format(i))

Unexpected value DISTRICT
Unexpected value NORTH
Unexpected value NEW
Unexpected value PUERTO
Unexpected value RHODE
Unexpected value SOUTH
Unexpected value TRUST
Unexpected value VIRGIN
Unexpected value WEST


In [13]:
# also want to check if there are any values aside from
# "BUSINESS" or partial state names in column C
# where col A = "STATE:"

activity_2000['C'].loc[activity_2000['A']=='STATE:'].unique()

array(['BUSINESS', 'OF', 'CAROLINA', 'DAKOTA', 'HAMPSHIRE', 'JERSEY',
       'MEXICO', 'YORK', 'RICO', 'ISLAND', 'TERRITORIES', 'ISLANDS',
       'VIRGINIA'], dtype=object)

In [14]:
# lastly want to check if any values other than 
# 'ACTIVITY','BUSINESS' or partial state names
# show up in columns C and D when col A = STATE:

activity_2000['D'].loc[activity_2000['A']=='STATE:'].unique()

array(['ACTIVITY:', 'COLUMBIA', 'BUSINESS', '(GUAM)'], dtype=object)

In [22]:
# if "STATE:" is present in column A, 
# we know at least the first part of the state name will be in column B
activity_2000.loc[activity_2000['A']=='STATE:', 'State'] = activity_2000['B']

# rule for dealing with two-word names
activity_2000.loc[(activity_2000['A']=="STATE:") 
                  & (activity_2000["C"]!='BUSINESS'), 'State'] = activity_2000["B"]+" "+activity_2000['C']

# rule for dealing with three-word names
activity_2000.loc[(activity_2000['A']=="STATE:") 
                  & (activity_2000["C"]!='BUSINESS') 
                  & (activity_2000["D"]!='ACTIVITY:') 
                  & (activity_2000["D"]!='BUSINESS'), 'State'] = activity_2000["B"]+" "+activity_2000['C']+" "+activity_2000['D']

activity_2000['State'].unique()



array([None, 'ALASKA', 'ALABAMA', 'ARKANSAS', 'ARIZONA', 'CALIFORNIA',
       'COLORADO', 'CONNECTICUT', 'DISTRICT OF COLUMBIA', 'DELAWARE',
       'FLORIDA', 'GEORGIA', 'HAWAII', 'IOWA', 'IDAHO', 'ILLINOIS',
       'INDIANA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MASSACHUSETTS',
       'MARYLAND', 'MAINE', 'MICHIGAN', 'MINNESOTA', 'MISSOURI',
       'MISSISSIPPI', 'MONTANA', 'NEBRASKA', 'NORTH CAROLINA',
       'NORTH DAKOTA', 'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO',
       'NEVADA', 'NEW YORK', 'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA',
       'PUERTO RICO', 'RHODE ISLAND', 'SOUTH CAROLINA', 'SOUTH DAKOTA',
       'TENNESSEE', 'TRUST TERRITORIES (GUAM)', 'TEXAS', 'UTAH',
       'VIRGINIA', 'VIRGIN ISLANDS', 'VERMONT', 'WASHINGTON', 'WISCONSIN',
       'WEST VIRGINIA', 'WYOMING'], dtype=object)

#### Step 4 - get activities

Next step is to pull out the retail activities that are included in the report. 

For later reports, there's actually a letter code (similar to the drug codes) that we can use, but in this older format we don't have that so that rules will be very similar to how we got the state names.

Again we want to check some assumptions before running the rules. 

In [27]:
# check for unexpected values in the first column with activity names
# accounting for all state name lengths
# since activity follows the state name

#activity_2000['E'].loc[activity_2000['D']=='ACTIVITY:'].unique()
#activity_2000['F'].loc[activity_2000['E']=='ACTIVITY:'].unique()
activity_2000['SUMMARY'].loc[activity_2000['F']=='ACTIVITY:'].unique()

array(['PHARMACIES', 'HOSPITALS', 'PRACTITIONERS'], dtype=object)

In [29]:
# check for unexpected values in the second column with activity names
# account for all state name lengths

#activity_2000['F'].loc[activity_2000['D']=='ACTIVITY:'].unique()
#activity_2000['SUMMARY'].loc[activity_2000['E']=='ACTIVITY:'].unique()
activity_2000['FOR'].loc[activity_2000['F']=='ACTIVITY:'].unique()

array([nan])

In [35]:
# check for unexpected values in the last column with activity names
# note that this is to check on the only 3-word activity,
# narcotic treatment programs
# which don't show up in earlier reporting years


#activity_2000['SUMMARY'].loc[activity_2000['D']=='ACTIVITY:'].unique()
#activity_2000['FOR'].loc[activity_2000['E']=='ACTIVITY:'].unique()
#activity_2000['TOTAL'].loc[activity_2000['F']=='ACTIVITY:'].unique()

In [36]:
activity_2000[activity_2000['F']=='ACTIVITY:']

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
433,2000,DISTRICT OF COLUMBIA,,,STATE:,DISTRICT,OF,COLUMBIA,BUSINESS,ACTIVITY:,PHARMACIES,,,,
449,2000,DISTRICT OF COLUMBIA,,,STATE:,DISTRICT,OF,COLUMBIA,BUSINESS,ACTIVITY:,HOSPITALS,,,,
465,2000,DISTRICT OF COLUMBIA,,,STATE:,DISTRICT,OF,COLUMBIA,BUSINESS,ACTIVITY:,PRACTITIONERS,,,,
2662,2000,TRUST TERRITORIES (GUAM),,,STATE:,TRUST,TERRITORIES,(GUAM),BUSINESS,ACTIVITY:,PHARMACIES,,,,
2678,2000,TRUST TERRITORIES (GUAM),,,STATE:,TRUST,TERRITORIES,(GUAM),BUSINESS,ACTIVITY:,HOSPITALS,,,,
2694,2000,TRUST TERRITORIES (GUAM),,,STATE:,TRUST,TERRITORIES,(GUAM),BUSINESS,ACTIVITY:,PRACTITIONERS,,,,


In [37]:
# once we are sure everything is regular
# run the cleaning routine
    
activity_2000.loc[activity_2000['D']=='ACTIVITY:', 'Business Activity'] = activity_2000['E']
activity_2000.loc[(activity_2000['D']=='ACTIVITY:') 
                  & (pd.notnull(activity_2000['F'])), 'Business Activity'] = activity_2000['E']+" "+activity_2000['F']


activity_2000.loc[activity_2000['E']=='ACTIVITY:', 'Business Activity'] = activity_2000['F']
activity_2000.loc[(activity_2000['E']=='ACTIVITY:') 
                  & (pd.notnull(activity_2000['SUMMARY'])), 'Business Activity'] = activity_2000['F']+" "+activity_2000['SUMMARY']


activity_2000.loc[activity_2000['F']=='ACTIVITY:', 'Business Activity'] = activity_2000['SUMMARY']
activity_2000.loc[(activity_2000['F']=='ACTIVITY:')
                 & (pd.notnull(activity_2000['FOR'])), 'Business Activity'] = activity_2000['SUMMARY']+" "+activity_2000['FOR']



activity_2000.head(20)

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
0,2000,,,,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,
1,2000,ALASKA,PHARMACIES,,STATE:,ALASKA,BUSINESS,ACTIVITY:,PHARMACIES,,,,,,
2,2000,,,,NUMBER,OF,TOTAL,GRAMS,AVERAGE,,,,,,
3,2000,,,,DRUG,REGISTRANTS,SOLD,TO,PURCHASE,PER,,,,,
4,2000,,,,DRUG,NAME,CODE,SOLD,TO,REGISTRANTS,REGISTRANT,,,,
5,2000,,,,----------------------------------------------...,,,,,,,,,,
6,2000,,,,DL-AMPHETAMINE,BASE,1100B,75,2359.66,31.46,,,,,
7,2000,,,,D-AMPHETAMINE,BASE,1100D,79,7791.91,98.63,,,,,
8,2000,,,,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,,,,OXYCODONE,9143,82,65876.54,803.37,,,,,,


#### Step 5 - forward fill state names and activity names

Use pandas forward fill function to fill in the state and activity names. 

In [38]:
activity_2000['State'] = activity_2000['State'].fillna(method='ffill')
activity_2000['Business Activity'] = activity_2000['Business Activity'].fillna(method='ffill')

activity_2000.head(20)

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
0,2000,,,,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,
1,2000,ALASKA,PHARMACIES,,STATE:,ALASKA,BUSINESS,ACTIVITY:,PHARMACIES,,,,,,
2,2000,ALASKA,PHARMACIES,,NUMBER,OF,TOTAL,GRAMS,AVERAGE,,,,,,
3,2000,ALASKA,PHARMACIES,,DRUG,REGISTRANTS,SOLD,TO,PURCHASE,PER,,,,,
4,2000,ALASKA,PHARMACIES,,DRUG,NAME,CODE,SOLD,TO,REGISTRANTS,REGISTRANT,,,,
5,2000,ALASKA,PHARMACIES,,----------------------------------------------...,,,,,,,,,,
6,2000,ALASKA,PHARMACIES,,DL-AMPHETAMINE,BASE,1100B,75,2359.66,31.46,,,,,
7,2000,ALASKA,PHARMACIES,,D-AMPHETAMINE,BASE,1100D,79,7791.91,98.63,,,,,
8,2000,ALASKA,PHARMACIES,,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,ALASKA,PHARMACIES,,OXYCODONE,9143,82,65876.54,803.37,,,,,,


#### Step 6 - get drug names

We will continue to use the drug codes dictionary to get drug names. The challenge here is that the code follows the drug name, so we may have to look several columns over to find the code. 

Regex will be useful here to check where drug codes are appearing (and if there are any we are missing in the dict), since we know all the codes are either 4 digits or 4 digits plus a capital letter. 

In [86]:
cols = ['B', 'C', 'D', 'E', 'F', 'SUMMARY', 'FOR', 'TOTAL', 'DRUG', 'PURCHASES' ]
code_columns = {}

for c in cols:
    print("Checking column {}".format(c))
    if pd.isnull(activity_2000[c]).all():
        print("Null column.")
    else:
        vals = activity_2000[c].loc[activity_2000[c].str.match('^[0-9]{4}[A-Z]?', na=False)].unique().tolist()
        if len(vals)>0:
            code_columns[c] = []
            print("Drug codes found in column {}:".format(c))
            print(vals)

            for v in vals:
                code_columns[c].append(v)
                if v not in drug_codes.keys():
                    print("Unexpected code found: {}".format(v))
        else:
            print("No drug codes found.")
    print()

Checking column B
Drug codes found in column B:
['1724', '9143', '9193']

Checking column C
Drug codes found in column C:
['1100B', '1100D']

Checking column D
No drug codes found.

Checking column E
No drug codes found.

Checking column F
No drug codes found.

Checking column SUMMARY
No drug codes found.

Checking column FOR
Null column.

Checking column TOTAL
Null column.

Checking column DRUG
Null column.

Checking column PURCHASES
Null column.



Now we have a dictionary of column names with the corresponding drug codes to pull from. 

In [87]:
code_columns

{'B': ['1724', '9143', '9193'], 'C': ['1100B', '1100D']}

In [88]:
for col in code_columns.keys():
    for code in code_columns[col]:
        activity_2000.loc[activity_2000[col]==code, 'Drug'] = drug_codes[code]
        
activity_2000.head(20)

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
0,2000,,,,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,
1,2000,ALASKA,PHARMACIES,,STATE:,ALASKA,BUSINESS,ACTIVITY:,PHARMACIES,,,,,,
2,2000,ALASKA,PHARMACIES,,NUMBER,OF,TOTAL,GRAMS,AVERAGE,,,,,,
3,2000,ALASKA,PHARMACIES,,DRUG,REGISTRANTS,SOLD,TO,PURCHASE,PER,,,,,
4,2000,ALASKA,PHARMACIES,,DRUG,NAME,CODE,SOLD,TO,REGISTRANTS,REGISTRANT,,,,
5,2000,ALASKA,PHARMACIES,,----------------------------------------------...,,,,,,,,,,
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,DL-AMPHETAMINE,BASE,1100B,75,2359.66,31.46,,,,,
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,D-AMPHETAMINE,BASE,1100D,79,7791.91,98.63,,,,,
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,ALASKA,PHARMACIES,OXYCODONE,OXYCODONE,9143,82,65876.54,803.37,,,,,,


#### Step 7 - Remove extra rows

The last steps will be a lot easier if we can take a moment to drop out some rows that contain only header data and similar. 

At this point, if we have a row that has something other than a drug name in column A, we don't need it anymore, since we've pulled out all the state names and business activity names. 

In [105]:
activity_2000['A'].unique()

array(['REPORTING', 'STATE:', 'NUMBER', 'DRUG',
       '--------------------------------------------------------------------------------------------',
       'DL-AMPHETAMINE', 'D-AMPHETAMINE', 'METHYLPHENIDATE', 'OXYCODONE',
       'HYDROCODONE', 'DATE:', 'PAGE:', 'ARCOS', 'STATISTICAL'],
      dtype=object)

In [108]:
# to be extra sure, we can add a regex to check for numeric data in column C
# and a report if any is found when the rows are being dropped
# the dashes might vary in length for future years so we will get those later

drops = ['REPORTING', 'STATE:', 'NUMBER', 'DRUG',
         'DATE:', 'PAGE:', 'ARCOS', 'STATISTICAL']

for d in drops:
    check_df = activity_2000.loc[(activity_2000['A']==d)
                                & activity_2000['C'].str.match('[-+]?[0-9,]*\.?[0-9]+?$', na=False)]
    if len(check_df)>0:
        print("Dropping rows with numeric data:")
        print(check_df)
    activity_2000 = activity_2000.drop(activity_2000[activity_2000['A']==d].index)
    
activity_2000.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
5,2000,ALASKA,PHARMACIES,,----------------------------------------------...,,,,,,,,,,
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,DL-AMPHETAMINE,BASE,1100B,75.0,2359.66,31.46,,,,,
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,D-AMPHETAMINE,BASE,1100D,79.0,7791.91,98.63,,,,,
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,ALASKA,PHARMACIES,OXYCODONE,OXYCODONE,9143,82,65876.54,803.37,,,,,,
10,2000,ALASKA,PHARMACIES,HYDROCODONE,HYDROCODONE,9193,82,23932.85,291.86,,,,,,
21,2000,ALASKA,HOSPITALS,,----------------------------------------------...,,,,,,,,,,
22,2000,ALASKA,HOSPITALS,DL-AMPHETAMINE BASE,DL-AMPHETAMINE,BASE,1100B,14.0,270.17,19.29,,,,,
23,2000,ALASKA,HOSPITALS,D-AMPHETAMINE BASE,D-AMPHETAMINE,BASE,1100D,19.0,1226.84,64.57,,,,,
24,2000,ALASKA,HOSPITALS,METHYLPHENIDATE,METHYLPHENIDATE,1724,30,3939.0,131.3,,,,,,


In [109]:
# finally, drop the rows with NaNs and dashes
activity_2000 = activity_2000.drop(activity_2000.loc[pd.isnull(activity_2000['D'])].index)
activity_2000.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,DL-AMPHETAMINE,BASE,1100B,75.0,2359.66,31.46,,,,,
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,D-AMPHETAMINE,BASE,1100D,79.0,7791.91,98.63,,,,,
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,ALASKA,PHARMACIES,OXYCODONE,OXYCODONE,9143,82,65876.54,803.37,,,,,,
10,2000,ALASKA,PHARMACIES,HYDROCODONE,HYDROCODONE,9193,82,23932.85,291.86,,,,,,
22,2000,ALASKA,HOSPITALS,DL-AMPHETAMINE BASE,DL-AMPHETAMINE,BASE,1100B,14.0,270.17,19.29,,,,,
23,2000,ALASKA,HOSPITALS,D-AMPHETAMINE BASE,D-AMPHETAMINE,BASE,1100D,19.0,1226.84,64.57,,,,,
24,2000,ALASKA,HOSPITALS,METHYLPHENIDATE,METHYLPHENIDATE,1724,30,3939.0,131.3,,,,,,
25,2000,ALASKA,HOSPITALS,OXYCODONE,OXYCODONE,9143,37,8336.11,225.3,,,,,,
26,2000,ALASKA,HOSPITALS,HYDROCODONE,HYDROCODONE,9193,38,2660.7,70.01,,,,,,


#### Step 8 - move over shifted data

The next step is to move over the shifted numeric data. 

If the cleaning has all gone properly so far, we should only have numeric data following a drug code and name.

For any drug where the code was in column B, we'll leave it alone. For the rest, we want to move the data over according to how many columns over from column B it has been displaced. 

In [117]:
activity_2000.insert(column='Registrants', loc=4, value=None)
activity_2000.insert(column='Total grams sold', loc=5, value=None)
activity_2000.insert(column='Avg grams/registrant', loc=6, value=None)


activity_2000.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,,,,DL-AMPHETAMINE,BASE,1100B,75.0,2359.66,31.46,,,,,
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,,,,D-AMPHETAMINE,BASE,1100D,79.0,7791.91,98.63,,,,,
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,,,,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,ALASKA,PHARMACIES,OXYCODONE,,,,OXYCODONE,9143,82,65876.54,803.37,,,,,,
10,2000,ALASKA,PHARMACIES,HYDROCODONE,,,,HYDROCODONE,9193,82,23932.85,291.86,,,,,,


In [119]:
shift = ['B', 'C', 'D', 'E', 'F', 'SUMMARY', 'FOR', 'TOTAL', 'DRUG', 'PURCHASES' ]

for col in code_columns.keys():
        i = shift.index(col)
        for code in code_columns[col]:
            activity_2000.loc[activity_2000[col]==code, 'Registrants'] = activity_2000[shift[i+1]]
            activity_2000.loc[activity_2000[col]==code, 'Total grams sold'] = activity_2000[shift[i+2]]
            activity_2000.loc[activity_2000[col]==code, 'Avg grams/registrant'] = activity_2000[shift[i+3]]


            
activity_2000.head(20)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant,A,B,C,D,E,F,SUMMARY,FOR,TOTAL,DRUG,PURCHASES
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,75,2359.66,31.46,DL-AMPHETAMINE,BASE,1100B,75.0,2359.66,31.46,,,,,
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,79,7791.91,98.63,D-AMPHETAMINE,BASE,1100D,79.0,7791.91,98.63,,,,,
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,80,12855.76,160.69,METHYLPHENIDATE,1724,80,12855.76,160.69,,,,,,
9,2000,ALASKA,PHARMACIES,OXYCODONE,82,65876.54,803.37,OXYCODONE,9143,82,65876.54,803.37,,,,,,
10,2000,ALASKA,PHARMACIES,HYDROCODONE,82,23932.85,291.86,HYDROCODONE,9193,82,23932.85,291.86,,,,,,
22,2000,ALASKA,HOSPITALS,DL-AMPHETAMINE BASE,14,270.17,19.29,DL-AMPHETAMINE,BASE,1100B,14.0,270.17,19.29,,,,,
23,2000,ALASKA,HOSPITALS,D-AMPHETAMINE BASE,19,1226.84,64.57,D-AMPHETAMINE,BASE,1100D,19.0,1226.84,64.57,,,,,
24,2000,ALASKA,HOSPITALS,METHYLPHENIDATE,30,3939.0,131.3,METHYLPHENIDATE,1724,30,3939.0,131.3,,,,,,
25,2000,ALASKA,HOSPITALS,OXYCODONE,37,8336.11,225.3,OXYCODONE,9143,37,8336.11,225.3,,,,,,
26,2000,ALASKA,HOSPITALS,HYDROCODONE,38,2660.7,70.01,HYDROCODONE,9193,38,2660.7,70.01,,,,,,


#### Step 9 - Last few cleanups

* Simplify references to Guam
* Drop extra columns we don't need
* Convert key columns to float datatype

In [127]:
activity_2000.loc[activity_2000['State']=='TRUST TERRITORIES (GUAM)', 'State'] = "GUAM"

cols = ['Total grams sold', 'Avg grams/registrant']
for col in cols:
    activity_2000[col] = activity_2000[col].str.replace(",","").astype(float)

activity_2000['Registrants'] = activity_2000[col].str.replace(",","").astype(int)

activity_2000 = activity_2000[['Year', 'State', 'Business Activity', 'Drug',
                              'Registrants', 'Total grams sold', 'Avg grams/registrant']]
activity_2000.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,75.0,2359.66,31.46
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,79.0,7791.91,98.63
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,80.0,12855.76,160.69
9,2000,ALASKA,PHARMACIES,OXYCODONE,82.0,65876.54,803.37
10,2000,ALASKA,PHARMACIES,HYDROCODONE,82.0,23932.85,291.86
22,2000,ALASKA,HOSPITALS,DL-AMPHETAMINE BASE,14.0,270.17,19.29
23,2000,ALASKA,HOSPITALS,D-AMPHETAMINE BASE,19.0,1226.84,64.57
24,2000,ALASKA,HOSPITALS,METHYLPHENIDATE,30.0,3939.0,131.3
25,2000,ALASKA,HOSPITALS,OXYCODONE,37.0,8336.11,225.3
26,2000,ALASKA,HOSPITALS,HYDROCODONE,38.0,2660.7,70.01


### Refactor the cleaning & checking code

Below is the refactored code to be reused on the remaining files, first for check functions to find irregular data patterns and second to actually clean up the dataframe. 

Note there are several versions of the function needed to handle all the various formats - as mentioned these were messy and inconsistent from year to year. To adjust the function, I went through it line by line on a file where it didn't work properly to identify the issues, and then adjusted it to reuse from there. 

In [575]:
def check_activity_old(df, flat_geos, flat_activity_codes):
    df.rename(columns={'ARCOS': "A", 
               '2':'B', 
               '-': 'C', 
               'REPORT': 'D', 
               '5':'E', 
               'RETAIL':'TOTAL', 
               'STATISTICAL': 'F'}, 
      inplace=True)
    # check for unusual values in column A - numeric values
    check_a = df['A'].loc[df['A'].str.match('^[0-9]+', na=False)].unique().tolist()
    if len(check_a)>0:
        print("Unusual values in column A:")
        print(check_a)
        print()
    if "STATE:" not in df['A'].unique().tolist():
        print("Data missing - no STATE: header in column A")
    
    # check for valid geography names
    for i in df['B'].loc[df['A']=='STATE:'].unique().tolist():
        if i not in flat_geos:
            print("Unexpected value in column B for states: {}".format(i))
    
            
    # check for unusual values in columns C and D
    for i in df['C'].loc[df['A']=='STATE:'].unique().tolist():
        if i!= 'BUSINESS' and i not in flat_geos:
            print("Unexpected value in column C for states: {}".format(i))

    for i in df['D'].loc[df['A']=='STATE:'].unique().tolist():
        if i!='BUSINESS' and i!='ACTIVITY:' and i not in flat_geos:
            print("Unexpected value in column D for states: {}".format(i))

            
    # check for unexpected values in columns with activity names
    for i in df['E'].loc[df['D']=='ACTIVITY:'].unique().tolist():
        if i not in flat_activity_codes:
            print("Unexpected value in column D for activities: {}".format(i))  
    for i in df['F'].loc[df['D']=='ACTIVITY:'].unique().tolist():
        if i not in flat_activity_codes:
            print("Unexpected value in column F for activities: {}".format(i))
    for i in df['SUMMARY'].loc[df['F']=='ACTIVITY:'].unique().tolist():
        if i not in flat_activity_codes:
            print("Unexpected value in column 'SUMMARY' for activities: {}".format(i))

            
    # check for drug codes
    cols = ['B', 'C', 'D', 'E', 'F', 'SUMMARY', 'FOR', 'TOTAL', 'DRUG', 'PURCHASES' ]
    code_columns = {}
    for c in cols:
        print("Checking column {}".format(c))
        if pd.isnull(df[c]).all():
            print("Null column.")
        else:
            vals = df[c].loc[df[c].str.match('^[0-9]{4}[A-Z]?', na=False)].unique().tolist()
            if len(vals)>0:
                code_columns[c] = []
                print("Drug codes found in column {}:".format(c))
                print(vals)

                for v in vals:
                    code_columns[c].append(v)
                    if v not in drug_codes.keys():
                        print("Unexpected code found: {}".format(v))
            else:
                print("No drug codes found.")
            check = df[c].loc[~df[c].str.match('^[0-9]+', na=False)].unique()
            print("non-numeric values in column {}:".format(c))
            print(check)

        print()

In [576]:
def clean_activity_old(df, year, drug_codes, flat_geos):
    """
    Use for years 2000 to 2005 inclusive. 
    """
    # rename columns
    df.rename(columns={'ARCOS': "A", 
                   '2':'B', 
                   '-': 'C', 
                   'REPORT': 'D', 
                   '5':'E', 
                   'RETAIL':'TOTAL', 
                   'STATISTICAL': 'F'}, 
          inplace=True)

    # insert new columns
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='State', loc=1, value=None)
    df.insert(column='Business Activity', loc=2, value=None)
    df.insert(column='Drug', loc=3, value=None)

    # get state names
    df.loc[df['A']=='STATE:', 'State'] = df['B']
    
    df.loc[(df['A']=="STATE:") 
                      & (df["C"]!='BUSINESS'), 'State'] = df["B"]+" "+df['C']

    df.loc[(df['A']=="STATE:") 
           & (df["C"]!='BUSINESS') 
           & (df["D"]!='ACTIVITY:') 
           & (df["D"]!='BUSINESS'), 'State'] = df["B"]+" "+df['C']+" "+df['D']

    # get the business activity names
    df.loc[df['D']=='ACTIVITY:', 'Business Activity'] = df['E']
    df.loc[(df['D']=='ACTIVITY:') 
           & (pd.notnull(df['F'])), 'Business Activity'] = df['E']+" "+df['F']
    df.loc[df['E']=='ACTIVITY:', 'Business Activity'] = df['F']
    df.loc[(df['E']=='ACTIVITY:') 
           & (pd.notnull(df['SUMMARY'])), 'Business Activity'] = df['F']+" "+df['SUMMARY']
    df.loc[df['F']=='ACTIVITY:', 'Business Activity'] = df['SUMMARY']
    df.loc[(df['F']=='ACTIVITY:')
           & (pd.notnull(df['FOR'])), 'Business Activity'] = df['SUMMARY']+" "+df['FOR']

    # forward fill state and business activity
    df['State'] = df['State'].fillna(method='ffill')
    df['Business Activity'] = df['Business Activity'].fillna(method='ffill')

    # get drug code & update
    # same code as in the checking function but quiet
    cols = ['B', 'C', 'D', 'E', 'F', 'SUMMARY', 'FOR', 'TOTAL', 'DRUG', 'PURCHASES' ]
    code_columns = {}

    for c in cols:
        if pd.isnull(df[c]).all():
            continue
        else:
            vals = df[c].loc[df[c].str.match('^[0-9]{4}[A-Z]?', na=False)].unique().tolist()
            if len(vals)>0:
                code_columns[c] = []
                for v in vals:
                    code_columns[c].append(v)
                    if v not in drug_codes.keys():
                        print("Unexpected drug code found: {}".format(v))
            else:
                continue
    for col in code_columns.keys():
        for code in code_columns[col]:
            df.loc[df[col]==code, 'Drug'] = drug_codes[code]
        
    # drop unnecessary rows
    drops = ['REPORTING', 'STATE:', 'NUMBER', 'DRUG',
         'DATE:', 'PAGE:', 'ARCOS', 'STATISTICAL']
    for d in drops:
        check_df = df.loc[(df['A']==d)
                          & df['C'].str.match('[-+]?[0-9,]*\.?[0-9]+?$', na=False)]
        if len(check_df)>0:
            print("Dropping rows with numeric data:")
            print(check_df)
        df = df.drop(df[df['A']==d].index)
    df = df.drop(df.loc[pd.isnull(df['D'])].index)
    
    # add the last few columns we need
    df.insert(column='Registrants', loc=4, value=None)
    df.insert(column='Total grams sold', loc=5, value=None)
    df.insert(column='Avg grams/registrant', loc=6, value=None)
    
    # shift data
    shift = ['B', 'C', 'D', 'E', 'F', 'SUMMARY', 'FOR', 'TOTAL', 'DRUG', 'PURCHASES' ]
    for col in code_columns.keys():
            i = shift.index(col)
            for code in code_columns[col]:
                df.loc[df[col]==code, 'Registrants'] = df[shift[i+1]]
                df.loc[df[col]==code, 'Total grams sold'] = df[shift[i+2]]
                df.loc[df[col]==code, 'Avg grams/registrant'] = df[shift[i+3]]

    # final cleanup
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = "GUAM"
    repl = ['Total grams sold', 'Avg grams/registrant']
    for col in repl:
        df[col] = df[col].str.replace(",","").astype(float)
    df['Registrants'] = df['Registrants'].str.replace(",","").astype(int)
    df = df[['Year', 'State', 'Business Activity', 'Drug',
             'Registrants', 'Total grams sold', 'Avg grams/registrant']]
    return df

In [577]:
activity_2000 = pd.read_csv('activity_2000.txt', delim_whitespace=True)
check_activity_old(activity_2000, flat_geos, flat_activity_codes)

Unexpected value in column B for states: TRUST
Unexpected value in column C for states: TERRITORIES
Unexpected value in column D for states: (GUAM)
Unexpected value in column F for activities: nan
Checking column B
Drug codes found in column B:
['1724', '9143', '9193']
non-numeric values in column B:
['PERIOD:' 'ALASKA' 'OF' 'REGISTRANTS' 'NAME' nan 'BASE' 'ENFORCEMENT'
 'SUMMARY' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'HAWAII' 'IOWA'
 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NEBRASKA' 'NORTH' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE'
 'TRUST' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT' 'WASHINGTON'
 'WISCONSIN' 'WEST' 'WYOMING']

Checking column C
Drug codes found in column C:
['1100B', '1100D']
non-numeric values in column C:
['BUSINESS' 'TOTAL'

In [578]:
activity_2000 = clean_activity_old(df=activity_2000, year=2000, 
                                   drug_codes=drug_codes, flat_geos=flat_geos)

activity_2000.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2000,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,75,2359.66,31.46
7,2000,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,79,7791.91,98.63
8,2000,ALASKA,PHARMACIES,METHYLPHENIDATE,80,12855.76,160.69
9,2000,ALASKA,PHARMACIES,OXYCODONE,82,65876.54,803.37
10,2000,ALASKA,PHARMACIES,HYDROCODONE,82,23932.85,291.86
22,2000,ALASKA,HOSPITALS,DL-AMPHETAMINE BASE,14,270.17,19.29
23,2000,ALASKA,HOSPITALS,D-AMPHETAMINE BASE,19,1226.84,64.57
24,2000,ALASKA,HOSPITALS,METHYLPHENIDATE,30,3939.0,131.3
25,2000,ALASKA,HOSPITALS,OXYCODONE,37,8336.11,225.3
26,2000,ALASKA,HOSPITALS,HYDROCODONE,38,2660.7,70.01


In [579]:
activity_2001 = pd.read_csv('activity_2001.txt', delim_whitespace=True)
check_activity_old(activity_2001, flat_geos, flat_activity_codes)

Unexpected value in column B for states: TRUST
Unexpected value in column C for states: TERRITORIES
Unexpected value in column D for states: (GUAM)
Unexpected value in column F for activities: nan
Checking column B
Drug codes found in column B:
['1105D', '1724', '2165', '9041L', '9050', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9655', '9739', '9740', '9737', '9652', '9743', '9058', '9180L', '9104', '9190', '9603', '9170', '9317', '1105L']
non-numeric values in column B:
['PERIOD:' 'ALASKA' 'OF' 'REGISTRANTS' 'NAME' nan 'BASE' '(SCHEDULE'
 '(PETHIDINE)' 'POWDERED' 'ENFORCEMENT' 'SUMMARY' 'ALABAMA' 'ARKANSAS'
 'ARIZONA' 'CALIFORNIA' '(PCP)' 'COLORADO' 'CONNECTICUT' 'DISTRICT'
 'DELAWARE' 'FLORIDA' 'GEORGIA' 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS'
 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND'
 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI' 'MISSISSIPPI' 'MONTANA'
 'NEBRASKA' 'NORTH' 'NEW' 'NEVADA' 'OHIO' 'OKLAHOMA' 'OREGON'
 'PENNSYLVANIA' 'PUERTO' 'RHODE' '

In [580]:
activity_2001 = clean_activity_old(df=activity_2001, year=2001, 
                                   drug_codes=drug_codes, flat_geos=flat_geos)

activity_2001.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2001,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,79,2736.82,34.64
7,2001,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,83,7486.18,90.19
8,2001,ALASKA,PHARMACIES,D-METHAMPHETAMINE,13,65.2,5.01
9,2001,ALASKA,PHARMACIES,METHYLPHENIDATE,85,16899.97,198.82
10,2001,ALASKA,PHARMACIES,AMOBARBITAL (SCHEDULE 2),2,9.1,4.55
11,2001,ALASKA,PHARMACIES,BUTALBITAL,70,7005.0,100.07
12,2001,ALASKA,PHARMACIES,SECOBARBITAL (SCHEDULE 2),16,292.85,18.3
13,2001,ALASKA,PHARMACIES,COCAINE,5,28.88,5.77
14,2001,ALASKA,PHARMACIES,CODEINE,84,32748.63,389.86
15,2001,ALASKA,PHARMACIES,DIHYDROCODEINE,6,11.66,1.94


In [581]:
activity_2002 = pd.read_csv('activity_2002.txt', delim_whitespace=True)
check_activity_old(activity_2002, flat_geos, flat_activity_codes)

Unexpected value in column F for activities: nan
Checking column B
Drug codes found in column B:
['1105D', '1724', '9041L', '9050', '9064', '9143', '9150', '9193', '9250B', '9300']
non-numeric values in column B:
['PERIOD:' 'ALASKA' 'OF' 'REGISTRANTS' 'NAME' nan 'BASE' '(PETHIDINE)'
 'ENFORCEMENT' 'SUMMARY' 'ALABAMA' 'ARKANSAS' 'AMERICAN' 'ARIZONA'
 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA'
 'GEORGIA' 'GUAM' 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS'
 'KENTUCKY' 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN'
 'MINNESOTA' 'MISSOURI' 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW'
 'NEVADA' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING']

Checking column C
Drug codes found in column C:
['1100B', '1100D', '9230', '9801']
non-numeric values in column C:
['BUSINESS' 'TOTAL' 'SOLD' 'CODE' nan 'DEPARTMENT' 'ADMINISTRATION' 

In [582]:
activity_2002 = clean_activity_old(df=activity_2002, year=2002, 
                                   drug_codes=drug_codes, flat_geos=flat_geos)

activity_2002.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2002,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,81,3527.03,43.54
7,2002,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,82,7719.42,94.13
8,2002,ALASKA,PHARMACIES,D-METHAMPHETAMINE,9,48.81,5.42
9,2002,ALASKA,PHARMACIES,METHYLPHENIDATE,84,19843.46,236.23
10,2002,ALASKA,PHARMACIES,COCAINE,2,6.59,3.29
11,2002,ALASKA,PHARMACIES,CODEINE,85,32053.56,377.1
12,2002,ALASKA,PHARMACIES,BUPRENORPHINE,4,0.0,0.0
13,2002,ALASKA,PHARMACIES,OXYCODONE,86,100029.28,1163.13
14,2002,ALASKA,PHARMACIES,HYDROMORPHONE,62,2544.27,41.03
15,2002,ALASKA,PHARMACIES,HYDROCODONE,85,32334.35,380.4


In [583]:
activity_2003 = pd.read_csv('activity_2003.txt', delim_whitespace=True)
check_activity_old(activity_2003, flat_geos, flat_activity_codes)

Unexpected value in column F for activities: nan
Checking column B
Drug codes found in column B:
['1105D', '1724', '9041L', '9050', '9143', '9150', '9193', '9250B', '9300']
non-numeric values in column B:
['PERIOD:' 'ALASKA' 'OF' 'REGISTRANTS' 'NAME' nan 'BASE' '(PETHIDINE)'
 'ENFORCEMENT' 'SUMMARY' 'ALABAMA' 'ARKANSAS' 'AMERICAN' 'ARIZONA'
 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA'
 'GEORGIA' 'GUAM' 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS'
 'KENTUCKY' 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN'
 'MINNESOTA' 'MISSOURI' 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW'
 'NEVADA' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING']

Checking column C
Drug codes found in column C:
['1100B', '1100D', '9230', '9801']
non-numeric values in column C:
['BUSINESS' 'TOTAL' 'SOLD' 'CODE' nan 'DEPARTMENT' 'ADMINISTRATION' '-'
 'FO

In [584]:
activity_2003 = clean_activity_old(df=activity_2003, year=2003, 
                                   drug_codes=drug_codes, flat_geos=flat_geos)

activity_2003.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2003,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,82,3617.96,44.12
7,2003,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,83,7352.22,88.58
8,2003,ALASKA,PHARMACIES,D-METHAMPHETAMINE,10,44.01,4.4
9,2003,ALASKA,PHARMACIES,METHYLPHENIDATE,86,19864.27,230.97
10,2003,ALASKA,PHARMACIES,COCAINE,4,15.51,3.87
11,2003,ALASKA,PHARMACIES,CODEINE,85,31933.2,375.68
12,2003,ALASKA,PHARMACIES,OXYCODONE,89,88212.61,991.15
13,2003,ALASKA,PHARMACIES,HYDROMORPHONE,61,1961.41,32.15
14,2003,ALASKA,PHARMACIES,HYDROCODONE,88,36279.31,412.26
15,2003,ALASKA,PHARMACIES,MEPERIDINE (PETHIDINE),79,19871.25,251.53


In [585]:
activity_2004 = pd.read_csv('activity_2004.txt', delim_whitespace=True)
check_activity_old(activity_2004, flat_geos, flat_activity_codes)

Unexpected value in column F for activities: nan
Checking column B
Drug codes found in column B:
['1105D', '1724', '9041L', '9050', '9143', '9150', '9193', '9250B', '9300']
non-numeric values in column B:
['PERIOD:' 'ALASKA' 'OF' 'REGISTRANTS' 'TRADE' nan 'BASE' '(PETHIDINE)'
 'ENFORCEMENT' 'SUMMARY' 'ALABAMA' 'ARKANSAS' 'AMERICAN' 'ARIZONA'
 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA'
 'GEORGIA' 'GUAM' 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS'
 'KENTUCKY' 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN'
 'MINNESOTA' 'MISSOURI' 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW'
 'NEVADA' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING']

Checking column C
Drug codes found in column C:
['1100B', '1100D', '9230', '9801']
non-numeric values in column C:
['BUSINESS' 'TOTAL' 'SOLD' 'NAME' nan 'DEPARTMENT' 'ADMINISTRATION,' '-'
 '

In [586]:
activity_2004 = clean_activity_old(df=activity_2004, year=2004, 
                                   drug_codes=drug_codes, flat_geos=flat_geos)

activity_2004.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2004,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,82,4147.54,50.57
7,2004,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,82,7603.11,92.72
8,2004,ALASKA,PHARMACIES,D-METHAMPHETAMINE,8,22.01,2.75
9,2004,ALASKA,PHARMACIES,METHYLPHENIDATE,83,21297.96,256.6
10,2004,ALASKA,PHARMACIES,COCAINE,2,1.06,0.53
11,2004,ALASKA,PHARMACIES,CODEINE,92,30516.76,331.7
12,2004,ALASKA,PHARMACIES,OXYCODONE,87,89450.06,1028.16
13,2004,ALASKA,PHARMACIES,HYDROMORPHONE,57,1743.55,30.58
14,2004,ALASKA,PHARMACIES,HYDROCODONE,91,38206.1,419.84
15,2004,ALASKA,PHARMACIES,MEPERIDINE (PETHIDINE),78,17938.42,229.97


In [587]:
activity_2005 = pd.read_csv('activity_2005.txt', delim_whitespace=True)
check_activity_old(activity_2005, flat_geos, flat_activity_codes)

Unexpected value in column F for activities: nan
Checking column B
Drug codes found in column B:
['1724', '9041L', '9050', '9064', '9143', '9150', '9193', '9250B', '9300']
non-numeric values in column B:
['PERIOD:' 'ALASKA' 'OF' 'REGISTRANTS' 'NAME' nan 'BASE' '(PETHIDINE)'
 'ENFORCEMENT' 'SUMMARY' 'ALABAMA' 'ARKANSAS' 'AMERICAN' 'ARIZONA'
 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA'
 'GEORGIA' 'GUAM' 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS'
 'KENTUCKY' 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN'
 'MINNESOTA' 'MISSOURI' 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW'
 'NEVADA' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING']

Checking column C
Drug codes found in column C:
['1100B', '1100D', '9230', '9801']
non-numeric values in column C:
['BUSINESS' 'TOTAL' 'SOLD' 'CODE' nan 'DEPARTMENT' 'ADMINISTRATION,' '-'
 'FO

In [588]:
activity_2005 = clean_activity_old(df=activity_2005, year=2005, 
                                   drug_codes=drug_codes, flat_geos=flat_geos)

activity_2005.head(10)

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
6,2005,ALASKA,PHARMACIES,DL-AMPHETAMINE BASE,81,4177.83,51.57
7,2005,ALASKA,PHARMACIES,D-AMPHETAMINE BASE,84,7208.45,85.81
8,2005,ALASKA,PHARMACIES,METHYLPHENIDATE,85,20836.45,245.13
9,2005,ALASKA,PHARMACIES,COCAINE,2,4.89,2.44
10,2005,ALASKA,PHARMACIES,CODEINE,90,28050.89,311.67
11,2005,ALASKA,PHARMACIES,BUPRENORPHINE,32,424.41,13.26
12,2005,ALASKA,PHARMACIES,OXYCODONE,88,81238.51,923.16
13,2005,ALASKA,PHARMACIES,HYDROMORPHONE,67,1610.07,24.03
14,2005,ALASKA,PHARMACIES,HYDROCODONE,89,39198.21,440.42
15,2005,ALASKA,PHARMACIES,MEPERIDINE (PETHIDINE),75,14308.89,190.78


#### Moving on to new year files

Like with the other reports, years after 2006 follow a slightly different format, so we need to adjust the checking and cleaning code to account for that. 

In [589]:
activity_2006 = pd.read_csv('activity_2006.txt', delim_whitespace=True)
activity_2006.head(10)

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
0,REPORTING,PERIOD:,01/01/2006,TO,12/31/2006,,,,,,,,,
1,Run,Date:,03/27/2015,,,,,,,,,,,
2,STATE:ALABAMA,BUSINESS,ACTIVITY:A,-,PHARMACIES,,,,,,,,,
3,DRUG,NAME,DRUG,CODE,BUYERS,TOTAL,GRAMS,AVG,GRAMS,,,,,
4,AMPHETAMINE,1100,1215,193908.24,159.6,,,,,,,,,
5,D-METHAMPHETAMINE,1105D,55,244.11,4.44,,,,,,,,,
6,METHYLPHENIDATE,1724,1219,278888.64,228.78,,,,,,,,,
7,BUTALBITAL,2165,927,47630,51.38,,,,,,,,,
8,PENTOBARBITAL,(SCHEDULE,2),2270,2,27.33,13.67,,,,,,,
9,SECOBARBITAL,(SCHEDULE,2),2315,12,183,15.25,,,,,,,


In [590]:
def check_activity_new(df, flat_geos, flat_activity_codes):
    df.rename(columns={'ARCOS': "A", 
                   '3':'B', 
                   '-': 'C', 
                   'REPORT': 'D', 
                   '5':'E', 
                   'STATISTICAL':'F', 
                   'SUMMARY': 'G'}, 
          inplace=True)
    # check for unusual values in column A - 
    # we don't expect any numeric values here
    check_a = df['A'].loc[df['A'].str.match('^[0-9]+', na=False)].unique().tolist()
    if len(check_a)>0:
        print("Unusual numerical values in column A:")
        print(check_a)
        print()
    
    # check for unexpected geography names
    l = df['A'].loc[df['A'].str.match('STATE:')].unique().tolist()
    l = [x.split(":")[1] for x in l]
    for i in l:
        if i not in flat_geos:
            print("Unexpected value in column A for states: {}".format(i))
    
    # check for unusual activity codes
    for c in ['C', 'D', 'E']:
        codes = df[c].loc[df[c].str.match("ACTIVITY:", 
                                              na=False)].str.split(":", 
                                                                   expand=True)[1].unique().tolist()
        for code in codes:
            if code not in activity_codes.keys():
                print("Unexpected activity code {} in column {}:".format(code, c))
            

    # we don't expect any numeric values to show up until col C
    for c in ['A', 'B']:
        check = df[c].loc[(df[c].str.match('^[0-9]+', 
                                           na=False))
                           & (~df[c].str.match('^[0-9]{4}[A-Z]?', 
                                               na=False))].unique().tolist()
        if len(check)>0:
            print("Unusual numerical values in column {}".format(c))
            print(check)
            print()
          
    # check for unusual values in columns B and C
    for i in df['B'].loc[df['A'].str.match('STATE:')].unique().tolist():
        if i not in flat_geos:
            print("Unexpected value in column B for states: {}".format(i))

    for i in df['C'].loc[df['A'].str.match('STATE:')].unique().tolist():
        if i not in flat_geos:
            print("Unexpected value in column C for states: {}".format(i))

            
    # check for drug codes
    cols = ['B', 'C', 'D', 'E', 'F', 'G', 'FOR', 'RETAIL', 
            'DRUG', 'PURCHASES', 'BY', 'GRAMS', 'WT']
    code_columns = {}
    for c in cols:
        print("Checking column {}".format(c))
        if pd.isnull(df[c]).all():
            print("Null column.")
        else:
            try:
                vals = df[c].loc[df[c].str.match('^[0-9]{4}[A-Z]?', 
                                                 na=False)].unique().tolist()
            except AttributeError:
                vals = df[c].loc[df[c].astype(str).str.match('^[0-9]{4}[A-Z]?', 
                                                             na=False)].unique().tolist()
            if len(vals)>0:
                code_columns[c] = []
                print("Drug codes found in column {}:".format(c))
                print(vals)

                for v in vals:
                    code_columns[c].append(v)
                    if v not in drug_codes.keys():
                        print("Unexpected code found: {}".format(v))
            else:
                print("No drug codes found.")
            try:
                check = df[c].loc[~df[c].str.match('^[0-9]+', na=False)].unique()
            except AttributeError:
                check = df[c].loc[~df[c].astype(str).str.match('^[0-9]+', na=False)].unique()
            print("non-numeric values in column {}:".format(c))
            print(check)

        print()

In [591]:
check_activity_new(activity_2006, flat_geos=flat_geos, flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['14-HYDROXYCODEINONE']

Unusual numerical values in column A
['14-HYDROXYCODEINONE']

Unusual numerical values in column B
['3']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1724', '2165', '7379', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9411', '9652', '9739', '9737', '9665', '9655', '7285', '9190', '9180L', '7370', '9273D', '9333']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' '(SCHEDULE' '(PETHIDINE)' 'POWDERED'
 'BASE' 'SUMMARY' 'SAMOA' 'OF' nan '(PCP

In [637]:
def clean_activity_new(df, year, drug_codes, flat_geos, activity_codes):
    """
    Use for years 2006 to   inclusive
    """
    # rename columns
    df.rename(columns={'ARCOS': "A", 
                   '3':'B', 
                   '-': 'C', 
                   'REPORT': 'D', 
                   '5':'E', 
                   'STATISTICAL':'F', 
                   'SUMMARY': 'G'}, 
          inplace=True)

    # insert new columns
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='State', loc=1, value=None)
    df.insert(column='Business Activity', loc=2, value=None)
    df.insert(column='Drug', loc=3, value=None)

    # get state names
    df.loc[(df['A'].str.match("STATE:", na=False)) 
           &(df['B']=='BUSINESS'), 'State'] = df['A'].str.split(":", expand=True)[1]
    df.loc[(df['A'].str.match("STATE:", na=False)) 
           &(df['C']=='BUSINESS'), 'State'] = df['A'].str.split(":", expand=True)[1]+" "+df['B']
    df.loc[(df['A'].str.match("STATE:", na=False)) 
           &(df['D']=='BUSINESS'), 'State'] = df['A'].str.split(":", expand=True)[1]+" "+df['B']+" "+df['C']
    
    # add a line to specifically handle the 2017 file
    df.loc[(df['A'].str.match("ENTIRE", na=False)) 
           &(df['B']=='UNITED'), 'State'] = 'UNITED STATES'

    # get business activity codes
    # will use the activity code dict later to get the name
    cols = ['C', 'D', 'E']
    cols_shift = ['D', 'E', 'F']
    for c in cols:
        i = cols.index(c)
        df.loc[df[c].str.match("ACTIVITY:[A-Z]", na=False), 
                      'Business Activity'] = df[c].str.split(":", expand=True)[1]
        # in a few cases, the letter code is not concatenated
        # and appears in the next column
        df.loc[df[c].str.match("ACTIVITY:$", na=False), 
          'Business Activity'] = df[cols_shift[i]]

    # forward fill state and business activity codes
    df['State'] = df['State'].fillna(method='ffill')
    df['Business Activity'] = df['Business Activity'].fillna(method='ffill')

    # get drug code & update
    # same code as in the checking function but quiet
    cols = ['B', 'C', 'D', 'E', 'F', 'G', 
            'FOR', 'RETAIL', 'DRUG', 
            'PURCHASES', 'BY', 'GRAMS', 'WT']    
    code_columns = {}

    for c in cols:
        if pd.isnull(df[c]).all():
            continue
        else:
            try:
                vals = df[c].loc[df[c].str.match('^[0-9]{4}[A-Z]?', 
                                                 na=False)].unique().tolist()
            except AttributeError:
                vals = df[c].loc[df[c].astype(str).str.match('^[0-9]{4}[A-Z]?', 
                                                             na=False)].unique().tolist()            
            if len(vals)>0:
                code_columns[c] = []
                for v in vals:
                    code_columns[c].append(v)
                    if v not in drug_codes.keys():
                        print("Unexpected drug code found: {}".format(v))
            else:
                continue
    for col in code_columns.keys():
        for code in code_columns[col]:
            df.loc[df[col]==code, 'Drug'] = drug_codes[code]
        
    # drop unnecessary rows
    df = df.drop(df.loc[df['A'].str.match("STATE:")].index)
    df = df.drop(df.loc[pd.isnull(df['Drug'])].index)

    
    # add the last few columns we need
    df.insert(column='Registrants', loc=4, value=None)
    df.insert(column='Total grams sold', loc=5, value=None)
    df.insert(column='Avg grams/registrant', loc=6, value=None)
    
    # shift data
    shift = ['B', 'C', 'D', 'E', 'F', 
            'G', 'FOR', 'RETAIL', 'DRUG', 
            'PURCHASES', 'BY', 'GRAMS', 'WT']  
    for col in code_columns.keys():
            i = shift.index(col)
            for code in code_columns[col]:
                df.loc[df[col]==code, 'Registrants'] = df[shift[i+1]]
                df.loc[df[col]==code, 'Total grams sold'] = df[shift[i+2]]
                df.loc[df[col]==code, 'Avg grams/registrant'] = df[shift[i+3]]

    # final cleanup
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = "GUAM"
    repl = ['Total grams sold', 'Avg grams/registrant']
    
    # in a few cases there are files that come in 
    # having mixed datatypes within columns 
    # or formatted as floats in a column where we only want ints
    # exceptions catch these and convert everything to strings first
    for col in repl:
        try:
            df[col] = df[col].str.replace(",","").astype(float)
        except:
            df[col] = df[col].astype(str)
            df[col] = df[col].str.replace(",","").astype(float)
    
    try:
        df['Registrants'] = df['Registrants'].str.replace(",","").astype(int)
    except:
        df['Registrants'] = df['Registrants'].astype(str)
        try:
            df['Registrants'] = df['Registrants'].str.replace(",","").astype(int)
        except ValueError:
            df['Registrants'] = df['Registrants'].str.replace(",","").astype(float).astype(int)
    
    # get business activity names
    df['Business Activity'] = df['Business Activity'].apply(lambda x: activity_codes[x])
    df = df[['Year', 'State', 'Business Activity', 'Drug',
             'Registrants', 'Total grams sold', 'Avg grams/registrant']]
    return df

In [593]:
activity_2006 = clean_activity_new(activity_2006, year=2006, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2006.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2006,ALABAMA,PHARMACIES,AMPHETAMINE,1215,193908.24,159.6
5,2006,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,55,244.11,4.44
6,2006,ALABAMA,PHARMACIES,METHYLPHENIDATE,1219,278888.64,228.78
7,2006,ALABAMA,PHARMACIES,BUTALBITAL,927,47630.0,51.38
8,2006,ALABAMA,PHARMACIES,PENTOBARBITAL (SCHEDULE 2),2,27.33,13.67


In [487]:
activity_2007 = pd.read_csv('activity_2007.txt', delim_whitespace=True)
check_activity_new(activity_2007, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column B
['3']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '1724', '2165', '7379', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9411', '9652', '9739', '9737', '9655', '9180L', '9273D']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' '(SCHEDULE' '(PETHIDINE)' 'POWDERED'
 'BASE' 'SUMMARY' 'HYDROXYBUTYRIC' 'SAMOA' 'OF' nan 'IN' '(PCP)'
 'HAMPSHIRE' 'JERSEY' 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' 'RICO' 'ISLAND'
 'ISLANDS' 'VIRGINIA']

Checking column C
Drug codes found

In [488]:
activity_2007 = clean_activity_new(activity_2007, year=2007, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2007.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2007,ALABAMA,PHARMACIES,AMPHETAMINE,1238,240152.17,193.98
5,2007,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,53,283.08,5.34
6,2007,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,908,16325.77,17.98
7,2007,ALABAMA,PHARMACIES,METHYLPHENIDATE,1244,318954.58,256.39
8,2007,ALABAMA,PHARMACIES,BUTALBITAL,945,51850.0,54.87


In [489]:
activity_2008 = pd.read_csv('activity_2008.txt', delim_whitespace=True)
check_activity_new(activity_2008, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['3,4-METHYLENEDIOXYAMPHETAMINE']

Unusual numerical values in column A
['3,4-METHYLENEDIOXYAMPHETAMINE']

Unusual numerical values in column B
['3']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '1724', '2165', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9737', '9739', '1615', '7379', '9668', '9190', '9180L', '2285', '2885', '9170', '9655', '7370', '2765', '9273D']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' '(SCHEDULE' '(PETHIDINE)' 'POWDER

In [490]:
activity_2008 = clean_activity_new(activity_2008, year=2008, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2008.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2008,ALABAMA,PHARMACIES,AMPHETAMINE,1267,228044.03,179.99
5,2008,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,50,196.43,3.93
6,2008,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1180,70740.36,59.95
7,2008,ALABAMA,PHARMACIES,METHYLPHENIDATE,1266,268348.52,211.97
8,2008,ALABAMA,PHARMACIES,BUTALBITAL,797,33050.0,41.47


In [491]:
activity_2009 = pd.read_csv('activity_2009.txt', delim_whitespace=True)
check_activity_new(activity_2009, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['9041', '16,896,952.51']

Unusual numerical values in column A
['16,896,952.51']

Unusual numerical values in column B
['3', '1']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '1724', '2165', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9737', '9739', '7379', '2885', '9668', '2765', '2783', '9655', '1615', '9273D', '9180L']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' '(SCHEDULE' '(PETHIDINE)' 'POWDERED'
 'BASE' 'SUMMARY' 'SAMOA' 'HYDR

First unusual values to check out:
* '9041' and '16,896,952.51' in column A - the first looks like a drug code, and the second looks like our recurring wrapped large value problem
* '3' and '1' in column B - could be page numbers or some other header data

In [492]:
activity_2009[activity_2009['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
480,9041,1,0,0,,,,,,,,,,
698,9041,1,0,0,,,,,,,,,,
959,9041,1,0,0,,,,,,,,,,
3413,9041,1,0,0,,,,,,,,,,


In [493]:
drug_codes['9041']

'COCAINE'

In [494]:
activity_2009.iloc[478:482]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
478,DIAZEPAM,2765,2,0.0,0.0,,,,,,,,,
479,NABILONE,7379,1,0.02,0.02,,,,,,,,,
480,9041,1,0,0.0,,,,,,,,,,
481,COCAINE,9041L,115,346.73,3.02,,,,,,,,,


In [495]:
activity_2009.iloc[957:961]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
957,ZOLPIDEM,2783,5,1.8,0.36,,,,,,,,,
958,LORAZEPAM,2885,2,1.2,0.6,,,,,,,,,
959,9041,1,0,0.0,,,,,,,,,,
960,COCAINE,9041L,17,33.21,1.95,,,,,,,,,


In [496]:
activity_2009.iloc[3411:3415]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
3411,BUTALBITAL,2165,1,30.0,30.0,,,,,,,,,
3412,PENTOBARBITAL,(SCHEDULE,2),2270.0,153.0,133542.77,872.83,,,,,,,
3413,9041,1,0,0.0,,,,,,,,,,
3414,COCAINE,9041L,11,11.43,1.04,,,,,,,,,


This looks to be another case of the dueling cocaine codes. Again since all the rows of data associated with the 9041 code are empty, and aren't following I will just let them drop out of the final data in the cleaning function.

In [497]:
# next examine what looks like a wrapped value
activity_2009[activity_2009['A']=='16,896,952.51']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2546,16896952.51,,,,,,,,,,,,,


In [498]:
activity_2009.iloc[2544:2548]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2544,METHYLPHENIDATE,1724,1229,412906.88,335.97,,,,,,,,,
2545,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012.0,1.0,16896952.51,,,,,,,
2546,16896952.51,,,,,,,,,,,,,
2547,BUTALBITAL,2165,810,42270,52.19,,,,,,,,,


Definitely a wrapped value. Even if this occurs in other files it's likely to vary in terms of the column shifts etc. so I'll just do it as a manual fix. No need to worry about row 2546; it will get dropped out by the cleaning function regardless.

In [499]:
# important that we not have dropped any rows yet 
# so that we can use loc and iloc interchangeably for the indexing here
activity_2009.loc[2545, 'FOR'] = activity_2009.loc[2546, 'A']
activity_2009.iloc[2544:2548]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2544,METHYLPHENIDATE,1724,1229,412906.88,335.97,,,,,,,,,
2545,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012.0,1.0,16896952.51,16896952.51,,,,,,
2546,16896952.51,,,,,,,,,,,,,
2547,BUTALBITAL,2165,810,42270,52.19,,,,,,,,,


In [500]:
activity_2009[activity_2009['B']=='3']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
33,ARCOS,3,-,REPORT,5,,,,,,,,,
69,ARCOS,3,-,REPORT,5,,,,,,,,,
97,ARCOS,3,-,REPORT,5,,,,,,,,,
132,ARCOS,3,-,REPORT,5,,,,,,,,,
168,ARCOS,3,-,REPORT,5,,,,,,,,,
198,ARCOS,3,-,REPORT,5,,,,,,,,,
231,ARCOS,3,-,REPORT,5,,,,,,,,,
267,ARCOS,3,-,REPORT,5,,,,,,,,,
300,ARCOS,3,-,REPORT,5,,,,,,,,,
334,ARCOS,3,-,REPORT,5,,,,,,,,,


In [501]:
activity_2009[activity_2009['B']=='1']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
480,9041,1,0,0,,,,,,,,,,
698,9041,1,0,0,,,,,,,,,,
959,9041,1,0,0,,,,,,,,,,
3413,9041,1,0,0,,,,,,,,,,


Now we can clean the file; after cleaning you can verify that the problematic rows were dropped.

In [502]:
activity_2009 = clean_activity_new(activity_2009, year=2009, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2009.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2009,ALABAMA,PHARMACIES,AMPHETAMINE,1251,259801.9,207.68
5,2009,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,41,196.83,4.8
6,2009,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1204,121313.71,100.76
7,2009,ALABAMA,PHARMACIES,METHYLPHENIDATE,1259,263010.46,208.9
8,2009,ALABAMA,PHARMACIES,BUTALBITAL,775,30270.0,39.06


In [503]:
# uncomment to see that the rows with irregular data were dropped
# now we use loc; using iloc will return a row that's now in that index position
# the row that used to have both the index location and index name will now be gone
# and each of these will return a KeyError

#activity_2009.loc[480]
#activity_2009.loc[698]
#activity_2009.loc[959]
#activity_2009.loc[3413]
#activity_2009.loc[2546]

In [504]:
activity_2010 = pd.read_csv('activity_2010.txt', delim_whitespace=True)
check_activity_new(activity_2010, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['9041', '4-HYDROXY-3-METHOXY-METHAMPHETAMINE(H', '16,120,104.19']

Unusual numerical values in column A
['4-HYDROXY-3-METHOXY-METHAMPHETAMINE(H', '16,120,104.19']

Unusual numerical values in column B
['3', '1']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '1724', '2165', '7379', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9739', '9780', '9737', '2783', '1615', '9041', '9190', '9655', '7444', '9180L', '9273D', '2885', '9743']
non-numeric values in column B:
['PERIO

Similar story here with some values to check:

* '9041' and '16,120,104.19' in column A
* '3' and '1' in column B

In [505]:
activity_2010[activity_2010['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
920,9041,1,0,0,,,,,,,,,,


In [506]:
activity_2010.iloc[918:922]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
918,PENTOBARBITAL,(SCHEDULE,2),2270.0,77.0,21916.8,284.63,,,,,,,
919,SECOBARBITAL,(SCHEDULE,2),2315.0,1.0,18.31,18.31,,,,,,,
920,9041,1,0,0.0,,,,,,,,,,
921,COCAINE,9041L,256,2023.73,7.91,,,,,,,,,


Another 9041 cocaine entry that's empty; as before we'll let it drop in the cleaning step.

In [507]:
activity_2010[activity_2010['A']=='16,120,104.19']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2539,16120104.19,,,,,,,,,,,,,


In [508]:
activity_2010.iloc[2537:2541]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2537,METHYLPHENIDATE,1724,1214,446324.88,367.65,,,,,,,,,
2538,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012.0,1.0,16120104.19,,,,,,,
2539,16120104.19,,,,,,,,,,,,,
2540,AMOBARBITAL,(SCHEDULE,2),2125,1.0,0.46,0.46,,,,,,,


In [509]:
# same fix as before
# again remember you can only do it this way if you have not dropped any rows yet 
# so that we can use loc and iloc interchangeably for the indexing here
# row 2539 will get dropped in the cleaning
activity_2010.loc[2538, 'FOR'] = activity_2010.loc[2539, 'A']
activity_2010.iloc[2537:2540]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2537,METHYLPHENIDATE,1724,1214,446324.88,367.65,,,,,,,,,
2538,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012.0,1.0,16120104.19,16120104.19,,,,,,
2539,16120104.19,,,,,,,,,,,,,


In [510]:
# this is just header data
#activity_2010[activity_2010['B']=='3'].head()

In [511]:
# this is the cocaine 9041 row
#activity_2010[activity_2010['B']=='1']

In [512]:
activity_2010 = clean_activity_new(activity_2010, year=2010, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2010.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2010,ALABAMA,PHARMACIES,AMPHETAMINE,1259,295394.9,234.63
5,2010,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,31,85.59,2.76
6,2010,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,35,161.89,4.63
7,2010,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1219,153102.27,125.6
8,2010,ALABAMA,PHARMACIES,METHYLPHENIDATE,1264,268138.96,212.14


In [513]:
# uncomment to check that the irregular rows were dropped if you like
#activity_2010.loc[2539]
#activity_2010.loc[920]

In [514]:
activity_2011 = pd.read_csv('activity_2011.txt', delim_whitespace=True)
check_activity_new(activity_2011, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['17,577,800.06', '9041']

Unusual numerical values in column A
['17,577,800.06']

Unusual numerical values in column B
['3', '2', '1', '9']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '1724', '2165', '7379', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9737', '9739', '9041', '9668', '1615', '9600', '2783', '2885', '9743', '9046']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' 'RACEMIC' '(SCHEDULE' '(PETHIDINE)'
 'POWDERED' 'BASE' 'SUMM

To check out:
* '17,577,800.06' and '9041' in column A (seeing a pattern yet...?)
* '3', '2', '1', '9' in column B

In [515]:
activity_2011[activity_2011['A']=='17,577,800.06']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2539,17577800.06,,,,,,,,,,,,,


In [516]:
activity_2011.iloc[2537:2541]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2537,METHYLPHENIDATE,1724,1209,571305.10,472.54,,,,,,,,,
2538,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPRO,2012.0,1.0,17577800.06,,,,,,,
2539,17577800.06,,,,,,,,,,,,,
2540,BUTALBITAL,2165,883,61075,69.17,,,,,,,,,


In [517]:
# fix this one as before and let it drop in the cleaning
activity_2011.loc[2538, 'FOR'] = activity_2011.loc[2539, 'A']
activity_2011.iloc[2537:2541]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2537,METHYLPHENIDATE,1724,1209,571305.10,472.54,,,,,,,,,
2538,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPRO,2012.0,1.0,17577800.06,17577800.06,,,,,,
2539,17577800.06,,,,,,,,,,,,,
2540,BUTALBITAL,2165,883,61075,69.17,,,,,,,,,


In [518]:
activity_2011[activity_2011['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2573,9041,2,0,0,,,,,,,,,,
2989,9041,1,0,0,,,,,,,,,,
3054,9041,9,0,0,,,,,,,,,,
4334,9041,2,0,0,,,,,,,,,,
4815,9041,1,0,0,,,,,,,,,,


More 9041 cocaine data - however these look a little different because they've got more than 1 registrant associated. Still, they all end up being blank, so we'll continue to let them drop. 

In [519]:
# uncomment to see the surrounding data for each of the rows
#activity_2011.iloc[2572:2575]
#activity_2011.iloc[2988:2991]
#activity_2011.iloc[3053:3056]
#activity_2011.iloc[4333:4336]
#activity_2011.iloc[4814:4817]

In [520]:
# this is header data
#activity_2011[activity_2011['B']=='3']

In [521]:
# these are all rows from the 9041 cocaine data

#activity_2011[activity_2011['B']=='2']
#activity_2011[activity_2011['B']=='1']
#activity_2011[activity_2011['B']=='9']

In [522]:
activity_2011 = clean_activity_new(activity_2011, year=2011, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2011.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2011,ALABAMA,PHARMACIES,AMPHETAMINE,1273,328351.81,257.94
5,2011,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,39,154.31,3.96
6,2011,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,16,73.91,4.62
7,2011,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1242,188868.16,152.07
8,2011,ALABAMA,PHARMACIES,METHYLPHENIDATE,1271,285745.75,224.82


In [523]:
# uncomment to verify the dropped rows

#activity_2011.loc[2539]
#activity_2011.loc[2573]
#activity_2011.loc[2989]
#activity_2011.loc[3054]
#activity_2011.loc[4334]
#activity_2011.loc[4815]

In [524]:
activity_2012 = pd.read_csv('activity_2012.txt', delim_whitespace=True)
check_activity_new(activity_2012, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['5-METHOXY-N,N', '5-METHOXY-N,N-DIISOPROPYLTRYPTAMINE(5-M', '9041']

Unusual numerical values in column A
['5-METHOXY-N,N', '5-METHOXY-N,N-DIISOPROPYLTRYPTAMINE(5-M']

Unusual numerical values in column B
['3', '4-METHOXYMETHCATHINONE;ME', '2']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '1724', '2165', '7379', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9737', '9780', '9739', '9041', '9668', '9743', '7439', '7438', '7370', '1615', '9200', '9056']
non-numeric valu

Just some more of the usual suspects to check:
* '9041' in column A
* '3', '2' in column B

In [525]:
activity_2012[activity_2012['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
3518,9041,3,0,0,,,,,,,,,,
3938,9041,2,0,0,,,,,,,,,,


In [526]:
activity_2012.iloc[3517:3520]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
3517,PSILOCIN,7438,1,0.01,0.01,,,,,,,,,
3518,9041,3,0,0.0,,,,,,,,,,
3519,COCAINE,9041L,4,7.04,1.76,,,,,,,,,


In [527]:
activity_2012.iloc[3937:3940]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
3937,PENTOBARBITAL,(SCHEDULE,2),2270.0,4.0,13240.82,3310.2,,,,,,,
3938,9041,2,0,0.0,,,,,,,,,,
3939,CODEINE,9050,1,4.42,4.42,,,,,,,,,


This is a little unusual in that we don't see an entry for 9041L cocaine, which is the one that usually has data associated. However, these are still empty rows, so we will let them drop. 

In [528]:
# this is header data
#activity_2012[activity_2012['B']=='3']

# this is the 9041 cocaine row
#activity_2012[activity_2012['B']=='2']

In [529]:
activity_2012 = clean_activity_new(activity_2012, year=2012, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2012.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2012,ALABAMA,PHARMACIES,AMPHETAMINE,1308,374199.35,286.09
5,2012,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,36,188.47,5.24
6,2012,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,6,29.32,4.89
7,2012,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1277,222824.02,174.49
8,2012,ALABAMA,PHARMACIES,METHYLPHENIDATE,1304,292500.1,224.31


In [530]:
# uncomment to verify the dropped rows
#activity_2012.loc[3938]
#activity_2012.loc[3518]

In [531]:
activity_2013 = pd.read_csv('activity_2013.txt', delim_whitespace=True)
check_activity_new(activity_2013, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['9041']

Unusual numerical values in column B
['3', '1', '8', '32', '2', '11', '9', '4']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '2165', '4187', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9041', '9737', '9739', '7379', '9668', '1615', '9056', '9743']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' 'RACEMIC' '(DL;D;L;ISOMERS)'
 '(SCHEDULE' '(PETHIDINE)' 'POWDERED' 'BASE' 'SAMOA' 'OF' '(PCP)' nan
 'HYDROXYBUTYRIC' 'HAMPSHIRE' 'JERSE

To check:

* '9041' in column A
* '3', '1', '8', '32', '2', '11', '9', '4' in column B

In [532]:
activity_2013[activity_2013['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
65,9041,1,0,0,,,,,,,,,,
340,9041,1,0,0,,,,,,,,,,
748,9041,1,0,0,,,,,,,,,,
824,9041,1,0,0,,,,,,,,,,
2696,9041,1,0,0,,,,,,,,,,
2723,9041,1,0,0,,,,,,,,,,
3168,9041,8,0,0,,,,,,,,,,
3198,9041,32,0,0,,,,,,,,,,
3285,9041,2,0,0,,,,,,,,,,
3309,9041,3,0,0,,,,,,,,,,


So many 9041 rows! Uncomment below to see the surrounding data for each. 

In [533]:
#activity_2013.iloc[64:67]
#activity_2013.iloc[339:342]
#activity_2013.iloc[747:750]
#activity_2013.iloc[823:826]
#activity_2013.iloc[2695:2698]
#activity_2013.iloc[2722:2725]
#activity_2013.iloc[3167:3170]
#activity_2013.iloc[3197:3200]
#activity_2013.iloc[3284:3287]
#activity_2013.iloc[3308:3311]
#activity_2013.iloc[3466:3469]
#activity_2013.iloc[3466:3469]
#activity_2013.iloc[3493:3496]
#activity_2013.iloc[3582:3585]
#activity_2013.iloc[3687:3690]
#activity_2013.iloc[3710:3713]
#activity_2013.iloc[4035:4038]
#activity_2013.iloc[4060:4063]
#activity_2013.iloc[4218:4221]
#activity_2013.iloc[4246:4249]
#activity_2013.iloc[4886:4889]
#activity_2013.iloc[4908:4911]
#activity_2013.iloc[5002:5005]

In [534]:
# this is header data
#activity_2013[activity_2013['B']=='3']

# these are all 9041 rows
#activity_2013[activity_2013['B']=='1']
#activity_2013[activity_2013['B']=='8']
#activity_2013[activity_2013['B']=='32']
#activity_2013[activity_2013['B']=='2']
#activity_2013[activity_2013['B']=='11']
#activity_2013[activity_2013['B']=='9']
#activity_2013[activity_2013['B']=='4']

In [535]:
activity_2013 = clean_activity_new(activity_2013, year=2013, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2013.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2013,ALABAMA,PHARMACIES,AMPHETAMINE,1319,399689.13,303.02
5,2013,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,39,165.16,4.23
6,2013,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,5,16.07,3.21
7,2013,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1292,234639.23,181.61
8,2013,ALABAMA,PHARMACIES,METHYLPHENIDATE,1315,277243.0,210.83


In [536]:
# verify that the rows have been dropped

#activity_2013.loc[65]
#activity_2013.loc[340]
#activity_2013.loc[748]
#activity_2013.loc[824]
#activity_2013.loc[2696]
#activity_2013.loc[2723]
#activity_2013.loc[3168]
#activity_2013.loc[3198]
#activity_2013.loc[3285]
#activity_2013.loc[3309]
#activity_2013.loc[3467]
#activity_2013.loc[3494]
#activity_2013.loc[3583]
#activity_2013.loc[3688]
#activity_2013.loc[3711]
#activity_2013.loc[4036]
#activity_2013.loc[4061]
#activity_2013.loc[4219]
#activity_2013.loc[4247]
#activity_2013.loc[4887]
#activity_2013.loc[4909]
#activity_2013.loc[5003]

In [537]:
activity_2014 = pd.read_csv('activity_2014.txt', delim_whitespace=True)
check_activity_new(activity_2014, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['9041']

Unusual numerical values in column B
['3', '1', '18', '5', '4', '2', '16', '13']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '2165', '7379', '9041L', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9737', '9739', '9668', '9041', '9743', '9170']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' 'RACEMIC' '(DL;D;L;ISOMERS)'
 '(SCHEDULE' '(PETHIDINE)' 'POWDERED' 'BASE' 'HYDROXYBUTYRIC' 'SAMOA' 'OF'
 nan 'HAMPSHIRE' 'JERSEY' 'MEXICO' 'YORK' 'CAR

Things to check:
* '9041' in column A
* '3', '1', '18', '5', '4', '2', '16', '13' in column B

In [538]:
activity_2014[activity_2014['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
39,9041,1,0,0,,,,,,,,,,
63,9041,1,0,0,,,,,,,,,,
207,9041,1,0,0,,,,,,,,,,
256,9041,18,0,0,,,,,,,,,,
532,9041,3,0,0,,,,,,,,,,
558,9041,3,0,0,,,,,,,,,,
660,9041,5,0,0,,,,,,,,,,
1005,9041,3,0,0,,,,,,,,,,
1202,9041,1,0,0,,,,,,,,,,
1669,9041,4,0,0,,,,,,,,,,


Again, a lot of 9041 rows. Too many to bother with manually printing them all out - this isn't as pretty but does the job. 

In [539]:
def show_9041_rows(df, i):
    return df.iloc[i-1:i+2, 0:6]

for i in activity_2014[activity_2014['A']=='9041'].index:
    print(show_9041_rows(activity_2014, i))
    print()

                A          B   C       D    E      F
38  PENTOBARBITAL  (SCHEDULE  2)    2270    6  699.5
39           9041          1   0       0  NaN    NaN
40        COCAINE      9041L  44  202.42  4.6    NaN

                A          B   C       D     E          F
62  PENTOBARBITAL  (SCHEDULE  2)    2270    73  47,575.93
63           9041          1   0       0   NaN        NaN
64        COCAINE      9041L  11  105.99  9.64        NaN

            A      B  C     D     E    F
206  NABILONE   7379  2  0.75  0.38  NaN
207      9041      1  0     0   NaN  NaN
208   COCAINE  9041L  6    20  3.33  NaN

                 A          B   C       D     E           F
255  PENTOBARBITAL  (SCHEDULE  2)    2270   158  187,699.96
256           9041         18   0       0   NaN         NaN
257        COCAINE      9041L  21  160.13  7.63         NaN

                 A          B   C       D    E         F
531  PENTOBARBITAL  (SCHEDULE  2)    2270   15  1,933.99
532           9041          3   0 

In a few of these cases it looks like cocaine 9041L may not be in the dataset (for certain states), but all the rows for 9041 still have zero gram data so we will let them drop as before. 

The unusual values in column B are related to these 9041 rows as well as header data. 

In [540]:
for i in ['3', '1', '18', '5', '4', '2', '16', '13']:
    print(activity_2014[['A', 'B', 'C', 'D', 'E', 'F']].loc[activity_2014['B']==i])
    print()

          A  B  C       D    E            F
34    ARCOS  3  -  REPORT    5  STATISTICAL
71    ARCOS  3  -  REPORT    5  STATISTICAL
99    ARCOS  3  -  REPORT    5  STATISTICAL
136   ARCOS  3  -  REPORT    5  STATISTICAL
170   ARCOS  3  -  REPORT    5  STATISTICAL
201   ARCOS  3  -  REPORT    5  STATISTICAL
238   ARCOS  3  -  REPORT    5  STATISTICAL
273   ARCOS  3  -  REPORT    5  STATISTICAL
306   ARCOS  3  -  REPORT    5  STATISTICAL
343   ARCOS  3  -  REPORT    5  STATISTICAL
374   ARCOS  3  -  REPORT    5  STATISTICAL
409   ARCOS  3  -  REPORT    5  STATISTICAL
445   ARCOS  3  -  REPORT    5  STATISTICAL
479   ARCOS  3  -  REPORT    5  STATISTICAL
513   ARCOS  3  -  REPORT    5  STATISTICAL
532    9041  3  0       0  NaN          NaN
550   ARCOS  3  -  REPORT    5  STATISTICAL
558    9041  3  0       0  NaN          NaN
582   ARCOS  3  -  REPORT    5  STATISTICAL
615   ARCOS  3  -  REPORT    5  STATISTICAL
652   ARCOS  3  -  REPORT    5  STATISTICAL
683   ARCOS  3  -  REPORT    5  

In [541]:
activity_2014 = clean_activity_new(activity_2014, year=2014, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2014.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2014,ALABAMA,PHARMACIES,AMPHETAMINE,1341,444407.49,331.4
5,2014,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,38,171.19,4.5
6,2014,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,2,1.61,0.8
7,2014,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1297,244512.6,188.52
8,2014,ALABAMA,PHARMACIES,METHYLPHENIDATE,1333,287724.18,215.85


In [542]:
activity_2015 = pd.read_csv('activity_2015.txt', delim_whitespace=True)
check_activity_new(activity_2015, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['9041']

Unusual numerical values in column B
['3', '1', '12', '2']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '2165', '7379', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9041L', '9737', '9739', '9041', '9668', '9600', '7315D', '7370', '7377', '7381', '7433', '7437', '9010', '9020', '9200', '9313']
non-numeric values in column B:
['PERIOD:' 'Date:' 'BUSINESS' 'NAME' 'RACEMIC' '(DL;D;L;ISOMERS)'
 '(SCHEDULE' '(PETHIDINE)' 'POWDERED' 'BASE' 'SAMOA' 'HYDROXYBUTYRIC' 

In [543]:
# checking for 9041 cocaine data

activity_2015[activity_2015['A']=='9041']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
342,9041,1,0,0,,,,,,,,,,
412,9041,12,0,0,,,,,,,,,,
1173,9041,2,0,0,,,,,,,,,,
3757,9041,2,0,0,,,,,,,,,,
4390,9041,1,0,0,,,,,,,,,,


In [544]:
# uncomment to check 

#activity_2015.iloc[341:344]
#activity_2015.iloc[411:414]
#activity_2015.iloc[1172:1175]
#activity_2015.iloc[3756:3759]
#activity_2015.iloc[4389:4392]

In [545]:
# check that these values are just header and 9041-related ['3', '1', '12', '2']
#activity_2015[activity_2015['B']=='3'].head()
#activity_2015[activity_2015['B']=='1']
#activity_2015[activity_2015['B']=='12']
#activity_2015[activity_2015['B']=='2']

In [546]:
activity_2015 = clean_activity_new(activity_2015, year=2015, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2015.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
4,2015,ALABAMA,PHARMACIES,AMPHETAMINE,1360,475334.85,349.51
5,2015,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,31,118.95,3.84
6,2015,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,13,24.1,1.85
7,2015,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1325,265397.62,200.3
8,2015,ALABAMA,PHARMACIES,METHYLPHENIDATE,1347,285554.42,211.99


In [547]:
activity_2016 = pd.read_csv('activity_2016.txt', delim_whitespace=True)
check_activity_new(activity_2016, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['25,314,852.67']

Unusual numerical values in column A
['25,314,852.67']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '2165', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9041L', '9737', '9739', '9668', '7379']
non-numeric values in column B:
['PERIOD:' 'BUSINESS' 'NAME' 'RACEMIC' '(DL;D;L;ISOMERS)' 'ACID'
 '(PETHIDINE)' 'POWDERED' 'BASE' 'HYDROXYBUTYRIC' '(SCHEDULE' 'Date:'
 'SUMMARY' 'SAMOA' 'TINCTURE' 'OF' nan '(PCP)' 'HAMPSHIRE' 'JERSEY'
 'MEXICO' 'YORK' 'CAROLIN

In [548]:
activity_2016[activity_2016['A']=='25,314,852.67']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2387,25314852.67,,,,,,,,,,,,,


In [549]:
activity_2016.iloc[2385:2389]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2385,METHYLPHENIDATE,(DL;D;L;ISOMERS),1724,1343,517387.64,385.25,,,,,,,,
2386,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012,1,25314852.67,,,,,,,
2387,25314852.67,,,,,,,,,,,,,
2388,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21C,2100.0,7.0,110.0,15.71,,,


In [550]:
activity_2016.loc[2386, 'FOR'] = activity_2016.loc[2387, 'A']
activity_2016.iloc[2385:2389]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2385,METHYLPHENIDATE,(DL;D;L;ISOMERS),1724,1343,517387.64,385.25,,,,,,,,
2386,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012,1,25314852.67,25314852.67,,,,,,
2387,25314852.67,,,,,,,,,,,,,
2388,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21C,2100.0,7.0,110.0,15.71,,,


In [551]:
activity_2016 = clean_activity_new(activity_2016, year=2016, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2016.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
3,2016,ALABAMA,PHARMACIES,AMPHETAMINE,1365,499133.36,365.67
4,2016,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,9,32.95,3.66
5,2016,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,27,86.77,3.21
6,2016,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1331,281761.32,211.69
7,2016,ALABAMA,PHARMACIES,METHYLPHENIDATE,1343,283227.95,210.89


In [563]:
activity_2017 = pd.read_csv('activity_2017.txt', delim_whitespace=True)
check_activity_new(activity_2017, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['30,411,958.46']

Unexpected activity code  in column E:
Unusual numerical values in column A
['30,411,958.46']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '2165', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9041L', '9737', '9739', '7379', '9668', '9333', '9056']
non-numeric values in column B:
['Date:' 'BUSINESS' 'NAME' 'RACEMIC' '(DL;D;L;ISOMERS)' 'ACID' 'IN'
 '(PETHIDINE)' 'POWDERED' 'BASE' 'COMBINATION' '(SCHEDULE' 'RANGE:'
 'SUMMARY' 'SAMOA' 'TINCTURE' 'HYDROX

It looks like there are some differences with this file - the overall format is the same, but some of the numeric values are formatted differently. For example, the unusual value '3133.15' in column 'RETAIL' is getting picked up as a drug code because it has 4 digits in a row without a comma separator, as the data typically does. 

Things to check:

* '30,411,958.46' in column A
* '1308.13(C)(3)]' in column FOR
* '3133.15' in column RETAIL
* 'PRODU' in column DRUG

In [564]:
activity_2017[activity_2017['A']=='30,411,958.46']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2450,30411958.46,,,,,,,,,,,,,


In [565]:
activity_2017.iloc[2448:2452]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2448,METHYLPHENIDATE,(DL;D;L;ISOMERS),1724,1318,515769.49,391.33,,,,,,,,
2449,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012,1,30411958.46,,,,,,,
2450,30411958.46,,,,,,,,,,,,,
2451,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21C,2100.0,378.0,15805.0,41.81,,,


In [566]:
activity_2017.loc[2449, 'FOR'] = activity_2017.loc[2450, 'A']
activity_2017.iloc[2448:2452]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
2448,METHYLPHENIDATE,(DL;D;L;ISOMERS),1724,1318,515769.49,391.33,,,,,,,,
2449,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED,2012,1,30411958.46,30411958.46,,,,,,
2450,30411958.46,,,,,,,,,,,,,
2451,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21C,2100.0,378.0,15805.0,41.81,,,


In [567]:
activity_2017[activity_2017['RETAIL']=='3133.15']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5140,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED-CSA,III),2012,12,37597.82,3133.15,,,,,


For whatever reason, this value is formatted without a comma separator. This will cause an error if we let it get picked up by the cleaning function as a drug code. Many ways to deal with this, but since it's just one value I'm choosing to handle it by manually adding a comma so that it will be treated like the rest of the data.

In [568]:
activity_2017.loc[5140, 'RETAIL'] = '3,133.15'
activity_2017.iloc[5139:5141]

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5139,METHYLPHENIDATE,(DL;D;L;ISOMERS),1724,5376,606458.81,112.81,,,,,,,,
5140,GAMMA,HYDROXYBUTYRIC,ACID(FDA,APPROVED-CSA,III),2012.0,12.0,37597.82,3133.15,,,,,


The check function also picked up this unusual value as a drug code in column 'FOR' - you might recognize it as part of a drug name but doesn't hurt to check it anyway and determine if the actual drug code will get picked up in that line. 

In [569]:
activity_2017[activity_2017['FOR']=='1308.13(C)(3)]']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5099,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21CFR,1308.13(C)(3)],2100,18623,596769.0,32.04,,
5141,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21CFR,1308.13(C)(3)],2100,168,2009.9,11.96,,
5180,BARBITURIC,ACID,DERIVIATIVE,OR,SALT,[PER,21CFR,1308.13(C)(3)],2100,6,93.5,15.58,,


Same situation here - because it's not a code that's in the dictionary, it will throw a KeyError in the cleaning process. I will fudge it here by adding an extra space. 

In [570]:
activity_2017.loc[activity_2017['FOR']=='1308.13(C)(3)]', 'FOR'] = '1,308.13(C)(3)]'

In [571]:
# this is just part of a drug name
activity_2017[activity_2017['DRUG']=='PRODU']

Unnamed: 0,A,B,C,D,E,F,G,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5104,DRONABINOL,IN,AN,ORAL,SOLUTION,IN,FDA,APPROVED,DRUG,PRODU,7365,191.0,341.25,1.79


In [572]:
activity_2017 = clean_activity_new(activity_2017, year=2017, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2017.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
3,2017,ALABAMA,PHARMACIES,AMPHETAMINE,1371,513195.41,374.32
4,2017,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,32,96.04,3.0
5,2017,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,4,14.86,3.72
6,2017,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1328,282505.33,212.73
7,2017,ALABAMA,PHARMACIES,METHYLPHENIDATE,1352,283489.85,209.68


In [612]:
# Define a new checksum function - in this case need to check 
# on the average values, no sums to check
def avgs_check(df):
    df['check'] = df['Total grams sold'].div(df['Registrants'], axis=0)
    df['diff'] = df['Avg grams/registrant'] - df['check']
    issues = df.loc[(df['diff'].abs())>0.2]
    if issues.empty:
        print('Averages check passed')
    else:
        print('Averages issues:')
        print(issues)
    df.drop(['check', 'diff'], axis=1, inplace=True)

# Define a new duplicates check function

def repeats_check_activity(df):
    df['check'] = df['Year'].astype(str)+df['State']+df['Business Activity']+df['Drug']
    checks = pd.Series(data=df['check'].value_counts())
    errors = checks.loc[checks!=1]
    if errors.empty:
        print('Repeats checks passed')
    else:
        print('Repeats errors:')
        print(errors)
    df.drop(['check'], axis=1, inplace=True)
    
def check_states(df, geos):
    """
    Compare the states present in the df with those we expect to find.
    """
    in_df = df['State'].unique()
    diff = set(geos).symmetric_difference(set(in_df))
    if diff:
        print('State values not matching:', diff)
    else:
        print("All expected state values present")

In [601]:
activity_dfs = {'2000': activity_2000, '2001': activity_2001, 
                '2002': activity_2002, '2003': activity_2003, 
                '2004': activity_2004, '2005': activity_2005, 
                '2006': activity_2006, '2007': activity_2007, 
                '2008': activity_2008, '2009': activity_2009, 
                '2010': activity_2010, '2011': activity_2011, 
                '2012': activity_2012, '2013': activity_2013, 
                '2014': activity_2014, '2015': activity_2015, 
                '2016': activity_2016, '2017': activity_2017}

for f in activity_dfs.keys():
    print('Checking {} file...'.format(f))
    avgs_check(activity_dfs[f])
    repeats_check_zip(activity_dfs[f])
    check_states(activity_dfs[f], geos)
    print()
    print()

Checking 2000 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES', 'AMERICAN SAMOA'}


Checking 2001 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES', 'AMERICAN SAMOA'}


Checking 2002 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}


Checking 2003 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}


Checking 2004 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}


Checking 2005 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}


Checking 2006 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}


Checking 2007 file...
Averages check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}


Checking 2008 file...
Averages check passed
Repeats checks p

Averages check passed
Repeats errors:
2017WYOMINGPRACTITIONERSHYDROCODONE                             2
2017WYOMINGPHARMACIESCODEINE                                    2
2017WYOMINGHOSPITALSOPIUM POWDERED                              2
2017WYOMINGHOSPITALSHYDROMORPHONE                               2
2017WYOMINGHOSPITALSBUPRENORPHINE                               2
2017WYOMINGPHARMACIESDL-METHAMPHETAMINE RACEMIC BASE            2
2017WYOMINGHOSPITALSPENTOBARBITAL (SCHEDULE 2)                  2
2017WYOMINGPHARMACIESREMIFENTANIL                               2
2017WYOMINGHOSPITALSMETHADONE                                   2
2017WYOMINGMID-LEVEL PRACTITIONERSOXYCODONE                     2
2017WYOMINGPRACTITIONERSPENTOBARBITAL (SCHEDULE 2)              2
2017WYOMINGPHARMACIESLISDEXAMFETAMINE                           2
2017WYOMINGPHARMACIESBUPRENORPHINE                              2
2017WYOMINGPHARMACIESBUTALBITAL                                 2
2017WYOMINGPHARMACIESTAPENTADOL       

The repeats checking function turned up a lot to investigate, most of which is probably related to the 9041 / 9041L cocaine codes, but there are some other things to look at. 

Starting with 2010, Oregon-practictioners-cocaine and California-practitioners-cocaine. 

In [605]:
activity_2010.loc[(activity_2010['State']=='OREGON')
             & (activity_2010['Drug']=='COCAINE')]

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
3770,2010,OREGON,PHARMACIES,COCAINE,14,156.45,11.18
3801,2010,OREGON,HOSPITALS,COCAINE,49,502.04,10.25
3826,2010,OREGON,PRACTITIONERS,COCAINE,1,0.0,0.0
3827,2010,OREGON,PRACTITIONERS,COCAINE,20,89.85,4.49


In [606]:
check_2010 = pd.read_csv('activity_2010.txt', delim_whitespace=True)
check_2010.iloc[3825:3829]

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
3825,PENTOBARBITAL,(SCHEDULE,2),2270.0,130.0,107185.27,824.5,,,,,,,
3826,COCAINE,9041,1,0.0,0.0,,,,,,,,,
3827,COCAINE,9041L,20,89.85,4.49,,,,,,,,,
3828,CODEINE,9050,62,1128.29,18.2,,,,,,,,,


In [607]:
activity_2010.loc[(activity_2010['State']=='CALIFORNIA')
             & (activity_2010['Drug']=='COCAINE')]

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
403,2010,CALIFORNIA,PHARMACIES,COCAINE,82,734.1,8.95
433,2010,CALIFORNIA,HOSPITALS,COCAINE,424,4669.99,11.01
464,2010,CALIFORNIA,PRACTITIONERS,COCAINE,1,0.0,0.0
465,2010,CALIFORNIA,PRACTITIONERS,COCAINE,122,392.49,3.22
491,2010,CALIFORNIA,TEACHING INSTITUTIONS,COCAINE,1,4.46,4.46


In [608]:
check_2010.iloc[463:466]

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
463,PENTOBARBITAL,(SCHEDULE,2),2270.0,507.0,1745954.32,3443.7,,,,,,,
464,COCAINE,9041,1,0.0,0.0,,,,,,,,,
465,COCAINE,9041L,122,392.49,3.22,,,,,,,,,


Since these are both related to the 9041 code and are zeros, I will drop them out. 

In [609]:
activity_2010 = activity_2010.drop(labels = [3826, 464], axis='index')
del check_2010

In [613]:
repeats_check_activity(activity_2010)

Repeats checks passed


There are so many results for 2011 - 2015 that are related to cocaine that it is easiest to drop out any rows that are cocaine but have zero grams distributed, and then run the repeats check again. 

In [617]:
activity_2011 = activity_2011.drop(activity_2011.loc[(activity_2011['Drug']=='COCAINE')
                 & (activity_2011['Total grams sold']==0)].index)

In [618]:
repeats_check_activity(activity_2011)

Repeats checks passed


In [619]:
activity_2012 = activity_2012.drop(activity_2012.loc[(activity_2012['Drug']=='COCAINE')
                 & (activity_2012['Total grams sold']==0)].index)

In [620]:
repeats_check_activity(activity_2012)

Repeats checks passed


In [623]:
activity_2013 = activity_2013.drop(activity_2013.loc[(activity_2013['Drug']=='COCAINE')
                 & (activity_2013['Total grams sold']==0)].index)

In [624]:
repeats_check_activity(activity_2013)

Repeats checks passed


In [625]:
activity_2014 = activity_2014.drop(activity_2014.loc[(activity_2014['Drug']=='COCAINE')
                 & (activity_2014['Total grams sold']==0)].index)

In [626]:
repeats_check_activity(activity_2014)

Repeats checks passed


In [628]:
activity_2015 = activity_2015.drop(activity_2015.loc[(activity_2015['Drug']=='COCAINE')
                 & (activity_2015['Total grams sold']==0)].index)

In [629]:
repeats_check_activity(activity_2015)

Repeats checks passed


For the 2017 data, there are repeating rows related to a number of drugs that are not cocaine, and they are all related to the state of Wyoming, so we need to examine more carefully.

In [631]:
activity_2017.loc[(activity_2017['State']=='WYOMING')
                 & (activity_2017['Drug']=='CODEINE')
                 & (activity_2017['Business Activity']=='PHARMACIES')]

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
5028,2017,WYOMING,PHARMACIES,CODEINE,119,14898.72,125.2
5107,2017,WYOMING,PHARMACIES,CODEINE,65917,15427479.78,234.04


In [632]:
check_2017 = pd.read_csv('activity_2017.txt', delim_whitespace=True)
check_2017.iloc[5026:5030]

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5026,BUTALBITAL,2165,46,1605.0,34.89,,,,,,,,,
5027,COCAINE,9041L,2,21.14,10.57,,,,,,,,,
5028,CODEINE,9050,119,14898.72,125.2,,,,,,,,,
5029,ETORPHINE,9056,1,57.0,57.0,,,,,,,,,


In [633]:
check_2017.iloc[5105:5110]

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5105,NABILONE,7379,65,13.75,0.21,,,,,,,,,
5106,COCAINE,9041L,206,1408.73,6.84,,,,,,,,,
5107,CODEINE,9050,65917,15427479.78,234.04,,,,,,,,,
5108,ETORPHINE,9056,1,57.0,57.0,,,,,,,,,
5109,BUPRENORPHINE,9064,59274,3179903.93,53.65,,,,,,,,,


This looks like it may actually be a forward fill issue.

In [636]:
check_2017.iloc[5000:5030]

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5000,OPIUM,POWDERED,9639,7,37.08,5.3,,,,,,,,
5001,OXYMORPHONE,9652,10,10.03,1,,,,,,,,,
5002,ALFENTANIL,9737,14,4.06,0.29,,,,,,,,,
5003,REMIFENTANIL,9739,15,0.88,0.06,,,,,,,,,
5004,SUFENTANIL,BASE,9740,3,0.01,0,,,,,,,,
5005,FENTANYL,BASE,9801,337,48.78,0.14,,,,,,,,
5006,STATE:WISCONSIN,BUSINESS,ACTIVITY:M,-,MID-LEVEL,PRACTITIONERS,,,,,,,,
5007,DRUG,NAME,DRUG,CODE,BUYERS,TOTAL,GRAMS,AVG,GRAMS,,,,,
5008,PENTOBARBITAL,(SCHEDULE,2),2270,21,12552.62,597.74,,,,,,,
5009,CODEINE,9050,1,2.23,2.23,,,,,,,,,


In [635]:
check_2017.iloc[5067:5107]

Unnamed: 0,ARCOS,3,-,REPORT,5,STATISTICAL,SUMMARY,FOR,RETAIL,DRUG,PURCHASES,BY,GRAMS,WT
5067,FENTANYL,BASE,9801,45,84.7,1.88,,,,,,,,
5068,STATE:WYOMING,BUSINESS,ACTIVITY:C,-,PRACTITIONERS,,,,,,,,,
5069,DRUG,NAME,DRUG,CODE,BUYERS,TOTAL,GRAMS,AVG,GRAMS,,,,,
5070,PENTOBARBITAL,(SCHEDULE,2),2270,38,40259.73,1059.47,,,,,,,
5071,CODEINE,9050,13,218.35,16.8,,,,,,,,,
5072,BUPRENORPHINE,9064,58,2.04,0.04,,,,,,,,,
5073,OXYCODONE,9143,1,9.75,9.75,,,,,,,,,
5074,HYDROMORPHONE,9150,40,18.51,0.46,,,,,,,,,
5075,HYDROCODONE,9193,26,186.32,7.17,,,,,,,,,
5076,MEPERIDINE,(PETHIDINE),9230,4,10.02,2.51,,,,,,,,


Turns out this was a forward fill problem, so the duplicate checker worked exactly as intended. In row 5091 you can see that we are getting US total data and it just didn't get picked up properly in cleaning, so all those rows were marked as Wyoming data.

The reason for this is that the 2017 file deviated from the expected format of having "STATE:" concatenated with the geography name in the first column. 

The cleaning function for new files now includes a line to handle this specific case. Below I'll rerun the 2017 cleaning.

In [638]:
activity_2017 = pd.read_csv('activity_2017.txt', delim_whitespace=True)
check_activity_new(activity_2017, flat_geos=flat_geos, 
                   flat_activity_codes=flat_activity_codes)

Unusual numerical values in column A:
['30,411,958.46']

Unexpected activity code  in column E:
Unusual numerical values in column A
['30,411,958.46']

Unexpected value in column B for states: BUSINESS
Unexpected value in column C for states: ACTIVITY:A
Unexpected value in column C for states: ACTIVITY:B
Unexpected value in column C for states: ACTIVITY:C
Unexpected value in column C for states: ACTIVITY:D
Unexpected value in column C for states: ACTIVITY:M
Unexpected value in column C for states: ACTIVITY:N-U
Unexpected value in column C for states: BUSINESS
Checking column B
Drug codes found in column B:
['1100', '1105D', '1205', '2165', '9050', '9064', '9120', '9143', '9150', '9193', '9220L', '9250B', '9300', '9652', '9780', '9041L', '9737', '9739', '7379', '9668', '9333', '9056']
non-numeric values in column B:
['Date:' 'BUSINESS' 'NAME' 'RACEMIC' '(DL;D;L;ISOMERS)' 'ACID' 'IN'
 '(PETHIDINE)' 'POWDERED' 'BASE' 'COMBINATION' '(SCHEDULE' 'RANGE:'
 'SUMMARY' 'SAMOA' 'TINCTURE' 'HYDROX

In [639]:
activity_2017.loc[2449, 'FOR'] = activity_2017.loc[2450, 'A']
activity_2017.loc[5140, 'RETAIL'] = '3,133.15'
activity_2017.loc[activity_2017['FOR']=='1308.13(C)(3)]', 'FOR'] = '1,308.13(C)(3)]'
activity_2017 = clean_activity_new(activity_2017, year=2017, 
                                   drug_codes=drug_codes, flat_geos=flat_geos, 
                                   activity_codes=activity_codes)
activity_2017.head()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
3,2017,ALABAMA,PHARMACIES,AMPHETAMINE,1371,513195.41,374.32
4,2017,ALABAMA,PHARMACIES,DL-METHAMPHETAMINE RACEMIC BASE,32,96.04,3.0
5,2017,ALABAMA,PHARMACIES,D-METHAMPHETAMINE,4,14.86,3.72
6,2017,ALABAMA,PHARMACIES,LISDEXAMFETAMINE,1328,282505.33,212.73
7,2017,ALABAMA,PHARMACIES,METHYLPHENIDATE,1352,283489.85,209.68


In [640]:
avgs_check(activity_2017)
repeats_check_zip(activity_2017)
check_states(activity_2017, geos)

Averages check passed
Repeats checks passed
All expected state values present


In [641]:
activity_dfs = {'2000': activity_2000, '2001': activity_2001, 
                '2002': activity_2002, '2003': activity_2003, 
                '2004': activity_2004, '2005': activity_2005, 
                '2006': activity_2006, '2007': activity_2007, 
                '2008': activity_2008, '2009': activity_2009, 
                '2010': activity_2010, '2011': activity_2011, 
                '2012': activity_2012, '2013': activity_2013, 
                '2014': activity_2014, '2015': activity_2015, 
                '2016': activity_2016, '2017': activity_2017}


activity_all = pd.concat(list(activity_dfs.values()), ignore_index=True)

# to be consistent with the other data, we'll drop the US totals
activity_all = activity_all.drop(activity_all[activity_all['State']=='UNITED STATES'].index)


activity_all.to_csv('distribution_by_activity.csv', index=False)

Data cleansing finished! On to do some exploratory data analysis and visualization.

In [643]:
activity_all.tail()

Unnamed: 0,Year,State,Business Activity,Drug,Registrants,Total grams sold,Avg grams/registrant
61347,2017,WYOMING,TEACHING INSTITUTIONS,BUPRENORPHINE,1,0.0,0.0
61348,2017,WYOMING,MID-LEVEL PRACTITIONERS,PENTOBARBITAL (SCHEDULE 2),2,177.72,88.86
61349,2017,WYOMING,MID-LEVEL PRACTITIONERS,CODEINE,1,10.8,10.8
61350,2017,WYOMING,MID-LEVEL PRACTITIONERS,OXYCODONE,1,0.54,0.54
61351,2017,WYOMING,MID-LEVEL PRACTITIONERS,HYDROCODONE,1,0.36,0.36
