# Cleaning messy PDF data with pandas and Jupyter notebooks
## Part 1 - DEA ARCOS Report 1: Retail Drug Distribution by Zip Code for Each State

### Background

#### What is ARCOS?
The DEA publishes data annually from its Automation of Reports and Consolidated Orders System, or ARCOS. According to the DEA's website, ARCOS "monitors the flow of DEA controlled substances from their point of manufacture through commercial distribution channels to point of sale or distribution at the dispensing/retail level - hospitals, retail pharmacies, practitioners, mid-level practitioners, and teaching institutions....these transactions...are then summarized into reports which give investigators in Federal and state government agencies information which can then be used to identify the diversion of controlled substances into illicit channels of distribution. The information on drug distribution is used throughout the United States (U.S.) by U.S. Attorneys and DEA investigators to strengthen criminal cases in the courts."

So, ARCOS exists to help the government identify patterns in the manufacture and distribution of controlled substances that might indicate that these substances are being sold illegally. Annual ARCOS reports are publically available on the DEA's website, dating back to the year 2000, but unfortunately they are only available in PDF form and are dozens or even hundreds of pages long. 

#### What's in this notebook?
I was interested in doing some data analysis and visualization on the distribution of oxycodone, an opioid painkiller that is one of the main drivers of the current prescription pain pill (and arguably heroin) addiction epidemic in the United States right now. 

Aside from a wealth of fascinating (and sometimes disturbing, sad, and frightening) data to explore, the ARCOS data also presents a great data cleansing challenge, given that it is distributed in PDFs - the perfect opportunity to practice your pandas skills, for example. Luckily, the files tend to have nearly identical formatting, aside from a shift in report formatting in 2006 and a few anomalies here and there.

This notebook is meant to demo the functionality of pandas and Jupyter notebooks for data cleaning - working with this data was a great project for me to improve my pandas skills and I'm sharing the code here so others can learn and practice. 

This is the demo / walkthrough version of the notebook - I made a workbook version that you can use to work through the cleaning steps yourself, referencing this copy to see the solutions I found for tackling the data cleansing challenges in this data.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import pickle

### Notes on the data 

Get the raw data (in PDF....!)
You can find the ARCOS reports here: https://www.deadiversion.usdoj.gov/arcos/retail_drug_summary/index.html

There are six ARCOS reports published each year and I chose to work with three of them in particular:
* Report 1:  Retail Drug Distribution by Zip Code for Each State - total drug amounts (in grams) distributed to retail registrants in each state, by 'gateway' zip code (the first three numbers of the zip), on a quarterly basis
* Report 3: Quarterly Distribution in Grams per 100K Population - quarterly drug consumption in grams per 100,000 population, by state
* Report 5: Statistical Summary for Retail Drug Purchases - average annual purchases by drug by business activity (pharmacy, hospital, etc.)


A few notes: 

* For years before 2006, the reports are lumped together into one giant PDF (700+ pages long). In more recent years they have elected to publish a separate PDF for each report. 

* I tried several approaches for simply getting the text out of the PDF - for a variety of reasons (in particular the unwieldy nature of the pre-2006 PDFs), it was easiest and quickest to just copy-paste the entire contents of the PDF into a text file. This was an OK solution for me since there aren't that many of them - if you were doing this with hundreds of files you would want to find another way. Another problem I ran into right away was the length of the title running onto multiple lines in the txt file and causing a lot of formatting challenges in a dataframe, so I manually adjusted the title text in each txt file. 

* For the pre-2006 reports, I (manually and carefully) removed the report content I wasn't interested in from the text file, and then used pandas to clean what remained. 

### Cleaning Report 1 - Retail Drug Distribution by Zip Code for Each State

These reports I'll refer to as the "zip reports" as they are the only ones at the gateway zip code level (others are at state level). Overall, these were definitely the cleanest and easiest to work with of the three reports, and a good place to start. 

#### Step 1 - Getting from PDF into pandas in the notebook

What to consider and experiment with:
* How will you pull the data out of the PDF? How much of the formatting (columns, headers, etc) will you be able to preserve?
* What delimiter works best?
* If the number of PDF files is small, are there any steps you can perform right in the txt or spreadsheet file that will make things easier?

There are different options for getting data from a PDF into a format you can interact with more directly. I ended up just copy-pasting the full contents of each file as it didn't seem that some of the PDF-to-spreadsheet/other tools out there would really save me that much time. 

I tried several text editors and spreadsheet applications, looking for something that would do a relatively good job delimiting the data based on the PDF files. Sublime is my favorite and that's what I used in the end. 

Tips
* Try a couple different editors and delimit options, and read each one into pandas to see how the structure of the data looks. Choose one that will minimize the amount of cleaning you need to do
* Keep your .txt file open as you begin cleaning in pandas
* Never save over your raw .txt file! This is a trial-and-error process and you will likely end up losing some data at one point or another. If you've saved over the starting point you will have to go back to your PDF...

In [40]:
# I experimented with different delimiters and found whitespace to require the least amt of add'l cleaning
zip_2000 = pd.read_csv('../data/report-1-zipcode/zip_2000.txt', delim_whitespace=True)
zip_2000.head(10)

Unnamed: 0,ARCOS,2,-,REPORT,1,RETAIL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,,,
1,STATE:,ALASKA,,,,,,,,,,,,
2,ZIP,CODE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE,
3,----------------------------------------------...,,,,,,,,,,,,,
4,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,


Not too pretty. I removed some extra text from the start of the report and moved the title all onto one line so that there would be enough columns when delimiting on spaces - I did this when I created the text files but you could do it as part of your pandas workflow instead. 

I dumped a couple more of the older years into .txt files and checked to make sure they looked essentially the same, then I worked on constructing a function that would clean this first file (and hopefully all the others). This involved *a lot* of trial and error - which is where the notebook shines. Take advantage of the ability to run small pieces of code, and check the effects on your data frame at each step to make sure you haven't lost or overwritten data where you didn't mean to. 


#### Step 2 - begin cleaning, starting with readability
First, let's start with renaming the columns. We can already see that the data we really want is in the first five columns, so we'll name them according to what data is mostly in each column. 

This is a good time to make sure you really understand how pandas does operations on a dataframe, and what it returns when it does so. 

Typically, when you apply an operation (like the column renaming below), pandas does not apply it directly to the dataframe. Instead, it essentially returns a copy of the df with the operation applied. This might seem strange and can cause confusion at first, but it protects you from accidentally modifying your data in a way that you can't reverse. 

So, if you're sure you want to perform an operation, you can handle it in two ways - either by telling pandas to do it "in place" (as below), or by replacing the df (shown in commented-out code).

In [41]:
zip_2000.rename(columns={'ARCOS': "Zip", 
                         '2':'Q1', 
                         '-': 'Q2', 
                         'REPORT': 'Q3', 
                         '1':'Q4', 
                         'RETAIL':'TOTAL'},
                inplace=True)

# alternate way to update the df:
# zip_2000 = zip_2000.rename(columns={'ARCOS': "Zip", 
#                         '2':'Q1', 
#                         '-': 'Q2', 
#                         'REPORT': 'Q3', 
#                         '1':'Q4', 
#                         'RETAIL':'TOTAL'},
#                inplace=True)
# you could also use the above and save the updated df off with a different name

zip_2000.head(15)

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,,,
1,STATE:,ALASKA,,,,,,,,,,,,
2,ZIP,CODE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE,
3,----------------------------------------------...,,,,,,,,,,,,,
4,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,


#### Step 3 - check for unusual / irregular lines of data

Before proceeding to move anything around or drop any data, it's a good idea to check for anything weird that might be going on. As we will see later, although this data looks fairly good, it's so large that it's impossible to guarantee that you'd visually spot any irregularities.

The best approach is to start with an assumption about the kind of data you think should *mostly* be present in each column and look for anything that doesn't match that. 

For this data:

* "Zip" column should mostly be three-digit numbers
* "Q1", "Q2", "Q3", "Q4", and "TOTAL" should all mostly be numbers with two decimal places
* The rest of the columns should mostly be NaNs

All of this is a little harder because right now the whole dataframe is string values, so it's a good time to practice your regular expressions!

In [42]:
# check for any cells that don't contain only numbers
# return just the set of unique values since there will be many rows with the same contents
zip_2000['Zip'].loc[zip_2000['Zip'].str.match('(?!^\d+$)^.+$')].unique()

array(['REPORTING', 'STATE:', 'ZIP',
       '------------------------------------------------------------------------------------------------------------------------------------',
       'DRUG', 'STATE', 'DATE:', 'ARCOS', 'RETAIL'], dtype=object)

In [43]:
# check for any cells that don't end in a decimal
# allow commas before the decimal
zip_2000['Q1'].loc[~zip_2000['Q1'].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()

array(['PERIOD:', 'ALASKA', 'CODE', nan, 'CODE:', 'TOTAL', '12/26/2002',
       'ENFORCEMENT', '2', 'DRUG', 'ALABAMA', 'ARKANSAS', 'ARIZONA',
       'CALIFORNIA', 'COLORADO', 'CONNECTICUT', 'DISTRICT', 'DELAWARE',
       'FLORIDA', 'GEORGIA', 'HAWAII', 'IOWA', 'IDAHO', 'ILLINOIS',
       'INDIANA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MASSACHUSETTS',
       'MARYLAND', 'MAINE', 'MICHIGAN', 'MINNESOTA', 'MISSOURI',
       'MISSISSIPPI', 'MONTANA', 'NEBRASKA', 'NORTH', 'NEW', 'NEVADA',
       'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 'PUERTO', 'RHODE',
       'SOUTH', 'TENNESSEE', 'TRUST', 'TEXAS', 'UTAH', 'VIRGINIA',
       'VIRGIN', 'VERMONT', 'WASHINGTON', 'WISCONSIN', 'WEST', 'WYOMING'],
      dtype=object)

The only thing that might be odd here is the '2' value, so check to see what those rows look like - turns out it's just header data from the report, nothing to worry about. 

In [44]:
zip_2000[zip_2000['Q1']=='2']

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
34,ARCOS,2,-,REPORT,1,,,,,,,,,
49,ARCOS,2,-,REPORT,1,,,,,,,,,
94,ARCOS,2,-,REPORT,1,,,,,,,,,
139,ARCOS,2,-,REPORT,1,,,,,,,,,
180,ARCOS,2,-,REPORT,1,,,,,,,,,
222,ARCOS,2,-,REPORT,1,,,,,,,,,
264,ARCOS,2,-,REPORT,1,,,,,,,,,
289,ARCOS,2,-,REPORT,1,,,,,,,,,
331,ARCOS,2,-,REPORT,1,,,,,,,,,
371,ARCOS,2,-,REPORT,1,,,,,,,,,


In [45]:
zip_2000['Q2'].loc[~zip_2000['Q2'].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()

array(['01/01/2000', nan, '1ST', '1100B', '1100D', '1724', '9143',
       'DEPARTMENT', 'ADMINISTRATION', '-', 'DISTRIBUTION', '9193', 'OF',
       'CAROLINA', 'DAKOTA', 'HAMPSHIRE', 'JERSEY', 'MEXICO', 'YORK',
       'RICO', 'ISLAND', 'TERRITORIES', 'ISLANDS', 'VIRGINIA'],
      dtype=object)

In [46]:
zip_2000['Q3'].loc[~zip_2000['Q3'].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()

array(['TO', nan, 'QUARTER', 'DRUG', 'OF', 'REPORT', 'BY', 'COLUMBIA',
       '(GUAM)'], dtype=object)

In [47]:
zip_2000['Q4'].loc[~zip_2000['Q4'].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()

array(['12/31/2000', nan, '2ND', 'NAME:', 'JUSTICE', '1', 'ZIP'],
      dtype=object)

In [48]:
zip_2000['TOTAL'].loc[~zip_2000['TOTAL'].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()

array([nan, 'QUARTER', 'DL-AMPHETAMINE', 'D-AMPHETAMINE',
       'METHYLPHENIDATE', 'OXYCODONE', 'PAGE:', 'CODE', 'HYDROCODONE'],
      dtype=object)

In [49]:
# because of the shifted over data, 
# there should be a good number of numeric values still in this column
zip_2000['DRUG'].loc[~zip_2000['DRUG'].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()

array([nan, '3RD', 'BASE', '2', 'FOR', '3', '4', '5', '6', '7', '8', '9',
       '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
       '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31',
       '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42',
       '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53',
       '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64',
       '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75',
       '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86',
       '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97',
       '98', '99', '100', '101', '102', '103', '104', '105', '106', '107',
       '108', '109', '110', '111', '112', '113', '114', '115', '116',
       '117', '118', '119', '120', '121', '122', '123', '124', '125',
       '126', '127', '128', '129', '130', '131', '132', '133', '134',
       '135', '136', '137', '138', '139', '140', '141', '

We aren't really expecting to see so many numbers in this format in this column. Notice they are sequential... Pick one to check... and it turns out they are page numbers from the header lines in the original PDF. 

In [50]:
zip_2000[zip_2000['DRUG']=='11']

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
369,DATE:,12/26/2002,DEPARTMENT,OF,JUSTICE,PAGE:,11,,,,,,,


In [51]:
# for these last cols, we really just want to check the non-null values
col_checks = ['DISTRIBUTION', 'BY', 'ZIP', 'CODE', 'FOR', 'EACH', 'STATE']

for c in col_checks:
    print("checking column {}".format(c))
    print(zip_2000[c].loc[pd.notnull(zip_2000[c])].unique())


checking column DISTRIBUTION
['QUARTER' 'EACH']
checking column BY
['4TH' 'STATE']
checking column ZIP
['QUARTER']
checking column CODE
['TOTAL']
checking column FOR
['TO']
checking column EACH
['DATE']
checking column STATE
[]


Even though everything looks fine here, that doesn't mean it will be true for all the files. So it's a good idea to package up this code and reuse it each time we read in a new file. Check farther down in the notebook for some interesting stuff these checks will catch. 

In [52]:
def check_data_old(df):
    """
    Use this to check data quality for files from 2000-2005 inclusive.
    """
    df.rename(columns={'ARCOS': "Zip", 
                         '2': 'Q1',
                         '-': 'Q2', 
                         'REPORT': 'Q3', 
                         '1':'Q4', 
                         'RETAIL':'TOTAL'},
                inplace=True)
    
    print("Zip column should contain mostly 3-digit gateway zip codes.")
    print(df['Zip'].loc[df['Zip'].str.match('(?!^\d+$)^.+$')].unique())
    print()
    print()
    print("Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values to two decimal places.")
    cols1 = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
    for c in cols1:
        print("Checking column {}".format(c))
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()
    
    print()
    print("For remaining columns, mostly nulls are expected.")
    cols2 = ['DISTRIBUTION', 'BY', 'ZIP', 'CODE', 'FOR', 'EACH', 'STATE']

    for c in cols2:
        print("checking column {}".format(c))
        print(df[c].loc[pd.notnull(df[c])].unique())
        print()

#### Step 3 - fix "shifted" data
The next obvious problem is that the data in each of the state total rows (see row 10 above for example), the data has been bumped over by one column. 

Challenge:
* It's not as simple as replacing any occurence of "STATE:" in the first column with the value in the next column - don't forget we have states whose names are more than one word (and therefore cell) long!

Notes:

* This could be handled many ways, but obviously we want to avoid looping over a large dataframe, so we want to take advantage of operations that pandas can do over the entire dataframe. There's likely a more elegant solution, but the below will work just fine. 

* The third line is making use of a great pandas idiom that we'll use frequently for if-then assignment of values into a column (without looping), and is also a good place to check your understanding of .loc and .iloc. 
    * loc refers to location based on the index label - so it could be numeric, or not
    * iloc refers to an index position 

    The idiom does if-then assignment on one column, like so:
    df.loc[df.AAA >= 5, 'BBB'] = -1


    Read the details of the idiom and related variations in the cookbook:
    https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html

In [53]:
# set the zipcode value to "TOTAL" for the rows in question 
zip_2000.loc[zip_2000['Q1']=='TOTAL', 'Zip'] = zip_2000['Q1']

# iterate through the following columns and shift the values in each row over by one 
shift = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
for i in range(0,5):
    zip_2000.loc[zip_2000['Zip']=='TOTAL', shift[i]] = zip_2000[shift[i+1]]


zip_2000.head(15)

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,,,
1,STATE:,ALASKA,,,,,,,,,,,,
2,ZIP,CODE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE,
3,----------------------------------------------...,,,,,,,,,,,,,
4,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,


#### Step 4 - Getting the state names 

I know I want a column to indicate the state, year, and drug, so I'll add those. The data for each of these columns is all there, but it's jumbled around so organizing it is the next step.

By examining the raw data, I can see that state names always appear in a certain consistent way: the space delimiting has split any state names that are more than one word into separate sequential columns, and all the rest of the cells in the row are NaNs. That means pulling the state names out is relatively simple, as long as we take into account those longer names  (and don't forget "District of Columbia!"). Again you could tackle this in many ways; I wrote a simple rule for each case.

In [54]:
# Insert year, state, and drug columns
zip_2000.insert(column='Year', loc=0, value=2000)
zip_2000.insert(column='State', loc=1, value=None)
zip_2000.insert(column='Drug', loc=2, value=None)
zip_2000.insert(column='Drug Code', loc=3, value=None)


# If "STATE:" is in the "Zip" column, I can get the state name from this row
# note - be careful - there were also cells in this column with the value "STATE"...
# The state name will be in the cell(s) following, 
# and all the rest of the cells in the row should be NaNs
zip_2000.loc[zip_2000['Zip']=='STATE:', 'State'] = zip_2000['Q1']

# If "STATE:" is in the "Zip" column but "Q2" column isn't a NaN, 
# then it's a two-word state
zip_2000.loc[(zip_2000['Zip']=="STATE:") & 
             (pd.notnull(zip_2000['Q2'])), 'State'] = zip_2000["State"]+" "+zip_2000['Q2']

# If "STATE:" is in the "Zip" column and both "Q2" and "Q3" aren't NaN, 
# then it's a three-word state
zip_2000.loc[(zip_2000['Zip']=="STATE:") & 
             (pd.notnull(zip_2000['Q3'])), 'State'] = zip_2000["State"]+" "+zip_2000['Q3']

zip_2000.head(15)

Unnamed: 0,Year,State,Drug,Drug Code,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
0,2000,,,,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,,,
1,2000,ALASKA,,,STATE:,ALASKA,,,,,,,,,,,,
2,2000,,,,ZIP,CODE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE,
3,2000,,,,----------------------------------------------...,,,,,,,,,,,,,
4,2000,,,,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,2000,,,,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,2000,,,,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,2000,,,,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,2000,,,,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,2000,,,,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,


This is a great place to start incorporating a little data validation - we know what state names we should be expecting to see, and how many....

In [55]:
# Check the state names and number of state names present in the column

print(zip_2000['State'].unique())
print(len(zip_2000['State'].unique()))

[None 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT OF COLUMBIA' 'DELAWARE' 'FLORIDA' 'GEORGIA'
 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY'
 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA'
 'MISSOURI' 'MISSISSIPPI' 'MONTANA' 'NEBRASKA' 'NORTH CAROLINA'
 'NORTH DAKOTA' 'NEW HAMPSHIRE' 'NEW JERSEY' 'NEW MEXICO' 'NEVADA'
 'NEW YORK' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO RICO'
 'RHODE ISLAND' 'SOUTH CAROLINA' 'SOUTH DAKOTA' 'TENNESSEE'
 'TRUST TERRITORIES (GUAM)' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN ISLANDS'
 'VERMONT' 'WASHINGTON' 'WISCONSIN' 'WEST VIRGINIA' 'WYOMING']
55


55 states?

On your first pass-through, this simple validation step would alert you to several things about your data that you might not have noticed:

* DC is included as "District of Columbia" - at first, I was only capturing 2-word state names, so I was finding "District of" in the list
* There are non-state territories listed here including Guam and the Virgin Islands - there are others that sometimes show up as well for some years
* Especially for territories, sometimes the name is slightly different year-to-year. For example, Guam is sometimes listed as "Trust Territories (Guam)," and sometimes just as "Guam." We will want to address that in a consistent way
* The presence of "None" isn't a big worry here since we haven't really started cleaning the data that much yet, but this would be a good check to do again later

In [56]:
# Change the references to Guam
zip_2000.loc[zip_2000['State']=='TRUST TERRITORIES (GUAM)', 'State']='GUAM'

# Check the state names and number of state names present in the column
print(zip_2000['State'].unique())
print(len(zip_2000['State'].unique()))

[None 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT OF COLUMBIA' 'DELAWARE' 'FLORIDA' 'GEORGIA'
 'HAWAII' 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY'
 'LOUISIANA' 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA'
 'MISSOURI' 'MISSISSIPPI' 'MONTANA' 'NEBRASKA' 'NORTH CAROLINA'
 'NORTH DAKOTA' 'NEW HAMPSHIRE' 'NEW JERSEY' 'NEW MEXICO' 'NEVADA'
 'NEW YORK' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO RICO'
 'RHODE ISLAND' 'SOUTH CAROLINA' 'SOUTH DAKOTA' 'TENNESSEE' 'GUAM' 'TEXAS'
 'UTAH' 'VIRGINIA' 'VIRGIN ISLANDS' 'VERMONT' 'WASHINGTON' 'WISCONSIN'
 'WEST VIRGINIA' 'WYOMING']
55


In [57]:
# make a list of the possible state and territory names
# and save it off for later data cleaning
# a few more from the list above are added to make it reusable for other files
geos = ['ALASKA', 'ALABAMA', 'ARKANSAS', 'ARIZONA',
        'CALIFORNIA', 'COLORADO', 'CONNECTICUT', 
        'DISTRICT OF COLUMBIA', 'DELAWARE', 'FLORIDA', 
        'GEORGIA', 'HAWAII', 'IOWA', 'IDAHO', 
        'ILLINOIS', 'INDIANA', 'KANSAS', 'KENTUCKY',
        'LOUISIANA', 'MASSACHUSETTS', 'MARYLAND', 'MAINE',
        'MICHIGAN', 'MINNESOTA', 'MISSOURI', 'MISSISSIPPI', 
        'MONTANA', 'NEBRASKA', 'NORTH CAROLINA', 'NORTH DAKOTA', 
        'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO', 'NEVADA',
        'NEW YORK', 'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 
        'PUERTO RICO', 'RHODE ISLAND', 'SOUTH CAROLINA', 
        'SOUTH DAKOTA', 'TENNESSEE', 'GUAM', 'TEXAS', 'UTAH', 
        'VIRGINIA', 'VIRGIN ISLANDS', 'VERMONT', 'WASHINGTON', 
        'WISCONSIN', 'WEST VIRGINIA', 'WYOMING', 'UNITED STATES', 
        'AMERICAN SAMOA']

with open('../data/geographies.pickle', 'wb') as f:
    pickle.dump(geos, f)

#### Step 5 - drop some extraneous rows

From looking at the data, we can already determine at this point that there's a lot here we won't need. For example, there are headers on most pages of the PDF reports with info like the reporting period, the name of the report, etc. We wouldn't expect there to be any useful data in those rows and we can confirm this before dropping these rows out to clean up the df and reduce size a little. 

With messy data where you can't necessarily rely on the data consistently being in a particular column, be very careful when dropping rows like this. I checked every row I was dropping (see cell below) to make sure I wasn't accidentally removing any of the real data. 

In [58]:
# to confirm for yourself as I did when I was doing the cleaning, 
# uncomment any of the lines below and see the rows we'll be dropping out

#zip_2000[zip_2000['Zip']=='ENFORCEMENT']
#zip_2000[zip_2000['Zip']=='REPORTING']
#zip_2000[zip_2000['Zip']=='RETAIL']
#zip_2000[zip_2000['Zip']=='DATE:']
#zip_2000[zip_2000['Zip']=='ZIP']
#zip_2000[zip_2000['Zip']=='ARCOS']
#zip_2000[zip_2000['Q1']=='ENFORCEMENT']
#zip_2000[zip_2000['Q1']=='REPORTING']
#zip_2000[zip_2000['Q1']=='RETAIL']
#zip_2000[zip_2000['Q1']=='DATE:']
#zip_2000[zip_2000['Q1']=='ZIP']
#zip_2000[zip_2000['Q1']=='ARCOS']

In [59]:
# Now we can drop all that garbage!
# This will make it easier to look at the remaining data to see what still needs to be addressed

drops = ['ENFORCEMENT', 'REPORTING', 'RETAIL', 'DATE:', 'ZIP', 'ARCOS']

for d in drops:
    zip_2000 = zip_2000.drop(zip_2000[zip_2000['Zip']==d].index)
    zip_2000 = zip_2000.drop(zip_2000[zip_2000['Q1']==d].index)
    
zip_2000.head(10)

Unnamed: 0,Year,State,Drug,Drug Code,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
1,2000,ALASKA,,,STATE:,ALASKA,,,,,,,,,,,,
3,2000,,,,----------------------------------------------...,,,,,,,,,,,,,
4,2000,,,,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,2000,,,,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,2000,,,,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,2000,,,,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,2000,,,,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,2000,,,,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,
10,2000,,,,TOTAL,670.99,606.22,645.30,707.32,2629.83,2629.83,,,,,,,
11,2000,,,,DRUG,CODE:,1100D,DRUG,NAME:,D-AMPHETAMINE,BASE,,,,,,,


#### Step 6 - Getting the drug names
The next sticky issue is how to get the drug names into our drug column. If you look through the data, you'll see that they vary a lot in length, and the multi-word names have again been split up into multiple columns. It would not be easy to define rules to cover the majority of cases, so I needed a different approach to how I dealt with the state names. 

I eventually used the drug codes instead of the names of the drugs themselves - they're much more uniform and tend to show up consistently in particular columns. You could find a programmatic way to do this, but I ended up just manually compiling the list. It didn't take very long and certainly less time than trying to formulate rules to handle all the cases. 

I've put the drug codes dictionary below for this first notebook, but it is tidier save it off in a file and load it into a dict when you are ready for it. We'll alse use this same dictionary to cover the other reports, so it's good to have it portable between notebooks.

In some cases, there are sub-codes / variations of codes, and I've preserved those here and kept to the original raw data as much as possible. For example, there are two codes, 9041 and 9041L, that both refer simply to "cocaine" in the reports. I have not been able to find out from anywhere on the DEA website how these are different (if you find out let me know!) and so I have left it alone. Although the distribution values for one of the codes is usually zero, there is still data associated with it in some cases (e.g., the report detailing registrants), so without knowing what it is it's better not to delete it. Keeping the drug codes as part of the data means we can distinguish between the two.

Once armed with the codes, we can easily extract the drug names! 

In [60]:
drug_codes = {'1100': 'AMPHETAMINE',
              '1100B': 'DL-AMPHETAMINE BASE',
              '1100D': 'D-AMPHETAMINE BASE',
              '1105B': 'DL-METHAMPHETAMINE RACEMIC BASE',
              '1105D': 'D-METHAMPHETAMINE',
              '1105L': 'LEVOMETHAMPHETAMINE',
              '1205': 'LISDEXAMFETAMINE',
              '1248': 'MEPHEDRONE; 4-METHOXYMETHCATHINONE',
              '1615': 'PHENDIMETRAZINE',
              '1724': 'METHYLPHENIDATE',
              '2010': 'GAMMA HYDROXYBUTYRIC ACID',
              '2012': 'GAMMA HYDROXYBUTYRIC ACID PREPARATIONS',
              '2100': 'BARBITURIC ACID DERIVIATIVE OR SALT',
              '2125': 'AMOBARBITAL (SCHEDULE 2)',
              '2165': 'BUTALBITAL',
              '2270': 'PENTOBARBITAL (SCHEDULE 2)',
              '2285': 'PHENOBARBITAL',
              '2315': 'SECOBARBITAL (SCHEDULE 2)',
              '2765': 'DIAZEPAM',
              '2783': 'ZOLPIDEM',
              '2885': 'LORAZEPAM',
              '4187': 'TESTOSTERONE',
              '7285': 'KETAMINE',
              '7315D': 'LYSERGIDE(D-LSD)',
              '7365': 'DRONABINOL IN AN ORAL SOLUTION IN FDA APPROVED DRUG PRODUCT (SYNDROS - CII)',
              '7369': 'DRONABINOL IN SESAME OIL',
              '7360': 'MARIJUANA PLANT(CANNABIS,CANNABIGEROL,CANNABIDIOL',
              '7370': 'TETRAHYDROCANNABINOL,SYNTHETIC',
              '7377': 'CANNABICYCLOL',
              '7379': 'NABILONE',
              '7381': 'MESCALINE',
              '7400': '3,4-METHYLENEDIOXYAMPHETAMINE (3,4-MD',
              '7431': '5-METHOXY-N,N DIMETHYLTRYPTAMINE',
              '7433': 'BUFOTENINE',
              '7437': 'PSILOCYBIN',
              '7438': 'PSILOCIN',
              '7439': '5-METHOXY-N,N-DIISOPROPYLTRYPTAMINE(5',
              '7540': 'METHYLONE (3,4-METHYLENEDIOXY-N-METHYLCATHINONE)',
              '7444': '4-HYDROXY-3-METHOXY-METHAMPHETAMINE',
              '7455': 'ETICYCLIDINE (PCE)',
              '7471': 'PHENCYCLIDINE (PCP)',
              '9010': 'ALPHAPRODINE',
              '9020': 'ANILERIDINE',
              '9041': 'COCAINE',
              '9041L': 'COCAINE',
              '9046': 'NORCOCAINE',
              '9050': 'CODEINE',
              '9104': 'NORCODEINE',
              '9056': 'ETORPHINE',
              '9058': 'DIPRENORPHINE',
              '9064': 'BUPRENORPHINE',
              '9120': 'DIHYDROCODEINE',
              '9143': 'OXYCODONE',
              '9150': 'HYDROMORPHONE',
              '9168': 'DIFENOXIN(I.E.DIPHENOXYLIC ACID)',
              '9170': 'DIPHENOXYLATE',
              '9180': 'ECGONINE',
              '9180L': 'ECGONINE',
              '9190': 'ETHYLMORPHINE',
              '9193': 'HYDROCODONE',
              '9200': 'HEROIN',
              '9220L': 'LEVORPHANOL',
              '9230': 'MEPERIDINE (PETHIDINE)',
              '9273D': 'DEXTROPROPOXYPHENE',
              '9250B': 'METHADONE',
              '9300': 'MORPHINE',
              '9313': 'NORMORPHINE',
              '9317': 'NALTREXONE',
              '9333': 'THEBAINE',
              '9336': 'MORPHINE-3-ETHEREAL SULFATE',
              '9411': 'NALOXONE',
              '9600': 'OPIUM',
              '9603': 'ALPHAACETYLMETHADOL',
              '9630': 'OPIUM TINCTURE',
              '9639': 'OPIUM POWDERED',
              '9652': 'OXYMORPHONE',
              '9655': 'PAREGORIC/OPIUM',
              '9665': '14-HYDROXYCODEINONE',
              '9668': 'NOROXYMORPHONE',
              '9670': 'CONCENTRATE OF POPPY STRAW',
             '9737': 'ALFENTANIL',
             '9739': 'REMIFENTANIL',
             '9740': 'SUFENTANIL BASE',
             '9743': 'CARFENTANIL',
             '9780': 'TAPENTADOL',
             '9801': 'FENTANYL BASE',
              '9809': 'OPIUM COMBINATION PRODUCT (C-III)'
             }

In [61]:
# save the dict off to use later
with open('../data/drug_codes.pickle', 'wb') as f:
    pickle.dump(drug_codes, f)

In [62]:
# iterate through drug codes, which are the keys in the dict
# find all rows in the df where that code appears in the cell next to "CODE:"
# assign the "Drug" column for those rows to the value associated to the drug code key

for key in drug_codes.keys():
    zip_2000.loc[(zip_2000['Q1']=='CODE:')&(zip_2000['Q2']==key), 'Drug'] = drug_codes[key]
    zip_2000.loc[(zip_2000['Q1']=='CODE:')&(zip_2000['Q2']==key), 'Drug Code'] = key


# check how it looks
zip_2000.head(10)

Unnamed: 0,Year,State,Drug,Drug Code,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
1,2000,ALASKA,,,STATE:,ALASKA,,,,,,,,,,,,
3,2000,,,,----------------------------------------------...,,,,,,,,,,,,,
4,2000,,DL-AMPHETAMINE BASE,1100B,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,2000,,,,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,2000,,,,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,2000,,,,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,2000,,,,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,2000,,,,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,
10,2000,,,,TOTAL,670.99,606.22,645.30,707.32,2629.83,2629.83,,,,,,,
11,2000,,D-AMPHETAMINE BASE,1100D,DRUG,CODE:,1100D,DRUG,NAME:,D-AMPHETAMINE,BASE,,,,,,,


Looking better! The names of drugs that were spread across columns are now in a single cell. 

It's a good time to check how the data is looking generally - are there any odd things going on? It should be starting to look more uniform. You can see there are still some non-numeric values in the columns where we should only have gram amounts, but those are just things we'll drop later (rows containing the state and drug names). You might notice the duplicated values in the "TOTAL" and "DRUG" columns (like row 10) - remember that's from shifting those values over, and we will drop that duplicate column later. 

#### Step 7 - Fill forward states and drug names

I did not use multilevel indexing with this data, so I needed to fill down the state and drug info into each row. 

pandas has a convenient method to fill blanks / nulls (in this case forward fill) that works great here.

In [63]:
zip_2000['State'] = zip_2000['State'].fillna(method='ffill')
zip_2000['Drug'] = zip_2000['Drug'].fillna(method='ffill')
zip_2000['Drug Code'] = zip_2000['Drug Code'].fillna(method='ffill')


zip_2000.head(15)

Unnamed: 0,Year,State,Drug,Drug Code,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,FOR,EACH,STATE
1,2000,ALASKA,,,STATE:,ALASKA,,,,,,,,,,,,
3,2000,ALASKA,,,----------------------------------------------...,,,,,,,,,,,,,
4,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,,,
5,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,995,416.16,396.63,433.46,423.54,1669.79,,,,,,,,
6,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,996,102.76,100.63,88.24,108.29,399.92,,,,,,,,
7,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,997,114.37,85.54,92.30,128.31,420.52,,,,,,,,
8,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,998,33.42,18.83,28.37,36.55,117.17,,,,,,,,
9,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,999,4.28,4.59,2.93,10.63,22.43,,,,,,,,
10,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,TOTAL,670.99,606.22,645.30,707.32,2629.83,2629.83,,,,,,,
11,2000,ALASKA,D-AMPHETAMINE BASE,1100D,DRUG,CODE:,1100D,DRUG,NAME:,D-AMPHETAMINE,BASE,,,,,,,


#### Final cleaning steps
The last few steps are to drop out remaining junk and convert the numeric data since it's not reading in as floats.

In [64]:
# drop any rows where the "Zip" column contains "DRUG" or "STATE" - these were headers in the PDF
zip_2000=zip_2000.drop(zip_2000[zip_2000['Zip']=='DRUG'].index)
zip_2000=zip_2000.drop(zip_2000[zip_2000['Zip']=='STATE:'].index)

# select just the columns with data we want to keep 
# this effectively drops all the other columns
# of course you would first check that the other columns were all NaNs or unneeded data!
zip_2000 = zip_2000[['Year', 'State', 'Drug', 'Drug Code', 'Zip', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]

# drop any rows where the "TOTAL" column is blank
zip_2000 = zip_2000.drop(zip_2000.loc[pd.isnull(zip_2000['TOTAL'])].index)

# for the numeric columns, these are currently string values
# remove any commas and set the datatype to float
cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
for col in cols:
    zip_2000[col]=zip_2000[col].str.replace(",","").astype(float)

# check how it looks
zip_2000.head()

Unnamed: 0,Year,State,Drug,Drug Code,Zip,Q1,Q2,Q3,Q4,TOTAL
5,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,995,416.16,396.63,433.46,423.54,1669.79
6,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,996,102.76,100.63,88.24,108.29,399.92
7,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,997,114.37,85.54,92.3,128.31,420.52
8,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,998,33.42,18.83,28.37,36.55,117.17
9,2000,ALASKA,DL-AMPHETAMINE BASE,1100B,999,4.28,4.59,2.93,10.63,22.43


And there's a cleaned dataframe! These steps can be refactored into a function that can be reused, assuming your txt files are in the same format. Be careful with this assumption...


### Refactor the code into a function to be reused on other dataframes
My refactored code is below - I ended up needing two versions of the function because the report formats changed a bit starting in 2006.

In [65]:
# Here's the refactored code to use on older data (before 2006)
def clean_zip_old(df, year, drug_codes):
    """
    Use to clean files for years 2000-2005 inclusive. 
    """
    df.rename(columns={'ARCOS': "Zip", 
                       '2':'Q1', 
                       '-': 'Q2', 
                       'REPORT': 'Q3', 
                       '1':'Q4', 
                       'RETAIL':'TOTAL'}, 
                  inplace=True)
    
    # Fix the shifted cells
    shift = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
    df.loc[df['Q1']=='TOTAL', 'Zip'] = df['Q1']
    for i in range(0,5):
        df.loc[df['Zip']=='TOTAL', shift[i]] = df[shift[i+1]]
   
    # Insert new columns
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='State', loc=1, value=None)
    df.insert(column='Drug', loc=2, value=None)
    df.insert(column='Drug Code', loc=2, value=None)

    # Get the state names
    df.loc[df['Zip']=='STATE:', 'State'] = df['Q1']
    df.loc[(df['Zip']=="STATE:") & 
           (pd.notnull(df['Q2'])), 'State'] = df["State"]+" "+df['Q2']
    df.loc[(df['Zip']=="STATE:") & 
           (pd.notnull(df['Q3'])), 'State'] = df["State"]+" "+df['Q3']

    # Change the references to Guam
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = 'GUAM'        

    # Drop unnecessary data    
    drops = ['ENFORCEMENT', 'REPORTING', 'RETAIL', 'DATE:', 'ZIP', 'ARCOS']
    for d in drops:
        df = df.drop(df[df['Zip']==d].index)
        df = df.drop(df[df['Q1']==d].index)
    
    # Pull out the drug name & code
    for key in drug_codes.keys():
        df.loc[(df['Q1']=='CODE:') &
               (df['Q2']==key), 'Drug'] = drug_codes[key]
    for key in drug_codes.keys():
        df.loc[(df['Q1']=='CODE:') &
               (df['Q2']==key), 'Drug Code'] = key

    # Forward fill the states and drugs
    df['State'] = df['State'].fillna(method='ffill')
    df['Drug'] = df['Drug'].fillna(method='ffill')
    df['Drug Code'] = df['Drug Code'].fillna(method='ffill')


    # Final cleanup
    df=df.drop(df[df['Zip']=='DRUG'].index)
    df=df.drop(df[df['Zip']=='STATE:'].index)
    df = df[['Year', 'State', 'Drug', 'Drug Code',
             'Zip', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
    df = df.drop(df.loc[pd.isnull(df['TOTAL'])].index)
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    for col in cols:
        df[col]=df[col].str.replace(",","").astype(float)
    return df

### Quality and sense checks
Given that this data hads a lot of rows and it's not realistic to check them all manually, it would be a good idea to do some sense checks at this point. For example...

* Do the quarterly columns sum up to the total? This could help identify areas where the data got jumbled or shifted in some unintended way.
* Do we have all our states? Did we miss any inconsistencies in the naming?
* Did the forward filling produce any issues? For example, this might present as two rows with the same year-state-drug-zip combination where we should only have one

In [66]:
# check functions
def quarterly_check(df):
    """
    Check to see if the quarterly values in each row sum up to the total.
    """
    df['check'] = df[['Q1', 'Q2', 'Q3', 'Q4']].sum(axis=1)
    df['diff'] = df['TOTAL'] - df['check']
    issues = df.loc[(df['diff'].abs())>0.2]
    df.drop(columns=['check', 'diff'], inplace=True)
    if issues.empty:
        print('Quarterly sums check passed')
    else:
        return issues


def repeats_check_zip(df):
    """
    Check to see if any rows of data may be repeated; in particular, we should have only one row for each 
    combination of year-state-drugcode-zipcode.
    """
    df['check'] = df['Year'].astype(str)+df['State']+df['Drug Code']+df['Zip']
    checks = pd.Series(data=df['check'].value_counts())
    errors = checks.loc[checks!=1]
    df.drop(columns=['check'], inplace=True)
    if errors.empty:
        print('Repeats checks passed')
    else:
        return errors

    
def check_states(df):
    """
    Compare the states present in the df with those we expect to find.
    """
    states = ['ALASKA', 'ALABAMA', 'AMERICAN SAMOA', 'ARKANSAS', 'ARIZONA', 'CALIFORNIA',
       'COLORADO', 'CONNECTICUT', 'DISTRICT OF COLUMBIA', 'DELAWARE',
       'FLORIDA', 'GEORGIA','GUAM', 'HAWAII', 'IOWA', 'IDAHO', 'ILLINOIS',
       'INDIANA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MASSACHUSETTS',
       'MARYLAND', 'MAINE', 'MICHIGAN', 'MINNESOTA', 'MISSOURI',
       'MISSISSIPPI', 'MONTANA', 'NEBRASKA', 'NORTH CAROLINA',
       'NORTH DAKOTA', 'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO',
       'NEVADA', 'NEW YORK', 'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA',
       'PUERTO RICO', 'RHODE ISLAND', 'SOUTH CAROLINA', 'SOUTH DAKOTA',
       'TENNESSEE', 'TEXAS', 'UTAH', 'VIRGINIA', 'VIRGIN ISLANDS',
       'VERMONT', 'WASHINGTON', 'WISCONSIN', 'WEST VIRGINIA', 'WYOMING']
    in_df = df['State'].unique()
    diff = set(states).symmetric_difference(set(in_df))
    if diff:
        print('State values not matching:', diff)
    else:
        print("All expected state values present")


In [67]:
quarterly_check(zip_2000)

Quarterly sums check passed


In [68]:
repeats_check_zip(zip_2000)

Repeats checks passed


In [69]:
# note that American Samoa isn't included in any of the pre-2006 data,
# so it will show up as an error for those years
check_states(zip_2000)

State values not matching: {'AMERICAN SAMOA'}


Now read in and check the rest of the pre-2006 data files. I've commented out the data checks here for brevity because none of them turned up anything unusual for these files. 

In [71]:
zip_2000 = pd.read_csv('../data/report-1-zipcode/zip_2000.txt', delim_whitespace=True)
#check_data_old(zip_2000)

In [72]:
zip_2001 = pd.read_csv('../data/report-1-zipcode/zip_2001.txt', delim_whitespace=True)
#check_data_old(zip_2001)

In [73]:
zip_2002 = pd.read_csv('../data/report-1-zipcode/zip_2002.txt', delim_whitespace=True)
#check_data_old(zip_2002)

In [74]:
zip_2003 = pd.read_csv('../data/report-1-zipcode/zip_2003.txt', delim_whitespace=True)
#check_data_old(zip_2003)

In [75]:
zip_2004 = pd.read_csv('../data/report-1-zipcode/zip_2004.txt', delim_whitespace=True)
#check_data_old(zip_2004)

In [76]:
zip_2005 = pd.read_csv('../data/report-1-zipcode/zip_2005.txt', delim_whitespace=True)
#check_data_old(zip_2005)

Assuming all the data checks came up clean and there's no other cleaning you need to do, you can go ahead and process all those files now. 

In [77]:
print("Processing 2000 file...")
zip_2000 = clean_zip_old(zip_2000, 2000, drug_codes)

print("Processing 2001 file...")
zip_2001 = clean_zip_old(zip_2001, 2001, drug_codes)

print("Processing 2002 file...")
zip_2002 = clean_zip_old(zip_2002, 2002, drug_codes)

print("Processing 2003 file...")
zip_2003 = clean_zip_old(zip_2003, 2003, drug_codes)

print("Processing 2004 file...")
zip_2004 = clean_zip_old(zip_2004, 2004, drug_codes)

print("Processing 2005 file...")
zip_2005 = clean_zip_old(zip_2005, 2005, drug_codes)
print("Done.")

Processing 2000 file...
Processing 2001 file...
Processing 2002 file...
Processing 2003 file...
Processing 2004 file...
Processing 2005 file...
Done.


In [78]:
old_zips = {'2000': zip_2000,
            '2001': zip_2001, 
            '2002': zip_2002, 
            '2003': zip_2003, 
            '2004': zip_2004, 
            '2005': zip_2005}

for f in old_zips.keys():
    print('Checking {} file...'.format(f))
    quarterly_check(old_zips[f])
    repeats_check_zip(old_zips[f])
    check_states(old_zips[f])
    print()
    print()

Checking 2000 file...
Quarterly sums check passed
Repeats checks passed
State values not matching: {'AMERICAN SAMOA'}


Checking 2001 file...
Quarterly sums check passed
Repeats checks passed
State values not matching: {'AMERICAN SAMOA'}


Checking 2002 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present


Checking 2003 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present


Checking 2004 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present


Checking 2005 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present




### Cleaning the newer reports - dealing with a different formats

Moving on to the post-2006 data files. These are published in separate reports, but are still often ~40 pages so I continued with dumping the PDF contents into a .txt file. 

I was able to reuse most of the cleaning and checking functions with a few small tweaks, going through the same step-by-step process on one file to build out the data checking and data cleansing steps and then refactoring it into the below. 

I won't show that whole process here for the sake of brevity, but if you want to practice your pandas skills it would be a good exercise to do on your own.

Key differences for checking code:
* A big difference with these files was the presence of a lot more data that was not reported with any decimal points of accuracy - so the decimal place requirement is not so useful in the check function. 

Key differences for cleaning code:
* Drug codes are mishmashed a lot more, showing up in a couple different columns and often as part of a string with the drug name itself or the words "drug" or code" - so I used a different matching technique to get those names that is a little fuzzier
* There is one special case that has to be handled here. There are two drug codes "1100" and "1100D" and in a few cases the word "DRUG" is concatenated with "1100" - so the matching is going to detect that as "1100D" even though it is not. 

Let's first note and explore some anomalies that showed up in a lot of these files. 

In [169]:
def check_data_new(df):
    """
    Use this to check data quality for files from 2006-2018 inclusive.
    """
    df.rename(columns={'ARCOS': "Zip", 
                         '3': 'Q1',
                         '-': 'Q2', 
                         'REPORT': 'Q3', 
                         '1':'Q4', 
                         'RETAIL':'TOTAL'},
                inplace=True)
    
    print("Zip column should contain mostly 3-digit gateway zip codes.")
    print(df['Zip'].loc[df['Zip'].str.match('(?!^\d+$)^.+$')].unique())
    print()
    print()
    print("Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including commas / decimals")
    cols1 = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
    for c in cols1:
        print("Checking column {}".format(c))
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,.]*$', na=False)].unique())
        print()
    
    print()
    print("For remaining columns, mostly nulls are expected.")
    cols2 = ['DISTRIBUTION', 'BY', 'ZIP', 'CODE', 'WITHIN', 'STATE', 'BY.1', 'GRAMS', 'WT']

    for c in cols2:
        print("checking column {}".format(c))
        print(df[c].loc[pd.notnull(df[c])].unique())
        print()

In [222]:
zip_2006 = pd.read_csv('../data/report-1-zipcode/zip_2006.txt', delim_whitespace=True)
check_data_new(zip_2006)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL'
 'STATE:COLORADO' 'STATE:ILLINOIS' 'STATE:KANSAS' 'STATE:MAINE'
 'STATE:MICHIGAN' 'STATE:MONTANA' 'STATE:NEW' 'STATE:OHIO'
 'STATE:PENNSYLVANIA' 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON'
 'STATE:WYOMING' 'STATE:CALIFORNIA' 'STATE:DISTRICT' 'STATE:KENTUCKY'
 'STATE:WEST' 'STATE:AMERICAN' 'STATE:IDAHO' 'STATE:IOWA'
 'STATE:MASSACHUSETTS' 'STATE:MISSISSIPPI' 'STATE:NEVADA' 'STATE:NORTH'
 'STATE:OREGON' 'STATE:PUERTO' 'STATE:VIRGINIA' '12,262,909.02'
 'STATE:ARIZONA' 'STATE:NEBRASKA' 'STATE:ALABAMA' 'STATE:GEORGIA'
 'STATE:MINNESOTA' 'STATE:UTAH' 'STATE:INDIANA' 'STATE:SOUTH'
 'STATE:FLORIDA' 'STATE:MARYLAND' 'STATE:MISSOURI' 'STATE:OKLAHOMA'
 'STATE:ALASKA' 'STATE:WISCONSIN' 'STATE:LOUISIANA' 'STATE:DELAWARE'
 'STATE:HAWAII' 'STATE:ARKANSAS' 'STATE:VERMONT']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including commas / dec

Carefully reading through all that, there are a number of odd things that show up to investigate:

1. A long numeric value showing up in the zip code column: '12,262,909.02'
2. 'NAME:ONE' in column Q4
3. '\**' in column TOTAL
4. 'DELETE' in column DRUG
5. '2)' and 'THIS' in column DISTRIBUTION
6. 'RECORD'in column BY
7. 'NOT' in column CODE
8. 'CO' in column WITHIN

One by one let's investigate these.

In [223]:
zip_2006[zip_2006['Zip']=='12,262,909.02']

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
3301,12262909.02,,,,,,,,,,,,,,,


This looks like a value that got wrapped around from the previous line, so let's check the rows around it. 

In [224]:
# reusable code to do this fix in the future
def fix_wrapped_value(df, str_value):
    # this modifies the df in place, but returns just the two rows in question
    # so that they print nicely in notebook
    ix = df[df['Zip']==str_value].index.values[0]
    df.loc[ix-1, 'TOTAL'] = df.loc[ix, 'Zip']
    return df.iloc[ix-1:ix+1]

In [225]:
zip_2006.iloc[3296:3302]

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
3296,DRUG,CODE:,2010,DRUG,NAME:,GAMMA,HYDROXYBUTYRIC,ACID,,,,,,,,
3297,STATE:,MISSOURI,,,,,,,,,,,,,,
3298,ZIP,CODE,QUARTER,1,QUARTER,2,QUARTER,3,QUARTER,4.0,TOTAL,GRAMS,,,,
3299,630,2104289.28,4615652.97,2446273.29,3096693.48,12262909.02,,,,,,,,,,
3300,TOTAL,2104289.28,4615652.97,2446273.29,3096693.48,,,,,,,,,,,
3301,12262909.02,,,,,,,,,,,,,,,


So, it is definitely a value that got accidentally moved to the next line. It's probably something that is best fixed manually, given that there seems to be only one instance of this issue in the file. In fact, this is going to be a common issue with these reports, so I'll write a little bit of code to fix it more efficiently. However, it would still be good practice to manually review each one before changing it to make sure it is what you think it is. 

Also, this data itself looks pretty strange - these are *huge* numbers compared to the other data in this file, and without any other context I don't know why there would be such huge amounts of GHB being transacted within a single zipcode in Missouri. It looks almost like a data entry error (e.g., milligrams vs grams, or similar). Make a note of it and investigate later. 

In [226]:
# manually adjust the total value, and check it
# note that if you want to update the value this way 
# it's necessary to use loc here instead of iloc
# it's not good to use these methods interchangeably
# but here I haven't modified anything related to the index or dropped any rows
# so they will return the same rows of data

zip_2006.loc[3300, 'TOTAL'] = zip_2006.loc[3301, 'Zip']
zip_2006.loc[3300:3302]

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
3300,TOTAL,2104289.28,4615652.97,2446273.29,3096693.48,12262909.02,,,,,,,,,,
3301,12262909.02,,,,,,,,,,,,,,,
3302,STATE:,NEW,YORK,,,,,,,,,,,,,


In [227]:
zip_2006[zip_2006['Q4']=='NAME:ONE']

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
19748,DRUG,CODE:,9411,DRUG,NAME:ONE,**,DELETE,THIS,RECORD,-,NOT,CO,,,,


This is quite a strange thing to find - but also addresses a few of the other reults from the list. 
Let's check what is in the surrounding rows. 

In [228]:
zip_2006.iloc[19745:19752]

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
19745,ZIP,CODE,QUARTER,1,QUARTER,2,QUARTER,3,QUARTER,4,TOTAL,GRAMS,,,,
19746,232,100,0,0,0,100,,,,,,,,,,
19747,TOTAL,100,0,0,0,100,,,,,,,,,,
19748,DRUG,CODE:,9411,DRUG,NAME:ONE,**,DELETE,THIS,RECORD,-,NOT,CO,,,,
19749,STATE:,ALABAMA,,,,,,,,,,,,,,
19750,ZIP,CODE,QUARTER,1,QUARTER,2,QUARTER,3,QUARTER,4,TOTAL,GRAMS,,,,
19751,350,36.6,42.96,55.8,58.56,193.92,,,,,,,,,,


From the drug codes dictionary, we can determine that this is referring to Naloxone, in Alabama specifically. Referring back to the original PDF, the text in this row is in fact cut off (Naloxone being cut off to "one"), so we can't really get any more insight into why this record is marked for deletion. 

Naloxone is a life-saving drug that is used to reverse the effects of opioid overdose in a matter of seconds. You may have heard of it under the brand name Narcan. It was has been a controlled substance (Schedule II) in the past, but today it is not, and can in fact be obtained without a prescription in almost all states. 

It's not clear why this record is marked this way. In this situation I'd investigate other files to see if there was any similar occurrence, and then decide what to do.

In [229]:
# zip_2006[zip_2006['DISTRIBUTION']=='2)']
# turns out just to be some text labeling the drugs as Schedule 2

In [230]:
zip_2007 = pd.read_csv('../data/report-1-zipcode/zip_2007.txt', delim_whitespace=True)
check_data_new(zip_2007)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL'
 'STATE:ARKANSAS' 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS'
 'STATE:MICHIGAN' 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO'
 'STATE:PENNSYLVANIA' 'STATE:SOUTH' 'STATE:TEXAS' 'STATE:WASHINGTON'
 'STATE:WYOMING' 'STATE:CALIFORNIA' 'STATE:KENTUCKY' 'STATE:MONTANA'
 'STATE:WEST' 'STATE:ARIZONA' 'STATE:IOWA' 'STATE:MASSACHUSETTS'
 'STATE:VIRGINIA' 'STATE:WISCONSIN' 'STATE:LOUISIANA' 'STATE:ALABAMA'
 'STATE:NORTH' 'STATE:UTAH' 'STATE:IDAHO' 'STATE:OREGON' 'STATE:MAINE'
 'STATE:OKLAHOMA' 'STATE:VIRGIN' 'STATE:COLORADO' 'STATE:HAWAII'
 'STATE:INDIANA' 'STATE:DISTRICT' 'STATE:MISSISSIPPI' 'STATE:NEVADA'
 'STATE:PUERTO' 'STATE:MARYLAND' 'STATE:GEORGIA' 'STATE:TENNESSEE'
 'STATE:MINNESOTA' 'STATE:ALASKA' 'STATE:CONNECTICUT' 'STATE:GUAM'
 'STATE:NEBRASKA' 'STATE:DELAWARE' 'STATE:RHODE']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including

Uh oh....more weird stuff...

(Sesame and oil might jump out to you but that's actually part of a drug name - Dronabinol in sesame oil)

* '3', 'DELETE' in column DISTRIBUTION
* 'THIS' in column BY
* '4', 'RECORD' in column ZIP
* 'WT', '-' in column CODE
* 'CO' in column STATE

In [231]:
# this is fine just a reference to the third quarter
# zip_2007[zip_2007['DISTRIBUTION']=='3']

In [232]:
zip_2007[zip_2007['DISTRIBUTION']=='DELETE']

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
21180,DRUG,CODE:,9411,DRUG,NAME:,NALOXONE,**,DELETE,THIS,RECORD,-,NOT,CO,,,


More naloxone data for Alabama that is marked to be deleted but made it into the report. 

In [233]:
zip_2007.iloc[21178:21184]

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
21178,831,107.21,112.06,106.82,137.73,463.82,,,,,,,,,,
21179,TOTAL,9278.60,9263.44,9344.81,10279.15,38166,,,,,,,,,,
21180,DRUG,CODE:,9411,DRUG,NAME:,NALOXONE,**,DELETE,THIS,RECORD,-,NOT,CO,,,
21181,STATE:,ALABAMA,,,,,,,,,,,,,,
21182,ZIP,CODE,QUARTER,1,QUARTER,2,QUARTER,3,QUARTER,4,TOTAL,GRAMS,,,,
21183,350,69,94.2,103.2,132.06,398.46,,,,,,,,,,


Looks like the same situation again - data for Naloxone distribution in Alabama is marked again for deletion. 

In [234]:
# another reference to Q4
# zip_2007[zip_2007['ZIP']=='4']

In [235]:
# header data
# zip_2007[zip_2007['CODE']=='WT']

In [236]:
zip_2008 = pd.read_csv('../data/report-1-zipcode/zip_2008.txt', delim_whitespace=True)
check_data_new(zip_2008)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL'
 'STATE:ARKANSAS' 'STATE:CALIFORNIA' 'STATE:FLORIDA' 'STATE:ILLINOIS'
 'STATE:IOWA' 'STATE:LOUISIANA' 'STATE:MICHIGAN' 'STATE:MISSOURI'
 'STATE:NEW' 'STATE:OHIO' 'STATE:PENNSYLVANIA' 'STATE:SOUTH' 'STATE:TEXAS'
 'STATE:VIRGINIA' 'STATE:WYOMING' 'STATE:CONNECTICUT' 'STATE:IDAHO'
 'STATE:KANSAS' 'STATE:MASSACHUSETTS' 'STATE:WASHINGTON' 'STATE:ALASKA'
 'STATE:DELAWARE' 'STATE:KENTUCKY' 'STATE:MISSISSIPPI' 'STATE:NEVADA'
 'STATE:OREGON' 'STATE:RHODE' 'STATE:WISCONSIN' 'STATE:ALABAMA'
 'STATE:GEORGIA' 'STATE:INDIANA' 'STATE:MARYLAND' 'STATE:MINNESOTA'
 'STATE:NEBRASKA' 'STATE:NORTH' 'STATE:OKLAHOMA' 'STATE:VERMONT'
 'STATE:WEST' 'STATE:PUERTO' 'STATE:COLORADO' 'STATE:MAINE'
 'STATE:MONTANA' 'STATE:TENNESSEE' 'STATE:UTAH' '7,259,359.69'
 'STATE:AMERICAN' '4,225,666.27' 'STATE:HAWAII' 'STATE:ARIZONA'
 'STATE:DISTRICT']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should h

Main issue with this report looks like more large values that got wrapped onto a new line.

* '7,259,359.69', '4,225,666.27' in Zip column

In [237]:
#zip_2008[zip_2008['Zip']=='7,259,359.69']
#zip_2008.iloc[13697:13700]
fix_wrapped_value(zip_2008, '7,259,359.69')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
13698,TOTAL,1615427.33,1736695.06,1860881.03,2046356.27,7259359.69,,,,,,,,,,
13699,7259359.69,,,,,,,,,,,,,,,


In [238]:
#zip_2008[zip_2008['Zip']=='4,225,666.27']
#zip_2008.iloc[16084:16088]
fix_wrapped_value(zip_2008, '4,225,666.27')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
16086,TOTAL,1024983.62,1043936.82,1047646.89,1109098.94,4225666.27,,,,,,,,,,
16087,4225666.27,,,,,,,,,,,,,,,


In [239]:
zip_2009 = pd.read_csv('../data/report-1-zipcode/zip_2009.txt', delim_whitespace=True)
check_data_new(zip_2009)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL'
 'STATE:ARKANSAS' 'STATE:CALIFORNIA' 'STATE:FLORIDA' 'STATE:ILLINOIS'
 'STATE:IOWA' 'STATE:LOUISIANA' 'STATE:MICHIGAN' 'STATE:MISSOURI'
 'STATE:NEW' 'STATE:OHIO' 'STATE:PENNSYLVANIA' 'STATE:SOUTH' 'STATE:TEXAS'
 'STATE:VIRGINIA' 'STATE:WYOMING' 'STATE:CONNECTICUT' 'STATE:IDAHO'
 'STATE:MONTANA' 'STATE:WISCONSIN' 'STATE:ARIZONA' 'STATE:NORTH'
 'STATE:WASHINGTON' 'STATE:KANSAS' 'STATE:MAINE' 'STATE:COLORADO'
 'STATE:GEORGIA' 'STATE:INDIANA' 'STATE:MARYLAND' 'STATE:MISSISSIPPI'
 'STATE:NEBRASKA' 'STATE:OREGON' 'STATE:KENTUCKY' 'STATE:UTAH'
 'STATE:TENNESSEE' 'STATE:ALABAMA' 'STATE:MASSACHUSETTS' 'STATE:NEVADA'
 'STATE:HAWAII' 'STATE:MINNESOTA' 'STATE:WEST' 'STATE:ALASKA'
 '9,457,780.58' 'STATE:OKLAHOMA' 'STATE:VERMONT' 'STATE:PUERTO'
 'STATE:DELAWARE' 'STATE:DISTRICT' 'STATE:GUAM' 'STATE:AMERICAN']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeri

Another wrapped around value to fix. 

* '9,457,780.58' in column Zip

In [240]:
#zip_2009[zip_2009['Zip']=='9,457,780.58']
#zip_2009.iloc[13230:13233]
fix_wrapped_value(zip_2009, '9,457,780.58')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
13231,TOTAL,1995780.42,2205348.38,2445864.99,2810786.79,9457780.58,,,,,,,,,,
13232,9457780.58,,,,,,,,,,,,,,,


In [241]:
zip_2010 = pd.read_csv('../data/report-1-zipcode/zip_2010.txt', delim_whitespace=True)
check_data_new(zip_2010)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL'
 'STATE:ARKANSAS' 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS'
 'STATE:MICHIGAN' 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO'
 'STATE:PENNSYLVANIA' 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON'
 'STATE:WYOMING' 'STATE:CALIFORNIA' 'STATE:NORTH' 'STATE:UTAH'
 'STATE:ALABAMA' 'STATE:INDIANA' 'STATE:MARYLAND' 'STATE:MISSISSIPPI'
 'STATE:WISCONSIN' 'STATE:IOWA' 'STATE:LOUISIANA' 'STATE:SOUTH'
 'STATE:VIRGINIA' 'STATE:COLORADO' 'STATE:GEORGIA' 'STATE:MINNESOTA'
 'STATE:OKLAHOMA' 'STATE:VERMONT' 'STATE:WEST' 'STATE:NEBRASKA'
 'STATE:OREGON' 'STATE:RHODE' 'STATE:ARIZONA' 'STATE:ALASKA'
 'STATE:DELAWARE' 'STATE:IDAHO' 'STATE:KENTUCKY' 'STATE:MASSACHUSETTS'
 'STATE:MONTANA' '4,611,079.22' '4,246,523.19' 'STATE:MAINE'
 'STATE:DISTRICT' 'STATE:CONNECTICUT' 'STATE:HAWAII' 'STATE:NEVADA']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, includ

More wrapped values to fix. 

* '4,611,079.22' '4,246,523.19' in column Zip

In [242]:
#zip_2010[zip_2010['Zip']=='4,611,079.22']
#zip_2010.iloc[16100:16103]
fix_wrapped_value(zip_2010, '4,611,079.22')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
16101,TOTAL,1055820.55,1182977.23,1165125.62,1207155.82,4611079.22,,,,,,,,,,
16102,4611079.22,,,,,,,,,,,,,,,


In [243]:
#zip_2010[zip_2010['Zip']=='4,246,523.19']
#zip_2010.iloc[17024:17026]
fix_wrapped_value(zip_2010, '4,246,523.19')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
17024,TOTAL,1030506.72,1071757.82,1070444.74,1073813.91,4246523.19,,,,,,,,,,
17025,4246523.19,,,,,,,,,,,,,,,


In [244]:
zip_2011 = pd.read_csv('../data/report-1-zipcode/zip_2011.txt', delim_whitespace=True)
check_data_new(zip_2011)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL'
 'STATE:ARKANSAS' 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS'
 'STATE:MICHIGAN' 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO'
 'STATE:PENNSYLVANIA' 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON'
 'STATE:WYOMING' 'STATE:CALIFORNIA' 'STATE:DISTRICT' 'STATE:MAINE'
 'STATE:MINNESOTA' 'STATE:NORTH' 'STATE:CONNECTICUT' 'STATE:OKLAHOMA'
 'STATE:ARIZONA' 'STATE:IDAHO' 'STATE:IOWA' 'STATE:KENTUCKY'
 'STATE:MASSACHUSETTS' 'STATE:OREGON' 'STATE:SOUTH' 'STATE:VIRGINIA'
 'STATE:WISCONSIN' 'STATE:LOUISIANA' '17,577,800.07' 'STATE:MARYLAND'
 'STATE:NEBRASKA' 'STATE:GEORGIA' 'STATE:COLORADO' 'STATE:DELAWARE'
 'STATE:VERMONT' 'STATE:WEST' 'STATE:ALABAMA' 'STATE:INDIANA'
 'STATE:MONTANA' '4,701,107.52' 'STATE:UTAH' 'STATE:GUAM'
 'STATE:MISSISSIPPI' 'STATE:NEVADA' 'STATE:ALASKA']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including commas / de

Wrapped values to fix. 

* '17,577,800.07', '4,701,107.52' in column Zip

In [245]:
#zip_2011[zip_2011['Zip']=='17,577,800.07']
#zip_2011.iloc[5057:5060]
fix_wrapped_value(zip_2011, '17,577,800.07')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
5058,TOTAL,4258659.46,4421310.91,4646749.25,4251080.45,17577800.07,,,,,,,,,,
5059,17577800.07,,,,,,,,,,,,,,,


In [246]:
#zip_2011[zip_2011['Zip']=='4,701,107.52']
#zip_2011.iloc[16332:16335]
fix_wrapped_value(zip_2011, '4,701,107.52')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
16333,TOTAL,1140519.34,1138598.39,1139138.23,1282851.56,4701107.52,,,,,,,,,,
16334,4701107.52,,,,,,,,,,,,,,,


In [247]:
zip_2012 = pd.read_csv('../data/report-1-zipcode/zip_2012.txt', delim_whitespace=True)
check_data_new(zip_2012)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'RETAIL' 'Run'
 'STATE:ARKANSAS' 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS'
 'STATE:MICHIGAN' 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO'
 'STATE:PENNSYLVANIA' 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON'
 'STATE:WYOMING' 'STATE:CALIFORNIA' 'STATE:DELAWARE' 'STATE:LOUISIANA'
 'STATE:MINNESOTA' 'STATE:OREGON' 'STATE:VIRGINIA' 'STATE:ARIZONA'
 'STATE:CONNECTICUT' 'STATE:INDIANA' 'STATE:MARYLAND' 'STATE:NEVADA'
 'STATE:NORTH' 'STATE:IOWA' 'STATE:SOUTH' 'STATE:DISTRICT' 'STATE:IDAHO'
 'STATE:KENTUCKY' 'STATE:MASSACHUSETTS' 'STATE:MISSISSIPPI' 'STATE:RHODE'
 'STATE:WISCONSIN' '17,723,212.99' 'STATE:ALABAMA' 'STATE:GEORGIA'
 'STATE:NEBRASKA' 'STATE:OKLAHOMA' 'STATE:WEST' 'STATE:ALASKA'
 'STATE:VERMONT' 'STATE:MAINE' 'STATE:MONTANA' '4,622,183.22'
 '4,500,723.77' 'STATE:COLORADO' 'STATE:UTAH' 'STATE:HAWAII']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeri

Three wrapped values to fix:

* '17,723,212.99', '4,622,183.22', '4,500,723.77' in column Zip

In [248]:
#zip_2012[zip_2012['Zip']=='17,723,212.99']
#zip_2012.iloc[4967:4970]
fix_wrapped_value(zip_2012, '17,723,212.99')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
4968,TOTAL,6486887.81,4353842.88,5198530.75,1683951.55,17723212.99,,,,,,,,,,
4969,17723212.99,,,,,,,,,,,,,,,


In [249]:
#zip_2012[zip_2012['Zip']=='4,622,183.22']
#zip_2012.iloc[12863:12866]
fix_wrapped_value(zip_2012, '4,622,183.22')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
12864,TOTAL,1135358.93,1143402.32,1157021.65,1186400.32,4622183.22,,,,,,,,,,
12865,4622183.22,,,,,,,,,,,,,,,


In [250]:
#zip_2012[zip_2012['Zip']=='4,500,723.77']
#zip_2012.iloc[16162:16165]
fix_wrapped_value(zip_2012, '4,500,723.77')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
16163,TOTAL,1137975.64,1109341.11,1106728.14,1146678.88,4500723.77,,,,,,,,,,
16164,4500723.77,,,,,,,,,,,,,,,


In [251]:
zip_2013 = pd.read_csv('../data/report-1-zipcode/zip_2013.txt', delim_whitespace=True)
check_data_new(zip_2013)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'STATE:ARKANSAS'
 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS' 'STATE:MICHIGAN'
 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO' 'STATE:PENNSYLVANIA'
 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON' 'STATE:WYOMING'
 'STATE:CALIFORNIA' 'STATE:DELAWARE' 'STATE:LOUISIANA' 'STATE:MINNESOTA'
 'STATE:NORTH' 'STATE:VIRGINIA' 'STATE:MISSISSIPPI' 'STATE:SOUTH'
 'STATE:WISCONSIN' 'STATE:IOWA' '22,131,149.17' 'STATE:CONNECTICUT'
 'STATE:MASSACHUSETTS' 'STATE:ALASKA' 'STATE:INDIANA' 'STATE:KENTUCKY'
 'STATE:NEVADA' 'STATE:OKLAHOMA' 'STATE:PUERTO' 'STATE:ARIZONA'
 'STATE:VIRGIN' 'STATE:WEST' 'STATE:GEORGIA' 'STATE:NEBRASKA'
 'STATE:COLORADO' 'STATE:MARYLAND' 'STATE:UTAH' 'STATE:ALABAMA'
 'STATE:MONTANA' 'STATE:VERMONT' '4,450,609.92' '5,067,676.02'
 '4,560,721.18' 'STATE:HAWAII' 'STATE:OREGON' 'STATE:AMERICAN'
 'STATE:IDAHO' 'STATE:RHODE' 'STATE:DISTRICT']


Q1, Q2, Q3, Q4, TOTAL, and DRUG

Wrapped values to fix:

* '22,131,149.17', '4,450,609.92', '5,067,676.02', '4,560,721.18' in column Zip

In [252]:
#zip_2013[zip_2013['Zip']=='22,131,149.17']
#zip_2013.iloc[4728:4731]
fix_wrapped_value(zip_2013, '22,131,149.17')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
4729,TOTAL,6892141.82,5497232.83,5202097.34,4539677.18,22131149.17,,,,,,,,,,
4730,22131149.17,,,,,,,,,,,,,,,


In [253]:
#zip_2013[zip_2013['Zip']=='4,450,609.92']
#zip_2013.iloc[12786:12789]
fix_wrapped_value(zip_2013, '4,450,609.92')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
12787,TOTAL,1137570.64,1107515.42,1103390.59,1102133.27,4450609.92,,,,,,,,,,
12788,4450609.92,,,,,,,,,,,,,,,


In [254]:
#zip_2013[zip_2013['Zip']=='5,067,676.02']
#zip_2013.iloc[15109:15112]
fix_wrapped_value(zip_2013, '5,067,676.02')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
15110,TOTAL,1254573.67,1271112.68,1276613.51,1265376.16,5067676.02,,,,,,,,,,
15111,5067676.02,,,,,,,,,,,,,,,


In [255]:
#zip_2013[zip_2013['Zip']=='4,560,721.18']
#zip_2013.iloc[15987:15990]
fix_wrapped_value(zip_2013, '4,560,721.18')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
15988,TOTAL,1160896.04,1153155.13,1138483.39,1108186.62,4560721.18,,,,,,,,,,
15989,4560721.18,,,,,,,,,,,,,,,


In [256]:
zip_2014 = pd.read_csv('../data/report-1-zipcode/zip_2014.txt', delim_whitespace=True)
check_data_new(zip_2014)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'STATE:ARKANSAS'
 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS' 'STATE:MICHIGAN'
 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO' 'STATE:PENNSYLVANIA'
 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON' 'STATE:WYOMING'
 'STATE:CALIFORNIA' 'STATE:CONNECTICUT' 'STATE:IDAHO' 'STATE:KENTUCKY'
 'STATE:NEVADA' 'STATE:OKLAHOMA' 'STATE:SOUTH' 'STATE:VERMONT'
 'STATE:ALABAMA' 'STATE:GEORGIA' 'STATE:INDIANA' 'STATE:MARYLAND'
 'STATE:MINNESOTA' 'STATE:NEBRASKA' 'STATE:NORTH' 'STATE:WEST'
 'STATE:COLORADO' 'STATE:MONTANA' '25,086,813.7' 'STATE:WISCONSIN'
 'STATE:MAINE' 'STATE:MASSACHUSETTS' 'STATE:MISSISSIPPI' 'STATE:VIRGINIA'
 'STATE:ARIZONA' 'STATE:HAWAII' 'STATE:AMERICAN' 'STATE:DELAWARE'
 'STATE:PUERTO' 'STATE:ALASKA' 'STATE:LOUISIANA' 'STATE:IOWA'
 'STATE:OREGON' 'STATE:DISTRICT']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including commas / decima

Just one wrapped value to fix here:
 * '25,086,813.7' in column Zip

In [257]:
#zip_2014[zip_2014['Zip']=='25,086,813.7']
#zip_2014.iloc[4679:4682]
fix_wrapped_value(zip_2014, '25,086,813.7')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
4680,TOTAL,10652890.18,2499586.56,11212102.08,722234.88,25086813.7,,,,,,,,,,
4681,25086813.7,,,,,,,,,,,,,,,


In [258]:
zip_2015 = pd.read_csv('../data/report-1-zipcode/zip_2015.txt', delim_whitespace=True)
check_data_new(zip_2015)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'ARCOS' 'STATE:ARKANSAS'
 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS' 'STATE:MICHIGAN'
 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO' 'STATE:PENNSYLVANIA'
 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON' 'STATE:WYOMING'
 'STATE:CALIFORNIA' 'STATE:DELAWARE' 'STATE:MINNESOTA' 'STATE:OREGON'
 'STATE:VIRGINIA' 'STATE:ALASKA' 'STATE:COLORADO' 'STATE:INDIANA'
 'STATE:WISCONSIN' 'STATE:IOWA' 'STATE:LOUISIANA' 'STATE:SOUTH'
 '25,390,271.23' 'STATE:NORTH' 'STATE:MASSACHUSETTS' 'STATE:HAWAII'
 'STATE:KENTUCKY' 'STATE:MISSISSIPPI' 'STATE:NEVADA' 'STATE:ARIZONA'
 'STATE:VIRGIN' 'STATE:WEST' 'STATE:MAINE' 'STATE:MONTANA'
 'STATE:OKLAHOMA' 'STATE:RHODE' 'STATE:IDAHO' '4,206,759.39'
 'STATE:CONNECTICUT' 'STATE:NEBRASKA' 'STATE:PUERTO' 'STATE:DISTRICT'
 'STATE:MARYLAND' 'STATE:GEORGIA']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including commas / decimals
C

Wrapped values to fix:

* '25,390,271.23', '4,206,759.39' in column Zip

In [259]:
#zip_2015[zip_2015['Zip']=='25,390,271.23']
#zip_2015.iloc[4955:4958]
fix_wrapped_value(zip_2015, '25,390,271.23')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
4956,TOTAL,9935262.14,7352603.71,4881029.76,3221375.62,25390271.23,,,,,,,,,,
4957,25390271.23,,,,,,,,,,,,,,,


In [260]:
#zip_2015[zip_2015['Zip']=='4,206,759.39']
#zip_2015.iloc[12278:12281]
fix_wrapped_value(zip_2015, '4,206,759.39')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
12279,TOTAL,1040101.13,1054084.97,1057771.58,1054801.71,4206759.39,,,,,,,,,,
12280,4206759.39,,,,,,,,,,,,,,,


In [261]:
zip_2016 = pd.read_csv('../data/report-1-zipcode/zip_2016.txt', delim_whitespace=True)
check_data_new(zip_2016)

Zip column should contain mostly 3-digit gateway zip codes.
['REPORTING' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'Run' 'RETAIL' 'STATE:ARKANSAS'
 'STATE:FLORIDA' 'STATE:ILLINOIS' 'STATE:KANSAS' 'STATE:MICHIGAN'
 'STATE:MISSOURI' 'STATE:NEW' 'STATE:OHIO' 'STATE:PENNSYLVANIA'
 'STATE:TENNESSEE' 'STATE:TEXAS' 'STATE:WASHINGTON' 'STATE:WYOMING'
 'STATE:CONNECTICUT' 'STATE:IOWA' 'STATE:NORTH' 'STATE:ALABAMA'
 'STATE:CALIFORNIA' 'STATE:GEORGIA' 'STATE:MASSACHUSETTS' 'STATE:NEBRASKA'
 'STATE:OREGON' 'STATE:VIRGINIA' 'STATE:INDIANA' 'STATE:KENTUCKY'
 'STATE:MARYLAND' 'STATE:MINNESOTA' 'STATE:OKLAHOMA' 'STATE:VERMONT'
 'STATE:WEST' 'STATE:COLORADO' 'STATE:NEVADA' 'STATE:UTAH' 'STATE:HAWAII'
 'STATE:RHODE' 'STATE:DELAWARE' 'STATE:IDAHO' 'STATE:LOUISIANA'
 'STATE:WISCONSIN' 'STATE:ARIZONA' 'STATE:SOUTH' '4,103,570.97'
 '4,085,382.79' 'STATE:MAINE' 'STATE:MONTANA' 'STATE:MISSISSIPPI'
 'STATE:DISTRICT' 'STATE:GUAM' 'STATE:ALASKA']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, i

Two values to fix:
* '4,103,570.97', '4,085,382.79' in column Zip

In [262]:
#zip_2016[zip_2016['Zip']=='4,103,570.97']
#zip_2016.iloc[10264:10267]
fix_wrapped_value(zip_2016, '4,103,570.97')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
10265,TOTAL,1006965.21,1021674.52,1034125.83,1040805.41,4103570.97,,,,,,,,,,
10266,4103570.97,,,,,,,,,,,,,,,


In [263]:
#zip_2016[zip_2016['Zip']=='4,085,382.79']
#zip_2016.iloc[12349:12352]
fix_wrapped_value(zip_2016, '4,085,382.79')

Unnamed: 0,Zip,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,ZIP,CODE,WITHIN,STATE,BY.1,GRAMS,WT
12350,TOTAL,1023720.42,1015520.11,1018495.71,1027646.55,4085382.79,,,,,,,,,,
12351,4085382.79,,,,,,,,,,,,,,,


In [264]:
zip_2017 = pd.read_csv('../data/report-1-zipcode/zip_2017.txt', delim_whitespace=True)
check_data_new(zip_2017)

Zip column should contain mostly 3-digit gateway zip codes.
['Run' 'DRUG' 'STATE:' 'ZIP' 'TOTAL' 'DATE' 'RETAIL' 'STATE:VERMONT'
 'STATE:NEW' 'STATE:GEORGIA' 'STATE:CALIFORNIA']


Q1, Q2, Q3, Q4, TOTAL, and DRUG columns should have mostly numeric values, including commas / decimals
Checking column Q1
['Date:' 'CODE:' 'ALABAMA' 'CODE' 'ALASKA' 'AMERICAN' 'RANGE:' 'DRUG'
 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE'
 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS'
 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND'
 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA'
 'NEBRASKA' 'NEVADA' 'NEW' 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON'
 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH'
 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON' 'WEST' 'WISCONSIN' 'WYOMING'
 'CODE:1205DRUG' nan 'CODE:2270DRUG' 'YORK' 'CODE:2315DRUG'
 'CODE:7365DRUG' 'CODE:7379DRUG' 'CODE:9041LDRUG' 'CODE:9250BDRUG'
 'CODE:9652DRUG'

No wrapped values to check out, but a few things to check

* 'AN' in column Q4
* 'CODE' in column TOTAL
* 'AN', 'PRODUCT' in column DISTRIBUTION
* 'AP' in column ZIP
* 'AP' in column STATE

In [265]:
# part of a drug name
#zip_2017[zip_2017['Q4']=='AN']

In [266]:
# just header info
#zip_2017[zip_2017['TOTAL']=='CODE']

In [267]:
# part of a drug name
#zip_2017[zip_2017['DISTRIBUTION']=='AN']

In [268]:
# part of a drug name
#zip_2017[zip_2017['DISTRIBUTION']=='PRODUCT']

In [269]:
# both part of a drug name
#zip_2017[zip_2017['ZIP']=='AP']
#zip_2017[zip_2017['STATE']=='AP']

Now we are ready to process and quality check the new year reports. 

The cleaning function below was modified from the old-year function - read the comments to understand differences.

In [270]:
def clean_zip_new(df, year, drug_codes):
    """
    Use this function to clean files from 2006-2017 inclusive.
    """
    # rename columns
    df.rename(columns={'ARCOS': "Zip", 
                       '3':'Q1', 
                       '-': 'Q2', 
                       'REPORT': 'Q3', 
                       '1':'Q4', 
                       'RETAIL':'TOTAL'}, 
              inplace=True)
    
    # insert new columns
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='State', loc=1, value=None)
    df.insert(column='Drug', loc=2, value=None)
    df.insert(column='Drug Name', loc=3, value=None)
    
    # deal with shifted state totals
    df.loc[df['Zip']=='STATE:', 'State']=df['Q1']
    df.loc[(df['Zip']=="STATE:") & 
           (pd.notnull(df['Q2'])), 'State']=df["State"]+" "+df['Q2']
    df.loc[(df['Zip']=="STATE:") & 
           (pd.notnull(df['Q3'])), 'State']=df["State"]+" "+df['Q3']

    # drop unnecessary rows
    # one small change - starting with the 2017 report
    # the headers say "DATE RANGE" instead of "REPORTING PERIOD"
    drops = ['REPORTING', 'RETAIL', 'Run', 'ZIP', 'ARCOS', "DATE"]
    for d in drops:
        df = df.drop(df[df['Zip']==d].index)
    
    # get the drug names
    # the method is different here because the data is less organized
    # important to use a sorted list of the keys here 
    # just iterating over the keys would be unordered
    # this could cause you to overwrite a partial match of a code with the wrong drug name
    # accidentally matching on a numeric gram value would be a concern,
    # except that those values are all formatted as strings with commas
    # and all drug codes are 4 digits or 4 digits with a letter at the end
    for key in sorted(drug_codes.keys()):
        df.loc[(df['Q1'].str.contains(key)) &
               (pd.notnull(df['Q1'])), 'Drug'] = drug_codes[key]
        df.loc[(df['Q2'].str.contains(key)) & 
               (pd.notnull(df['Q2'])), 'Drug'] = drug_codes[key]
        
        df.loc[(df['Q1'].str.contains(key)) &
               (pd.notnull(df['Q1'])), 'Drug Code'] = key
        df.loc[(df['Q2'].str.contains(key)) & 
               (pd.notnull(df['Q2'])), 'Drug Code'] = key
        
    # address the special case of 1100 and 1100D
    # do this after the loop above to overwrite any wrong values with the right ones
    df.loc[df['Q1'].str.contains('1100DRUG', na=False), 'Drug'] = drug_codes['1100']
    df.loc[df['Q1'].str.contains('1100DRUG', na=False), 'Drug Code'] = '1100'

    
    # forward fill state and drug names
    df['State'] = df['State'].fillna(method='ffill')
    df['Drug'] = df['Drug'].fillna(method='ffill')
    df['Drug Code'] = df['Drug Code'].fillna(method='ffill')

    
    # drop out more unneeded rows
    df=df.drop(df[df['Zip']=='DRUG'].index)
    df=df.drop(df[df['Zip']=='STATE:'].index)
    
    # select just what we need to keep
    df = df[['Year', 'State', 'Drug', 'Drug Code', 
             'Zip', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
    df = df.drop(df.loc[pd.isnull(df['TOTAL'])].index)
    
    # convert to floats in the numeric columns
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    for col in cols:
        df[col]=df[col].str.replace(",","").astype(float)
    return df

In [271]:
# be sure not to re-read in the CSVs 
# if you have already done the manual corrections for wrapped values!

print("Processing 2006 file...")
zip_2006 = clean_zip_new(zip_2006, 2006, drug_codes)
print("Done.")

print("Processing 2007 file...")
zip_2007 = clean_zip_new(zip_2007, 2007, drug_codes)
print("Done.")

print("Processing 2008 file...")
zip_2008 = clean_zip_new(zip_2008, 2008, drug_codes)
print("Done.")

print("Processing 2009 file...")
zip_2009 = clean_zip_new(zip_2009, 2009, drug_codes)
print("Done.")

print("Processing 2010 file...")
zip_2010 = clean_zip_new(zip_2010, 2010, drug_codes)
print("Done.")

print("Processing 2011 file...")
zip_2011 = clean_zip_new(zip_2011, 2011, drug_codes)
print("Done.")

print("Processing 2012 file...")
zip_2012 = clean_zip_new(zip_2012, 2012, drug_codes)
print("Done.")

print("Processing 2013 file...")
zip_2013 = clean_zip_new(zip_2013, 2013, drug_codes)
print("Done.")

print("Processing 2014 file...")
zip_2014 = clean_zip_new(zip_2014, 2014, drug_codes)
print("Done.")

print("Processing 2015 file...")
zip_2015 = clean_zip_new(zip_2015, 2015, drug_codes)
print("Done.")

print("Processing 2016 file...")
zip_2016 = clean_zip_new(zip_2016, 2016, drug_codes)
print("Done.")

print("Processing 2017 file...")
zip_2017 = clean_zip_new(zip_2017, 2017, drug_codes)
print("Done.")

Processing 2006 file...
Done.
Processing 2007 file...
Done.
Processing 2008 file...
Done.
Processing 2009 file...
Done.
Processing 2010 file...
Done.
Processing 2011 file...
Done.
Processing 2012 file...
Done.
Processing 2013 file...
Done.
Processing 2014 file...
Done.
Processing 2015 file...
Done.
Processing 2016 file...
Done.
Processing 2017 file...
Done.


We need one additional checking function, to make sure that no wrapped total values were missed. 

In [272]:
import itertools
def check_state_totals(df):
    """
    Use to check if there are any state-drugcode combinations that don't have a row of total values.
    Slow to run due to needing to loop over a large number of state-drugcode pairs.
    Primarily needed for files for years 2006-2017 inclusive.
    """
    i = 0
    for state in df['State'].unique():
        for pair in itertools.product([state], df.loc[df['State']==state, 'Drug Code'].unique().tolist()):
            if len(df.loc[(df['State']==pair[0]) 
                           & (df['Zip']=='TOTAL')
                           & (df['Drug Code']==pair[1])])<1:
                print("No state total for {} ({}) in {}.".format(pair[1], drug_codes[pair[1]], pair[0]))
                i+=1
    if i==0:
        print("No state total issues found.")

In [273]:
new_zips = {'2006': zip_2006, '2007': zip_2007, 
            '2008': zip_2008, '2009': zip_2009, 
            '2010': zip_2010, '2011': zip_2011,
            '2012': zip_2012, '2013': zip_2013, 
            '2014': zip_2014, '2015': zip_2015, 
            '2016': zip_2016, '2017': zip_2017}

for f in new_zips.keys():
    print('Checking {} file...'.format(f))
    quarterly_check(new_zips[f])
    repeats_check_zip(new_zips[f])
    check_states(new_zips[f])
    check_state_totals(new_zips[f])
    print("Done.")
    print()
    print()

Checking 2006 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.


Checking 2007 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.


Checking 2008 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.


Checking 2009 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.


Checking 2010 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.


Checking 2011 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.


Checking 2012 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present
No state total issues found.
Done.



Everything looks good here, but the first time I ran these checks, the check function for state totals turned up some issues with the 1100 drug code that alerted me to the issue with 1100D being wrongly tagged when the drug code was concatenated as "1100DRUG." It's a good illustration of why you should build in any many checks as possible.

### Wrap up the zip files

Finally, we are done!

Now you can package them all into one big dataframe and save it off to .csv, or to whatever format you'd like to use.

In [275]:
zip_all = pd.concat(list(old_zips.values())+list(new_zips.values()), 
                    ignore_index=True)
zip_all.to_csv('../data/report-1-zipcode/retail_distribution_by_zipcode.csv', index=False)