# Cleaning messy PDF data with pandas and Jupyter notebooks
## Part 2 - DEA ARCOS Report 3: Quarterly Distribution in Grams per 100K Population 

### Background

#### What is ARCOS?
The DEA publishes data annually from its Automation of Reports and Consolidated Orders System, or ARCOS. According to the DEA's website, ARCOS "monitors the flow of DEA controlled substances from their point of manufacture through commercial distribution channels to point of sale or distribution at the dispensing/retail level - hospitals, retail pharmacies, practitioners, mid-level practitioners, and teaching institutions....these transactions...are then summarized into reports which give investigators in Federal and state government agencies information which can then be used to identify the diversion of controlled substances into illicit channels of distribution. The information on drug distribution is used throughout the United States (U.S.) by U.S. Attorneys and DEA investigators to strengthen criminal cases in the courts."

So, ARCOS exists to help the government identify patterns in the manufacture and distribution of controlled substances that might indicate that these substances are being sold illegally. Annual ARCOS reports are publically available on the DEA's website, dating back to the year 2000, but unfortunately they are only available in PDF form and are dozens or even hundreds of pages long. 

#### What's in this notebook?
I was interested in doing some data analysis and visualization on the distribution of oxycodone, an opioid painkiller that is one of the main drivers of the current prescription pain pill (and arguably heroin) addiction epidemic in the United States right now. 

Aside from a wealth of fascinating (and sometimes disturbing, sad, and frightening) data to explore, the ARCOS data also presents a great data cleansing challenge, given that it is distributed in PDFs - the perfect opportunity to practice your pandas skills, for example. Luckily, the files tend to have nearly identical formatting, aside from a shift in report formatting in 2006 and a few anomalies here and there.

This notebook is meant to demo the functionality of pandas and Jupyter notebooks for data cleaning - working with this data was a great project for me to improve my pandas skills and I'm sharing the code here so others can learn and practice. 

This is part 2 - cleaning and processing of Report 3. 

In [603]:
# load the libraries we need, and the drug codes dictionary and geographies list
import pandas as pd
import numpy as np
import pickle

drug_codes = pickle.load(open("drug_codes.pickle", "rb"))
geos = pickle.load(open("geographies.pickle", "rb"))

### Notes on the data 

Get the raw data (in PDF....!)
You can find the ARCOS reports here: https://www.deadiversion.usdoj.gov/arcos/retail_drug_summary/index.html

There are six ARCOS reports published each year and I chose to work with three of them in particular:
* Report 1:  Retail Drug Distribution by Zip Code for Each State - total drug amounts (in grams) distributed to retail registrants in each state, by 'gateway' zip code (the first three numbers of the zip), on a quarterly basis
* Report 3: Quarterly Distribution in Grams per 100K Population - quarterly drug consumption in grams per 100,000 population, by state
* Report 5: Statistical Summary for Retail Drug Purchases - average annual purchases by drug by business activity (pharmacy, hospital, etc.)


A few notes: 

* For years before 2006, the reports are lumped together into one giant PDF (700+ pages long). In more recent years they have elected to publish a separate PDF for each report. 

* I tried several approaches for simply getting the text out of the PDF - for a variety of reasons (in particular the unwieldy nature of the pre-2006 PDFs), it was easiest and quickest to just copy-paste the entire contents of the PDF into a text file. This was an OK solution for me since there aren't that many of them - if you were doing this with hundreds of files you would want to find another way. Another problem I ran into right away was the length of the title running onto multiple lines in the txt file and causing a lot of formatting challenges in a dataframe, so I manually adjusted the title text in each txt file. 

* For the pre-2006 reports, I (manually and carefully) removed the report content I wasn't interested in from the text file, and then used pandas to clean what remained. 

#### Step 1 - Getting from PDF into pandas in the notebook

What to consider and experiment with:
* How will you pull the data out of the PDF? How much of the formatting (columns, headers, etc) will you be able to preserve?
* What delimiter works best?
* If the number of PDF files is small, are there any steps you can perform right in the txt or spreadsheet file that will make things easier?

There are different options for getting data from a PDF into a format you can interact with more directly. I ended up just copy-pasting the full contents of each file as it didn't seem that some of the PDF-to-spreadsheet/other tools out there would really save me that much time. 

I tried several text editors and spreadsheet applications, looking for something that would do a relatively good job delimiting the data based on the PDF files. Sublime is one of my favorites and that's what I used in the end. 

Tips
* Try a couple different editors and delimit options, and read each one into pandas to see how the structure of the data looks. Choose one that will minimize the amount of cleaning you need to do
* Keep your .txt file open as you begin cleaning in pandas
* Never save over your raw .txt file! This is a trial-and-error process and you will likely end up losing some data at one point or another. If you've saved over the starting point you will have to go back to your PDF...


I'm going to use very much the same approach for cleaning up the files containing Report 3, the quarterly distribution by 100K population. These files were quite a bit messier, so the preprocessing is more complex and there were some differences in format in some of the files that I discovered via the testing techniques. 

In [275]:
pop_2000 = pd.read_csv('population_2000.txt', delim_whitespace=True)
pop_2000.head(15)

Unnamed: 0,ARCOS,2,-,REPORT,3,QUARTERLY,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
1,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,
2,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,----------------------------------------------...,,,,,,,,,,,
4,ALASKA,105.20,95.05,101.17,110.90,412.33,,,,,,
5,ALABAMA,172.96,149.64,168.43,187.49,678.54,,,,,,
6,ARKANSAS,174.08,160.97,171.89,190.19,697.14,,,,,,
7,ARIZONA,134.19,131.48,127.38,138.19,531.25,,,,,,
8,CALIFORNIA,50.79,52.67,49.06,54.98,207.53,,,,,,
9,COLORADO,60.53,52.67,61.75,71.06,246.02,,,,,,


#### Step 2 - check for unusual or irregular data

As with Report 1, in order to do the data cleaning in bulk we will be making assumptions of the format of the data when it comes it. So before we can safely process each file we want to check those assumptions. 

In [276]:
def check_pop_data(df):
    df.rename(columns={'ARCOS': "State", 
                   '2':'Q1', 
                   '-': 'Q2', 
                   'REPORT': 'Q3', 
                   '3':'Q4', 
                   'QUARTERLY':'TOTAL'}, 
          inplace=True)
    
    print("Checking State column - we expect mostly state names here, as well as DRUG.")
    print(df['State'].unique())
    print()
    print()
    print('For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.')
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    
    for c in cols:
        vals = df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()
        print("Checking column {} for unusual values.".format(c))
        if c=='Q1':
            print("We expect to see CODE: but not DRUG, and no drug codes.")
            if 'DRUG' in vals:
                print("Irregular value: DRUG")
        if c=='Q2':
            print("We expect to see drug codes here, but should not see DRUG or CODE.")
            if 'DRUG' in vals:
                print("Irregular value: 'DRUG'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        if c=='Q3':
            print("We expect to see DRUG here, but no drug codes.")
        if c=='Q4':
            print("We expect to see NAME: here, but not DRUG and no drug codes.")
            if 'DRUG' in vals:
                print("Irregular value: DRUG")
        if c=='TOTAL':
            print("We expect to see partial/full drug names, but no codes and not the words DRUG, NAME, or CODE.")
            if 'DRUG' in vals:
                print("Irregular value: 'DRUG'")
            if 'NAME' in vals:
                print("Irregular value: 'NAME'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()
        print()
    
    cols2 = ['DISTRIBUTION', 'IN', 'GRAMS', 'PER', '100,000', 'POPULATION']
    print()
    print("For the rest of the columns, we expect mostly NaNs and some partial drug names.")
    for c in cols2:
        print("Checking column {} for unusual values.".format(c))
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()


In [277]:
pop_2000.head()

Unnamed: 0,ARCOS,2,-,REPORT,3,QUARTERLY,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
1,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,
2,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,----------------------------------------------...,,,,,,,,,,,
4,ALASKA,105.20,95.05,101.17,110.90,412.33,,,,,,


In [278]:
check_pop_data(pop_2000)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'HAWAII' 'IOWA'
 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NEBRASKA' 'NORTH' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'DATE:' 'ARCOS'
 'QUARTERLY' 'SOUTH' 'TENNESSEE' 'TRUST' 'TEXAS' 'UTAH' 'VIRGINIA'
 'VIRGIN' 'VERMONT' 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
['PERIOD:' 'CODE:' '1ST' n

#### Step 3 - determine if we can reuse old code

Having checked for irregular data, we can move on to cleaning it.

This data looks really similar to what we had in the other zipcode-level report, so we should be able to reuse most of the code from the finished cleaning functions there.

The old code is below, with comments describing what we'll probably need to change.

In [279]:
# DON'T NEED TO RUN THIS CELL
# Here's the cleaning code for the zipcode reports, from the previous notebook
def clean_zip_old(df, year):
    # this step can remain largely similar
    df.rename(columns={'ARCOS': "Zip", 
                       '2':'Q1', 
                       '-': 'Q2', 
                       'REPORT': 'Q3', 
                       '1':'Q4', 
                       'RETAIL':'TOTAL'}, 
                  inplace=True)
    
    # we don't have to deal with shifted state totals
    # though we will have to deal with shifted data for long state names
    start = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    shift = ['Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
    df.loc[df['Zip']=='STATE','Zip'] = df['Q1']
    for i in range(0,5):
        df.loc[df['Zip']=='TOTAL', start[i]] = df[shift[i]]
   
    # this will remain, though we already have state taken care of
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='State', loc=1, value=None)
    df.insert(column='Drug', loc=2, value=None)

    # No need to get the state names except for the long ones - they are already there
    df.loc[df['Zip']=='STATE:', 'State'] = df['Q1']
    df.loc[(df['Zip']=="STATE:") & 
           (pd.notnull(df['Q2'])), 'State'] = df["State"]+" "+df['Q2']
    df.loc[(df['Zip']=="STATE:") & 
           (pd.notnull(df['Q3'])), 'State'] = df["State"]+" "+df['Q3']

    # will need to check what states are present
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = 'GUAM'        

    # some version of this will probably be needed 
    drops = ['ENFORCEMENT', 'REPORTING', 'RETAIL', 'DATE:', 'ZIP', 'ARCOS']
    for d in drops:
        df = df.drop(df[df['Zip']==d].index)
        df = df.drop(df[df['Q1']==d].index)
    
    #  this looks like it will be very similar
    for key in drug_codes.keys():
        df.loc[(df['Q1']=='CODE:') &
               (df['Q2']==key), 'Drug'] = drug_codes[key]

    # this should be reusable
    df['State'] = df['State'].fillna(method='ffill')
    df['Drug'] = df['Drug'].fillna(method='ffill')

    # May vary slightly
    df=df.drop(df[df['Zip']=='DRUG'].index)
    df=df.drop(df[df['Zip']=='STATE:'].index)
    df = df[['Year', 'State', 'Drug', 'Zip', 
             'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
    df = df.drop(df.loc[pd.isnull(df['TOTAL'])].index)
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    for col in cols:
        df[col]=df[col].str.replace(",","").astype(float)
    return df

#### Step 4 - readability and column names

Next, rename the key columns to make it easier to work with.

In [280]:
pop_2000.rename(columns={'ARCOS': "State", 
                   '2':'Q1', 
                   '-': 'Q2', 
                   'REPORT': 'Q3', 
                   '3':'Q4', 
                   'QUARTERLY':'TOTAL'}, 
          inplace=True)

pop_2000.head(10)

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
1,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,
2,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,----------------------------------------------...,,,,,,,,,,,
4,ALASKA,105.20,95.05,101.17,110.90,412.33,,,,,,
5,ALABAMA,172.96,149.64,168.43,187.49,678.54,,,,,,
6,ARKANSAS,174.08,160.97,171.89,190.19,697.14,,,,,,
7,ARIZONA,134.19,131.48,127.38,138.19,531.25,,,,,,
8,CALIFORNIA,50.79,52.67,49.06,54.98,207.53,,,,,,
9,COLORADO,60.53,52.67,61.75,71.06,246.02,,,,,,


#### Step 5 - getting the drug names

First we'll add a column for year and drug name, then get the drug names. 

We can get the drug names in the same way we did in Part 1 with the old files, by using the dict of drug codes and names.

Assumption check! Is it an accurate assumption that the drug codes are only appearing in column Q2, and that they are always preceded by "DRUG" in the state column? You can easily check this without having to scan through every row by looking at the unique values that are currently present in those columns. 

In [281]:
with open('drug_codes.pickle', 'rb') as f:
    drug_codes = pickle.load(f)

pop_2000.insert(column='Year', loc=0, value=2000)
pop_2000.insert(column='Drug', loc=2, value=None)
for key in drug_codes.keys():
    pop_2000.loc[(pop_2000['State']=='DRUG')&(pop_2000['Q2']==key), 'Drug'] = drug_codes[key]

pop_2000.head(10)

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,2000,REPORTING,,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
1,2000,DRUG,DL-AMPHETAMINE BASE,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,
2,2000,STATE,,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,2000,----------------------------------------------...,,,,,,,,,,,,
4,2000,ALASKA,,105.20,95.05,101.17,110.90,412.33,,,,,,
5,2000,ALABAMA,,172.96,149.64,168.43,187.49,678.54,,,,,,
6,2000,ARKANSAS,,174.08,160.97,171.89,190.19,697.14,,,,,,
7,2000,ARIZONA,,134.19,131.48,127.38,138.19,531.25,,,,,,
8,2000,CALIFORNIA,,50.79,52.67,49.06,54.98,207.53,,,,,,
9,2000,COLORADO,,60.53,52.67,61.75,71.06,246.02,,,,,,


In [282]:
pop_2000['Drug'].unique()

array([None, 'DL-AMPHETAMINE BASE', 'D-AMPHETAMINE BASE',
       'METHYLPHENIDATE', 'OXYCODONE', 'HYDROCODONE'], dtype=object)

It looks good, so we'll continue with forward filling the drug names, and go ahead and drop the rows with just drug names in them. 

First, double check that there's nothing in the drug column we don't expect to be there.

In [283]:
pop_2000['Drug'].unique()

array([None, 'DL-AMPHETAMINE BASE', 'D-AMPHETAMINE BASE',
       'METHYLPHENIDATE', 'OXYCODONE', 'HYDROCODONE'], dtype=object)

In [284]:
# forward fill the drug names 
pop_2000['Drug'] = pop_2000['Drug'].fillna(method='ffill')

# drop the drug name rows
pop_2000 = pop_2000.drop(pop_2000[pop_2000['State']=='DRUG'].index)

pop_2000.head(10)

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,2000,REPORTING,,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
2,2000,STATE,DL-AMPHETAMINE BASE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,2000,----------------------------------------------...,DL-AMPHETAMINE BASE,,,,,,,,,,,
4,2000,ALASKA,DL-AMPHETAMINE BASE,105.20,95.05,101.17,110.90,412.33,,,,,,
5,2000,ALABAMA,DL-AMPHETAMINE BASE,172.96,149.64,168.43,187.49,678.54,,,,,,
6,2000,ARKANSAS,DL-AMPHETAMINE BASE,174.08,160.97,171.89,190.19,697.14,,,,,,
7,2000,ARIZONA,DL-AMPHETAMINE BASE,134.19,131.48,127.38,138.19,531.25,,,,,,
8,2000,CALIFORNIA,DL-AMPHETAMINE BASE,50.79,52.67,49.06,54.98,207.53,,,,,,
9,2000,COLORADO,DL-AMPHETAMINE BASE,60.53,52.67,61.75,71.06,246.02,,,,,,
10,2000,CONNECTICUT,DL-AMPHETAMINE BASE,92.78,99.15,93.85,110.78,396.58,,,,,,


#### Step 6 - dealing with long state names

Next we have to deal with the long state names that got split out into multiple columns. This is pretty clean data so a couple simple rules will work.

Again a moment to check your assumption about how the data are formatted - use a unique values check to be sure you're right.

In [285]:
pop_2000.loc[pd.notnull(pop_2000['DISTRIBUTION']), 'State']=pop_2000["State"]+" "+pop_2000['Q1']
pop_2000.loc[pd.notnull(pop_2000['IN']), 'State']=pop_2000["State"] +" "+pop_2000['Q2']

pop_2000.head(10)

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,2000,REPORTING,,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
2,2000,STATE 1ST QUARTER,DL-AMPHETAMINE BASE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,2000,----------------------------------------------...,DL-AMPHETAMINE BASE,,,,,,,,,,,
4,2000,ALASKA,DL-AMPHETAMINE BASE,105.20,95.05,101.17,110.90,412.33,,,,,,
5,2000,ALABAMA,DL-AMPHETAMINE BASE,172.96,149.64,168.43,187.49,678.54,,,,,,
6,2000,ARKANSAS,DL-AMPHETAMINE BASE,174.08,160.97,171.89,190.19,697.14,,,,,,
7,2000,ARIZONA,DL-AMPHETAMINE BASE,134.19,131.48,127.38,138.19,531.25,,,,,,
8,2000,CALIFORNIA,DL-AMPHETAMINE BASE,50.79,52.67,49.06,54.98,207.53,,,,,,
9,2000,COLORADO,DL-AMPHETAMINE BASE,60.53,52.67,61.75,71.06,246.02,,,,,,
10,2000,CONNECTICUT,DL-AMPHETAMINE BASE,92.78,99.15,93.85,110.78,396.58,,,,,,


#### Step 7 - dealing with shifted state data

Next, the data that was shifted over by the longer state names needs to be fixed. This can be done again with a couple simple rules to handle the case of 2- and 3-word state names.

In [286]:
shift = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL', 'DISTRIBUTION', 'IN']
for i in range(0,5):
    pop_2000.loc[((pd.notnull(pop_2000['DISTRIBUTION'])) 
            & (pd.notnull(pop_2000['IN']))), shift[i]] = pop_2000[shift[i+2]]

for i in range(0,5):
    pop_2000.loc[((pd.notnull(pop_2000['DISTRIBUTION'])) 
            & (pd.isnull(pop_2000['IN']))), shift[i]] = pop_2000[shift[i+1]]

pop_2000.head(10)   

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,2000,REPORTING,,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
2,2000,STATE 1ST QUARTER,DL-AMPHETAMINE BASE,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,2000,----------------------------------------------...,DL-AMPHETAMINE BASE,,,,,,,,,,,
4,2000,ALASKA,DL-AMPHETAMINE BASE,105.20,95.05,101.17,110.90,412.33,,,,,,
5,2000,ALABAMA,DL-AMPHETAMINE BASE,172.96,149.64,168.43,187.49,678.54,,,,,,
6,2000,ARKANSAS,DL-AMPHETAMINE BASE,174.08,160.97,171.89,190.19,697.14,,,,,,
7,2000,ARIZONA,DL-AMPHETAMINE BASE,134.19,131.48,127.38,138.19,531.25,,,,,,
8,2000,CALIFORNIA,DL-AMPHETAMINE BASE,50.79,52.67,49.06,54.98,207.53,,,,,,
9,2000,COLORADO,DL-AMPHETAMINE BASE,60.53,52.67,61.75,71.06,246.02,,,,,,
10,2000,CONNECTICUT,DL-AMPHETAMINE BASE,92.78,99.15,93.85,110.78,396.58,,,,,,


#### Step 8 - drop any unneeded or empty rows

Now we can go through and remove any rows that are empty, just contain header data, etc. A good way to do this is to say that if a row has a value that's not in the universe of possible state or territory names in the state column, we can drop it. That should take care of almost everything,  and any other issues will pop up when we do the type conversion of the numeric columns.

In [287]:
# change references to Guam before doing these drops
pop_2000.loc[pop_2000['State']=='TRUST TERRITORIES (GUAM)', 'State'] = 'GUAM'
# give yourself a view into what's getting dropped
print("Dropping rows with the following values in the State column:")
print(pop_2000['State'].loc[~(pop_2000['State'].isin(geos))].unique())
pop_2000 = pop_2000.drop(pop_2000.loc[~(pop_2000['State'].isin(geos))].index)

Dropping rows with the following values in the State column:
['REPORTING' 'STATE 1ST QUARTER'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'DATE: 12/24/2002' 'ARCOS' 'QUARTERLY DISTRIBUTION']


#### Step 8 - final clean-up

Last few things to do - drop out the extra columns, and convert the numeric columns to floats.

In [289]:
pop_2000.head()

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
4,2000,ALASKA,DL-AMPHETAMINE BASE,105.2,95.05,101.17,110.9,412.33,,,,,,
5,2000,ALABAMA,DL-AMPHETAMINE BASE,172.96,149.64,168.43,187.49,678.54,,,,,,
6,2000,ARKANSAS,DL-AMPHETAMINE BASE,174.08,160.97,171.89,190.19,697.14,,,,,,
7,2000,ARIZONA,DL-AMPHETAMINE BASE,134.19,131.48,127.38,138.19,531.25,,,,,,
8,2000,CALIFORNIA,DL-AMPHETAMINE BASE,50.79,52.67,49.06,54.98,207.53,,,,,,


In [291]:
pop_2000 = pop_2000[['Year', 'State', 'Drug', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
for col in cols:
    pop_2000[col] = pop_2000[col].str.replace(",","").astype(float)

### Refactor the code

There was a little more variation in these file formats. As I processed each of the files I updated the function code for the different cases. I won't go through it here for brevity - taking the first function and modifying it yourself to clean each new file by trial & error and checking assumptions would be a great exercise to improve your pandas skills.

In [207]:
pop_2000.head()

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DISTRIBUTION,IN,GRAMS,PER,"100,000",POPULATION
0,REPORTING,PERIOD:,01/01/2000,TO,12/31/2000,,,,,,,
1,DRUG,CODE:,1100B,DRUG,NAME:,DL-AMPHETAMINE,BASE,,,,,
2,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,TO,DATE
3,----------------------------------------------...,,,,,,,,,,,
4,ALASKA,105.20,95.05,101.17,110.90,412.33,,,,,,


In [604]:
def clean_pop_old(df, year, drug_codes, geos):
    """
    Use for years 2000-2004 inclusive of ARCOS Report 3 data. 
    """
    # rename columns
    df.rename(columns={'ARCOS': "State", 
                   '2':'Q1', 
                   '-': 'Q2', 
                   'REPORT': 'Q3', 
                   '3':'Q4', 
                   'QUARTERLY':'TOTAL'}, 
          inplace=True)
    
    # add new columns for year and drug
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='Drug', loc=2, value=None)
    
    # check for new substance codes
    codes = df['Q2'].loc[df['Q1'].str.contains('CODE', na=False)].unique().tolist()
    for c in codes:
        if c not in drug_codes.keys():
            print("Found new substance code: {}".format(c))
    
    # get the drug names
    for key in drug_codes.keys():
        df.loc[(df['State']=='DRUG') 
               & (df['Q2']==key), 'Drug'] = drug_codes[key]
    df['Drug'] = df['Drug'].fillna(method='ffill')
    df = df.drop(df[df['State']=='DRUG'].index)

    # fix longer state names
    df.loc[pd.notnull(df['DISTRIBUTION']), 'State'] = df["State"]+" "+df['Q1']
    df.loc[pd.notnull(df['IN']), 'State'] = df["State"] +" "+df['Q2']

    # fix shifted data
    shift = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL', 'DISTRIBUTION', 'IN']
    for i in range(0,5):
        df.loc[(pd.notnull(df['DISTRIBUTION']) 
                & pd.notnull(df['IN'])), shift[i]] = df[shift[i+2]]

    for i in range(0,5):
        df.loc[(pd.notnull(df['DISTRIBUTION']) 
                & pd.isnull(df['IN'])), shift[i]] = df[shift[i+1]]      

    # standardize some geography names
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = 'GUAM'
    df.loc[df['State']=='UNITED STATES TOTAL', 'State'] = 'UNITED STATES'
    df.loc[df['State']=='U.S. TOTAL', 'State'] = 'UNITED STATES'

    # drop out unnecessary rows (headers, etc.)
    print("Dropping rows with the following values in the State column:")
    print(df['State'].loc[~(df['State'].isin(geos))].unique())
    df = df.drop(df.loc[~(df['State'].isin(geos))].index)
    
    # keep only what we need
    df = df[['Year', 'State', 'Drug', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
    
    # change the datatype of the numeric columns
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    for col in cols:
        df[col]=df[col].str.replace(",","").astype(float)
    return df

# A change in the text/titling of the reports means a slightly different 
# function is needed for some of the older files

def clean_pop_oldv2(df, year, drug_codes, geos):
    """
    Use for years 2005 and 2006 of ARCOS Report 3 data.
    """
    # rename columns
    df.rename(columns={'ARCOS': "State", 
                       '2':'Q1', 
                       '-': 'Q2', 
                       'REPORT': 'Q3', 
                       '3':'Q4', 
                       'QUARTERLY':'TOTAL'}, 
              inplace=True)
    
    # add new columns for year and drug
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='Drug', loc=2, value=None)
    
    # check for new substance codes
    codes = df['Q2'].loc[df['Q1'].str.contains('CODE', na=False)].unique().tolist()
    for c in codes:
        if c not in drug_codes.keys():
            print("Found new substance code: {}".format(c))
    
    # get the drug names
    for key in drug_codes.keys():
        df.loc[(df['State']=='DRUG')&(df['Q2']==key), 'Drug'] = drug_codes[key]
    df['Drug'] = df['Drug'].fillna(method='ffill')
    df = df.drop(df[df['State']=='DRUG'].index)

    # fix longer state names
    df.loc[pd.notnull(df['DRUG']), 'State'] = df["State"]+" "+df['Q1']
    df.loc[pd.notnull(df['DISTRIBUTION']), 'State'] = df["State"] +" "+df['Q2']
    
    # fix shifted data
    start1 = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    shift1 = ['Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
    for i in range(0,5):
        df.loc[pd.notnull(df['DRUG']), start1[i]] = df[shift1[i]]
        
    start2 = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    shift2 = ['Q2', 'Q3', 'Q4', 'TOTAL', 'DISTRIBUTION']
    for i in range(0,5):
        df.loc[pd.notnull(df['DISTRIBUTION']), start2[i]] = df[shift2[i]]        

    # standardize geography names    
    df = df.drop(df[df['State']=='DRUG'].index)
    df = df.drop(df[df['State']=='DRUG CODE:'].index)
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = 'GUAM'
    df.loc[df['State']=='UNITED STATES TOTAL', 'State'] = 'UNITED STATES'
    df.loc[df['State']=='U.S. TOTAL', 'State'] = 'UNITED STATES'

    # drop rows with unnecessary data (e.g., headers)
    print("Dropping rows with the following values in the State column:")
    print(df['State'].loc[~(df['State'].isin(geos))].unique())
    df = df.drop(df.loc[~(df['State'].isin(geos))].index)
    
    # keep only what we need
    df = df[['Year', 'State', 'Drug', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
    
    # change the data type of numeric columns
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    for col in cols:
        df[col]=df[col].str.replace(",","").astype(float)
    return df

def clean_pop_new(df, year, drug_codes, geos):
    """
    Use for years 2007-2017 inclusive of ARCOS Report 3 data. 
    """
    # rename columns
    df.rename(columns={'ARCOS': "State", 
                       '3':'Q1', 
                       '-': 'Q2', 
                       'REPORT': 'Q3', 
                       '3.1':'Q4', 
                       'QUARTERLY':'TOTAL'}, 
              inplace=True)

    # insert new columns for drug and year
    df.insert(column='Year', loc=0, value=year)
    df.insert(column='Drug', loc=2, value=None)
    
    # check for new substance codes
    codes = df['Q1'].loc[df['Q1'].str.contains('CODE', na=False)].unique().tolist()
    for c in codes:
        if c.split(':')[1] not in drug_codes.keys():
            print("Found new substance code: {}".format(c))
    
    # the drug codes are coming in with a different format pattern in these files
    for key in drug_codes.keys():
        df.loc[(df['State']=='DRUG')&(df['Q1'].str[5:]==key), 'Drug'] = drug_codes[key]
    df['Drug'] = df['Drug'].fillna(method='ffill')
    df = df.drop(df[df['State']=='DRUG'].index)

    # fix longer state names
    df.loc[pd.notnull(df['DRUG']), 'State']=df["State"]+" "+df['Q1']
    df.loc[pd.notnull(df['DISTRIBUTION']), 'State']=df["State"] +" "+df['Q2']

    # fix shifted data
    start1 = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    shift1 = ['Q2', 'Q3', 'Q4', 'TOTAL', 'DRUG']
    for i in range(0,5):
        df.loc[pd.notnull(df['DRUG']), start1[i]] = df[shift1[i]]
        
    start2 = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    shift2 = ['Q2', 'Q3', 'Q4', 'TOTAL', 'DISTRIBUTION']
    for i in range(0,5):
        df.loc[pd.notnull(df['DISTRIBUTION']), start2[i]] = df[shift2[i]]        

        
    #df = df.drop(df[df['State']=='DRUG'].index)
    #df = df.drop(df[df['State']=='DRUG CODE:'].index)
    # update territory references
    df.loc[df['State']=='TRUST TERRITORIES (GUAM)', 'State'] = 'GUAM'
    df.loc[df['State']=='UNITED STATES TOTAL', 'State'] = 'UNITED STATES'
    df.loc[df['State']=='U.S. TOTAL', 'State'] = 'UNITED STATES'

    # drop out unnecessary rows (e.g., headers)
    print("Dropping rows with the following values in the State column:")
    print(df['State'].loc[~(df['State'].isin(geos))].unique())
    df = df.drop(df.loc[~(df['State'].isin(geos))].index)
    
    # keep only what we need
    df = df[['Year', 'State', 'Drug', 'Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']]
    
    # update datatype of numeric columns
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    for col in cols:
        df[col]=df[col].str.replace(",","").astype(float)
    return df

#### Load in and check the rest of the files

Next step is to load in the rest of the files and check them for irregularities. Just like with the zip files, there are differences in some of the file formats so I made a few additional versions of the function that checks for irregular data. You could probably combine them into a single function that would handle all cases but I found it easier to break out this way. 

In [165]:
def check_pop_data_v2(df):
    """
    Use for checking population data files for years 2004 and 2005 inclusive.
    """
    df.rename(columns={'ARCOS': "State", 
                   '2':'Q1', 
                   '-': 'Q2', 
                   'REPORT': 'Q3', 
                   '3':'Q4', 
                   'QUARTERLY':'TOTAL'}, 
          inplace=True)
    
    print("Checking State column - we expect mostly state names here, as well as DRUG.")
    print(df['State'].unique())
    print()
    print()
    print('For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.')
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    
    for c in cols:
        vals = df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()
        print("Checking column {} for unusual values.".format(c))
        if c=='Q1':
            print("We expect to see CODE: but not DRUG, and no drug codes.")
            if 'DRUG' in vals:
                print("Irregular value: DRUG")
        if c=='Q2':
            print("We expect to see drug codes here, but should not see DRUG or CODE.")
            if 'DRUG' in vals:
                print("Irregular value: 'DRUG'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        if c=='Q3':
            print("We expect to see DRUG here, but no drug codes.")
        if c=='Q4':
            print("We expect to see NAME: here, but not DRUG and no drug codes.")
            if 'DRUG' in vals:
                print("Irregular value: DRUG")
        if c=='TOTAL':
            print("We expect to see partial/full drug names, but no codes and not the words DRUG, NAME, or CODE.")
            if 'DRUG' in vals:
                print("Irregular value: 'DRUG'")
            if 'NAME' in vals:
                print("Irregular value: 'NAME'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()
        print()
    
    cols2 = ['DRUG','DISTRIBUTION','BY','STATE',
             'PER','100,000','POPULATION','BY.1',
             'GRAMS','WEIGHT']
    print()
    print("For the rest of the columns, we expect mostly NaNs and some partial drug names.")
    for c in cols2:
        print("Checking column {} for unusual values.".format(c))
        if pd.isnull(df[c]).all():
            print("Column {} is empty.".format(c))
        else:
            print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()

        
def check_pop_data_v3(df):
    """
    Use for checking population data files for year 2006 to 2010, and 2012 to 2017, inclusive.
    """
    df.rename(columns={'ARCOS': "State", 
                   '3':'Q1', 
                   '-': 'Q2', 
                   'REPORT': 'Q3', 
                   '3.1':'Q4', 
                   'QUARTERLY':'TOTAL'}, 
          inplace=True)
    
    print("Checking State column - we expect mostly state names here, as well as DRUG.")
    print(df['State'].unique())
    print()
    print()
    print('For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.')
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    
    for c in cols:
        vals = df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()
        print("Checking column {} for unusual values.".format(c))
        if c=='Q1':
            print("We expect to see CODE: concatenated with drug codes.")
        if c=='Q2':
            print("We should see DRUG here, but not CODE or NAME.")
            if 'NAME' in vals:
                print("Irregular value: 'NAME'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        if c=='Q3':
            print("We expect to see NAME: here, but no drug codes.")
        if c=='Q4':
            print("We expect to see full/partial drug names. ")
            if 'DRUG' in vals:
                print("Irregular value: DRUG")
        if c=='TOTAL':
            print("We expect to see partial/full drug names, but no codes and not the words DRUG, NAME, or CODE.")
            if 'DRUG' in vals:
                print("Irregular value: 'DRUG'")
            if 'NAME' in vals:
                print("Irregular value: 'NAME'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()
        print()
    
    cols2 = ['DRUG','DISTRIBUTION','BY','STATE',
             'PER','100K','POPULATION','BY.1',
             'GRAM','WT']
    
    print()
    print("For the rest of the columns, we expect mostly NaNs and some partial drug names.")
    for c in cols2:
        print("Checking column {} for unusual values.".format(c))
        if pd.isnull(df[c]).all():
            print("Column {} is empty.".format(c))
        else:
            print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()      
        

def check_pop_data_v4(df):
    """
    Use for checking population data files for year 2011 only.
    """
    df.rename(columns={'ARCOS': "State", 
                   '3':'Q1', 
                   '-': 'Q2', 
                   'REPORT': 'Q3', 
                   '3.1':'Q4', 
                   'QUARTERLY':'TOTAL'}, 
          inplace=True)
    
    print("Checking State column - we expect mostly state names here, as well as DRUG.")
    print(df['State'].unique())
    print()
    print()
    print('For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.')
    cols = ['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']
    
    for c in cols:
        vals = df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique()
        print("Checking column {} for unusual values.".format(c))
        if c=='Q1':
            print("We expect to see CODE: concatenated with drug codes.")
        if c=='Q2':
            print("We should see DRUG here, but not CODE or NAME.")
            if 'NAME' in vals:
                print("Irregular value: 'NAME'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        if c=='Q3':
            print("We expect to see NAME: here, but no drug codes.")
        if c=='Q4':
            print("We expect to see full/partial drug names. ")
            if 'DRUG' in vals:
                print("Irregular value: DRUG")
        if c=='TOTAL':
            print("We expect to see partial/full drug names, but no codes and not the words DRUG, NAME, or CODE.")
            if 'DRUG' in vals:
                print("Irregular value: 'DRUG'")
            if 'NAME' in vals:
                print("Irregular value: 'NAME'")
            if 'CODE' in vals:
                print("Irregular value: 'CODE'")
        print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()
        print()
    
    cols2 = ['DRUG', 'DISTRIBUTION', 'BY',
             'STATE', '/', '100K', 'POPULATION',
             'BY.1', 'MILIGRAM', 'WT']
    
    print()
    print("For the rest of the columns, we expect mostly NaNs and some partial drug names.")
    for c in cols2:
        print("Checking column {} for unusual values.".format(c))
        if pd.isnull(df[c]).all():
            print("Column {} is empty.".format(c))
        else:
            print(df[c].loc[~df[c].str.match('[-+]?[0-9,]*\.[0-9]+$', na=False)].unique())
        print()      
        

In [568]:
pop_2000 = pd.read_csv('population_2000.txt', delim_whitespace=True)
check_pop_data(pop_2000)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'HAWAII' 'IOWA'
 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NEBRASKA' 'NORTH' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'DATE:' 'ARCOS'
 'QUARTERLY' 'SOUTH' 'TENNESSEE' 'TRUST' 'TEXAS' 'UTAH' 'VIRGINIA'
 'VIRGIN' 'VERMONT' 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
['PERIOD:' 'CODE:' '1ST' n

In [569]:
pop_2001 = pd.read_csv('pop_2001.txt', delim_whitespace=True)
check_pop_data(pop_2001)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'HAWAII' 'IOWA'
 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NEBRASKA' 'NORTH' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'DATE:' 'ARCOS'
 'QUARTERLY' 'SOUTH' 'TENNESSEE' 'TRUST' 'TEXAS' 'UTAH' 'VIRGINIA'
 'VIRGIN' 'VERMONT' 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
['PERIOD:' 'CODE:' '1ST' n

In [570]:
pop_2002 = pd.read_csv('pop_2002.txt', delim_whitespace=True)
check_pop_data(pop_2002)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII'
 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'DATE:' 'ARCOS' 'QUARTERLY'
 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED' 'AMERICAN']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
['PERIOD:' 'CODE

In [571]:
pop_2003 = pd.read_csv('pop_2003.txt', delim_whitespace=True)
check_pop_data(pop_2003)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII'
 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'DATE:' 'ARCOS' 'QUARTERLY'
 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED' 'AMERICAN']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
['PERIOD:' 'CODE

In [572]:
pop_2004 = pd.read_csv('pop_2004.txt', delim_whitespace=True)
check_pop_data_v2(pop_2004)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII'
 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'DATE:' 'ARCOS' 'QUARTERLY'
 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED' 'AMERICAN']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
Irregular value:

One thing to check - 'DRUG' showed up in column Q1 where we don't expect it.

In [573]:
pop_2004[pop_2004['Q1']=='DRUG']

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,STATE,PER,"100,000",POPULATION,BY.1,GRAMS,WEIGHT
48,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
70,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
119,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
141,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
190,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
209,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
258,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
280,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
329,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,
350,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100000,POPULATION,BY,GRAMS,WEIGHT,,,,,


Turns out to be header data, and it will not cause any issues for the drug name checking. 

In [574]:
pop_2005 = pd.read_csv('pop_2005.txt', delim_whitespace=True)
check_pop_data_v2(pop_2005)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'ALASKA' 'ALABAMA' 'ARKANSAS' 'ARIZONA' 'CALIFORNIA' 'COLORADO'
 'CONNECTICUT' 'DISTRICT' 'DELAWARE' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII'
 'IOWA' 'IDAHO' 'ILLINOIS' 'INDIANA' 'KANSAS' 'KENTUCKY' 'LOUISIANA'
 'MASSACHUSETTS' 'MARYLAND' 'MAINE' 'MICHIGAN' 'MINNESOTA' 'MISSOURI'
 'MISSISSIPPI' 'MONTANA' 'NORTH' 'NEBRASKA' 'NEW' 'NEVADA' 'OHIO'
 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'DATE:' 'ARCOS' 'QUARTERLY'
 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VIRGINIA' 'VIRGIN' 'VERMONT'
 'WASHINGTON' 'WISCONSIN' 'WEST' 'WYOMING' 'UNITED' 'AMERICAN']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: but not DRUG, and no drug codes.
Irregular value:

In [575]:
# same thing to check here and it's just header data
#pop_2005[pop_2005['Q1']=='DRUG']

In [576]:
pop_2006 = pd.read_csv('pop_2006.txt', delim_whitespace=True)
check_pop_data_v3(pop_2006)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE'
 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS'
 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND'
 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA'
 'NEBRASKA' 'NEVADA' 'NEW' 'NORTH' 'ARCOS' 'QUARTERLY' 'OHIO' 'OKLAHOMA'
 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS'
 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON' 'WEST' 'WISCONSIN'
 'WYOMING' 'U.S.' 'AMERICAN' 'SUBSTANCE']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'OF' 'HAMPSHIRE' 'JERSEY'
 'MEXICO' 'YORK' 'CAROLINA' '3' 'DRUG' 'DAKOTA' 'RICO' 'ISLAND'

In [577]:
# these are all just odd-looking values, but legitimate
#pop_2006[pop_2006['Q1']=='1,645']
#pop_2006[pop_2006['Q1']=='0']
#pop_2006[pop_2006['Q1']=='1,835']

In [578]:
# this is the US total value
#pop_2006[pop_2006['DRUG']=='6,131']

In [579]:
pop_2007 = pd.read_csv('pop_2007.txt', delim_whitespace=True)
check_pop_data_v3(pop_2007)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'AMERICAN' 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT'
 'DELAWARE' 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO'
 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE'
 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI'
 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW' 'ARCOS' 'QUARTERLY'
 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA'
 'WASHINGTON' 'WEST' 'WISCONSIN' 'WYOMING' 'U.S.' 'SUBSTANCE']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' 'HAMPSHIRE'
 'JERSEY' 'MEXICO' 'YORK' '3' 'DRUG' 'CAROLINA' 'DAKOTA' 'RICO' 

In [580]:
# this is part of a drug name
#pop_2007[pop_2007['BY']=='III)']

In [581]:
pop_2008 = pd.read_csv('pop_2008.txt', delim_whitespace=True)
check_pop_data_v3(pop_2008)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'AMERICAN' 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT'
 'DELAWARE' 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO'
 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE'
 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI'
 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW' 'NORTH' 'ARCOS'
 'QUARTERLY' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA'
 'WASHINGTON' 'WEST' 'WISCONSIN' 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' 'HAMPSHIRE'
 'JERSEY' 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' '3' 'DRUG' 'RICO' 'ISLAND'
 'I

In [582]:
pop_2009 = pd.read_csv('pop_2009.txt', delim_whitespace=True)
check_pop_data_v3(pop_2009)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'AMERICAN' 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT'
 'DELAWARE' 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO'
 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE'
 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI'
 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW' 'ARCOS' 'QUARTERLY'
 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA'
 'WASHINGTON' 'WEST' 'WISCONSIN' 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' '1,017' 'OF'
 'HAMPSHIRE' 'JERSEY' 'MEXICO' 'YORK' '3' 'DRUG' 'CAROLINA' 'DAKOTA'
 'RICO' 'IS

In [583]:
pop_2010 = pd.read_csv('pop_2010.txt', delim_whitespace=True)
check_pop_data_v3(pop_2010)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'AMERICAN' 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT'
 'DELAWARE' 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO'
 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE'
 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI'
 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW' 'NORTH' 'OHIO' 'OKLAHOMA'
 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS'
 'ARCOS' 'QUARTERLY' 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON'
 'WEST' 'WISCONSIN' 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' 'HAMPSHIRE'
 'JERSEY' 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' 'RICO' 'ISLAND' '3' 'DRUG'
 'I

In [584]:
pop_2011 = pd.read_csv('pop_2011.txt', delim_whitespace=True)
check_pop_data_v4(pop_2011)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'AMERICAN' 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT'
 'DELAWARE' 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO'
 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE'
 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI'
 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW' 'NORTH' 'OHIO' 'OKLAHOMA'
 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS'
 'UTAH' 'ARCOS' 'QUARTERLY' 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON'
 'WEST' 'WISCONSIN' 'WYOMING' 'U.S.' '1,459,041.39']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' '1,506,492'
 'HAMPSHIRE' 'JERSEY' 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' 'RI

Firstly, note that this year the data is reported in milligrams ('MILIGRAMS')! It is only this one year, so I will fix this manually at the end.

Looks like we might be dealing with a wrapped value with the value '1,459,041.39' showing up in the State column.

In [585]:
#pop_2011[pop_2011['State']=='1,459,041.39']
pop_2011.iloc[1518:1522]

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,STATE,/,100K,POPULATION,BY.1,MILIGRAM,WT
1518,NEW,MEXICO,104951.92,102720.78,112510.5,116879.64,437062.84,,,,,,,,,
1519,NEW,YORK,206942.49,233855.98,267066.16,295988.41,1003853.05,,,,,,,,,
1520,NORTH,CAROLINA,297253.62,340783.46,390634.12,430370.18,,,,,,,,,,
1521,1459041.39,,,,,,,,,,,,,,,


It is a wrapped value, and note that since it's a two-word state and we haven't fixed the shifted values yet, we need to move that value into the column currently called "DRUG."

In [586]:
ix = pop_2011[pop_2011['State']=='1,459,041.39'].index.values[0]
pop_2011.loc[ix-1, 'DRUG'] = pop_2011.loc[ix, 'State']
pop_2011.iloc[ix-1:ix+1]

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,STATE,/,100K,POPULATION,BY.1,MILIGRAM,WT
1520,NORTH,CAROLINA,297253.62,340783.46,390634.12,430370.18,1459041.39,,,,,,,,,
1521,1459041.39,,,,,,,,,,,,,,,


In [587]:
pop_2012 = pd.read_csv('pop_2012.txt', delim_whitespace=True)
check_pop_data_v3(pop_2012)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Population' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA'
 'AMERICAN' 'ARIZONA' 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT'
 'DELAWARE' 'DISTRICT' 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO'
 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE'
 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI'
 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW' 'ARCOS' 'QUARTERLY'
 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA'
 'WASHINGTON' 'WEST' 'WISCONSIN' 'WYOMING' 'U.S.' 'METHYLCATHINONE)']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Year:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' '566' 'OF'
 'HAMPSHIRE' 'JERSEY' 'MEXICO' 'YORK' '3' 'DRUG' 'CAROLINA' 'DA

Looks like part of a drug name might be getting wrapped here - 

* METHYLCATHINONE) in column State


In [588]:
#pop_2012[pop_2012['State']=='METHYLCATHINONE)']
pop_2012.iloc[650:653]

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,STATE,PER,100K,POPULATION,BY.1,GRAM,WT
650,U.S.,TOTAL,0,0,0,0,0.0,,,,,,,,,
651,DRUG,CODE:7540,DRUG,NAME:,METHYLONE,"(3,4-METHYLENEDIOXY-N-",,,,,,,,,,
652,METHYLCATHINONE),,,,,,,,,,,,,,,


It is a drug name getting wrapped, but the code is properly formatted, so we don't need to do anything here. Row 652 will get dropped anyway later in the cleaning routine. 

In [589]:
pop_2013 = pd.read_csv('pop_2013.txt', delim_whitespace=True)
check_pop_data_v3(pop_2013)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA' 'AMERICAN' 'ARIZONA'
 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE' 'DISTRICT'
 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS' 'INDIANA' 'IOWA'
 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND' 'MASSACHUSETTS'
 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA' 'NEBRASKA'
 'NEVADA' 'NEW' 'ARCOS' 'Population' 'QUARTERLY' 'NORTH' 'OHIO' 'OKLAHOMA'
 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS'
 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON' 'WEST' 'WISCONSIN'
 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' 'HAMPSHIRE' 'JERSEY' '3'
 'Year:' 'DRUG' 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' 'RICO' 'ISLAND'
 'I

In [590]:
pop_2014 = pd.read_csv('pop_2014.txt', delim_whitespace=True)
check_pop_data_v3(pop_2014)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA' 'AMERICAN' 'ARIZONA'
 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE' 'DISTRICT'
 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS' 'INDIANA' 'IOWA'
 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND' 'MASSACHUSETTS'
 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA' 'NEBRASKA'
 'NEVADA' 'NEW' 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON' 'ARCOS' 'Population'
 'QUARTERLY' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS'
 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON' 'WEST' 'WISCONSIN'
 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'CODE:1100' '1ST' 'SAMOA' '655' 'OF' 'HAMPSHIRE' 'JERSEY'
 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' '3' 'Year:' 'DRUG' 'RICO' 'ISLAND'
 'ISLANDS' 

In [591]:
pop_2015 = pd.read_csv('pop_2015.txt', delim_whitespace=True)
check_pop_data_v3(pop_2015)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA' 'AMERICAN' 'ARIZONA'
 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE' 'DISTRICT'
 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS' 'INDIANA' 'IOWA'
 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND' 'MASSACHUSETTS'
 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA' 'NEBRASKA'
 'NEVADA' 'NEW' 'ARCOS' 'Population' 'QUARTERLY' 'NORTH' 'OHIO' 'OKLAHOMA'
 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS'
 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA' 'WASHINGTON' 'WEST' 'WISCONSIN'
 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' '1,336' 'HAMPSHIRE'
 'JERSEY' '3' 'Year:' 'DRUG' 'MEXICO' 'YORK' 'CAROLINA' 'DAKOTA' 'RICO'
 'IS

In [592]:
pop_2016 = pd.read_csv('pop_2016.txt', delim_whitespace=True)
check_pop_data_v3(pop_2016)

Checking State column - we expect mostly state names here, as well as DRUG.
['REPORTING' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA' 'AMERICAN' 'ARIZONA'
 'ARKANSAS' 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE' 'DISTRICT'
 'FLORIDA' 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS' 'INDIANA' 'IOWA'
 'KANSAS' 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND' 'MASSACHUSETTS'
 'MICHIGAN' 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA' 'NEBRASKA'
 'NEVADA' 'NEW' 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO'
 'RHODE' 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VERMONT' 'Population' 'Run'
 'ARCOS' 'QUARTERLY' 'VIRGIN' 'VIRGINIA' 'WASHINGTON' 'WEST' 'WISCONSIN'
 'WYOMING' 'U.S.' '1308.13(C)(3)]']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['PERIOD:' 'CODE:1100' '1ST' 'SAMOA' 'OF' 'HAMPSHIRE' 'JERSEY' 'MEXICO'
 'YORK' 'CAROLINA' 'DAKOTA' 'RICO' 'ISLAND' 'Year:' 'Date

A possible wrapped drug name to check out:

* 1308.13(C)(3)] in column State

In [593]:
# they are both part of a drug name that will be properly captured, so nothing else is needed
#pop_2016[pop_2016['State']=='1308.13(C)(3)]']
#pop_2016.iloc[338:341]
#pop_2016.iloc[388:391]

In [594]:
pop_2017 = pd.read_csv('pop_2017.txt', delim_whitespace=True)
check_pop_data_v3(pop_2017)

Checking State column - we expect mostly state names here, as well as DRUG.
['Run' 'DRUG' 'STATE' 'ALABAMA' 'ALASKA' 'AMERICAN' 'ARIZONA' 'ARKANSAS'
 'CALIFORNIA' 'COLORADO' 'CONNECTICUT' 'DELAWARE' 'DISTRICT' 'FLORIDA'
 'GEORGIA' 'GUAM' 'HAWAII' 'IDAHO' 'ILLINOIS' 'INDIANA' 'IOWA' 'KANSAS'
 'KENTUCKY' 'LOUISIANA' 'MAINE' 'MARYLAND' 'MASSACHUSETTS' 'MICHIGAN'
 'MINNESOTA' 'MISSISSIPPI' 'MISSOURI' 'MONTANA' 'NEBRASKA' 'NEVADA' 'NEW'
 'NORTH' 'OHIO' 'OKLAHOMA' 'OREGON' 'PENNSYLVANIA' 'PUERTO' 'RHODE'
 'SOUTH' 'TENNESSEE' 'TEXAS' 'UTAH' 'VERMONT' 'VIRGIN' 'VIRGINIA'
 'WASHINGTON' 'WEST' 'DATE' 'Population' 'ARCOS' 'QUARTERLY' 'WISCONSIN'
 'WYOMING' 'U.S.']


For columns Q1, Q2, Q3, Q4, and TOTAL we expect mostly numeric values with decimal accuracy.
Checking column Q1 for unusual values.
We expect to see CODE: concatenated with drug codes.
['Date:' 'CODE:1100' '1ST' 'SAMOA' 'OF' 'HAMPSHIRE' 'JERSEY' 'MEXICO'
 'YORK' 'CAROLINA' 'DAKOTA' 'RICO' 'ISLAND' '2,866' 'ISLANDS' 'VIRGINIA'
 'RANGE:

A few odd-looking things with this one:

* 'PRODUCT' in column Q1
* '(SYNDROS' in column Q2
* '1308.13(C)(3)]' in column 100K


Those first two turn out to be the same row - looking like a drug name might have gotten wrapped around to a new line.

In [595]:
#pop_2017[pop_2017['Q1']=='PRODUCT']
pop_2017[pop_2017['Q2']=='(SYNDROS']

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,STATE,PER,100K,POPULATION,BY.1,GRAM,WT
579,DRUG,PRODUCT,(SYNDROS,-,CII),,,,,,,,,,,


In [596]:
pop_2017.iloc[575:583]

Unnamed: 0,State,Q1,Q2,Q3,Q4,TOTAL,DRUG,DISTRIBUTION,BY,STATE,PER,100K,POPULATION,BY.1,GRAM,WT
575,WASHINGTON,3.4,2.59,2.31,2.45,10.76,,,,,,,,,,
576,WISCONSIN,0.16,0,0,0,0.16,,,,,,,,,,
577,U.S.,TOTAL,0.86,0.76,0.79,0.91,3.32,,,,,,,,,
578,DRUG,CODE:7365,DRUG,NAME:,DRONABINOL,IN,AN,ORAL,SOLUTION,IN,FDA,APPROVED,,,,
579,DRUG,PRODUCT,(SYNDROS,-,CII),,,,,,,,,,,
580,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,GRAMS,,,,,
581,ALABAMA,0,0,0.03,0.06,0.09,,,,,,,,,,
582,ARIZONA,0,0,0.23,0.11,0.34,,,,,,,,,,


This is in fact a new drug product for the 2017 reporting year - a liquid form of dronabinol, which is a synthetic version of THC. 
https://www.drugdevelopment-technology.com/comment/syndros-first-fda-approved-liquid-dronabinol-launched-long-journey/

I did most of this analysis before the 2017 data was released, and hadn't done a manual review of the drug codes in that file, so this irregular line was how I discovered the addition of this new drug to the ARCOS data! I did not catch this in the 2017 zip-code data review. It is in the drug codes dictionary from part 1, so you don't need to worry about adding it, but I left this in to highlight the importance of always checking your data.

After finding this value, I went through the file to check if there were any other new additions to the list, and there was another: code 9809, known as 'OPIUM COMBINATION PRODUCT (C-III).' This one is also in the drug codes dictionary.

In [597]:
# this is just part of a drug name (that's not new :))
#pop_2017[pop_2017['100K']=='1308.13(C)(3)]']

In [598]:
# load in and clean all the files

print("Cleaning 2000 file...")
pop_2000 = clean_pop_old(pop_2000, 2000, drug_codes, geos)
print("Done.")

print("Cleaning 2001 file...")
pop_2001 = clean_pop_old(pop_2001, 2001, drug_codes, geos)
print("Done.")

print("Cleaning 2002 file...")
pop_2002 = clean_pop_old(pop_2002, 2002, drug_codes, geos)
print("Done.")

print("Cleaning 2003 file...")
pop_2003 = clean_pop_old(pop_2003, 2003, drug_codes, geos)
print("Done.")

print("Cleaning 2004 file...")
pop_2004 = clean_pop_oldv2(pop_2004, 2004, drug_codes, geos)
print("Done.")

print("Cleaning 2005 file...")
pop_2005 = clean_pop_oldv2(pop_2005, 2005, drug_codes, geos)
print("Done.")

print("Cleaning 2006 file...")
pop_2006 = clean_pop_oldv2(pop_2006, 2006, drug_codes, geos)
print("Done.")

print("Cleaning 2007 file...")
pop_2007 = clean_pop_new(pop_2007, 2007, drug_codes, geos)
print("Done.")

print("Cleaning 2008 file...")
pop_2008 = clean_pop_new(pop_2008, 2008, drug_codes, geos)
print("Done.")

print("Cleaning 2009 file...")
pop_2009 = clean_pop_new(pop_2009, 2009, drug_codes, geos)
print("Done.")

print("Cleaning 2010 file...")
pop_2010 = clean_pop_new(pop_2010, 2010, drug_codes, geos)
print("Done.")


# here we will fix the milligram data issue
# fix the extra large value which reads in as a string
print("Cleaning 2011 file...")
pop_2011 = clean_pop_new(pop_2011, 2011, drug_codes, geos)
# convert from mg to g
print("Converting milligrams to grams...")
pop_2011[['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']] = pop_2011[['Q1', 'Q2', 'Q3', 'Q4', 'TOTAL']].divide(1000, axis=0)
print("Done.")

print("Cleaning 2012 file...")
pop_2012 = clean_pop_new(pop_2012, 2012, drug_codes, geos)
print("Done.")

print("Cleaning 2013 file...")
pop_2013 = clean_pop_new(pop_2013, 2013, drug_codes, geos)
print("Done.")

print("Cleaning 2014 file...")
pop_2014 = clean_pop_new(pop_2014, 2014, drug_codes, geos)
print("Done.")

print("Cleaning 2015 file...")
pop_2015 = clean_pop_new(pop_2015, 2015, drug_codes, geos)
print("Done.")

print("Cleaning 2016 file...")
pop_2016 = clean_pop_new(pop_2016, 2016, drug_codes, geos)
print("Done.")

print("Cleaning 2017 file...")
pop_2017 = clean_pop_new(pop_2017, 2017, drug_codes, geos)
print("Done.")

Cleaning 2000 file...
Dropping rows with the following values in the State column:
['REPORTING' 'STATE 1ST QUARTER'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'DATE: 12/24/2002' 'ARCOS' 'QUARTERLY DISTRIBUTION']
Done.
Cleaning 2001 file...
Dropping rows with the following values in the State column:
['REPORTING' 'STATE 1ST QUARTER'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'DATE: 12/19/2002' 'ARCOS' 'QUARTERLY DISTRIBUTION']
Done.
Cleaning 2002 file...
Dropping rows with the following values in the State column:
['REPORTING' 'STATE 1ST QUARTER'
 '------------------------------------------------------------------------------------------------------------------------------------'
 'DATE: 02/12/2004' 'ARCOS' 'QUARTERLY DISTRIBUTION']
Done.
Cleaning 2003 file...
Dropping rows with the following values in

There aren't any odd values getting dropped that we didn't already know about, so everything looks good.

You will notice a couple of new drug codes were found - I added somthing to check on this issue into the cleaning function after going through all the below, so read on to see how these issues surfaced themselves before I had that check in place.

### Quality and sense checks
Just like before, it's important to check and validate the cleaned data. 

Big issues that require refactoring the code (like mentioned above with the mg/g or other format changes) will likely surface when you start reading in files with your cleaning function as it will throw errors. 

These checks are to look for things that might not have thrown an error. We can reuse the functions from the first file, with one addition for checksums. 

In [480]:
# Can reuse two of the checking functions
# But need a new one to do checksums

# check functions
def quarterly_check(df):
    """
    Check to see if the quarterly values in each row sum up to the total.
    """
    df['check'] = df[['Q1', 'Q2', 'Q3', 'Q4']].sum(axis=1)
    df['diff'] = df['TOTAL'] - df['check']
    issues = df.loc[(df['diff'].abs())>0.2]
    if issues.empty:
        print('Quarterly sums check passed')
    else:
        return issues
    df.drop(['check', 'diff'], axis=1, inplace=True)

def repeats_check_zip(df):
    """
    Check to see if any rows of data may be repeated; in particular, we should have only one row for each 
    combination of year-state-drug-zipcode.
    """
    df['check'] = df['Year'].astype(str)+df['State']+df['Drug']+df['Zip']
    checks = pd.Series(data=df['check'].value_counts())
    errors = checks.loc[checks!=1]
    if errors.empty:
        print('Repeats checks passed')
    else:
        return errors
    df.drop(['check'], axis=1, inplace=True)
    
def check_states(df, geos):
    """
    Compare the states present in the df with those we expect to find.
    """
    in_df = df['State'].unique()
    diff = set(geos).symmetric_difference(set(in_df))
    if diff:
        print('State values not matching:', diff)
    else:
        print("All expected state values present")


def repeats_check_pop(df):
    df['check'] = df['Year'].astype(str)+df['State']+df['Drug']
    checks = pd.Series(data=df['check'].value_counts())
    errors = checks.loc[checks!=1]
    if errors.empty:
        print('Repeats checks passed')
    else:
        print("Repeating values found:")
        print(errors)
    df.drop(['check'], axis=1, inplace=True)

In [478]:
# housing the dfs into a dictionary makes it easier to iterate through them here
# but keep in mind modifying any of these dfs does not update the version stored in this dict
# so when we want to use it again later, we need to re-instantiate it
population_dfs = {'2000': pop_2000, '2001': pop_2001, 
                  '2002': pop_2002, '2003': pop_2003, 
                  '2004': pop_2004, '2005': pop_2005, 
                  '2006': pop_2006, '2007': pop_2007, 
                  '2008': pop_2008, '2009': pop_2009,
                  '2010': pop_2010, '2011': pop_2011, 
                  '2012': pop_2012, '2013': pop_2013, 
                  '2014': pop_2014, '2015': pop_2015, 
                  '2016': pop_2016, '2017': pop_2017}

for f in population_dfs.keys():
    print('Checking {} file...'.format(f))
    quarterly_check(population_dfs[f])
    repeats_check_pop(population_dfs[f])
    check_states(population_dfs[f], geos)
    print()
    print()

Checking 2003 file...
Quarterly sums check passed
Repeats checks passed
All expected state values present


Checking 2009 file...
Quarterly sums check passed
Repeating values found:
2009NORTH CAROLINACOCAINE    2
2009CALIFORNIACOCAINE        2
2009FLORIDACOCAINE           2
2009UNITED STATESCOCAINE     2
2009CONNECTICUTCOCAINE       2
Name: check, dtype: int64
All expected state values present


Checking 2014 file...
Quarterly sums check passed
Repeating values found:
2014MISSISSIPPICOCAINE             2
2014TENNESSEECOCAINE               2
2014DISTRICT OF COLUMBIACOCAINE    2
2014WISCONSINCOCAINE               2
2014NEW JERSEYCOCAINE              2
2014WEST VIRGINIACOCAINE           2
2014MICHIGANCOCAINE                2
2014CALIFORNIACOCAINE              2
2014KANSASCOCAINE                  2
2014SOUTH DAKOTACOCAINE            2
2014TEXASCOCAINE                   2
2014DELAWARECOCAINE                2
2014INDIANACOCAINE                 2
2014OREGONCOCAINE                  2
2014FLORI

Repeats checks passed
All expected state values present




This time the checking functions turns up something interesting. 

There repeating values for cocaine for some states in years 2009-2015 inclusive.

Also, there's a repeating value for LSD in 2015 and for PCP in 2013 at the national level.

Regarding cocaine, it appears that there are two codes for cocaine - 9041 and 9041L. It's not clear from any DEA documentation that I was able to find what the difference between the two is - the reports themselves also refer to both of these codes as simply "Cocaine" as far as the drug name. In any case, for all of these files, although both codes are included, one of them always contains all zeros. 

In [481]:
# check for each year if you like
pop_2009[pop_2009['Drug']=='COCAINE']

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL
663,2009,CALIFORNIA,COCAINE,0.0,0.0,0.0,0.0,0.0
664,2009,CONNECTICUT,COCAINE,0.0,0.0,0.0,0.0,0.0
665,2009,FLORIDA,COCAINE,0.0,0.0,0.0,0.0,0.0
666,2009,NORTH CAROLINA,COCAINE,0.0,0.0,0.0,0.0,0.0
667,2009,UNITED STATES,COCAINE,0.0,0.0,0.0,0.0,0.0
670,2009,ALABAMA,COCAINE,2.4,2.43,3.66,1.99,10.47
671,2009,ALASKA,COCAINE,6.56,3.7,6.68,2.35,19.3
672,2009,ARIZONA,COCAINE,4.73,4.53,4.78,3.82,17.86
673,2009,ARKANSAS,COCAINE,3.22,2.42,3.68,1.75,11.07
674,2009,CALIFORNIA,COCAINE,4.33,4.52,5.23,3.38,17.46


I'm choosing to deal with these issues after merging all the dataframes - I'll drop the extra rows reporting all zeros for cocaine, and I'm also going to remove the total values for sake of consistency with my other datafile, although you might choose to keep it. It could also provide an opportunity for an additional checksum function. 

Now to see what's up with LSD and PCP in 2013. It looks pretty odd - the two values are not consecutive in the dataframe.

In [617]:
pop_2013[pop_2013['Drug']=='PHENCYCLIDINE (PCP)']

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL
695,2013,COLORADO,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.0
696,2013,OHIO,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.01
697,2013,UNITED STATES,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.0
705,2013,MICHIGAN,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.0
706,2013,UNITED STATES,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.0


To see what's going on let's read in a new copy of the raw 2013 data and see what's present in some these rows. 

In [486]:
check_2013 = pd.read_csv('pop_2013.txt', delim_whitespace=True)
check_2013.iloc[690:708]

Unnamed: 0,ARCOS,3,-,REPORT,3.1,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100K,POPULATION,BY.1,GRAM,WT
690,WASHINGTON,0,0,0,0,0,,,,,,,,,,
691,WISCONSIN,0.01,0,0,0,0.01,,,,,,,,,,
692,U.S.,TOTAL,0,0,0,0,0,,,,,,,,,
693,DRUG,CODE:7471,DRUG,NAME:,PHENCYCLIDINE,(PCP),,,,,,,,,,
694,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,GRAMS,,,,,
695,COLORADO,0,0,0,0,0,,,,,,,,,,
696,OHIO,0,0,0,0,0.01,,,,,,,,,,
697,U.S.,TOTAL,0,0,0,0,0,,,,,,,,,
698,ARCOS,3,-,REPORT,3,,,,,,,,,,,
699,Population,Year:,2010,,,,,,,,,,,,,


Here's something truly strange - a drug code 9003 with no name associated in the file, and that doesn't appear in the DEA list of controlled substances:
https://www.deadiversion.usdoj.gov/schedules/orangebook/d_cs_drugcode.pdf

In any case, the transacted amount is reported as all zeros for Michigan (the only state with data for this drug) and at the national total level, so I'm making the choice to drop this data entirely - rows 705 and 706. 

In [618]:
pop_2013 = pop_2013.drop(labels = [705, 706], axis='index')
pop_2013[pop_2013['Drug']=='PHENCYCLIDINE (PCP)']

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL
695,2013,COLORADO,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.0
696,2013,OHIO,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.01
697,2013,UNITED STATES,PHENCYCLIDINE (PCP),0.0,0.0,0.0,0.0,0.0


In [489]:
# moving on to look at LSD in 2015
pop_2015[pop_2015['Drug']=='LYSERGIDE(D-LSD)']

Unnamed: 0,Year,State,Drug,Q1,Q2,Q3,Q4,TOTAL
598,2015,MICHIGAN,LYSERGIDE(D-LSD),0.0,0.0,0.0,0.0,0.0
599,2015,UNITED STATES,LYSERGIDE(D-LSD),0.0,0.0,0.0,0.0,0.0
607,2015,KENTUCKY,LYSERGIDE(D-LSD),0.0,0.0,0.0,182.52,182.52
608,2015,UNITED STATES,LYSERGIDE(D-LSD),0.0,0.0,0.0,2.53,2.53


In [490]:
check_2015 = pd.read_csv('pop_2015.txt', delim_whitespace=True)
check_2015.iloc[595:615]

Unnamed: 0,ARCOS,3,-,REPORT,3.1,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100K,POPULATION,BY.1,GRAM,WT
595,U.S.,TOTAL,0.95,0.85,0.74,0.75,3.29,,,,,,,,,
596,DRUG,CODE:7315D,DRUG,NAME:,LYSERGIDE(D-LSD),,,,,,,,,,,
597,STATE,1ST,QUARTER,2ND,QUARTER,3RD,QUARTER,4TH,QUARTER,TOTAL,GRAMS,,,,,
598,MICHIGAN,0,0,0,0,0,,,,,,,,,,
599,U.S.,TOTAL,0,0,0,0,0,,,,,,,,,
600,ARCOS,3,-,REPORT,3,,,,,,,,,,,
601,Population,Year:,2010,,,,,,,,,,,,,
602,QUARTERLY,DRUG,DISTRIBUTION,BY,STATE,PER,100K,POPULATION,BY,GRAM,WT,,,,,
603,REPORTING,PERIOD:,01/01/2015,TO,12/31/2015,,,,,,,,,,,
604,Run,Date:,03/07/2016,,,,,,,,,,,,,


Another new drug code that I had missed - this now was happening enough that I added something into the cleaning function so that it would report if any new codes had been found. You could also add this into a check function, and it might even make more sense to include it there in the future.

I've left this in here just to continue to highlight the importance of rigorous review of your data whn working on something like this - the code is also included in the drug dictionaries now. Below I'll fix up the 2015 file by re-running the load with the new version of the check function and the latest drug codes dictionary. 

In [605]:
pop_2015 = pd.read_csv('pop_2015.txt', delim_whitespace=True)
pop_2015 = clean_pop_new(pop_2015, 2015, drug_codes, geos)

Dropping rows with the following values in the State column:
['REPORTING' 'Run' 'STATE 1ST QUARTER' 'ARCOS' 'Population'
 'QUARTERLY DRUG DISTRIBUTION']


In [606]:
# drop these unneeded dfs
del check_2013
del check_2015

In [621]:
# re-instantiate the dict of dfs
population_dfs = {'2000': pop_2000, '2001': pop_2001, 
                  '2002': pop_2002, '2003': pop_2003, 
                  '2004': pop_2004, '2005': pop_2005, 
                  '2006': pop_2006, '2007': pop_2007, 
                  '2008': pop_2008, '2009': pop_2009,
                  '2010': pop_2010, '2011': pop_2011, 
                  '2012': pop_2012, '2013': pop_2013, 
                  '2014': pop_2014, '2015': pop_2015, 
                  '2016': pop_2016, '2017': pop_2017}

pop_all = pd.concat(list(population_dfs.values()), ignore_index=True)

# Drop out the totals rows
drops = ['UNITED STATES']
for d in drops:
    pop_all=pop_all.drop(pop_all[pop_all['State']==d].index)

# Drop out the reporting on the extra cocaine code
pop_all = pop_all.drop(pop_all.loc[(pop_all['TOTAL']==0)
                                   &(pop_all['Drug']=="COCAINE")].index)

pop_all.to_csv('distribution_by_100K_pop.csv', index=False)

In [622]:
# it is worth running these checks again to make sure nothing went wrong
# e.g., accidentally passing an old version of a df into the concat

quarterly_check(pop_all)
repeats_check_pop(pop_all)
check_states(pop_all, geos)

Quarterly sums check passed
Repeats checks passed
State values not matching: {'UNITED STATES'}
