# <center>Week 5 Assignment</center>

This week you will be retrieving and cleansing data from a survey generated by the National Center for Immunization and Respiratory Diseases about National Immunizations in Children. In completing this assignment, you will be able to combine topics discussed in several of our prior FTEs.

File needed to complete this assignment are located in the data_5 folder:
* NISPUF14_CODEBOOK.PDF
* nispuf14.dat

Assignment Requirements:
* Retrieve all of the data within nispuf14.dat and store it in a more <i> accessible format</i>
* <i> Accessible format </i> can be any of the following:
    - csv file
    - json file
    - relational database
* For this assignment, feel free to use a dataframe for intermediate steps. 

<hr>

### What's in these two files?
I'm glad you asked that! And to be honest, you probably are not going to like the answer.

NISPUF14_CODEBOOK.PDF is a PDF that contains a description of the format for the data in nispuf14.dat. In other words, the PDF tells you how to read the data in nispuf14.dat.

Why would we need a PDF to tell us how to read our data?  Well, this data file is stored in a positional format. This means that both the value and relative position of each character provides meaning within the dataset.

Here's what the data in nispuf14.dat looks like.
<img align="left" style="padding-right:10px;" src="figures_5/positional_data.jpeg" width = 800><br>

Ugly? Yes! And very much so. However, data in this format is not all that uncommon. Mainframe computers operate on positional formating. 

Q - Who still uses mainframe computers?<br>
A - Mainframes are more prevalent than you'd think. Any industry that has a large volume of daily mathematical calculations to do, most likely use a mainframe computer as part of their normal operations. For example, the banking industry. Certainly, the website and customer-facing applications are not run on a mainframe computer, but the nightly accounting processes probably are. 

The following article walks through the history of the mainframe computer and how it has evolved over the years. 
https://www.thocp.net/hardware/mainframe.htm

<hr>

### How are we supposed to read that?
This is where NISPUF14_CODEBOOK.PDF comes into the picture. Section 1 of the PDF contains the description of the positional formatting information for each data field. Here's how it works!

As an example, let's say that our data file looked like this:<br>
CAT  FLUFFY410<br>
DOG  FIDO  522<br>
BIRD CHIRP 2 1<br>

At a glance, we can determine that each line contains information about animals. We can see a field representing an animal_type and perhaps an animal_name.  However, we have little to no information about what the numerics at the end of each line mean. Or even how many fields the numeric group is representing. The last line is leading us to believe that there might be more than one field represented, but we are not confident at this point.


### Does this come with a 'Magical Decoder Ring'?
Short of an actual magical ring, I'd settle for a description of each field and their relative position in the line.  It would be even better if the description was written down for future reference.

Let's look at the above animal dataset in conjunction with the  following description:<br>
Type 1 5<br>
Name 6 11<br>
Age 12 12<br>
Weight 13 14<br>

Aaahhhhh! Now everything is starting to come together!!! We can now confirm that the first field is indeed animal_type, and the second is animal_name. However, we now know that the numeric grouping is really two fields, animal_age and animal_weight. We can also see that animal_age is a single digit, and animal_weight is a 2-digit numeric. We are also able to determine at this point that the animal_name on the first line is actually 'FLUFFY' and not 'FLUFFY410'.

Time to add a little code to this example.

<hr>

In [None]:
# Load the sample data into a list
animal_data = ['CAT  FLUFFY410', 'DOG  FIDO  522', 'BIRD CHIRP 2 1' ]

# processing each animal_data line
for  line in animal_data:
    animal_type = line[0:5]
    animal_name = line[5:11]
    age = line[11]
    weight = line[12:14]
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')


Hopefully, things are looking less scary at this point? 

Retrieving data from a positionally formatted file is just a matter of chunking the larger string up into smaller pieces. The trick is in determining where to make those chunks. 

The key to all this is the 'magical decoder' description because there are no other clues in the file itself. Unlike a csv type file, positional formatted files don't have a delimited to help identify individual data elements. 

That being said, positional formats do account for every character within a row.  Meaning that even unused characters are given a value. In our example above, a blank  character(' ') was used to fill unused characters. The value used to represent unused characters can literally be anything. For example, if '-' was used instead of a ' ' our sample data would have looked like:

CAT--FLUFFY410<br>
DOG--FIDO--522<br>
BIRD-CHIRP-2-1<br>

Let's see if our code above will still work?

In [None]:
# Load the sample data into a list
animal_data2 = ['CAT--FLUFFY410', 'DOG--FIDO--522', 'BIRD-CHIRP-2-1' ]

# processing each animal_data line
for  line in animal_data2:
    animal_type = line[0:5]
    animal_name = line[5:11]
    age = line[11]
    weight = line[12:14]
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

Aside from changing the initial list that contains our dataset, no coding changes were needed. 

Our output looks a little different, but that's because of the different unused character representation. Both of the above examples have their respective unused character values in the data elements.  It's just easier to see in the second example over the first.

Let's try stripping out the unused characters in both examples.

In [None]:
# working with the second dataset, animal_data2, first.

# processing each animal_data line
for  line in animal_data2:
    animal_type = line[0:5].strip('-')
    animal_name = line[5:11].strip('-')
    age = line[11].strip('-')
    weight = line[12:14].strip('-')
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

In [None]:
# repeat the same things with the first set, animal_data.

# processing each animal_data line
for  line in animal_data:
    animal_type = line[0:5].strip('-')
    animal_name = line[5:11].strip('-')
    age = line[11].strip('-')
    weight = line[12:14].strip('-')
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

<div class="alert alert-success">
Success!! The two outputs match!
</div>

<hr>

### Back to our assignment
Section 1 of NISPUF14_CODEBOOK.PDF contains the description of the positional format for nispuf14.dat.

<div class="alert alert-block alert-info">
<b>Helpful Hint::</b> Combining pyPDF2 and Tabula would work great for  parsing the information within section 1 of NISPUF14_CODEBOOK.PDF. pyPDF2 to retrieve section 1 of the PDF and Tabula for getting the positional formatting information off the PDF and into a pandas dataframe.
</div>

Installation reminders from FTE for week3.
<div class="alert alert-block alert-success">
<b>Installation - PyPDF2::</b> PyPDF2 can be installed as normal using pip.
</div>

<div class="alert alert-block alert-success">
<b>Installation - Tabula::</b> To install the tabula package, you can use pip as shown before. https://pypi.org/project/tabula-py/
</div>

<div class="alert alert-block alert-success">
<b>Installation - Java::</b> Note: in order to use tabula, you need to have the latest version of java installed. https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py has some useful information if you need help getting java installed on your machine.
</div>

#### Assignment Approach

<div class="alert alert-block alert-warning">
<b>One possible solution: </b> Students are encouraged to define their approach when completing any assignment in this class.  Below, I have shared my approach to the assignment for this week.  Feel free to use some or all of this design, if you'd like.
</div>



for each line in the file:

    data_line = new list
    for each variable (line) found in the dataframe:
        create a dictionary with variable name as key, 
        use start / end position numbers as a slice to give the dictionary's value
        append dictionary to data_line
    write data_line to CSV file

<hr>

In [None]:
!pip install tabula-py

In [1]:
import PyPDF2
import glob
import tabula
from tabula import read_pdf as trp
import pandas as pd
import numpy as np

In [2]:
# pdf = open('./data/NISPUF14_CODEBOOK.PDF','rb')
read_pdf = PyPDF2.PdfFileReader(open('./data/NISPUF14_CODEBOOK.PDF','rb'))
pdf_get_pages = read_pdf.getPage(4)
pdf_get_pages.extractText().split('\n')[1]


'Position Position Section Variable Label SEQNUMC 1 6 1 UNIQUE CHILD IDENTIFIER SEQNUMHH 7 11 1 UNIQUE HOUSEHOLD IDENTIFIER PDAT 12 12 1 CHILD HAS ADEQUATE PROVIDER DATA PROVWT_D 13 31 1 FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT (EXCLUDES TERRITORIES) PROVWT_D_TERR 32 50 1 FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT INCLUDING '

In [4]:
target = "./data/NISPUF14_CODEBOOK.PDF"

df = trp(target, pages='5-21')
df = df.dropna()

In [5]:
df.head()

Unnamed: 0,Variable Name,Position,Position.1,Section,Variable Label
0,SEQNUMC,1,6,1,UNIQUE CHILD IDENTIFIER
1,SEQNUMHH,7,11,1,UNIQUE HOUSEHOLD IDENTIFIER
2,PDAT,12,12,1,CHILD HAS ADEQUATE PROVIDER DATA
3,PROVWT_D,13,31,1,FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT (EXCLUDES
5,PROVWT_D_TERR,32,50,1,FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT INCLUDING


In [7]:
df['Start'] = pd.to_numeric(df['Position'], errors='coerce')
df['Len'] = pd.to_numeric(df['Position.1'], errors='coerce')

In [8]:
df['Start'] = df['Start'] - 1
df['Len'] = df['Len'] - df['Start']
df['End'] = df['Start'] + df['Len']

In [9]:
df.isnull().sum(axis = 0)

Variable Name      0
Position           0
Position.1         0
Section            0
Variable Label     0
Start             16
Len               16
End               16
dtype: int64

In [10]:
#creating an index to find the remaining NANs
index = df['Start'].index[df['Start'].apply(np.isnan)]
df_index = df.index.values.tolist()
del_index = [df_index.index(i) for i in index]
del_index

[34, 60, 76, 93, 108, 123, 137, 162, 200, 239, 276, 310, 349, 388, 427, 466]

In [11]:
#looking at the a Bad row
df.iloc[60]

Variable Name      Variable Name
Position                Position
Position.1              Position
Section                  Section
Variable Label    Variable Label
Start                        NaN
Len                          NaN
End                          NaN
Name: 86, dtype: object

In [12]:
#creating a index of all the rows that contains "Position" in the Position field
indexNames = df[ df['Position'] == 'Position'].index
indexNames

Int64Index([41, 86, 136, 184, 233, 284, 334, 381, 421, 460, 497, 539, 578, 617,
            656, 695],
           dtype='int64')

In [13]:
#Dropping the indexes from IndexName from the DF
df.drop(indexNames, inplace=True)

In [14]:
#Checking to see if there are any remaining Nulls 
df.isnull().sum(axis = 0)

Variable Name     0
Position          0
Position.1        0
Section           0
Variable Label    0
Start             0
Len               0
End               0
dtype: int64

In [15]:
df['Len'] = pd.to_numeric(df['Len'], downcast='integer')
df['Start'] = pd.to_numeric(df['Start'], downcast='integer')
df['End'] = pd.to_numeric(df['End'], downcast='integer')

In [16]:
df.dtypes

Variable Name     object
Position          object
Position.1        object
Section           object
Variable Label    object
Start              int16
Len                 int8
End                int16
dtype: object

In [None]:
# Creating a list of Variable Names for the layout
col_names = df['Variable Name']
col_names

In [27]:
#read in data
lines_list = []
file = ('./data/nispuf14.dat')

with open(file) as f:
    for line in f:
        lines_list.append(line.strip())

In [28]:
data = [row for row in lines_list]

24897


24897

In [30]:
from tqdm import tqdm
final_list = []
for x in tqdm(data):
    key_dict = {}
    for index, row in df.iterrows():
        key = row['Variable Name']
        if key not in key_dict:
            key_dict[key] = []
        #key_dict[key].append(data[index][row.Start:row.End])
        key_dict[key].append(x[row.Start:row.End])
    final_list.append(key_dict)


100%|██████████| 24897/24897 [29:57<00:00, 13.85it/s] 


In [32]:
import json
with open('nispuf14.fixed','w') as file:
    tqdm(file.write(json.dumps(final_list)))

0it [00:00, ?it/s]


In [33]:
testdf = pd.read_json('nispuf14.fixed')
testdf

Unnamed: 0,SEQNUMC,SEQNUMHH,PDAT,PROVWT_D,PROVWT_D_TERR,RDDWT_D,RDDWT_D_TERR,STRATUM,YEAR,AGECPOXR,...,XVRCTY7,XVRCTY8,XVRCTY9,INS_1,INS_2,INS_3,INS_3A,INS_4_5,INS_6,INS_11
0,[000011],[00001],[2],[ . ],[ . ],[ 218.30024855484000],[ 218.30024855484000],[1022],[2014],[.],...,[ ],[ ],[ ],[ .],[ .],[ .],[ .],[ .],[ .],[ .]
1,[000021],[00002],[1],[ 806.84601169505000],[ 806.84601169505000],[ 454.86041741251200],[ 454.86041741251200],[2036],[2014],[.],...,[ ],[ ],[ ],[ 2],[ .],[ .],[ 2],[ 2],[ 2],[ .]
2,[000031],[00003],[2],[ . ],[ . ],[ 30.54542540283290],[ 30.54542540283290],[1072],[2014],[.],...,[ ],[ ],[ ],[ .],[ .],[ .],[ .],[ .],[ .],[ .]
3,[000041],[00004],[1],[ 63.44868567610260],[ 63.44868567610260],[ 36.96593137368630],[ 36.96593137368630],[2016],[2014],[.],...,[ ],[ ],[ ],[ 1],[ 2],[ 2],[ .],[ 2],[ 2],[ 2]
4,[000051],[00005],[1],[ 94.87263225744540],[ 94.87263225744540],[ 64.62020426239790],[ 64.62020426239790],[1073],[2014],[.],...,[ ],[ ],[ ],[ 2],[ 1],[ 1],[ .],[ 2],[ 2],[77]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24892,[238771],[23877],[2],[ . ],[ . ],[ 503.27830948320800],[ 503.27830948320800],[2046],[2014],[.],...,[ ],[ ],[ ],[ .],[ .],[ .],[ .],[ .],[ .],[ .]
24893,[238781],[23878],[2],[ . ],[ . ],[ 148.34276841124900],[ 148.34276841124900],[2029],[2014],[.],...,[ ],[ ],[ ],[ .],[ .],[ .],[ .],[ .],[ .],[ .]
24894,[238791],[23879],[2],[ . ],[ . ],[ 74.63170096114980],[ 74.63170096114980],[2063],[2014],[.],...,[ ],[ ],[ ],[ 1],[ .],[ .],[ 2],[ 2],[ 2],[ 2]
24895,[238801],[23880],[1],[ 21.91889835428250],[ 21.91889835428250],[ 22.26043136519570],[ 22.26043136519570],[1049],[2014],[.],...,[ ],[ ],[ ],[ 2],[ .],[ .],[ 1],[ 2],[ 2],[ 2]


In [None]:
import csv

In [None]:
# with open('./data/nispuf14.dat', 'r') as f:
#     csv_reader = csv.DictReader(f)
    
#     with open('./data/nispuf14.fixed', 'w') as newfile:
#         fieldnames = list(key_dict2.keys)
#         csv_writer = csv.DictWriter(newfile, fieldnames=fieldnames, delimiter=',')
#         for line in csv_reader:
#             print(line)


In [None]:
lines_list = []
file = ('./data/nispuf14.dat')

with open(file) as f:
    for line in f:
        lines_list.append(line.strip())
for line in lines_list:
    key_dict3 = {}
    for k,v in key_dict2:
        print(k)

In [None]:
# test_dict2 = {}
# for index, row in df.iterrows():
#     key = row.Start
#     if key not in test_dict2:
#         test_dict2[key] = []
#     test_dict2[key].append(row.End)
    
# test_dict2

In [None]:
test_dict[1]