<a href="https://colab.research.google.com/github/j-buss/wi-dpi-analysis/blob/development/eda/1.0_Get_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Salary and Education in Wisconsin

This notebook is intended to describe analysis on salaries of teachers within the Wisconsin Department of Public Instruction.


## Data Download



https://dpi.wi.gov/cst/data-collections/staff/published-data

In [0]:
!pip install --upgrade google-cloud-storage

Collecting google-cloud-storage
[?25l  Downloading https://files.pythonhosted.org/packages/9c/aa/048f5b3950f78c9e6afdb05e3667abb7a7ca4463bfde002257acd1874c3f/google_cloud_storage-1.15.0-py2.py3-none-any.whl (64kB)
[K     |█████                           | 10kB 13.1MB/s eta 0:00:01[K     |██████████▏                     | 20kB 2.8MB/s eta 0:00:01[K     |███████████████▎                | 30kB 4.0MB/s eta 0:00:01[K     |████████████████████▍           | 40kB 2.8MB/s eta 0:00:01[K     |█████████████████████████▌      | 51kB 3.4MB/s eta 0:00:01[K     |██████████████████████████████▋ | 61kB 4.0MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 4.3MB/s 
Installing collected packages: google-cloud-storage
  Found existing installation: google-cloud-storage 1.13.2
    Uninstalling google-cloud-storage-1.13.2:
      Successfully uninstalled google-cloud-storage-1.13.2
Successfully installed google-cloud-storage-1.15.0


In [0]:
from google.cloud import storage

In [0]:
def upload_blob(project_id, bucket_name, string, destination_blob_name):
    """Uploads a file to the bucket."""
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.upload_from_string(string)

    #print('File {} uploaded to {}.'.format(
    #    source_file_name,
    #    destination_blob_name))

In [0]:
def rename_blob(project_id, bucket_name, blob_name, new_name):
    """Renames a blob."""
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(blob_name)

    new_blob = bucket.rename_blob(blob, new_name)

    print('Blob {} has been renamed to {}'.format(
        blob.name, new_blob.name))

In [0]:
project_id='wi-dpi-010'
raw_data_bucket_name='landing-009'
source_name='all_staff_report'
year='2017_2018'
filename='AllStaffReportPublic__04152019_194414.csv'
full_filename=raw_data_bucket_name + '/' + source_name + '/' + year + '/' + filename

landing_dataset_name='landing'
landing_table_name=source_name
landing_bq_fullname=landing_dataset_name + '.' + landing_table_name

refined_dataset_name='refined'
refined_table_name=source_name
refined_bq_fullname=refined_dataset_name + '.' + refined_table_name



In [0]:
# Authenticate to GCS.
from google.colab import auth
auth.authenticate_user()

In [0]:
school_years = [(str(i) + '_' + str(i+1)) for i in range(1995,2016)]

In [0]:
for x in school_years:
  upload_blob(project_id, raw_data_bucket_name, '', 'all_staff_report/' + x + '/')

In [0]:
file_dict = [
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/95staff.txt",'new_name':"all_staff_report/1995_1996/95staff.txt",'landing_tablename':"1995",'file_type':"fixed"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/96staff.txt",'new_name':"all_staff_report/1996_1997/96staff.txt",'landing_tablename':"1996",'file_type':"fixed"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/97staff.txt",'new_name':"all_staff_report/1997_1998/97staff.txt",'landing_tablename':"1997",'file_type':"fixed"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/98staff.txt",'new_name':"all_staff_report/1998_1999/98staff.txt",'landing_tablename':"1998",'file_type':"fixed"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/99STAFF.DAT",'new_name':"all_staff_report/1999_2000/99STAFF.DAT",'landing_tablename':"1999",'file_type':"fixed"},
    
    
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/00staff.dat",'new_name':"all_staff_report/2000_2001/00staff.dat",'landing_tablename':"2000"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/01staff.dat",'new_name':"all_staff_report/2001_2002/01staff.dat",'landing_tablename':"2001"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/02staff.txt",'new_name':"all_staff_report/2002_2003/02staff.txt",'landing_tablename':"2002"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/03staff.txt",'new_name':"all_staff_report/2003_2004/03staff.txt",'landing_tablename':"2003"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/04staff.dat",'new_name':"all_staff_report/2004_2005/04staff.dat",'landing_tablename':"2004"},
                                                                                                                                                 
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/05staff.txt",'new_name':"all_staff_report/2005_2006/05staff.txt",'landing_tablename':"2005"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/06staff.txt",'new_name':"all_staff_report/2006_2007/06staff.txt",'landing_tablename':"2006"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/07staff.txt",'new_name':"all_staff_report/2007_2008/07staff.txt",'landing_tablename':"2007"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/08STAFF.TXT",'new_name':"all_staff_report/2008_2009/08STAFF.TXT",'landing_tablename':"2008"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/09STAFF.TXT",'new_name':"all_staff_report/2009_2010/09STAFF.TXT",'landing_tablename':"2009"},
                                                                                                                                                 
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/10STAFF.TXT",'new_name':"all_staff_report/2010_2011/10STAFF.TXT",'landing_tablename':"2010"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/11STAFF.txt",'new_name':"all_staff_report/2011_2012/11STAFF.txt",'landing_tablename':"2011"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/12STAFF.txt",'new_name':"all_staff_report/2012_2013/12STAFF.txt",'landing_tablename':"2012"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/13staff.txt",'new_name':"all_staff_report/2013_2014/13staff.txt",'landing_tablename':"2013"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/14staff.txt",'new_name':"all_staff_report/2014_2015/14staff.txt",'landing_tablename':"2014"},
    
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/2015.csv",'new_name':"all_staff_report/2015_2016/2015.csv",'landing_tablename':"2015"},
    {'old_name':"all_staff_report/temp/AllStaff_Open_Files/2016.csv",'new_name':"all_staff_report/2016_2017/2016.csv",'landing_tablename':"2016"}
      
]

In [0]:
for rename in rename_file[1:]:
  rename_blob(project_id, raw_data_bucket_name, rename['old_name'], rename['new_name'])

Blob all_staff_report/temp/AllStaff_Open_Files/96staff.txt has been renamed to all_staff_report/1996_1997/96staff.txt
Blob all_staff_report/temp/AllStaff_Open_Files/97staff.txt has been renamed to all_staff_report/1997_1998/97staff.txt
Blob all_staff_report/temp/AllStaff_Open_Files/98staff.txt has been renamed to all_staff_report/1998_1999/98staff.txt
Blob all_staff_report/temp/AllStaff_Open_Files/99STAFF.DAT has been renamed to all_staff_report/1999_2000/99STAFF.DAT
Blob all_staff_report/temp/AllStaff_Open_Files/00staff.dat has been renamed to all_staff_report/2000_2001/00staff.dat
Blob all_staff_report/temp/AllStaff_Open_Files/01staff.dat has been renamed to all_staff_report/2001_2002/01staff.dat
Blob all_staff_report/temp/AllStaff_Open_Files/02staff.txt has been renamed to all_staff_report/2002_2003/02staff.txt
Blob all_staff_report/temp/AllStaff_Open_Files/03staff.txt has been renamed to all_staff_report/2003_2004/03staff.txt
Blob all_staff_report/temp/AllStaff_Open_Files/04staff.d

In [0]:
from google.cloud.storage import Blob

client = storage.Client(project=project_id)
bucket = client.get_bucket(raw_data_bucket_name)
#encryption_key = 'c7f32af42e45e85b9848a6a14dd2a8f6'
#blob = Blob('secure-data', bucket, encryption_key=encryption_key)
blob = Blob()
upload_from_string('my secret message.')
#with open('/tmp/my-secure-file', 'wb') as file_obj:
#    blob.download_to_file(file_obj)

NameError: ignored

## Data Preparation

### Load libraries
Install the following packages in order to load data to BigQuery.

*Please note this will require a restart to the runtime*

In [0]:
!pip install gcsfs
!pip install pandas-gbq -U
import gcsfs

Collecting gcsfs
[?25l  Downloading https://files.pythonhosted.org/packages/30/7b/bb9dd860c64f15a06fdefdd3ea6c30ae336f3f5524f800cac59592769bf7/gcsfs-0.2.1.tar.gz (51kB)
[K     |████████████████████████████████| 61kB 4.0MB/s 
Building wheels for collected packages: gcsfs
  Building wheel for gcsfs (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/58/b5/19/7b0e8a870ef16e1c0b8eee819c511c789be5cde308e59f2752
Successfully built gcsfs
Installing collected packages: gcsfs
Successfully installed gcsfs-0.2.1
Collecting pandas-gbq
  Downloading https://files.pythonhosted.org/packages/6a/65/bc46678a5550c0cef1700d7292319deae716751af3f6158250d6a3a454ed/pandas_gbq-0.10.0-py2.py3-none-any.whl
Collecting pydata-google-auth (from pandas-gbq)
  Downloading https://files.pythonhosted.org/packages/89/c5/03b68c114bc2c2bcaa2e40fdf269a14361fa75b70a09415e8bad65413b75/pydata_google_auth-0.1.3-py2.py3-none-any.whl
Collecting google-cloud-bigquery>=1.9.0 (from pandas-gbq)
[?25l  D

### Import Libraries

In [0]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 5)
import seaborn as sns
import matplotlib.pyplot as plt

from google.cloud import bigquery

In [0]:
%matplotlib inline
plt.style.use('bmh')

Set values for project and datasets we are working with

### Functions

In [0]:
def create_dataset(client, project_id, dataset_name):
  
  
  dataset_id = "{}.{}".format(project_id, dataset_name)
  dataset = bigquery.Dataset(dataset_id)
  dataset.location = "US"

  dataset = client.create_dataset(dataset)
  print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

In [0]:
def convert_currency(val):
    """
    Convert the string number value to a float
     - Remove $
     - Remove commas
     - Convert to float type
    """
    new_val = val.replace(',','').replace('$', '')
    return float(new_val)

In [0]:
def prep_name(val):
  """
  Take name and make first leter capital; rest lowercase
  """
  new_val  = val.lower().title()
  return new_val

### Load Data

In [0]:
client = bigquery.Client(project_id)

**Load data from Google Cloud Storage Bucket to BigQuery Landing**

In [0]:
create_dataset(client, project_id, landing_dataset_name)

In [0]:
print (full_filename)

landing-009/all_staff_report/2017_2018/AllStaffReportPublic__04152019_194414.csv


In [0]:
import gcsfs

In [0]:
fs = gcsfs.GCSFileSystem(project=project_id)


In [0]:
i = file_dict[0]

In [0]:
storage_client = storage.Client(project=project_id)
bucket = storage_client.get_bucket(raw_data_bucket_name)
blob = bucket.get_blob(i['new_name'])
data = blob.download_as_string()

In [0]:
print (data[0:1000])

b'000016044WILSON              JUDITH          FW1962419851995R                                    01 000000003300009888409125264N0802695H  001    NON-EDUCATIONAL AGENCY        NON-EDUCATIONAL AGENCY                                           33                                                                                                                                                                                                     WI                           WI\r\n000003685STARK               LISA                       1995R                                    01 00000070030999630003E120464N1000070OT 001    ALGOMA SCH DIST               ALGOMA SCH DIST               00731KEWAUNEE COUNTY               031715 DIVISION ST              ALGOMA WI  54201                                            1715 DIVISION ST              ALGOMA WI  54201                                            ALGOMA           WI54201     ALGOMA           WI54201     414-487-7001000780DALE N LARSON\r\n000014696

In [0]:
df = pd.read_fwf(data, colspecs='infer')

TypeError: ignored

In [0]:
fn = raw_data_bucket_name + '/' + i['new_name']
print(fn)

landing-009/all_staff_report/1995_1996/95staff.txt


In [0]:
with fs.open(fn) as f:
    #print(f.read())
    df = pd.read_fwf(f.read(), colspecs='infer')
    #df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
    #df.to_gbq('landing.' + i['landing_tablename'],project_id=project_id,if_exists='replace')
  

TypeError: ignored

In [0]:
for i in file_dict:
  with fs.open(i['new_name']) as f:
    df = pd.read_csv(f, skiprows=1)
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
    df.to_gbq('landing.' + i['landing_tablename'],project_id=project_id,if_exists='replace')

FileNotFoundError: ignored

In [0]:

with fs.open(full_filename) as f:
  df = pd.read_csv(f, skiprows=1)

Get data from "Landing" table

In [0]:
query = '''
  SELECT
    *
  FROM
    [{project}.{dataset}.{table}]
 '''.format(project=project_id, dataset=landing_dataset_name, table=landing_table_name)

In [0]:
df = pd.io.gbq.read_gbq(query, project_id=project_id, reauth=True)

  """Entry point for launching an IPython kernel.


Minimal "formating" for the data to prepare for analytics:


1.   Format ***total_salary*** and ***total_fringe*** as Float instead of string
2.   Standardized ***first_name*** and ***last_name*** for first letter capitalized

In [0]:
df['total_salary'] = df['total_salary'].apply(lambda x: convert_currency(x) if pd.notnull(x) else x)
df['total_fringe'] = df['total_fringe'].apply(lambda x: convert_currency(x) if pd.notnull(x) else x)
df['last_name'] = df['last_name'].apply(lambda x: prep_name(x))
df['first_name'] = df['first_name'].apply(lambda x: prep_name(x))

Create the dataset for "Refined"

In [0]:
create_dataset(client, project_id, refined_dataset_name)

In [0]:
df.to_gbq(refined_bq_fullname,project_id=project_id,if_exists='replace')

1it [00:05,  5.63s/it]


So at this point we have a refined table which has cleaned up a bare minimum of data attributes for easier analysis. 

**Create Table for focused analysis**

Now let's take the newly created refined table and filter it down to only full time teachers. This will allow us to focused subset of data to use for the bulk of our initial analysis.

In [0]:
query = '''
  SELECT
    *
  FROM
    [{project}.{dataset}.{table}]
 '''.format(project=project_id, dataset=refined_dataset_name, table=refined_table_name)