# Guidelines for ETL Project

This document contains guidelines, requirements, and suggestions for Project 1.

## Project Proposal

Before you start writing any code, remember that you only have one week to complete this project. View this project as a typical assignment from work. Imagine a bunch of data came in and you and your team are tasked with migrating it to a production data base.

Take advantage of your Instructor and TA support during office hours and class project work time. They are a valuable resource and can help you stay on track.

## Finding Data

Your project must use 2 or more sources of data. We recommend the following sites to use as sources of data:

* [data.world](https://data.world/)

* [Kaggle](https://www.kaggle.com/)

You can also use APIs or data scraped from the web. However, get approval from your instructor first. Again, there is only a week to complete this!

## Data Cleanup & Analysis

Once you have identified your datasets, perform ETL on the data. Make sure to plan and document the following:

* The sources of data that you will extract from.

* The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc).

* The type of final production database to load the data into (relational or non-relational).

* The final tables or collections that will be used in the production database.

You will be required to submit a final technical report with the above information and steps required to reproduce your ETL process.

## Project Report

At the end of the week, your team will submit a Final Report that describes the following:

* **E**xtract: your original data sources and how the data was formatted (CSV, JSON, MySQL, etc).

* **T**ransform: what data cleaning or transformation was required.

* **L**oad: the final database, tables/collections, and why this was chosen.

Please upload the report to Github and submit a link to Bootcampspot.

Data sets
eeoc_db.reveal_eeo1_for_2016
eeoc_db.year16_state_nac2

# Project 2 - ETL with EEOC
# Khrizel Solano and Kelli Okuji Wilson

# Extract 

In [2]:
#pip install mysqlclient
import fnmatch
# from regexdict import regexdict
import re
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [6]:
# Store CSV into DataFrame

In [3]:
csv_file = "./Resources/year16_nac2.csv"
eeoc_nac2_df = pd.read_csv(csv_file, sep=';')
eeoc_nac2_df.head()

#eeoc_db.reveal_eeo1_for_2016.csv
#eeoc_db.year16_state_nac2.csv

Unnamed: 0,NAC2_code,TOTAL_UNITS,TOTAL10,WHM1,WHM2,WHM3,WHM4,WHM5,WHM6,WHM7,...,TOMRF7,TOMRF8,TOMRF9,TOMRF1_2,NAC2_Label,i,SUMCOUNT,MISSCOUNT,SMALLEST,INDEX
0,11,1408,321995,3311,7254,2369,4740,2561,5636,12178,...,68,871.0,35,43,"Agriculture, Forestry, Fishing and Hunting",141,4,2,3.0,84.0
1,21,1935,346103,5039,37459,14647,2088,3999,58589,43730,...,32,11.0,8,94,"Mining, Quarrying, and Oil and Gas Extraction",141,9,6,,
2,22,2564,466109,5883,64738,25510,1645,9568,87506,21627,...,20,9.0,4,148,Utilities,141,6,4,,
3,23,9254,1602496,26632,83888,37250,21458,24350,393797,91062,...,88,196.0,52,263,Construction,141,0,0,,
4,31,6878,1714563,15623,39393,18702,39607,17628,83772,174938,...,1300,1630.0,755,531,Manufacturing,141,0,0,,


## Transform

* Transform the EEOC table from wide to narrow so it's easier to join with the Kaggle data set

* Create data dictionaries for race, gender, and job category in EEOC data set to match the Kaggle data set values

* Clean up the values in the job category in the Kaggle data set so it matches with the EEOC data set values

* Create new tables by joining the EEOC and Kaggle data sets

In [76]:
# Create new data sets with select columns

In [18]:
eeoc_nac2_All_df = eeoc_nac2_df[['NAC2_code','NAC2_Label','WHM1','WHM2','WHM3','WHM4','WHM5','WHM6','WHM7','WHM8','WHM9','WHF1','WHF2','WHF3','WHF4','WHF5','WHF6','WHF7','WHF8','WHF9','BLKM1','BLKM2','BLKM3','BLKM4','BLKM5','BLKM6','BLKM7','BLKM8','BLKM9','BLKF1','BLKF2','BLKF3','BLKF4','BLKF5','BLKF6','BLKF7','BLKF8','BLKF9','HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9','ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9','AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9','NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9','TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9']]
eeoc_nac2_All_df.head()

Unnamed: 0,NAC2_code,NAC2_Label,WHM1,WHM2,WHM3,WHM4,WHM5,WHM6,WHM7,WHM8,...,TOMRM9,TOMRF1,TOMRF2,TOMRF3,TOMRF4,TOMRF5,TOMRF6,TOMRF7,TOMRF8,TOMRF9
0,11,"Agriculture, Forestry, Fishing and Hunting",3311,7254,2369,4740,2561,5636,12178,15711,...,71,6,58,26,35.0,100,5,68,871.0,35
1,21,"Mining, Quarrying, and Oil and Gas Extraction",5039,37459,14647,2088,3999,58589,43730,15476,...,12,6,322,71,10.0,214,13,32,11.0,8
2,22,Utilities,5883,64738,25510,1645,9568,87506,21627,6961,...,45,13,600,63,22.0,682,28,20,9.0,4
3,23,Construction,26632,83888,37250,21458,24350,393797,91062,114587,...,173,43,529,101,251.0,1436,274,88,196.0,52
4,31,Manufacturing,15623,39393,18702,39607,17628,83772,174938,80811,...,731,37,621,213,462.0,816,128,1300,1630.0,755


In [None]:
# Convert wide to narrow table using melt

In [19]:
eeoc_nac2_All_unpivot_df=pd.melt(eeoc_nac2_All_df, id_vars=['NAC2_code','NAC2_Label'], var_name="EEOC_Code", value_name="count")
eeoc_nac2_All_unpivot_df.head()

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0
2,22,Utilities,WHM1,5883.0
3,23,Construction,WHM1,26632.0
4,31,Manufacturing,WHM1,15623.0


In [11]:
# Create dictionaries for race, gender, and job_categories

In [7]:
# race_dictionary = {
#     **dict.fromkeys(['WHM1','WHM2','WHM3','WHM4','WHM5','WHM6','WHM7','WHM8','WHM9','WHF1','WHF2','WHF3','WHF4','WHF5','WHF6','WHF7','WHF8','WHF9'], "White"),
#     **dict.fromkeys(['BLKM1','BLKM2','BLKM3','BLKM4','BLKM5','BLKM6','BLKM7','BLKM8','BLKM9','BLKF1','BLKF2','BLKF3','BLKF4','BLKF5','BLKF6','BLKF7','BLKF8','BLKF9'], "Black_or_African_American"),
#     **dict.fromkeys(['HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9'], "Hispanic_or_Latino"),
#     **dict.fromkeys(['ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9'], "Asian"),
#     **dict.fromkeys(['AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9'], "American_Indian_Alaskan_Native"),
#     **dict.fromkeys(['NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9'], "Native_Hawaiian_or_Pacific_Islander"),
#     **dict.fromkeys(['TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9'], "Two_or_more_races")
#                   }
# eeoc_nac2_All_unpivot_df['race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(race_dictionary)

# eeoc_nac2_All_unpivot_df.head()

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count,race
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,White
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,White
2,22,Utilities,WHM1,5883.0,White
3,23,Construction,WHM1,26632.0,White
4,31,Manufacturing,WHM1,15623.0,White


In [9]:
# 1-Senior OFF AND MGRS/WHITE/MALE
# 2-PROF/WHITE/MALE
# 3-TECH/WHITE/MALE
# 4-SALE/WHITE/MALE
# 5-CLERICALS/WHITE/MALE
# 6-CRAFT/WHITE/MALE
# 7-OPER/WHITE/MALE
# 8-LABORS/WHITE/MALE
# 9-Service/WHITE/MALE

# 'Executives'
# 'Managers'
# 'Professionals'
# 'Technicians'
# 'Sales workers'
# 'Administrative support'
# 'Craft workers'
# 'operatives'
# 'laborers and helpers'
# 'Service workers'


SyntaxError: invalid syntax (<ipython-input-9-ae2cb05f097a>, line 1)

In [20]:
eeoc_nac2_All_unpivot_df['job_category_coded'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].str[-1].replace({'1' : 'Senior OFF AND MGRS', '2' : 'PROF', '3' : 'TECH', '4' : 'SALE', '5' : 'CLERICALS', '6' : 'CRAFT', '7' : 'OPERS','8' : 'LABORS', '9' : 'Service'})

In [21]:
eeoc_nac2_All_unpivot_df['race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].str[:-2].replace({'WH' : 'White', 'BLK' : 'Black', 'HISP': 'Hispanic_or_Latino','ASIAN' : 'Asian','AIAN' : 'American_Indian_Alaskan_Native','NHOPI' : 'Native_Hawaiian_or_Pacific_Islander','TOMR' : 'Two_or_more_races'})
eeoc_nac2_All_unpivot_df.sample(5)

SyntaxError: invalid syntax (<ipython-input-21-42bb29e57f16>, line 1)

In [10]:
# KS still working on it
# job_category_dictionary = {
#     **dict.froFkeys(['WHF1','BLKF1','HISPF1','ASIANF1','AIANF1','NHOPIF1','TOMRF1'],'')

SyntaxError: unexpected EOF while parsing (<ipython-input-10-99a8cc1ef74e>, line 2)

In [11]:
job_category_dictionary = {
    **dict.fromkeys(['WHM1','WHM2','WHM3','WHM4','WHM5','WHM6','WHM7','WHM8','WHM9','WHF1','WHF2','WHF3','WHF4','WHF5','WHF6','WHF7','WHF8','WHF9'], "White"),
    **dict.fromkeys(['BLKM1','BLKM2','BLKM3','BLKM4','BLKM5','BLKM6','BLKM7','BLKM8','BLKM9','BLKF1','BLKF2','BLKF3','BLKF4','BLKF5','BLKF6','BLKF7','BLKF8','BLKF9'], "Black_or_African_American"),
    **dict.fromkeys(['HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9'], "Hispanic_or_Latino"),
    **dict.fromkeys(['ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9'], "Asian"),
    **dict.fromkeys(['AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9'], "American_Indian_Alaskan_Native"),
    **dict.fromkeys(['NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9'], "Native_Hawaiian_or_Pacific_Islander"),
    **dict.fromkeys(['TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9'], "Two_or_more_races")
                  }
eeoc_nac2_All_unpivot_df['job_category'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(job_category_dictionary)

eeoc_nac2_All_unpivot_df.head()

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count,race,gender,job_category
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,White,Male,White
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,White,Male,White
2,22,Utilities,WHM1,5883.0,White,Male,White
3,23,Construction,WHM1,26632.0,White,Male,White
4,31,Manufacturing,WHM1,15623.0,White,Male,White


In [23]:
# A = "vhtdhd"
# print(A[-1])

d


In [12]:
eeoc_nac2_All_unpivot_df['job_category_coded'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].str[-1]
eeoc_nac2_All_unpivot_df['gender_category_coded'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].str[-2].replace("M","Male")
eeoc_nac2_All_unpivot_df['race_category_coded'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].str[:-2]


eeoc_nac2_All_unpivot_df.sample(10)



Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count,race,gender,job_category,job_category_coded,gender_category_coded,race_category_coded
3010,48,Transportation and Warehousing,TOMRF9,2547.0,Two_or_more_races,Female,Two_or_more_races,9,F,TOMR
1086,33,Manufacturing,HISPF1,634.0,Hispanic_or_Latino,Female,Hispanic_or_Latino,1,F,HISP
137,56,Administrative and Support and Waste Managemen...,WHM6,56912.0,White,Male,White,6,Male,WH
2669,32,Manufacturing,TOMRM4,767.0,Two_or_more_races,Male,Two_or_more_races,4,Male,TOMR
550,81,Other Services (except Public Administration),BLKM5,4662.0,Black_or_African_American,Male,Black_or_African_American,5,Male,BLK
1103,92,Public Administration,HISPF1,19.0,Hispanic_or_Latino,Female,Hispanic_or_Latino,1,F,HISP
2853,72,Accommodation and Food Services,TOMRF2,547.0,Two_or_more_races,Female,Two_or_more_races,2,F,TOMR
1260,51,Information,HISPF8,1046.0,Hispanic_or_Latino,Female,Hispanic_or_Latino,8,F,HISP
111,54,"Professional, Scientific, and Technical Services",WHM5,85180.0,White,Male,White,5,Male,WH
2823,54,"Professional, Scientific, and Technical Services",TOMRF1,442.0,Two_or_more_races,Female,Two_or_more_races,1,F,TOMR


In [13]:
# d = regexdict({'^W':'White', '^H':'Hispanic'})
# eeoc_nac2_All_unpivot_df['job_category'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)

# eeoc_nac2_All_unpivot_df.head()

NameError: name 'regexdict' is not defined

In [307]:
# Create a race dictionary
race_dictionary = {'WHM1':'White','WHM2':'White','WHM3':'White','WHM4':'White','WHM5':'White','WHM6':'White','WHM7':'White','WHM8':'White','WHM9':'White','WHF1':'White','WHF2':'White','WHF3':'White','WHF4':'White','WHF5':'White','WHF6':'White','WHF7':'White','WHF8':'White','WHF9':'White','BLKM1':'Black_or_African_American','BLKM2':'Black_or_African_American','BLKM3':'Black_or_African_American','BLKM4':'Black_or_African_American','BLKM5':'Black_or_African_American','BLKM6':'Black_or_African_American','BLKM7':'Black_or_African_American','BLKM8':'Black_or_African_American','BLKM9':'Black_or_African_American','BLKF1':'Black_or_African_American','BLKF2':'Black_or_African_American','BLKF3':'Black_or_African_American','BLKF4':'Black_or_African_American','BLKF5':'Black_or_African_American','BLKF6':'Black_or_African_American','BLKF7':'Black_or_African_American','BLKF8':'Black_or_African_American','BLKF9':'Black_or_African_American'}
                  # ,'HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9','ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9','AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9','NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9','TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9'}
eeoc_nac2_All_unpivot_df['race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(race_dictionary)

eeoc_nac2_All_unpivot_df

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,Count,race
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,White
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,White
2,22,Utilities,WHM1,5883.0,White
3,23,Construction,WHM1,26632.0,White
4,31,Manufacturing,WHM1,15623.0,White
5,32,Manufacturing,WHM1,29496.0,White
6,33,Manufacturing,WHM1,72333.0,White
7,42,Wholesale Trade,WHM1,26030.0,White
8,44,Retail Trade,WHM1,20530.0,White
9,45,Retail Trade,WHM1,6785.0,White


In [None]:
# Create a gender dictionary
race_dictionary = {'WHM1':'White','WHM2':'White','WHM3':'White','WHM4':'White','WHM5':'White','WHM6':'White','WHM7':'White','WHM8':'White','WHM9':'White','WHF1':'White','WHF2':'White','WHF3':'White','WHF4':'White','WHF5':'White','WHF6':'White','WHF7':'White','WHF8':'White','WHF9':'White','BLKM1':'Black_or_African_American','BLKM2':'Black_or_African_American','BLKM3':'Black_or_African_American','BLKM4':'Black_or_African_American','BLKM5':'Black_or_African_American','BLKM6':'Black_or_African_American','BLKM7':'Black_or_African_American','BLKM8':'Black_or_African_American','BLKM9':'Black_or_African_American','BLKF1':'Black_or_African_American','BLKF2':'Black_or_African_American','BLKF3':'Black_or_African_American','BLKF4':'Black_or_African_American','BLKF5':'Black_or_African_American','BLKF6':'Black_or_African_American','BLKF7':'Black_or_African_American','BLKF8':'Black_or_African_American','BLKF9':'Black_or_African_American'}
                  # ,'HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9','ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9','AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9','NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9','TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9'}
eeoc_nac2_All_unpivot_df['race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(race_dictionary)

eeoc_nac2_All_unpivot_df

In [319]:
eeoc_nac2_All_unpivot_df['EEOC_Code'][eeoc_nac2_All_unpivot_df['EEOC_Code']]

EEOC_Code
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM1      NaN
WHM2      NaN
WHM2      NaN
WHM2      NaN
WHM2      NaN
WHM2      NaN
WHM2      NaN
         ... 
TOMRF8    NaN
TOMRF8    NaN
TOMRF8    NaN
TOMRF8    NaN
TOMRF8    NaN
TOMRF8    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
TOMRF9    NaN
Name: EEOC_Code, Length: 3024, dtype: object

In [329]:
###YES

race_dictionary = {('WHM1','WHM2'): "White"}
eeoc_nac2_All_unpivot_df['race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(race_dictionary)

eeoc_nac2_All_unpivot_df
    
#     'WHM1':'White','WHM2':'White','WHM3':'White','WHM4':'White','WHM5':'White','WHM6':'White','WHM7':'White','WHM8':'White','WHM9':'White','WHF1':'White','WHF2':'White','WHF3':'White','WHF4':'White','WHF5':'White','WHF6':'White','WHF7':'White','WHF8':'White','WHF9':'White','BLKM1':'Black_or_African_American','BLKM2':'Black_or_African_American','BLKM3':'Black_or_African_American','BLKM4':'Black_or_African_American','BLKM5':'Black_or_African_American','BLKM6':'Black_or_African_American','BLKM7':'Black_or_African_American','BLKM8':'Black_or_African_American','BLKM9':'Black_or_African_American','BLKF1':'Black_or_African_American','BLKF2':'Black_or_African_American','BLKF3':'Black_or_African_American','BLKF4':'Black_or_African_American','BLKF5':'Black_or_African_American','BLKF6':'Black_or_African_American','BLKF7':'Black_or_African_American','BLKF8':'Black_or_African_American','BLKF9':'Black_or_African_American'}
#                   # ,'HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9','ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9','AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9','NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9','TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9'}
# eeoc_nac2_All_unpivot_df['race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(race_dictionary)

# eeoc_nac2_All_unpivot_df

# d = {
#     ('John', 'Blue', 1): 100,
#     ('Bill', 'Green', 5): 200,
#     ('Paul', 'Blue', 4): 300,
#     ('Bill', 'Green', 7): 400
# }


Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,Count,race
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,
2,22,Utilities,WHM1,5883.0,
3,23,Construction,WHM1,26632.0,
4,31,Manufacturing,WHM1,15623.0,
5,32,Manufacturing,WHM1,29496.0,
6,33,Manufacturing,WHM1,72333.0,
7,42,Wholesale Trade,WHM1,26030.0,
8,44,Retail Trade,WHM1,20530.0,
9,45,Retail Trade,WHM1,6785.0,


In [324]:
gender_dictionary = re.search('*F*',eeoc_nac2_All_unpivot_df['EEOC_Code'])
gender_dictionary = {'*F*':'Female'}
eeoc_nac2_All_unpivot_df['gender'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(gender_dictionary)
eeoc_nac2_All_unpivot_df

error: nothing to repeat at position 0

In [None]:
>>> import fnmatch
>>> fnmatch.fnmatch('V34', 'V*')
True
>>> rank_dict = {'V*': 1, 'A*': 2, 'V': 3,'A': 4}
>>> checker = 'V30'
>>> for k, v in rank_dict.items():
...     if fnmatch.fnmatch(checker, k):
...         print(v)
... 
1

In [236]:
searchrace = []    
for values in eeoc_nac2_All_unpivot_df['EEOC_Code']:
    searchrace.append(re.search("[WH|TOMR|BLK|ASIAN|HISP|AIAN|NOHOPI]", values).group())

eeoc_nac2_All_unpivot_df['Race'] = searchrace
eeoc_nac2_All_unpivot_df.head(3000)


Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,Count,Race,Gender
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,W,W
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,W,W
2,22,Utilities,WHM1,5883.0,W,W
3,23,Construction,WHM1,26632.0,W,W
4,31,Manufacturing,WHM1,15623.0,W,W
5,32,Manufacturing,WHM1,29496.0,W,W
6,33,Manufacturing,WHM1,72333.0,W,W
7,42,Wholesale Trade,WHM1,26030.0,W,W
8,44,Retail Trade,WHM1,20530.0,W,W
9,45,Retail Trade,WHM1,6785.0,W,W


In [248]:
searchgender = []    
for values in eeoc_nac2_All_unpivot_df['EEOC_Code']:
    searchgender.append(re.search("(F|M)", values).group())

eeoc_nac2_All_unpivot_df['Gender'] = searchgender
df['color'] = np.where(df['Set']=='Z', 'green', 'red')

eeoc_nac2_All_unpivot_df.head(3000) #[F-M][^1^2^3^4^5^6^7^8^9]

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,Count,Race,Gender,Job_Category
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,W,M,1
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,W,M,1
2,22,Utilities,WHM1,5883.0,W,M,1
3,23,Construction,WHM1,26632.0,W,M,1
4,31,Manufacturing,WHM1,15623.0,W,M,1
5,32,Manufacturing,WHM1,29496.0,W,M,1
6,33,Manufacturing,WHM1,72333.0,W,M,1
7,42,Wholesale Trade,WHM1,26030.0,W,M,1
8,44,Retail Trade,WHM1,20530.0,W,M,1
9,45,Retail Trade,WHM1,6785.0,W,M,1


In [301]:
eeoc_nac2_All_unpivot_df['Gender'] = 
eeoc_nac2_All_unpivot_df['EEOC_Code'].replace('1', eeoc_nac2_All_unpivot_df['EEOC_Code'], inplace=True, regex=True)

# eeoc_nac2_All_unpivot_df['EEOC_Code'].str.split("[M|F]", n = 1, expand = True)[1] 
eeoc_nac2_All_unpivot_df

# df_users['EMAIL'].replace('@.*$', '@newcompany.com', inplace=True, regex=True)


# np.where(eeoc_nac2_All_unpivot_df['EEOC_Code'].isin(
#     [re.search(".F",eeoc_nac2_All_unpivot_df['EEOC_Code'])]), 'Female', 'Male')

# eeoc_nac2_All_unpivot_df


# new = data["Name"].str.split(" ", n = 1, expand = True) 
  
# # making seperate first name column from new data frame 
# data["First Name"]= new[0] 
  
# # making seperate last name column from new data frame 
# data["Last Name"]= new[1] 
  
# # Dropping old Name columns 
# data.drop(columns =["Name"], inplace = True)

SyntaxError: invalid syntax (<ipython-input-301-97f5b5a548a6>, line 1)

In [300]:
eeoc_nac2_All_unpivot_df['Gender']=re.search("/d",  eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code']])
eeoc_nac2_All_unpivot_df

# for line in re.findall("WH*", eeoc_nac2_All_unpivot_df['EEOC_Code']):
#     print(line)
    
#     m = re.search('(?<=abc)def', 'abcdef')
    

KeyError: "['WHM1' 'WHM1' 'WHM1' ... 'TOMRF9' 'TOMRF9' 'TOMRF9'] not in index"

In [292]:
eeoc_nac2_All_unpivot_df['Job_Category'] =re.contains('M', eeoc_nac2_All_unpivot_df['EEOC_Code'])
#np.where(eeoc_nac2_All_unpivot_df['EEOC_Code'].isin(['WHM1','TOMRF1','WHM2']), 'Executive', 'meh')
#eeoc_nac2_All_unpivot_df

# eeoc_nac2_All_unpivot_df['Job_Category']=np.where(searchjob.append(re.search("\d", 
#                             eeoc_nac2_All_unpivot_df['EEOC_Code']),"Executives","meh")
   # else: "meh"

#df['color'] = np.where(df['Set']=='Z', 'green', 'red')
# searchjob = []    
# for values in eeoc_nac2_All_unpivot_df['EEOC_Code']:
#     if searchjob.append(re.search("\d", values).group())==1: "Executives"
#     else: "meh"

# eeoc_nac2_All_unpivot_df['Job_Category'] = searchjob
# eeoc_nac2_All_unpivot_df.head(3000) 


# # if b > a:
# #   print("b is greater than a")
# # elif a == b:
# #   print("a and b are equal")

AttributeError: module 're' has no attribute 'contains'

In [None]:
m = re.search("(vi.*)", value)

In [56]:
eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code'].str.contains('WH*', regex=True)]
#eeoc_nac2_All_unpivot_df['race']

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0
2,22,Utilities,WHM1,5883.0
3,23,Construction,WHM1,26632.0
4,31,Manufacturing,WHM1,15623.0
5,32,Manufacturing,WHM1,29496.0
6,33,Manufacturing,WHM1,72333.0
7,42,Wholesale Trade,WHM1,26030.0
8,44,Retail Trade,WHM1,20530.0
9,45,Retail Trade,WHM1,6785.0


In [None]:
df['raw'].str.contains('....-..-..', regex=True)

In [66]:
d = fnmatch({'W*':'White'})
eeoc_nac2_All_unpivot_df['Race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
eeoc_nac2_All_unpivot_df

TypeError: 'module' object is not callable

In [45]:
# Create the dictionary
race_dictionary = {'WHM1':'White','WHM2':'White','WHM3':'White','WHM4':'White','WHM5':'White','WHM6':'White','WHM7':'White','WHM8':'White','WHM9':'White','WHF1':'White','WHF2':'White','WHF3':'White','WHF4':'White','WHF5':'White','WHF6':'White','WHF7':'White','WHF8':'White','WHF9':'White',
                   'BLKM1':'Black_or_African_American','BLKM2':'Black_or_African_American','BLKM3':'Black_or_African_American','BLKM4':'Black_or_African_American','BLKM5':'Black_or_African_American','BLKM6':'Black_or_African_American','BLKM7':'Black_or_African_American','BLKM8':'Black_or_African_American','BLKM9':'Black_or_African_American','BLKF1':'Black_or_African_American','BLKF2':'Black_or_African_American','BLKF3':'Black_or_African_American','BLKF4':'Black_or_African_American','BLKF5':'Black_or_African_American','BLKF6':'Black_or_African_American','BLKF7':'Black_or_African_American','BLKF8':'Black_or_African_American','BLKF9':'Black_or_African_American','HISPM1','HISPM2','HISPM3','HISPM4','HISPM5','HISPM6','HISPM7','HISPM8','HISPM9','HISPF1','HISPF2','HISPF3','HISPF4','HISPF5','HISPF6','HISPF7','HISPF8','HISPF9','ASIANM1','ASIANM2','ASIANM3','ASIANM4','ASIANM5','ASIANM6','ASIANM7','ASIANM8','ASIANM9','ASIANF1','ASIANF2','ASIANF3','ASIANF4','ASIANF5','ASIANF6','ASIANF7','ASIANF8','ASIANF9','AIANM1','AIANM2','AIANM3','AIANM4','AIANM5','AIANM6','AIANM7','AIANM8','AIANM9','AIANF1','AIANF2','AIANF3','AIANF4','AIANF5','AIANF6','AIANF7','AIANF8','AIANF9','NHOPIM1','NHOPIM2','NHOPIM3','NHOPIM4','NHOPIM5','NHOPIM6','NHOPIM7','NHOPIM8','NHOPIM9','NHOPIF1','NHOPIF2','NHOPIF3','NHOPIF4','NHOPIF5','NHOPIF6','NHOPIF7','NHOPIF8','NHOPIF9','TOMRM1','TOMRM2','TOMRM3','TOMRM4','TOMRM5','TOMRM6','TOMRM7','TOMRM8','TOMRM9','TOMRF1','TOMRF2','TOMRF3','TOMRF4','TOMRF5','TOMRF6','TOMRF7','TOMRF8','TOMRF9'}
eeoc_nac2_All_unpivot_df['Race'] = eeoc_nac2_All_unpivot_df['EEOC_Code'].map(race_dictionary)

eeoc_nac2_All_unpivot_df

Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count,Race
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,White
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,White
2,22,Utilities,WHM1,5883.0,White
3,23,Construction,WHM1,26632.0,White
4,31,Manufacturing,WHM1,15623.0,White
5,32,Manufacturing,WHM1,29496.0,White
6,33,Manufacturing,WHM1,72333.0,White
7,42,Wholesale Trade,WHM1,26030.0,White
8,44,Retail Trade,WHM1,20530.0,White
9,45,Retail Trade,WHM1,6785.0,White


In [None]:
 'BLK':'Black_or_African_American',
                   'HISP':'Hispanic_or_Latino', 'ASIAN':'Asian',
                  'AIAN':'American_Indian_Alaskan_Native',
                  'NHOPI': 'Native_Hawaiian_or_Pacific_Islander',
                  'TOMR': 'Two_or_more_races'}

In [None]:
eeoc_nac2_All_clean_df['race']

In [None]:

eeoc_nac2_All_unpivot_df['race'] = 'other'
eeoc_nac2_All_unpivot_df.loc[eeoc_nac2_All_unpivot_df['EEOC_Code'] == 'mobile', 'combo'] = 'WH'
eeoc_nac2_All_unpivot_df.loc[eeoc_nac2_All_unpivot_df['EEOC_Code'] == 'tablet', 'combo'] = 'BLK'


def func(row):
    if row['mobile'] == 'mobile':
        return 'mobile'
    elif row['tablet'] =='tablet':
        return 'tablet' 
    else:
        return 'other'


In [11]:
#nac2 label for company category
eeoc_nac2_All_clean_df = eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code'].str.contains('WH')]
eeoc_nac2_All_clean_df['race'] = 'WH'
# eeoc_nac2_All_clean_df = eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code'].str.contains('BLK')]
# eeoc_nac2_All_clean_df['race'] = 'BLK'
# eeoc_nac2_All_clean_df = eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code'].str.contains('HISP')]
# eeoc_nac2_All_clean_df['race'] = 'HISP'
eeoc_nac2_All_clean_df = eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code'].str.contains('M')]
eeoc_nac2_All_clean_df['gender'] = 'M'
eeoc_nac2_All_clean_df = eeoc_nac2_All_unpivot_df[eeoc_nac2_All_unpivot_df['EEOC_Code'].str.contains('1')]
eeoc_nac2_All_clean_df['job_category'] = 'Executives'
eeoc_nac2_All_clean_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count,job_category
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311.0,Executives
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039.0,Executives
2,22,Utilities,WHM1,5883.0,Executives
3,23,Construction,WHM1,26632.0,Executives
4,31,Manufacturing,WHM1,15623.0,Executives


In [None]:
#WH

In [77]:
eeoc_nac2_WF_df = eeoc_nac2_df[['NAC2_code','NAC2_Label','WHM1','WHM2','WHM3','WHM4','WHM5','WHM6','WHM7','WHM8','WHM9']]
eeoc_nac2_WF_df.head()

Unnamed: 0,NAC2_code,NAC2_Label,WHM1,WHM2,WHM3,WHM4,WHM5,WHM6,WHM7,WHM8,WHM9
0,11,"Agriculture, Forestry, Fishing and Hunting",3311,7254,2369,4740,2561,5636,12178,15711,1552
1,21,"Mining, Quarrying, and Oil and Gas Extraction",5039,37459,14647,2088,3999,58589,43730,15476,868
2,22,Utilities,5883,64738,25510,1645,9568,87506,21627,6961,2600
3,23,Construction,26632,83888,37250,21458,24350,393797,91062,114587,6759
4,31,Manufacturing,15623,39393,18702,39607,17628,83772,174938,80811,13006


In [126]:
eeoc_nac2_WF_unpivot_df=pd.melt(eeoc_nac2_WF_df, id_vars=['NAC2_code','NAC2_Label'], var_name="EEOC_Code", value_name="count")
eeoc_nac2_WF_unpivot_df.head(10)


Unnamed: 0,NAC2_code,NAC2_Label,EEOC_Code,count
0,11,"Agriculture, Forestry, Fishing and Hunting",WHM1,3311
1,21,"Mining, Quarrying, and Oil and Gas Extraction",WHM1,5039
2,22,Utilities,WHM1,5883
3,23,Construction,WHM1,26632
4,31,Manufacturing,WHM1,15623
5,32,Manufacturing,WHM1,29496
6,33,Manufacturing,WHM1,72333
7,42,Wholesale Trade,WHM1,26030
8,44,Retail Trade,WHM1,20530
9,45,Retail Trade,WHM1,6785


In [10]:
white = eeoc_nac2_WF_unpivot_df[eeoc_nac2_WF_unpivot_df['EEOC_Code'].str.contains('WH')]
white['race'] = 'WH'
white = eeoc_nac2_WF_unpivot_df[eeoc_nac2_WF_unpivot_df['EEOC_Code'].str.contains('M')]
white['gender'] = 'M'
white = eeoc_nac2_WF_unpivot_df[eeoc_nac2_WF_unpivot_df['EEOC_Code'].str.contains('1')]
white['job_category'] = 'Executives'
white.head(1000)

NameError: name 'eeoc_nac2_WF_unpivot_df' is not defined

In [10]:
new_customer_data_df = customer_data_df[['id', 'first_name', 'last_name']].copy()
new_customer_data_df.head()

Unnamed: 0,id,first_name,last_name
0,1,Benetta,Cancott
1,2,Lilyan,Cherry
2,3,Ezekiel,Benasik
3,4,Kennedy,Atlay
4,5,Sanford,Salmen


In [11]:
# Store JSON data into a DataFrame

In [6]:
json_file = "./Resources/customer_location.json"
customer_location_df = pd.read_json(json_file)
customer_location_df.head()

Unnamed: 0,address,id,latitude,longitude,us_state
0,043 Mockingbird Place,1,39.1682,-86.5186,Indiana
1,4 Prentice Point,2,41.0938,-85.0707,Indiana
2,46 Derek Junction,3,32.7673,-96.7776,Texas
3,11966 Old Shore Place,4,39.035,-94.3567,Missouri
4,5 Evergreen Circle,5,40.7808,-73.9772,New York


In [None]:
# Clean DataFrame

In [17]:
new_customer_location_df = customer_location_df[["id", "address", "us_state"]].copy()
new_customer_location_df.head()

Unnamed: 0,id,address,us_state
0,1,043 Mockingbird Place,Indiana
1,2,4 Prentice Point,Indiana
2,3,46 Derek Junction,Texas
3,4,11966 Old Shore Place,Missouri
4,5,5 Evergreen Circle,New York


In [18]:
# Connect to local database

In [25]:
rds_connection_string = "root:mysqlp5ssw0rd*@127.0.0.1/customer_db" #hide this
engine = create_engine(f'mysql://{rds_connection_string}')

In [26]:
# Check for tables

In [27]:
engine.table_names()

['customer_location', 'customer_name']

In [None]:
# Use pandas to load csv converted DataFrame into database

In [28]:
new_customer_data_df.to_sql(name='customer_name', con=engine, if_exists='append', index=False)

In [29]:
# Use pandas to load json converted DataFrame into database

In [30]:
new_customer_location_df.to_sql(name='customer_location', con=engine, if_exists='append', index=False)

In [31]:
# Confirm data has been added by querying the customer_name table
# NOTE: can also check using pgAdmin

In [32]:
pd.read_sql_query('select * from customer_name', con=engine).head()

Unnamed: 0,id,first_name,last_name
0,1,Benetta,Cancott
1,2,Lilyan,Cherry
2,3,Ezekiel,Benasik
3,4,Kennedy,Atlay
4,5,Sanford,Salmen


In [33]:
# Confirm data has been added by querying the customer_location table

In [34]:
pd.read_sql_query('select * from customer_location', con=engine).head()

Unnamed: 0,id,address,us_state
0,1,043 Mockingbird Place,Indiana
1,2,4 Prentice Point,Indiana
2,3,46 Derek Junction,Texas
3,4,11966 Old Shore Place,Missouri
4,5,5 Evergreen Circle,New York


In [37]:
csv_file = "./Resources/customer_data.csv"
customer_data_df = pd.read_csv(csv_file)
customer_data_df.head()

Unnamed: 0,id,first_name,last_name,email,gender,car
0,1,Benetta,Cancott,bcancott0@studiopress.com,Female,Scion
1,2,Lilyan,Cherry,lcherry1@deliciousdays.com,Female,Chrysler
2,3,Ezekiel,Benasik,ebenasik2@wikia.com,Male,Mercedes-Benz
3,4,Kennedy,Atlay,katlay3@so-net.ne.jp,Male,Buick
4,5,Sanford,Salmen,ssalmen4@reuters.com,Male,Lincoln


In [9]:
# Create new data with select columns

In [10]:
new_customer_data_df = customer_data_df[['id', 'first_name', 'last_name']].copy()
new_customer_data_df.head()

Unnamed: 0,id,first_name,last_name
0,1,Benetta,Cancott
1,2,Lilyan,Cherry
2,3,Ezekiel,Benasik
3,4,Kennedy,Atlay
4,5,Sanford,Salmen


In [11]:
# Store JSON data into a DataFrame

In [6]:
json_file = "./Resources/customer_location.json"
customer_location_df = pd.read_json(json_file)
customer_location_df.head()

Unnamed: 0,address,id,latitude,longitude,us_state
0,043 Mockingbird Place,1,39.1682,-86.5186,Indiana
1,4 Prentice Point,2,41.0938,-85.0707,Indiana
2,46 Derek Junction,3,32.7673,-96.7776,Texas
3,11966 Old Shore Place,4,39.035,-94.3567,Missouri
4,5 Evergreen Circle,5,40.7808,-73.9772,New York


In [None]:
# Clean DataFrame

In [17]:
new_customer_location_df = customer_location_df[["id", "address", "us_state"]].copy()
new_customer_location_df.head()

Unnamed: 0,id,address,us_state
0,1,043 Mockingbird Place,Indiana
1,2,4 Prentice Point,Indiana
2,3,46 Derek Junction,Texas
3,4,11966 Old Shore Place,Missouri
4,5,5 Evergreen Circle,New York


In [18]:
# Connect to local database

In [25]:
rds_connection_string = "root:mysqlp5ssw0rd*@127.0.0.1/customer_db" #hide this
engine = create_engine(f'mysql://{rds_connection_string}')

In [26]:
# Check for tables

In [27]:
engine.table_names()

['customer_location', 'customer_name']

In [None]:
# Use pandas to load csv converted DataFrame into database

In [28]:
new_customer_data_df.to_sql(name='customer_name', con=engine, if_exists='append', index=False)

In [29]:
# Use pandas to load json converted DataFrame into database

In [30]:
new_customer_location_df.to_sql(name='customer_location', con=engine, if_exists='append', index=False)

In [31]:
# Confirm data has been added by querying the customer_name table
# NOTE: can also check using pgAdmin

In [32]:
pd.read_sql_query('select * from customer_name', con=engine).head()

Unnamed: 0,id,first_name,last_name
0,1,Benetta,Cancott
1,2,Lilyan,Cherry
2,3,Ezekiel,Benasik
3,4,Kennedy,Atlay
4,5,Sanford,Salmen


In [33]:
# Confirm data has been added by querying the customer_location table

In [34]:
pd.read_sql_query('select * from customer_location', con=engine).head()

Unnamed: 0,id,address,us_state
0,1,043 Mockingbird Place,Indiana
1,2,4 Prentice Point,Indiana
2,3,46 Derek Junction,Texas
3,4,11966 Old Shore Place,Missouri
4,5,5 Evergreen Circle,New York
