# Annual Business Survey 2019
## Group Assessment - Module 08

  *****

## ETL Report
#### Introduction

- In this portion, we will dicuss the problems or questions that we are trying to solve or answer. 
    We will identify the sources of our data and describe why the data needs to be transformed.
    
> The United States Census Bureau provides access to an extensive database with hundreds of datasets that detail national <br>
  characteristics. The data is stored in a way that can be quite confusing to many people, but there are many resources to <br>
  assist interested researchers with gleaning new insights from the data.
  
> For this study, we looked specifically at the Annual Business Survey for 2018. This survey provides information on a variety of<br>
  characteristics for businesses and business owners.
  
> **These are the questions that we are interested in answering:**<br>
>> 1. When considering the Finance and Insurance Industries as well as the Education Industry, what are the reasons that a owner <br>
      was motivated to create the business?
      
>> 2. When considering the highest average pay for each state, are there any noticeable trends when accounting for the Industry, <br>
      the Race of the Owner, or the Gender of the Owner?
      
>> 3. Does the rate of utilizing technology solutions have any correlation with family ownership status?

>> 4. What industries are most and least likely to utilize tech solutions in their business?
    
>> Other Questions

> As mentioned previously, the data is stored in a way that can be quite confusing, and it is necessary to perform a variety of<br>
  transformations on it before it can be utilized. Some things to keep in mind is that there are multiple series of "total" categories,<br>
  multiple variables stored in a single column, and the call to the API must contain every needed variable, as well as supporting variables<br>
  in order to get complete and accurate results.
#### Data Sources

- In this portion we will be more specific about where we found the data, when we accessed it, and we will cite any sources.


>US Census Bureau. (2021b, October 14). *Annual Business Survey (ABS) APIs.* Census.Gov.<br>
>Retrieved April 22, 2022, from https://www.census.gov/data/developers/data-sets/abs.2019.html

#### Extraction

- In this portion we will show how we got our data, show aspects of the data consumption process and indicate the order of the process.

In [1]:
import pandas as pd
from utils import explain # Helper for clarifying variable information (for questions about this, type "help(explain)" in a cell)

tech_variables = 'https://api.census.gov/data/2018/abstcb/variables.html'
owner_variables = 'https://api.census.gov/data/2018/abscbo/variables.html'
characteristics_variables = 'https://api.census.gov/data/2018/abscb/variables.html'
company_summary_variables = 'https://api.census.gov/data/2018/abscs/variables.html'

### Get Tables of Variables

In [2]:
tech_vars = pd.read_html(tech_variables)[0]
owner_vars = pd.read_html(owner_variables)[0]
characteristic_vars = pd.read_html(characteristics_variables)[0]
company_summary_vars = pd.read_html(company_summary_variables)[0]

tech_vars = tech_vars[[_ for _ in tech_vars.columns[:2]]]
owner_vars = owner_vars[[_ for _ in owner_vars.columns[:2]]]
characteristic_vars = characteristic_vars[[_ for _ in characteristic_vars.columns[:2]]]
company_summary_vars = company_summary_vars[[_ for _ in company_summary_vars.columns[:2]]] 

tech_vars.rename(columns = {'Label': 'Tech Labels'}, inplace = True)
owner_vars.rename(columns = {'Label': 'Owner Labels'}, inplace = True)
characteristic_vars.rename(columns = {'Label': 'Characteristic Labels'}, inplace = True)
company_summary_vars.rename(columns = {'Label': 'Company Summary Labels'}, inplace = True)

In [3]:
grouped_tables = pd.merge(tech_vars, owner_vars, left_on = 'Name', right_on = 'Name', how = 'outer')
grouped_tables = pd.merge(grouped_tables, characteristic_vars, left_on = 'Name', right_on = 'Name', how = 'outer')
grouped_tables = pd.merge(grouped_tables, company_summary_vars , left_on = 'Name', right_on = 'Name', how = 'outer')

grouped_tables.fillna("-", inplace = True)

grouped_tables = grouped_tables[(~grouped_tables['Tech Labels'].str.contains('Standard error|standard error'))]
grouped_tables = grouped_tables[(~grouped_tables['Owner Labels'].str.contains('Standard error|standard error'))]
grouped_tables = grouped_tables[(~grouped_tables['Characteristic Labels'].str.contains('Standard error|standard error'))]
grouped_tables = grouped_tables[(~grouped_tables['Company Summary Labels'].str.contains('Standard error|standard error'))]

grouped_tables.reset_index(drop = True, inplace = True)
grouped_tables = grouped_tables[:-2].sort_values(by = 'Name')
grouped_tables = grouped_tables[(~grouped_tables['Name'].str.contains('variables'))].reset_index(drop = True)

# In the segment of code below, we are creating a list of variables that are common to all tables.
# For example, OWNER_ETH is also noted as ETH_GROUP, so I am creating a new variable that indicates they are similar.
def clean_var_names(var_name):
    if 'OWNER_' in var_name:
        var_name = var_name.replace('OWNER_',"")
    elif ('OWN' in var_name and 'CHAR' not in var_name and 'PDEMP' not in var_name):
        var_name = var_name.replace('OWN',"")
    
    if var_name in ['ETH','RACE','SEX','VET']:
        var_name = var_name + "_GROUP"
        
    return var_name

grouped_tables['Common_Vars'] = grouped_tables['Name'].apply(lambda x: clean_var_names(x))
grouped_tables = grouped_tables[['Name','Common_Vars','Tech Labels','Owner Labels','Characteristic Labels','Company Summary Labels']]
grouped_tables.sort_values(by = 'Common_Vars', inplace = True)

##### Define which variables are common and could be used to merge.

In [4]:
possible_merge_options = grouped_tables[(grouped_tables['Common_Vars'].str.contains("SEX|RACE|ETH|VET")) | (grouped_tables.apply(lambda x: not x.str.contains("-").any(), axis = 1))]
common_vars = [_.lower() for _ in possible_merge_options.Common_Vars.unique().tolist()]

possible_merge_options

Unnamed: 0,Name,Common_Vars,Tech Labels,Owner Labels,Characteristic Labels,Company Summary Labels
22,OWNER_ETH,ETH_GROUP,-,Ethnicity code,-,-
5,ETH_GROUP,ETH_GROUP,Ethnicity code,-,Ethnicity code,Ethnicity code
10,GEOCOMP,GEOCOMP,GEO_ID Component,GEO_ID Component,GEO_ID Component,GEO_ID Component
11,GEO_ID,GEO_ID,Geographic identifier code,Geographic identifier code,Geographic identifier code,Geographic identifier code
18,NAICS2017,NAICS2017,2017 NAICS code,2017 NAICS code,2017 NAICS code,2017 NAICS code
19,NATION,NATION,Geography,Geography,Geography,Geography
23,OWNER_RACE,RACE_GROUP,-,Race code,-,-
31,RACE_GROUP,RACE_GROUP,Race code,-,Race code,Race code
24,OWNER_SEX,SEX_GROUP,-,Sex code,-,-
35,SEX,SEX_GROUP,Sex code,-,Sex code,Sex code


In [5]:
print("\n------ Displaying top 10 rows of variable tables. ------\n")
grouped_tables.head(10)


------ Displaying top 10 rows of variable tables. ------



Unnamed: 0,Name,Common_Vars,Tech Labels,Owner Labels,Characteristic Labels,Company Summary Labels
0,BUSCHAR,BUSCHAR,-,-,Business characteristic code,-
1,CBSA,CBSA,-,Geography,Geography,Geography
2,EMP,EMP,Number of employees,-,Number of employees,Number of employees
3,EMPSZFI,EMPSZFI,-,-,Employment size of firms code,Employment size of firms code
4,EMP_PCT,EMP_PCT,Percent of employees (%),-,Percent of employees (%),-
22,OWNER_ETH,ETH_GROUP,-,Ethnicity code,-,-
5,ETH_GROUP,ETH_GROUP,Ethnicity code,-,Ethnicity code,Ethnicity code
6,FACTORS_P,FACTORS_P,Factors adversely affecting technology product...,-,-,-
7,FACTORS_U,FACTORS_U,Factors adversely affecting technology use code,-,-,-
8,FIRMPDEMP,FIRMPDEMP,Number of employer firms,-,Number of employer firms,Number of employer firms


In [22]:
grouped_tables

Unnamed: 0,Name,Common_Vars,Tech Labels,Owner Labels,Characteristic Labels,Company Summary Labels
0,BUSCHAR,BUSCHAR,-,-,Business characteristic code,-
1,CBSA,CBSA,-,Geography,Geography,Geography
2,EMP,EMP,Number of employees,-,Number of employees,Number of employees
3,EMPSZFI,EMPSZFI,-,-,Employment size of firms code,Employment size of firms code
4,EMP_PCT,EMP_PCT,Percent of employees (%),-,Percent of employees (%),-
22,OWNER_ETH,ETH_GROUP,-,Ethnicity code,-,-
5,ETH_GROUP,ETH_GROUP,Ethnicity code,-,Ethnicity code,Ethnicity code
6,FACTORS_P,FACTORS_P,Factors adversely affecting technology product...,-,-,-
7,FACTORS_U,FACTORS_U,Factors adversely affecting technology use code,-,-,-
8,FIRMPDEMP,FIRMPDEMP,Number of employer firms,-,Number of employer firms,Number of employer firms


In [68]:
grouped_tables.iloc[31]['Tech Labels']

'Percent of sales, value of shipments, or revenue of employer firms (%)'

##### Define which variables will be requested from API

In [52]:
vars_of_interest = [
    'NAICS2017',
    'YIBSZFI',
    'SEX',
    'QDESC',
    'NSFSZFI',
    'GEO_ID',
    'RACE_GROUP',
    'BUSCHAR',
    'OWNER_RACE',
    'OWNER_SEX',
    'OWNPDEMP',
    'FIRMPDEMP',
    'OWNCHAR',
    'TECHUSE',
    'RCPPDEMP',
    'PAYANN',
    'EMP'
]
target_subset = grouped_tables[(grouped_tables['Name'].isin(vars_of_interest))]
target_subset = target_subset[['Name','Company Summary Labels','Characteristic Labels','Owner Labels','Tech Labels']]

### Build variable strings to pass to api call


In [56]:
no_label = [
    'OWNPDEMP_LABEL','GEO_ID_LABEL','FIRMPDEMP_LABEL','STATE_LABEL','PAYANN_LABEL',
    'EMP_LABEL','RCPPDEMP_LABEL',
]
variable_dict = {}
for i,label in enumerate(target_subset.columns[1:]):
    variable_list = []
    for item in target_subset[(target_subset[label] != "-")].Name.tolist():
        variable_list.append(item)
        variable_list.append(f'{item}_LABEL')
    variable_list = [_ for _ in variable_list if _ not in no_label]#, 'QDESC_LABEL']]
    in_table = "NAME," + ",".join(variable_list)
    variable_dict[i] = in_table
variable_dict


{0: 'NAME,EMP,FIRMPDEMP,GEO_ID,NAICS2017,NAICS2017_LABEL,PAYANN,RACE_GROUP,RACE_GROUP_LABEL,RCPPDEMP,SEX,SEX_LABEL,YIBSZFI,YIBSZFI_LABEL',
 1: 'NAME,BUSCHAR,BUSCHAR_LABEL,EMP,FIRMPDEMP,GEO_ID,NAICS2017,NAICS2017_LABEL,PAYANN,QDESC,QDESC_LABEL,RACE_GROUP,RACE_GROUP_LABEL,RCPPDEMP,SEX,SEX_LABEL,YIBSZFI,YIBSZFI_LABEL',
 2: 'NAME,GEO_ID,NAICS2017,NAICS2017_LABEL,OWNCHAR,OWNCHAR_LABEL,OWNPDEMP,QDESC,QDESC_LABEL,OWNER_RACE,OWNER_RACE_LABEL,OWNER_SEX,OWNER_SEX_LABEL',
 3: 'NAME,EMP,FIRMPDEMP,GEO_ID,NAICS2017,NAICS2017_LABEL,NSFSZFI,NSFSZFI_LABEL,PAYANN,RACE_GROUP,RACE_GROUP_LABEL,RCPPDEMP,SEX,SEX_LABEL,TECHUSE,TECHUSE_LABEL'}

In [57]:
# industry_code = '61'
# qdesc1 = 'B27'

links = [
    f'https://api.census.gov/data/2018/abscs?get={variable_dict[0]}&for=state:*',
    f'https://api.census.gov/data/2018/abscb?get={variable_dict[1]}&for=state:*',
    f'https://api.census.gov/data/2018/abscbo?get={variable_dict[2]}&for=us:*&for=QDESC_LABEL=YRACQBUS',
    f'https://api.census.gov/data/2018/abstcb?get={variable_dict[3]}&for=state:*',
    f'https://api.census.gov/data/2018/abscbo?get={variable_dict[2]}&for=state:*&OWNCHAR=CG&NAICS2017=00&QDESC=O02'
]

def get_data_frame(url):
    print("If the operation fails, click the link to see the error.")
    print(url,'\n')
    return pd.read_csv(url)

### The cell below is where the dataframes are first stored.

In [58]:
comp_sum_df = get_data_frame(links[0]) # Company Summary
bus_char_df = get_data_frame(links[1]) # Business Characteristics
bus_own_df = get_data_frame(links[2]) # Business Owners (National Level)
bus_tech_df = get_data_frame(links[3]) # Business Tech   
slbo = get_data_frame(links[4]) # Business Owners (State Level)

If the operation fails, click the link to see the error.
https://api.census.gov/data/2018/abscs?get=NAME,EMP,FIRMPDEMP,GEO_ID,NAICS2017,NAICS2017_LABEL,PAYANN,RACE_GROUP,RACE_GROUP_LABEL,RCPPDEMP,SEX,SEX_LABEL,YIBSZFI,YIBSZFI_LABEL&for=state:* 

If the operation fails, click the link to see the error.
https://api.census.gov/data/2018/abscb?get=NAME,BUSCHAR,BUSCHAR_LABEL,EMP,FIRMPDEMP,GEO_ID,NAICS2017,NAICS2017_LABEL,PAYANN,QDESC,QDESC_LABEL,RACE_GROUP,RACE_GROUP_LABEL,RCPPDEMP,SEX,SEX_LABEL,YIBSZFI,YIBSZFI_LABEL&for=state:* 

If the operation fails, click the link to see the error.
https://api.census.gov/data/2018/abscbo?get=NAME,GEO_ID,NAICS2017,NAICS2017_LABEL,OWNCHAR,OWNCHAR_LABEL,OWNPDEMP,QDESC,QDESC_LABEL,OWNER_RACE,OWNER_RACE_LABEL,OWNER_SEX,OWNER_SEX_LABEL&for=us:*&for=QDESC_LABEL=YRACQBUS 

If the operation fails, click the link to see the error.
https://api.census.gov/data/2018/abstcb?get=NAME,EMP,FIRMPDEMP,GEO_ID,NAICS2017,NAICS2017_LABEL,NSFSZFI,NSFSZFI_LABEL,PAYANN,RACE_GRO

In [59]:
df_collection = [comp_sum_df, bus_char_df, bus_own_df, bus_tech_df, slbo]    
df_names = ['comp_sum_df', 'bus_char_df', 'bus_own_df', 'bus_tech_df','slbo']   

### Clean DataFrames

In [60]:
drop_list = [
    'race_group','sex','yibszfi','qdesc','buschar',
    'owner_race','owner_sex','us','ownchar'
]

for df in df_collection:
    column_names = [_ for _ in df.columns.tolist()]
    new_column_names = [_.replace("[[","").replace('"',"").replace("]","").lower() for _ in column_names]

    df.columns = new_column_names
    df.drop(columns = [_ for _ in new_column_names if ('unnamed' in _ or _ in drop_list)],inplace = True)
    df['name'] = df['name'].apply(lambda x: x.replace("[","").replace('"',""))
    
    if 'sex_label' in df.columns:
        df.rename(columns = {'sex_label': 'gender'}, inplace = True)
    
    if 'owner_sex_label' in df.columns:
        df.rename(columns = {'owner_sex_label': 'gender'}, inplace = True)
    
    if 'naics2017_label' in df.columns:
        df.rename(columns = {'naics2017_label': 'industry'}, inplace = True)
        
    if 'naics2017' in df.columns:
        df.rename(columns = {'naics2017': 'industry_code'}, inplace = True)

### Display some info about dataframes and save data

In [61]:
# # First display all columns together
# print("(This group indicates common columns...)")
# for i,df in enumerate(df_collection):
#     print(f'DATAFRAME: {df_names[i]}\n COMMON: {", ".join(_ for _ in sorted(df.columns.tolist()) if _ in common_vars)}\n')

# print("\n(Columns are sorted alphabetically...)")
# for i,df in enumerate(df_collection):
#     print(f'DATAFRAME: {df_names[i]}\n  COLUMNS: {", ".join(sorted(df.columns.tolist()))}\n')
    
for i,df in enumerate(df_collection):
    try:
        df.to_csv(f'data/{df_names[i]}.csv', index = False)
    except:
        print("Data directory is not present - skipping save")
        
    print('\n############# NEW DATAFRAME ################')
    print('Displaying column value counts where there are fewer than 10 unique values in the column.')
    print(f'\n---  DataFrame: {df_names[i]} ---------------------')
    print(f'Columns: {", ".join(df.columns.tolist())}')
    for column in df:
        if len(df[column].unique().tolist()) < 10:
            print(df[column].value_counts())
            print("")
    print('############# END OF DATAFRAME INFO ################\n\n')


############# NEW DATAFRAME ################
Displaying column value counts where there are fewer than 10 unique values in the column.

---  DataFrame: comp_sum_df ---------------------
Columns: name, emp, firmpdemp, geo_id, industry_code, industry, payann, race_group_label, rcppdemp, gender, yibszfi_label, state
Total                                         27311
White                                         11327
Nonminority                                   11320
Minority                                       9285
Asian                                          7748
Black or African American                      6433
Equally minority/nonminority                   6133
American Indian and Alaska Native              4509
Native Hawaiian and Other Pacific Islander     2126
Name: race_group_label, dtype: int64

0    86192
Name: rcppdemp, dtype: int64

Total                  44001
Male                   15588
Female                 13737
Equally male/female    12866
Name: gender, dtype: 

## All Tables Merge
 - We certainly had a substantial amount of difficulty with this merge because it was difficult to find ways<br>
     that the the data could be useful when fully merged. <br><br> 
     In this merge set, we are looking at the level of artificial intelligence in all sectors in Minnesota, <br>
     and how it relates to how long the company has been in business, and whether it is a family owned business.

### Steps:

1. Limit table data to "Total for all sectors".

In [45]:
total_bus_tech_df = bus_tech_df[(bus_tech_df['industry'] == 'Total for all sectors')]
total_bus_own_df = bus_own_df[(bus_own_df['industry'] == 'Total for all sectors')]
total_comp_sum_df = comp_sum_df[(comp_sum_df['industry'] == 'Total for all sectors')]
total_bus_char_df = bus_char_df[(bus_char_df['industry'] == 'Total for all sectors')]
total_slbo = slbo[(slbo['industry'] == 'Total for all sectors')]

2. Further filter the table data and remove unnecessary columns.

In [59]:
robotics_in_mn = total_bus_tech_df[(~total_bus_tech_df['techuse_label'].str.contains('Total')) & (total_bus_tech_df['name'] == 'Minnesota') & (total_bus_tech_df['techuse_label'].str.contains('Artificial'))]
robotics_in_mn = robotics_in_mn[['name','techuse_label','firmpdemp']]

years_in_biz_mn = total_comp_sum_df[(total_comp_sum_df.name == 'Minnesota') & (total_comp_sum_df['yibszfi_label'] != 'All firms') & (total_comp_sum_df['gender'] != 'Total')]
years_in_biz_mn = years_in_biz_mn[['name','gender','yibszfi_label','firmpdemp']].drop_duplicates()

bus_char_mn = total_bus_char_df[(total_bus_char_df.name == 'Minnesota') & (total_bus_char_df['buschar_label'] != 'All firms')& (total_bus_char_df['buschar_label'] != 'Total reporting') & (total_bus_char_df['qdesc_label'] == 'FAMOWN')]
bus_char_mn = bus_char_mn[['name','qdesc_label','buschar_label','firmpdemp']]

total_slbo = total_slbo[(total_slbo.name == 'Minnesota')]
total_slbo = total_slbo[['name','ownchar_label','ownpdemp']]

3. Rename similar columns so that they are meaningfull after the merge.

In [59]:
years_in_biz_mn.rename(columns = {'firmpdemp': 'yib_num_firms'}, inplace = True)
robotics_in_mn.rename(columns = {'firmpdemp': 'robin_mn_num_firms'}, inplace = True)
bus_char_mn.rename(columns = {'firmpdemp': 'bus_char_num_firms'}, inplace = True)

4. Merge tables using outer merge to retain all records.

In [59]:
joined_tables = pd.merge(years_in_biz_mn, robotics_in_mn, left_on = 'name', right_on = 'name', how = 'outer')
joined_tables = pd.merge(joined_tables, bus_char_mn, left_on = 'name', right_on = 'name', how = 'outer')
joined_tables = pd.merge(joined_tables, total_slbo, left_on = 'name', right_on = 'name', how = 'outer')

# Just to display a subset of the data
joined_tables.iloc[[_ for _ in range(1870,1876)]]

Unnamed: 0,name,gender,yibszfi_label,yib_num_firms,techuse_label,robin_mn_num_firms,qdesc_label,buschar_label,bus_char_num_firms,ownchar_label,ownpdemp
1870,Minnesota,Equally male/female,Firms with less than 2 years in business,0,Artificial Intelligence: Don't know,4105,FAMOWN,Item not reported,2293,Before 1980,3210
1871,Minnesota,Equally male/female,Firms with less than 2 years in business,0,Artificial Intelligence: Don't know,4105,FAMOWN,Not applicable,3158,Before 1980,3210
1872,Minnesota,Equally male/female,Firms with 16 or more years in business,272,Artificial Intelligence: Did not use,95053,FAMOWN,Family-owned,17863,Before 1980,3210
1873,Minnesota,Equally male/female,Firms with 16 or more years in business,272,Artificial Intelligence: Did not use,95053,FAMOWN,Not family-owned,40870,Before 1980,3210
1874,Minnesota,Equally male/female,Firms with 16 or more years in business,272,Artificial Intelligence: Did not use,95053,FAMOWN,Item not reported,2293,Before 1980,3210
1875,Minnesota,Equally male/female,Firms with 16 or more years in business,272,Artificial Intelligence: Did not use,95053,FAMOWN,Not applicable,3158,Before 1980,3210
