## Clustering Counties Within a State Based on Industry Characteristics

Four WD types:

- Residential
- Highway
- Building
- Heavy

QCEW provides the following useful characteristics by industry and county (this is not an exhaustive list):

- Average annual number of establishments
- Average annual employment
- Average weekly wage
- Annual contributions 

We can map NAICS to WD type as follows:

- **Residential:** NAICS 2361 Residential Building Construction
- **Highway:** NAICS 2373 Highway, Street, and Bridge Construction
- **Building:** NAICS 2362 Nonresidential Building Construction
- **Heavy:** NAICS 2379 Other Heavy Construction

With this simple mapping, counties can easily be compared to one another based on industry characteristics. 

In [3]:
using CSV
using Clustering
using DataFrames
using DataFramesMeta
using Distances
using Interact
using StatsBase
using URIParser
using VegaLite
using WebIO

#### Create generic function(s) that will return graph for any industry/state

In [4]:
const industries = Dict(
    2361 => "2018.annual 2361 NAICS 2361 Residential building construction.csv",
    2373 => "2018.annual 2373 NAICS 2373 Highway, street, and bridge construction.csv",
    2362 => "2018.annual 2362 NAICS 2362 Nonresidential building construction.csv",
    2379 => "2018.annual 2379 NAICS 2379 Other heavy construction.csv"
)

function industry_df(industry::Int)
    @linq DataFrame(CSV.File("data/$(industries[industry])", normalizenames=true)) |>
    where(:agglvl_title .== "County, NAICS 4-digit -- by ownership sector") |>
    where(:own_title .== "Private")
end

industry_df (generic function with 1 method)

In [5]:
const states_abbrevs_fips = @linq DataFrame(CSV.read("data/states_abbrebs_fips.csv")) |>
    transform(fips = @. lpad(string(:fips), 2, "0"));

Unnamed: 0_level_0,state,abbrev,fips
Unnamed: 0_level_1,String,String,String
1,alabama,AL,01
2,alaska,AK,02
3,arizona,AZ,04
4,arkansas,AR,05
5,california,CA,06
6,colorado,CO,08
7,connecticut,CT,09
8,delaware,DE,10
9,district-of-columbia,DC,11
10,florida,FL,12


In [7]:
function create_groups_fuzzy_cmeans(state::String, industry::Int, C::Int=4, m::Float64=2.75)
        industry_data = industry_df(industry)
        df = @linq industry_data |> where(first.(:area_fips, 2) .== state)
        matrix = Matrix(hcat(
            df.annual_avg_estabs_count,
            df.annual_avg_emplvl,
            df.annual_avg_wkly_wage,
            df.annual_contributions
        )')
        matrix_normalized = StatsBase.transform(fit(ZScoreTransform, matrix, dims=2), convert.(Float64, matrix))
        weights = fuzzy_cmeans(matrix_normalized, C, m).weights
        df = DataFrame(
            fips = df.area_fips,
            county = df.area_title,
            group = [findfirst(x -> x == maximum(weights[i,:]), weights[i,:]) for i = 1:size(weights,1)]
        )
        return df
end

create_groups_fuzzy_cmeans (generic function with 3 methods)

In [8]:
function show_state_groups(state::String, industry::Int, C::Int=4, m::Float64=2.75)
    link = "https://raw.githubusercontent.com/mthelm85/topojson/master/countries/us-states/$(states_abbrevs_fips[states_abbrevs_fips.fips .== state, :abbrev][1])-$(state)-$(states_abbrevs_fips[states_abbrevs_fips.fips .== state, :state][1])-counties.json"
    return @manipulate for C = slider(2:20, value=C, label="Number of Groups"), m = slider(1.1:0.1:10.0, value=2.75, label="Fuzziness Factor")
        df = create_groups_fuzzy_cmeans(state, industry, C, m)
        @vlplot(width=1200, height=900) + 
        @vlplot(
            mark={ 
                :geoshape,
                stroke=:black
            },
            data={
                url=URI(link),
                format={
                    type=:topojson,
                    feature=Symbol("cb_2015_$(states_abbrevs_fips[states_abbrevs_fips.fips .== state, :state][1])_county_20m")
                }
            },
            transform=[
                {
                    lookup="properties.GEOID",
                    from={
                        data=df,
                        key=:fips,
                        fields=["group"]
                    }
                }
            ],
            color={
                "group:n",
                legend={title="Group"}
            },
            projection={
                typ=:naturalEarth1
            }
        ) +
        @vlplot(
            :text,
            data={
                url=URI(link),
                format={
                    type=:topojson,
                    feature=Symbol("cb_2015_$(states_abbrevs_fips[states_abbrevs_fips.fips .== state, :state][1])_county_20m")
                }
            },
                transform=[
                    {
                        calculate="geoCentroid(null, datum)",
                        as="centroid"
                    },
                    {
                        calculate="datum.centroid[0]",
                        as="centroidx"
                    },
                    {
                        calculate="datum.centroid[1]",
                        as="centroidy"
                    }
                ],
            text={field="properties.NAME", type=:nominal},
            longitude="centroidx:q",
            latitude="centroidy:q"
        )
    end
end

show_state_groups (generic function with 3 methods)

In [9]:
show_state_groups("40", 2362)