# Project File Summary

# Business Goal: 
### Determine the best locales for a startup craft brewery within the Denver Metro area.  
A desirable locale will have a sparse number of surrounding breweries (competition) and a population capable of sustaining regular customers for a **startup** brewery.

1. Discover population densities and demographics within the Denver Metro area through the latest available US Census Bureau population information.
    - Review and trim data as necessary to ensure its reflective of the Denver Metro area
2. Locate surrounding established breweries using the Foursquare API, Google and any other relevant sources.

# Hypothesis:
$H_1$: The consumption of craft beer in the approximate Denver-Lakewood-Aurora Metropolitan Statistical Area area is influenced by a combination of demographic factors, including the number of existing breweries, education level, household income, age distribution, gender composition, ethnic diversity, and specific occupational profiles. We anticipate that areas with a higher concentration of breweries, a well-educated population, higher household incomes, a diverse age range, a balanced gender distribution, and specific occupational profiles, such as construction, will exhibit a higher propensity for craft beer consumption.

## Project Jupyter Notebook Files (In Order)
1) DenverCraftBreweries_START_HERE
2) DenverCraftBreweries_geographies
3) DenverCraftBreweries_demographics
4) DenverCraftBreweries_breweries 
5) DenverCraftBreweries_folium
6) DenverCraftBreweries_EDA
7) DenverCraftBreweries_models

### `DenverCraftBreweries_START_HERE`

* Overview file meant to summarize project work, broken down according to Jupyter Notebook files used in the analysis.  Project folders where files reside are listed under each file below.

### `DenverCraftBreweries_geographies`

#### DIRECTORIES & FILES:
* /DenverCraftBreweries/data/US Census Bureau/American Community Survey
    - `gdf_msa_places.pkl`
* /DenverCraftBreweries/data/US Census Bureau/tl_rd22_08_prisecroads
    - `gdf_msa_roads.pkl`
* /DenverCraftBreweries/data/US Census Bureau/tl_2022_us_mil
    - `gdf_msa_mil.pkl`
* /DenverCraftBreweries/data/US Census Bureau/tl_rd22_08_arealm
    - `gdf_msa_realm.pkl`
* /DenverCraftBreweries/data/US Census Bureau/tl_rd22_08_pointlm
    - `gdf_msa_pointlm.pkl`

#### SUMMARY
* Exploration of US Census Bureau TIGER Shapefile geometries
* Determination of appropriate level of detail for geography (state-level vs county-level vs Census Tract vs Census Block vs Census Places) is made.  **Census Places** are utilized and fall between Census Tracts (highly detailed sub-city level) and Census Blocks (not necessarily organized to cities/city-level).
    - Census Places capture cities, towns, villages, etc. and are a natural level of detail for community introspection -- too many details/focus may entangle the analysis
* Creation of `gdf_msa_places.pkl` pickled GeoPandas GeoDataFrame which contains Census Places for the approximate Denver-Lakewood-Aurora MSA.  Additional places NW of Denver have been incorporated whereas places to the extreme south of Denver beyond Parker, CO have been dropped from the original Census Bureau dataset.
* Creation of `gdf_msa_roads.pkl` which represents a Census Bureau geographic interpretation (latitude, longitude) of major roads which run through the MSA
* Creation of `gdf_msa_mil.pkl` which represents a Census Bureau geographic interpretation (latitude, longitude) of military bases found w/in the MSA
* Creation of `gdf_msa_arealm.pkl` which represents a Census Bureau geographic interpretation (latitude, longitude) of area landmarks (polygons) found in the MSA
    - Area landmarks are used to determine if Census Bureau Places have colleges or medical facilities w/in their boundaries
    - this dataset appears to have the more complete listing of colleges & medical facilities (vs Point Landmarks)
* Creation of `gdf_msa_pointlm.pkl` which represents a Census Bureau geographic interpretation (latitude, longitude) of place landmarks (points) found in the MSA
    - Point landmarks are used to determine if Census Bureau Places have airports w/in their boundaries
    - this dataset appears to have the more complete listing of airports (vs Area Landmarks)
* Census Places have a unique geographic identifier (GEO_ID) which can be used to match geospatial areas to demographics.

### `DenverCraftBreweries_demographics`

#### DIRECTORIES & FILES:
* /DenverCraftBreweries/data/US Census Bureau/American Community Survey
    - `acs_2021_5yr_data_dictionary.xlsx`
    - `variableMapping_acs_2021_5yr_data_dictionary.xlsx`
    - `colorado_place_demographics.pkl`
    
* /DenverCraftBreweries/data/US Census Bureau/American Community Survey/backups
    - `acs_data_dictionary_soup.pickle`
    - `2021_ACS_RAW_All_DataProfiles.pkl`
    - `df_var_map.pkl`
    
#### SUMMARY
* Procure US Census Bureau American Community Survey (ACS) demographic profile information for specific places (cities, towns, etc.) w/in the approximate Denver-Lakewood-Aurora MSA (some additional areas NW of Dever metro have been added due to continuity of a population core)
    - includes social, economic, housing, and demographic characteristics
* Creation of data dictionary `acs_2021_5yr_data_dictionary.xlsx` from US Census Bureau website which is used in disseminating data profiles, which are assorted collections of variables packaged together based off of overarching meaning (income, housing, etc.)
* Creation of mapping file `variableMapping_acs_2021_5yr_data_dictionary.xlsx` and DataFrame `df_var_map.pkl` derived from data dictionary, in prior point above, used to determine pertinent variables needed for project
    - see section below on `Craft Beer Drinker Demographic Research`
* Save all GET return results for select Census Data Profiles (DP02, DP03, DP04, DP05) from dataframe constructed as `2021_ACS_RAW_All_DataProfiles.pkl`
* Creation of `colorado_place_demographics.pkl` Pandas DataFrame which captures demographics across the state of Colorado (this includes all Census Places w/in the State of Colorado)
* Exploration of occupations associated w/ higher than average alcohol consumption involving a review of published material.  Implications include local places possibly associated w/ higher alcohol consumption including those areas close to hospitals.

## Craft Beer Drinker Demographic Research
* /DenverCraftBreweries/Beer Industry Research
    - Numerous publications and/or articles have been saved here
    - `Research Conclusions.ods` summarizes important findings while referencing specific documents
* Publications from the Brewer's Association, a craft beer trade group, becomes important in identifying an average craft beer drinker. They also reference and summarize Harris Polls conducted in the last several years.
    - Craft Beer Insights Poll (CIP Survey) conducted by Harris Poll May 18th – 26th, 2019 (numbers presented via article Power Hour: Nielsen Shares 2019 Craft Beer Consumer Insights)
        * Avg weekly craft beer drinker: male, 21 – 44 y.o., making $\$$75k - $\$$99k annually
        * weekly craft drinkers represented 45% of respondents
        * monthly craft drinkers represented 55% of respondents
        * 43% of legal drinking age consumers drink craft beer, up from 35% in 2015
        * 56% of men, 31% of women said they drink craft beer
        * more than half of 21 – 44 y.o. said they drink craft beer

### `DenverCraftBreweries_breweries`

#### DIRECTORY & FILES
* /DenverCraftBreweries/data/Breweries/
    - `api_coords.html`
    - `vendor_comparison.html`

* /DenverCraftBreweries/data/Breweries/Foursquare
    - `gdf_msa_foursquare_breweries.pkl`

* /DenverCraftBreweries/data/Breweries/Foursquare/backups
    - numerous API return requests are written to JSON files here which follow the convention "foursquare_result_#.json"
    - JSON files are later opened and concatenated into a single dataframe

* /DenverCraftBreweries/data/Breweries/Google Maps Platform
    - `gdf_msa_goolge_breweries.pkl`

* /DenverCraftBreweries/data/Breweries/Google Maps Platform/backups
    - numerous API return requests are written to JSON files here which follow the convention "google_result_#_[1st|2nd|3rd].json"
    - 1st|2nd|3rd denotes a page in a given call; Google limits results to 20, however, for a given point will provide up to 60 results if that many results exit (or 3 pages).
    - JSON files are later opened and concatenated into a single dataframe
    
* /DenverCraftBreweries/data/Breweries/coloradobrewerylist
    - `gdf_msa_coloradobrewerylist.pkl`

* /DenverCraftBreweries/data/Breweries/coloradobrewerylist/backups
    - colorado_brewery_list.pkl
    - geocoded_colorado_brewery_list.pkl

#### SUMMARY
* Derivation of brewery data from 3 major sources which helps solidify the accuracy of the dataset:
    - FourSquare **API**
        - GET return results limited to 50-places per call. Backups saved in a dedicated directory noted above.
    - Google Maps Platform / Places **API**
        - GET return results limited to 60-places per call, w/ 20 results per 'page'
        - A token is provided in the 1st return result for the 1st 20-places, which must be used to grab any additional pages (up to pages 2 & 3) & is accomplished by coding a loop in Python (while loop).  Backups saved in a dedicated directory noted above.
    - **coloradobrewerylist.com**
        - most comprehensive source.  Selenium is used to locate information on webpages; webpages are written in interactive JavaScript and therefore both Selenium and BeautifulSoup libraries are used.  While the page lists out all known craft breweries w/in the Denver Metro area, it does not include latitude and longitude coorinates therefore Google's geocoder API is used to convert addresses to latitude/longitude pairs.
* API coordinates used to overcome the GET call return results for both FourSquare and Google are plotted to ensure adequate coverage, where a 5-mile radius is used -- see `api_coords.html`
* Data quality & coverage is subsequently analyzed w/ a *Folium plot* of all datasets & it's determined that coloradobrewerylist.com is the best singular comprehensive data source for breweries within our MSA/geospatial area

### `DenverCraftBreweries_folium`

#### DIRECTORY & FILES
* /DenverCraftBreweries/data/US Census Bureau/American Community Survey
    - `gdf_place_demographics.pkl`
    - `KNNImputation_hhinc_median.png`
    
* /DenverCraftBreweries/finalized_figures
    - `MSA_geographic_demographics.html`

#### SUMMARY

* Important step in visually exploring the demographics of our MSA geographies and understanding their implications for choice of brewery location
* Choropleth `MSA_geographic_demographics` is created for various categories which are derived from `variableMapping_acs_2021_5yr_data_dictionary.xlsx`, where an important variable, median household income (hhinc_median) has 3 missing values that are imputed using KNearest neighbors:
    - Population
    - Households
    - Education
    - Occupation
    - Income
    - Select (incorporates *select* variables from above categories)
* Merges `gdf_msa_places.pkl` & `colorado_place_demographics.pkl` together to arrive at important demographics for places w/in our geographic area
    - Creation of file `gdf_place_demographics` based off of merge mentioned above, which also includes important derived variables. Derived variables:
        - population density, pop. >= 21 years old
        - target occupations
        - target age group

### `DenverCraftBreweries_EDA`

#### DIRECTORY & FILES
* /DenverCraftBreweries/finalized_figures
    - a series of barplots are created:
        - 15 Least Densely Populated MSA Places.png
        - barplot_Bottom 15 Median Household Incomes.png
        - barplot_Largest 15 Census Places: Total Population 21 and Over.png
        - barplot_Smallest 15 Census Places: Total Population 21 and Over.png
        - barplot_Top 10 Places: Different Residence 1-Year Prior.png
        - barplot_Top 15 Census Places for Target Occupations.png
        - barplot_Top 15 Census Places: HS Diploma or Higher Education.png
        - barplot_Top 15 Census Places: HS Diploma or Higher Education, Percent of Total.png
        - barplot_Top 15 Median Household Incomes.png
        - breweries_per_place.png
        - barplot_Top 20 Census Place Brewery Densities.png
        - breweries_per_place.png
        - Top 15 Most Densely Populated MSA Places.png
        - Total Population by Age Group.png
        - Total Population by Household Income Band.png
    - a series of boxplots are created:
        - boxplot_brewery_density.png
        - boxplot_commute_wfh.png
        - boxplot_diff_res_1yr.png
        - boxplot_hhinc_median.png
        - boxplot_target_hhinc.png
        - boxplot_target_occs.png
        - boxplot_tot_pop_>=21.png
        - KNNImputation_hhinc_median.png
        - Total Population 18 and over: Men and Women.png
    - a series of histograms are created:
        - histogram_Brewery Density.png
        - histogram_Median Household Income.png
        - histogram_Target Household Income.png
        - histogram_Target Occupations Population.png
        - histogram_Total Population 21 and Over.png

* /DenverCraftBreweries/models/PCA
    - a series of  PC1 loadings, where the original dataframe has Census Places for indices, saved as Excel workbooks following the convention "PCA_places_Loadings_ROUND#.xlsx"
        - 6 workbooks total
    - a series of PC1 vs PC2 plots, where the original dataframe has variables for indices, saved as Excel workbooks following the convention "PCA_vars_Loadings_ROUND#.xlsx"
        - 6 images total
    - a series of Loadings plots, where the original dataframe has Census Places for indices, saved as PNG images following the convention "PCA_vars_Loadings_ROUND#.xlsx"
        - 6 images total
    - a series of Loadings plots, where the original dataframe has variables for indices, saved as PNG images following the convention "PCA_vars_PC1vsPC2_ROUND1.png"
        - 6 images total
    - `gdf_pca_demo_vars.pkl`
    - `pca_scaled_gdf_demographics.pkl`
    - variableSets.txt



#### SUMMARY
* Histograms, Kernel Density Estimates (KDEs), statistical summaries (extended 5-panel available from Pandas), and bar plots are used to examine data where an emphasis is placed on visualizations; select ouput is saved where it has relevance to craft beer drinker demographics/hypothesis, or PCA and subsequent modeling. See finalized_figures directory listing above.

* Principle Component Analysis (PCA) is conducted.  Results of this analysis are used to determine groupings of variables capturing similar explained variance, and enables dimensionality reduction.  Data projections into lower dimensional space are not used from PCA, rather, original variables are kept.
    - When features are found to have a similar explained variance for PC1 we have an inclination to keep features which were published as having importance for craft beer drinkers from trade association or found in polling
    - Scree plots are created 
    - Cumulative Explained Variance Plots are constructed
    - PC1 vs PC2 plots are created
    - PCA Loadings Summaries are written to file (Excel)
* A geoDataFrame containing select Census demographics and their geometries are pickled `gdf_pca_demo_vars`.
    - A scaled version is also pickled as `pca_scaled_gdf_demographics` which will aid clustering models
* A text file is saved which demonstrates 3 possible feature sets, including their origins.  

### DenverCraftBreweries_models

#### DIRECTORY & FILES
* /DenverCraftBreweries/finalized_figures
    - Geospatial Clustering Plots of Census Places:
        - kmeans_geospatial_clustering.png
        - agglomerative_geospatial_clustering.png
        - dbscan_geospatial_clustering.png
    - Geospatial Clustering Plot of Breweries:
        - folium_dbscan_brewery_clustering.html
    - Plots for determining optimal number of clusters (all methods):
        - Census Places
            - kmeans_wcss.png
            - agglomerative_hierarchical_dendrogram.png
            - dbscan_places_2distance.png
        - Breweries
            - dbscan_breweries_2distance.png

#### SUMMARY
* Geospatial Cluster plots representing similar Census Places across the approximate Denver-Lakewood-Aurora MSA are created and evaluated using the Silhouette Score for clustering.  
* Discussion of clustering similarities and differences for Census Places
* Discussion of brewery clustering for the approximate Denver MSA
* Discussion of Census Places bolstering demographics amenable to craft brewing production, which also have few breweries at present