# Assignment 1
### Understanding Uncertainty
### Due 9/5

### Author: Garret Knapp 
### ID: nbz3de

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [1]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Downloading course data
Download complete
Extracting data files...
Data extracted


4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/iowa.csv')
df.head()

Unnamed: 0,Invoice/Item Number,Date,Store Number,Store Name,Zip Code,Category Name,Vendor Name,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars)
0,INV-59108400026,06/06/2023,3723,J D SPIRITS LIQUOR,51040,STRAIGHT RYE WHISKIES,INFINIUM SPIRITS,27102,TEMPLETON RYE 4YR,750,18.09,27.14,2,54.28
1,S16879800057,01/16/2014,3926,LIQUOR DOWNTOWN / IOWA CITY,52240,VODKA 80 PROOF,HEAVEN HILL BRANDS,35416,BURNETT'S VODKA 80 PRF,750,4.84,7.26,12,87.12
2,INV-05301100019,06/05/2017,3829,GARY'S FOODS / MT VERNON,52314,CANADIAN WHISKIES,DIAGEO AMERICAS,11296,CROWN ROYAL,750,15.59,23.39,6,135.66
3,INV-40973500083,10/14/2021,5102,WILKIE LIQUORS,52314,AMERICAN SCHNAPPS,JIM BEAM BRANDS,82787,DEKUYPER BUTTERSHOTS,1000,7.87,11.81,12,141.72
4,INV-17022500013,01/18/2019,2560,HY-VEE FOOD STORE / MARION,52302,WHISKEY LIQUEUR,SAZERAC COMPANY INC,64863,FIREBALL CINNAMON WHISKEY,200,2.5,3.75,12,45.0


In [58]:
missing1 = np.sum(df['Bottle Volume (ml)'].isna())
print(f"Bottle Volume (ml) Missing Values: {missing1}")
print(f"\nBottle Volume (ml) Value Description: {df['Bottle Volume (ml)'].describe()}")

Bottle Volume (ml) Missing Values: 0

Bottle Volume (ml) Value Description: count    159904.000000
mean        869.592737
std         513.812818
min          20.000000
25%         750.000000
50%         750.000000
75%        1000.000000
max        6000.000000
Name: Bottle Volume (ml), dtype: float64


In [57]:
missing2 = np.sum(df['Category Name'].isna())
print(f"Category Name Missing Values: {missing2}")
print(f"\nCategory Name Value Counts: {df['Category Name'].value_counts()}")

Category Name Missing Values: 133

Category Name Value Counts: Category Name
AMERICAN VODKAS              16611
CANADIAN WHISKIES            15280
STRAIGHT BOURBON WHISKIES    10416
WHISKEY LIQUEUR               7789
SPICED RUM                    7379
                             ...  
ROCK & RYE                      14
LOW PROOF VODKA                 11
ANISETTE                         7
WHITE CREME DE MENTHE            6
AMARETTO - IMPORTED              2
Name: count, Length: 92, dtype: int64


In [64]:
missing3 = np.sum(df['State Bottle Cost'].isna())
print(f"State Bottle Cost Missing Values: {missing3}")
print(f"\nState Bottle Cost Description: {df['State Bottle Cost'].describe()}")

State Bottle Cost Missing Values: 0

State Bottle Cost Description: count    159904.000000
mean         10.980340
std          11.399802
min           0.000000
25%           5.780000
50%           8.660000
75%          13.250000
max        2298.840000
Name: State Bottle Cost, dtype: float64


## Variables
Var 1:  Bottle Volumne (ml) <br>
Type: Numeric <br>
Missing Values: 0 <br>

Var 2: Category Name <br>
Type: Categorical <br>
Missing Values: 133 <br>

Var 3: State Bottle Cost <br>
Type: Numeric <br>
Missing Values: 0<br>

## Questions from the Data

One question we could ask from the data is: What is the most purchased liquor category in each ZIP code each month? Using this information, we could develop a prediction tool to forecast the most popular liquor categories by ZIP code in the following months. Liquor store owners would be key stakeholders in this analysis, as these insights would help them prioritize inventory planning and purchasing based on expected demand in the areas they serve.

However, this raises concerns around data privacy and informed consent. Even when data is anonymized, using detailed geographic data such as ZIP codes can still pose privacy risks. Consumers are often unaware that their purchasing behavior is being analyzed in this way, which may conflict with principles of transparency and consent. To mitigate these concerns, data could be aggregated at a broader geographic level, or used with clearer communication to consumers about how their data is collected and applied.

Another question we could ask from the data is: Which vendors generate the highest sales revenue in each store each quarter? Using this information, we could develop a prediction tool to forecast the top-performing vendors for each store in upcoming quarters. Store managers and purchasing teams would be key stakeholders in this analysis, as these insights would help them negotiate better deals with the top-performing vendors and ensure adequate inventory is maintained for those products.

Prediction tools trained on historical sales data can unintentionally reinforce existing biases by favoring established vendors or popular categories, potentially sidelining smaller or minority-owned businesses. This lack of fairness may limit opportunities for smaller vendors to make a profit.

In addition to sales data, I would also want access to each store's current inventory of an item after a purchase is made. This would help us analyze whether certain stores frequently sell out of popular products, allowing for more responsive inventory adjustments. Additionally, we could track how quickly specific items sell out during peak periods, such as Christmas, to ensure sufficient stock is available in advance of seasonal demand.

Another valuable piece of data would be whether an item was sold at a discounted price or as part of a promotion. This information could help identify products that are primarily purchased when on sale, enabling more strategic pricing decisions and optimized timing for future promotions. Additionally, vendors can avoid overstocking items that are primarily purchased via promotion or on sale, maximizing profits.