## Coffee Analysis for Sourcing

### Introduction
Coffee is one of the most important drinks humanity could have ever discovered and yet I feel it is also one of the more underappreciated things in regards to the broader population. After watching a bunch of tiktok and youtube short content creators attest to how much "better-tasting" coffee could be, I decided to take a leap of faith and try it at home for myself. And oh boy was this a rabbit hole that I never thought I would enjoy immensely or go down into.

At first, I was immediately able to taste the difference, flavors being quite smooth, non-bitter, and fruity after dialing in the right grind size. It was absolutely mind bending, "how could coffee taste this good!". However, I had a really hard time distinguishing the actual flavor profiles mentioned due to lack of tasting experience. Over time I was able to slowly differentiate some of the flavors (not all) and have been brewing coffee ever since my first buy (5 months to date). But there was always a thought in the back of my mind telling me to question "what more can you do to brew a better cup?" 

This question led me to confronting the most obvious factor in the entire process of making coffee "the origin" or for a better term "sourcing". Even though I am still a beginner in the space, it is a widely known that growing and processing methods at these coffee farms affect the outcome of the taste as well as the quality. However after some light digging, I was left with some unsatisfaction. Most of the articles put out there are very obscure or don't explain too much on how these factors come into play, or it felt like complete pseudoscience at times. 

To combat this I decided to take matters into my own hand and try and model something for myself to nail down what factors actually contribute to a good tasting cup of coffee. I told myself - "Welp, I am in a data science program after all, why not apply the skills I have learned so far and take a stab at it." Never have I ever once thought that I would be applying skills I have learned in my career path in this personal of a manner. Damn it! It's just a hobby bro.

Before we delve deeper into how I acquire and process the data here's some quick info on gear for those that are interested!

### At Home Setup

- Chestnut C3 Grinder - TIMEMORE
- V60 Dripper - HARIO
- Origami Dripper Size M - ORIGAMI
- Origami Dripper Air Size S - ORIGAMI
- Respective V60 and Origami Filter Papers

### Methodology

Below we will follow standard steps in data science in defining and mapping out the process I will take.

**Objective/Question**
1. What factors contribute to a good tasting cup of coffee?
2. What specific regions can I nail down my coffee search to for coffee acquisition?

**Genreal Steps**
1. Acquiring Data
2. Exploratory Data Analysis (EDA)
3. Data Cleaning
4. Data Modeling + Evaluation
5. Visualization of Results


In [None]:
# Basic libs
import pandas as pd
import numpy as np
from scipy.stats import *

# Plotting libs
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns


### Data

Upon doing some extensive searching, it appears that a lot of coffee data that go into extreme detail are locked behind paywalls. Especially the ones from more official organizations like the Coffee Quality Institute (CQI). I have found the equivalent to what I was searching for on Kaggle, albeit the quality of the data may not be 100% reliable as it was scraped from CQI, but it seemed like the most reliable data I could get.

Here is the direct link to the data: [Coffee Quality Data (CQI)](https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi)

In [None]:
# reading in the dataset
cqi_df = pd.read_csv("../Data/cqi_2018/merged_data_cleaned.csv")
cqi_df.drop(columns=['Unnamed: 0'], inplace=True)

# Describing dataset
cqi_shape = "{} rows x {} columns".format(cqi_df.shape[0], cqi_df.shape[1])
print("Data Frame Shape: \n" + cqi_shape + " \n")
print("Columns: \n")
print(cqi_df.columns)

In [None]:
# Preview of data
cqi_df.head(2)

### Exploratory Data Analysis

In [None]:
# Helper function for discovery of individual variables and their distributions
def quick_hist(df, col_name, asc, head, g_title=None):
    temp_counts = df[col_name].value_counts().reset_index(name='Counts').sort_values(['Counts'], ascending=asc)

    if head:
        temp_counts = temp_counts.head(head)
    
    #print("Tabling")
    #print(temp_counts)
    
    # Matplot lib
    #temp_counts.plot.bar(x=col_name, y='count', rot=0)
    #plt.show()

    # Plotly
    if not g_title:
        g_title = f'{col_name} Distribution'

    f = px.bar(temp_counts, x=col_name, y='Counts', text='Counts', title=g_title, color=col_name)
    f.update_layout(width=800, height=600)

    f.show()

In [None]:
# Tester
#quick_hist(cqi_df, 'Species', False, False)
#quick_hist(cqi_df, 'Owner', False, 10, 'Top 10 Owner Distribution')

# Showcase
quick_hist(cqi_df, 'Altitude', False, 10, 'Top 10 Altitude Distribution')
quick_hist(cqi_df, 'Harvest.Year', False, 10, 'Top 10 Harvest Years')
quick_hist(cqi_df, 'Country.of.Origin', False, 10, 'Top 10 Contry Origins')


In [None]:
# Plotting cup points vs altitude with labels on country of origin

# Quick outlier removal for better viewing
point_alt_view = cqi_df.copy()

z_score_alt = zscore(point_alt_view["altitude_high_meters"], nan_policy='omit')
z_score_p = zscore(point_alt_view["Total.Cup.Points"], nan_policy='omit')
thresh = 3

point_alt_view = point_alt_view[z_score_alt <= (thresh)]
point_alt_view = point_alt_view[abs(z_score_p) <= (thresh)]

# Plotting
fig = px.scatter(point_alt_view, x='altitude_high_meters', y='Total.Cup.Points', title='Altitude High vs Total Cup Points by Country', color='Country.of.Origin')
fig.update_layout(width=800, height=600)
# Show the plot
fig.show()

### Data Cleaning

### Data Modeling

### Visualizations

### Thoughts & Conclusions