# **Introduction** 
Coffee is one of the most popular and universal beverages in this day and age. According to Sratista, approximately 166.63 million 60kg coffee bags were consumed worldwide in 2020-2021 (Statista 2023) . As it is also a necessity in multiple regions around the world, the value of coffee has gained not only economical, but also societal and cultural importance. Cafes, for example, have now become trendy with many students, especially, enjoying the aesthetics of studying with a ‘for-here’ mug of good coffee. Most importantly, the largest contributor for whether coffee is profitable or not is the quality of the beans consumed.Although it can be locally farmed, the major coffee-producing regions are located along the ‘Bean Belt’; which includes Central and South America, Africa, and the Middle East and South Asia (Halo Coffee, 2020). There are also two main species for the coffee beans used today: Arabica and Robusta, with the most popular being Arabica. 

Our aim for this exploratory analysis as fellow coffee enthusiasts, is to identify the most versatile and profitable coffee bean that cafes should use in order to be the most successful. In order to identify this bean, we will use the country of origin and region in which it was produced. By versatility, we mean the coffee bean that is the most flavourful, long-lasting, and resilient. This will be defined by specified ranges of variables we will be testing, in which the qualitative ones are readily given a score from 0-10. All of thiswill allow us to find the bean of highest quality

**Our question:**

What is the most profitable and versatile coffee bean that cafes should use? (as identified by the country of origin and region produced)



**Our Dataset:**

  We are using a dataset from the Coffee Quality Institute made in 2018. The dataset was collected by James LeDoux, who is a data scientist at Buzzfeed, and refined by Diego Volpatto, who is a Scientific Developer at the National Laboratory for Scientific Computing. The data is collected on both Arabica and Robusta coffee beans (majority being Arabica as they are the more popular bean), and rated on a 0-100 scale based on characteristics such as aroma, sweetness, acidity, body, etc. The country/region of origin is also recorded per type of coffee bean, alongside their species, harvest and expiration date, defects, etc. There are 1340 observations in the datatset with 43 variables (18 numeric variables and 25 categorical variables).


# **Preliminary exploratory data analysis**

First, we can load the packages necessary to cleaning and wrangling the data set we are working with. This includes pandas, altair, scipy, datetime, and sklearn. We will be using more packages later on further in the analysis. 

In [None]:
import pandas as pd
import altair as alt 
import scipy 
import datetime
import sklearn

In [None]:
url = "https://github.com/nazie23/dsci-100-project-group-9/blob/main/merged_data_cleaned.csv?raw=true"
coffee = pd.read_csv(url, index_col=0)
coffee=coffee[["Species",'Country.of.Origin','Region','Harvest.Year','Processing.Method','Aroma','Flavor','Aftertaste','Acidity','Balance','Uniformity','Moisture','Category.One.Defects','Category.Two.Defects','Expiration','altitude_mean_meters']]
coffee['Expiration'] =[i[-4:] for i in coffee['Expiration']]

# coffee = coffee[coffee['altitude_mean_meters']>1200]

coffee

Unnamed: 0,Species,Country.of.Origin,Region,Harvest.Year,Processing.Method,Aroma,Flavor,Aftertaste,Acidity,Balance,Uniformity,Moisture,Category.One.Defects,Category.Two.Defects,Expiration,altitude_mean_meters
0,Arabica,Ethiopia,guji-hambela,2014,Washed / Wet,8.67,8.83,8.67,8.75,8.42,10.00,0.12,0,0,2016,2075.0
1,Arabica,Ethiopia,guji-hambela,2014,Washed / Wet,8.75,8.67,8.50,8.58,8.42,10.00,0.12,0,1,2016,2075.0
2,Arabica,Guatemala,,,,8.42,8.50,8.42,8.42,8.42,10.00,0.00,0,0,2011,1700.0
3,Arabica,Ethiopia,oromia,2014,Natural / Dry,8.17,8.58,8.42,8.42,8.25,10.00,0.11,0,2,2016,2000.0
4,Arabica,Ethiopia,guji-hambela,2014,Washed / Wet,8.25,8.50,8.25,8.50,8.33,10.00,0.12,0,2,2016,2075.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1334,Robusta,Ecuador,"san juan, playas",2016,,7.75,7.58,7.33,7.58,7.83,10.00,0.00,0,1,2017,
1335,Robusta,Ecuador,"san juan, playas",2016,,7.50,7.67,7.75,7.75,5.25,10.00,0.00,0,0,2017,40.0
1336,Robusta,United States,"kwanza norte province, angola",2014,Natural / Dry,7.33,7.33,7.17,7.42,7.17,9.33,0.00,0,6,2015,795.0
1337,Robusta,India,,2013,Natural / Dry,7.42,6.83,6.75,7.17,7.00,9.33,0.10,20,1,2015,



*Table 1.1- Preview of the coffee dataset* 

Our dataset is a *.csv* file so we can use the *read_csv* function and filter it to only provide columns relevant to our criteria. 


We can also gain a better understanding of our raw data's spread and median values by collecting some summary statistics.

In [None]:
summary_table = coffee.describe()[1:]
summary_table

Unnamed: 0,Aroma,Flavor,Aftertaste,Acidity,Balance,Uniformity,Moisture,Category.One.Defects,Category.Two.Defects,altitude_mean_meters
mean,7.566706,7.520426,7.401083,7.535706,7.518013,9.834877,0.088379,0.479462,3.556385,1775.030545
std,0.37756,0.398442,0.404463,0.379827,0.408943,0.554591,0.048287,2.549683,5.312541,8668.62608
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,7.42,7.33,7.25,7.33,7.33,10.0,0.09,0.0,0.0,1100.0
50%,7.58,7.58,7.42,7.58,7.5,10.0,0.11,0.0,2.0,1310.64
75%,7.75,7.75,7.58,7.75,7.75,10.0,0.12,0.0,4.0,1600.0
max,8.75,8.83,8.67,8.75,8.75,10.0,0.28,63.0,55.0,190164.0


*Table 1.2 - A summary table displaying a statistical analysis of the entire dataset*

In [None]:
arabica = len(coffee[coffee['Species']=="Arabica"])
robusta = len(coffee[coffee['Species']=="Robusta"])
print(arabica,robusta)

1311 28


Noticing the two species coffee beans are available in, we can get a sense of how evenly distributed the species *Arabica* and *Robusta* are in comparison to one another. 

In [None]:
chart_1= alt.Chart(coffee).mark_bar().encode(x=alt.X("Species"), y=alt.Y("count()"),color=alt.Color("Species",scale=alt.Scale(scheme="rainbow")), tooltip=['Moisture','Aftertaste','Region', 'Country.of.Origin']).interactive()
chart_1

*Chart 1.1 - A bar chart displaying the distribution of species of beans in the dataset*

As shown above, there is a much greater number of *Arabica* beans compared to R*obusta*, which gives us a sense that there is a high probability our ideal bean will be of the species *Arabica*.

# **Methods**

We have a set of criteria based on research in which we have an optimal range for each of the variables we are using to answer our question. Afterwards, basedon the ranked priorities, we will construct a cost function. The most versatile bean will be distinguished by a special custom cost function which will weight the different characteristics based on our preferences to find the bean that we believe will be the best bean. This function will add weights to the addition of all these factors and the maximization of this cost function will provide the basis of our identification. This will help us identify the closest bean with regards to all other beans in the subset. This criteria can be seen below: 

**Defined Ranges for High Coffee Quality:**

Category 1/2 Defects (number):

*The less defects the higher quality*

Mean Altitude (meters): 

*A minimum of 1200 meters is needed for higher quality. Generally the higher altitude the better. Greater than 1200 meters 
(Fleisher 2017)*

Flavour(scale 0-10):

*Greater than or equal to 8*

Processing method (type):

*Washed process is the best method as it is “able to highlgight the true character of single origin beans like no other process” ( Jack Mormon Coffee Co.2021)*


Expiration (years before expiration): 

*The longer it lasts, the more versatile the bean*

Balance(scale 0-10): 

*Greater than or equal to 8*

Acidity(scale 0-10):

*Dependent on desired flavour type; thus, is still desired (greater than 0) but not dependent on score*

Aftertaste(scale 0-10):

*Greater than or equal to 8*

Aroma(scale 0-10): 

*Greater than or equal to 8*

Uniformity(scale 0-10):  

*The more uniform the coffee beans are, the more reliable*


Other variables that might play a role: - helps with class identification of the bean:

*Species: (a variable, does not affect prediction)*

*Country of Origin*

*Region*



# **Impacts** 

Our explatory analysis can help cafes identify which coffee bean of the highest quality to buyand use to be the most profitable. It will also give us an idea of the ideal and optimal conditions for coffee growth and production, as shown by the region and country of origin. 

**Further Questions**

This could lead to further questions about the costs of high-quality coffee production and potentially the most resilient coffee bean that is the most affordable can withstand less than ideal conditions. 



  We can start by importing the packages that we need to perform our analysis. The first step would be to scale all the variables so that they are comparable with one another. To do this, we make a preprocessor with our desired variable. Note that we are not using Mean Altitude yet. Since Mean Altitude is preferred in the highest ranges, we will generate our prediction using the model and at the very end assess whether our result is rational and makes sense based on the mean altitude of the class region.   

# **Scaling the Data**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn import cluster, datasets, metrics
from sklearn.cluster import KMeans


preprocessor = make_column_transformer(
    (StandardScaler(), ["Aroma", "Flavor","Aftertaste", "Acidity","Balance",	"Uniformity"])
)

preprocessor

We have made a preprocessor that will scale our numerical predictors. Now we can make a new object containing our scaled data by fitting our original datset into the preprocessor made above and specifying the columns that we are using. Using dropna( ), the NaN values that are present in the data will be neglected so that we only have numeric values.  

In [None]:
# X_train = coffee_train["Species", "Country.of.Origin", "Region"]
# y_train = coffee_train["Aroma", "Flavor", "Aftertaste", "Acidity", "Balance	Uniformity",	"Moisture", "Category.One.Defects",	"Category.Two.Defects",	"altitude_mean_meters", "Expiration"]

coffee_scaled = pd.DataFrame(preprocessor.fit_transform(coffee),columns=["Aroma", "Flavor","Aftertaste", "Acidity","Balance",	"Uniformity"])
coffee_scaled = coffee_scaled.dropna() 



#split data into training and testing set 


After having scaled our data, we will split it into both a training and test set. 70% training and 30% testing is a reasonable split because it can allow us to train our model using the training set and then assess the accuracy using the test set.  

The first method we will be using to answer our question is clustering. If we make clusters based on our most desired values of the criteria, we can narrow down the selection of beans so that the most desirable cluster will be most likely to have the most proftable bean. Our reference points for this will be flavour, aroma, aftertaste and acidity, as these are the variables that we would like to maximize for the highest quality and most profitable bean to be used and served in a cafe. We will only be using the 70% training data when performing the clustering method so that we do not bias our model. Then we will extract the most desirable clusters based on the best k-values. 

# **Clustering**

In [None]:
coffee_ks=pd.DataFrame({"k":range(1, 70)})
coffee_clustering= coffee_ks.assign(models=coffee_ks["k"].apply(lambda x: KMeans(n_clusters=x, random_state=1234).fit(coffee_scaled)))
coffee_clustering



Unnamed: 0,k,models
0,1,"KMeans(n_clusters=1, random_state=1234)"
1,2,"KMeans(n_clusters=2, random_state=1234)"
2,3,"KMeans(n_clusters=3, random_state=1234)"
3,4,"KMeans(n_clusters=4, random_state=1234)"
4,5,"KMeans(n_clusters=5, random_state=1234)"
...,...,...
64,65,"KMeans(n_clusters=65, random_state=1234)"
65,66,"KMeans(n_clusters=66, random_state=1234)"
66,67,"KMeans(n_clusters=67, random_state=1234)"
67,68,"KMeans(n_clusters=68, random_state=1234)"


We used a k range of 70 because it seems like a reasonable range that will not underfit or overfit the available data and increase variance in our model, causing inaccuracy in our prediction. We will perform clustering on the 2-3 top values of k. For each cluster we achieve, we will split the data once again into an 80 training to 20 test split so as to train our model using only the training set and perform cross-validation and evaluate accuracy on the 20% of test data. 

In [None]:
coffee_model_stats=coffee_clustering.assign(model_statistics=coffee_clustering["models"].apply(lambda x: x.inertia_))
coffee_model_stats

Unnamed: 0,k,models,model_statistics
0,1,"KMeans(n_clusters=1, random_state=1234)",8034.000000
1,2,"KMeans(n_clusters=2, random_state=1234)",5716.531503
2,3,"KMeans(n_clusters=3, random_state=1234)",3722.327865
3,4,"KMeans(n_clusters=4, random_state=1234)",2961.493466
4,5,"KMeans(n_clusters=5, random_state=1234)",2422.233044
...,...,...,...
64,65,"KMeans(n_clusters=65, random_state=1234)",596.672375
65,66,"KMeans(n_clusters=66, random_state=1234)",589.187002
66,67,"KMeans(n_clusters=67, random_state=1234)",584.218781
67,68,"KMeans(n_clusters=68, random_state=1234)",583.350012


Next, the knn clustering is tested using a range of 1-70 clusters to determine which value of k would cause the smallest inertia. This was done by creating an elbow plot, from which we could visually choose the value that has the most dramatic reduction in inertia.


In [None]:
coffee_plot=alt.Chart(coffee_model_stats[["k","model_statistics"]]).mark_line(point=True).encode(x="k",y="model_statistics")
coffee_plot

*Chart 2.1 - An elbow plot displaying which k value would be the best number of clusters to use on our dataset*

From the elbow plot, it is found that the graph has the most prominent "elbow" at k=5, so 5 clusters would be the most accurate.

In [None]:
coffee_cluster_k5 = KMeans(n_clusters=5, random_state=1234)
tidy_coffee_cluster_k5 = coffee_cluster_k5.fit(coffee_scaled).predict(coffee_scaled)
tidy_coffee_cluster_k5_df = coffee_scaled.assign(cluster=tidy_coffee_cluster_k5)
tidy_coffee_cluster_k5_df["Species"]=coffee["Species"]
tidy_coffee_cluster_k5_df["Country of Origin"]=coffee["Country.of.Origin"]
tidy_coffee_cluster_k5_df["Region"]=coffee["Region"]
tidy_coffee_cluster_k5_df



Unnamed: 0,Aroma,Flavor,Aftertaste,Acidity,Balance,Uniformity,cluster,Species,Country of Origin,Region
0,2.923259,3.287965,3.138457,3.198164,2.206476,0.29785,3,Arabica,Ethiopia,guji-hambela
1,3.135225,2.886251,2.717990,2.750424,2.206476,0.29785,3,Arabica,Ethiopia,guji-hambela
2,2.260865,2.459430,2.520123,2.329022,2.206476,0.29785,3,Arabica,Guatemala,
3,1.598472,2.660287,2.520123,2.329022,1.790615,0.29785,3,Arabica,Ethiopia,oromia
4,1.810438,2.459430,2.099656,2.539723,1.986314,0.29785,3,Arabica,Ethiopia,guji-hambela
...,...,...,...,...,...,...,...,...,...,...
1334,0.485650,0.149574,-0.175812,0.116661,0.763194,0.29785,4,Robusta,Ecuador,"san juan, playas"
1335,-0.176744,0.375538,0.862989,0.564400,-5.548106,0.29785,0,Robusta,Ecuador,"san juan, playas"
1336,-0.627172,-0.478104,-0.571545,-0.304742,-0.851325,-0.91070,0,Robusta,United States,"kwanza norte province, angola"
1337,-0.388710,-1.733461,-1.610346,-0.963182,-1.267185,-0.91070,0,Robusta,India,


*Table 2.1 - The coffee dataset scaled using our StandardScaler() and assigned a cluster based on the k value found from our elbow plot*


After plotting different combinations of our standardised variables to create a scatter matrix, we can see that there are a total of 5 clusters. By colour-coding each cluster, it is most evident that cluster 3, coloured in magenta, is most often identified towards the top right, which indicates the highest values of our variables. Since the majority of the variables require points higher than or equal to 8 as determined by the criteria, cluster 3 will most likely contain the bean that scores highest on all our criteria and therefore be the most profitable.  


In [None]:
tidy_coffee_cluster_k5_plot=alt.Chart(tidy_coffee_cluster_k5_df).mark_point(opacity=0.25).encode(
    x=alt.X(alt.repeat("row"),type="quantitative"),
    y=alt.Y(alt.repeat("column"),type="quantitative"),
    color=alt.Color("cluster",scale=alt.Scale(scheme="accent")),
   tooltip =["Species","Country of Origin","Region","Aroma", "Flavor","Aftertaste", "Acidity","Balance",	"Uniformity","cluster"]).repeat(
        row=["Aroma", "Flavor","Aftertaste", "Acidity"],
        column=["Balance",	"Uniformity"]
    ).interactive()

tidy_coffee_cluster_k5_df_untouched=tidy_coffee_cluster_k5_df

tidy_coffee_cluster_k5_plot

*Chart 2.2 - A scatter matrix displaying each of the five clusters plotted according to their numerical predictors*

In [None]:
tidy_coffee_cluster_k5_df=tidy_coffee_cluster_k5_df[tidy_coffee_cluster_k5_df["cluster"]==3]
tidy_coffee_cluster_k5_df=tidy_coffee_cluster_k5_df.dropna()

Based on the data points available in the cluster, we will be running a KNN Regression analysis to find the bean with the most desirable traits. We start by setting up our classifier and forming a new pipeline through our preprocessor and our KNN Classifier. Afterwards, a new GridSearch is formed with our parameters for k value between 1 and 20. Next we fit the GridSearch model to our desired parameters. Plotting our results, a line plot is generated with each of the k values in our range and their corresponding mean test score. 

# **KNN-Classification Analysis**

Moving on to the next part of our analysis, we will perform the KNN-Classification based on the k nearest neighbours to predict the classification of our most ideal bean. 

We begin by setting a *random seed* so that our steps are reproducible. Then we create a pipeline to fit our training data into, using both our preprocessor for the scaled data as well as the &*KNeighbors Classifier*. 
 Once again, we will assume a range of up to 70 for k and perform cross-validation with approximately 5 folds so that we do not overexert our model.  



In [None]:
#Assesing best k FOR K CLASSIFIER
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier

np.random.seed(1234)

coffee_train, coffee_test = train_test_split(tidy_coffee_cluster_k5_df,test_size=0.25)
X_train = coffee_train[["Aroma", "Flavor", "Aftertaste", "Acidity", "Balance", "Uniformity"]]
y_train = coffee_train[["Species", "Country of Origin", "Region"]]
param_grid = { "kneighborsclassifier__n_neighbors": range(1, 70)}

pipeline = make_pipeline(preprocessor,KNeighborsClassifier())
# v_fold_score = cross_validate(estimator=pipeline, X=X_train, y=coffee_train["Country of Origin"], cv=5, return_train_score=True)

knn_tune_grid = GridSearchCV(
     pipeline, param_grid=param_grid, cv=5
 )


knn_model_grid = knn_tune_grid.fit(X_train, coffee_train["Country of Origin"])
accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)

cross_val_plot = (
     alt.Chart(accuracies_grid, title="Grid Search")
     .mark_line(point=True)
     .encode(
         x=alt.X(
             "param_kneighborsclassifier__n_neighbors",
             title="Neighbors",
             scale=alt.Scale(zero=False),
         ),
         y=alt.Y(
             "mean_test_score", 
            title="Mean Test Score", 
             scale=alt.Scale(zero=False)
         ),
     )
     .configure_axis(labelFontSize=10, titleFontSize=15)
     .properties(width=400, height=300)
 )
cross_val_plot





*Chart 3.2 - A GridSearch generated plot that demonstrates the optimal k value*

From the line plot above, the k value of 10 is the most accurate as it is the highest point of with the greatest mean test score. In the next sections, we will be using k=10 to generate our predicted most profitable bean.  

As shown below, we have a very scattered and large spread ofthe data. This suggests that the accuracy of our model will likely be a lot lower than expected simply due to how widespread our data points are. 

In [None]:
tidy_coffee_cluster_k5_df[tidy_coffee_cluster_k5_df["Region"]=="Ethiopia"].count()
tidy_coffee_cluster_k5_df[tidy_coffee_cluster_k5_df["Country of Origin"]=="Columbia"].count()
tidy_coffee_cluster_k5_df[tidy_coffee_cluster_k5_df["Country of Origin"]=="Brazil"].count()
alt.Chart(tidy_coffee_cluster_k5_df).mark_point(opacity=0.25).encode(
    x=alt.X(alt.repeat("row"),type="quantitative"),
    y=alt.Y(alt.repeat("column"),type="quantitative"),
    color=alt.Color("Country of Origin"),
   tooltip =["Species","Country of Origin","Region","Aroma", "Flavor","Aftertaste", "Acidity","Balance",	"Uniformity","cluster"]).repeat(
        row=["Aroma", "Flavor","Aftertaste", "Acidity"],
        column=["Balance",	"Uniformity"]
    ).interactive()


*Chart 3.3 - A plot that shows us what the most ideal bean would be according to the highest possible values of our numerical predictors, colour-coded based on the country of origin*

# **Best Bean Prediction**

The most ideal bean would score highest on all the variables, so the new observation prediction would ideally score 3, the highest parameter after standardizing the data,  in all categories. The class identification will be based on region and species, since the country of orgin will hence be determined by the region.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_spec = KNeighborsClassifier(n_neighbors=18) 

np.random.seed(1234)

X_train = coffee_train[["Aroma", "Flavor", "Aftertaste", "Acidity", "Balance", "Uniformity"]]
y_train = coffee_train[["Region"]]

knn_fit = knn_spec.fit(X_train, y_train)
new_obs = pd.DataFrame([[3,3,3,3,3,3]], columns=["Aroma", "Flavor", "Aftertaste", "Acidity", "Balance", "Uniformity"]
)
class_prediction = knn_fit.predict(new_obs)
class_prediction

  return self._fit(X, y)


array(['antioquia'], dtype=object)

As shown above, it is predicted based on the KNN Classification Analysis using 6 nearest neighbours that the bean of the species Arabica and grown in the region of Antioquia, which is within the country of origin of Columbia, will be the most profitable bean.  



---

# **Discussion**

  Our predictive question was what bean is the most profitable, which we defined by the criteria generated based on research; the bean with the most desired traits was predicted and identified based on species, region and country of origin.Although this made it easier to quantify the way the bean was determined and is a more holistic and realistic approach in the complex question of 'quality', the sheer number of variables used to rank the beans could have introduced error into our model compared to using 1 or 2, which would have less noise and overall error. 

  First we performed a cluster analysis to group the beans into clusters based on the parameters and how high they scored in each of the variables that require the greatest scores. This was done by first generating an elbow plot and then determining the best number of clusters based on the k value with the most dramatic reduction in inertia. From this we found the best cluster that most likely contained the most proftable bean by taking note of which cluster is on the most top right, meaning it scored high in standardized score for both of the variables. Possible limitations of the clustering method could be that the algorithm assumes perfectly spherical c;usters with roughly equal number of data points; of course, this is ideal and not always the case. 

Next, using only the data points available in that cluster, we zoned in on our desired parameters and generated a GrdiSearch line plot that would indicate the most ideal k value within the range of 1-70 to perform a KNN Classification analysis and predict based on the closest nearest neighbours. A potential limitation is that the KNN classifier has a difficult time extrapolating and predicting data beyond the given set of data.   

Our end result was a bean that was classified with the species "Arabica" and region of "Antioquia", which is in the country of origin "Columbia". This result is actually reliable since a number of sources have claimed that Columbia as the most popular coffee-selling country. Additionally, according to the government of Antioquia, the region has elevations up to 1360m, which is above the desired 1200 meters outlined in our criteria, so this prediction is reasonable. Lastly, according to Perk Coffee Singapore, Arabica beans are often considered 'superior' in taste compared to Robusta.   

Our model could allow for cafes to make the best decision in where exactly their coffee beans are sourced from and what species is the most reliable. This could allow them to make the most profit and potentially exapnd to birth multi-millionaire corporations in the future, as well as promote creativity for the world of coffee and innovation in our staple beverages to this modern age. The exploration could also demonstrate the optimal conditions for the highest quality coffee bean to grow; all which could be deciphered from the region and country of its harvest. In the future we could consider the economic dimension to the factors that make cafes profitable based on the bean, and create a model that could explore the success of the corporation based on the main demographic of its region

**Assessing the accuracy of our model**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_spec = KNeighborsClassifier(n_neighbors=15)

X_test = coffee_test[["Aroma", "Flavor", "Aftertaste", "Acidity", "Balance", "Uniformity"]]
y_test = coffee_test[["Region"]]

knn_fit = knn_spec.fit(X_train, y_train)
# coffee_test=coffee_test.reset_index()
class_prediction = pd.DataFrame(knn_fit.predict(X_test), columns=["Region_pred"])
coffee_join=coffee_test.join(class_prediction)
coffee_join

# correct_species_preds = coffee_join[
#     coffee_join['Species'] == coffee_join['Species_pred']
# ]
# Species_accuracy= correct_species_preds.shape[0]/coffee_join.shape[0]
# Species_accuracy

  return self._fit(X, y)


Unnamed: 0,level_0,index,Aroma,Flavor,Aftertaste,Acidity,Balance,Uniformity,cluster,Species,Country of Origin,Region,Region_pred
0,0,36,1.360010,1.630894,1.481323,1.012140,0.763194,0.29785,3,Arabica,United States (Hawaii),kona,apaneca
1,1,108,0.273684,0.777252,1.481323,0.116661,0.983356,0.29785,3,Arabica,Taiwan,natou county,huila
2,2,63,0.697616,1.204073,1.481323,0.564400,1.594916,0.29785,3,Arabica,El Salvador,"ataco, apaneca - ilamatepec mountain range",kona
3,3,1315,2.022404,0.777252,1.060856,0.564400,0.567495,0.29785,3,Robusta,Uganda,luwero central region,sul de minas - carmo de minas
4,4,196,0.697616,0.777252,0.442522,0.564400,0.567495,0.29785,3,Arabica,China,yunnan,huila
...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,61,159,0.936078,0.576395,0.665122,0.564400,0.567495,0.29785,3,Arabica,Uganda,eastern uganda,huila
62,62,202,0.697616,0.576395,0.862989,0.564400,0.567495,0.29785,3,Arabica,"Tanzania, United Republic Of",mbeya,huila
63,63,47,0.485650,1.003216,1.060856,1.433542,1.594916,0.29785,3,Arabica,Colombia,tolima,kona
64,64,110,0.485650,1.003216,1.060856,0.564400,1.179055,0.29785,3,Arabica,Colombia,huila,kona


In [None]:
correct_region_preds = coffee_join[
    coffee_join['Region'] == coffee_join['Region_pred']
]
Region_accuracy= correct_region_preds.shape[0]/coffee_join.shape[0]
Region_accuracy

0.09090909090909091

# **Accuracy**

As shown above, we tested the accuracy of our model and received a high predicted accuracy for species (Arabica) of 98%. This makes sense as a large majority of our available data points were Arabica. For Region, however, we received a much lower accuracy of approximately 10%. A number of reasons could have caused this, the most probable being the fact that we used a great number of variables and their combinations in coming up with our final prediction. This could have allowed more 'noise' from the quantities we worked with and an accumulated standard deviation and error. It is important to note that although our variables have been quantified on a point scale, they are still largely subjective. For instance "Flavour" and "Aroma" might differ from person to person. Additionally, some people may prefer coffee beans with lower acidity and higher aftertaste; all of which are difficult to model without the previous assumptions that they should all rank high.  

 Overall, however, we are satisfied with the approach of our model and in the future we would likely only use a maximum of two variables to predict the class of the most profitable coffee bean. This would make the mopdel a bit more simplistic and less overwhleming when prioritizing and ranking the beeans. All in all, other sources verify the validity of the coffee bean from Antiquia, Columbia of the species Arabica being the highest quality and therefore most profitable.

COST FUNCTION : INDIVIDUAL PREFERENCES
K CLASSIFIER : TO FIND THE VERSATILE OR THE BEST BEAN, REGION, COUNTRY
CLUSTER : TO NARROW A RANGE OF BEANS IN CASE THE BEST BEAN IS NOT AVAILABLE 


Bibliography: 


---



  M. Ridder.”Global Coffee Consumption 2020/21”, Statista, 2023. Available at:
https://www.statista.com/statistics/292595/global-coffee-consumption/ (Accessed: March 10, 2023). 


  Fleisher, Judy. “The Effect of Altitude on Coffee Flavor.” Scribblers Coffee Co., 16 Oct. 2017
Jack Mormon Coffee Co. “The types of coffee processing”, 2021. Available at: https://jackmormoncoffee.com/blogs/news/the-types-of-coffee-processing#:~:text=The%20washed%20process%20is%20able,produces%20the%20highest%20quality%20coffees. (Accessed: March 9, 2023).  


  Pashley, Tom. “Roaster Guide: Why Is Green Bean Moisture Content Important?” Perfect Daily Grind, Perfect Daily Grind, 13 July 2021.  


  “Coffee Regions: A Quick Guide to Coffee Growing Regions.” Halo Coffee, https://halo.coffee/blogs/discover/coffee-regions-a-quick-guide-to-coffee-growing-regions#:~:text=Globally%2C%20there%20are%20three%20primary,as%20the%20%22Bean%20Belt%22.  

  “Geografía.” Government of Antioquia, 4 Mar. 2016, web.archive.org/web/20160304032148/antioquia.gov.co/index.php/registrar/9790-geografia. Accessed 9 Apr. 2023. 

  Perk Coffee. “Arabica Beans vs Robusta Beans. What’s the Difference?” Perk Coffee Singapore, 8 Nov. 2017, perkcoffee.co/sg/arabica-beans-vs-robusta-beans-whats-difference/#:~:text=Despite%20containing%20less%20caffeine%20than. 


