Code Academy ML Fundamentals Portfolio Project

Using the data from the Great American Coffee Taste Test. A group of people were sent 4 coffees and told to record data based on the taste. My goal is to use the data in reverse and see if the feedback from the tasters can be used to classify the 4 different coffees. 

Dataset is here: https://bit.ly/gacttCSV+

A YouTube video with some explanation of the test is here: https://www.youtube.com/watch?v=1fN_z4-EcOU



In [115]:
# Import Libraries

import pandas as pd
from sklearn.cluster import KMeans



# Import the dataset from the CSV file which has been downloaded and added to the repo:
coffee = pd.read_csv('GACTT_RESULTS_ANONYMIZED_v2.csv')
coffee = coffee.drop(coffee.columns[1:66], axis=1)
coffee = coffee.drop(coffee.columns[[8,12,16, 20]], axis=1)
coffee = coffee.drop(coffee.columns[20:], axis=1)
coffee = coffee.dropna()
print(coffee.columns)
print(coffee.head())

Index(['Submission ID', 'How strong do you like your coffee?',
       'What roast level of coffee do you prefer?',
       'How much caffeine do you like in your coffee?',
       'Lastly, how would you rate your own coffee expertise?',
       'Coffee A - Bitterness', 'Coffee A - Acidity',
       'Coffee A - Personal Preference', 'Coffee B - Bitterness',
       'Coffee B - Acidity', 'Coffee B - Personal Preference',
       'Coffee C - Bitterness', 'Coffee C - Acidity',
       'Coffee C - Personal Preference', 'Coffee D - Bitterness',
       'Coffee D - Acidity', 'Coffee D - Personal Preference',
       'Between Coffee A, Coffee B, and Coffee C which did you prefer?',
       'Between Coffee A and Coffee D, which did you prefer?',
       'Lastly, what was your favorite overall coffee?'],
      dtype='object')
   Submission ID How strong do you like your coffee?  \
15        Zd694B                              Medium   
16        QAeYZY                     Somewhat strong   
17        QA5JY

In [116]:
# Re-format the data for the purpose of using the ratings to classify the coffees

# Extract relevant columns for each coffee
coffee_a_cols = ['Coffee A - Bitterness', 'Coffee A - Acidity', 'Coffee A - Personal Preference']
coffee_b_cols = ['Coffee B - Bitterness', 'Coffee B - Acidity', 'Coffee B - Personal Preference']
coffee_c_cols = ['Coffee C - Bitterness', 'Coffee C - Acidity', 'Coffee C - Personal Preference']
coffee_d_cols = ['Coffee D - Bitterness', 'Coffee D - Acidity', 'Coffee D - Personal Preference']

# Create a new DataFrame with three rows for each Submission ID
new_data = []
for _, row in coffee.iterrows():
    for label, cols in zip(['A', 'B', 'C', 'D'], [coffee_a_cols, coffee_b_cols, coffee_c_cols, coffee_d_cols]):
        new_row = {
            'Submission ID': row['Submission ID'],
            'Bitterness': row[cols[0]],
            'Acidity': row[cols[1]],
            'Personal Preference': row[cols[2]],
            'Lastly, how would you rate your own coffee expertise?': row['Lastly, how would you rate your own coffee expertise?'],
            'How strong do you like your coffee?': row['How strong do you like your coffee?'],
            'What roast level of coffee do you prefer?': row['What roast level of coffee do you prefer?'],
            'How much caffeine do you like in your coffee?': row['How much caffeine do you like in your coffee?'],
            'Between Coffee A, Coffee B, and Coffee C which did you prefer?': row['Between Coffee A, Coffee B, and Coffee C which did you prefer?'],
            'Between Coffee A and Coffee D, which did you prefer?': row['Between Coffee A and Coffee D, which did you prefer?'],
            'Lastly, what was your favorite overall coffee?': row['Lastly, what was your favorite overall coffee?'],
            'Label': label
        }
        new_data.append(new_row)

# Create the new DataFrame
coffee = pd.DataFrame(new_data)


# Print the new DataFrame
print(coffee.head())


  Submission ID  Bitterness  Acidity  Personal Preference  \
0        Zd694B         1.0      1.0                  1.0   
1        Zd694B         1.0      1.0                  1.0   
2        Zd694B         1.0      1.0                  1.0   
3        Zd694B         1.0      1.0                  1.0   
4        QAeYZY         3.0      3.0                  3.0   

   Lastly, how would you rate your own coffee expertise?  \
0                                               10.0       
1                                               10.0       
2                                               10.0       
3                                               10.0       
4                                                7.0       

  How strong do you like your coffee?  \
0                              Medium   
1                              Medium   
2                              Medium   
3                              Medium   
4                     Somewhat strong   

  What roast level of cof

Split the categorical data into dummies with one hot encoding

In [117]:
encoded_columns = coffee.columns[5:11]
coffee = pd.get_dummies(coffee, columns=encoded_columns)
print(coffee.columns)


Index(['Submission ID', 'Bitterness', 'Acidity', 'Personal Preference',
       'Lastly, how would you rate your own coffee expertise?', 'Label',
       'How strong do you like your coffee?_Medium',
       'How strong do you like your coffee?_Somewhat light',
       'How strong do you like your coffee?_Somewhat strong',
       'How strong do you like your coffee?_Very strong',
       'How strong do you like your coffee?_Weak',
       'What roast level of coffee do you prefer?_Blonde',
       'What roast level of coffee do you prefer?_Dark',
       'What roast level of coffee do you prefer?_French',
       'What roast level of coffee do you prefer?_Italian',
       'What roast level of coffee do you prefer?_Light',
       'What roast level of coffee do you prefer?_Medium',
       'What roast level of coffee do you prefer?_Nordic',
       'How much caffeine do you like in your coffee?_Decaf',
       'How much caffeine do you like in your coffee?_Full caffeine',
       'How much caffeine d

Scale the scalar variables to a scale of 0-1 so that they can match up with the categorical variables:

In [118]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler_columns = coffee.columns[1:5]
coffee[scaler_columns] = scaler.fit_transform(coffee[scaler_columns])
print(encoded.head(30))

   Submission ID  Bitterness  Acidity  Personal Preference  \
0         Zd694B        0.00     0.00                 0.00   
1         Zd694B        0.00     0.00                 0.00   
2         Zd694B        0.00     0.00                 0.00   
3         Zd694B        0.00     0.00                 0.00   
4         QAeYZY        0.50     0.50                 0.50   
5         QAeYZY        0.50     0.50                 0.50   
6         QAeYZY        0.50     0.50                 0.50   
7         QAeYZY        0.50     0.50                 0.50   
8         QA5JYA        0.50     0.50                 0.50   
9         QA5JYA        0.50     0.50                 0.50   
10        QA5JYA        0.50     0.50                 0.50   
11        QA5JYA        0.50     0.50                 0.50   
12        ylqbBg        0.50     0.50                 0.75   
13        ylqbBg        0.50     0.50                 1.00   
14        ylqbBg        0.75     0.50                 0.50   
15      

Now I need to split the data into features and labels

In [119]:
features = coffee.drop(['Label', 'Submission ID'], axis=1)
labels = coffee['Label']

print(features)
print(labels)

       Bitterness  Acidity  Personal Preference  \
0            0.00     0.00                 0.00   
1            0.00     0.00                 0.00   
2            0.00     0.00                 0.00   
3            0.00     0.00                 0.00   
4            0.50     0.50                 0.50   
...           ...      ...                  ...   
14683        0.25     1.00                 0.25   
14684        0.25     0.75                 1.00   
14685        0.75     0.00                 0.50   
14686        0.50     0.50                 0.50   
14687        0.25     0.75                 0.75   

       Lastly, how would you rate your own coffee expertise?  \
0                                               1.000000       
1                                               1.000000       
2                                               1.000000       
3                                               1.000000       
4                                               0.666667       
...

In this section I actually define the KMeans model, fit it, and predict values

In [120]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(features)
predictions = kmeans.predict(features)

print(predictions)

[2 2 2 ... 1 1 1]


  super()._check_params_vs_input(X, default_n_init=10)


In this section I'm trying to find a good evaluation metric, since I actually do have the true labels (albeit in a different format)

In [121]:
from sklearn.metrics import adjusted_rand_score

rand_score = adjusted_rand_score(labels, predictions)

print(rand_score)

-0.00019325724284518787
