## Milestone 4:
Milestone 4 continues to add more content to what will become the group's final notebook.  In addition to your refined problem statement and EDA, the new milestone notebook should now include the group's baseline model, pipeline, and interpretation of these initial results.

To complete Milestone 4, students must submit a well organized and markdown-annotated Jupyter notebook with all relevant output visible.
Helper utility .py files used by the notebook are also acceptable.

## Problem statement
The Coffee Quality Institute provides coffee evaluations using tasting experts who score the coffees based on features such as acidity, body, and balance... and one subjective 'overall' scoring, but what contributes to this subjective component?  We set out to determine what features cause a coffee to receive a higher rating, and whether other variables like coffee origin contribute.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# sklearn imports
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [3]:
df = pd.read_csv('merged_data_cleaned.csv')
df = df.iloc[:,1:]

# drop test datapoint
df = df.drop(df[df['Harvest.Year'] == 'TEST'].index)

print("Dataset shape: ",df.shape)

Dataset shape:  (1338, 43)


## Explore and Visualize Data
Conduct exploratory data analysis to understand the underlying patterns and relationships in the data. Visualizations can be helpful in identifying trends and outliers. Make sure that the EDA you present explains the feature engineering choices you made. Moreover, when we read through your notebook, we expect to understand why you choose the particular baseline model and why you engineer your features the way you did. This section would be a great way to provide your reasoning.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1338 entries, 0 to 1338
Data columns (total 43 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Species                1338 non-null   object 
 1   Owner                  1331 non-null   object 
 2   Country.of.Origin      1337 non-null   object 
 3   Farm.Name              980 non-null    object 
 4   Lot.Number             276 non-null    object 
 5   Mill                   1021 non-null   object 
 6   ICO.Number             1179 non-null   object 
 7   Company                1130 non-null   object 
 8   Altitude               1112 non-null   object 
 9   Region                 1279 non-null   object 
 10  Producer               1107 non-null   object 
 11  Number.of.Bags         1338 non-null   int64  
 12  Bag.Weight             1338 non-null   object 
 13  In.Country.Partner     1338 non-null   object 
 14  Harvest.Year           1291 non-null   object 
 15  Grading.D

### Data cleaning and preprocessing
- Clean text data
- Feature selection
- Imputation
- Train/test split

In [5]:
df.head(2)

Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,Region,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,0,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
1,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,1,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0


In [13]:
predictors = [
        'Species', 'Country.of.Origin', # encoding required for country
       'altitude_mean_meters', 'Number.of.Bags', 'Harvest.Year', # requires processing/imputing
        'Processing.Method',  # also needs imputing
    'Aroma',
       'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
       'Clean.Cup', 'Sweetness', 'Total.Cup.Points',
       'Moisture', 'Category.One.Defects', 
       'Category.Two.Defects'
]
## what is ICO number??

In [7]:
# split into numeric and categorical features
numeric = df.select_dtypes(include=[np.number])
categorical = df.select_dtypes(exclude=[np.number])
print('NUMERIC:')
print(numeric.columns)
print('\nCATEGORICAL:')
print(categorical.columns)

NUMERIC:
Index(['Number.of.Bags', 'Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body',
       'Balance', 'Uniformity', 'Clean.Cup', 'Sweetness', 'Cupper.Points',
       'Total.Cup.Points', 'Moisture', 'Category.One.Defects', 'Quakers',
       'Category.Two.Defects', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

CATEGORICAL:
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Bag.Weight', 'In.Country.Partner', 'Harvest.Year', 'Grading.Date',
       'Owner.1', 'Variety', 'Processing.Method', 'Color', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement'],
      dtype='object')


In [8]:
# PROCESS BAG WEIGHT DATA
print('Unique Bag.Weight values before processing:')
print(categorical['Bag.Weight'].unique())
def fix_bagweight(text):
    vals = text.split()
    bag_weight = int(vals[0])
    # convert to kilograms
    if len(vals) > 1:
        if vals[1] == 'lbs':
            bag_weight *= 0.453592
    return round(bag_weight)

# add bag weight to numeric, drop from categorical
numeric['Bag.Weight'] = categorical['Bag.Weight'].apply(fix_bagweight)
categorical = categorical.drop(columns='Bag.Weight')

print('\nUnique Bag.Weight values after processing:')
print(numeric['Bag.Weight'].unique())

Unique Bag.Weight values before processing:
['60 kg' '1' '30 kg' '69 kg' '1 kg' '2 kg,lbs' '6' '3 lbs' '50 kg' '2 lbs'
 '100 lbs' '15 kg' '2 kg' '2' '70 kg' '19200 kg' '5 lbs' '1 kg,lbs' '6 kg'
 '0 lbs' '46 kg' '40 kg' '20 kg' '34 kg' '1 lbs' '660 kg' '18975 kg'
 '12000 kg' '35 kg' '66 kg' '80 kg' '132 lbs' '5 kg' '25 kg' '59 kg'
 '18000 kg' '150 lbs' '9000 kg' '18 kg' '10 kg' '29 kg' '1218 kg' '4 lbs'
 '0 kg' '13800 kg' '1500 kg' '24 kg' '80 lbs' '8 kg' '3 kg' '350 kg'
 '67 kg' '4 kg' '55 lbs' '100 kg' '130 lbs']

Unique Bag.Weight values after processing:
[   60     1    30    69     2     6    50    45    15    70 19200     0
    46    40    20    34   660 18975 12000    35    66    80     5    25
    59 18000    68  9000    18    10    29  1218 13800  1500    24    36
     8     3   350    67     4   100]


In [9]:
# PROCESS HARVEST YEAR DATA
print('Unique Bag.Weight values before processing:')
print(categorical['Harvest.Year'].unique())
def fix_harvestyear(text):
    text = str(text)
    yr_pattern = re.compile('\d{4}(?=\D|$)')
    match = yr_pattern.search(text)
    if match:
        year = match.group()
        return year
    else:
        return None

# add bag weight to numeric, drop from categorical
numeric['Harvest.Year'] = categorical['Harvest.Year'].apply(fix_harvestyear)
categorical = categorical.drop(columns='Harvest.Year')

print('\nUnique Harvest.Year values after processing:')
print(numeric['Harvest.Year'].unique())

Unique Bag.Weight values before processing:
['2014' nan '2013' '2012' 'March 2010' 'Sept 2009 - April 2010'
 'May-August' '2009/2010' '2015' '2011' '2016' '2015/2016' '2010'
 'Fall 2009' '2017' '2009 / 2010' '2010-2011' '2009-2010' '2009 - 2010'
 '2013/2014' '2017 / 2018' 'mmm' 'December 2009-March 2010' '2014/2015'
 '2011/2012' 'January 2011' '4T/10' '2016 / 2017' '23 July 2010'
 'January Through April' '1T/2011' '4t/2010' '4T/2010'
 'August to December' 'Mayo a Julio' '47/2010' 'Abril - Julio' '4t/2011'
 'Abril - Julio /2011' 'Spring 2011 in Colombia.' '3T/2011' '2016/2017'
 '1t/2011' '2018' '4T72010' '08/09 crop']

Unique Harvest.Year values after processing:
['2014' None '2013' '2012' '2010' '2009' '2015' '2011' '2016' '2017'
 '2018']


In [10]:
# train linear regression on numeric data
#train = df[predictors]
imputer = SimpleImputer(strategy='mean')
train = numeric.drop(columns=['Cupper.Points','Total.Cup.Points'])
train = imputer.fit_transform(train)
y = df['Cupper.Points']

# # one-hot encode species
# ohe = OneHotEncoder(drop='first')
# train['Species'] = ohe.fit_transform(train[['Species']]).toarray().ravel()

# # one-hot encode processing method (prob an easier way to just use one ohe)
# ohe_proc = OneHotEncoder(drop='first')
# proc_method = ohe_proc.fit_transform(train[['Processing.Method']]).toarray()

X_train, X_test, y_train, y_test = train_test_split(train, y, 
                                                    test_size = 0.2,
                                                    random_state=0, 
                                                    shuffle=True)
# scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('X_train shape:',X_train.shape)
print('y_train shape:',y_train.shape)
print('X_test shape:',X_test.shape)
print('y_test shape:',y_test.shape)

X_train shape: (1070, 19)
y_train shape: (1070,)
X_test shape: (268, 19)
y_test shape: (268,)


## Baseline Model
Select an appropriate machine learning model or statistical technique to solve the problem at hand. Train and evaluate the model using appropriate metrics and techniques. This would act as your baseline model, against which you will compare to improve your final model.

In [11]:
# simple Linear Regression Model
lr= LinearRegression()
lr.fit(X_train_scaled, y_train)
print('Linear regression train score:',lr.score(X_train_scaled, y_train))
print('Linear regression test score:', lr.score(X_test_scaled, y_test))

Linear regression train score: 0.5715957951394057
Linear regression test score: 0.8631784914874205


In [12]:
print('LR Coefficients')
name_coef = zip(imputer.get_feature_names_out(),lr.coef_)
for name, coef in sorted(name_coef, key=lambda x:np.abs(x[1]), reverse=True):
    print(f"{name:<25} {round(coef,2)}")

LR Coefficients
altitude_mean_meters      2447413293717.47
altitude_low_meters       -1223835486445.01
altitude_high_meters      -1223707364078.23
Flavor                    0.14
Aftertaste                0.11
Balance                   0.05
Category.One.Defects      0.03
Acidity                   0.03
Harvest.Year              0.03
Moisture                  -0.02
Clean.Cup                 0.02
Category.Two.Defects      -0.01
Aroma                     0.01
Number.of.Bags            -0.01
Sweetness                 -0.01
Uniformity                0.0
Quakers                   -0.0
Bag.Weight                0.0
Body                      0.0


## Interpret the result
Analyze the results of the model and communicate the findings. This may involve creating visualizations or presenting the results in a clear and concise manner. The findings here must lead to your choice of the final model.

## Final Model Pipeline: 
By this step, you should clearly understand and reason for choosing a particular machine learning technique. You are only expected to choose a technique and set up the pipeline to ensure that you are able to train it and run the required experiments. You are not required to tune your model to get optimal results.