# Project Deliverable 2 Code
For ease of use, I uploaded the csv to github, so the code can be ran easily without downloading the dataset or inserting
kaggle api key.

## Data Preprocessing and Exploration
This section removes any invalid and duplicate entries in our dataset, as well as removing A_id, as it will not be useful for our model. (It is only an identifier.)


In [2]:
import pandas as pd
# load the dataset
url = 'https://raw.githubusercontent.com/johnxminimo/cs577applequalityproject/main/apple_quality.csv'

dataset = pd.read_csv(url)
dataset.head()
dataset.describe()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,1999.5,-0.503015,-0.989547,-0.470479,0.985478,0.512118,0.498277
std,1154.844867,1.928059,1.602507,1.943441,1.402757,1.930286,1.874427
min,0.0,-7.151703,-7.149848,-6.894485,-6.055058,-5.961897,-5.864599
25%,999.75,-1.816765,-2.01177,-1.738425,0.062764,-0.801286,-0.771677
50%,1999.5,-0.513703,-0.984736,-0.504758,0.998249,0.534219,0.503445
75%,2999.25,0.805526,0.030976,0.801922,1.894234,1.835976,1.766212
max,3999.0,6.406367,5.790714,6.374916,7.619852,7.364403,7.237837


In [3]:
# Let's perform some cleaning/preprocessing (removing duplicates, null/invalid records, and features )
# lets first remove duplicates, from our dataset
# by looking at the data, we don't need apple_id, as this is just an identifier
dataset.drop_duplicates(inplace=True)
dataset.drop("A_id", axis=1, inplace=True)
dataset.describe()

Unnamed: 0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,-0.503015,-0.989547,-0.470479,0.985478,0.512118,0.498277
std,1.928059,1.602507,1.943441,1.402757,1.930286,1.874427
min,-7.151703,-7.149848,-6.894485,-6.055058,-5.961897,-5.864599
25%,-1.816765,-2.01177,-1.738425,0.062764,-0.801286,-0.771677
50%,-0.513703,-0.984736,-0.504758,0.998249,0.534219,0.503445
75%,0.805526,0.030976,0.801922,1.894234,1.835976,1.766212
max,6.406367,5.790714,6.374916,7.619852,7.364403,7.237837


In [4]:
# As shown here, our dataset rates our apple as either good or bad.
print(dataset["Quality"])

0       good
1       good
2        bad
3       good
4       good
        ... 
3996    good
3997     bad
3998    good
3999    good
4000     NaN
Name: Quality, Length: 4001, dtype: object


In [5]:
# Instead we should use 1 for good and 0 for bad
dataset["Quality"].replace(("good", "bad"), [1,0], inplace = True)
print(dataset["Quality"])



0       1.0
1       1.0
2       0.0
3       1.0
4       1.0
       ... 
3996    1.0
3997    0.0
3998    1.0
3999    1.0
4000    NaN
Name: Quality, Length: 4001, dtype: float64


In [6]:
dataset['Acidity']=pd.to_numeric(dataset.Acidity,errors='coerce')
dataset.dropna(inplace=True)

In [7]:
from sklearn.model_selection import train_test_split
# now lets begin by splitting our dataset into training and testing
X = dataset.drop("Quality", axis = 1)
y = dataset["Quality"]
dataset.info()
print(X)

<class 'pandas.core.frame.DataFrame'>
Index: 4000 entries, 0 to 3999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Size         4000 non-null   float64
 1   Weight       4000 non-null   float64
 2   Sweetness    4000 non-null   float64
 3   Crunchiness  4000 non-null   float64
 4   Juiciness    4000 non-null   float64
 5   Ripeness     4000 non-null   float64
 6   Acidity      4000 non-null   float64
 7   Quality      4000 non-null   float64
dtypes: float64(8)
memory usage: 281.2 KB
          Size    Weight  Sweetness  Crunchiness  Juiciness  Ripeness  \
0    -3.970049 -2.512336   5.346330    -1.012009   1.844900  0.329840   
1    -1.195217 -2.839257   3.664059     1.588232   0.853286  0.867530   
2    -0.292024 -1.351282  -1.738429    -0.342616   2.838636 -0.038033   
3    -0.657196 -2.271627   1.324874    -0.097875   3.637970 -3.413761   
4     1.364217 -1.296612  -0.384658    -0.553006   3.030874 -1.303849   

# Preparing and training our first model (Logistic Reg)
For the first model, I opted to train logistic regression, since it is a simple model and could serve as a "baseline".

The methodolgy I chose is to split the data into training and testing set using a 70/30 split. The reason why I chose this split is becaues we have quite a bit of entries (4000), and should be large enough to where we don't need to add more into our testing set or to use cross validation.

I then split the training set into training and tuning, with 20% of the training set going to tuning.

As for hyperparamters, I will be using gridsearch in order to test:
l1: lasso reg
l2: ridge reg

regulaization strengths (C): 
10^-x for x = -5 to 5 (same parameter set from previous homework)

solvers:
liblinear

As for determining whether a model is good, I will use precision since we want to ensure that false positives are at a minimum.


In [19]:
from numpy import loadtxt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn import metrics
from sklearn.metrics import make_scorer, precision_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
paramGridLR = {
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear'],
}

precision_scorer = make_scorer(precision_score, pos_label=1)
logRes = LogisticRegression()
logResGridSearch = GridSearchCV(estimator=logRes, param_grid = paramGridLR, verbose=1, scoring=precision_scorer)
logResGridSearch.fit(X_train, y_train)

bestLogResModel = logResGridSearch.best_estimator_
precisionOnTest = precision_score(y_test, bestLogResModel.predict(X_test), pos_label=1)


print("Best parameters for logistic regression found by GridSearch" + str(logResGridSearch.best_params_))
print("Precision for log reg on validation set using best parameters found by gridSearch: " + str(logResGridSearch.best_score_))
print("Precision score on test set: " + str(precisionOnTest))
print("Accuracy score on test set:" + str(bestLogResModel.score(X_test, y_test)))


Fitting 5 folds for each of 12 candidates, totalling 60 fits


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Best parameters for logistic regression found by GridSearch{'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'}
Precision for log reg on validation set using best parameters found by gridSearch: 0.7673342533707068
Precision score on test set: 0.7689393939393939
Accuracy score on test set:0.7308333333333333


## LogReg Results
As for the logistic regression results, it seems that the best parameters are: C = 0.01, penalty: l1 (lasso reg).

### Test Set
Precision score: 0.768
Accuracy: 0.73


# Training Second Model Random Forest
For the second model to test, I opted to use Random Forest.

I also chose to use gridsearch inorder to test different hyper parameters like in our logistic regression model.

In [44]:
from sklearn.ensemble import RandomForestClassifier

randForestParam = {
    'n_estimators': [100, 200, 500],
    'max_features': ['auto', 'log2'],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False],
    'criterion': ['mse', 'log_loss']
}

randomForestGrid = GridSearchCV(estimator=RandomForestClassifier(), param_grid=randForestParam, cv=5, n_jobs = -1)
randomForestGrid.fit(X_train, y_train)

bestRFModel = randomForestGrid.best_estimator_
rfPrecisionOnTest = precision_score(y_test, randomForestGrid.predict(X_test), pos_label=1)


print("Best parameters for random forest found by GridSearch" + str(randomForestGrid.best_params_))
print("Precision for random forest on validation set using best parameters found by gridSearch: " + str(randomForestGrid.best_score_))
print("Precision score on test set: " + str(rfPrecisionOnTest))
print("Accuracy score on test set:" + str(bestRFModel.score(X_test, y_test)))




2430 fits failed out of a total of 3240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
974 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/johnminimo/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/johnminimo/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1144, in wrapper
    estimator._validate_params()
  File "/Users/johnminimo/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "/Users/johnminimo/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 95,

Best parameters for random forest found by GridSearch{'bootstrap': True, 'criterion': 'log_loss', 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Precision for random forest on validation set using best parameters found by gridSearch: 0.875
Precision score on test set: 0.8887070376432079
Accuracy score on test set:0.89


## Results on Random Forest
For random forest results, it seems that the best parameters are:{'bootstrap': True, 'criterion': 'log_loss', 'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}.

### Test Set
Precision score: 0.89
Accuracy: 0.89

Our precision score and accuracy score both do better compared to our first baseline model.

With a precision score of 0.76, we have reduced the amount of false positives substantially, from our initial score of 0.73 with logistic regression.

Our accuracy also increased, which means that we are getting a larger amount of our test set correct, and not just getting less false positives. Our model performs better as a whole.


# Training Third Model, ANN

Since ANN is computationally extensive, I am opting not to implement gridsearch, since it will increase the complexity of our code since we need to create a separaate model builder function to work with gridsearch, and also would take long to train + test due to the different parameters and combinations gridsearch would use.

Instead, I am opting to use relu for hidden layers, and then for our final output layer, a sigmoid unit. This is similar to the approach we took in our homework.


In [46]:
dataset.dtypes

A_id           float64
Size           float64
Weight         float64
Sweetness      float64
Crunchiness    float64
Juiciness      float64
Ripeness       float64
Acidity        float64
Quality        float64
dtype: object

In [20]:
from keras.losses import BinaryCrossentropy
from keras.optimizers import Adam
from keras import backend
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(9, activation='relu', input_shape=(7,)))
model.add(Dense(15, activation='relu'))
model.add(Dense(1, activation='sigmoid'))   

model.summary()

model.compile(optimizer=Adam(), loss=BinaryCrossentropy(), metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100)



Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_27 (Dense)            (None, 9)                 72        
                                                                 
 dense_28 (Dense)            (None, 15)                150       
                                                                 
 dense_29 (Dense)            (None, 1)                 16        
                                                                 
Total params: 238
Trainable params: 238
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
 1/88 [..............................] - ETA: 23s - loss: 0.6898 - accuracy: 0.6562

2024-04-21 20:31:10.846940: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x2bf2fab90>

In [22]:
model.evaluate(X_test, y_test)
yPredict = model.predict(X_test)



2024-04-21 20:33:31.116422: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


In [23]:
print(yPredict) # if > .5, make = 1

[[0.5376683 ]
 [0.8730905 ]
 [0.9414666 ]
 ...
 [0.6869172 ]
 [0.45628667]
 [0.5662097 ]]


In [28]:
finalPredict = [1 if y > 0.5 else 0 for y in yPredict]
#print(finalPredict)

print(classification_report(y_test, finalPredict))

              precision    recall  f1-score   support

         0.0       0.73      0.76      0.74       593
         1.0       0.75      0.73      0.74       607

    accuracy                           0.74      1200
   macro avg       0.74      0.74      0.74      1200
weighted avg       0.74      0.74      0.74      1200



# ANN Results on Test
Accuracy of 0.74
Precision of 0.73

# Conclusion
The best model for our problem based on the testing completed, seems to be random forests. With a high accuracy of 0.89 and a precision of 0.89, this model best suited our needs.

With such a high precision, we know that we are finding less false positives, which means that bad apples are not able to get through in our predictions, while our accuracy is also great, further proving that the model is not a fluke.