### Problem Statement:
Build a model to predict the quality rating of fruits using various attributes.

### Dataset Information:

1. A_id: Unique identifier for each fruit
2. Size: Size of the fruit
3. Weight: Weight of the fruit
4. Sweetness: Degree of sweetness of the fruit
5. Crunchiness: Texture indicating the crunchiness of the fruit
6. Juiciness: Level of juiciness of the fruit
7. Ripeness: Stage of ripeness of the fruit
8. Acidity: Acidity level of the fruit
9. Quality: Overall quality of the fruit

In [1]:
import pandas as pd                       # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np                        # linear algebra
import seaborn as sns                     # for plotting
import matplotlib.pyplot as plt           # plotting
import warnings
warnings.filterwarnings('ignore')         #Suppress warnings in Python.
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [2]:
df=pd.read_csv(r'apple_quality.csv')
df.head()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.491590483,good
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809367,good
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,bad
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723217,good
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,good


In [3]:
df.shape

(4001, 9)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4001 entries, 0 to 4000
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4001 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB


In [5]:
df.describe()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,1999.5,-0.503015,-0.989547,-0.470479,0.985478,0.512118,0.498277
std,1154.844867,1.928059,1.602507,1.943441,1.402757,1.930286,1.874427
min,0.0,-7.151703,-7.149848,-6.894485,-6.055058,-5.961897,-5.864599
25%,999.75,-1.816765,-2.01177,-1.738425,0.062764,-0.801286,-0.771677
50%,1999.5,-0.513703,-0.984736,-0.504758,0.998249,0.534219,0.503445
75%,2999.25,0.805526,0.030976,0.801922,1.894234,1.835976,1.766212
max,3999.0,6.406367,5.790714,6.374916,7.619852,7.364403,7.237837


In [9]:
for i in df.columns:
    print('columns: ',i)
    print('Nuinque: ',df[i].nunique(),'\n')
    print('Unique: ',df[i].unique())
    print(30*'==')

columns:  A_id
Nuinque:  4000 

Unique:  [0.000e+00 1.000e+00 2.000e+00 ... 3.998e+03 3.999e+03       nan]
columns:  Size
Nuinque:  4000 

Unique:  [-3.97004852 -1.19521719 -0.29202386 ... -4.00800374  0.27853965
         nan]
columns:  Weight
Nuinque:  4000 

Unique:  [-2.51233638 -2.83925653 -1.35128199 ... -1.77933711 -1.71550503
         nan]
columns:  Sweetness
Nuinque:  4000 

Unique:  [ 5.34632961  3.66405876 -1.73842916 ...  2.36639697  0.12121725
         nan]
columns:  Crunchiness
Nuinque:  4000 

Unique:  [-1.01200871  1.58823231 -0.34261593 ... -0.20032937 -1.15407476
         nan]
columns:  Juiciness
Nuinque:  4000 

Unique:  [1.84490036 0.8532858  2.83863551 ... 2.16143512 1.2666774         nan]
columns:  Ripeness
Nuinque:  4000 

Unique:  [ 0.3298398   0.86753008 -0.03803333 ...  0.21448838 -0.77657147
         nan]
columns:  Acidity
Nuinque:  4001 

Unique:  ['-0.491590483' '-0.722809367' '2.621636473' ... '-2.229719806'
 '1.599796456' 'Created_by_Nidula_Elgiriyewithana

In [92]:
df[['Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness','Acidity']]

Unnamed: 0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity
0,-1.798424,-0.950373,2.993421,-1.424150,0.690545,-0.089872,-0.269415
1,-0.359060,-1.154404,2.127698,0.429746,0.176767,0.197020,-0.378997
2,0.109445,-0.225759,-0.652507,-0.946892,1.205422,-0.286156,1.206044
3,-0.079977,-0.800146,0.923916,-0.772399,1.619575,-2.087320,0.338315
4,0.968573,-0.191640,0.044164,-1.096894,1.305025,-0.961548,0.201472
...,...,...,...,...,...,...,...
3995,0.291729,-0.048594,-1.669449,-0.365345,0.614425,0.931482,0.028866
3996,0.108878,1.834105,0.137124,-1.159058,-0.252634,-0.846326,0.842347
3997,-1.105655,-0.716904,-1.013784,-0.234036,0.874379,2.275957,-0.668950
3998,-1.818112,-0.492908,1.459901,-0.845446,0.854549,-0.151419,-1.093171


### Data Cleaning

In [11]:
df.isnull().sum()

A_id           1
Size           1
Weight         1
Sweetness      1
Crunchiness    1
Juiciness      1
Ripeness       1
Acidity        0
Quality        1
dtype: int64

In [17]:
df[df.isnull().any(axis=1)]

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
4000,,,,,,,,Created_by_Nidula_Elgiriyewithana,


In [18]:
df.dropna(inplace=True)

In [19]:
df.isnull().sum()

A_id           0
Size           0
Weight         0
Sweetness      0
Crunchiness    0
Juiciness      0
Ripeness       0
Acidity        0
Quality        0
dtype: int64

In [22]:
df[df.duplicated()]

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality


In [23]:
df['Acidity']=pd.to_numeric(df.Acidity,errors='coerce')
df.dtypes

A_id           float64
Size           float64
Weight         float64
Sweetness      float64
Crunchiness    float64
Juiciness      float64
Ripeness       float64
Acidity        float64
Quality         object
dtype: object

In [24]:
df.Acidity.isnull().sum()

0

In [26]:
df.columns

Index(['A_id', 'Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness',
       'Ripeness', 'Acidity', 'Quality'],
      dtype='object')

In [29]:
df.drop('A_id',axis=1,inplace=True)

### Model Building

In [51]:
label_encoder = LabelEncoder()
df['Quality'] = label_encoder.fit_transform(df['Quality'])
df.head()

Unnamed: 0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.49159,1
1,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809,1
2,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636,0
3,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723,1
4,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984,1


In [52]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', 'Acidity']] = scaler.fit_transform(df[['Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', 'Acidity']])

In [53]:
df.head()

Unnamed: 0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,-1.798424,-0.950373,2.993421,-1.42415,0.690545,-0.089872,-0.269415,1
1,-0.35906,-1.154404,2.127698,0.429746,0.176767,0.19702,-0.378997,1
2,0.109445,-0.225759,-0.652507,-0.946892,1.205422,-0.286156,1.206044,0
3,-0.079977,-0.800146,0.923916,-0.772399,1.619575,-2.08732,0.338315,1
4,0.968573,-0.19164,0.044164,-1.096894,1.305025,-0.961548,0.201472,1


In [84]:
x=df.drop('Quality',axis=1)
x.head()

Unnamed: 0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity
0,-1.798424,-0.950373,2.993421,-1.42415,0.690545,-0.089872,-0.269415
1,-0.35906,-1.154404,2.127698,0.429746,0.176767,0.19702,-0.378997
2,0.109445,-0.225759,-0.652507,-0.946892,1.205422,-0.286156,1.206044
3,-0.079977,-0.800146,0.923916,-0.772399,1.619575,-2.08732,0.338315
4,0.968573,-0.19164,0.044164,-1.096894,1.305025,-0.961548,0.201472


In [85]:
y=pd.DataFrame(df['Quality'])
y.head()

Unnamed: 0,Quality
0,1
1,1
2,0
3,1
4,1


In [86]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [87]:
# Listof classifier to try

classifiers = {
     'Logistic Regression': LogisticRegression(),
    'Decision tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    }

In [88]:
# List to store accuarcy
accuracies = []

In [89]:
# train and ecvaluate each classifer
for name, clf in classifiers.items():
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    
    accuracies.append(accuracy)
    
    print(f"\n{30*'='}\n{name}\n{30*'='}")
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))


Logistic Regression
Accuracy: 0.7650
Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.72      0.76       409
           1       0.74      0.81      0.77       391

    accuracy                           0.77       800
   macro avg       0.77      0.77      0.76       800
weighted avg       0.77      0.77      0.76       800


Decision tree
Accuracy: 0.8525
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.85      0.86       409
           1       0.85      0.85      0.85       391

    accuracy                           0.85       800
   macro avg       0.85      0.85      0.85       800
weighted avg       0.85      0.85      0.85       800


Random Forest
Accuracy: 0.8925
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.89      0.89       409
           1       0.88      0.90      0.89       391

    accurac

Overall Observations: <br>
The models achieved high accuracy and F1 scores, indicating their capability to effectively classify apples into quality categories.<br>
Random Forest outperformed other models, showcasing their ability to handle intricate patterns in the data.<br>