# **Wine Quality Prediction using Support Vector Machine text**

# **Objective**

### This project aims to predict the quality of wine based on various physichemical properties usins Support vector Machine(SVM).The aim is to leverage various machine learning algorithms to analyze and predict the quality of wine, facilitating better decision-making for winemakers, sellers, and consumers.

**These datasets include features such as:**

Fixed acidity

Volatile acidity

Citric acid

Residual sugar

Chlorides

Free sulfur dioxide

Total sulfur dioxide

Density

pH

Sulphates

Alcohol

Quality (score between 0 and 10)

# **Import Library**

In [None]:
import pandas as pd
import numpy as np

# **Import CSV as Dataframe**

In [None]:
df=pd.read_csv(r'https://raw.githubusercontent.com/Lorddhaval/Dataset/main/WhiteWineQuality.csv', sep=';')

# **Get the first five rows of Dataframe**

In [None]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


# **Get information of Dataframe**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


# **Get the Summary Statistics**

In [None]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


# **Get Column Names**

In [None]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

# **Get shape of Dataframe**

In [None]:
df.shape

(4898, 12)

# **Get Unique values(Class or Label) in Y variable**

In [None]:
df['quality'].value_counts()

Unnamed: 0_level_0,count
quality,Unnamed: 1_level_1
6,2198
5,1457
7,880
8,175
4,163
3,20
9,5


In [None]:
df.groupby('quality').mean()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,7.6,0.33325,0.336,6.3925,0.0543,53.325,170.6,0.994884,3.1875,0.4745,10.345
4,7.129448,0.381227,0.304233,4.628221,0.050098,23.358896,125.279141,0.994277,3.182883,0.476135,10.152454
5,6.933974,0.302011,0.337653,7.334969,0.051546,36.432052,150.904598,0.995263,3.168833,0.482203,9.80884
6,6.837671,0.260564,0.338025,6.441606,0.045217,35.650591,137.047316,0.993961,3.188599,0.491106,10.575372
7,6.734716,0.262767,0.325625,5.186477,0.038191,34.125568,125.114773,0.992452,3.213898,0.503102,11.367936
8,6.657143,0.2774,0.326514,5.671429,0.038314,36.72,126.165714,0.992236,3.218686,0.486229,11.636
9,7.42,0.298,0.386,4.12,0.0274,33.4,116.0,0.99146,3.308,0.466,12.18


# **Define Target (Y) and features (X)**

In [113]:
y = df['quality']

In [114]:
y.shape

(4898,)

In [115]:
y

Unnamed: 0,quality
0,6
1,6
2,6
3,6
4,6
5,6
6,6
7,6
8,6
9,6


In [116]:
x = df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
        'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
        'pH', 'sulphates', 'alcohol']]


In [117]:
x = df.drop(['quality'], axis=1)

In [118]:
x.shape

(4898, 11)

In [119]:
x

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9
...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8


# **Get Train Test Split**

In [120]:
from sklearn.model_selection import train_test_split

In [134]:
x_train,x_test,y_train,Y_test=train_test_split(x,y,test_size=0.8,stratify = y,random_state=2529)

In [136]:
x_train.shape,x_test.shape,y_train.shape,Y_test.shape

((979, 11), (3919, 11), (979,), (3919,))

# **Select Model**

In [128]:
from sklearn.svm import SVC


In [129]:
model=SVC()

# **Get Model Train**

In [137]:
model.fit(x_train, y_train)


# **Get Model Prediction**

In [138]:
y_pred=model.predict(x_test)

In [139]:
x_test

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
2368,7.4,0.34,0.30,14.90,0.037,70.0,169.0,0.99698,3.25,0.37,10.40
3509,6.4,0.26,0.25,10.70,0.046,66.0,179.0,0.99606,3.17,0.55,9.90
3742,7.0,0.15,0.28,14.70,0.051,29.0,149.0,0.99792,2.96,0.39,9.00
69,7.4,0.24,0.29,10.10,0.050,21.0,105.0,0.99620,3.13,0.35,9.50
928,6.5,0.25,0.35,12.00,0.055,47.0,179.0,0.99800,3.58,0.47,10.00
...,...,...,...,...,...,...,...,...,...,...,...
4618,6.1,0.44,0.28,4.25,0.032,43.0,132.0,0.99160,3.26,0.47,11.30
2249,6.3,0.27,0.18,7.70,0.048,45.0,186.0,0.99620,3.23,0.47,9.00
3179,8.5,0.20,0.40,1.10,0.046,31.0,106.0,0.99194,3.00,0.35,10.50
3387,7.6,0.36,0.49,11.30,0.046,87.0,221.0,0.99840,3.01,0.43,9.20


In [140]:
y_pred.shape

(3919,)

In [141]:
y_pred

array([6, 6, 6, ..., 6, 5, 6])

# **Get Model Evaluation**

In [146]:
from sklearn.metrics import confusion_matrix, classification_report


In [149]:
print(confusion_matrix(Y_test, y_pred))


[[   0    0    5   11    0    0    0]
 [   0    0    4  126    0    0    0]
 [   0    0   75 1091    0    0    0]
 [   0    0   76 1683    0    0    0]
 [   0    0   10  694    0    0    0]
 [   0    0    0  140    0    0    0]
 [   0    0    0    4    0    0    0]]


In [151]:
print(classification_report(Y_test, y_pred))


              precision    recall  f1-score   support

           3       0.00      0.00      0.00        16
           4       0.00      0.00      0.00       130
           5       0.44      0.06      0.11      1166
           6       0.45      0.96      0.61      1759
           7       0.00      0.00      0.00       704
           8       0.00      0.00      0.00       140
           9       0.00      0.00      0.00         4

    accuracy                           0.45      3919
   macro avg       0.13      0.15      0.10      3919
weighted avg       0.33      0.45      0.31      3919



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# **Accuracy**

In [152]:
from sklearn.metrics import accuracy_score

In [153]:
accuracy_score(Y_test,y_pred)

0.4485838224036744

# **Explaination**

Interpreting the Accuracy of 0.44
An accuracy of 0.44 (44%) indicates that the model's predictions are correct 44% of the time. This performance is relatively low, suggesting that the model might not be capturing the patterns in the data effectively. Possible reasons for low accuracy include:

**Feature Engineering:** The features may not be informative enough or might need more preprocessing.

**Model Choice:** The model used might not be suitable for the problem.

**Data Quality:** There might be issues with data quality or imbalance.

**Overfitting/Underfitting:** The model might be overfitting to the training data or underfitting the training data.


## **Improvement Strategies**
**Feature Engineering:** Try creating new features or transforming existing ones to better capture the underlying patterns.

**Model Tuning:** Experiment with different models and hyperparameters.

**Cross-Validation:** Use cross-validation to better evaluate model performance.

**Ensemble Methods:** Combine multiple models to improve performance.

Data Augmentation: Augment the dataset or balance it if there is an imbalance in the target classes.

# **Conclusion**
Achieving higher accuracy involves iterating over the steps of data preprocessing, model selection, and evaluation. Continuous refinement of the approach can lead to better performance and more accurate predictions.