# NAIVE BAYES CLASSIFIER FOR RED WINE QUALITY PREDICTION

This note will analyze and make prediction to determine how good in quality of red wine is.

## Section I: Definition

### 1. Input Attributes

Red wine, or wine, in general, having quality determines by 11 attributes, listed as below:

1. fixed acidity (`fixed`)
2. volatile acidity (`volatile`)
3. citric acide (`citric`)
4. residual sugar (`rsugar`)
5. chlorides (`chlorides`)
6. free sulfur dioxide (`free-sulfur-dioxide`)
7. total sulfur dioxide (`total-sulfur-dioxide`)
8. density (`density`)
9. pH (`ph`)
10. sulphates (`sulphates`)
11. alcohol (`alcohol`)

### 2. Output Classification

The quality of wine is generally scored from 0 to 10, with 0 is the worst, and 10 is the best.

So, given an input measure number of attributes, the quality score of red wine will be in range of 0..10.

### 3. Sample Dataset

Following is a sample of 10 records that scores the quality.

```csv
7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
7.9,0.6,0.06,1.6,0.069,15,59,0.9964,3.3,0.46,9.4,5
7.3,0.65,0,1.2,0.065,15,21,0.9946,3.39,0.47,10,7
7.8,0.58,0.02,2,0.073,9,18,0.9968,3.36,0.57,9.5,7
7.5,0.5,0.36,6.1,0.071,17,102,0.9978,3.35,0.8,10.5,5
```

## Section II: Analysis

The analysis process will be taken in following orders:

1. **Build data**: that is to split the input dataset into two parts:

  - **Training Set**: to train the model.
  - **Test Set**: to evaluate the prediction.

2. **Summarize data**: to do compute all pre-requisite input before making the prediction, including class (in this analysis, class is the wine quality), attributes, and their mean with standart deviation.

  - **Group data by class**.
  - **Compute mean**.
  - **Compute standard deviation**.
  - **Summarize dataset**.
  - **Summarize attributes**.

3. **Make single prediction**: with pre-computed values grouped in previous step, given a record of red win attribute, predict the wine quality score. (**INTERESTING!!!**)

  - **Compute the Guassian distribution**.
  - **Compute class probabilities**.
  - **Predict the wine quality score**.

4. **Make complete prediction**: at this step, we make prediction for complete input dataset.

5. **Evaluate accuracy**: after prediction, we need to compare with given score from input dataset to see how accuracy we reach in predicting red wine quality so far. The value is in range of 0%..100%. 

With that in mind, let's dive into the first step.

### 1. Build Data

The given dataset is in CSV format, and any other should be.

First, import dataset:

In [1]:
import csv

def load_csv(filename):
    dataset = list()
    with open(filename) as csv_file:
        rows = csv.reader(csv_file, delimiter=',')
        for row in rows:
            dataset.append([float(x) for x in row])
    return dataset


Next, we split dataset into two parts: **Training Set** and **Test Set** with ratio of 67% : 33%.

In [2]:
import random

def split_dataset(dataset, split_ratio, is_random=False):
    train_size = int(len(dataset) * split_ratio)
    train_set = []
    if is_random:
        copy = list(dataset)
        while len(train_set) < train_size:
            index = random.randrange(len(copy))
            train_set.append(copy.pop(index))
        return [train_set, copy]
    return [dataset[:train_size], dataset[train_size:]]

### 2. Summarize Data

It's time to do some grouping and computing re-requisite components, inc. **mean** and **standard deviation**.

#### Step 2.1: Group by Class

In [3]:
def separate_by_class(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if vector[-1] not in separated:
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

#### Step 2.2: Compute Mean

Mean is calculated by averaging of total sum of all numbers.

In [4]:
def mean(numbers):
	return sum(numbers)/float(len(numbers))

#### Step 2.3: Compute Standard Deviation

The standard deviation describes the variation of spread of the data, and we will use it to characterize the expected spread of each attribute in our Gaussian distribution when calculating probabilities.

The variance is calculated as the average of the squared differences for each attribute value from the mean. Note we are using the N-1 method, which subtracts 1 from the number of attribute values when calculating the variance.

In [5]:
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
	return math.sqrt(variance)

#### Step 2.4: Summarize Dataset

Let's group computed components into pairs for further computation.

In [6]:
def summarize(dataset):
	summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
	del summaries[-1]
	return summaries

#### Step 2.5: Summarize by Class

So, it's time to apply summarization for whole input data, which is done by computing for each input instance.


In [7]:
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances)
    return summaries

### 3. Make Single Prediction

At this point, we have enough pre-requisite inputs for computing the probabilities. As mentioned at top of this note, we will apply Gaussian distribution, so-called [**Normal Distribution**](https://en.wikipedia.org/wiki/Normal_distribution) to calculate the probabilities.

#### Step 3.1: Compute Gaussian Probability Density

In [8]:
import math

def calculate_probability(x, x_mean, x_stdev):
    exponent = math.exp(-(math.pow(x - x_mean, 2) / (2 * math.pow(x_stdev, 2))))
    return (1 / (math.sqrt(2*math.pi) * x_stdev)) * exponent

### Step 3.2: Compute Class Probabilities

Following the Bayes Classifier formula, we can compute the probabilities of a class by multiplying probability of each attributes together.

In [9]:
def calculate_class_probabilities(summaries, input_vector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            xmean, xstdev = classSummaries[i]
            x = input_vector[i]
            probabilities[classValue] *= calculate_probability(x, xmean, xstdev)
    return probabilities

#### Step 3.3: Predict Quality Score

To predict quality score, we first need to compute probabilities of each class (here, it is the wine quality) and pick up the best/max one.


In [10]:
def predict(summaries, input_vector):
    probabilities = calculate_class_probabilities(summaries, input_vector)
    best_label, best_prob = None, -1
    for classValue, probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = classValue
    return best_label

### 4. Make Complete Prediction

So, we compute prediction (estimation by our model so far), and append results into dataset for final evalulation.


In [11]:
def get_predictions(summaries, test_set):
    predictions = []
    for i in range(len(test_set)):
        result = predict(summaries, test_set[i])
        predictions.append(result)
    return predictions

### 5. Evaluate Accuracy

The final step is to compare our prediction with input dataset to see how well our model is working so far.



In [12]:
def get_accuracy(test_set, predictions):
    correct = 0
    for x in range(len(test_set)):
        if test_set[x][-1] == predictions[x]:
            correct += 1
    return (correct / float(len(test_set))) * 100.0

## Section III: Execution

We have prediction model completed. Let see how well it can determine the wine quality.

In [None]:
def main():
    filename = 'red-wine-quality.csv'
    split_ratio = 0.67
    dataset = load_csv(filename)
    training_set, test_set = split_dataset(dataset, split_ratio, False)
    print('Split {0} rows into train = {1} and test = {2} rows'.format(len(dataset), len(training_set), len(test_set)))
    # prepare model
    model = summarize_by_class(training_set)
    # test model
    predictions = get_predictions(model, test_set)
    accuracy = get_accuracy(test_set, predictions)
    print('Accuracy: {0}'.format(accuracy))

main()

In [None]:
**RESULT**:

Here is the result of prediction accuracy.

```
Split 1599 rows into train = 1071 and test = 528 rows
Accuracy: 40.15151515151515
```

It looks like the dataset is too small, we might need bigger input so that we can have better mean and stdev for computation.