<a href="https://colab.research.google.com/github/nehagoyal09/Python_Main_Topics/blob/main/Day_19_central_tendency_wine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **`Introduction`**
This Jupyter notebook is part of your learning experience in the study of central tendency

We will work with a simple data set that contains details of wine quality

In this exercise, we will perform the following tasks:

1 - Load and study the data

2 - View the distributions of the various features in the data set and calculate their central tendencies

3 - Create a new Pandas Series that contains the details of the representative factor for quality



### **`Task 1 `**
Load the data and study its features such as:

- fixed acidity

- volatile acidity

- citric acid etc.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('/content/Wine Quality Dataset.csv')
data.head(10)

In [None]:
data.info()

**Observations from Task 1**

There are 4898 rows and 12 columns in the data.Each row contains the details of the types of acids present in white-wine and the quality

The features in the data set are:

- Different acids and their Quality

### **`Task 2`**
 View the distributions of the various features in the data set and calculate their central tendencies

We will now look at the distributions of the various features in the data set

We will also calculate appropriate measures of central tendency for these features

In [None]:
# Create a histogram of the "Fixed acidity" feature
plt.figure(figsize= (9,4))

sns.histplot(data = data, x= 'fixed acidity', color = 'orange',
             edgecolor= 'linen', alpha = 0.9, bins = 5)

plt.title("Histogram of Fixed Acidity")
plt.xlabel("Fixed Acidity")
plt.ylabel("Count")
plt.show()


### **Observations**

We observe that the histogram is normally distributed.

The maximum count of values for fixed acidity lies in between 6 to 8.

Let's see the measures of central tendency in working!

- Mean
- Median
- Mode

In [None]:
# calculate the mean of 'fixed acidity'

print(round(data['fixed acidity'].mean(),2))

In [None]:
# calculate the median

data['fixed acidity'].median()

In [None]:
# Create a histogram of the "fixed acidity" feature and also show the mean and the median
plt.figure(figsize= (9,4))

sns.histplot(data= data, x= 'fixed acidity', color= 'orange',
             edgecolor= 'linen', alpha= 0.9, bins =5)

plt.title("Histogram of Fixed Acidity")
plt.xlabel("Fixed Acidity")
plt.ylabel("Count")

plt.vlines(data['fixed acidity'].mean(), ymin= 0, ymax= 4000, color= 'blue', label= 'Mean')
plt.vlines(data['fixed acidity'].median(), ymin= 0, ymax= 4000, color= 'red', label= 'Median')
plt.legend()
plt.show()


**Observations**

We can see that mean and median are clear representative of the data.

Mean and median are very close to each other.

We can choose either of the parameters say mean as the measure of central tendency.

In [None]:
# Create a histogram of the "volatile acidity" feature
plt.figure(figsize= (9,4))

sns.histplot(data= data, x= 'volatile acidity', color= 'green',
             edgecolor= 'linen', alpha = 0.7, bins =5)

plt.title("Histogram of Volatile Acidity")
plt.xlabel("Volatile Acidity")
plt.ylabel("Count")
plt.show()

`Observations`

We observe that this histogram is not well distributed, it is skewed a little towards the right.

As we have seen skewness, therefore we can check the distribution using distplot function.



![](https://th.bing.com/th/id/OIP.Ugu0QJrmFX7AaKhnEL_E-gHaEK?w=306&h=180&c=7&r=0&o=5&dpr=1.25&pid=1.7)

### **Skewness**:
Some distributions of data, such as the bell curve or normal distribution, are symmetric. This means that the right and the left of the distribution are perfect mirror images of one another. Not every distribution of data is symmetric. Sets of data that are not symmetric are said to be asymmetric. The measure of how asymmetric a distribution can be is called skewness.

The mean, median and mode are all measures of the center of a set of data. The skewness of the data can be determined by how these quantities are related to one another.

- If Skewness (S) = 0, then the distribution is normally distributed.

- If Skewness (S) > 0, then the distribution is positively skewed.

- If Skewness (S) < 0, then the distribution is negatively skewed.

In [None]:
# Plot distplot using 'Volatile acidity' feature
plt.figure(figsize= (9,4))

sns.distplot(data['volatile acidity'], color= 'blue')

plt.title("Distplot of Volatile Acidity")
plt.xlabel("Volatile Acidity")
plt.ylabel("Density")
plt.show()


`Observation:`

The above plot shows the normal distribution.

The normal distribution is described by the mean and the standard deviation.

The normal distribution is often referred to as a 'bell curve' because of it's shape:

- The median and mean are equal
- It has only one mode
- It is symmetric, meaning it decreases the same amount on the left and the right of the centre

We can calculate skewness too using skew() function in python.

In [None]:
# Calculate skewness of 'Volatile Acidity'
print(round(data["volatile acidity"].skew(),2))

`Observation`

We can clearly see that the skewness value is greater than 1, hence it is positively skewed.

In [None]:
# Calculate the mean of volatile acidity
print(round(data['volatile acidity'].mean(),2))

In [None]:
# calculate the median of volatile acidity
data['volatile acidity'].median()

In [None]:
# Create a histogram of the "Volatile Acidity" feature and also show the mean and the median
plt.figure(figsize= (9,4))

sns.histplot(data= data, x= 'volatile acidity', color= 'green',
             edgecolor= 'linen', alpha= 0.7, bins= 5)
plt.title("Histogram of Volatile Acidity")
plt.xlabel("Volatile Acidity")
plt.ylabel("Density")

plt.vlines(data['volatile acidity'].mean(), ymin=0, ymax=4000, color= 'blue', label= 'mean')
plt.vlines(data['volatile acidity'].median(), ymin=0, ymax=4000, color= 'red', label= 'median')
plt.legend()
plt.show()

**Observations**

The mean and the median are close to each other and the difference between them is very small.

We can safely choose the mean as the measure of the central tendency here.

In [None]:
# Create a histogram of citric acid
plt.figure(figsize= (9,4))

sns.histplot(data= data, x= 'citric acid', color= 'purple',
             edgecolor= 'linen', alpha= 0.5, bins= 5)
plt.title("Histogram of Citric Acid")
plt.xlabel("Citric Acid")
plt.ylabel("Count")
plt.show()

**Observation**

We observe that this histogram is not well distributed, it is skewed a little towards the right.

In [None]:
# Calculate the mean "Citric Acid" feature
print(round(data['citric acid'].mean(),2))

In [None]:
# Calculate the median "Citric Acid" feature
data['citric acid'].median()

In [None]:
# Create a histogram of the "Citric Acid" feature and also show the mean and the median
plt.figure(figsize= (9,4))

sns.histplot(data= data, x= 'citric acid', color= 'purple',
             edgecolor= 'linen', alpha= 0.5, bins = 5)
plt.title("Histogram of Citric Acid")
plt.xlabel("Citric Acid")
plt.ylabel("Count")

plt.vlines(data['citric acid'].mean(), ymin=0, ymax=4000, color= 'blue', label= 'mean')
plt.vlines(data['citric acid'].median(), ymin=0, ymax=4000, color= 'red', label= 'median')
plt.legend()
plt.show()

**Observation**

The mean and the median are close to each other and the difference between them is very small.

We can safely choose the mean as the measure of the central tendency here.

In [None]:
# Calculate distplot using 'Citric Acidity' feature
plt.figure(figsize= (11,6))

sns.distplot(data['citric acid'], color = 'blue')
plt.title("Distplot of Citric Acid")
plt.xlabel("Citric Acid")
plt.ylabel("Count")
plt.show()

Same procedure we can follow for other numerical columns to get the mean, median (parameters of central tendency)

In [None]:
df = pd.DataFrame(data['quality'].value_counts())
df.index

In [None]:
# create a count polt of 'quality' feature
plt.figure(figsize= (9,4))

sns.barplot(x= df.index, y= df['count'])

plt.title("Bar Plot of Quality")
plt.xlabel("Quality")
plt.ylabel("Count")
plt.show()

Observation

It is quite clear from the count plot that the 6 is the highest count of quality , whereas 9 is negligible.

In [None]:
# Count the number of occurences of different categories of the "Quality" feature
data['quality'].value_counts()

In [None]:
# Calculate the mode of the "quality" feature
data['quality'].mode()

`Observations from Task 2`

We saw the distributions of the various features in the data set using appropriate plots

We calculated central tendency measures like mean, median and mode for the various features

The mean and the median for all the above features were similar, so we can choose the mean in these cases

The mode of the "Quality" feature can be chosen as a representative value



# `Task 3`
 Create a new Pandas Series that contains the details of the acid types for a quality
We will now create a Pandas Series that contains the representative values for each of the features

In [None]:
# Create a new Pandas Series called "rep_acid" that contains the details of the representative quality for the different types of acids

rep_acid = pd.DataFrame(index = ['fixed acidity', 'volatile acidity', 'citric acid', 'quality'],
                        data = [data['fixed acidity'].mean(), data['volatile acidity'].mean(),
                                data['citric acid'].mean(), data['quality'].mode()])

rep_acid

**Observations from Task 3**

- The representative acid for the quality is as follows:

- The mean value of the fixed acidity would be 6.854

- The mean value of the volatile acidity would be 0.2782

- The mean value of citric acid would be 0.3341

- The quality would be 6




**Final Conclusions**
- From the given data, we can use simple visualisations to get a sense of how data are distributed.

- We can use various measures of central tendency such as mean, median and mode to represent a group of observations.

- The type of central tendency measure to use depends on the type and the distribution of the data

