# Wine Quality Analysis

This notebook performs an exploratory data analysis (EDA) on the Red Wine Quality dataset.

## 1. Importing Libraries
First, let's import the necessary libraries for data manipulation, analysis, and visualization.

In [None]:
import numpy as np
import sklearn
import pandas as pd
import matplotlib.pyplot as plt

## 2. Loading the Data
Now, we'll load the dataset from the CSV file into a pandas DataFrame and display the first few rows to get a glimpse of the data.

In [None]:
df_wine = pd.read_csv('../../Datasets/winequality-red.csv', sep= ',')
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [None]:
print(df_wine.shape)

(1599, 12)


## 3. Initial Data Exploration
In this section, we'll perform some initial checks to understand the structure and properties of the dataset.

### 3.1. Dataset Shape
Let's start by checking the dimensions of the dataset (number of rows and columns).

In [None]:
print(df_wine.shape)

The dataset has 1599 rows and 12 columns. This means there are 11 features (attributes) for each wine sample, plus one target variable ('quality').

### 3.2. Data Types and Non-Null Values
Next, we'll examine the data types of each column and check for any missing values.

In [None]:
df_wine.info()

The `.info()` output shows that all columns have the `float64` data type, except for the 'quality' column, which is `int64`. This is expected, as quality is a discrete score. Importantly, there are no missing values in any of the columns.

### 3.3. Checking for Missing or Special Values
Although `.info()` reported no nulls, it's good practice to explicitly check for missing values (NA), zeros, or other special values that might indicate data quality issues.

# Questioin 4: howmany missing values do we have?

All of the cells in the df have a value. There are also no values with a value of -1 and there are a few cells with a value of 0 but these make sence for the atribute


In [None]:
print(df_wine.nunique())

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
quality                   6
dtype: int64


In [None]:
df_wine.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0
