<a href="https://colab.research.google.com/github/kirisame-ame/GCI_AI-Course/blob/main/ENG_hw2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2: Cleaning Data Using Pandas

## Question

Your task is to use a dataset [1] on quality of Portuguese "Vinho Verde" red wine. The dataset consists of 12 columns describing each wine sample's features (pH, acidity, etc.) and 1 column rating the wine quality ranging from 1 to 10.

Both the [dataset itself](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) and its [description](https://archive.ics.uci.edu/dataset/186/wine+quality) are available online.

Your task is as follows:
1. Load the dataset as `pd.Dataframe`.
2. Group the data based on the `volatile acidity` column into $n$ **equally-sized** groups.
    - $n$ is a natural number (i.e., a positive integer) that does not exceed the number of data points that does not cause any splitting point to fall exactly on duplicate values.
    - If the number of data points is not perfectly divisible by $n$, use the behavior of `pd.qcut()` to handle the grouping.
3. For each of the groups you created, select only the rows where the value in the "quality" column is equal to 5.
4. Calculate the mean of the `alcohol` values within these filtered rows for each group.
5. Return the minimum of the mean alcohol values among all the groups.

**Submission Guidelines:**<br>
When submitting your solution, only submit the entire `homework()` function. Submit by selecting this week's assignment in the Omnicampus homework section, pasting the function into the submission area, and then clicking [Submit Python Code].

Please pay attention to the following points when submitting.
- Erase the `!!WRITE ME!!` when submitting
- The return value of the function should be type numeric
- Write your answer as one function

## Deadline
2/9 (Sun) 23:59

## 1. Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame

## 2. Downloading the Dataset
We will download the dataset using `wget` command.

In [2]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

--2025-03-05 14:56:34--  http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘winequality-red.csv’

winequality-red.csv     [  <=>               ]  82.23K   403KB/s    in 0.2s    

2025-03-05 14:56:35 (403 KB/s) - ‘winequality-red.csv’ saved [84199]



In [3]:
url_winequality_data = './winequality-red.csv'

You can now load the dataset using

```python
pd.read_csv('./winequality-red.csv', sep=';')
```

## 3. Solution

In [5]:
df = pd.read_csv('./winequality-red.csv', sep=';')

In [6]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [12]:
df['volatile acidity'] = pd.qcut(df['volatile acidity'],10)
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,"(0.66, 0.745]",0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,"(0.745, 1.58]",0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,"(0.745, 1.58]",0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,"(0.119, 0.31]",0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,"(0.66, 0.745]",0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [13]:
high_q = df[df["quality"]==5]
high_q.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,"(0.66, 0.745]",0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,"(0.745, 1.58]",0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,"(0.745, 1.58]",0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
4,7.4,"(0.66, 0.745]",0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
5,7.4,"(0.61, 0.66]",0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5


In [15]:
bins = high_q['volatile acidity'].unique()
print(bins)

[(0.66, 0.745], (0.745, 1.58], (0.61, 0.66], (0.57, 0.61], (0.47, 0.52], (0.52, 0.57], (0.37, 0.415], (0.415, 0.47], (0.31, 0.37], (0.119, 0.31]]
Categories (10, interval[float64, right]): [(0.119, 0.31] < (0.31, 0.37] < (0.37, 0.415] <
                                            (0.415, 0.47] ... (0.57, 0.61] < (0.61, 0.66] <
                                            (0.66, 0.745] < (0.745, 1.58]]


In [20]:
grouped = high_q.groupby("volatile acidity")["alcohol"].mean()
grouped[:-1]

  grouped = high_q.groupby("volatile acidity")["alcohol"].mean()


Unnamed: 0_level_0,alcohol
volatile acidity,Unnamed: 1_level_1
"(0.119, 0.31]",10.243478
"(0.31, 0.37]",10.131818
"(0.37, 0.415]",9.766
"(0.415, 0.47]",9.912987
"(0.47, 0.52]",9.822222
"(0.52, 0.57]",9.936364
"(0.57, 0.61]",9.894253
"(0.61, 0.66]",9.834574
"(0.66, 0.745]",9.786082


In [24]:
print(grouped.min())

9.766


In [None]:
# Feel free to use this code block for testing
def homework(url_winequality_data, n):

In [31]:
def homework(url_winequality_data, n):
  df = pd.read_csv(url_winequality_data, sep=';')
  df['volatile acidity'] = pd.qcut(df['volatile acidity'],n)
  five_q = df[df["quality"]==5]
  grouped = five_q.groupby("volatile acidity",observed=True)["alcohol"].mean()
  return grouped.min()

Once you've finished checking, copy and paste your `homework()` function to Omnicampus and submit your solution. If you see a `1.0`, it means that the your answer was correct.

In [30]:
print(homework(url_winequality_data,10))
print(type(homework(url_winequality_data,10)))

9.766
<class 'float'>


### References

[1] Cortez P, Cerdeira A, Almeida F, Matos T, Reis J. Wine Quality [dataset]. 2009. UCI Machine Learning Repository. Available from: https://doi.org/10.24432/C56S3T.