# Drawing Conclusions Using Groupby

In this notebook, you're going to investigate two questions about this data using pandas' [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function

Use `winequality_edited.csv`. You should've created this data file in the previous section: *Appending Data (cont.)*.

In [1]:
import pandas as pd

%matplotlib inline

In [2]:
# Load dataset

df = pd.read_csv("winequality_edited.csv")
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,acidity_levels
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,high
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,medium
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,medium
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,moderately_high
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,moderately_high


### Is a certain type of wine associated with higher quality?

*Tip : For this question, compare the average quality of red wine with the average quality of white wine with groupby. To do this group by color and then find the mean quality of each group.*

In [3]:
# Find the mean quality of each wine type (red and white) with groupby

In [4]:
df.groupby("color").mean()

Unnamed: 0_level_0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
red,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
white,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909


In [5]:
df.groupby("color")["quality"].mean()

color
red      5.636023
white    5.877909
Name: quality, dtype: float64

In [6]:
df.groupby("color").mean()["quality"]

color
red      5.636023
white    5.877909
Name: quality, dtype: float64

### What level of acidity receives the highest average rating?

This question is more tricky because unlike color, which has clear categories you can group by (red and white) pH is a quantitative variable without clear categories. However, there is a simple fix to this. You can create a categorical variable from a quantitative variable by creating your own categories. [pandas' cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function let's you "cut" data in groups. Using this, create a new column called acidity_levels with these categories:

Acidity Levels:
* High: Lowest 25% of pH values
* Moderately High: 25% - 50% of pH values
* Medium: 50% - 75% of pH values
* Low: 75% - max pH value

Here, the data is being split at the 25th, 50th, and 75th percentile. Remember, you can get these numbers with pandas' describe()! After you create these four categories, you'll be able to use groupby to get the mean quality rating for each acidity level.

In [7]:
# View the min, 25%, 50%, 75%, max pH values with Pandas describe

In [8]:
df.describe()["pH"]

count    6497.000000
mean        3.218501
std         0.160787
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: pH, dtype: float64

In [9]:
# Bin edges that will be used to "cut" the data into groups

In [10]:
bin_edges = [ 2.72, 3.11, 3.21, 3.32, 4.01 ] # Fill in this list with five values you just found

In [11]:
# Labels for the four acidity level groups

In [12]:
bin_names = ["high" ,"moderately_high" , "medium", "low" ] # Name each acidity level category

In [13]:
# Creates acidity_levels column

In [14]:
df['acidity_levels'] = pd.cut(df['pH'], bins = bin_edges, labels=bin_names)
df.head().T

Unnamed: 0,0,1,2,3,4
fixed_acidity,7,6.3,8.1,7.2,7.2
volatile_acidity,0.27,0.3,0.28,0.23,0.23
citric_acid,0.36,0.34,0.4,0.32,0.32
residual_sugar,20.7,1.6,6.9,8.5,8.5
chlorides,0.045,0.049,0.05,0.058,0.058
free_sulfur_dioxide,45,14,30,47,47
total_sulfur_dioxide,170,132,97,186,186
density,1.001,0.994,0.9951,0.9956,0.9956
pH,3,3.3,3.26,3.19,3.19
sulphates,0.45,0.49,0.44,0.4,0.4


In [15]:
# Find the mean quality of each acidity level with groupby

In [16]:
df.groupby("acidity_levels").mean()["quality"]

acidity_levels
high               5.783343
moderately_high    5.784540
medium             5.850832
low                5.859593
Name: quality, dtype: float64

In [17]:
# Save changes for the next section
df.to_csv('winequality_edited.csv', index=False)