# Drawing Conclusions Using Query

In the notebook below, you're going to investigate two questions about this data using [pandas' query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) function.

In [1]:
import pandas as pd

%matplotlib inline

In [2]:
# Load 'winequality_edited.csv,' a file you previously created
# in this workspace and worked with in the concepts
# "Appending Data(cont.)" and "Exploring with Visuals"

In [3]:
df = pd.read_csv("winequality_edited.csv")
df.head().T

Unnamed: 0,0,1,2,3,4
fixed_acidity,7,6.3,8.1,7.2,7.2
volatile_acidity,0.27,0.3,0.28,0.23,0.23
citric_acid,0.36,0.34,0.4,0.32,0.32
residual_sugar,20.7,1.6,6.9,8.5,8.5
chlorides,0.045,0.049,0.05,0.058,0.058
free_sulfur_dioxide,45,14,30,47,47
total_sulfur_dioxide,170,132,97,186,186
density,1.001,0.994,0.9951,0.9956,0.9956
pH,3,3.3,3.26,3.19,3.19
sulphates,0.45,0.49,0.44,0.4,0.4


### Do wines with higher alcoholic content receive better ratings?


Tip : To answer this question, use query to create two groups of wine samples:

* Low alcohol (samples with an alcohol content less than the median)
* High alcohol (samples with an alcohol content greater than or equal to the median)
Then, find the mean quality rating of each group.

In [4]:
# get the median amount of alcohol content

In [5]:
alcohol_median = df["alcohol"].median()
alcohol_median

10.300000000000001

In [6]:
# select samples with alcohol content less than the median

In [7]:
low_alcohol = df.query(" alcohol < 10.3 ")
low_alcohol.head().T

Unnamed: 0,0,1,2,3,4
fixed_acidity,7,6.3,8.1,7.2,7.2
volatile_acidity,0.27,0.3,0.28,0.23,0.23
citric_acid,0.36,0.34,0.4,0.32,0.32
residual_sugar,20.7,1.6,6.9,8.5,8.5
chlorides,0.045,0.049,0.05,0.058,0.058
free_sulfur_dioxide,45,14,30,47,47
total_sulfur_dioxide,170,132,97,186,186
density,1.001,0.994,0.9951,0.9956,0.9956
pH,3,3.3,3.26,3.19,3.19
sulphates,0.45,0.49,0.44,0.4,0.4


In [8]:
# select samples with alcohol content greater than or equal to the median

In [9]:
high_alcohol = df[ df["alcohol"] >= alcohol_median ]
high_alcohol.head().T

Unnamed: 0,9,10,12,13,15
fixed_acidity,8.1,8.1,7.9,6.6,6.6
volatile_acidity,0.22,0.27,0.18,0.16,0.17
citric_acid,0.43,0.41,0.37,0.4,0.38
residual_sugar,1.5,1.45,1.2,1.5,1.5
chlorides,0.044,0.033,0.04,0.044,0.032
free_sulfur_dioxide,28,11,16,48,28
total_sulfur_dioxide,129,63,75,143,112
density,0.9938,0.9908,0.992,0.9912,0.9914
pH,3.22,2.99,3.18,3.54,3.25
sulphates,0.45,0.56,0.63,0.52,0.55


In [10]:
# ensure these queries included each sample exactly once

In [11]:
num_samples = df.shape[0]
num_samples == low_alcohol['quality'].count() + high_alcohol['quality'].count() # should be True

True

In [12]:
# get mean quality rating for the low alcohol and high alcohol groups

In [13]:
low_alcohol["quality"].mean()

5.475920679886686

In [14]:
high_alcohol["quality"].mean()

6.1460843373493974

### Do sweeter wines receive better ratings?
Similarly, use the median to split the samples into two groups by residual sugar and find the mean quality rating of each group.

In [15]:
# get the median amount of residual sugar

In [16]:
df["residual_sugar"].median()

3.0

In [17]:
# select samples with residual sugar less than the median

In [18]:
low_sugar = df.query(" residual_sugar < 3 ")
low_sugar.head().T

Unnamed: 0,1,8,9,10,12
fixed_acidity,6.3,6.3,8.1,8.1,7.9
volatile_acidity,0.3,0.3,0.22,0.27,0.18
citric_acid,0.34,0.34,0.43,0.41,0.37
residual_sugar,1.6,1.6,1.5,1.45,1.2
chlorides,0.049,0.049,0.044,0.033,0.04
free_sulfur_dioxide,14,14,28,11,16
total_sulfur_dioxide,132,132,129,63,75
density,0.994,0.994,0.9938,0.9908,0.992
pH,3.3,3.3,3.22,2.99,3.18
sulphates,0.49,0.49,0.45,0.56,0.63


In [19]:
# select samples with residual sugar greater than or equal to the median

In [20]:
high_sugar = df.query(" residual_sugar >= 3 ")
high_sugar.head().T

Unnamed: 0,0,2,3,4,5
fixed_acidity,7,8.1,7.2,7.2,8.1
volatile_acidity,0.27,0.28,0.23,0.23,0.28
citric_acid,0.36,0.4,0.32,0.32,0.4
residual_sugar,20.7,6.9,8.5,8.5,6.9
chlorides,0.045,0.05,0.058,0.058,0.05
free_sulfur_dioxide,45,30,47,47,30
total_sulfur_dioxide,170,97,186,186,97
density,1.001,0.9951,0.9956,0.9956,0.9951
pH,3,3.26,3.19,3.19,3.26
sulphates,0.45,0.44,0.4,0.4,0.44


In [21]:
# ensure these queries included each sample exactly once
num_samples == low_sugar['quality'].count() + high_sugar['quality'].count() # should be True

True

In [22]:
# get mean quality rating for the low sugar and high sugar groups

In [23]:
low_sugar["quality"].mean()

5.8088007437248219

In [24]:
high_sugar["quality"].mean()

5.8278287461773699