# Imp link for students 
1. [jakevdp](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html)
2. [pandas query function](http://jose-coto.com/query-method-pandas)

## Drawing Conclusions Using Query
In the notebook below, you're going to investigate two questions about this data using [Pandas' query function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html). 


Here are tips for answering each question:

### Q1: Do wines with higher alcoholic content receive better ratings?
To answer this question, use query to create two groups of wine samples:

Low alcohol (samples with an alcohol content less than the median)  
High alcohol (samples with an alcohol content greater than or equal to the median)  
Then, find the mean quality rating of each group.  

### Q2: Do sweeter wines (more residual sugar) receive better ratings?
Similarly, use the median to split the samples into two groups by residual sugar and find the mean quality rating of each group.

In [1]:
import pandas as pd
df = pd.read_csv('winequality_edited.csv')

In [2]:
df.shape

(6497, 14)

## Do wines with higher alcoholic content receive better ratings?

In [3]:
median = df['alcohol'].median()
median

10.3

### Error
`low_alcohol = df.query('alcohol < median')`

UndefinedVariableError: name 'median' is not defined.  
Refer this link for [how to solve](https://stackoverflow.com/questions/29085544/undefinedvariableerror-when-querying-pandas-dataframe). 

put median in double quotes to distinguish it from the column names from df otherwise query will take that as column name

In [4]:
low_alcohol = df.query('alcohol < alcohol.median()')
low_alcohol.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color,acidity_levels
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Red,Low
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Red,Moderately High
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,Red,Medium
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Red,Moderately High
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Red,Low


In [5]:
low_alcohol.shape

(3177, 14)

In [6]:
low_alcohol2 = df[df['alcohol'] < median]
low_alcohol2.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color,acidity_levels
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Red,Low
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Red,Moderately High
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,Red,Medium
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Red,Moderately High
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Red,Low


In [7]:
# select samples with alcohol content greater than or equal to the median
high_alcohol = df.query('alcohol >= alcohol.median()')
high_alcohol.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color,acidity_levels
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,Red,Low
11,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,Red,Low
16,8.5,0.28,0.56,1.8,0.092,35.0,103.0,0.9969,3.3,0.75,10.5,7,Red,Medium
31,6.9,0.685,0.0,2.5,0.105,22.0,37.0,0.9966,3.46,0.57,10.6,6,Red,Low
36,7.8,0.6,0.14,2.4,0.086,3.0,15.0,0.9975,3.42,0.6,10.8,6,Red,Low


In [8]:
high_alcohol.shape

(3320, 14)

In [9]:
# ensure these queries included each sample exactly once
num_samples = df.shape[0]
num_samples == low_alcohol['quality'].count() + high_alcohol['quality'].count() # should be True

True

In [10]:
# get mean quality rating for the low alcohol and high alcohol groups
print('mean quality rating for the low alcohol : ',low_alcohol['quality'].mean())
print('mean quality rating for the high alcohol : ',high_alcohol['quality'].mean())

mean quality rating for the low alcohol :  5.475920679886686
mean quality rating for the high alcohol :  6.146084337349397


So wines with higher alcoholic content generally receive better ratings.

### Do sweeter wines receive better ratings?

Similarly, use the median to split the samples into two groups by residual sugar and find the mean quality rating of each group.

In [11]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'color', 'acidity_levels'],
      dtype='object')

In [12]:
df = df.rename(columns = {'residual sugar' : 'residual_sugar'})

In [13]:
# get the median amount of residual sugar
low_rs = df.query('residual_sugar < residual_sugar.median()')
low_rs.shape

(3227, 14)

In [14]:
high_rs = df.query('residual_sugar >= residual_sugar.median()')
high_rs.shape

(3270, 14)

In [15]:
# ensure these queries included each sample exactly once
num_samples = df.shape[0]
num_samples == low_rs['residual_sugar'].count() + high_rs['residual_sugar'].count() # should be True

True

In [16]:
print('mean quality rating for the low residual sugar : ',low_rs['residual_sugar'].mean())
print('mean quality rating for the high residual sugar : ',high_rs['residual_sugar'].mean())

mean quality rating for the low residual sugar :  1.8264177254415837
mean quality rating for the high residual sugar :  9.012492354740075


sweeter wines generally receive higher ratings