# Processing Dataframes - Part 2

## This notebook explains how one can perform processing on dataframes collectively, by grouping the data, aggregation and cross tabulation on dataframes etc.

In [34]:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import matplotlib.pyplot as plt

### Grouping in Dataframes

In [35]:
dframe = DataFrame({'Students':['Max','Lily','Charles','Ruby','Andrew','Kate'],
                    'Division':['A','B','A','A','B','B'],
                   'Marks':[23,30,18,26,11,15]})
dframe

Unnamed: 0,Division,Marks,Students
0,A,23,Max
1,B,30,Lily
2,A,18,Charles
3,A,26,Ruby
4,B,11,Andrew
5,B,15,Kate


#### Grouping all the students based on their division and finding their collective metrics:

 The resultant data represents average marks in each division.

In [36]:
dframe['Marks'].groupby(dframe['Division']).mean()

Division
A    22.333333
B    18.666667
Name: Marks, dtype: float64

Another method:

We have added another column in the dataframe describing age of each students.

In [37]:
dframe['Age'] = [21,23,19,24,16,20]

In [38]:
dframe.groupby('Division').mean()

Unnamed: 0_level_0,Marks,Age
Division,Unnamed: 1_level_1,Unnamed: 2_level_1
A,22.333333,21.333333
B,18.666667,19.666667


#### Iterating over elements in group

In [39]:
for division,info in dframe.groupby('Division'):
    print('This is division:',division)
    print(info)
    print('\n')


This is division: A
  Division  Marks Students  Age
0        A     23      Max   21
2        A     18  Charles   19
3        A     26     Ruby   24


This is division: B
  Division  Marks Students  Age
1        B     30     Lily   23
4        B     11   Andrew   16
5        B     15     Kate   20




### Aggregation

From some website on the Internet:
#### Data aggregation is any process in which information is gathered and expressed in a summary form, for purposes such as statistical analysis. A common aggregation purpose is to get more information about particular groups based on specific variables such as age, profession, or income.
Let us make use of some large dataset. The url for the data: 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ '. Download the CSV file and move it to the location of your IPython notebooks.

In [40]:
dframe = pd.read_csv('winequality.csv', sep=';')

#### Some basic Dataframe operations revised:

In [41]:
dframe.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [42]:
 dframe.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


 Grouping by quality

In [43]:
group1 = dframe.groupby('quality').mean()

In [44]:
group1.head()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,8.36,0.8845,0.171,2.635,0.1225,11.0,24.9,0.997464,3.398,0.57,9.955
4,7.779245,0.693962,0.174151,2.69434,0.090679,12.264151,36.245283,0.996542,3.381509,0.596415,10.265094
5,8.167254,0.577041,0.243686,2.528855,0.092736,16.983847,56.51395,0.997104,3.304949,0.620969,9.899706
6,8.347179,0.497484,0.273824,2.477194,0.084956,15.711599,40.869906,0.996615,3.318072,0.675329,10.629519
7,8.872362,0.40392,0.375176,2.720603,0.076588,14.045226,35.020101,0.996104,3.290754,0.741256,11.465913


Adding a new column named density to pH ratio

In [56]:
dframe['density/pH'] = dframe['density']/dframe['pH']
dframe.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,density/pH
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0.284274
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0.3115
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0.305828
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0.315823
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0.284274


## Thanks for reading.