<h1><font color="blue">
Aggregation and Grouping
</font>
</h1>


<bold><i><font color = "green"> Here we will use the Planets dataset, available via the Seaborn package (see Visualization With Seaborn). 
It gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). 
It can be downloaded with a simple Seaborn command </font></i></bold>


In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
planets = sns.load_dataset('planets')
planets.shape

In [None]:
planets.head()

<h3><i><font color = "green">
 Simple Aggregation in Pandas.For series,output of aggregation is a single value   
    </font> </i></h3>

In [None]:
rng = np.random.RandomState(42)

ser = pd.Series(rng.rand(5))
ser

In [None]:
print("sum is",ser.sum())
print("mean is",ser.mean())

<h3><i><font color = "green">
  For a DataFrame, by default the aggregates return results within each column:
    </font> </i></h3>



In [None]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
print(df)
df.mean()

<h3><i><font color = "green">
  If you need mean according to row wise
    </font> </i></h3>


In [None]:

df.mean(axis='columns')

<h1>
<font color = "purple"> 
  There is a convenience method describe() that computes several common aggregates for each column and returns the result </font>
 </h1>

In [None]:
planets.dropna().describe()


<h3><font color = "green">
The following summarizes some other built-in Pandas aggregations:</font>  </h3>
<p></p>
<font color="purple">
<p>1.count() -	Total number of items</p>
<p>2.first(), last()  -	First and last item</p>
<p>3.mean(), median()	- Mean and median</p>
<p>4.min(), max()     -	Minimum and maximum</p>
<p>5.std(), var()	    - Standard deviation and variance</p>
<p>6.mad()            -	Mean absolute deviation</p>
<p>7.prod()           -	Product of all items</p>
<p>8.sum()            -	Sum of all items</p></font>
<h3>These are all methods of DataFrame and Series objects.</h3> 
  



<h3><i><font color = "green">
GroupBy: Split, Apply, Combine
often we would prefer to aggregate conditionally on some label or index: 
this is implemented in the so-called groupby operation
  
    </font> </i></h3>


## Split, apply, combine

In [None]:

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

In [None]:
grouped_obj =df.groupby('key')
#What is the data type of grouped obj??
print(type(grouped_obj))

#what is the data type of x here??
print(df.apply(lambda x : print(type(x))))


grouped_obj.apply(lambda x: print(type(x)))
df.groupby('key').sum()
#you can apply any aggregation method on groupby(ex : min,max,mean etc.)

<h3><i><font color = "green">
 Column indexing:
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a modified GroupBy object. For example
    </font> </i></h3>



In [None]:
print(planets.groupby('method'))
print(planets.groupby('method')['orbital_period'])
#Here we've selected a particular Series group from the original DataFrame group by reference to its column name. 
#As with the GroupBy object, no computation is done until we call some aggregate on the object:
planets.groupby('method')['orbital_period'].median()

In [None]:
planets.head()



<h3><i><font color = "green">
Iteration over groups:
The GroupBy object supports direct iteration over the groups, returning each group as a Series or DataFrame:
    </font> </i></h3>


In [None]:

for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

<h3> Dispatch method</h3>
<i><font color = "green">
Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through 
and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() method
of DataFrames to perform a set of aggregations that describe each group in the data:
  
 </font> </i>



In [None]:
planets.groupby('method')['year'].describe()

<h3><i><font color = "green">
 Aggregate, filter, transform, apply :
    </font> </i></h3>



In [None]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

<h3><i><font color = "green">
  Aggregation
    </font> </i></h3>


In [None]:
df.groupby('key').aggregate(['min', np.median, max])

In [None]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

<h3><i><font color = "green">
Filtering:
A filtering operation allows you to drop data based on the group properties 
    </font> </i></h3>


In [None]:
def filter_func(x):
    return x['data2'].std() > 4

print(df)
print(df.groupby('key').std())
print(df.groupby('key').filter(filter_func))



  
 
<h3>Transformation:</h3>
<i><font color = "green">
While aggregation must return a reduced version of the data, transformation can return some transformed version of the 
full data to recombine. For such a transformation, the output is the same shape as the input. 
A common example is to center the data by subtracting the group-wise mean:
</font> </i>

In [None]:
df.groupby('key').transform(lambda x: x - x.mean())

<h3><i><font color = "green">
The apply() method---
lets you apply an arbitrary function to the group results
    </font> </i></h3>


In [None]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

df.groupby('key').apply(norm_by_data2)


<h3><i><font color = "green">
A dictionary or series mapping index to group
    </font> </i></h3>


In [None]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
print(df2)
print(df2.groupby(mapping).sum())

<h3><i><font color = "green">
Any Python function:
Further, any of the preceding key choices can be combined to group on a multi-index 
    </font> </i></h3>


In [None]:
print(df2)
print(df2.groupby(str.lower).mean())

<h3><i><font color = "green">
  A list of valid keys
    </font> </i></h3>


In [None]:
df2.groupby([str.lower, mapping]).mean()

<h2> <font color = 'red'>Grouping example  </font></h2>

As an example of this, in a couple lines of Python code we can put all these together and count discovered planets
by method and by decade:


In [None]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)


<h3><i><font color = "green">   unstack() example-
  Series with MultiIndex to produce DataFrame. The level involved will automatically get sorted.
    </font> </i></h3>
  


In [None]:
s = pd.Series([1, 2, 3, 4],
index=pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']]))
s

In [None]:
s.unstack()

<h1> <font color="red">Example: Random Sampling and Permutation </h1> </font>
### Suppose you wanted to draw a random sample (with or without replacement) from a large dataset for some application. There are a number of ways to perform the “draws”; here we use the sample method for Series.
### To demonstrate, here’s a way to construct a deck of English-style playing cards:

In [None]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)
#So now we have a Series of length 52 whose index contains card names and values are the ones used in Blackjack 
#and other games (to keep things simple, just let the ace 'A' be 1):

In [None]:
deck[:13]

In [None]:
#Drawing a hand of five cards from the deck could be written as:
def draw(deck, n=5):
    return deck.sample(n)

draw(deck)

In [None]:
#Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, 
#we can group based on this and use apply:

get_suit = lambda card: card[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n=2)

In [None]:
#Alternatively, we could write:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

## Example: Filling Missing Values with Group-Specific Values
When cleaning up missing data, in some cases you will replace data observations using dropna, but in others you may want to impute (fill in) the null (NA) values using a fixed value or some value derived from the data. fillna is the right tool to use; for example, here I fill in NA values with the mean:

In [None]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

In [None]:
s.fillna(s.mean())

Suppose you need the fill value to vary by group. One way to do this is to group the data and use apply with a function that calls fillna on each data chunk. Here is some sample data on US states divided into eastern and western regions:

In [None]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
             'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data

Note that the syntax ['East'] * 4 produces a list containing four copies of the elements in ['East']. Adding lists together concatenates them.

Let’s set some values in the data to be missing:

In [None]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
print(data)
data.groupby(group_key).mean()

We can fill the NA values using the group means like so:

In [None]:
fill_mean = lambda g: g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)

In another case, you might have predefined fill values in your code that vary by group. Since the groups have a name attribute set internally, we can use that:

In [None]:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)