### <span style = "color:red"> Axis indexes with duplicate values

In [2]:
import pandas as pd

In [3]:
from pandas import Series, DataFrame

In [9]:
import numpy as np

In [4]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index’s<span style = "color:olive"> is_unique </span> property can tell you whether its values are unique or not: 

In [5]:
obj.index.is_unique

False

Indexing a value with multiple entries returns a Series but single entries return a scalar value: 

In [6]:
obj['a']

a    0
a    1
dtype: int64

In [7]:
obj['c'] 

4

this will index rows in a DataFrame: 

In [10]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df 

Unnamed: 0,0,1,2
a,-0.525156,0.127277,-1.118542
a,-1.499203,-0.809403,2.498249
b,0.398071,-0.018446,0.742484
b,0.253497,-0.496054,-0.920012


In [11]:
 df.ix['b'] 

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2
b,0.398071,-0.018446,0.742484
b,0.253497,-0.496054,-0.920012


### <span style = "color:red"> Summarizing and Computing Descriptive Statistics  

pandas objects are equipped with a set of common mathematical and statistical methods

In [13]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
                index=['a', 'b', 'c', 'd'],
                columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


this returns a Series containing column sums: 

In [14]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing axis=1 sums over the rows instead:

In [15]:
df.sum(axis=1) 

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [19]:
df.mean(axis=1) 

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

<span style = "color:darkcyan">idxmin </span>and<span style = "color:darkcyan"> idxmax</span>, return the minimum or maximum values:

In [20]:
df.idxmax()

one    b
two    d
dtype: object

In [21]:
df.cumsum() 

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


<span style = "color:darkcyan">describe </span>producing multiple summary statistics in one shot:

In [22]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, <span style = "color:darkcyan">describe</span> produces alternate summary statistics:

In [23]:
obj = Series(['a', 'a', 'b', 'c'] * 4)

In [24]:
obj.describe() 

count     16
unique     3
top        a
freq       8
dtype: object

####  <span style = "color:crimson"> Descriptive and summary statistics 

<span style = "color:crimson">count:</span>    Number of non-NA values

<span style = "color:crimson">describe:</span>   Compute set of summary statistics for Series or each DataFrame column 

<span style = "color:crimson">min, max:</span>   Compute minimum and maximum values 

<span style = "color:crimson">argmin, argmax:</span>   Compute index locations (integers) at which minimum or maximum value obtained, respectively 

<span style = "color:crimson">idxmin, idxmax:</span> Compute index values at which minimum or maximum value obtained, respectively 

<span style = "color:crimson">quantile:</span>  Compute sample quantile ranging from 0 to 1 

<span style = "color:crimson">sum:</span> Sum of values 

<span style = "color:crimson">mean:</span> Mean of values 

<span style = "color:crimson">median:</span> Arithmetic median (50% quantile) of values

<span style = "color:crimson">mad:</span> Mean absolute deviation from mean value 

<span style = "color:crimson">var:</span> Sample variance of values 

<span style = "color:crimson">std:</span> Sample standard deviation of values 

<span style = "color:crimson">skew:</span> Sample skewness (3rd moment) of values 

<span style = "color:crimson">kurt:</span> Sample kurtosis (4th moment) of values 

<span style = "color:crimson">cumsum:</span> Cumulative sum of values 

<span style = "color:crimson">cummin, cummax:</span> Cumulative minimum or maximum of values, respectively 

<span style = "color:crimson">cumprod:</span> Cumulative product of values 

<span style = "color:crimson">diff:</span> Compute 1st arithmetic difference (useful for time series) 

<span style = "color:crimson">pct_change:</span> Compute percent changes

### <span style='color:red'>Unique Values, Value Counts, and Membership

In [36]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) 

 <span style='color:darkcyan'>unique</span>, gives you an array of the unique values in a Series: 

In [40]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

<span style='color:darkcyan'> value_counts</span> containing value frequencies:

In [41]:
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

In [42]:
 pd.value_counts(obj.values, sort=False) 

b    2
d    1
c    3
a    3
dtype: int64

 <span style='color:darkcyan'>isin</span> is responsible for vectorized set membership and can be very useful in filtering a data set down to a subset of values in a Series or column in a DataFrame:

In [43]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [44]:
obj[mask] 

0    c
5    b
6    b
7    c
8    c
dtype: object

<span style='color:darkcyan'>isin:</span> Compute boolean array indicating whether each Series value is contained in the passed sequence of values.

<span style='color:darkcyan'>unique:</span> Compute array of unique values in a Series, returned in the order observed. 

<span style='color:darkcyan'>value_counts:</span> Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order.


 you may want to compute a histogram on multiple related columns in a DataFrame:

In [45]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})

In [46]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


Passing <span style='color:darkcyan'>pandas.value_counts</span> to this DataFrame’s apply function gives

In [47]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


### <span style='color:red'>Handling Missing Data

 One of the goals in designing pandas was to make working with missing data as painless as possible. 
 
 pandas uses the floating point value <span style='color:deeppink'>NaN (Not a Number) </span>to represent missing data in both floating as well as in non-floating point arrays.

In [48]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [49]:
string_data 

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [50]:
string_data.isnull() 

0    False
1    False
2     True
3    False
dtype: bool

<span style = "color:crimson">dropna:</span> Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.

<span style = "color:crimson">fillna:</span> Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. 

<span style = "color:crimson">isnull:</span> Return like-type object containing boolean values indicating which values are missing / NA. 

<span style = "color:crimson">notnull:</span> Negation of isnull.


### <span style='color:red'>Filtering Out Missing Data 
    

You have a number of options for filtering out missing data. 
  <span style='color:darkcyan'>dropna</span> can be very helpful. On a Series, it returns the Series with only the non-null data and index values:

In [51]:
from numpy import nan as NA

In [52]:
data = Series([1, NA, 3.5, NA, 7])

In [53]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

computed by boolean indexing:

In [54]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

You may want to drop rows or columns which are all NA or just those containing any NAs.<span style='color:darkcyan'> dropna</span> by default drops any row containing a missing value:

In [55]:
 data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

In [56]:
cleaned = data.dropna()

In [57]:
data 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [58]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


<span style='color:deeppink'> how='all' </span>will only drop rows that are all NA: 

In [59]:
data.dropna(how='all') 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


<span style='color:mediumvioletred'>Dropping columns in the same way is only a matter of passing axis=1: </span>

In [60]:
data[4] = NA

In [61]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [62]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


if you want to keep only rows containing a certain number of observations. You can indicate this with the <span style='color:mediumvioletred'>thresh</span> argument:

In [63]:
df = DataFrame(np.random.randn(7, 3))

In [64]:
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [65]:
df

Unnamed: 0,0,1,2
0,2.393548,,
1,0.783063,,
2,0.38432,,
3,1.083807,,0.707912
4,-0.963181,,1.540551
5,-1.074034,-0.194629,2.190909
6,-1.348003,-0.829412,-1.182235


In [66]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,-1.074034,-0.194629,2.190909
6,-1.348003,-0.829412,-1.182235
