<center>
  <a href="2.3.Intro-to-pandas_Dataframes.ipynb">Previous Page</a> | <a href="./">Content Page</a> | <a href="2.5-intro-to-munging.ipynb">Next Page</a></center>
</center>

# 2.4 .Summary Statistics in Dataframe and Missing Data

Summary statistics methods automatically **EXCLUDE** missing values

Here, we will discuss in further depth how to deal with missing values

In [4]:
import pandas as pd
from pandas import DataFrame

import numpy as np

In [5]:
 

df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


##### default skip NAN

In [53]:
df.sum() # default skip NAN

one    9.25
two   -5.80
dtype: float64

In [54]:
df.sum(skipna=True)

one    9.25
two   -5.80
dtype: float64

In [55]:
df.sum(skipna=False)

one   NaN
two   NaN
dtype: float64

https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sum.html

In [56]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


#### # again, setting axis = 1 allows us to compute by column

In [57]:
 # again, setting axis = 1 allows us to compute by column

df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [58]:
# disable the skip na function 
df.mean(skipna=False)

one   NaN
two   NaN
dtype: float64

In [59]:
df.mean(skipna=False, axis=1)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

## Exercise 2.4

There are many other descriptive and summary statistics methods in pandas that make it easy to summarise our data as a first exploratory step. 

Try out these other methods on the data frame above: 

1. describe( )
2. count( )
3. median( )
4. var( )

In [6]:
###### remember for help
df.describe?


##### df.[hit Tab key]

##### Exercise 2.4.1: Descriptive statistic (df.describe())

In [60]:
df._______()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


##### Exercise 2.4.2: Count how many values (df.count())

In [61]:
df.____()

one    3
two    2
dtype: int64

##### Exercise 2.4.3: Find the median (df.median())

In [62]:
df.____()

one    1.4
two   -2.9
dtype: float64

##### Exercise 2.4.4: Find the variance (df.var())

In [63]:
df.____()

one    12.205833
two     5.120000
dtype: float64

##### Exercise 2.4.5: Other statistics like average, standard deviation

In [9]:
df.____()

one    3.083333
two   -2.900000
dtype: float64

In [10]:
df.____()

one    3.493685
two    2.262742
dtype: float64

## 2.4.2. What to do with missing data? 

#### Sometimes the easiest thing to do is to get rid of them 

In [64]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [65]:
from numpy import nan 
data = Series([1, nan, 3.5, nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [66]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                  [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


##### dropna() by default drops any row containing a missing value 

In [67]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


##### only drop when row contains all NA 

In [68]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


### To Sum up: Old Ways of Retreving Data

In [69]:
import numpy as np
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])


In [70]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [72]:
data.ix['Colorado',['two','three']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


two      5
three    6
Name: Colorado, dtype: int32

In [73]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [74]:
data.ix[:'Utah','two']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

#### Mixing the loc and iloc (keyword and index position)

In [75]:
data.ix[['Colorado','Utah'], [3,0,1]]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


<center>
  <a href="2.3.Intro-to-pandas_Dataframes.ipynb">Previous Page</a> | <a href="./">Content Page</a> | <a href="2.5-intro-to-munging.ipynb">Next Page</a></center>
</center>

#### More detailed information: 
https://pandas.pydata.org/pandas-docs/stable/dsintro.html<br>
https://discuss.analyticsvidhya.com/t/what-is-the-difference-between-pandas-series-and-python-lists/27373/2
https://stackoverflow.com/questions/26047209/what-is-the-difference-between-a-pandas-series-and-a-single-column-dataframe

### Possible Solution:
#### Exercise 2.4:

In [2]:
import pandas as pd
import numpy as np
from pandas import DataFrame

df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [3]:
df.describe( )

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [4]:
df.count( )

one    3
two    2
dtype: int64

In [5]:
df.median( )

one    1.4
two   -2.9
dtype: float64

In [6]:
df.var( )

one    12.205833
two     5.120000
dtype: float64

In [7]:
df.mean()

one    3.083333
two   -2.900000
dtype: float64

In [8]:
df.std()

one    3.493685
two    2.262742
dtype: float64