In [35]:
import numpy as np
import pandas as pd

from IPython.display import display, HTML

# Panda Basics II

#### Boolean Comparisons

Series and DataFrame have the binary comparison methods eq, ne, le, lt, ge, and gt whose behavior is vectorized:

eq (equivalent to ==) — equals to

ne (equivalent to !=) — not equals to

le (equivalent to <=) — less than or equals to

lt (equivalent to <) — less than

ge (equivalent to >=) — greater than or equals to

gt (equivalent to >) — greater than


pd.read_csv()
The read_csv function is simple way to read csv (comma separated values) files, which is a commonly used file format for storing data.



In [4]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [5]:
df

Unnamed: 0,one,two,three
a,-0.948215,-0.57047,
b,-0.219192,-0.734155,0.815758
c,-1.7286,-2.053497,0.185963
d,,-1.855228,-1.477839


In [6]:
df2 = df.copy()

In [7]:
df.gt(df2)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [8]:
df.ne(df2)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


#### You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [9]:
(df > 0).all()

one      False
two      False
three    False
dtype: bool

In [10]:
(df > 0).any()

one      False
two      False
three     True
dtype: bool

In [11]:
(df > 0).any().any()

True

*To evaluate single-element pandas objects in a boolean context, use the method bool():*

In [12]:
pd.Series([True]).bool()

True

In [13]:
pd.Series([False]).bool()

False

In [14]:
 pd.DataFrame([[True]]).bool()

True

In [15]:
pd.DataFrame([[False]]).bool()

False

### Objects comparison

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [16]:
pd.Series(['foo', 'bar', 'baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [36]:
pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

### Descriptive Statistics

There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. All of them are vectorized. Most of them are aggregations and produce a lower-dimensional result.

Generally speaking, these methods take an axis as an argument and the axis can be specified by name or integer:

In [37]:
display(df)

Unnamed: 0,one,two,three
a,-0.948215,-0.57047,
b,-0.219192,-0.734155,0.815758
c,-1.7286,-2.053497,0.185963
d,,-1.855228,-1.477839


In [21]:
# Aggregate mean for each column
df.mean(0)

one     -0.965336
two     -1.303338
three   -0.158706
dtype: float64

In [22]:
# Aggregate mean for each row
df.mean(1)

a   -0.759343
b   -0.045863
c   -1.198711
d   -1.666534
dtype: float64

By applying vectorized operations, we can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

In [23]:
ts_stand = (df - df.mean()) / df.std()

In [33]:
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

*Most popular descriptive statistics in Pandas*

*Most popular descriptive statistics in Pandas*

| Function      | Description |
| ----------- | ----------- |
|sum()	       |Return sum of values
count()	        |Return number of non-null observations
mean()	|Return mean of values
median()	|Return median of values
mode()	|Return mode of values
std()	|Return standard deviation of values
min()	|Return minimum
max()	|Return maximum
abs()	|Return absolute value
prod()	|Return Product of values
cumsum()	|Return cumulative sum
cumprod()	|Return cumulative product
mad()	|Mean absolute deviation
var()	|Unbiased variance
skew()	|Sample skewness
kurt()	|Sample kurtosis  |
quantile()	|Sample quantile  |
cummax()	|Cumulative maximum  |
cummin()	|Cumulative minimum  |
describe()	|Return summary of descriptive statistics  |   


*Iterations*

In short, basic iteration (for i in object) produces:

Series: values
DataFrame: column labels

In [38]:
df = pd.DataFrame({'col1': np.random.randn(3),
                     'col2': np.random.randn(3)}, index=['a', 'b', 'c'])

In [39]:
for col in df:
    print (col)

col1
col2


In [40]:
# To iterate over the rows of a DataFrame, you can use the following methods:

In [41]:
#items(): to iterate over the (key, value) pairs.
#iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. 
            #This converts the rows to Series objects, which can change the dtypes and has some performance implications.
#itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. 
                #This is a lot faster than iterrows() and is in most cases preferable to use to iterate over the values of a DataFrame.

items

Consistent with the dict-like interface, items() iterates through key-value pairs:

Series: (index, scalar value) pairs
DataFrame: (column, Series) pairs

In [42]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [43]:
for label, ser in df.items():
    print (label)
    print (ser)

a
0    1
1    2
2    3
Name: a, dtype: int64
b
0    a
1    b
2    c
Name: b, dtype: object





**iterrows**


iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:

In [44]:
for row_index, row in df.iterrows():
        print(row_index, row, sep='\n')

0
a    1
b    a
Name: 0, dtype: object
1
a    2
b    b
Name: 1, dtype: object
2
a    3
b    c
Name: 2, dtype: object


itertuple

In [45]:
for row in df.itertuples():
        print(row)


Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')
