# Pandas

- Pandas stands for Python Data Analysis Library or "panel data" (multidimensional, structured data sets)
- Data analysis and modeling
- Supports manipulating tables and series
- Focuses on linear and panel regression
- Other libraries such as statsmodels and scikit-learn go beyond simple regression
- DataFrame supports data manipulation and indexing
- Reading/writing data files
- Reshaping and pivoting of data sets
- Slicing/indexing/subsetting large data sets
- Column insertion and deletion
- Group by support for split-apply-combine operations
- Data set merging and joining
- Hierarchical axis indexing for dimensional reduction
- Time series support: frequency conversion, moving window statistics, date shifting and lagging

[Download Anaconda Distribution](https://www.anaconda.com/download/)  

## Pandas Docs

Pandas deals with three main data structures
- **Series** 1D labeled homogeneous size-immutable array
- **DataFrame** 2D labeled size-mutable array tabular structure
- **Panel** 3D labeled size-mutable tabular structure (deprecated in the 0.20.x release)

[Docs: The pandas Tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)  
[Pandas Cheat Sheet:]( https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)  

In [1]:
# execute this cell before any subsequent cells in this notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# https://docs.python.org/3/tutorial/datastructures.html
print("[10,20,30,40]")
l = [10,20,30,40]
print(type(l))
print(l)
print()

# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html
print("np.array([10,20,30,40])")
a = np.array([10,20,30,40])
print(type(a))
print(a)
print()

# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
print("pd.Series([10,20,30,40])")
s = pd.Series([10,20,30,40])
print(type(a))
print(s)
print()

# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
print("pd.DataFrame([10,20,30,40])")
d = pd.DataFrame([10,20,30,40])
print(type(d))
print(d)

[10,20,30,40]
<class 'list'>
[10, 20, 30, 40]

np.array([10,20,30,40])
<class 'numpy.ndarray'>
[10 20 30 40]

pd.Series([10,20,30,40])
<class 'numpy.ndarray'>
0    10
1    20
2    30
3    40
dtype: int64

pd.DataFrame([10,20,30,40])
<class 'pandas.core.frame.DataFrame'>
    0
0  10
1  20
2  30
3  40


## Series

[Docs: pandas.Series](http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.Series.html)

In [3]:
# Creating an empty series
s1 = pd.Series()
print(s1)
print(s1.shape)

# Add index and value
s1['A'] = 42
print(s1)
print(s1.shape)
print("---\n")

print("s1.dtypes", s1.dtypes)
print("s1.axes", s1.axes)
print("s1.ndim", s1.ndim)
print("s1.size", s1.size)
print("s1.shape", s1.shape)
print("s1.values", s1.values)
print("len(s1)", len(s1))

Series([], dtype: float64)
(0,)
A    42
dtype: int64
(1,)
---

s1.dtypes int64
s1.axes [Index(['A'], dtype='object')]
s1.ndim 1
s1.size 1
s1.shape (1,)
s1.values [42]
len(s1) 1


In [4]:
#  Use concat() to perform an outer join
s1 = pd.Series([1, 2])
print(s1)
print(s1.shape)
s2 = pd.Series([10, 20])
print(s2)
print(s2.shape)
s3 = pd.concat([s1, s2])
print(s3)
print(s3.shape)

0    1
1    2
dtype: int64
(2,)
0    10
1    20
dtype: int64
(2,)
0     1
1     2
0    10
1    20
dtype: int64
(4,)


In [5]:
# Creating a Series by passing a list of values, letting pandas create a default integer index
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
print(s.shape)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
(6,)


In [6]:
# Create a Series by passing a list of integer values and a list of integer indexes
s = pd.Series([1,3,5,np.nan,6,8], index = [1, 2, 4, 6, 8, 10])
print(s)
print(s[1])

1     1.0
2     3.0
4     5.0
6     NaN
8     6.0
10    8.0
dtype: float64
1.0


In [7]:
# Create a Series by passing a list of random integer values and a list of character indexes
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s['a'])

a    0.341781
b    2.139182
c    1.156589
d   -0.646238
e   -0.081690
dtype: float64
0.3417807479175985


In [8]:
# Series can be instantiated from a dict
d = {'b' : 1, 'a' : 0, 'c' : 2}
s = pd.Series(d)
print(s)

a    0
b    1
c    2
dtype: int64


In [9]:
# If an index is passed, values in data corresponding to labels in index selected
d = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(d)
print(s)
s = pd.Series(d, index=['b', 'c', 'd', 'a'])
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


In [10]:
# If data is scalar value, index must be provided and scalar value is repeated
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64


In [11]:
# Series is similar to ndarray and is valid argument to most NumPy functions
# Series support ndarray slicing operations as well

s = pd.Series([1,3,5,np.nan,9,11])
print("s\n", end="");print(s, "\n")
print("s.values\n", end="");print(s.values, "\n")
print("s[0]\n", end="");print(s[0], "\n")
print("s[3]\n", end="");print(s[3], "\n")
print("s[:3]\n", end="");print(s[:3], "\n")
print("s.mean()", s.mean());print()
print("s > s.median()\n", end="");print(s > s.median(), "\n")
print("s[s > s.median()]\n", end="");print(s[s > s.median()], "\n")
print("s[[4, 3, 1]]\n", end="");print(s[[4, 3, 1]], "\n")
print("np.exp(s)\n", end="");print(np.exp(s), "\n")


s
0     1.0
1     3.0
2     5.0
3     NaN
4     9.0
5    11.0
dtype: float64 

s.values
[ 1.  3.  5. nan  9. 11.] 

s[0]
1.0 

s[3]
nan 

s[:3]
0    1.0
1    3.0
2    5.0
dtype: float64 

s.mean() 5.8

s > s.median()
0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool 

s[s > s.median()]
4     9.0
5    11.0
dtype: float64 

s[[4, 3, 1]]
4    9.0
3    NaN
1    3.0
dtype: float64 

np.exp(s)
0        2.718282
1       20.085537
2      148.413159
3             NaN
4     8103.083928
5    59874.141715
dtype: float64 



## DataFrames

[Docs: pandas.DataFrame](http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.html)

In [12]:
# Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns
dates = pd.date_range('20130101', periods=12)
print("\ndates\n", dates)
df = pd.DataFrame(dates, columns=["date"])
print("\ndf\n", df)
print("\ndf.head()\n",  df.head())
print("\ndf.tail()\n",  df.tail())
print("\ndf.head(3)\n", df.head(3))
print("\ndf.iloc[3]\n", df.iloc[3])
print("\ndf.shape\n",   df.shape)
df.describe()


dates
 DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12'],
              dtype='datetime64[ns]', freq='D')

df
          date
0  2013-01-01
1  2013-01-02
2  2013-01-03
3  2013-01-04
4  2013-01-05
5  2013-01-06
6  2013-01-07
7  2013-01-08
8  2013-01-09
9  2013-01-10
10 2013-01-11
11 2013-01-12

df.head()
         date
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05

df.tail()
          date
7  2013-01-08
8  2013-01-09
9  2013-01-10
10 2013-01-11
11 2013-01-12

df.head(3)
         date
0 2013-01-01
1 2013-01-02
2 2013-01-03

df.iloc[3]
 date   2013-01-04
Name: 3, dtype: datetime64[ns]

df.shape
 (12, 1)


Unnamed: 0,date
count,12
unique,12
top,2013-01-09 00:00:00
freq,1
first,2013-01-01 00:00:00
last,2013-01-12 00:00:00


In [13]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

print("\ndf")
print(df)

print("\ndf.dtypes")
print(df.dtypes)

#print(df[0]) # KeyError

print("\nselect row 0\tdf[0:1]")
print(df[0:1])

print("\nselect row 0\t\tdf.iloc[0]")
print(df.iloc[0])

print("\nselect column 'A'\tdf['A']")
print(df['A'])

print("\nselect row and column\tdf['A'][0]")
print(df['A'][0])


df
                   A         B         C         D
2013-01-01 -0.493489 -1.867854 -0.693528  0.051362
2013-01-02  0.065439 -0.189497  0.646952  0.743015
2013-01-03 -0.031777 -0.443974 -1.027095  1.902984
2013-01-04  0.132046  0.512629 -0.974647  1.126266
2013-01-05  0.440120  0.659647 -1.122417  0.907400
2013-01-06  0.781445  0.158817 -1.216605  0.463534

df.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

select row 0	df[0:1]
                   A         B         C         D
2013-01-01 -0.493489 -1.867854 -0.693528  0.051362

select row 0		df.iloc[0]
A   -0.493489
B   -1.867854
C   -0.693528
D    0.051362
Name: 2013-01-01 00:00:00, dtype: float64

select column 'A'	df['A']
2013-01-01   -0.493489
2013-01-02    0.065439
2013-01-03   -0.031777
2013-01-04    0.132046
2013-01-05    0.440120
2013-01-06    0.781445
Freq: D, Name: A, dtype: float64

select row and column	df['A'][0]
-0.49348946457315995


In [14]:
# Creating a DataFrame by passing a dict of objects that can be converted to series-like
df2 = pd.DataFrame({
    'A' : 1.,
    'B' : pd.Timestamp('20130102'),
    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
    'D' : np.array([3] * 4,dtype='int32'),
    'E' : pd.Categorical(["test","train","test","train"]),
    'F' : 'foo' })
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


In [15]:
dfHousing = pd.read_csv('data/Housing.csv', nrows=5)
print("dfHousing")
print(dfHousing)
print("\n---\n")
print("dfHousing[['bedrooms', 'price']]")
print(dfHousing[['bedrooms', 'price']])
print("\n---\n")
print("dfHousing.iloc[2][['bedrooms', 'price']]")
print(dfHousing.iloc[2][['bedrooms', 'price']])

dfHousing
   Unnamed: 0  price  lotsize  bedrooms  bathrms  stories driveway recroom  \
0           1  42000     5850         3        1        2      yes      no   
1           2  38500     4000         2        1        1      yes      no   
2           3  49500     3060         3        1        1      yes      no   
3           4  60500     6650         3        1        2      yes     yes   
4           5  61000     6360         2        1        1      yes      no   

  fullbase gashw airco  garagepl prefarea  
0      yes    no    no         1       no  
1       no    no    no         0       no  
2       no    no    no         0       no  
3       no    no    no         0       no  
4       no    no    no         0       no  

---

dfHousing[['bedrooms', 'price']]
   bedrooms  price
0         3  42000
1         2  38500
2         3  49500
3         3  60500
4         2  61000

---

dfHousing.iloc[2][['bedrooms', 'price']]
bedrooms        3
price       49500
Name: 2, dtype: objec

In [16]:
trades = pd.read_csv('data/Trades.csv')
print(trades.head())
trades['Sign'] = [1 if ls == 'Long' else -1 for ls in trades['Position']]
trades['Profit'] = trades['Quantity'] * trades['Sign'] * (trades['Close Price'] - trades['Open Price']) 
print(trades.head())
print("Total:  ", trades['Profit'].count())
print("Profit: ", round(trades['Profit'].sum(), 2))

  Position  Quantity            Open Time  Open Price           Close Time  \
0     Long      2400  2018-03-10 09:30:15     30.6035  2018-03-10 09:31:45   
1    Short      4700  2018-03-10 09:32:00     31.6353  2018-03-10 09:34:00   
2     Long      3500  2018-03-10 09:34:15     30.4790  2018-03-10 09:36:30   
3    Short      3100  2018-03-10 09:36:45     33.2328  2018-03-10 09:38:45   
4     Long      3500  2018-03-10 09:39:00     32.7397  2018-03-10 09:39:45   

   Close Price  
0      31.7985  
1      29.5608  
2      34.0210  
3      32.7242  
4      32.3310  
  Position  Quantity            Open Time  Open Price           Close Time  \
0     Long      2400  2018-03-10 09:30:15     30.6035  2018-03-10 09:31:45   
1    Short      4700  2018-03-10 09:32:00     31.6353  2018-03-10 09:34:00   
2     Long      3500  2018-03-10 09:34:15     30.4790  2018-03-10 09:36:30   
3    Short      3100  2018-03-10 09:36:45     33.2328  2018-03-10 09:38:45   
4     Long      3500  2018-03-10 09:39:

In [17]:
data = {'Country': ['Belgium', 'India', 'Brazil'],
           'Capital': ['Ottawa', 'New Delhi', 'Brasília'],
           'Population': [11190846, 1303171035, 207847528]}
df = pd.DataFrame(data,
            columns=['Country', 'Capital', 'Population'])

print("\ndf")
print(df)

print("\ndf.iloc[0],[0]")
print(df.iloc[0],[0]) # Select single value by row & column

print("\ndf.loc[0], ['Country']")
print(df.loc[0], ['Country']) # Select single value by row & column label

print("\ndf.iat[0,0]")
print(df.iat[0,0])

print("\ndf.at[0, 'Country']")
print(df.at[0, 'Country'])

print("\ndf[df['Population']>12000000]")
print(df[df['Population']>12000000])  # filter DataFrame

print("\ndf[1:]")
print(df[1:]) # Get subset of DataFrame 

print("\ndf.drop('Country', axis=1)")
print(df.drop('Country', axis=1)) # Drop values from columns(axis=1)


df
   Country    Capital  Population
0  Belgium     Ottawa    11190846
1    India  New Delhi  1303171035
2   Brazil   Brasília   207847528

df.iloc[0],[0]
Country        Belgium
Capital         Ottawa
Population    11190846
Name: 0, dtype: object [0]

df.loc[0], ['Country']
Country        Belgium
Capital         Ottawa
Population    11190846
Name: 0, dtype: object ['Country']

df.iat[0,0]
Belgium

df.at[0, 'Country']
Belgium

df[df['Population']>12000000]
  Country    Capital  Population
1   India  New Delhi  1303171035
2  Brazil   Brasília   207847528

df[1:]
  Country    Capital  Population
1   India  New Delhi  1303171035
2  Brazil   Brasília   207847528

df.drop('Country', axis=1)
     Capital  Population
0     Ottawa    11190846
1  New Delhi  1303171035
2   Brasília   207847528
