# An introduction to Pandas

This package provides fundamental *routines* and *data structures* for doing **data analysis** and **manipulation** in Python. It is built on top of NumPy.

In [None]:
import pandas as pd
import numpy as np

## References

* Python for Data Analysis, Chapter 5, by Wes McKinney, O'REILLY
* [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)

## Data Structures

### Series

In [None]:
s = pd.Series([4, 7, -5, 3])
print(obj)

This data structure is a one-dimensional ndarray with axis labels.

In [None]:
print(type(obj.values))
print(obj.values)

Indexing is similar to numpy arrays :

In [None]:
s[1:3] = 22
print(s)

Index values have to be hashable (as in the case of dictionaries), but do not have to be unique :

In [None]:
obj2 = pd.Series([4, 7, -5, 3, 12], index=['d', 'b', 'a', 'c', 'a'])
print(obj2)

In [None]:
Hence, its use is similar to a dictionary :

In [None]:
print("obj2['c'] : ", obj2['c'])
print('b' in obj2, 'and', 'f' in obj2)

Its use is also similar to a numpy array :

In [None]:
obj3 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print((obj3 > 0) & (obj3 < 5))

In [None]:
obj3[(obj3 > 0) & (obj3 < 5)]

In [None]:
obj3 * 2

And therefore can be used in place of numpy arrays in functions such as `numpy.exp` :

In [None]:
np.exp(obj3)

Missing values appear as `NaN` (not a number) :

In [None]:
obj4 = pd.Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000})
print(obj4, end='\n\n')

data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj5 = pd.Series(data, index=states)
print(obj5)

In [None]:
pd.isnull(obj5)

In [None]:
obj5[pd.notnull(obj5)]

### DataFrame

A DataFrame is a data structure used as **data matrices**, i.e., a collection of columns, one for each variable. For example, with three columns and five rows:

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
print(frame)

Specify columns orders and rows labels:

In [None]:
frame2 = pd.DataFrame(data, 
                     columns=['year', 'state', 'pop'],
                     index=['one', 'two', 'three', 'four', 'five'])
print(frame2)

In [None]:
frame2['state']

In [None]:
type(frame2['state'])

In [None]:
frame2.loc['three']

In [None]:
frame2['dept'] = 16.5
print(frame2)

In [None]:
frame2.loc['three', 'dept'] = 4.2
print(frame2)

In [None]:
frame2.iloc[1, 2] = 0.87
print(frame2)

In [None]:
frame2['var'] = np.arange(5)
print(frame2)

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['dept'] = val
print(frame2)

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
print(frame2)
del frame2['eastern']
print(frame2)

## Exercises

Based on [pandas-datareader](http://pandas.pydata.org/pandas-docs/stable/remote_data.html), retrieve data about MSFT and AAPL and GOOG indices for the three last months. 

In [None]:
from pandas_datareader import data
help(data.DataReader)

In [None]:
from dateutil.relativedelta import relativedelta

all_data = {}

start_date = datetime.today() - relativedelta(months=9)

indices = ['AAPL', 'IBM', 'MSFT', 'GOOG']

# ...
    
all_data['AAPL'].head()

Create a dataframe that contains all the close columns for these stock indices. Save this dataframe to a comma-separated values (csv) file (see, [pandas.DataFrame.to_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html)).

In [None]:
# ...

Determine the days with maximal difference between Open and Close values for each stock (see, [idxmax](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html) and [abs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.abs.html)).

In [None]:
# ...

Translate columns names to another language (see, dataframe's column attribute).

In [None]:
# ...

Create a new column named *profit* that indicates if the corresponding day is positive (i.e., when close is greater than the open value).

In [None]:
# ...

Determine the top 10 days with highest close values for each stock (see, [sort_values](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)).

In [None]:
# ...

Use [pandas.DataFrame.plot](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) and [pandas.DataFrame.hist](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html) to visualize close and volume attributes for a given stock index.

In [None]:
# ...

Create a new dataframe that indicates for each stock if the gain was negative, small (<1), medium (<6), or large. For this purpose you are asked to use  of [pandas.DataFrame.apply](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html).

In [None]:
# ...

### Index Objects

In [None]:
x = pd.Series(range(3), index=['a', 'b', 'c'])
print(x)
print()
print(x.index)

What do you conclude from executing the following statement ?

In [None]:
x.index[1] = 'd'

In [None]:
'c' in x.index

In [None]:
'd' in x.index

In [None]:
y = x.reindex(['c', 'b', 'a'])
print(y)

In [None]:
x.reindex(['c', 'b', 'a'])
print(x)

In [None]:
y['a'] = 32
print(x)

In [None]:
x.reindex(['a', 'b', 'c', 'd', 'e'])

Missing values can be handled as follows

In [None]:
x.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

## Exercises

With Data Frames, the `reindex` method can be applied to rows and columns. Reorder the collumns of the following data frame (in alphabetical order) with the reindex method (see, [reindex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)).

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                  index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
# ...

Add an index 'b' to the data frame with a value equal to 5 for each state.

In [None]:
# ...

Use the `drop` method (see, [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)) to delete the two last columns of the dataframe. What is the purpose of the `inplace` parameter ?

In [None]:
# ...

Why does the following sum yields to an instance of `Series` with `NaN` values ?

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 + s2

Use the DataFrame's add method (see, [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html)) to add these two data frames so that missing values are replaced by `0` (see, [fillna](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)).

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                   columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(df1, '\n', df2)