from IPython.display import HTML
html1 = <img src="pandas2.jpeg" width="400" height="400" align="center"/>
HTML(html1)

#### Data structures in pandas
 
There are three main data structures in pandas:

* Series
* DataFrame
* Panel


Series and dataframes form the core data model for Pandas in Python. 

The DataFrame represents the entire spreadsheet, whereas the Series is a single column of the DataFrame. 

A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.

The data sets are first read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns. Row label indexes and column labels can be specified along with the data. 

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Series
Series is really a 1D NumPy array capable of holding data of any type (integer,string, float, python objects etc.). It consists of a NumPy array coupled with an array of labels. The axis lables are called index. 

##### Key Points
 
- Homogeneous data
- Size immutable
- Values of data mutable

The Pandas data structure known as Series is very similar to the numpy.ndarray. In turn many methods and functions that operate on a ndarray will also operate on a Series. A Series may sometime
be refrerred to as a "vector"

#### Series creation
pd.Series( data, index, dtype, copy)

A series can be created using various inputs like −

- Array
- Dict
- Scalar value or constant

In [4]:
# Create an empty series
x = pd.Series()
print(x)

Series([], dtype: float64)


In [6]:
data = pd.Series([1,2,3,4])
data

0    1
1    2
2    3
3    4
dtype: int64

The output shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created.

You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [7]:
data.values

array([1, 2, 3, 4])

In [8]:
data.index  # like range(4)

RangeIndex(start=0, stop=4, step=1)

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.

Here we see the cutomized indexed values in the output

We can create a Series with an index while creating a Series

In [10]:
data1 = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
data1

a    1
b    2
c    3
d    4
dtype: int64

In [11]:
data1.index

Index(['a', 'b', 'c', 'd'], dtype='object')

You can use labels in the index when selecting values 

In [13]:
print(data1['a'])

1
c    3
a    1
d    4
dtype: int64


In [14]:
# ['c', 'a', 'd'] is interpreted as a list of indices, even 
#though it contains strings instead of integers.
print(data1[['a', 'c', 'd']])

a    1
c    3
d    4
dtype: int64


In [66]:
# Create a Series from ndarray
data = np.array([1,2,3,4])
x = pd.Series(data)
print(x)


0    1
1    2
2    3
3    4
dtype: int64


In [67]:
print(x.index)

RangeIndex(start=0, stop=4, step=1)


Another way to think about a Series is as a fixed-length, ordered dict, as it is a map‐ ping of index values to data values. It can be used in many contexts where you might use a dict:

#### Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [28]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
print(data.keys())
print('\n')
print(data.values())
x = pd.Series(data)
print('\n')
print(x)

print('index:', x.index)
### NOTE: Dictionary keys are used to construct index.

dict_keys(['a', 'b', 'c'])


dict_values([0.0, 1.0, 2.0])


a    0.0
b    1.0
c    2.0
dtype: float64
index: Index(['a', 'b', 'c'], dtype='object')


When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [8]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
sdata1 = pd.Series(sdata)
sdata1

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

You can override this by passing the dict keys in the order you want them to appear in the resulting Series

In [32]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
sdata2 = pd.Series(sdata, index = states)
sdata2

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is con‐ sidered in pandas to mark missing or NA values. Since 'Utah' was not included in states, it is excluded from the resulting object.

In [69]:
# Operations are automatically aligned and vectorized
# This is similar to joing operation 
sdata1+sdata2

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

#### Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [9]:
x = pd.Series(5, index=[0, 1, 2, 3])
#x = pd.Series(5)
print(x)

0    5
1    5
2    5
3    5
dtype: int64


In [35]:
# The number of items in a Series object can be determined by several techniques
x = pd.Series([0,1,2,3,4,5,6,7, np.nan])
x

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    NaN
dtype: float64

In [None]:
# length of a Series
len(x)

In [None]:
# Alternatively, the length can be determined using the .size property
x.size

In [None]:
# The .shape property resturns a tuple where the first item in the number of items
x.shape

In [None]:
# The numner of the values that are not part of the NaN can be found by using
# teh .count() method
x.count()

In [10]:
# To determine the unique values
x = pd.Series([0,1,2,1, 4, 3, 5, 3,4,5,6,7, np.nan])
print('total elements:', x.size)
x.duplicated()
x.unique()
len(x.unique())

total elements: 13


9

In [11]:
# count of each unique items (non- NaN) in the Series. 
x.value_counts()

5.0    2
3.0    2
4.0    2
1.0    2
7.0    1
6.0    1
2.0    1
0.0    1
dtype: int64

In [12]:
x.head()   # first five values

0    0.0
1    1.0
2    2.0
3    1.0
4    4.0
dtype: float64

In [13]:
x.head(n=3)  # x.head(3)  # first three values

0    0.0
1    1.0
2    2.0
dtype: float64

In [14]:
x.tail()     # last five values

8     4.0
9     5.0
10    6.0
11    7.0
12    NaN
dtype: float64

In [15]:
x.tail(n= 3)  # x.tail(3)  # last three values

10    6.0
11    7.0
12    NaN
dtype: float64

In [16]:
# Not a number (NaN)
npd = np.array([1,2,3,4,5])
npd.mean()

3.0

In [17]:
# Series with Nan
npd = np.array([1,2,3,4,5, np.NaN])
npd.mean()


nan

In [18]:
# when encountering NaN, pandas ignores Nan
pds = pd.Series([1,2,3,4,5, np.NaN])
pds.mean()

3.0

In [19]:
# skipna argumnet skips NaN values
pds.mean(skipna = False)   # True will skip 

nan

In [20]:
# skipna argumnet skips NaN values
pds.mean(skipna = True)   # True will skip 

3.0

#### Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [21]:
data = pd.Series(range(3), index=['a', 'b', 'c'])

In [22]:
idx = data.index
idx

Index(['a', 'b', 'c'], dtype='object')

In [23]:
idx[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [24]:
idx[1] = 'd'

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [25]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [26]:
data2 = pd.Series([1.5, -2.5, 0], index=labels)
data2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [27]:
data2.index is labels

True

In [28]:
0 in data2.index

True

#### Reindexing
An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [29]:
data = pd.Series([4, 7, 8, 9], index=['d', 'b', 'a', 'c'])
data

d    4
b    7
a    8
c    9
dtype: int64

Calling reindex on this Series rearranges the data according to the new index, intro‐ ducing missing values if any index values were not already present:

In [30]:
data1 = data.reindex(['a', 'b', 'c', 'd', 'e'])
data1

a    8.0
b    7.0
c    9.0
d    4.0
e    NaN
dtype: float64

#### Dropping Entries from an Axis
Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the
drop method will return a new object with the indicated value or values deleted from an axis:

In [31]:
data = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
data

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [32]:
new_data = data.drop('b')
new_data

a    0
c    2
d    3
e    4
dtype: int64

In [33]:
data

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [34]:
#Multiple values
new_data1 = data.drop(['b','c'])
new_data1

a    0
d    3
e    4
dtype: int64

Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object *in-place* without returning a new object:

**Be careful with the inplace, as it destroys any data that is dropped**

In [35]:
data.drop('b', inplace = True)
data

a    0
c    2
d    3
e    4
dtype: int64

#### Retrieve Data Using Label (Index)
Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.

In [36]:
x = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

print(x)
print('\n')
#retrieve the first element
print("first element:",x[0])

print('\n')
print("first element:",x['a'])

a    1
b    2
c    3
d    4
e    5
dtype: int64


first element: 1


first element: 1


In [None]:
# If a label is not contained, an exception is raised
#retrieve multiple elements
print(x['g'])

Slicing with labels behaves differently than normal Python slicing in that the end‐point is inclusive. 

## Boolean selection

In [115]:
s = pd.Series(np.arange(0,10))
print(s)
indx = s>5
s[indx]

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


6    6
7    7
8    8
9    9
dtype: int64

In [116]:
s[s>5]

6    6
7    7
8    8
9    9
dtype: int64

In [56]:
print(s)

# are all items >=7
print((s >=7).all())

# similarly there is any function.
print((s<=8).any())

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
False
True


In [57]:
# How many values < 7
(s<7).sum()

7

In [61]:
# and operation

((s>5) & (s<8)).any()

True

In [62]:
((s>5) & (s<8)).all()

False

In [63]:
# or operation

((s>5) | (s<8)).any()

True

#### Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels.

In [117]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [118]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [119]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

#### Sorting 
Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [121]:
data = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [122]:
data.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

To sort a Series by its values, use sort_values method:

In [123]:
data.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

The data is sorted in ascending order by default, but can be sorted in descending order, too

In [124]:
data.sort_values(ascending = False)

c    3
b    2
a    1
d    0
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [126]:
data = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [129]:
data.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

#### Unique Values, Value Counts, and Membership
Another class of related methods extracts information about the values contained in a one-dimensional Series. 

In [133]:
data = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [134]:
uniq_val = data.unique()
uniq_val

array(['c', 'a', 'd', 'b'], dtype=object)

**value_counts** computes a Series containing value frequencies:

In [135]:
data.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

*isin* performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series

In [141]:
m1 = data.isin(['b','d'])
m1

0    False
1    False
2     True
3    False
4    False
5     True
6     True
7    False
8    False
dtype: bool

In [142]:
data[m1]

2    d
5    b
6    b
dtype: object

*isin* performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series

In [143]:
ser1 = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

In [144]:
unique_ser = pd.Series(['c', 'b', 'a'])

In [145]:
pd.Index(unique_ser).get_indexer(ser1)

array([0, 2, 1, 1, 0, 2])

## IO Tools

The **Pandas I/O API** is a set of top level reader functions accessed like pd.read_csv() that generally return a Pandas object.

The two functions for reading text files (or the flat files) are **read_csv()** and **read_table()**. They both use the same parsing code to intelligently convert tabular data into a DataFrame object −

In [None]:
# .csv file
df = pandas.read_csv(filepath, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None, skiprows =2 )

In [None]:
## CHECK THIS

# Loading data from the web
# it uses pandas.io.data.DataReader class which is able to read data fromm various
#web sources

from pandas.io.data import DataReader
from datetime import date
from dateutil.relativedelta import relativedelta
# read the last three months of the data for GOOG

goog = DataReader("GOOG", "yahoo", date.today(), relativedelta(month = -3))
