## Pandas
### The Series Data Structure

The NumPy array has an implicitly defined integer index used to access the values;
the Pandas Series has an explicitly defined index associated with the values.

In [None]:
import pandas as pd
pd.Series?

In [None]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

In [None]:
#specify index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data['b']

In [None]:
data.values

In [None]:
data.index

In [None]:
numbers = [1, 2, 3]
pd.Series(numbers)

In [None]:
animals = ['Tiger', 'Bear', None]
pd.Series(animals)

NaN , not a number, is a numeric data type used to represent any value that is undefined or unpresentable. <br>
NaN is also assigned to variables, in a computation, that do not have values and have yet to be computed.

In [None]:
numbers = [1, 2, None]
pd.Series(numbers)

One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do. 

In [None]:
import numpy as np
np.nan == None

In [None]:
np.nan == np.nan

In [None]:
np.isnan(np.nan)

In [None]:
None == None

### Think of a Pandas Series like a specialized Python dictionary:

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values; <br>
a Series is a structure that maps typed keys to a set of typed values. 

**type**: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [None]:
## constructing a Series object directly from 
## a Python dictionary
population_dict = {'California': 38332521,
                    'Texas': 26448193,
                    'New York': 19651127,
                    'Florida': 19552860,
                    'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
#the index is drawn from the keys
population.index

In [None]:
#Unlike a dictionary, the Series also supports 
#array-style operations such as slicing
population['California':'Florida']

pd.Series(data, index=index)
data can be one of many entities:

In [None]:
# list or NumPy array; index defaults to an integer sequence:
pd.Series([2, 4, 6])

In [None]:
# data can be a scalar, which is repeated to fill 
# specified index
pd.Series(121, index=[100, 200, 300])

In [None]:
# the index can be explicitly set if 
# a different result is preferred:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2, 6])

### Think of a Pandas Series like a specialized Python dictionary:

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values; <br>
a Series is a structure that maps typed keys to a set of typed values. 

**type**: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [None]:
## constructing a Series object directly from 
## a Python dictionary
population_dict = {'California': 38332521,
                    'Texas': 26448193,
                    'New York': 19651127,
                    'Florida': 19552860,
                    'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
#the index is drawn from the keys
population.index

In [None]:
#Unlike a dictionary, the Series also supports 
#array-style operations such as slicing
population['California':'Florida']

pd.Series(data, index=index)
data can be one of many entities:

In [None]:
# list or NumPy array; index defaults to an integer sequence:
pd.Series([2, 4, 6])

In [None]:
# data can be a scalar, which is repeated to fill 
# specified index
pd.Series(121, index=[100, 200, 300])

In [None]:
# the index can be explicitly set if 
# a different result is preferred:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2, 6])

### Querying a Series

In [None]:
#dictionary-like expressions
data['b']

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
data.index

In [None]:
data.values

In [None]:
list(data.items())

In [None]:
#modifiable as well
data['e'] = 1.25
data['a'] = 1
data

In [None]:
# slicing by explicit index
data['a':'c']

In [None]:
# slicing by implicit integer index
data[0:2]

Note: <br>
Slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice;
Slicing with an implicit index (i.e., data[0:2]), the final
index is **NOT** included the slice.

In [None]:
# selection
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing: select specific ones
data[['a', 'e']]

Be careful if your Series has an explicit integer index: <br>
An indexing operation such as data[1] will use the explicit indices; <br>
a slicing operation like data[1:3] will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
# explicit index when indexing
data[1]

In [None]:
# implicit index when slicing
data[1:3] #data[0:2]

**To avoid confusion, use special indexer:**

In [None]:
#loc attribute always references the explicit
print(data.loc[1],'\n')
print(data.loc[1:3])

In [None]:
#iloc attribute always references the implicit
print(data.iloc[1],'\n')
print(data.iloc[1:3])

### The DataFrame Data Structure

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. 

A two-dimensional array can be viewed as an ordered sequence of aligned one-dimensional columns; a DataFrame can be viewed as a
sequence of aligned Series objects sharing the same index.

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

population_dict = {'California': 38332521,
                    'Texas': 26448193,
                    'New York': 19651127,
                    'Florida': 19552860,
                    'Illinois': 12882135}
population = pd.Series(population_dict)

states = pd.DataFrame({'population': population,
'area': area})
states

#change orders

In [None]:
states.index

In [None]:
# additional columns attribute
states.columns

In [None]:
#specialized dictionary
states['area']

In [None]:
##notice the difference for indexing
import numpy as np
a = np.random.randint(0, 10, (2,3))
print(a)
a[0]

### Construct DataFrame objects

In [None]:
# From a single Series object
pd.DataFrame(population, columns=['population'])

In [None]:
pd.DataFrame(population)

In [None]:
# From a list of dictionaries 
# list comprehension 
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data

In [None]:
pd.DataFrame(data)

In [None]:
#if some keys in the dictionary are missing
#Pandas will fill them in with NaN
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

In [None]:
#From a dictionary of Series objects
pd.DataFrame({'population': population,'area': area})

In [None]:
{'population': population,'area': area}

In [None]:
#From a two-dimensional NumPy array
#with specified column and index names
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])

In [None]:
# if omitted, an integer index will be used 
pd.DataFrame(np.random.rand(3, 2))

### Selection in DataFrame

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                    'New York': 141297, 'Florida': 170312,
                    'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                    'New York': 19651127, 'Florida': 19552860,
                    'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

In [None]:
#check the first a few
data.head(2)

In [None]:
#dictionary-style indexing of the column name
data['area']

In [None]:
#attribute-style access with column names that are strings:
data.area

In [None]:
#avoid data.pop as pop() is a method for DataFrame
data.pop is data['pop']

In [None]:
#DataFrame allows modification/addition
data['density'] = data['pop'] / data['area']
data

In [None]:
##two-dim/three-dim array
data.values

In [None]:
##transpose
data.T

In [None]:
##difference between array and DataFrame
# index accesses a row for DataFrame.values
data.values[0]

In [None]:
# "index" to a DataFrame accesses a column:
data['area']

In [None]:
#loc, iloc again
data.loc[:'Florida', :'pop']

In [None]:
#implicit iloc
data.iloc[:3, :2]

In [None]:
#selection using > < etc. (row); fancy indexing (column)
data['density'] = data['pop'] / data['area']
data.loc[data.density > 100, ['pop', 'density']]

In [None]:
#Note for []:
#indexing refers to columns, slicing refers to rows
print(data['Florida':'Illinois'],'\n')
print(data['area'])

In [None]:
# Note for []:
# refer to rows by implicit number rather than by index
data[1:3]

In [None]:
# Note for []:
# refer to rows for > < etc.
data[data.density > 100]

In [None]:
#mutable:
data.iloc[0, 2] = 90
data

In [None]:
copy_df = data.drop('Florida')
copy_df

In [None]:
#drop NA
copy_df.iloc[1,1] = None
copy_df

In [None]:
copy_df.dropna()

In [None]:
del copy_df['pop']
copy_df

In [None]:
#assign None does not remove
copy_df['density'] = None
copy_df

### Index alignment

#### Series
The resulting array contains the union of indices of the two input arrays; <br>
Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number".

In [None]:
import pandas as pd
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                    'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                    'New York': 19651127}, name='population')
population / area

In [None]:
# check on the indices
area.index | population.index

In [None]:
area.index & population.index

In [None]:
area.index.union(population.index)

In [None]:
area.index.intersection(population.index)

In [None]:
# another example:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

If using NaN values is not the desired behavior, we can modify the fill value using appropriate object methods in place of the operators. For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:

In [None]:
A.add(B, fill_value=0)