# Exploring Pandas

## Series data structure 

- Cross between list and a dictionary.
    - Items are stored in order, and there are labels with which we can retrieve data.
    - The data type is automatically identified when we store it as/in `Series`.
    - Data column has a label and can be retrieved using the `.name` attribute.
    - To create a series, we can pass data, index and a name.
        - If we don't pass in index or name, Pandas creates a default index starting at 0 and assigns name as `None`.

In [24]:
import pandas as pd
pd.Series?

In [25]:
animals = ['Tiger', 'Bear', 'Dog', 'Cat']
pd.Series(animals)

0    Tiger
1     Bear
2      Dog
3      Cat
dtype: object

#### Internal storage
- Pandas stores the series values in a typed array using numpy library. This gives improvement in speed of processing.

#### Missing data
- Python provides `None` type for missing data or lack of data
- In case of a typed object like the `Series`, Pandas does some type conversion
    - For a list of strings and missing one element (indicated by `None`), Pandas inserts it as a `None` and uses the type object for the underlying array.
    - For a list of integers or floating point numbers and introduce a `None` element, Pandas converts this to a special floating point value designated as `NAN` (stands for not a number). 
        - We cannot do equality tests on `NAN`; `NAN` compared to itself gives `False` always.
        - Need to use special functions to test for the presence of `NAN` (e.g., numpy's `isNAN`).

In [3]:
# Examples:
animals = ['Tiger', 'Cat', None, 'Fox']
pd.Series(animals)

0    Tiger
1      Cat
2     None
3      Fox
dtype: object

In [4]:
numbers = [1, 2, 3, None]
pd.Series(numbers)

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

### Creating Series data

- We can create a Pandas `Series` by passing it any array-like data (in the previous examples, we passed in a `list` of elements).
- We can use `tuples` or `dictionary`.
    - Often times, data has labels and we can use these labels to manipulate data.
- When we create a `Series` from `dictionary` data, the index is automatically assigned to keys of the dictionary.
    - By default, if no index info is given, incrementing integers are used as indices. 

In [5]:
sports = {'Archery' : 'Bhutan', 'Gold' : 'Scotland',
         'Sumo' : 'Japan', 'Taekwondo' : 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Gold            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

- Once the `Series` is created, we can get the index object using the `index` attribute

In [6]:
s.index

Index(['Archery', 'Gold', 'Sumo', 'Taekwondo'], dtype='object')

- We can also pass the index explicitly to the `Series` object.
    - Index is passed as a list of values.
- If the index values provided don't match the dictionary keys, Pandas overrides the default indices creation from the dictionary in favor of only and all the indices provided by the programmer.
    - Pandas will ignore it from your dictionary, all keys, which are not in your index, and pandas will add `None` type or `NAN` values for any index value you provide, which is not in your dictionary key list.

In [7]:
a = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])

In [8]:
a

India      Tiger
America     Bear
Canada     Moose
dtype: object

In [9]:
pd.Series(sports, index = ['Archery', 'Polo', 'Sumo'])

Archery    Bhutan
Polo          NaN
Sumo        Japan
dtype: object

- Notice how `Polo` got assigned to `NaN` and `Taekwondo` was ignored from the `sports` dictionary.

###  Querying a `Series`

- Pandas series can be queried by either the index label or the index position. 
    - To query by the numeric location, use `iloc` attribute.
    - To query by the index label, use the `loc` attribute.
    - If you pass an integer directly to the `Series` object, Pandas treats it as `iloc` attribute.
    - Similarly, if you pass the object, Pandas will query as if the label was used and treats it as `loc` attribute.
    - **Safer to use the iloc or loc attribute than direct index number or label**

In [10]:
s

Archery           Bhutan
Gold            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [11]:
s.iloc[2]

'Japan'

In [12]:
s.loc['Archery']

'Bhutan'

In [13]:
s[2]

'Japan'

In [14]:
s['Taekwondo']

'South Korea'

In [15]:
# Task: to iterate over all items in the series, add and get total
p = pd.Series([100, 120.0, 101.0, 35.50, 20])

In [16]:
total = 0
for item in p:
    total += item
print(total)

376.5


#### Vectorization
- Pandas and underlying numpy libraries support [vectorization](https://en.wikipedia.org/wiki/Array_programming)
- Vectorization works with most functionsin numpy

#### Broadcasting:
- Apply an operation to every value in the series to change the series
- Supported by numpy and pandas

    

In [17]:
import numpy as np

total = np.sum(p)
print (total)

376.5


#### Using Jupyter to profile

- Using  2 % symbols, one can fix the code to be cell specific and call the "magic" functions using that 
- One example is timeit. As shown below

In [18]:
s

Archery           Bhutan
Gold            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [19]:
tmp = pd.Series(np.random.randint(0,1000,10000))
tmp.head()

0    800
1    497
2    307
3    734
4    301
dtype: int32

In [20]:
len (tmp)

10000

In [21]:
%%timeit -n 100 
summary = 0 
for item in tmp:
    summary += item

100 loops, best of 3: 1.14 ms per loop


In [23]:
%%timeit -n 100
summary = np.sum(tmp)

100 loops, best of 3: 94.3 Âµs per loop


In [26]:
tmp.head()

0    800
1    497
2    307
3    734
4    301
dtype: int32

In [27]:
tmp += 2
tmp.head()

0    802
1    499
2    309
3    736
4    303
dtype: int32

### Manipulating the `Series` data

- The `.loc` operator allows not only querying the data but also to add/append new data
- If the value passed as index doesn't exist, a new entry is added.
- Indices can have different types, Pandas will change the type of the underlying numpy array accordingly

- If the index values are not unique then, the query results in another `Series` object.
    - This is similar to the relational database world, where a result of a query on a table is another table.

In [28]:
s = pd.Series([1,2,3])
s.loc['MyFavAnimal'] = 'Tiger'
s

0                  1
1                  2
2                  3
MyFavAnimal    Tiger
dtype: object

In [29]:
sports

{'Archery': 'Bhutan',
 'Gold': 'Scotland',
 'Sumo': 'Japan',
 'Taekwondo': 'South Korea'}

In [31]:
cricket_loving = pd.Series(['India', 'England', 'Australia', 'Sri Lanka'], index = ['Cricket', 'Cricket','Cricket', 'Cricket'])
sports_loving = pd.Series(sports)
all_sports_loving = sports_loving.append(cricket_loving)
all_sports_loving

Archery           Bhutan
Gold            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket            India
Cricket          England
Cricket        Australia
Cricket        Sri Lanka
dtype: object

In [32]:
c = all_sports_loving.loc['Cricket']

In [33]:
c

Cricket        India
Cricket      England
Cricket    Australia
Cricket    Sri Lanka
dtype: object

In [34]:
type(all_sports_loving)

pandas.core.series.Series

In [35]:
type(c)

pandas.core.series.Series

In [36]:
a = all_sports_loving.loc['Archery']
a

'Bhutan'

In [37]:
type(a)

str