---|||
# Pandas  Introduction 

It is a Python library used for data manipulation, cleaning and processing.

In [None]:
import pandas as pd

In [None]:
from pandas import Series, DataFrame

#### The core data structure in pandas are Series (column like), and Dataframe(tabular like)
---|||

**Series**: one dimensional array-like object containing
-  sequence of values, and **P**
-  an associated array of data labels, called its **index**

> By default, the index ranges from 0 to the len(P) - 1 


In [None]:
# notice the associated indices
pd.Series([5, 10, 111, 4])

In [None]:
# create a series with modified index, using its index parameter

s = pd.Series([10, 20, 30, 40], index=['i', 'j', 'k', 'l'])
s

In [None]:
# obtain the index of a given series using the index attribute

s.index


---|||
##### The index of a Series can be used to select data corresponding to the index

They act as index to the data sequence

In [None]:
s['k'], s['i']

##### NumPy-like operations can be used to manipulate a Series object

The index remains unchanged, after the operation(s) is performed

In [None]:
import numpy as np

In [None]:
s[s > 25]

In [None]:
s * 2

In [None]:
np.exp(s)

In [None]:
# query if the series contains a given index

'j' in s

### A good mental model is to think of a Series object as a dictionary of keys and values

- the index are the keys
- the data are the values


In [None]:
# some states and their corresponding capitals in Nigeria
sdata = {'lagos': 'ikeja', 'ogun': 'abeokuta', 'adamawa': 'lafia'}

# create a series object from the data
s = pd.Series(sdata)
s

In [None]:
# get the index
s.index

In [None]:
# query if an index is contained in the series

'lagos' in s, 'fct' in s


---|||

#### Alter the indices of a Series in-place

In [None]:
index = ['l', 'o', 'a']

# change the indices to index
s.index = index

In [None]:
s


---|||
#### Using the Dictionary to create the series will sort the data based on the keys

This can be overriden by passing the same keys, and in whatever order to the index keyword

> adding a key that doesn't belong in the dictionary will result in the key having a NAN value (a.k.a missing data)

In [None]:
# adding more inices than data, results in NAN values

# some states and their corresponding capitals in Nigeria
sdata = {'lagos': 'ikeja', 'ogun': 'abeokuta', 'adamawa': 'lafia'}

# using the keys to order the values
s = pd.Series(sdata, index=['lagos', 'adamawa', 'ogun'])
s

In [None]:
# using a key that is not in the dictionary => 'fct
s = pd.Series(sdata, index=['ogun', 'fct', 'adamawa', 'lagos'])

s


---|||
#### **isnull** and **notnull** method as a way of detecting missing data

In [None]:
# find index with missing data in a series
s.isnull()

In [None]:
# find index without missing data in the series

s.notnull()

In [None]:
# the query formats (functions and instance method) are equivalent
(pd.notnull(s) == s.notnull(), pd.isnull(s) == s.isnull() )


---|||
##### Arithmetic Operations on series with similar keys 

When performing arithmetic operations on different series with similar keys, the keys are used to align the data before the operation is performed element wise

In [None]:
adata = {'i': 11, 'j': 33, 'k': 23}
bdata = {'a': 100, 'j': 23, 'b': 73, 'k': 1000}

s1 = pd.Series(adata)
s2 = pd.Series(bdata)

In [None]:
s1

In [None]:
s2

In [None]:
# notice that not all the keys are the same

s1 + s2

|^|

Notice that keys that doesn't match have NAN values returned

---
---|||
#### Naming a Series 

It is possible to name a series object using the name attribute of kwarg

In [None]:
sdata

In [None]:
s = pd.Series(sdata, name='States and Capital in Nigeria')

In [None]:
s

In [None]:
# get the name using its index attributes
s.name


---|||
# Dataframe

The other core pandas object is the DataFrame representing a tabular-like data (like excel spreadsheet). It contains

- ordered collections of columns; each column can contain different data type
- a row and column index
 
A good mental model is to think of Dataframe as a dictionary containing

- **a key**: representing the column index
- **a value**: a Series object such that
  -  the keys of the series represent the row index
  - the values in the series represent the data having keyed by its row and column indices

In [None]:
# creating a dataframe

data = {'states': ['lagos', 'fct', 'ondo', 'oyo', 'plateau'], 'capital': ['ikeja', 'abuja', 'akure', 'ibadan', 'jos']}

# they row index will automatically default to number, but we use this instead
df = pd.DataFrame(data, index=['i', 'j', 'k', 'l', 'm'])

df

In [None]:
# create by passing list
df = pd.DataFrame(
  [('lagos', 'ikeja', 'SW'),
   ('fct', 'abuja', 'NC'), 
   ('ondo', 'akure', 'SW'),
   ('oyo', 'ibadan', 'SW'),
   ('plateau', 'jos', 'SS')],
  columns=['states', 'capitals', 'geographic_region'], index=['a', 'b', 'c', 'd', 'e']
)

In [None]:
# we can specify the column index in the order we want them to appear

# notice the order of the columns indices
data = {'states': ['lagos', 'fct', 'ondo', 'oyo', 'plateau'], 'capital': ['ikeja', 'abuja', 'akure', 'ibadan', 'jos'], 'geographic_region': ['SW', 'NC', 'SW', 'SW', 'SS']}

# notice the polpulation columns is not contained in the data, hence missing values
df = pd.DataFrame(data, columns=['geographic_region', 'states', 'capital', 'population'])
df


---|||
#### **Head** or **Tail** Selecting the top or last few elements

We can select the first few elements in the beginning or end of the dataframe using the **head** and **tail** method.

By default they return the first or last five rows in the dataframe, however, we can pass-in a number to indicate the number of rows that should be returned

In [None]:
# select the first n values of a dataframe
df.head(3)

In [None]:
df.tail(2)

In [None]:
# query the columns using the column attributes
df.columns


---|||

# Indexing a Dataframe

Remember that the keys to  the dataframe are its columns indices.

When the dataframe object is indexed by the column name, a series containing the row index and its corresponding data is returned.

Indexing can be carried out in two ways; 
- **dictionary indexing**: as key 
- **attribute indexing**: using the name of the column

> the dictionary indexing format is more general, as it can also be used with columns with *space* in their name; i.e invalid python variable names 

In [None]:
# get all the data in the capital
df['capital']

In [None]:
# using attribute index
df.geographic_region

In [None]:
# get the population column

df.population

In [None]:
# assign values to the population column

df.population = [20, 10, 3, 4, 1]

In [None]:
# a better way
df.population = np.arange(6, 1, -1)

In [None]:
df


---|||

#### Adding a new series to an existing dataframe object

Add a new column by adding a series to a new column in the dataframe

- The length of the series data must match the those in the dataframe
- The index length of the series must match those in the dataframe, 
- the index names,that matches those in the dataframe will be aligned
- the index name that doesn't match will be NAN


> Note, if the index length, those of the 

In [None]:
df.index = ['one', 'two', 'three', 'four', 'five']
df

In [None]:
# add a series


# the max length of the series must match those of the dataframe
# we are not accounting for the 'five' row index
# there are no 'not-good index in the data frame
s = pd.Series(['no', 'yes', 'yes', 'yes', 'bad'], index=['two', 'one', 'four', 'three', 'not-good'])

# add a new column to the dataframe
# note this can only be created using dictionary key indexing
# using attribute indexing will not work
df['Visited'] = np.nan
df

In [None]:
# add the series to the visited column

# notice that the 'not-good' column doesn't match
df.Visited = s
df


---|||
#### Deleting a column from a dataframe

Using the **del** keyword followed by the column selection from the dataframe will delete the column from the dataframe

In [140]:
df

Unnamed: 0,geographic_region,states,capital,population,Visited
one,SW,lagos,ikeja,6,
two,NC,fct,abuja,5,
three,SW,ondo,akure,4,
four,SW,oyo,ibadan,3,
five,SS,plateau,jos,2,


In [142]:
# delete the geographic_region column

del df['geographic_region']

In [145]:
# the column is deleted
df.columns

Index(['states', 'capital', 'population', 'Visited'], dtype='object')

In [146]:
# the column is not the data frame 
df

Unnamed: 0,states,capital,population,Visited
one,lagos,ikeja,6,
two,fct,abuja,5,
three,ondo,akure,4,
four,oyo,ibadan,3,
five,plateau,jos,2,
