---|||
# Pandas  Introduction 

It is a Python library used for data manipulation, cleaning and processing.

In [None]:
import pandas as pd
import numpy as np

#### The core data structure in pandas are Series (column like), and Dataframe(tabular like)
---|||

**Series**: one dimensional array-like object containing
-  sequence of values, **P**
-  an associated array of data labels, called its **index**

> By default, the index ranges from 0 to the len(P) - 1 


In [None]:
# notice the associated indices
pd.Series([5, 10, 111, 4])

In [None]:
# create a series with modified index, using its index parameter

s = pd.Series([10, 20, 30, 40], index=['i', 'j', 'k', 'l'])
s

In [None]:
# obtain the index of a given series using the index attribute

s.index


---|||
##### The index of a Series can be used to select data corresponding to the index

They act as index to the data sequence

In [None]:
s['k'], s['i']

##### NumPy-like operations can be used to manipulate a Series object

The index remains unchanged, after the operation(s) is performed

In [None]:
s[s > 25]

In [None]:
s * 2

In [None]:
np.exp(s)

In [None]:
# query if the series contains a given index

'j' in s

### A good mental model is to think of a Series object as a dictionary of keys and values

- the index are the keys
- the data are the values


In [None]:
# some states and their corresponding capitals in Nigeria
sdata = {'lagos': 'ikeja', 'ogun': 'abeokuta', 'adamawa': 'lafia'}

# create a series object from the data
s = pd.Series(sdata)
s

In [None]:
# get the index

s.index

In [None]:
# query if an index is contained in the series

'lagos' in s, 'fct' in s


---|||

#### Alter the indices of a Series in-place

In [None]:
index = ['l', 'o', 'a']

# change the indices to index
s.index = index

In [None]:
s.index


---|||
#### Using the Dictionary to create the series will sort the data based on the keys

This can be overriden by passing the same keys, and in whatever order to the index keyword

> adding a key that doesn't belong in the dictionary will result in the key having a NAN value (a.k.a missing data)

In [None]:
# adding more indices than data, results in NAN values

# some states and their corresponding capitals in Nigeria
sdata = {'lagos': 'ikeja', 'ogun': 'abeokuta', 'adamawa': 'lafia'}

# using the keys to order the values
s = pd.Series(sdata, index=['lagos', 'adamawa', 'ogun'])
s

In [None]:
# using a key that is not in the dictionary => 'fct
# will add the key, but its value will benan


s = pd.Series(sdata, index=['ogun', 'fct', 'adamawa', 'lagos'])

s


---|||
#### **isnull** and **notnull** method as a way of detecting missing data

In [None]:
# find index with missing data in a series

s.isnull()

In [None]:
# find index without missing data in the series

s.notnull()

In [None]:
# the query formats (functions and instance method) are equivalent


(
    pd.notnull(s) == s.notnull(),
    
    pd.isnull(s) == s.isnull()
)


---|||
##### Arithmetic Operations on series with similar keys 

When performing arithmetic operations on different series with similar keys, the keys are used to align the data before the operation is performed element wise

In [None]:
adata = {'i': 11, 'j': 33, 'k': 23}
bdata = {'a': 100, 'j': 23, 'b': 73, 'k': 1000} # keys j, k are in adata

s1 = pd.Series(adata)
s2 = pd.Series(bdata)

In [None]:
s1

In [None]:
s2

In [None]:
# notice that not all the keys are the same

# additional keys thar are not present in either index(es) ...
# wi;; return nan values

s1 + s2

|^|

Notice that keys that doesn't match have NAN values returned

---
---|||
#### Naming a Series 

It is possible to name a series object using the name attribute of kwarg

In [None]:
sdata

In [None]:
s = pd.Series(sdata, name='States and Capital in Nigeria')

In [None]:
s

In [None]:
# get the name using its index attributes

s.name


---|||
# Dataframe

The other core pandas object is the DataFrame representing a tabular-like data (like excel spreadsheet). It contains

- ordered collections of columns; each column can contain different data type
- a row and column index
 
A good mental model is to think of Dataframe as a dictionary containing

- **keys**: representing the columns, and its column index
-
- **value**:  Series object such that
  -  the keys of the series represent the row index
  - the values in the series represent the data keyed by its row and column index

In [None]:
# creating a dataframe

data = {'states': ['lagos', 'fct', 'ondo', 'oyo', 'plateau'], 'capital': ['ikeja', 'abuja', 'akure', 'ibadan', 'jos']}

# the row index will automatically default to a number, 
# if we had used the dictionary as is.
# but we pass one in the intialisation

df = pd.DataFrame(data, index=['i', 'j', 'k', 'l', 'm'])

df

In [None]:
# create a dataframe by passing list
# and we can specify the column index...
# in the order we want them to appear

df = pd.DataFrame(
  [('lagos', 'ikeja', 'SW'),
   ('fct', 'abuja', 'NC'), 
   ('ondo', 'akure', 'SW'),
   ('oyo', 'ibadan', 'SW'),
   ('plateau', 'jos', 'SS')],
  columns=['states', 'capitals', 'geographic_region'], index=['a', 'b', 'c', 'd', 'e']
)

df

In [None]:
# if we passed this directly, then the columns will be ... 
# ordered based on the keys - (capital, geographic_region, states)

data = {
    'states': ['lagos', 'fct', 'ondo', 'oyo', 'plateau'],
    'capital': ['ikeja', 'abuja', 'akure', 'ibadan', 'jos'],
    'geographic_region': ['SW', 'NC', 'SW', 'SW', 'SS']
    }

# notice the order of the columns indices (geographic.., states, capital)

# also notice that the population index in columns ... 
# is not contained in the data, so its values will be nan

df = pd.DataFrame(data, columns=['geographic_region', 'states', 'capital', 'population'])
df


---|||
#### **Head** or **Tail** Selecting the top or last few elements

We can select the first few elements in the beginning or end of the dataframe using the **head** and **tail** method.

By default they return the first or last five rows in the dataframe, however, we can pass-in a number to indicate the number of rows that should be returned

In [None]:
# select the first n values of a dataframe
df.head(3)

In [None]:
df.tail(2)

In [None]:
# get the column index using the column attributes

df.columns


---|||

# Indexing a Dataframe

Remember that the keys to  the dataframe are its columns index.

When the dataframe object is indexed by the column name, a series containing the row index and its corresponding data is returned.

Indexing can be carried out in two ways; 
- **dictionary indexing**: as key 
- **attribute indexing**: using the name of the column

> the dictionary indexing format is more general, as it can also be used with column index with *space* in their name, which would have other_wise be invalid using attribute. 

In [None]:
# get all the data in the capital -> a series 

df['capital']

In [None]:
# using attribute index

df.geographic_region

In [None]:
# get the population column

df.population

In [None]:
# assign values to the population column

df.population = [20, 10, 3, 4, 1]
df

In [None]:
# remember broadcasting?

df.population = 1
df


---|||

#### Adding a new series to an existing dataframe object

Add a new column then;  add a series to a new column in the dataframe

Satisfy the following

- The length of the series data must match the those in the dataframe
- The index length of the series must match those in the dataframe, 
- the index names,that matches those in the dataframe will be aligned
- the index name that doesn't match will be NAN


> Note, if the index length, those of the 

In [None]:
# change the row index names

df.index = ['one', 'two', 'three', 'four', 'five']
df

In [None]:
# add a series


# the max length of the series must match those of the dataframe

# the row index, 'five' is not on the new series index ... 
# hence the corresponding value will be nan

# there is no 'not-good' index in the data frame, ... 
# hence its values will not be aligned

s = pd.Series(
    data=['no', 'yes', 'yes', 'yes', 'bad'], 
    index=['two', 'one', 'four', 'three', 'not-good']
    )

# add a new column to the dataframe
# note this can only be created using dictionary key indexing
# as using attribute indexing will not work

df['Visited'] = np.nan
df

In [None]:
# add the series to the visited column

# notice that the 'not-good' column doesn't match

df.Visited = s
df


---|||
#### Deleting a column from a dataframe

Using the **del** keyword followed by the column selection from the dataframe will delete the column from the dataframe

In [None]:
df

In [None]:
del df['Visited']

df


---||| 

#### Swapping Columns and Rows with transpose

Using the **transpose function** or **T attribute** will:

- swap rows to columns
- columns to rows

In [None]:
df.T

In [None]:
df.transpose()


---|||

### Creating a new Dataframe from an existing Dataframe

By using an existing dataframe object, one can create a new dataframe.

IF the index key is also specified, then

- any index of the previous dataframe added to this new index, will have its column data in the new data frame

- new index will automatically be assigned nan

In [None]:
# we only want the data in 'one, five and three' in the new dataframe

# in the new index added, 'ten, nine' will be assigned nan 

df2 = pd.DataFrame(df, index=['one', 'ten', 'five', 'nine', 'three'])

df2

In [None]:
# set the name attributes of the index and column of a dataframe

df2.columns.name = 'States in Nigeria Info'
df2.index.name = "Numbering"

df2

In [None]:
# return the values in a dataframe
# the data are returned along the columns axis

df.values


---|||

### Index Object

This is a pandas object that holds the values of
- a Series row index 
- a Dataframe's columns or row index A

The Index Obeject is;

- It is immutable 
- it can be shared among other data structures

In [None]:
# manipulatiing a Series using its index object

s = pd.Series(
    data=['Mo', 'Usman', 'Kolawole'],
    index=['a', 'b', 'c']
    )
s

In [None]:
# get the index object

ind = s.index

ind

In [None]:
# index object are immutable

ind[0] = 'k'

In [None]:
#  create an index object

ind = pd.Index(data=['l', 'm', 'n', 'p'])

ind

In [None]:
# use the new index object in a series

s = pd.Series(np.arange(4), index=ind)

s

In [None]:
ind2 = ['l', 'm', 'n', 'p']

In [None]:
# remember the difference between (==) and (is)?

# compares element wise; remember vectorization?
s.index == ind2

In [None]:
# check that they are the same object in memory 

s.index is ind2

In [None]:
s.index is ind


---|||
### Index Object as a container for Duplicate Object

Since Index Object are immutable, they are similar to a fixed-set in Python, and support Set logic. But unlike Python Sets, they can contain duplicate values   

In [None]:
# create two pandas object

ind1 = pd.Index(['a', 'b', 'c', 'd', 'm', 'f', 'g', 'l'],)
ind2 = pd.Index(['l', 'm', 'b', 'p', 'a', 'h', 'q', 'r'],)

In [None]:
# apply set logic to both

# concatenate two Index objects to create a new one

ind3 = ind2.append(ind1)

ind3

In [None]:
# compute the difference between two set A - B
# this is remove all the elements of B that is also in A,  from A

ind4 = ind3.difference(ind2)

ind4

In [None]:
# compute the union of two or more

# remember this is set logic, so duplicates will not be allowed
# but an index object itself can contain duplicate values

ind5 = ind1.union(ind2)

ind5


---|||
### Re-indexing a Series 

The **reindex** method allows creating a new Series object, with the data of the old series **aligned** to a new index (if the index of the data in the old series are elements in the new index).

> the data in the new index will be ordered based on how they are passed in the 

In [None]:
# create a series

s = pd.Series(
    data=np.arange(8),
    index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    )
s

In [None]:
# index

s.index

In [None]:
# change the index using the reindex method

# note there are new index values present, 
# so, index in **s** will be aligned to its data in the new series ...
# and new index values will be assigned NAN values

s.reindex(['a', 'i', 'j', 'k', 'b', 'l', 'm', 'c', 'n', 'p', 'd', 'q'])


---|||
### Filling the NAN values in the **reindex** method

When both old and new index values are passed as elements in the reindex method,
the new index values will by default be **aligned** to NAN values. To fill this NAN values, one can pass the method of filling the NAN values.

**Methods**

- ffill: fill the NAN values with the last valid data in the old series,
- bfill: fill the next NAN with the  data,

In [None]:
# using the same series, from above

s.reindex(
    index=['a', 'i' , 'b', 'j', 'c', 'k', 'd', 'e'],
    method='ffill'
)


In [None]:
s.reindex?


---|||
### NumPy UFunc and Mappings 

Remember UFuncs are numpy functions that apply an operation to each element of an ndarray, through broadcasting.

In pandas, ufuncs can also be applied to pandas dataframe and series objects


In [None]:
df = pd.DataFrame(np.arange(24).reshape((6 , 4)))

df

In [None]:
# apply the sqrt ufunc of the table

np.sqrt(df)


---|||
### Apply Mapping Method



In [None]:
s = pd.Series(np.arange(1, 10))
df = pd.DataFrame(np.arange(36).reshape(9, 4))

In [None]:
f =  lambda x: x*x

def g(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [None]:
s

In [None]:
s.apply(f)

In [None]:
df.apply(f)


In [None]:
df

In [None]:
df.apply(g, axis=1)


---|||
### **applymap** method

This is similar to apply, except that it applies the passed in function to each element in the dataframe, instead of applying it to a a series along a  specific axis.

In [None]:
data = np.random.random((5, 6))
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])

df

In [None]:
format = lambda x: '%.3f' %x

In [None]:
# using apply will try to execute format on the axis (0, by default),
# which should fail, because the format method except single values,
# but the apply method will pass a series object to it

df.apply(format)

In [None]:
# to overcome this challenge, the applymap method is used

df.applymap(format)

In [None]:
df


---|||
### Sorting

Sorting can be done in two ways:

- **by values**: sort based on the value in the series or dataframe
- **by index**: sort based on the index in the series OR based on the axis in a dataframe

In [None]:
s = pd.Series(np.arange(1, 14, 2), index=['c', 'a', 'b', 'k', 'i', 'm', 'e'])

s

In [None]:
# sort by index in descending order

s = s.sort_index(ascending=False)
s

In [None]:
# sort by values in ascending

s.sort_values()

In [None]:
df = pd.DataFrame(np.arange(1, 41, 3).reshape(7, 2), index=[3, 1, 7, 11, 32, 9, 0])

df

In [None]:
# sort by index along axis = 1

df.sort_index(axis=1, ascending=False)

In [None]:
# sort by index along axis=0

df.sort_index(axis=0)

In [None]:
# sort by values along axis = 1, but a key must be specified

df.sort_values(axis=1, by=1)


---|||
# Chapter 6: Data Loading, Storage, and File Formats

---|||
### Loading/Reading of Data into a Dataframe Object

The read_[type] is meant to convert data stored on disk to a Dataframe object.


In [None]:
# reading a csv formatted data

file_path = "../datasets/Employees.csv"

df = pd.read_csv(file_path)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# set the index column to Unnamed
df = df.set_index('Unnamed: 0')

In [None]:
df.head()

In [None]:
# remove index name
df.index.name = ""

In [None]:
df.head()

In [None]:
# this multiple operations could be done directly when loading the file

df = pd.read_csv(file_path, index_col=0)

In [None]:
df.head()


---|||
# Using **read_table** as a general method for loading text data

While *read_csv* is specific to CSV formatted files, one can use *read_table*, and indicate how the data is formatted in an optional seperator key (sep).

The seperator between data points for CSV-formatted file is a comma (,).

In [None]:
# use read_tables to open the Employee CSV data

df = pd.read_table(file_path)

In [None]:
df.head()

In [None]:
# specify the format method in the seperator
# remember the index column

df = pd.read_table(file_path, sep=',', index_col=0)

In [None]:
df.head()

In [None]:
file_path = "../datasets/pydata_datasets/babynames/yob1881.txt"

# suppose we don't know how the data is formatted, we use the read_table to peek

df = pd.read_table(file_path)

In [None]:
df.head()

In [None]:
# since it is comma seperated, we use the read_csv file

df = pd.read_csv(file_path)
df.head()


---|||
### Using the header key


It is used to specify whether the data has column names(header) associated with it. Whenever the header argument is still specified, then;

- it value is either 0 or NONE
- the default column names (0, 1, 2, 3, ...) based on the inferenced number of columns 

When the **names** key is specified, header must be specified with...

- None: if column names are not present
- 0: if they are present, but will be renamed (using names=[val1, val2, ...])

In [None]:
# something is amiss, the column names are part of the data
# so we specify that the header (=> column names) is/are absent

df = pd.read_csv(file_path, header=None)
df.head()


---|||
### Setting the column names

This can be achieved by passing a values to the **names** optional parameter in the read_csv method.

There are implications if the length of the passed values exceed those in the data file. In this case, the name is added and populated with NAN values

If the length of the passed values were less than those in the data file, then the the columns not accounted for are used to index the data 

In [None]:
# to be even more pragmatic, by specify the names we want
# instead of using the default integer column indexing

df = pd.read_csv(file_path, header=None, names=['Name', 'Sex', 'Count'])
df.head()

In [None]:
df.tail()

In [None]:
file_path = "../datasets/pydata_datasets/haiti/Haiti.csv"

# peek into the data

df = pd.read_csv(file_path)
df.head(3)

In [None]:
# get the count of each columns of the dataframe

df.count() 

# observe that the total length is 3593

In [None]:
# confirm the observation using the shape

df.shape

# its true; there are 3593 rows

In [None]:
# is the serial number a numbering format or special?

# sort by index
df.sort_values(by='Serial').head()

In [None]:
# change the values of serial to start from 1, by subtracting 3

# since we have a dataframe, and we want to apply and operation on...
# each values in the Serial column, we use the map method

f = lambda x: x - 3

In [None]:
df.Serial = df.loc[:, 'Serial'] - 3  

In [None]:
# reindex by Serial

df.set_index('Serial', inplace=True)

In [None]:
df.sort_index(inplace=True)

In [None]:
df.head()

In [None]:
# an interesting data set
file_path = "../datasets/pydata_datasets/movielens/movies.dat"

# peek
pd.read_table(file_path).head()

Observations:

- the file has no header: set header to [name, year, genre]
- we have three seperator :: to get the title and genre, () to get the year
- we should replace the |, with something better , a space

- the index of the data should be the first column

In [None]:
df = pd.read_table(
    file_path,
    header=None,
    names=['num', "title", "year", "genre", ''],
    sep='(\d+)::([^:]+)\s\((\d+)\)::(.*)',
    engine='python',
    keep_default_na=False,
)

In [None]:
df.head()


---|||
## Handling Missing Values 

It is possible to pass a string or sequence of strings, as values to the **na_values** key, which would marks any occurence of the string(s) as missing values (NAN, NA...) 

In [None]:
# lets set any Abel and Echo as sentinel value

df = pd.read_csv("../datasets/Employees.csv", na_values=['Able', 'Echo'])

df.tail()

In [None]:
# re-index

df.set_index('Unnamed: 0', inplace=True)

In [None]:
df.head()

In [None]:
# rename index to empty 
df.rename_axis(index="", inplace=True) 

In [None]:
df.head()


---|||
### **na_values** for all or specific column(s)

It is possible to specify that when a given data, as specified in the **value(s)** passed to the **na_values** key, then the data would be marked as sentinel i.e. the data would be marked as missing if any is found in the dataframe.

It is also possible to specify that we want the matches in specific columns by passing a dictionary, containing a **key-value** pair to **na_values**, such that;

- the key is the specific column name where we want to mark a certain data as sentinel

- the value(s) is/are the data to mark as sentinel in the specified column

In [None]:
# we could have set the sentinel by selecting the specific column

# lets set any Abel and Echo as sentinel value in the Name column
#and all the values less than four (4) in the YearOfService column

df = pd.read_csv(
    "../datasets/Employees.csv", 
    na_values={
        'Name': ['Able', 'Echo'],
        'YearOfService': [0, 1, 2, 3]       # values less than 4
        },

    # when names is specified, header must be specified with...
    #   None: if column names are not present
    #   0: if they are present, but will be renamed (using names=[])
    header=0, # this allows us to specify no column
    names= ["", "Department", "Name", "YearOfService"], # rename column
    index_col=0
)

df.sort_values(by=['Name'], ascending=True).tail()


---|||
### Converter Optional Parameter

This is a **read_csv, and read_table** optional parameter, that take a dictionary as value.

Its purpose is to apply a **function/mapping f** specified as value to a **key representing the column name, that the function should apply the mapping** to every values in the specified column

In [None]:
df.head()

In [None]:
df = pd.read_table(
    "../datasets/a.txt",
    header=0,
    names=["Deparment", "Name", "YearOfService"],
    sep=','
    # index_col=0
)
df

In [None]:
# based on the observation above, lets change the NAN values in Name to 'MO'abs

f = lambda x: ('Mo' if not x else x )
# def f(x):
    # if x
df = pd.read_table(
    "../datasets/a.txt",
    header=0,
    names=["Deparment", "Name", "YearOfService"],
    sep=',',
    converters={'Name': f}
    # index_col=0
)
df


---|||
### Reading Text Files in Pieces

It is optimal, when reading data in large files, to read the data from the file in small pieces OR iterate through smaller chunks of the file

One way this can be achieved is to specify the number of rows

In [None]:
import pandas as pd
import numpy as np

In [None]:
# set the display to more compact format

pd.options.display.max_rows = 10

In [None]:
file_path = "../datasets/pydata_datasets/movielens/movies.dat"

# take a look at the resulting dataframe for example
df = pd.read_table(
    file_path,
    header=None,
    names=['num', "title", "year", "genre", ''],
    sep='(\d+)::([^:]+)\s\((\d+)\)::(.*)',
    engine='python',
    keep_default_na=False,
)

df.shape

There are 3883 rows.

Suppose it was very large, then it would take a lot longer to parse the entire data into the dataframe. One way to avoid this is to specify the **number of rows** we want from the file.

By passing the **nrows** keys, the read_[format] will stop when it reaches the number of rows value specified. This is especially useful when one wants to quickly examine data in a file.

In [None]:
# instead of reading the entire file, we could specify the number of rows

file_path = "../datasets/pydata_datasets/movielens/movies.dat"

df = pd.read_table(
    file_path,
    header=None,
    names=['num', "title", "year", "genre", ''],
    sep='(\d+)::([^:]+)\s\((\d+)\)::(.*)',
    engine='python',
    keep_default_na=False,
    nrows=3
)

df


---|||
### Reading in chunks, by specifying the chunksize

When the chunksize value is specified, an iterator over the chunks of data in the file is returned.

Here, if the **chunksize == p**, then p amount of rows will be returned everytime we call **next** on the iterator (or we iterate using for loop), until, there are no more data to iterate over

More interestingly, we can call the **get_chunks(size=val)** method, and specify a certain **val**, over the iterator returned, to

- read less than the value of rows that would have been returned in a given iteration,  **if val < p**

- read more than the value of rows that would have been returned in a given iteration, **if val > p**

- read a given number of rows from a given iteration, based on the value of the specified size in the **get_chunks** method

In [None]:
file_path = "../datasets/pydata_datasets/movielens/movies.dat"

# specify the chunksize to get an iterator

chunk = pd.read_table(
    file_path,
    header=None,
    names=['num', "title", "year", "genre", ''],
    sep='(\d+)::([^:]+)\s\((\d+)\)::(.*)',
    engine='python',
    keep_default_na=False,
    chunksize=20
)

# in this iteration get the firs 20 rows
next(chunk)

# in this iteration, we get 22 rows; more than the chunksize specified
print(chunk.get_chunk(size=22).shape)

# in this iteration, we get the default chunksize specfied; 20
print((next(chunk).shape))

# in this iteration, we get 10 rows; less than the chunksize specified
print(chunk.get_chunk(size=10).shape)

In [None]:
file_path = "../datasets/pydata_datasets/movielens/movies.dat"

# specify the chunksize to get an iterator

chunk = pd.read_table(
    file_path,
    header=None,
    names=['num', "title", "year", "genre", ''],
    sep='(\d+)::([^:]+)\s\((\d+)\)::(.*)',
    engine='python',
    keep_default_na=False,
    chunksize=10
)

# print the number of iterations it takes to read the entire file
count = 0
tot_rows = 0
for piece  in chunk:
    count += 1
    tot_rows += piece.shape[0]
    # print((piece.shape))

'It took {0} iterations to read {1} number of rows'.format(count,  tot_rows)


---|||
### Writing Data To Text Format

Using the **to_[format]**, we can save a dataframe data to disk based on the format specified


dataframe.to_[format]([name.[format])


- dataframe.to_excel([name.xlsl]): to excel format
- dataframe.to_csv([name.csv]): to csv formatted
- dataframe.to_json([name.json]): to json

In [None]:
file_path = "../datasets/Employees.csv"

df = pd.read_csv("../datasets/Employees.csv", index_col=0)
df

In [None]:

# to json; the record format omit the index 

df.to_json("employee.json", "records")

In [None]:
# save to csv fomat, 
# but instead of the delimeter being comma, we use |

df.to_csv("employee.csv", sep="|")

In [None]:
# observe the following data
file_path = "a.txt"

# since it is a .txt file, we use the read_table format
# actually, we could have used a read_csv, since we know it is comma seperated

df = pd.read_table(file_path, sep=',', index_col=0)

df

#### Observations

There are missing values in the Name column, to replace this, missing values with another value, we could have used the **converter** method while reading the file.

However, we want to save this data to disk, and replace any missing values with **another name**.

To achieve the above, we pass the **value** to **na_rep** key. This value will  stand in for the missing values in the dataframe 

In [None]:
df.to_csv("employee_NULL_for_NA.csv", sep='|', na_rep="NULL")

In [None]:
# we could also remove the column labels by passing a header as false
df.to_csv("employee_NULL_for_NA_w-o_header.csv", sep='|', na_rep="NULL", header=False)

In [None]:
# we could also remove the index by passing an index key as false
df.to_csv("employee_NULL_for_NA_w-o_header_w-o_index.csv", sep='|', na_rep="NULL", header=False, index=False)

In [None]:
# instead of writing the output to a file, by passing the file name,
# we could output the result into standard output, which will just print the result

# get the stdout
from sys import stdout

df.to_csv(stdout, sep='|', na_rep="NULL", header=False, index=False)

In [None]:
# we can also indicate which columns we are interested in

# we could also remove the index by passing an index key as false
df.to_csv(stdout, sep='|', na_rep="NULL", index=False, columns=["Department", "YearOfService"])


---|||
### Binary Data Formats

One can take data in a different format and store it in binary format, in a process known as serialization

The reverse is known as deserialization. 

In [None]:
# read a csv file and save it in binary format

f = lambda x: "NULL" if not x else x

df = pd.read_table("a.txt", index_col=['Unnamed: 0'], sep=',', converters={'Name': f})

df

In [None]:
# store it to binary format

df_bin = df.to_pickle("employee_NULL_w-o_header_to_binary")

In [None]:
# load the binary file

df = pd.read_pickle("employee_NULL_w-o_header_to_binary")
df


---|||
### Storage Formats

HDF5: Hierachical Data Format

This is a file format used for **storing large quantities** of scientific array data. 

An HDF5 file can store multiple datasets, and metadata as a key-value pair. Interestingly, it supports the compression of those file, using **a variety of compression modes**, so that data with repeated pattern are stored efficiently.

In [89]:
# store a dataframe in HDF5 format

# create the dataframe
df = pd.DataFrame({'a': np.random.randn(100)})

# store as hdf
store = pd.HDFStore('data.h5')

In [90]:
# store the dataframe

store['obj1'] = df

In [93]:
store['obj1_col'] = df['a']

In [97]:
print(store.info())

<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/obj1                frame        (shape->[100,1])
/obj1_col            series       (shape->[100])  


In [98]:
# retrieve the df

store['obj1']

Unnamed: 0,a
0,-1.245699
1,0.326683
2,-1.405154
3,-0.338620
4,0.736357
...,...
95,-0.116920
96,-0.420591
97,0.437353
98,-0.391695


In [99]:
# retreive the data in col a

store['obj1_col']

0    -1.245699
1     0.326683
2    -1.405154
3    -0.338620
4     0.736357
        ...   
95   -0.116920
96   -0.420591
97    0.437353
98   -0.391695
99    0.990099
Name: a, Length: 100, dtype: float64


---|||
### Storage Schema

Two schema (i.e. mode of storing data) are supported by HDF5Store;

- fixed: fast, but doen't support query operations

- table: slow, but supports query operation

In [104]:
df2 = pd.DataFrame({'b': np.random.randn(40), 'e': np.random.randn(40)})

store.put('obj2', value=df2, format='table')

In [106]:
# note, if storage schema is "fixed, then this query operation will fail
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,b,e
10,0.021936,-0.729077
11,-0.738394,1.231876
12,0.927187,-1.120192
13,-0.121029,1.007038
14,1.172373,0.689595
15,0.593094,0.022941


In [107]:
# close the storage

store.close()

In [108]:
# confirm close

store['obj1']

ClosedFileError: data.h5 file is not open!

In [109]:
# performing the same operation using read_hdf5, and to_hddf. 
a = np.arange(20).reshape((10, 2))

df  = pd.DataFrame(a, columns=['a', 'b'])

df.to_hdf('rdata.h5', 'obj', format='table')

In [116]:
pd.read_hdf('rdata.h5', 'obj', where= ['index >=3', 'columns = a' ])

3
4
5
6
7
8
9
