## Week 3: More Data Processing with Pandas 
#  Lecture 30: Storing dataframes

### Learning Objectives
* loading data and storing data in a pandas DataFrame 


* Csv 
* Excel
* Pickle
* Hdf5 
* Pyarrow 


After review the various operation of a  pandas Dataframes ... we can go ahead and learn how to load various files in python using pandas and jupyter. 

all files will be converted to a pandas DataFrame 

In [2]:
import os
os.listdir()

#list of files and folders in the current directory, so you can references your file names 

In [77]:
import numpy as np
import pandas as pd
%matplotlib inline

## CSV & Type Inference 



In [13]:
#One of the most common files to work with is a CSV file or comma-seperated text file 
#basically a txt file where values or collumns are seperate by commas 

#when you load a csv file, python knows that the values will be seperated by commas, 
#it knows how to seperate them and how to extract values so it's fairly easy to load data from a csv file

# With pandas you don't have to specify which collumns are numeric, integer, boolean or string
# When pandas reads your file, it does two things to your data: type inference and data conversion

#As pandas parses through the data it breaks up each line into tokens and stores each collumn as an array of strings. 
# then the data is converted to integers. If there is an error, the value is converted to an object. 

#the drawback to performing type inference on CSV data is Pandas infers data types,
#so you can only define data types after the dataframe is made. 
#Pandas allows you to explicitly define types of the columns using the dtype parameter

df = pd.read_csv ('supermarkets.csv')
df

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


In [21]:
#We could also have used read_table and specifying the delimiter:

#same thing as pd.read_csv but specify the demilimiter 

df = pd.read_table ('supermarkets.csv', sep='\t')
df

Unnamed: 0,"ID,Address,City,State,Country,Name,Employees"
0,"1,3666 21st St,San Francisco,CA 94114,USA,Made..."
1,"2,735 Dolores St,San Francisco,CA 94119,USA,Br..."
2,"3,332 Hill St,San Francisco,California 94114,U..."
3,"4,3995 23rd St,San Francisco,CA 94114,USA,Ben'..."
4,"5,1056 Sanchez St,San Francisco,California,USA..."
5,"6,551 Alvarado St,San Francisco,CA 94114,USA,R..."


In [55]:
# A file will not always have a header row

#you have a couple of options. 
#You can allow pandas to assign default column names, or you can specify names yourself:

pd.read_csv('supermarkets.csv', header=None)

pd.read_csv('supermarkets.csv', names=['a', 'b', 'c', 'd','e', 'f','message'])

Unnamed: 0,a,b,c,d,e,f,message
0,ID,Address,City,State,Country,Name,Employees
1,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
2,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
3,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
4,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
5,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
6,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


In [57]:
#You can also set a column to be the index of the returned DataFrame

# either indicate you want the column at index 4 
pd.read_csv('nba_draft.csv', index_col= 3) 


# or use a named collumn like 'message' using the index_col argument:
names=['a', 'b', 'c', 'd','e', 'f','message']

pd.read_csv('supermarkets.csv',names= names,index_col= 'message') 




Unnamed: 0_level_0,a,b,c,d,e,f
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Employees,ID,Address,City,State,Country,Name
8,1,3666 21st St,San Francisco,CA 94114,USA,Madeira
15,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop
25,3,332 Hill St,San Francisco,California 94114,USA,Super River
10,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop
12,5,1056 Sanchez St,San Francisco,California,USA,Sanchez
20,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley


In [68]:
# to have a hierarchical index from multiple columns
#just pass a list of column numbers or names:

##having trouble getting the right data for this example
data={('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

pd.DataFrame.from_dict(data, index_col=['USAF', 'WBAN'])

# parsed = pd.read_csv('supermarkets.csv', index_col=['key1', 'key2'])

In [69]:
#Another paser function is skiprows; where you can skip the first, third, and fourth rows of a file:

#Sometime a formatting issue can occur preventing you from working with your data at all
#to save you the trouble of finding those values that are causing the issue, simply skips thoes rows 

pd.read_csv('supermarkets.csv', skiprows=[0, 2, 3]) 

Unnamed: 0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
0,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
1,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
2,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


In [70]:
## Handling missing values 

#Missing data is usually not present (empty string) 
#it is helpful to mark those missing values by some sentinel value. 
#By default, pandas uses a set of commonly occurring sentinels, such as NA,and NULL:

result = pd.read_csv('supermarkets.csv')

pd.isnull(result)

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False


In [None]:
# na_values option can take either a list or set of strings to consider missing values:

result = pd.read_csv('ch06/ex5.csv', na_values=['NULL'])

#Different NA sentinels can be specified for each column in a dict:

In [869]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
    
In [870]: pd.read_csv('ch06/ex5.csv', na_values=sentinels)

In [79]:
#another perameter, nrows, allows you to only read out a small number of rows (avoiding reading the entire file) 

pd.read_csv('supermarkets.csv', nrows=5)

#make pandas display settings more compact 

pd.options.display.max_rows = 10 


In [None]:
#To read out a file in pieces, specify a chunksize as a number of rows:

chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
In [875]: chunker
Out[875]: <pandas.io.parsers.TextParser at 0x8398150> 
    
#The TextParser object returned 
#by read_csv allows you to iterate over the parts of the file according to the chunksize. 

chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0) tot = tot.order(ascending=False)

In [None]:
#For any file with a single-character delimiter, 
#you can use Python’s built-in csv module.

#To use it, pass any open file or file-like object to csv.reader:

import csv
f = open('ch06/ex7.csv')
reader = csv.reader(f)
#Iterating through the reader like a file yields tuples of values in each like with any quote characters removed



In [None]:
#for a txt file seperated by whitespace or some other pattern, use pandas' read_table: 

#a table might not have a fixed delimiter, in these cases, you can pass a regular expression as a delimiter for read_table.

##having trouble getting the right data for this example
result = pd.read_table('ch06/ex3.txt', sep='\s+')

## Excel 

In [81]:
#pandas reads excel 
#normally you should incliude an extra parameter, sheet 
#some excel files might have multiple sheets in them so you want to specifiy the sheet name ,
#which starts from zero, so if you want the first sheet 

pd.read_excel('supermarkets.xlsx',sheet_name='Sheet3') 

In [82]:
#Column types can be explicitly specified

pd.read_excel('supermarkets.xlsx', index_col=0,
dtype={'Name': str, 'Value': float})

Unnamed: 0_level_0,Address,City,State,Country,Supermarket Name,Number of Employees
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
3,332 Hill St,San Francisco,California 94114,USA,Super River,25
4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


## Pickle (what is this in python)

In [74]:
### BINARY DATA #### 


#Pickling takes a Python object hierarchy and is converted into a byte stream
# “unpickling” is the inverse operation, a byte stream is converted back into an object hierarchy
#This has the advantage that there are no restrictions imposed by external standards such as JSON or XDR


#pickle is used for serializing and de-serializing a Python object structure. 
#Pickle serializes the object before storing it to a file. 
#serialization is storing binary data 

#the easiest way to store data in binary format, is using python's built-in pickle serialization. 
# to_pickle method writes the data to disk in pickle format. 
#It is a way to convert a python object to a character stream that contains all the information necessary to reconstruct the object

#To serialize an object hierarchy, you first create a pickler,
#then you call the pickler’s dump() method. 
#To de-serialize a data stream, you first create an unpickler, 
#then you call the unpickler’s load() method. The pickle module provides the following constant:
frame= pd.read_csv('supermarkets.csv')

frame.to_pickle('supermarkets.csv')

#read pickled objects using pandas.read_pickle 

pd.read_pickle ('supermarkets.csv')

#where we should use pickling
#It is very useful when you want to restart your code. you dump some object while coding in the python shell. 
#then import the pickled object and deserialize it

#caution# 

#pickle is only recommended as a short term storage format. It is not guaranteed that the format will be 
#stable over time, as an object pickled today may not unpickle with a later verzion of the library. 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

## HDF5 (introduction in book)

In [4]:
## working with very large datasets ## 

##for storing large quantities of scientific array data ## 

#for datasets that wont fit into memory,you can efficiently read and write small sections of much larger arrays. 
#HDF5 provides an interface that simplifies stroing Series and DataFrame object 


# "HDF" stands for hierarchical data format. Each HDF5 file can store multiple datasets and supporting metadata.
# HDF5 supports on-the-fly compression with a variety of compression modes, 
#enabling data with repeated patterns to be stored more efficiently 

# HDFStore class works like a dict and handles the low-level details: 
frame = pd.DataFrame({'a':np.random.randn(100)})
store = pd.HDFStore('mydata.h5') 
store['obj1'] = frame 
store['obj1_col'] = frame['a']
store



<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

In [6]:
#objects contained in the HDF5 file can then be retrieved with the same dict-like API 
store['obj1'].head()

Unnamed: 0,a
0,0.232331
1,-1.046879
2,-1.091731
3,0.801573
4,0.289194


In [9]:
# HDFStore supports 2 storage schemas: 'fixed', 'table'
store.put ('obj2', frame, format='table')

store.select ('obj2', where =['index >= 10 and index <=15'])

Unnamed: 0,a
10,-2.68464
11,0.950889
12,-1.071151
13,-1.184191
14,0.509601
15,0.822926


In [10]:
store.close()

In [12]:
# pandas.read_hdf

frame.to_hdf ('mydata.h5','obj3',format ='table')
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])


Unnamed: 0,a
0,0.232331
1,-1.046879
2,-1.091731
3,0.801573
4,0.289194


In [None]:


#It's important to note, HDF5 is not a database. 

#If you are processing data stored on a remote server, like Amazon S3 and HDFS, 
#using a different binary format desifned for distrabution storage like Apache Parquet may be more suitable 



## Pyarrow (the future)


In [None]:
#the main feature of arrow is that two programs written in different lenguages 
#which can speak arrow will share information with low overhead (cross-lenguage, cross-system communication)
#Arrow is a columnar in-memory analytics layer designed to accelerate big data. It houses a set of canonical in-memory
#representations of flat and hierarchical data along with multiple language-bindings for structure manipulation.

#The equivalent to a pandas DataFrame in Arrow is a Table.
#Table also provides nested columns, 
#thus it can represent more data than a DataFrame, so a full conversion is not always possible.

#Conversion from a Table to a DataFrame is done by calling pyarrow.Table.to_pandas(). 
#The inverse is then achieved by using pyarrow.Table.from_pandas().

#conda install -c conda-forge pyarrow

#With the current design of pandas and Arrow, it is not possible to convert all column types unmodified. 
#One of the main issues here is that pandas has no support for nullable columns of arbitrary type.
#Also datetime64 is currently fixed to nanosecond resolution. 


In [5]:
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()

# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)



ModuleNotFoundError: No module named 'pyarrow'

In [None]:
#at the end you can export file back to whatever format 