## CookBook on How to load data into python
+ text Files
+ csv files
+ pickle files
+ json files 
+ odds and ends 


In [55]:
import os 
os.mkdir('temp_data')

In [56]:
# Writing lines of text to a text file Text file
file_path = './temp_data/test.txt'
with open(file_path, 'w') as f:  ## 'w' stands for 'write'
    f.writelines('this is my test file \n I am writting now in the second line')
    

In [57]:
# Writing lines of text to a text file Text file
with open(file_path, 'r') as f:  ## 'r' stands for 'read'
    x = f.readlines()
# read back file, as a list, where every line is an element of the list
print(x, type(x))

# entier list together as one string 
print(' '.join(x))


['this is my test file \n', ' I am writting now in the second line'] <class 'list'>
this is my test file 
  I am writting now in the second line


#### Loading and Saving Data with Pandas 
pandas can handle mulitple formats both reading and writing data
+ csv (comma sepearted file) is very common, not partitulary quick, but understandable way to load and save data
    + if you have tsv (tab seperated data use ```sep='\t'```
    + occasionally en encoding needs to be specified as an arg ```encoding='utf-8'```
+ pickle, is a serialized way to save data, is type safe but it basically endemic to python
+ parquet is a relatively new format, is type safe and very fast to read and write

In [58]:
import pandas as pd
import os
# create a data frame for testing
df = pd.DataFrame({'a': [1, 2, None], 'b': ['x', 'y', None]})

path = './temp_data/df.csv'
# write to csv 
df.to_csv(path, header=True, index=False) # index=False, means no row numbers, header means write colnames

# load from csv
df_loaded = pd.read_csv(path, header=0)
print(df_loaded)

# write to pickle
path = './temp_data/df.pkl'
df.to_pickle(path)

# load from pickle 
df_loaded = pd.read_pickle(path)
print(df_loaded)


     a    b
0  1.0    x
1  2.0    y
2  NaN  NaN
     a     b
0  1.0     x
1  2.0     y
2  NaN  None


#### Pandas CSV Writer
pandas writes to a csv file line per row, and can be opened with the line 

In [59]:
# show how pandas writes a csv file 
path = './temp_data/df.csv'
with open(path, 'r') as f:  ## 'r' stands for 'read'
    print(' '.join(f.readlines()))

a,b
 1.0,x
 2.0,y
 ,




#### load and save data in excel format
thisn requires a packages to be installed first
+ ```  pip install openpyxl ``` 
+ ```  pip install xlrd ``` 

In [60]:
# Read Excel Data 
# needs  openpyxl to run, install with pip
p = 'temp/df.xlsx'
df.to_excel(p)
print(pd.read_excel(p))


   Unnamed: 0    a    b
0           0  1.0    x
1           1  2.0    y
2           2  NaN  NaN


#### Json data is essentally a dictionay
Json data is stored as string, in a dictionary or dictionary of dictionaries format.  It's pretty universal for things like webrequest arguements, als it requires that json package




In [61]:
import json
# saves json
d = {'name': 'USS YorkTown', 'size':{'displacement':[27,100], 'dims':[820, 93,34]}}
path = './temp/df.json'
with open(path, 'w') as f:
    json.dump(d, f, ensure_ascii=False, indent=4)
    
with open(path, 'r') as f:
    ship = (json.load(f))
print(ship)
print(ship['name'])

{'name': 'USS YorkTown', 'size': {'displacement': [27, 100], 'dims': [820, 93, 34]}}
USS YorkTown


#### Directories with many csvs of data 
a common problem is that a directory may contain many files of data that need to be used as one dataset.

In [62]:

## setup a dirictory with multiple data files
data_dir = './temp_data/data'
try:
    os.mkdir(data_dir)
except:
    pass
# create a data frame for testing
df = pd.DataFrame({'a': [1, 2, None], 'b': ['x', 'y', None]})


# writes two csv to the data_dir directory 
df.to_csv(data_dir + '/f1.csv', header=True, index=False) # index=False, means no row numbers, header means write colnames
df.to_csv(data_dir + '/f2.csv', header=True, index=False) # index=False, means no row numbers, header means write colnames


os.listdir(data_dir) ## lists the two csv files 

['f1.csv', 'f2.csv']

#### Use Generators to Load Files
+ give a list of file paths, a generator is used to load each file 
+ using the ```list()``` method loads every data frame, the ```next()``` loads files one at a time
+ ```pd.concat``` vertically stacks all the data together. 

In [63]:
# create a list of files to read in 
file_list = [data_dir + '/' + str(f) for f in os.listdir(data_dir)]

print(file_list)
# create a loading generator 
def load(file_list):
    for path in file_list:
        yield pd.read_csv(path, header=0)
# loads all files into a list of data frams
df_list = list(load(file_list))        
# vertically stacks the data frames 
pd.concat(df_list)        

['./temp_data/data/f1.csv', './temp_data/data/f2.csv']


Unnamed: 0,a,b
0,1.0,x
1,2.0,y
2,,
0,1.0,x
1,2.0,y
2,,


In [64]:
os.system('rm temp_data -r')  ## cleans up and remove the temp_data directory

0