<h1 align='center'>6.2 Reading and Writing Data in Some Other Formats

<b>JSON

JSON (short for JavaScript Object Notation)

    JSON  is  very  nearly  valid  Python  code  with  the  exception  of  its  null  value  null  and some  other  
    nuances  (such  as  disallowing  trailing  commas  at  the  end  of  lists).  

    The basic types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. All of the keys in an 
    object must be strings.

In [98]:
import json
import codecs
import pandas as pd
import numpy as np

To convert a JSON string to Python form, use json.loads

In [44]:
obj = """
        {"name": "Wes", "places_lived": ["United States", "Spain", "Germany"],
        "pet": null, 
        "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},              
        {"name": "Katie", "age": 38,              
        "pets": ["Sixes", "Stache", "Cisco"]}]
        }
        """

In [45]:
res=json.loads(obj)

In [46]:
res

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

json.dumps, on the other hand, converts a Python object back to JSON:

In [47]:
asjson = json.dumps(res)

Conversion To DF

In [48]:
siblings = pd.DataFrame(res['siblings'], columns=['name', 'age','pets'])

In [49]:
siblings

Unnamed: 0,name,age,pets
0,Scott,30,"[Zeus, Zuko]"
1,Katie,38,"[Sixes, Stache, Cisco]"


If you need to export data from pandas to JSON, one way is to use the to_json meth‐ods on Series and DataFrame:

<b> XML | HTML

In [70]:
tables = pd.read_html(r"C:\Users\Synergy_Stud\Desktop\curry.html")

In [71]:
tables

[     Season   Age   Tm   Lg  Pos    G   GS    MP    FG   FGA  ...    FT%  ORB  \
 0   2009-10  21.0  GSW  NBA   PG   80   77  36.2   6.6  14.3  ...  0.885  0.6   
 1   2010-11  22.0  GSW  NBA   PG   74   74  33.6   6.8  14.2  ...  0.934  0.7   
 2   2011-12  23.0  GSW  NBA   PG   26   23  28.2   5.6  11.4  ...  0.809  0.6   
 3   2012-13  24.0  GSW  NBA   PG   78   78  38.2   8.0  17.8  ...  0.900  0.8   
 4   2013-14  25.0  GSW  NBA   PG   78   78  36.5   8.4  17.7  ...  0.885  0.6   
 5   2014-15  26.0  GSW  NBA   PG   80   80  32.7   8.2  16.8  ...  0.914  0.7   
 6   2015-16  27.0  GSW  NBA   PG   79   79  34.2  10.2  20.2  ...  0.908  0.9   
 7   2016-17  28.0  GSW  NBA   PG   79   79  33.4   8.5  18.3  ...  0.898  0.8   
 8   2017-18  29.0  GSW  NBA   PG   51   51  32.0   8.4  16.9  ...  0.921  0.7   
 9   2018-19  30.0  GSW  NBA   PG   69   69  33.8   9.2  19.4  ...  0.916  0.7   
 10  2019-20  31.0  GSW  NBA   PG    5    5  27.8   6.6  16.4  ...  1.000  0.8   
 11   Career   N

Using  lxml.objectify,  we  parse  the  file  and  get  a  reference  to  the  root  node  of  the XML file with getroot

In [91]:
from lxml import objectify

parsed = objectify.parse(open(r"C:\Users\Synergy_Stud\Desktop\xml.xml"))

root = parsed.getroot()



    root.INDICATOR  returns  a  generator  yielding  each  <INDICATOR>  XML  element. 
    
    For each  record,  we  can  populate  a  dict  of  tag  names  (like  YTD_ACTUAL)  to  data  values(excluding a 
    few tags)

In [96]:
data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ','DESIRED_CHANGE', 'DECIMAL_PLACES']

el_data = {}

for elt in root:    
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval    
    data.append(el_data)

el_data

{'AGENCY_NAME': '\nMetro-North Railroad\n',
 'INDICATOR_NAME': '\nEscalator Availability\n',
 'DESCRIPTION': '\nPercent of the time that escalators are operational  systemwide. \nThe availability rate is based on physical observations performed \n the morning of regular business days only. This is a new indicator\n the agency  began reporting in 2009.\n ',
 'PERIOD_YEAR': 2011,
 'PERIOD_MONTH': 12,
 'CATEGORY': 'Service Indicators\n ',
 'FREQUENCY': '\n M\n ',
 'INDICATOR_UNIT': '\n %\n ',
 'YTD_TARGET': 97.0,
 'YTD_ACTUAL': '\n ',
 'MONTHLY_TARGET': 97.0,
 'MONTHLY_ACTUAL': '\n '}

<b> HDF5

One of the easiest ways to store data (also known as serialization) efficiently in binary format  is  using  Python’s  built-in  pickle  serialization

    You can read any “pickled” object stored in a file by using the built-in pickle directly,or even more conveniently 
    using pandas.read_pickle

    frame.to_pickle('examples/frame_pickle')

    pickle  is  only  recommended  as  a  short-term  storage  format.  Theproblem is that it is hard to guarantee that 
    the format will be stableover  time

The  “HDF”  in  HDF5  standsfor hierarchical data format. Each HDF5 file can store multiple datasets and support‐ing  metadata

    HDF5 is not a database. It is best suited for write-once, read-manydatasets.  While  data  can  be  added  to  a  
    file  at  any  time,  if  multiplewriters do so simultaneously, the file can become corrupted

    If  you  are  processing  data  that  is  stored  on  remote  servers,  likeAmazon S3 or HDFS, using a different 
    binary format designed fordistributed  storage  like  Apache  Parquet  may  be  more  suitable

While  it’s  possible  to  directly  access  HDF5  files  using  either  the  PyTables  or  h5pylibraries,  pandas  provides  a  high-level  interface  that  simplifies  storing  Series  and DataFrame  object. 
    
    The  HDFStore  class  works  like  a  dict  and  handles  the  low-level details
    
    Objects contained in the HDF5 file can then be retrieved with the same dict-like API

In [99]:
frame = pd.DataFrame({'a': np.random.randn(100)})

store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

In [101]:
store['obj1'].head()

Unnamed: 0,a
0,0.225791
1,0.117257
2,0.057617
3,0.115853
4,-0.287917


HDFStore supports two storage schemas, 'fixed' and 'table'. 

The latter is generally slower, but it supports query operations using a special syntax:

In [102]:
store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,-0.4737
11,-1.567397
12,0.81439
13,-0.11047
14,1.224221
15,-0.762832


In [105]:
store.close()

The put is an explicit version of the store['obj2'] = frame method but allows us toset other options like the storage format.

* Shortcut

In [106]:
frame.to_hdf('mydata.h5', 'obj3', format='table')

In [107]:
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

Unnamed: 0,a
0,0.225791
1,0.117257
2,0.057617
3,0.115853
4,-0.287917


<b>XLSX

In [109]:
xlsx = pd.ExcelFile(r"C:\Users\Synergy_Stud\Desktop\xlsx.xlsx")

In [110]:
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,
2,,,,into


To  write  pandas  data  to  Excel  format,  you  must  first  create  an  ExcelWriter,  thenwrite data to it using pandas objects’ to_excel method

In [113]:
frame.to_excel(r"C:\Users\Synergy_Stud\Desktop\xlsx.xlsx")

In [114]:
xlsx = pd.ExcelFile(r"C:\Users\Synergy_Stud\Desktop\xlsx.xlsx")
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0.1,Unnamed: 0,a
0,0,0.225791
1,1,0.117257
2,2,0.057617
3,3,0.115853
4,4,-0.287917
...,...,...
95,95,1.892527
96,96,-1.573165
97,97,-0.202583
98,98,-1.276294
