# Educative - Advanced Pandas: Going Beyond the Basics
## Chapter 2 - Reading Different Input Data into pandas
___

In [2]:
import pandas as pd
pd.__version__

'1.5.0'

### Table of Contents
[(1) Read Data from the Web](#web)  
[(2) Read Data from Markup Language Files](#markup)  
[(3) Read Data from Statistical Software](#stats)  
[(4) Read Data from SQL Databases](#sql)  
[(5) Read Data from Binary Files](#binary)

___
<a class="anchor" id="web"></a>
## (1) Read Data from the Web

### (i) Read online files

#### CSV

In [2]:
url = 'https://raw.githubusercontent.com/kennethleungty/Simulated-Annealing-Feature-Selection/main/data/raw/train.csv'

In [3]:
df = pd.read_csv(url)

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### AWS S3

To access CSV files from S3, need to install the following packages too:  
1) fsspec  
2) s3fs

In [5]:
# pip install fsspec
# pip install s3fs

In [5]:
s3_url = 's3://noaa-wcsd-pds/data/processed/SH1305/bottom/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv'

In [6]:
df = pd.read_csv(
    s3_url,
    storage_options={"anon": True} # Specify anonymous connection
    )

df.head()

Unnamed: 0,Ping_date,Ping_time,Ping_milliseconds,Latitude,Longitude,Position_status,Depth,Line_status,Ping_status,Altitude,GPS_UTC_time
0,2013-05-23,08:08:55,802.0,47.038269,-124.846214,1,148.039711,1,0,-0.55,-1
1,2013-05-23,08:08:57,302.0,47.038176,-124.846193,1,148.255721,1,0,-0.126019,-1
2,2013-05-23,08:08:58,802.0,47.03808,-124.846168,1,148.055539,1,0,0.62,-1
3,2013-05-23,08:09:00,302.0,47.03798,-124.846135,1,148.054853,1,0,1.05,-1
4,2013-05-23,08:09:01,802.0,47.037881,-124.846102,1,148.055645,1,0,0.84,-1


#### JSON

In [7]:
json_url = 'https://blockchain.info/latestblock'
pd.read_json(json_url)

Unnamed: 0,height,time,block_index,txIndexes,hash
0,756380,1664548699,1664548699,8199470655468017,00000000000000000006650e387060e75198e3b1ce07a6...
1,756380,1664548699,1664548699,3015277109181083,00000000000000000006650e387060e75198e3b1ce07a6...
2,756380,1664548699,1664548699,6353439046823471,00000000000000000006650e387060e75198e3b1ce07a6...
3,756380,1664548699,1664548699,3178692138026629,00000000000000000006650e387060e75198e3b1ce07a6...
4,756380,1664548699,1664548699,6726832768352653,00000000000000000006650e387060e75198e3b1ce07a6...
...,...,...,...,...,...
3292,756380,1664548699,1664548699,2671747665077032,00000000000000000006650e387060e75198e3b1ce07a6...
3293,756380,1664548699,1664548699,3137308048682972,00000000000000000006650e387060e75198e3b1ce07a6...
3294,756380,1664548699,1664548699,2747989026006833,00000000000000000006650e387060e75198e3b1ce07a6...
3295,756380,1664548699,1664548699,6177539568422872,00000000000000000006650e387060e75198e3b1ce07a6...


___
### (ii) Read HTML Tables on Website

In [8]:
url = 'https://www.google.com/finance/quote/AMZN:NASDAQ'
pd.read_html(url)

[                                               (USD)  \
 0  RevenueThe total amount of income generated by...   
 1  Operating expenseRepresents the total incurred...   
 2  Net incomeCompanyâs earnings for a period ne...   
 3  Net profit marginMeasures how much net income ...   
 4  Earnings per shareRepresents the company's pro...   
 5  EBITDAEarnings before interest, taxes, depreci...   
 6  Effective tax rateThe percent of their income ...   
 
   Jun 2022infoFiscal Q2 2022 ended 6/30/22. Reported on 7/29/22. Y/Y change  
 0                                            121.23B                  7.21%  
 1                                             51.49B                 24.98%  
 2                                             -2.03B               -126.07%  
 3                                              -1.67               -124.27%  
 4                                              -0.20               -126.46%  
 5                                             12.91B               

In [9]:
pd.read_html('https://www.google.com/finance/quote/AMZN:NASDAQ')[0]

Unnamed: 0,(USD),Jun 2022infoFiscal Q2 2022 ended 6/30/22. Reported on 7/29/22.,Y/Y change
0,RevenueThe total amount of income generated by...,121.23B,7.21%
1,Operating expenseRepresents the total incurred...,51.49B,24.98%
2,Net incomeCompanyâs earnings for a period ne...,-2.03B,-126.07%
3,Net profit marginMeasures how much net income ...,-1.67,-124.27%
4,Earnings per shareRepresents the company's pro...,-0.20,-126.46%
5,"EBITDAEarnings before interest, taxes, depreci...",12.91B,-17.97%
6,Effective tax rateThe percent of their income ...,23.90%,â


In [10]:
pd.read_html('https://www.google.com/finance/quote/AMZN:NASDAQ')[1]

Unnamed: 0,(USD),Jun 2022infoFiscal Q2 2022 ended 6/30/22. Reported on 7/29/22.,Y/Y change
0,Cash and short-term investmentsInvestments tha...,60.71B,-32.46%
1,Total assetsThe total amount of assets owned b...,419.73B,16.49%
2,Total liabilitiesSum of the combined debts a c...,288.33B,17.44%
3,Total equityThe value of subtracting the total...,131.40B,â
4,Shares outstandingTotal number of common share...,10.19B,â
5,Price to bookA ratio used to determine if a co...,8.90,â
6,Return on assetsA financial ratio that shows a...,2.00%,â
7,Return on capitalCompanyâs return above the ...,2.91%,â


___
### (iii) Read from Clipboard
- https://en.wikipedia.org/wiki/Premier_League#Managers

In [11]:
pd.read_clipboard()

Unnamed: 0,Nat.,Manager,Club,Appointed,Time as manager
0,Germany,Jürgen Klopp,Liverpool,8 October 2015,"6 years, 357 days"
1,Spain,Pep Guardiola,Manchester City,1 July 2016,"6 years, 91 days"
2,Denmark,Thomas Frank,Brentford,16 October 2018,"3 years, 349 days"
3,Austria,Ralph Hasenhüttl,Southampton,5 December 2018,"3 years, 299 days"
4,Northern Ireland,Brendan Rodgers,Leicester City,26 February 2019,"3 years, 216 days"
5,Spain,Mikel Arteta,Arsenal,20 December 2019,"2 years, 284 days"
6,Scotland,David Moyes,West Ham United,29 December 2019,"2 years, 275 days"
7,Portugal,Bruno Lage,Wolverhampton Wanderers,9 June 2021,"1 year, 113 days"
8,Portugal,Marco Silva,Fulham,1 July 2021,"1 year, 91 days"
9,France,Patrick Vieira,Crystal Palace,4 July 2021,"1 year, 88 days"


___
<a class="anchor" id="markup"></a>
## (2) Read Data from Markup Language Files

### (i) HTML

In [13]:
# pip install html5lib

In [14]:
html_path = '../scripts/continents.html'

In [15]:
df = pd.read_html(html_path)[0]

In [16]:
df

Unnamed: 0_level_0,Continent or Region,Area,Area,Area,Population,Population
Unnamed: 0_level_1,Continent or Region,km2,sq mi,% of total land,2021 estimate,% of total
0,Asia,44614000,17226000,29.8%,4.7 billion,60%
1,Africa,30365000,11724000,20.3%,1.4 billion,17%
2,North America,24230000,9360000,16.2%,600 million,7.6%
3,South America,17814000,6878000,11.9%,430 million,5.6%
4,Antarctica,14200000,5500000,9.5%,0,0%
5,Europe,10000000,3900000,6.7%,750 million,9.8%
6,Oceania,8510900,3286100,5.7%,44 million,0.54%


In [17]:
df = pd.read_html(html_path,attrs = {'class': 'wikitable'})[0]

In [18]:
df

Unnamed: 0_level_0,Continent or Region,Area,Area,Area,Population,Population
Unnamed: 0_level_1,Continent or Region,km2,sq mi,% of total land,2021 estimate,% of total
0,Asia,44614000,17226000,29.8%,4.7 billion,60%
1,Africa,30365000,11724000,20.3%,1.4 billion,17%
2,North America,24230000,9360000,16.2%,600 million,7.6%
3,South America,17814000,6878000,11.9%,430 million,5.6%
4,Antarctica,14200000,5500000,9.5%,0,0%
5,Europe,10000000,3900000,6.7%,750 million,9.8%
6,Oceania,8510900,3286100,5.7%,44 million,0.54%


### (ii) XML

#### Customers Dataset

In [19]:
xml_path = '../scripts/customers.xml'

In [20]:
customer_df = pd.read_xml(xml_path)

In [21]:
customer_df

Unnamed: 0,customer_id,first_name,last_name,email,gender,country
0,1,Amabelle,Esposito,aesposito0@reverbnation.com,Female,Peru
1,2,Shaughn,Brothwell,sbrothwell1@aboutads.info,Genderqueer,Yemen
2,3,Merridie,Skowcraft,mskowcraft2@chron.com,Female,Albania
3,4,Susi,Burstowe,sburstowe3@amazon.co.jp,Female,Russia
4,5,Currey,Loughead,cloughead4@eepurl.com,Male,China


#### Students Dataset

In [22]:
xml_path = '../scripts/students.xml'
student_df = pd.read_xml(xml_path)
print(student_df)

                       Name  Table
0  University_Students_Info    NaN


Using the basic `read_xml` function produces an output that is not correct

In [23]:
from lxml import etree

In [24]:
def xml_to_df(file_path):
    with(open(file_path, 'r', encoding='utf-8')) as f:
        doc = etree.parse(f, parser = etree.XMLParser(recover=True))
        
    namespaces={'o':'urn:schemas-microsoft-com:office:office',
                'x':'urn:schemas-microsoft-com:office:excel',
                'ss':'urn:schemas-microsoft-com:office:spreadsheet'}
    
    L = []
    ws = doc.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
    if len(ws) > 0: 
        tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
        if len(tables) > 0: 
            rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
            for row in rows:
                tmp = []
                cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
                for cell in cells:
                    tmp.append(cell.text)
                L.append(tmp)
    df = pd.DataFrame(L[1:], columns=L[0])
    
    return df

In [25]:
students_df = xml_to_df(xml_path)

In [26]:
students_df

Unnamed: 0,student_id,first_name,last_name,country_of_birth,major,scholarship,expected_grad_year
0,1,James,Taylor,USA,Biology,No,2025
1,2,Michelle,Johnson,England,Chemistry,Yes,2026
2,3,Wilson,Lee,China,Geography,Yes,2025


___
<a class="anchor" id="stats"></a>
## (3) Read Data from Statistical Software (SAS, SPSS, Stata)

___
### (i) SAS
- Data source (Stockton 96): http://www.principlesofeconometrics.com/sas.htm
- Saved as .sas7bdat files (or .xpt files)

In [27]:
sas_file_path = '../data/sas/stockton96.sas7bdat'

In [28]:
df = pd.read_sas(sas_file_path)

In [29]:
df.head()

Unnamed: 0,PRICE,SQFT,AGE
0,69000.0,1204.0,25.0
1,125000.0,1543.0,19.0
2,130000.0,1708.0,11.0
3,29500.0,1067.0,93.0
4,230000.0,2030.0,30.0


In [30]:
# Specifying chunk size yields a SAS7BDAT object that reads chunksize lines from the file at a time
with pd.read_sas(sas_file_path, chunksize=100) as rdr:
    for chunk in rdr:
        print(chunk.shape)

(100, 3)
(100, 3)
(100, 3)
(100, 3)
(100, 3)
(100, 3)
(100, 3)
(100, 3)
(100, 3)
(40, 3)


___
### (ii) SPSS
- Data source (UIS): https://stats.oarc.ucla.edu/other/examples/asa2/
- Saved as .sav files

In [31]:
spss_file_path = '../data/spss/uis.sav'

In [32]:
df = pd.read_spss(spss_file_path)

In [33]:
df.head()

Unnamed: 0,ID,age,becktota,hercoc,ivhx,ndrugtx,race,treat,site,los,time,censor
0,1.0,39.0,9.0,Neither Heroin nor Cocain,recent,1.0,white,long,A,123.0,188.0,returned to drugs or lost to follow-up
1,2.0,33.0,34.0,Neither Heroin nor Cocain,previous,8.0,white,long,A,25.0,26.0,returned to drugs or lost to follow-up
2,3.0,33.0,10.0,Heroin only,recent,3.0,white,long,A,7.0,207.0,returned to drugs or lost to follow-up
3,4.0,32.0,20.0,Neither Heroin nor Cocain,recent,1.0,white,short,A,66.0,144.0,returned to drugs or lost to follow-up
4,5.0,24.0,5.0,Heroin only,never,5.0,non-white,long,A,173.0,551.0,otherwise


In [64]:
# Convert categoricals (Default)
df = pd.read_spss(spss_file_path,
                  usecols=["age", "hercoc", "race"],
                  convert_categoricals=True
                 )
df.head()

Unnamed: 0,age,hercoc,race
0,39.0,Neither Heroin nor Cocain,white
1,33.0,Neither Heroin nor Cocain,white
2,33.0,Heroin only,white
3,32.0,Neither Heroin nor Cocain,white
4,24.0,Heroin only,non-white


In [34]:
# Not converting categoricals
df = pd.read_spss(spss_file_path,
                  usecols=["age", "hercoc", "race"],
                  convert_categoricals=False
                 )
df.head()

Unnamed: 0,age,hercoc,race
0,39.0,4.0,0.0
1,33.0,4.0,0.0
2,33.0,2.0,0.0
3,32.0,4.0,0.0
4,24.0,2.0,1.0


___
### (iii) Stata
- Data source (Colombia Voucher): https://stats.oarc.ucla.edu/other/examples/asa2/
- Saved as .dta files

In [35]:
stata_file_path = '../data/stata/ch10_dee.dta'

In [36]:
df = pd.read_stata(stata_file_path)

In [37]:
df.head()

Unnamed: 0,schoolid,hispanic,college,black,otherrace,female,register,distance
0,1032,0,1,1,0,1,1,4.0
1,1032,0,1,1,0,1,1,4.0
2,1032,0,0,0,1,1,1,4.0
3,1032,0,0,1,0,0,0,4.0
4,1032,0,0,1,0,1,0,4.0


In [38]:
# Specifying chunk size yields a StataReader object that reads chunksize lines from the file at a time
with pd.read_stata(stata_file_path, chunksize=200) as reader:
    for df in reader:
        print(df.shape)

(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(200, 8)
(27, 8)


In [39]:
df = pd.read_stata(stata_file_path,
                  columns=['schoolid', 'college', 'female'])
df.head()

Unnamed: 0,schoolid,college,female
0,1032,1,1
1,1032,1,1
2,1032,0,1
3,1032,0,0
4,1032,0,1


___
<a class="anchor" id="sql"></a>
## (4) Read Data from SQL Databases

In [40]:
import sqlalchemy

In [41]:
uri = "mysql+pymysql://root:password@127.0.0.1:3306/adv_pandas_ch2"

engine = sqlalchemy.create_engine(uri)

In [42]:
with engine.connect() as conn, conn.begin():
    df = pd.read_sql("insured_cars", conn)

In [43]:
df

Unnamed: 0,insured_car_id,car_make,car_model,car_model_year
0,JN1AZ4EH6FM680796,Audi,Quattro,1993
1,WBA3R1C5XFK004020,Chevrolet,Monte Carlo,1973
2,3TMJU4GN9FM186368,Volkswagen,Eurovan,1994
3,WAUXL58E95A915334,Volvo,C70,2003
4,WAUTFAFH9AN241216,Mitsubishi,Eclipse,1989
...,...,...,...,...
95,1GD11XEG5FF119356,Saab,9-3,1999
96,JH4NA21643T540325,Ford,Escape,2002
97,1FTSW2B52AE850743,Hyundai,Genesis,2009
98,3N1BC1AP7BL153775,Hummer,H2,2003


In [44]:
# Establish connection (with context manager)
with engine.connect() as conn, conn.begin():
    df = pd.read_sql_table('insured_cars', conn, 
                            columns=['car_make', 'car_model'],
                            index_col='insured_car_id')

In [45]:
df

Unnamed: 0_level_0,car_make,car_model
insured_car_id,Unnamed: 1_level_1,Unnamed: 2_level_1
JN1AZ4EH6FM680796,Audi,Quattro
WBA3R1C5XFK004020,Chevrolet,Monte Carlo
3TMJU4GN9FM186368,Volkswagen,Eurovan
WAUXL58E95A915334,Volvo,C70
WAUTFAFH9AN241216,Mitsubishi,Eclipse
...,...,...
1GD11XEG5FF119356,Saab,9-3
JH4NA21643T540325,Ford,Escape
1FTSW2B52AE850743,Hyundai,Genesis
3N1BC1AP7BL153775,Hummer,H2


In [46]:
from sqlalchemy.sql import text

In [47]:
sql = 'SELECT * FROM insured_cars WHERE car_model_year >= 2010 AND car_make = "Rolls-Royce";'

with engine.connect().execution_options(autocommit=True) as conn:
    query = conn.execute(text(sql))
    
df = pd.DataFrame(query.fetchall())

In [48]:
df

Unnamed: 0,insured_car_id,car_make,car_model,car_model_year
0,SAJWA4DC1AM015816,Rolls-Royce,Phantom,2012
1,1G6KD54Y55U074446,Rolls-Royce,Phantom,2013


___
<a class="anchor" id="binary"></a>
## (5) Read Data from Binary Files

### (i) Feather
- Data source: https://github.com/lmassaron/datasets
- Saved as .feather files

In [3]:
feather_file_path = '../data/feather/wine_quality.feather'

In [4]:
df = pd.read_feather(feather_file_path)

In [5]:
df

Unnamed: 0,quality,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,red_wine
0,2,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,1
1,2,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,1
2,2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,1
3,3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,1
4,2,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,3,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,0
6493,2,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,0
6494,3,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,0
6495,4,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,0


In [6]:
df = pd.read_feather(feather_file_path,
                    columns=['quality', 'density', 'pH', 'red_wine'])

In [7]:
df.head()

Unnamed: 0,quality,density,pH,red_wine
0,2,0.9978,3.51,1
1,2,0.9968,3.2,1
2,2,0.997,3.26,1
3,3,0.998,3.16,1
4,2,0.9978,3.51,1


___
### (ii) HDF5
- Data source: https://github.com/HDFGroup/hdf5-examples
- Saved as .h5 (or .hdf5) files

In [8]:
hdf5_file_path = '../data/hdf5/dummy_hdf5_store.h5'

In [9]:
# Load .h5 file into HDFStore object
hdf = pd.HDFStore(hdf5_file_path)

In [10]:
# Display group hierarchy
for i in hdf.walk():
    print(i)

<generator object HDFStore.walk at 0x000001F1622C42E0>

In [14]:
# Display group (aka key) names
hdf.keys()

['/part_1', '/part_2']

In [11]:
df1 = hdf.get('part_1')

In [12]:
df1

Unnamed: 0,A,B,C,D
gMQHq2fnCb,1.186074,-1.357741,0.201251,1.479924
Y26xNpPaqV,-1.981245,-0.527744,-0.506953,1.570476
lPKTEASOAL,2.733528,-0.152921,-2.199485,0.467001
jN1vjvYzhI,-0.016582,0.101883,0.729991,0.597424
2MsUMCpUJO,1.219564,-2.692658,-1.94098,1.046695
oQHtVkd1dL,0.510922,0.429455,1.098722,0.85524
6BCMfprflI,-0.42899,1.906072,1.674477,-0.492251
aqgRN47TrE,0.038902,-1.424061,0.679096,1.016213
gTwxHKEcfb,-1.16081,1.265717,0.844393,-0.581767
yVyTUkigxh,-0.03641,-1.730592,-0.841512,0.777245


___

In [42]:
df2 = pd.read_hdf(hdf5_file_path, 'part_2')

In [44]:
df2.head()

Unnamed: 0,A,B,C,D
PgnwthUEJi,1.232732,-0.187266,-0.756517,1.664979
7ue55ovN1m,-0.935272,0.217103,0.377266,-0.771104
DZhupfBOKW,0.661486,-0.738875,2.652743,0.427305
KqhCSv1gSe,-0.803123,-1.887771,0.845601,-1.126714
8AAkXCBFhq,-1.337643,-1.084606,0.448301,0.277181


___
### (iii) ORC
- Data source: https://github.com/apache/orc/tree/main/examples
- Saved as .orc files

In [17]:
import pyorc

In [30]:
orc_file_path = '../data/orc/over1k_bloom.orc'

In [31]:
with open(orc_file_path, 'rb') as orc_file:
    reader = pyorc.Reader(orc_file) # Generate Reader object
    orc_data = reader.read() # Read data contents in Reader object
    orc_schema = reader.schema # Retrieve schema (optional)

df = pd.DataFrame(data=orc_data)

In [32]:
len(df)

2098

In [33]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,124.0,336.0,65664,4294967435,74.720001,42.47,True,bob davidson,2013-03-01 09:11:58.703302+00:00,45.4,b'\x01yard duty\x02'
1,19.0,442.0,65553,4294967380,26.43,37.77,True,alice zipper,2013-03-01 09:11:58.703217+00:00,29.62,b'\x01history\x02'
2,35.0,387.0,65619,4294967459,96.910004,18.86,False,katie davidson,2013-03-01 09:11:58.703079+00:00,27.32,b'\x01history\x02'
3,111.0,372.0,65656,4294967312,13.01,34.95,False,xavier quirinius,2013-03-01 09:11:58.703310+00:00,23.91,b'\x01topology\x02'
4,54.0,317.0,65547,4294967409,60.709999,2.09,False,nick robinson,2013-03-01 09:11:58.703103+00:00,90.21,b'\x01geology\x02'


___
### (iv) Parquet
- Data source: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet
- Saved as .parquet files

In [77]:
parquet_file_path = '../data/parquet/userdata1.parquet'

In [78]:
df = pd.read_parquet(parquet_file_path)

In [79]:
df

Unnamed: 0,registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title,comments
0,2016-02-03 07:55:29,1,Amanda,Jordan,ajordan0@com.com,Female,1.197.201.2,6759521864920116,Indonesia,3/8/1971,49756.53,Internal Auditor,1E+02
1,2016-02-03 17:04:03,2,Albert,Freeman,afreeman1@is.gd,Male,218.111.175.34,,Canada,1/16/1968,150280.17,Accountant IV,
2,2016-02-03 01:09:31,3,Evelyn,Morgan,emorgan2@altervista.org,Female,7.161.136.94,6767119071901597,Russia,2/1/1960,144972.51,Structural Engineer,
3,2016-02-03 00:36:21,4,Denise,Riley,driley3@gmpg.org,Female,140.35.109.83,3576031598965625,China,4/8/1997,90263.05,Senior Cost Accountant,
4,2016-02-03 05:05:31,5,Carlos,Burns,cburns4@miitbeian.gov.cn,,169.113.235.40,5602256255204850,South Africa,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2016-02-03 10:30:59,996,Dennis,Harris,dharrisrn@eepurl.com,Male,178.180.111.236,374288806662929,Greece,7/8/1965,263399.54,Editor,
996,2016-02-03 17:16:53,997,Gloria,Hamilton,ghamiltonro@rambler.ru,Female,71.50.39.137,,China,4/22/1975,83183.54,VP Product Management,
997,2016-02-03 05:02:20,998,Nancy,Morris,nmorrisrp@ask.com,,6.188.121.221,3553564071014997,Sweden,5/1/1979,,Junior Executive,
998,2016-02-03 02:41:32,999,Annie,Daniels,adanielsrq@squidoo.com,Female,97.221.132.35,30424803513734,China,10/9/1991,18433.85,Editor,​
