# Data Structures in Pandas

Pandas has two different data structures:
* Series
* Data Frames

Series represents data in 1D form while data frames represent data in 2D tabular form. 

Series - one dimensional array with labels

Python dictionaries - data structures for storing key-value pairs

In [1]:
import pandas as pd

In [2]:
dict = {'a': 3, 'b':'cat', 'c':2.5}
pd.Series(dict)

a      3
b    cat
c    2.5
dtype: object

# Creating series using 2 lists

In [3]:
val = pd.Series([100, 'cat', 310, 'gog', 500], ['Amy', 'Bobby', 'Cat', 'Don', 'Emma'])
val

Amy      100
Bobby    cat
Cat      310
Don      gog
Emma     500
dtype: object

In [4]:
# loc is a label-location based indexer for selection by labels 
val.loc[['Cat', 'Emma']]

Cat     310
Emma    500
dtype: object

In [5]:
val[[0,3,4]]

Amy     100
Don     gog
Emma    500
dtype: object

In [6]:
# iloc is primarily integer based (from 0 to length-1 of the axis)
val.iloc[1]

'cat'

In [7]:
# check if there is a cat in the Series index

'cat' in val

False

In [8]:
'Cat' in val

True

DataFrames - 2D data structure stores data in tabular form (rows and columns)
* <class 'pandas.core.frame.DataFrame'>

In [9]:
d = {'A': pd.Series([100., 200., 300.], index = ['apple','pear','orange']),
     'B': pd.Series([111.,222.,333.,444.], index = ['apple','pear','orange','melon'])}

In [10]:
df = pd.DataFrame(d)
print(df)

            A      B
apple   100.0  111.0
melon     NaN  444.0
orange  300.0  333.0
pear    200.0  222.0


In [11]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [12]:
df.index

Index(['apple', 'melon', 'orange', 'pear'], dtype='object')

In [13]:
df.columns

Index(['A', 'B'], dtype='object')

In [14]:
pd.DataFrame(df, index=['orange', 'melon','apple'], columns=['A'])

Unnamed: 0,A
orange,300.0
melon,
apple,100.0


# Read csv files

In [15]:
df = pd.read_csv('Resp2.csv')

In [16]:
df.head()

Unnamed: 0,experience,respiration
0,0,3.94
1,0,4.26
2,0,4.16
3,0,3.76
4,0,4.07


In [17]:
#reading comma separated files with .csv extension
df2 = pd.read_csv('winequality-red.csv', sep =';')
df2.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


# Reading txt files

In [18]:
df3 = pd.read_csv('bostonTxt.txt', sep = '\t')
df3.head()

Unnamed: 0,MV,INDUS,NOX,RM,TAX,PT,LSTAT
0,24.0,2.31,53.8,6.575,296,15.3,4.98
1,21.6,7.07,46.9,6.421,242,17.8,9.14
2,34.7,7.07,46.9,7.185,242,17.8,4.03
3,33.4,2.18,45.8,6.998,222,18.7,2.94
4,36.2,2.18,45.8,7.147,222,18.7,5.33


# Reading xls files

Advantage of xls files over csv and txt files is that it has multiple sheets

In [32]:
xl = pd.ExcelFile(r'C:\Users\hsripuram\Desktop\MLPractice\PreProcessing\boston1.xls')
print(xl.sheet_names)

['Sheet1', 'Sheet2']


In [34]:
df5 = xl.parse('Sheet1')

df5.head() 

Unnamed: 0,MV,INDUS,NOX,RM,TAX,PT,LSTAT,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,24.0,2.31,53.8,6.575,296,15.3,4.98,,,Subset of Boston housing tract
1,21.6,7.07,46.9,6.421,242,17.8,9.14,,,data of Harrison and Rubinfeld
2,34.7,7.07,46.9,7.185,242,17.8,4.03,,,(1978). Each case is one U.S.
3,33.4,2.18,45.8,6.998,222,18.7,2.94,,,Census tract in the Boston area.
4,36.2,2.18,45.8,7.147,222,18.7,5.33,,,


# Web Scraping

In [35]:
import html5lib

In [36]:
uss = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')

In [37]:
print(type(uss)) # this is list of DataFrames

<class 'list'>


In [38]:
u = uss[0]
print(u)

                            Name postal abbreviation[1]          Cities  \
                            Name postal abbreviation[1]         Capital   
0                        Alabama                     AL      Montgomery   
1                         Alaska                     AK          Juneau   
2                        Arizona                     AZ         Phoenix   
3                       Arkansas                     AR     Little Rock   
4                     California                     CA      Sacramento   
5                       Colorado                     CO          Denver   
6                    Connecticut                     CT        Hartford   
7                       Delaware                     DE           Dover   
8                        Florida                     FL     Tallahassee   
9                        Georgia                     GA         Atlanta   
10                        Hawaii                     HI        Honolulu   
11                       

In [39]:
url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

df6 = pd.read_html(url)

print(type(df6))

<class 'list'>


In [40]:
df0 = df6[0]
print(df0)

                               Bank Name        City  ST   CERT  \
0       City National Bank of New Jersey      Newark  NJ  21111   
1                          Resolute Bank      Maumee  OH  58317   
2                  Louisa Community Bank      Louisa  KY  58112   
3                   The Enloe State Bank      Cooper  TX  10716   
4    Washington Federal Bank for Savings     Chicago  IL  30570   
..                                   ...         ...  ..    ...   
554                   Superior Bank, FSB    Hinsdale  IL  32646   
555                  Malta National Bank       Malta  OH   6629   
556      First Alliance Bank & Trust Co.  Manchester  NH  34264   
557    National State Bank of Metropolis  Metropolis  IL   3815   
558                     Bank of Honolulu    Honolulu  HI  21029   

                   Acquiring Institution       Closing Date       Updated Date  
0                        Industrial Bank   November 1, 2019   November 7, 2019  
1                     Buckeye Sta

In [41]:
newUrl = pd.read_html('https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_in_South_Korea')

In [42]:
newUr = newUrl[0]
print(newUr)

                                                 Site  Image  \
0                Seokguram Grotto and Bulguksa Temple    NaN   
1   Haeinsa Temple Janggyeong Panjeon, the Deposit...    NaN   
2                                      Jongmyo Shrine    NaN   
3                        Changdeokgung Palace Complex    NaN   
4                                   Hwaseong Fortress    NaN   
5            Gochang, Hwasun and Ganghwa Dolmen Sites    NaN   
6                             Gyeongju Historic Areas    NaN   
7                 Jeju Volcanic Island and Lava Tubes    NaN   
8                   Royal Tombs of the Joseon Dynasty    NaN   
9      Historic Villages of Korea: Hahoe and Yangdong    NaN   
10                                     Namhansanseong    NaN   
11                              Baekje Historic Areas    NaN   
12      Sansa, Buddhist Mountain Monasteries in Korea    NaN   
13             Seowon, Korean Neo-Confucian Academies    NaN   

                                       

All the above html pages have only 1 table. If there are more than 1 table per webpage

In [44]:
pines = pd.read_html('https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_in_Philippines')
pine = pines[0]
print(pine)

                                                Name  Image  \
0  Baroque Churches of the Philippines: San Agust...    NaN   
1                       Tubbataha Reefs Natural Park    NaN   
2  Rice Terraces of the Philippine Cordilleras: B...    NaN   
3                             Historic City of Vigan    NaN   
4   Puerto Princesa Subterranean River National Park    NaN   
5          Mount Hamiguitan Range Wildlife Sanctuary    NaN   

                                           Location  \
0  City of Manila; Ilocos Sur; Ilocos Norte; Iloilo   
1                                           Palawan   
2                                            Ifugao   
3                                        Ilocos Sur   
4                                           Palawan   
5                                    Davao Oriental   

                              Regions  \
0  Manila; Ilocandia; Western Visayas   
1                            Mimaropa   
2                          Cordillera   
3       

In [45]:
pine1 = pines[1]
print(pine1)

                      Type (criteria)  \
0               Natural: (vii)(ix)(x)   
1       Cultural: (i)(iii)(iv)(v)(vi)   
2                               Mixed   
3              Cultural: (iii)(iv)(v)   
4                Natural: (vii)(viii)   
5                 Mixed: (iii)(ix)(x)   
6                    Natural: (ix)(x)   
7   Cultural: (i)(ii)(iii)(iv)(v)(vi)   
8                   Natural: (vii)(x)   
9                    Natural: (ix)(x)   
10              Natural: (vii)(ix)(x)   
11                   Natural: (ix)(x)   
12                   Natural: (ix)(x)   
13         Cultural: (ii)(iii)(iv)(v)   
14                   Natural: (ix)(x)   
15         Cultural: (ii)(iii)(iv)(v)   
16                    Cultural: (iii)   
17         Cultural: (ii)(iii)(iv)(v)   
18                   Natural: (ix)(x)   

                                                 Site  \
0                               Apo Reef Natural Park   
1   Baroque Churches of the Philippines (Extension...   
2       

In [46]:
pine2 = pines[2]
print(pine2)

  vteWorld Heritage Sites in the Philippines  \
0                                   Cultural   
1                                    Natural   

        vteWorld Heritage Sites in the Philippines.1  \
0  Baroque Churches of the Philippines (San Agust...   
1  Mount Hamiguitan Range Wildlife Sanctuary Puer...   

   vteWorld Heritage Sites in the Philippines.2  
0                                           NaN  
1                                           NaN  


# Read JSON data

In [50]:
df7 = pd.read_json('C:\Users\hsripuram\Desktop\MLPractice\PreProcessing\example_1.json')

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-50-564fba9913b8>, line 1)

In [51]:
df8 = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=5')

In [52]:
df8.head()

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,assignee,assignees,milestone,comments,created_at,updated_at,closed_at,author_association,pull_request,body
0,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/29840,528277651,MDExOlB1bGxSZXF1ZXN0MzQ1MzY5Nzc5,29840,DEPR: remove FrozenNDarray,...,,[],,0,2019-11-25 19:02:52+00:00,2019-11-25 19:02:52+00:00,NaT,MEMBER,{'url': 'https://api.github.com/repos/pandas-d...,Reboots #29335.
1,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/29839,528239797,MDU6SXNzdWU1MjgyMzk3OTc=,29839,PERF: improve Integer/BooleanArray astype to n...,...,,[],{'url': 'https://api.github.com/repos/pandas-d...,1,2019-11-25 17:45:38+00:00,2019-11-25 18:01:59+00:00,NaT,MEMBER,,"Currently, in the IntegerArray or BooleanArray..."
2,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/29838,528237789,MDU6SXNzdWU1MjgyMzc3ODk=,29838,PERF: improve conversion to BooleanArray from ...,...,,[],{'url': 'https://api.github.com/repos/pandas-d...,0,2019-11-25 17:41:38+00:00,2019-11-25 17:41:39+00:00,NaT,MEMBER,,"Currently, the creation of a BooleanArray from..."
3,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/29837,528235995,MDU6SXNzdWU1MjgyMzU5OTU=,29837,groupby() drops categorical columns when aggre...,...,,[],,2,2019-11-25 17:37:52+00:00,2019-11-25 18:16:28+00:00,NaT,NONE,,"#### Code Sample, a copy-pastable example if p..."
4,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/29836,528196106,MDExOlB1bGxSZXF1ZXN0MzQ1MzA1MDYw,29836,XLSB support,...,,[],,1,2019-11-25 16:26:33+00:00,2019-11-25 18:58:55+00:00,NaT,NONE,{'url': 'https://api.github.com/repos/pandas-d...,"Hey all, a moderately commonly requested featu..."
