In this lecture, we focus on basic techniques of loading data into Python. Data loading is always a big topic for any programming language. Here we are only going to just scrape the surface. 

In [1]:
import numpy as np
import pandas as pd

We will be focusing on reading in 4 types of data: csv files, excel files, html files, and SQL tables. To ensure we can actually read in these types of files, we first need to ensure that the following libraries are installed: 'sqlalchemy', 'lxml', 'html5lib', and 'BeautifulSoup4'. To install these four libraries, you can open up the Anaconda prompt, and then type in the following four commands separately: 1) conda install sqlalchemy 2) conda install lxml 3) conda install html5lib 4) conda install BeautifulSoup4. Anaconda prompt will install the package for you as long as the package is not pre-installed. After all this, you can type the following command to check whether the libraries have been properly installed: conda list. Once you install these libraries, you need to restart your jupyter notebook. If you don't have the Anaconda distribution, you can use the 'pip install' option to install packages. 

In [2]:
pwd # checking the default path

'C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\\Introduction to Data Science Using Python'

We first dowload the data set and move it to the appropriate folder. We then read in the files:

In [3]:
import os
path='C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\\Introduction to Data Science Using Python\\datasets'
os.chdir(path)

In [4]:
df = pd.read_csv('csv_example') # reading in the file
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [5]:
df.to_csv('ccsv_output_example',index=True) # outputting the file (index=False means not saving indices)

In [6]:
df2 = pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1') # for Excel files, each sheet is a data frame
df2

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [7]:
df2.to_excel('Excel_Sample_output.xlsx',sheet_name='MySheet')

Now let's move onto HTML. To make this work, we will need a set of libraries. You may need to install the following three libraries: 'htmllib5', 'lxml', and 'BeautifulSoup4'. Below is an example. This type of data import is the used a lot in web scraping:

In [8]:
html_list = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
html_list

[                                             Bank Name                City  \
 0                  Washington Federal Bank for Savings             Chicago   
 1      The Farmers and Merchants State Bank of Argonia             Argonia   
 2                                  Fayette County Bank          Saint Elmo   
 3    Guaranty Bank, (d/b/a BestBank in Georgia & Mi...           Milwaukee   
 4                                       First NBC Bank         New Orleans   
 5                                        Proficio Bank  Cottonwood Heights   
 6                        Seaway Bank and Trust Company             Chicago   
 7                               Harvest Community Bank          Pennsville   
 8                                          Allied Bank            Mulberry   
 9                         The Woodbury Banking Company            Woodbury   
 10                              First CornerStone Bank     King of Prussia   
 11                                  Trust Company B

Notice here that the read_html() function will read tables off of a webpage and return a list of 'DataFrame' objects (basically in this list, every element is a 'DataFrame' object). To retrieve the table we want to have, we can reference the list:

In [9]:
df3=html_list[0]
df3.head(20)

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 1, 2019"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"
2,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","January 29, 2019"
3,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","March 22, 2018"
4,First NBC Bank,New Orleans,LA,58302,Whitney Bank,"April 28, 2017","January 29, 2019"
5,Proficio Bank,Cottonwood Heights,UT,35495,Cache Valley Bank,"March 3, 2017","January 29, 2019"
6,Seaway Bank and Trust Company,Chicago,IL,19328,State Bank of Texas,"January 27, 2017","January 29, 2019"
7,Harvest Community Bank,Pennsville,NJ,34951,First-Citizens Bank & Trust Company,"January 13, 2017","May 18, 2017"
8,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","January 29, 2019"
9,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","December 13, 2018"


Once we read in the data, like what we do in SAS, we often want to do a 'proc content' to examine the data on a metadata level. We can do this in Python too using the info() method:

In [10]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555 entries, 0 to 554
Data columns (total 7 columns):
Bank Name                555 non-null object
City                     555 non-null object
ST                       555 non-null object
CERT                     555 non-null int64
Acquiring Institution    555 non-null object
Closing Date             555 non-null object
Updated Date             555 non-null object
dtypes: int64(1), object(6)
memory usage: 30.4+ KB


Python can certainly handle other types of data such as SAS, SQL, PDFs, Word documents etc. For SQL, 'sqlalchemy' handles the data importation and exportation. You can find an overview of supported drivers for each SQL dialect in the 'SQLAlchemy' documentation. Below is an example to read SAS datasets:

In [11]:
df4 = pd.read_sas('Medicare.sas7bdat')
df4.info()
print(df4.head(20))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324 entries, 0 to 323
Data columns (total 9 columns):
STATE       324 non-null float64
YEAR        324 non-null float64
TOT_CHG     324 non-null float64
COV_CHG     324 non-null float64
MED_REIB    324 non-null float64
TOT_D       324 non-null float64
NUM_DCHG    324 non-null float64
AVE_T_D     324 non-null float64
NMSTATE     324 non-null object
dtypes: float64(8), object(1)
memory usage: 22.9+ KB
    STATE  YEAR       TOT_CHG       COV_CHG      MED_REIB      TOT_D  \
0     1.0   1.0  2.211617e+09  2.170240e+09  9.727529e+08  1932673.0   
1     1.0   2.0  2.523987e+09  2.468264e+09  1.046016e+09  1936939.0   
2     1.0   3.0  2.975970e+09  2.922612e+09  1.205792e+09  2016354.0   
3     1.0   4.0  3.194595e+09  3.149746e+09  1.307983e+09  1948427.0   
4     1.0   5.0  3.417705e+09  3.384305e+09  1.376212e+09  1926335.0   
5     1.0   6.0  3.519375e+09  3.492636e+09  1.466221e+09  1847216.0   
6     2.0   1.0  6.474776e+07  6.224228e+07