# Data Acquisition in Python

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  

In [4]:
# Basic Libraries
import numpy as np
import pandas as pd

---

### Pandas Dataframe

The `pandas` library in Python offers an amazing data structure for data science -- the `DataFrame`.    
It is pretty close to a `dictionary`, and we will start by creating a `DataFrame` from `dictionary`.

In [5]:
canteens_dict = {"Name" : ["North Spine", "Koufu", "Canteen 9", "North Hill", "Canteen 11"],
                 "Stalls" : [20, 15, 10, 12, 8],
                 "Rating" : [4.5, 4.2, 4.0, 3.7, 4.2]
                }

canteens_df = pd.DataFrame(canteens_dict)
print(canteens_df)

          Name  Stalls  Rating
0  North Spine      20     4.5
1        Koufu      15     4.2
2    Canteen 9      10     4.0
3   North Hill      12     3.7
4   Canteen 11       8     4.2


It is super simple to access the columns of the `DataFrame` -- directly use the column names.

In [6]:
canteens_df["Name"]

0    North Spine
1          Koufu
2      Canteen 9
3     North Hill
4     Canteen 11
Name: Name, dtype: object

You may also extract a single record or row from a `DataFrame` -- use `iloc` with the index.

In [7]:
canteens_df.iloc[0]

Name      North Spine
Stalls             20
Rating            4.5
Name: 0, dtype: object

Thus, a Pandas `DataFrame` is really like a table, with structured data accessible in two ways.

In [8]:
canteens_df.iloc[1]

Name      Koufu
Stalls       15
Rating      4.2
Name: 1, dtype: object

---

### Import CSV file into a DataFrame

If the dataset is in a standard CSV format (flat file), we may use the `read_csv` function from Pandas.   

In [10]:
csv_data = pd.read_csv('./train.csv', header = None)
csv_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,71,72,73,74,75,76,77,78,79,80
0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
2,2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
3,3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
4,4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000


In [12]:
print("Data type : ", type(csv_data))
print("Data dims : ", csv_data.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (1461, 81)


---

### Import TXT file into a DataFrame

If the dataset is in a standard TXT format (flat file), we may use the `read_table` function from Pandas.   

In [None]:
txt_data = pd.read_table('data/somedata.txt', sep = "\s+", header = None)
txt_data.head()

In [None]:
print("Data type : ", type(txt_data))
print("Data dims : ", txt_data.shape)

---

### Import XLS file into a DataFrame

If the dataset is in a Microsoft XLS or XLSX format, we may use the `read_excel` function from Pandas.    
However, to use the `read_excel` function, you will need to install the `xlrd` module using Anaconda.

In [2]:
xls_data = pd.read_excel('data/somedata.xlsx', sheet_name = 'Sheet1', header = None)
xls_data.head()

NameError: name 'pd' is not defined

In [1]:
print("Data type : ", type(xls_data))
print("Data dims : ", xls_data.shape)

NameError: name 'xls_data' is not defined

---

### Import JSON file into a DataFrame

If the dataset is in a standard JSON format, we may use the `read_json` function from Pandas.    

In [None]:
json_data = pd.read_json('data/somedata.json')
json_data.head()

In [None]:
print("Data type : ", type(json_data))
print("Data dims : ", json_data.shape)

---

### Import HTML table into a DataFrame

If the dataset is in a table formal within an HTML website, we may use the `read_html` function from Pandas.    
Let's try to get the Cast of Kung-Fu Panda : http://www.imdb.com/title/tt0441773/fullcredits/?ref_=tt_ov_st_sm

In [None]:
html_data = pd.read_html('http://www.imdb.com/title/tt0441773/fullcredits/?ref_=tt_ov_st_sm')

In [None]:
print("Data type : ", type(html_data))
print("HTML tables : ", len(html_data))

In [None]:
html_data[2].head()