# Data and Database

In data science, getting data and preprocessing is the major component of the process after formulating the problem at hand. Data can be collected from different sources in different format. For the purpose of this notebook, we can categorize the sources (based on formats) into:

* Text Files: Text files can be in different formats (both online or saved in a local drive). The most common formats are Excel,  csv, txt, pdf, etc ...
* Web scraping: Python has tools to access and extract data from websites. Beautiful Soup is one of the well known libraries to for pulling data out of HTML and XML files.
* Application Programming Interface (APIs): API is a software intermediary that allows two applications to talk to each other. APIs provide many efficiencies over using static data downloads (such as CSV files). These include the ability to work with rapidly changing data or working with data from which you only want a small chunk (say, today’s temperature, compared to downloading a whole trove of weather data). 
* Database: There are several database types such as Relational, NoSQL, Hierarchical, Network, and Object-Oriented databases. This notebook will focus on relational database.

To pull of data from a database, Python has libraries such as pyodbc and SQLalchemy. The following is an example of pulling data from Microsoft SQL Server DB with pyodbc and SQLalchemy in conjunction with pandas library.


In [1]:
import pandas as pd
import pyodbc
import sqlalchemy as sal
import urllib.parse

In [2]:
print(pd.__version__)
print(pyodbc.version)
print(sal.__version__)

1.3.4
4.0.0-unsupported
1.4.27


In [3]:
# 1: Pyodbc ---- Reading data from SQL Server in local machine

conn = pyodbc.connect("Driver={SQL Server};"
                      "Server=DESKTOP-USBNA55;"
                      "Database=SED;"
                      "Trusted_Connection=yes;"
)   

cursor = conn.cursor()

df = pd.read_sql_query('SELECT * from dbo.SED',conn)
df = df.set_index(['S&E_Fields', 'Broad_Fields', 'Detailed_Fields'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Column 0,Race_and_Ethnicity,Sex,Year,Number
S&E_Fields,Broad_Fields,Detailed_Fields,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


In [4]:
# 2: sqlalchemy (...Pyodbc) ----Reading data from SQL Server in local machine
params = urllib.parse.quote_plus("Driver={SQL Server};"
                      "Server=DESKTOP-USBNA55;"
                      "Database=SED;"
                      "Trusted_Connection=yes;")

engine1 = sal.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params))

conn1 = engine1.connect()

df1 = pd.read_sql_query('SELECT * from dbo.SED',conn1)
df1 = df1.set_index(['S&E_Fields', 'Broad_Fields', 'Detailed_Fields'])
df1.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Column 0,Race_and_Ethnicity,Sex,Year,Number
S&E_Fields,Broad_Fields,Detailed_Fields,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
