# Overview and Setup 

In this section, we are going to discuss what is SQL and how to set up SQLite in a Python environment. 

## What are SQL and Relational Databases? 

**SQL** stands for structured query language and is used to retrieve, manipulate, and write data. It is often used in **relational database management systems (RDBMS)**, or databases with tables that are linked together. This means the tables can be joined to better organize data. We will talk about table relationships and joins in Module 6. 

While SQL is traditionally associated with relational databases, SQL has continued to be popular enough to be implemented in NoSQL databases ("Not only SQL") as well as "big data" platforms like Apache Spark and Trino. Even though it is 50 years old, SQL continues to be a necessary skill for any data professional and a go-to language for working with data. 

Within relational databases there are many commercial and open-source platforms like Oracle, Microsoft SQL Server, PostgreSQL, and MySQL. These platforms will often run a database on an onsite server or "in the cloud", which is rented server space operating remotely. All of these platforms use SQL, and core SQL language features are shared across them. To keep our environment simple, the platform we will use is SQLite.

## What is SQLite? 

**SQLite** is a database platform just like [Oracle](https://www.oracle.com/database/technologies/appdev/sql.html) or [Microsoft SQL Server](https://www.microsoft.com/en-us/sql-server). However what is unique about it is it does not require a server. Instead the database is simply stored as a file on your local machine and you use a library or user interface to open it. Python already contains a SQLite library by default so you do not have to install it. It also complies to [DBI API 2.0 specified by PEP 249](https://docs.python.org/3/library/sqlite3.html). This means that other database platform packages that comply to this standard (including [Microsoft SQL Server](https://pypi.org/project/pymssql/) and [Oracle](https://pypi.org/project/cx-Oracle/)) can be worked with in the same way we will use SQLite. Therefore, everything you learn in this training can apply to most major database platforms!  

> If you want to write SQL against a SQLite database with a graphical user interface, there are many tools that provide this. My personal favorites are [SQLiteOnline](https://docs.python.org/3/library/sqlite3.html) and [SQLiteStudio](https://sqlitestudio.pl/). 


## Setup 

As stated earlier, SQLite is already built-in with Python 3. If you use other platforms like [Microsoft SQL Server](https://pypi.org/project/pymssql/) or [Oracle](https://pypi.org/project/cx-Oracle/) you will need to `pip install` those respective packages that comply to the DBI-API 2.0 standard. 

We do however need to get the SQLite file containing a sample database we will work examples with. For convenience, we can use download the file straight [off the Github repository](https://github.com/thomasnield/anaconda_intro_to_sql/) and put it in our working Python directory. 


Then download the `company_operations.db` file directly from the GitHub repository using the `download()` command. 

In [1]:
import urllib.request
urllib.request.urlretrieve("https://github.com/thomasnield/anaconda_intro_to_sql/blob/main/company_operations.db?raw=true", "company_operations.db")

('company_operations.db', <http.client.HTTPMessage at 0x7f947ef84640>)

You should now have the `company_operations.db` file downloaded and ready to go for this notebook. 

## Connecting to a Database 

To connect to a database using the DB-API 2.0 standard, first import the module for the desired database platform. For SQLite, we `import sqlite3`. Let's also bring in Pandas as it will make it esaier to display the results of a SQL query. 

After importing the module for `sqlite3`, you can call its `connect()` function and pass the necessary arguments to connect to the database. SQLite requires only a string argument for the path to the database file. Since the SQLite file is already in the working directory, we only need to provide the name of the file. This will return a connection which we will save to a variable `conn`. 

In [None]:
import sqlite3
import pandas as pd

conn = sqlite3.connect('company_operations.db')

If you are using Oracle or other database platforms, you may need to provide further arguments to connect to the database. You will need to [read the documentation](https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html) for a given platform. This will clarify what parameters are required to connect to a database, and you can retreive those parameters from your database adminstrator who can provide a username, password, hostname, IP address, or other necessary information. 

Now that you have a connection, you can write a SQL query as a string and then pass it to Pandas' `read_sql()` function along with the connection. Pandas will then pass that SQL query to the connection and return the results as a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). 

In [None]:
sql = "SELECT * FROM CUSTOMER"

pd.read_sql(sql, conn)

Unnamed: 0,CUSTOMER_ID,CUSTOMER_NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY
0,1,Alpha Medical,18745 Train Dr,Dallas,TX,75021,INDUSTRIAL
1,2,Oak Cliff Base,2379 Cliff Ave,Abbevile,LA,70510,GOVERNMENT
2,3,Sports Unlimited,1605 Station Dr,Alexandrai,LA,71301,COMMERCIAL
3,4,Riley Sporting Goods,9854 Firefly Blvd,Austin,TX,78701,COMMERCIAL
4,5,Lite Industrial,462 Roadrunner Blvd,Houston,TX,77254,INDUSTRIAL
5,6,Prairie Sports Center,689 Stadium Way,Tulsa,OK,74101,COMMERCIAL
6,7,Facility 95,2396 Runway Dr,Oklahoma City,OK,73101,GOVERNMENT
7,8,Allen Stadium,573 HIllcrest Rd,Allen,TX,75002,COMMERCIAL
8,9,Dent Research,392 45th St,Waco,TX,76700,INDUSTRIAL
9,10,Gamma Solutions,2752 27th St,Phoenix,AZ,85001,COMMERCIAL


If you observe the code and output above, `SELECT` is a SQL command that retrieves data, and we are using it to retrieve all columns and records from the `CUSTOMER` table. In a Python environment, SQL code is going to be a string and then passed to the connection which will return the results. While we can iterate the results manually (as shown [in this documentation](https://docs.python.org/3/library/sqlite3.html#tutorial)), it is more convenient for our purposes to let Pandas load the results into a `DataFrame` for us. 

In other sections, we will focus on learning SQL's functionalities and continue using a Python environment to execute our queries and writing operations. 

If you want to use a given column as the index column in Pandas, you can specify it in the `read_sql()` function. 

In [None]:
sql = "SELECT * FROM CUSTOMER"

pd.read_sql(sql, conn, index_col="CUSTOMER_ID")

Unnamed: 0_level_0,CUSTOMER_NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY
CUSTOMER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Alpha Medical,18745 Train Dr,Dallas,TX,75021,INDUSTRIAL
2,Oak Cliff Base,2379 Cliff Ave,Abbevile,LA,70510,GOVERNMENT
3,Sports Unlimited,1605 Station Dr,Alexandrai,LA,71301,COMMERCIAL
4,Riley Sporting Goods,9854 Firefly Blvd,Austin,TX,78701,COMMERCIAL
5,Lite Industrial,462 Roadrunner Blvd,Houston,TX,77254,INDUSTRIAL
6,Prairie Sports Center,689 Stadium Way,Tulsa,OK,74101,COMMERCIAL
7,Facility 95,2396 Runway Dr,Oklahoma City,OK,73101,GOVERNMENT
8,Allen Stadium,573 HIllcrest Rd,Allen,TX,75002,COMMERCIAL
9,Dent Research,392 45th St,Waco,TX,76700,INDUSTRIAL
10,Gamma Solutions,2752 27th St,Phoenix,AZ,85001,COMMERCIAL


## Why SQL Instead of Pandas? 

As we will be learning how to retrieve, filter, transform, aggregate, and join data, you might be wondering why not just use Pandas since it can do all of those tasks too. SQL and Pandas are not competitors, but rather two different tools for two different environments. When you have many terabytes of data stored on a relational database, you will likely be unable to process that data locally on your machine using Pandas. It makes sense to let SQL do the heavy computation on the server side (which is optimized to process the data it is storing) and have Pandas simply receive the results. Conversely, SQL may be less equipped for machine learning tasks and merging disparate data sources, or running more elaborate algorithms that Python and Pandas are better equipped to do. 

Generally, it is a good practice when working with a relational database to have the database server do the computation work where possible and have the Python environment consume the results. Keep both tools in your backpocket, and use them situationally where they make sense. 