# SQLite Database Exploration

This application takes a SQLite database and outputs results to markdown report. 

Steps:
This notebook inspects the database to identify tables, data structure and optimize through indexing. 

In [46]:
import pandas as pd
import sys
import os

# Add the path to utils/ directory, which is one level up from the /data directory
sys.path.append(os.path.abspath(os.path.join('..', 'utils')))

# Now you can import the db_utils module
import db_utils as db


In [25]:
db_path = "C:/Users/megan/OneDrive/Documents/GitHub/sqlite_to_analysis_app/data/combined_data.db"
conn = db.connect_to_db(db_path)

In [26]:
# identify names of tables in the database
db.run_query(conn,"SELECT name FROM sqlite_master WHERE type='table'")

[('CompanyClassification',), ('CompanyDataset',)]

In [28]:
# identify if database is optimized with indexes for CompanyDataset
print(db.run_query(conn,"SELECT * FROM sqlite_master WHERE type='index' and name='CompanyDataset'"))

# identify if database is optimized with indexes for CompanyClassification
print(db.run_query(conn,"SELECT * FROM sqlite_master WHERE type='index' and name='CompanyClassification'"))

[]
[]


In [38]:
import time
# # time queries to get a sense of performance
def time_query(conn, query):
    """Time the execution of a query on the SQLite database."""
    cursor = conn.cursor()
    start_time = time.time()  # Record the start time
    cursor.execute(query)
    result = cursor.fetchall()
    end_time = time.time()  # Record the end time
    
    execution_time = end_time - start_time
    
    return execution_time, result

In [39]:
print(time_query(conn, "SELECT count(*) FROM CompanyDataset"))
print(time_query(conn, "SELECT count(*) FROM CompanyClassification"))

TypeError: expected str, bytes or os.PathLike object, not sqlite3.Connection

In [20]:
# check the table names in the database file
df = pd.read_sql_query("SELECT * FROM sqlite_master WHERE type = 'table'", conn)

# Verify that result of SQL query is stored in the dataframe
print(df.head())

In [43]:
# read tables 
company_dataset = pd.read_sql_query("SELECT * FROM CompanyDataset",conn)
company_classification = pd.read_sql_query("SELECT * FROM CompanyClassification",conn)

In [44]:
print(company_dataset.columns)
company_dataset.head()

Index(['Unnamed: 0', 'CompanyName', 'Website', 'year founded', 'industry',
       'size range', 'locality', 'country', 'linkedin url',
       'current employee estimate', 'total employee estimate'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,CompanyName,Website,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369
2,21074,accenture,accenture.com,1989.0,information technology and services,10001+,"dublin, dublin, ireland",ireland,linkedin.com/company/accenture,190689,455768
3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958
4,1558607,ey,ey.com,1989.0,accounting,10001+,"london, greater london, united kingdom",united kingdom,linkedin.com/company/ernstandyoung,158363,428960


In [45]:
print(company_classification.columns)
company_classification.head()

Index(['Category', 'Website', 'CompanyName', 'homepage_text', 'h1', 'h2', 'h3',
       'nav_link_text', 'meta_keywords', 'meta_description'],
      dtype='object')


Unnamed: 0,Category,Website,CompanyName,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description
0,Commercial Services & Supplies,bipelectric.com,bip dipietro electric inc,Electrici...,,,,,"electricians vero beach, vero beach electrical...","Providing quality, reliable full service resid..."
1,Healthcare,eliasmedical.com,elias medical,site map | en español Elias Medical h...,Offering Bakersfield family medical care from ...,Welcome to ELIAS MEDICAL#sep#Family Medical Pr...,Get To Know Elias Medical#sep#Family Medical P...,,Elias Medical bakersfield ca family doctor med...,For the best value in Bakersfield skin care tr...
2,Commercial Services & Supplies,koopsoverheaddoors.com,koops overhead doors,Home About Us Garage Door Repair & Servi...,,Customer Reviews#sep#Welcome to Koops Overhead...,,,"Koops Overhead Doors, Albany Garage Doors, Tro...","Koops Overhead Doors specializes in the sales,..."
3,Healthcare,midtowneyes.com,midtown eyecare,918-599-0202 Type Size...,,Welcome to our practice!,,,,We would like to welcome you to Midtown Eyecar...
4,Commercial Services & Supplies,reprosecurity.co.uk,repro security ltd,Simply fill out our form below...,,Welcome to REPRO SECURITY Ltd,,,,Repro Security provide a range of tailor made ...


# Optimize tables by adding indexes

Columns chosen for indexing based on what I expect to use for filtering, joining, sorting or aggregation. 

That is columns used in:
- WHERE
- JOIN
- ORDER BY
- GROUP BY

There are no indexes set for either CompanyDataset or CompanyClassification. Also there is a column in the Company dataset that has no name. This should cause an error if it happens in future cases. In this case, the column appears to be some kind of company ID.

In [47]:
db.check_for_unnamed_columns(conn, 'CompanyDataset')

AttributeError: module 'db_utils' has no attribute 'check_for_unnamed_columns'