# Introduction

In this notebook I will be importing each sheet in the provided excel document and storing it in a sqlite database locally for my work. This notebook will also include some basic EDA about the joins and relationships between the three tables that will be useful for later work. 

## Imports and Settings

In [1]:
import pandas as pd
import sqlite3

In [2]:
working_dir = '/Users/jmbeck/Desktop/guild_eval'

## Data Load

In [3]:
contact_details = pd.read_excel(working_dir + '/data/Guild_Education_SQL_Workbook.xls', sheet_name='Contact_Details')
course_details = pd.read_excel(working_dir + '/data/Guild_Education_SQL_Workbook.xls',sheet_name='Course_Details')
oppt_details = pd.read_excel(working_dir + '/data/Guild_Education_SQL_Workbook.xls', sheet_name='Opportunity_Details')

Look at the basic structure of each table.

In [4]:
contact_details.head()

Unnamed: 0,Sf Contact ID,Sf Opportunity ID,Sf Course C ID
0,0033600000q4ADOAA2,00636000005dxASAAY,
1,0033600000BMcisAAD,00636000005eG7BAAU,
2,00336000009m01JAAQ,00636000005eGPxAAM,
3,0033600000BOI13AAH,00636000005eNIpAAM,
4,0033600000BNSXNAA5,00636000005eOcBAAU,


In [5]:
course_details.head()

Unnamed: 0,Sf Course C ID,Sf Course C Name,Sf Course C Course Start Date C Date,Sf Course C Course End Date C Date,Sf Course C Final Grade C
0,a1C36000009t3tGEAQ,High School Completion Program,NaT,NaT,
1,a1C36000009eJkMEAU,Management Training Program - 16 Week,NaT,NaT,
2,a1C36000009u2k9EAA,High School Completion Program,NaT,NaT,
3,a1C36000009fIoUEAU,High School Completion Program,NaT,NaT,
4,a1C36000005okzZEAQ,Management Training Program - 16 Week,NaT,NaT,


In [6]:
oppt_details.head()

Unnamed: 0,Sf Opportunity ID,Sf Opportunity Application Type C,Sf Opportunity Program Category,Sf Opportunity Program C
0,00636000005dxASAAY,Guild Education,Lead Gen - Post Secondary,Giving and Receiving Feedback
1,00636000005eG7BAAU,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)
2,00636000005eGPxAAM,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)
3,00636000005eNIpAAM,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)
4,00636000005eOcBAAU,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)


## Create a Local SQLite Database

Create the local database file for SQLite and store each pandas dataframe as a table in it.  We won't worry about indices since these tables are small and we aren't going to reuse it, but we will double check that everything loaded properly. 

In [7]:
conn = sqlite3.connect(working_dir + '/db/eval_db.sqlite')

In [8]:
contact_details.to_sql("CONTACT_DETAILS", conn, if_exists="replace", index=False)

  dtype=dtype, method=method)


In [9]:
course_details.to_sql("COURSE_DETAILS", conn, if_exists="replace", index=False)

In [10]:
oppt_details.to_sql("OPPORTUNITY_DETAILS", conn, if_exists="replace", index=False)

### Read tables back in and assert that row numbers match

In [11]:
contact_details_reread = pd.read_sql('SELECT * FROM CONTACT_DETAILS', conn)

In [12]:
assert(contact_details.shape == contact_details_reread.shape)

In [13]:
course_details_reread = pd.read_sql('SELECT * FROM COURSE_DETAILS', conn)

In [14]:
assert(course_details.shape == course_details_reread.shape)

In [15]:
oppt_details_reread = pd.read_sql('SELECT * FROM OPPORTUNITY_DETAILS', conn)

In [16]:
assert(oppt_details.shape == oppt_details_reread.shape)

In [17]:
conn.close()

## EDA

The EDA contained in this section will be pretty basic, but will look at the linkage between keys for each table.

### Contact Details

Basic structure and information on the CONTACT_DETAILS data set.

In [18]:
contact_details.shape

(27029, 3)

There are roughly 27000 contacts in the data set.

In [19]:
contact_details.head()

Unnamed: 0,Sf Contact ID,Sf Opportunity ID,Sf Course C ID
0,0033600000q4ADOAA2,00636000005dxASAAY,
1,0033600000BMcisAAD,00636000005eG7BAAU,
2,00336000009m01JAAQ,00636000005eGPxAAM,
3,0033600000BOI13AAH,00636000005eNIpAAM,
4,0033600000BNSXNAA5,00636000005eOcBAAU,


This table is the cornerstone for linking the three tables together. The primary identifer for this table is [Sf Contact ID], and foreign keys would be [Sf Opportunity ID] to the OPPORTUNITY_DETAILS table, and [Sf Course C ID] to the COURSE_DETAILS table. 

There are multiple records for each contact id.

In [20]:
contact_details.groupby(['Sf Contact ID']).size().sort_values(ascending=False).head()

Sf Contact ID
0033600000VDyNSAA1    46
0033600000OkCxiAAF    43
0033600000fuScWAAU    42
0033600000Y2HfUAAV    40
0033600000iutHvAAI    36
dtype: int64

Each contact id can have multiple course ids associated with them, since a student can take many courses.

In [21]:
contact_details.groupby(['Sf Contact ID'])['Sf Course C ID'].nunique().sort_values(ascending=False).head()

Sf Contact ID
0033600000VDyNSAA1    44
0033600000OkCxiAAF    42
0033600000fuScWAAU    41
0033600000Y2HfUAAV    39
0033600000iutHvAAI    34
Name: Sf Course C ID, dtype: int64

There are instances where a single contact id is also associated with several programs/opportunities.

In [22]:
contact_details.groupby(['Sf Contact ID'])['Sf Opportunity ID'].nunique().sort_values(ascending=False).head()

Sf Contact ID
0033600000MyAVQAA3    11
0033600000UFuqbAAD    11
0033600000q4ADdAAM     9
0033600000kcMHIAA2     9
0033600000NoojwAAB     8
Name: Sf Opportunity ID, dtype: int64

The majority of contact ids do not correspond to an opportunity id at all.

In [23]:
contact_details.groupby(['Sf Contact ID'])['Sf Opportunity ID'].nunique().value_counts()

0     15151
2      1925
1       923
4       215
3       135
6        62
5        61
8        17
7        11
11        2
9         2
Name: Sf Opportunity ID, dtype: int64

### Course Details

This section contains a basis overview of the structure and uniqueness of the COURSE_DETAILS data and its keys. 


In [24]:
course_details.shape

(7572, 5)

There are roughly 7500 instances of courses being taken in this data set.

In [25]:
course_details.head()

Unnamed: 0,Sf Course C ID,Sf Course C Name,Sf Course C Course Start Date C Date,Sf Course C Course End Date C Date,Sf Course C Final Grade C
0,a1C36000009t3tGEAQ,High School Completion Program,NaT,NaT,
1,a1C36000009eJkMEAU,Management Training Program - 16 Week,NaT,NaT,
2,a1C36000009u2k9EAA,High School Completion Program,NaT,NaT,
3,a1C36000009fIoUEAU,High School Completion Program,NaT,NaT,
4,a1C36000005okzZEAQ,Management Training Program - 16 Week,NaT,NaT,


In [26]:
course_details.groupby('Sf Course C ID').size().sort_values(ascending=False).head()

Sf Course C ID
a1C36000009vDXOEA2    1
a1C36000004u04hEAA    1
a1C36000004u04VEAQ    1
a1C36000004u04WEAQ    1
a1C36000004u04XEAQ    1
dtype: int64

This table contains one record per course id.  

In [27]:
course_details.groupby('Sf Course C Name').size().value_counts(ascending=False).head()

1    297
2    176
4     68
3     60
8     39
dtype: int64

The same course name can appear multiple times in the table, suggesting that this is a unique set of instances of each course taken by a students, along with their final grades.

### Opportunity Details

Basic structure and details of the OPPORTUNITY_DETAILS table.

In [28]:
oppt_details.shape

(6968, 4)

There are roughly 7000 opportunities/links back to the CONTACT_DETAILS table. 

In [29]:
oppt_details.head()

Unnamed: 0,Sf Opportunity ID,Sf Opportunity Application Type C,Sf Opportunity Program Category,Sf Opportunity Program C
0,00636000005dxASAAY,Guild Education,Lead Gen - Post Secondary,Giving and Receiving Feedback
1,00636000005eG7BAAU,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)
2,00636000005eGPxAAM,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)
3,00636000005eNIpAAM,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)
4,00636000005eOcBAAU,Western Governors University,University,B.A. in Interdisciplinary Studies (K-8)


In [30]:
oppt_details.groupby('Sf Opportunity ID').size().value_counts()

1    6968
dtype: int64

This table contains a single row per opportunity id, with details about the opportunity.