# Programming Project - Unit 3,1
*by Igor A. Brandão and Leandro Antonio Feliciano da Silva*

**Goals**
The purpose of this project is explore the following:

- Access a database content;
- Full content of the statistical part seen in the course;
- Graphs generation;
- Geolocation analysis and hypotheses should be explained in detail.

<hr>

# Global Imports section

Import the necessary libraries to handle 

- Geocoding;
- Maps;
- File input;
- Heatmap;
- Bokeh charts;
- Numpy library;
- Tqdm progress bar

In [None]:
### Library necessary to run this IPython Notebook
!pip install geocoder
!pip install folium
!pip install tqdm

In [95]:
# Import pandas
import pandas as pd

# Import google geocoder
import geocoder as gc

# Import numpy library
import numpy as np

# Import folium heatmap
import folium
from folium.plugins import HeatMap

# Import tqdm progressing bar plugin
from tqdm import tqdm

# Import bokeh libraries
from bokeh.plotting import figure
from bokeh.charts import Bar, Histogram, Donut, BoxPlot, output_notebook, show
from bokeh.layouts import row, gridplot, column
from bokeh.models import HoverTool
from bokeh.charts.attributes import cat, color
from bokeh.charts.operations import blend

<hr>

# Database section

## Database connection

#### In the cell bellow, we perform a connection with local database.

#### **Important:** The database file must be in the same notebook's folder or it should be correctly referenced

In [3]:
# Import necessary module
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///py-students-database.sqlite')

In [4]:
# What are the tables in the database?

# Save the table names to a list: table_names
table_names  = engine.table_names()

# Print the table names to the shell
print(table_names)

['Grade']


In [5]:
# Query string
sql = "PRAGMA table_info(Grade)"

# Execute query and store records in DataFrame: df
df = pd.read_sql_query(sql, engine)

# Print head of DataFrame
df.head()

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,BIGINT,0,,0
1,1,student_id,BIGINT,0,,0
2,2,period,FLOAT,0,,0
3,3,subject,TEXT,0,,0
4,4,situation,TEXT,0,,0


## Database query

Here we perform a basic query into DB table to check if it's correctly connected

In [6]:
# Query string
sql = "select * from Grade"

# Execute query and store records in DataFrame: df
df = pd.read_sql_query(sql, engine)

# Print head of DataFrame
df.head()

Unnamed: 0,id,student_id,period,subject,situation,overall_grade,unitI_grade,unitII_grade,unitIII_grade
0,0,0,2014.1,IMD0012.0 - INTRODUÇĂO ŔS TÉCNICAS DE PROGRAMAÇĂO,APROVADO,7.0,4.9,9.0,7.0
1,1,1,2014.1,IMD0012.0 - INTRODUÇĂO ŔS TÉCNICAS DE PROGRAMAÇĂO,APROVADO,7.3,8.0,7.0,7.0
2,2,2,2014.1,IMD0012.0 - INTRODUÇĂO ŔS TÉCNICAS DE PROGRAMAÇĂO,APROVADO,9.3,9.5,8.3,10.0
3,3,3,2014.1,IMD0012.0 - INTRODUÇĂO ŔS TÉCNICAS DE PROGRAMAÇĂO,APROVADO,7.3,6.5,7.0,8.3
4,4,4,2014.1,IMD0012.0 - INTRODUÇĂO ŔS TÉCNICAS DE PROGRAMAÇĂO,APROVADO,7.5,5.5,8.0,9.0


## Data export [optional]

In order to visualize the data into an excel file, the cell bellow is responsible for exporting the data.

In [7]:
# Export the new dataSet to csv
df.to_csv('py-students-dataset.csv', encoding="utf-8")

<hr>

# Statistic section

#### Here in this section, we'll handle the statistics infos.

#### The idea is to use the ***top-down analysis***, from the more generic context to the specific one

## 1) Subjects overall grades

Taking into consideration the chart bellow, it's possible to observe the fact of students having more difficults with technical subjects such as programming at the beginning.

Possibly it's happen because most of the students start the course without experience in programming and accordling them progress, this skill develope in a way to turn easier subjects like **Programming II**.

The same happens with more theorical subjects such as **Data Structure I**, because it's require a programming and math notion and it has a good level of complexity.

In [8]:
# =================================================================================
# Data selection
# =================================================================================

# Query string
sql = "select subject, AVG(overall_grade) as mean from Grade \
    Group by subject"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Make a bar chart: p
p = Bar(data, values='mean', label='subject', color='subject',
            title='Subjects overall score',
            legend='bottom_right', background_fill_color="#E8DDCB",
            plot_width=900, plot_height=768, tools=TOOLS)

# Set the y axis label
p.yaxis.axis_label= 'Subject overall score'

# Call the output_notebook() 
output_notebook()
show(p)

<hr>

## 2) Subjects grades distribution

Observing the 3 boxplots, in general it´s possible to conclude that the unit 1 show the **highest grades**
in **Programming Technical Introduction** and **Programming Techniques and Practices **, but at the same time 
the unit 01 represent the **worstest grades in **Data Structure I and II**.

Taking into consideration our experience as students from BTI, the presented scenario can be explaind by unit 1 content in 
**Data Structure**, it´s the more complex topic from its disciplines (complexity algorithm analysis, ordering algorithms and 
binary tree sometimes).

In [99]:
# =================================================================================
# Data selection
# =================================================================================

# Select the subject list
sql = "select subject, unitI_grade, unitII_grade, unitIII_grade from Grade"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Make a box plot: unit 1
b1 = BoxPlot(data, values='unitI_grade', label='subject', color='subject',
            title='Subjects grades range (grouped by unit 01)',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
b1.yaxis.axis_label='Subjects grades range'

# Make a box plot: unit 2
b2 = BoxPlot(data, values='unitII_grade', label='subject', color='subject',
            title='Subjects grades range (grouped by unit 02)',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
b2.yaxis.axis_label='Subjects grades range'

# Make a box plot: unit 3
b3 = BoxPlot(data, values='unitIII_grade', label='subject', color='subject',
            title='Subjects grades range (grouped by unit 03)',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
b3.yaxis.axis_label='Subjects grades range'

# Create a list containing plots b1, b2 and b3
row1 = [b1,b2,b3]

# Create a gridplot using row1 and row2: layout
layout = gridplot([row1],sizing_mode='scale_width', plot_height=900)

# Call the output_notebook() 
output_notebook()

# Display the plot
show(layout)

<hr>

## 3) Subjects situation

In [130]:
# =================================================================================
# Data selection
# =================================================================================

# Select the subject list status
sql = "select subject, situation from Grade"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# Create an specific array to subject status
subjectStatus = data.copy()
subjectStatus["Count"] = 0

# Count the status sum-up
subjectStatus = pd.DataFrame(subjectStatus.groupby(["situation"])['Count'].count()).reset_index()

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Donut chart
d = Donut(subjectStatus, label=['situation', 'Count'], values='Count',
          text_font_size='10pt', hover_text='situation', legend='top_left',
          tools=TOOLS, background_fill_color="#E8DDCB", title='Subjects Status', 
          color='situation', plot_width=900, plot_height=900)

# Print the chart
output_notebook()
show(d)