# Programming Project - Unit 3,3
*by Igor A. Brandão and Leandro Antonio Feliciano da Silva*

**Goals**
The purpose of this project is explore the following:

- Access Health Graph API - Runkeeper content;
- Full content of the statistical part seen in the course;
- Graphs generation;
- Geolocation analysis and hypotheses should be explained in detail;
- Web scraping.

<hr>

# Global Imports section

Import the necessary libraries to handle 

- Geocoding;
- Maps;
- File input;
- Heatmap;
- Bokeh charts;
- Numpy library;
- Tqdm progress bar
- Requests;
- urlopen;
- HTTPError;
- BeautifulSoup
- Regular expression

In [None]:
### Library necessary to run this IPython Notebook
!pip install geocoder
!pip install folium
!pip install tqdm
!pip install tabulate

In [30]:
# Import pandas
import pandas as pd

# Import google geocoder
import geocoder as gc

# Import numpy library
import numpy as np

# Import folium heatmap
import folium
from folium.plugins import HeatMap

# Import tqdm progressing bar plugin
from tqdm import tqdm

# Import bokeh libraries
from bokeh.plotting import figure
from bokeh.charts import Bar, Histogram, Donut, BoxPlot, Line, output_notebook, show
from bokeh.layouts import row, gridplot, column
from bokeh.models import HoverTool
from bokeh.charts.attributes import cat, color
from bokeh.charts.operations import blend
from bokeh.models import HoverTool

# Import request libraries
from urllib.request import Request, urlopen
from urllib.error import HTTPError

# Import web scraping libraries
from bs4 import BeautifulSoup
import re # regular expression

# Import API libraries
import requests
import json
from pandas.io.json import json_normalize

<hr>

# API section

## API data retrieving

#### In the cell bellow, we perform a connection with Health Graph API - Runkeeper.

In [39]:
# Access token
ACCESS_TOKEN = '25bc30d6dd6f4b99bbeb48e8619103b4'

# Base URI
api_URI = "http://api.runkeeper.com/fitnessActivities"

# Number of results
pageSize = 300

# Final URI
url = '%s?pageSize=%s&access_token=%s' % \
    (api_URI, pageSize, ACCESS_TOKEN,)

# print(url)

# Receive the results from API
api_content = requests.get(url).json()

print(json.dumps(api_content, indent=1))

{
 "size": 245,
 "items": [
  {
   "utc_offset": -3,
   "duration": 1037,
   "start_time": "Fri, 23 Jun 2017 12:40:11",
   "total_calories": 148,
   "tracking_mode": "outdoor",
   "total_distance": 5345.75479639154,
   "entry_mode": "API",
   "has_path": true,
   "source": "RunKeeper",
   "type": "Cycling",
   "uri": "/fitnessActivities/1005956310"
  },
  {
   "utc_offset": -3,
   "duration": 1140,
   "start_time": "Fri, 23 Jun 2017 07:50:54",
   "total_calories": 143,
   "tracking_mode": "outdoor",
   "total_distance": 6081.16390011933,
   "entry_mode": "API",
   "has_path": true,
   "source": "RunKeeper",
   "type": "Cycling",
   "uri": "/fitnessActivities/1005843765"
  },
  {
   "utc_offset": -3,
   "duration": 1285,
   "start_time": "Thu, 22 Jun 2017 12:41:16",
   "total_calories": 184,
   "tracking_mode": "outdoor",
   "total_distance": 6589.61628038915,
   "entry_mode": "API",
   "has_path": true,
   "source": "RunKeeper",
   "type": "Cycling",
   "uri": "/fitnessActivities/10055

## JSON to Data Frame conversion

In order to have a better data manipulation, in the next cell we perform a conversion of importe in json format from API to pandas data frame

In [63]:
# Perform a conversion from JSON to Data Frame
api_df = json_normalize(api_content['items'])

# Converts the duration from seconds to minutes
api_df['duration'] = api_df['duration']/60
api_df['duration'] = api_df['duration'].round(2);

api_df

Unnamed: 0,duration,entry_mode,has_path,source,start_time,total_calories,total_distance,tracking_mode,type,uri,utc_offset
0,17.28,API,True,RunKeeper,"Fri, 23 Jun 2017 12:40:11",148,5345.754796,outdoor,Cycling,/fitnessActivities/1005956310,-3
1,19.00,API,True,RunKeeper,"Fri, 23 Jun 2017 07:50:54",143,6081.163900,outdoor,Cycling,/fitnessActivities/1005843765,-3
2,21.42,API,True,RunKeeper,"Thu, 22 Jun 2017 12:41:16",184,6589.616280,outdoor,Cycling,/fitnessActivities/1005535586,-3
3,20.58,API,True,RunKeeper,"Thu, 22 Jun 2017 07:37:35",147,6162.944349,outdoor,Cycling,/fitnessActivities/1005303953,-3
4,19.30,API,True,RunKeeper,"Tue, 20 Jun 2017 12:35:21",156,5382.671145,outdoor,Cycling,/fitnessActivities/1004255363,-3
5,19.88,API,True,RunKeeper,"Tue, 20 Jun 2017 07:30:08",144,6086.986813,outdoor,Cycling,/fitnessActivities/1004088987,-3
6,17.02,API,True,RunKeeper,"Mon, 19 Jun 2017 12:36:20",141,5389.115056,outdoor,Cycling,/fitnessActivities/1003621881,-3
7,23.08,API,True,RunKeeper,"Mon, 19 Jun 2017 07:43:02",156,6143.735127,outdoor,Cycling,/fitnessActivities/1003493993,-3
8,50.43,API,True,RunKeeper,"Sat, 17 Jun 2017 14:47:50",310,8859.432268,outdoor,Cycling,/fitnessActivities/1002663916,-3
9,18.17,API,True,RunKeeper,"Wed, 14 Jun 2017 13:26:09",153,5753.597600,outdoor,Cycling,/fitnessActivities/1000879327,-3


## Data export [optional]

In order to visualize the data into an excel file, the cell bellow is responsible for exporting the data.

In [57]:
# Export the new dataSet to csv
api_df.to_csv('dataSource.csv', encoding="utf-8")

<hr>

# Statistic section

#### Here in this section, we'll handle the statistics infos.

#### The idea is to use the ***top-down analysis***, from the more generic context to the specific one

## 1) Exercises summary

## 2) Activites overall times

In [86]:
# =================================================================================
# Data selection
# =================================================================================

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Make a bar chart: p
p = Bar(api_df, values='duration', label='type', agg='mean', color='type',
            legend='bottom_right', background_fill_color="#E8DDCB",
            plot_width=750, plot_height=500, tools=TOOLS)

# Set the y and x axis label
p.yaxis.axis_label= 'Activity overall time'
p.xaxis.axis_label= 'Activity type'

# Set hover to bars
hover = p.select(dict(type=HoverTool))
hover.tooltips = [('Tipo de atividade', '@type'),('Average time:',' @height')]

# Configure visual properties on a plot's title attribute
p.title.text = "Overall time by activity"
p.title.align = "center"
p.title.text_font_size = "25px"

# Call the output_notebook() 
output_notebook()
show(p)

<hr>

## 2) Activites calories burning

In [85]:
# =================================================================================
# Data selection
# =================================================================================

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,hover,reset,save'

# Make a bar chart: p
p = Bar(api_df, values='total_calories', label='type', agg='sum', color='type',
            legend='bottom_right', background_fill_color="#E8DDCB",
            plot_width=750, plot_height=500, tools=TOOLS)

# Set the y and x axis label
p.yaxis.axis_label= 'Activity total calories'
p.xaxis.axis_label= 'Activity type'

# Set hover to bars
hover = p.select(dict(type=HoverTool))
hover.tooltips = [('Tipo de atividade', '@type'),('Total calories:',' @height')]

# Configure visual properties on a plot's title attribute
p.title.text = "Total calories burned by activity"
p.title.align = "center"
p.title.text_font_size = "25px"

# Call the output_notebook() 
output_notebook()
show(p)

## 3) Subjects grades distribution

Observing the 3 boxplots, in general it´s possible to conclude that the unit 1 show the **highest grades**
in **Programming Technical Introduction** and **Programming Techniques and Practices **, but at the same time 
the unit 01 represent the **worstest grades in **Data Structure I and II**.

Taking into consideration our experience as students from BTI, the presented scenario can be explaind by unit 1 content in 
**Data Structure**, it´s the more complex topic from its disciplines (complexity algorithm analysis, ordering algorithms and 
binary tree sometimes).

In [42]:
# =================================================================================
# Data selection
# =================================================================================

# Select the subject list
sql = "select subject, unitI_grade, unitII_grade, unitIII_grade from Grade"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Make a box plot: unit 1
b1 = BoxPlot(data, values='unitI_grade', label='subject', color='subject',
            title='Subjects grades range (grouped by unit 01)',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
b1.yaxis.axis_label='Subjects grades range'

# Make a box plot: unit 2
b2 = BoxPlot(data, values='unitII_grade', label='subject', color='subject',
            title='Subjects grades range (grouped by unit 02)',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
b2.yaxis.axis_label='Subjects grades range'

# Make a box plot: unit 3
b3 = BoxPlot(data, values='unitIII_grade', label='subject', color='subject',
            title='Subjects grades range (grouped by unit 03)',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
b3.yaxis.axis_label='Subjects grades range'

# Create a list containing plots b1, b2 and b3
row1 = [b1,b2,b3]

# Create a gridplot using row1 and row2: layout
layout = gridplot([row1],sizing_mode='scale_width', plot_height=900)

# Call the output_notebook() 
output_notebook()

# Display the plot
show(layout)

<hr>

## 3) Subjects situation

In [51]:
# =================================================================================
# Data selection
# =================================================================================

# Select the subject list status
sql = "select subject, situation from Grade"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# Create an specific array to subject status
subjectStatus = data.copy()
subjectStatus["Count"] = 0

# Count the status sum-up
subjectStatus = pd.DataFrame(subjectStatus.groupby(["situation"])['Count'].count()).reset_index()

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Donut chart
d = Donut(subjectStatus, label=['situation', 'Count'], values='Count',
          text_font_size='10pt', hover_text='situation', legend='top_left',
          tools=TOOLS, background_fill_color="#E8DDCB", title='Subjects Status', 
          color='situation', plot_width=900, plot_height=900)

# Configure visual properties on a plot's title attribute
d.title.text = "Subjects situation"
d.title.align = "center"
d.title.text_font_size = "25px"

# Print the chart
output_notebook()
show(d)

<hr>

## 4) Overall grades by period

In [44]:
# =================================================================================
# Data selection
# =================================================================================

# Select the subject list
sql = "select subject, overall_grade from Grade where period like '%2014.1%'"
sql2 = "select subject, overall_grade from Grade where period like '%2014.2%'"
sql3 = "select subject, overall_grade from Grade where period like '%2015.1%'"
sql4 = "select subject, overall_grade from Grade where period like '%2015.2%'"
sql5 = "select subject, overall_grade from Grade where period like '%2016.1%'"
sql6 = "select subject, overall_grade from Grade where period like '%2016.2%'"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)
data2 = pd.read_sql_query(sql2, engine)
data3 = pd.read_sql_query(sql3, engine)
data4 = pd.read_sql_query(sql4, engine)
data5 = pd.read_sql_query(sql5, engine)
data6 = pd.read_sql_query(sql6, engine)

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Make a box plot: 2014.1
box1 = BoxPlot(data, values='overall_grade', label='subject', color='subject',
            title='Overall grades in 2014.1',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
box1.yaxis.axis_label='Subjects grades range'

# Make a box plot: 2014.2
box2 = BoxPlot(data2, values='overall_grade', label='subject', color='subject',
            title='Overall grades in 2014.2',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
box2.yaxis.axis_label='Subjects grades range'

# Make a box plot: 2015.1
box3 = BoxPlot(data3, values='overall_grade', label='subject', color='subject',
            title='Overall grades in 2015.1',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
box3.yaxis.axis_label='Subjects grades range'

# Make a box plot: 2015.2
box4 = BoxPlot(data4, values='overall_grade', label='subject', color='subject',
            title='Overall grades in 2015.2',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
box4.yaxis.axis_label='Subjects grades range'

# Make a box plot: 2016.1
box5 = BoxPlot(data5, values='overall_grade', label='subject', color='subject',
            title='Overall grades in 2016.1',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
box5.yaxis.axis_label='Subjects grades range'

# Make a box plot: 2016.2
box6 = BoxPlot(data6, values='overall_grade', label='subject', color='subject',
            title='Overall grades in 2016.2',
            legend=False, background_fill_color="#E8DDCB", tools=TOOLS)

# Set the y axis label
box6.yaxis.axis_label='Subjects grades range'

# Create a list containing plots b1, b2 and b3
row1 = [box1,box2,box3]

# Create a list containing plots b4, b5 and b6
row2 = [box4,box5,box6]

# Create a gridplot using row1 and row2: layout
layout = gridplot([row1,row2],sizing_mode='scale_width', plot_height=900)

# Call the output_notebook() 
output_notebook()

# Display the plot
show(layout)

<hr>

## 5) Overall grades dispersion

In [50]:
# =================================================================================
# Data selection
# =================================================================================

# Select the subject list
sql = "select overall_grade from Grade where overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# Convert the data into a dataFrame groupped values
grades = data.copy()
grades["Count"] = 0

# Group the overall grades and count the group frequency
grades = pd.DataFrame(grades.groupby(["overall_grade"])['Count'].count()).reset_index()

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Grades between 0.0 and 3.0
grades1 = grades[(0.0 < grades['overall_grade']) & (grades['overall_grade'] < 3.0)]['overall_grade']
frequency1 = grades[(0.0 < grades['overall_grade']) & (grades['overall_grade'] < 3.0)]["Count"]

# 3.1 <= Grades <= 5.0
grades2 = grades[(3.1 <= grades['overall_grade']) & (grades['overall_grade'] <= 5.0)]['overall_grade']
frequency2 = grades[(3.1 <= grades['overall_grade']) & (grades['overall_grade'] <= 5.0)]["Count"]

# 5.1 <= Grades <= 8.0
grades3 = grades[(5.1 <= grades['overall_grade']) & (grades['overall_grade'] <= 8.0)]['overall_grade']
frequency3 = grades[(5.1 <= grades['overall_grade']) & (grades['overall_grade'] <= 8.0)]["Count"]

# 8.1 < Grades
grades4 = grades[grades['overall_grade'] > 8.1]['overall_grade']
frequency4 = grades[grades['overall_grade'] > 8.1]["Count"]

# Create the figure: p
p = figure(x_axis_label='Grades', y_axis_label='Grades frequency', tools=TOOLS)

# Add the data to the plot
p.circle(grades1, frequency1, color="#990990", size=10, alpha=0.8)
p.circle(grades2, frequency2, color="#990000", size=10, alpha=0.8)
p.circle(grades3, frequency3, color="#009900", size=10, alpha=0.8)
p.circle(grades4, frequency4, color="#000099", size=10, alpha=0.8)

# Configure visual properties on a plot's title attribute
p.title.text = "Overall grades dispersion"
p.title.align = "center"
p.title.text_font_size = "25px"

# Call the output_notebook() 
output_notebook()

# Display the plot
show(p)

<hr>

## 6) Overall grades historic chart

In [86]:
# =================================================================================
# Data selection
# =================================================================================

# Select the overall grades by period
sql = "select subject, period, AVG(overall_grade) as median from Grade where period like '%2014.1%' \
        and overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0 \
        Group by subject"

sql2 = "select subject, period, AVG(overall_grade) as median from Grade where period like '%2014.2%' \
        and overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0 \
        Group by subject"

sql3 = "select subject, period, AVG(overall_grade) as median from Grade where period like '%2015.1%' \
        and overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0 \
        Group by subject"

sql4 = "select subject, period, AVG(overall_grade) as median from Grade where period like '%2015.2%' \
        and overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0 \
        Group by subject"

sql5 = "select subject, period, AVG(overall_grade) as median from Grade where period like '%2016.1%' \
        and overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0 \
        Group by subject"

sql6 = "select subject, period, AVG(overall_grade) as median from Grade where period like '%2016.2%' \
        and overall_grade IS NOT NULL and overall_grade <> '' \
        and overall_grade > 0.0 \
        Group by subject"

# Select the periods (except 2014.1 because the dataSet isn't complete)
sqlPeriod = "select distinct period from Grade where period not Like '%2014.1%'"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)
data2 = pd.read_sql_query(sql2, engine)
data3 = pd.read_sql_query(sql3, engine)
data4 = pd.read_sql_query(sql4, engine)
data5 = pd.read_sql_query(sql5, engine)
data6 = pd.read_sql_query(sql6, engine)
dataPeriod = pd.read_sql_query(sqlPeriod, engine)

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# (dict, OrderedDict, lists, arrays and DataFrames are valid inputs)
period = np.array(dataPeriod);

values = np.array([data2['median'], data3['median'], data4['median'], 
                   data5['median'], data6['median']])

# Make a line chart with the dataSet
line = Line(data=values, legend="top_left", background_fill_color="#E8DDCB", 
            ylabel='Overall grades', tools=TOOLS, plot_width=900)

# Configure visual properties on a plot's title attribute
line.title.text = "Overall grades historic chart"
line.title.align = "center"
line.title.text_font_size = "25px"

# Call the output_notebook() 
output_notebook()

# Display the plot
show(line)

## 7) Overall students situation

In [48]:
from bokeh.charts import Bar, output_notebook
from bokeh.models import Legend, LegendItem
from bokeh.models import ColumnDataSource, ranges, LabelSet

# =================================================================================
# Data selection
# =================================================================================

# Select the subject list status
sql = "select subject, situation from Grade"

# Execute query and store records in DataFrame: data
data = pd.read_sql_query(sql, engine)

# Create an specific array to subject status
subjectStatus = data.copy()
subjectStatus["Count"] = 0

# Count the status sum-up
subjectStatus = pd.DataFrame(subjectStatus.groupby(["situation"])['Count'].count()).reset_index()

# =================================================================================
# Chart plotting
# =================================================================================

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Bar chart creation
p = Bar(subjectStatus, values='Count', label='situation', color='situation',
          tools=TOOLS, plot_width=900, plot_height=750)

# Configure visual properties on a plot's title attribute
p.title.text = "Overall students situation"
p.title.align = "center"
p.title.text_font_size = "25px"
p.ylabel = 'Situation frequency'
p.xlabel = 'Situation'
p.legend.location = "top_right"

# Add hover selection for each bar
hover = p.select(dict(type=HoverTool))
hover.tooltips = [('Situation name',' $x'),('Frequency',' $y')]

# Call the output_notebook()
output_notebook()
        
# Plota o grafico
show(p)