# Programming Project - Unit 2,4
*by Igor A. Brandão and Leandro Max*

**Goals**
The purpose of this project is explore the following:

- Full content of the statistical part seen in the course;
- Graphs generation;
- Geolocation analysis and hypotheses should be explained in detail.

## Global Imports

Import the necessary libraries to handle 

- Geocoding;
- Maps;
- File input;
- Heatmap;
- Bokeh charts;
- Numpy library;
- Tqdm progress bar

In [None]:
### Library necessary to run this IPython Notebook
!pip install geocoder
!pip install folium
!pip install tqdm

In [24]:
# Import pandas
import pandas as pd

# Import google geocoder
import geocoder as gc

# Import numpy library
import numpy as np

# Import folium heatmap
import folium
from folium.plugins import HeatMap

# Import tqdm progressing bar plugin
from tqdm import tqdm

# Import bokeh libraries
from bokeh.charts import Bar, Histogram, Donut, output_notebook, show
from bokeh.layouts import row
from bokeh.models import HoverTool
from bokeh.charts.attributes import cat, color
from bokeh.charts.operations import blend

## Data importing

This cell is responsible to import the dataSet from excel file and save it into a variable **xl**

In [25]:
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'imd_student_blind.xlsx'

# Load spreadsheet: xl
xl = pd.ExcelFile(file)

# Print sheet names
print(xl.sheet_names)

['Sheet1']


## Data printing

Here the dataFrame receive the dataSet from **Sheet1** --> *xl.parse(0)*

In [26]:
# Load a sheet into a DataFrame by index: df
df = xl.parse(0)

# Print the head of the DataFrame df
df.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado


In [27]:
df.columns

Index(['a_ID', 'CEP', 'ano_ingresso', 'periodo_ingresso', 'status',
       'ano_disciplina', 'periodo_disciplina', 'nota', 'disciplina_ID',
       'status.disciplina'],
      dtype='object')

In [28]:
df.shape

(4842, 10)

## 1) Geolocation handler section

Here in this section, we'll handle the geolocalization infos. The idea is converting the zipcode into latitude and longitude and export the new dataSet (as it requires a long time to perform this operation).

After that, it'll be possible to pin the students position and generate the heatMap.

In [29]:
# df["GeoCod"] = df["CEP"]
df['lat'], df['long'] = [0, 0]

# Print new df
df.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina,lat,long
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado,0,0
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado,0,0
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado,0,0
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado,0,0
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado,0,0


### *Warning: Do not process this cell again!*

We've already converted all CEP into *lat/long*, you just need to skip this cell.

Please, use the **py-students-blind-with-lat-long.csv** file to generate the maps and save your time and your processor ;)

In [None]:
# Retrieve the latitude and longitute related to each student
for i in tqdm(range(len(df))):
    str = df.loc[i,'CEP']
    g = gc.google(str)
    if g.lat != None:
        df.ix[i,'lat'] = g.lat
        df.ix[i,'long'] = g.lng
print('Geocoding complete!')


  0%|          | 0/4842 [00:00<?, ?it/s][A
  0%|          | 1/4842 [00:00<35:46,  2.26it/s][A
  0%|          | 2/4842 [00:00<36:57,  2.18it/s][A
  0%|          | 3/4842 [00:01<43:29,  1.85it/s][A
  0%|          | 4/4842 [00:02<42:12,  1.91it/s][A
  0%|          | 5/4842 [00:02<48:32,  1.66it/s][A
  0%|          | 6/4842 [00:03<45:58,  1.75it/s][A
  0%|          | 7/4842 [00:04<49:35,  1.63it/s][A
  0%|          | 8/4842 [00:04<46:30,  1.73it/s][A
 42%|████▏     | 2045/4842 [25:04<33:03,  1.41it/s] 

In [33]:
# Print df with latitude and longitude
df.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina,lat,long
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado,-5.816641,-35.200015
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado,-5.816641,-35.200015
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado,-5.853337,-35.252804
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado,-5.832998,-35.242542
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado,-5.872282,-35.2066


### Latitude and longitude export

In order to avoid unnecessary processing, we are exporting the data *lat/long* from google API to a .csv file.

In [None]:
# Export the new dataSet to csv (don't run this cell again)
df.to_csv('py-students-blind-with-lat-long.csv', encoding="utf-8")

### Please, proceed from here :)

In [48]:
# Read the generate csv
geodata1 = pd.read_csv('py-students-blind-with-lat-long.csv', encoding="utf-8", index_col=0)

In [59]:
# Retrieve the data related to geolocalization
geodata = geodata1.filter(['a_ID','CEP','lat','long'], axis=1)
geodata = geodata.rename(columns = {'a_ID':'Aluno'})

# Reset the index
geodata = geodata.reset_index()

# Remove the the previous index
geodata.drop(geodata.columns[0], axis=1, inplace=True)
geodata.head()

Unnamed: 0,Aluno,CEP,lat,long
0,0,59015430,44.200797,24.502298
1,0,59015430,44.200797,24.502298
2,1,59073120,44.473235,-73.217882
3,2,59072580,34.272328,-118.025521
4,3,59088150,32.876563,-84.326867


In [60]:
geodata.columns

Index(['Aluno', 'CEP', 'lat', 'long'], dtype='object')

In [61]:
geodata.shape

(4842, 4)

### Pin map

The idea here is generating a map with pinnings indicating the student location

In [33]:
# Set map center and zoom level
mapc = [-5.788, -35.202]
zoom = 11

# Create map object
map_osm = folium.Map(location=mapc, zoom_start=zoom)

# Plot each of the locations that we geocoded
for j in tqdm(range(len(geodata))):
    folium.Marker([geodata.ix[j,'lat'], geodata.ix[j,'long']],
                  #popup=(geodata.ix[j,'Aluno'])
                 ).add_to(map_osm)
# Show the map
map_osm

100%|██████████| 4842/4842 [01:15<00:00, 64.14it/s]


### Heatmap map

In order to see in a properly way the students concentration in Natal, we'll generate a hetmap to demonstrate that using colors.

Cold colors represent low concentration of students, and hot colors indicate high concentrations.

In [62]:
# Do some fix in final data
dataFinal = geodata.copy()
dataFinal["Count"] = 0

# Print the dataSet head
dataFinal.head()

Unnamed: 0,Aluno,CEP,lat,long,Count
0,0,59015430,44.200797,24.502298,0
1,0,59015430,44.200797,24.502298,0
2,1,59073120,44.473235,-73.217882,0
3,2,59072580,34.272328,-118.025521,0
4,3,59088150,32.876563,-84.326867,0


In [66]:
# Cound the number of students by zipcode
dataFinal = pd.DataFrame(dataFinal.groupby(["CEP"])['Count'].count()).reset_index()

# Add latitude and longitude to dataFinal
dataFinal["lat"] = geodata['lat']
dataFinal["long"] = geodata['long']

# Print the dataSet head
dataFinal.head()

Unnamed: 0,CEP,Count,lat,long
0,0,1,44.200797,24.502298
1,1507000,1,44.200797,24.502298
2,5021000,1,44.473235,-73.217882
3,5163000,1,34.272328,-118.025521
4,6321200,1,32.876563,-84.326867


In [72]:
dataFinal.shape

(627, 4)

In [73]:
# Set map center and zoom level
mapc = [-5.788, -35.202]
zoom = 11

# Initialize the coordinates array
coordinates = []

# Add the coordinates to the coordinate
for i in range(len(dataFinal)):
    # eliminate items with'nan' element
    if all(~np.isnan([dataFinal.ix[i,'lat'], dataFinal.ix[i,'long'], dataFinal.ix[i,'Count']])):
        coordinates.append([dataFinal.ix[i,'lat'], dataFinal.ix[i,'long'], dataFinal.ix[i,'Count']])

# Create map object
htMap = folium.Map(location=mapc, zoom_start=zoom)

# Append the coordinates to the heatMap
HeatMap(coordinates).add_to(htMap)

# Print the heatMap
htMap

## 2) Statistic handler section

Here in this section, we'll handle the statistics infos.

### Grades histogram

Here we'll analyse the grades taking into consideration its distribution

In [5]:
# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# Make the Histogram: p
hist_grade = Histogram(df, 'nota', title='Students grades distribution', 
              legend='top_left', tools=TOOLS, bins=50, 
                       background_fill_color="#E8DDCB", color="#036564")

# Set axis labels
hist_grade.xaxis.axis_label = 'Grades (0 to 10)'
hist_grade.yaxis.axis_label = 'Grades frequency'

# Call the output_notebook() 
output_notebook()
show(hist_grade)

In [6]:
# Imports
import numpy as np
import scipy.special

from bokeh.layouts import gridplot
from bokeh.plotting import figure, show, output_notebook

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset, hover, save'

p1 = figure(title="Student grades - Normal Distribution (μ=0, σ=0.5)",tools=TOOLS,
            background_fill_color="#E8DDCB")

# Confidence interval
mu, sigma = 0, 0.5

# Histogram settings
hist, edges = np.histogram(df['nota'], density=True, bins=50)

# Tendency line settings
x = np.linspace(-5, 10, 1000)

# Probability density function
pdf = 1/(sigma * np.sqrt(2*np.pi)) * np.exp(-(x-mu)**2 / (2*sigma**2))

# Cumulative density function
cdf = (1+scipy.special.erf((x-mu)/np.sqrt(2*sigma**2)))/2

p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="#036564", line_color="#033649")

# Add the lines
p1.line(x, pdf, line_color="#D95B43", line_width=8, alpha=0.7, legend="Probability density function")
p1.line(x, cdf, line_color="white", line_width=2, alpha=0.7, legend="Cumulative density function")

# Customs
p1.legend.location = "top_left"
p1.xaxis.axis_label = 'Grades (0 to 10)'
p1.yaxis.axis_label = 'Grades frequenc)'

# Print the distribution
output_notebook()
show(gridplot(p1, ncols=2, plot_width=800, plot_height=600, toolbar_location=None))

In [173]:
df.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina,lat,long,logradouro,bairro,cidade,uf
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado,,,Avenida Xavier da Silveira,Tirol,Natal,RN
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado,,,Avenida Xavier da Silveira,Tirol,Natal,RN
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado,,,Rua Santo Onofre,Planalto,Natal,RN
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado,,,Rua Ivo Furtado,Cidade Nova,Natal,RN
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado,,,Rua Itapecirica,Neópolis,Natal,RN


### Student status analysis

The idea here is analyse the (%) between approvals and desaprovals

In [216]:
# Do some fix in studentStatus
studentStatus = df.copy()
studentStatus["Count"] = 0

# Print the dataSet head
studentStatus.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina,Count
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado,0
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado,0
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado,0
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado,0
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado,0


In [217]:
# Count the status sum-up
studentStatus = pd.DataFrame(studentStatus.groupby(["status.disciplina"])['Count'].count()).reset_index()

# Print the dataSet head
studentStatus.head()

Unnamed: 0,status.disciplina,Count
0,Aprovado,2766
1,Reprovado,2076


In [250]:
from bokeh.palettes import RdBu3

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# original example
d = Donut(studentStatus, label=['status.disciplina', 'Count'], values='Count',
          text_font_size='12pt', hover_text='status_count', legend='top_left', 
          tools=TOOLS, background_fill_color="#E8DDCB", title='Total approvals and desaprovals', 
          color=RdBu3)

# Print the chart
output_notebook()
show(d)

### Discipline status analysis

Here we'll check the (%) of students that quit, cancel or get through the discipline

In [None]:
# Do some fix in disciplineStatus
disciplineStatus = df.copy()
disciplineStatus["Count"] = 0

# Print the dataSet head
disciplineStatus.head()

In [None]:
# Count the status sum-up
disciplineStatus = pd.DataFrame(disciplineStatus.groupby(["status"])['Count'].count()).reset_index()

# Print the dataSet head
disciplineStatus.head()

In [None]:
from bokeh.palettes import RdBu3

# Tools
TOOLS = 'box_zoom,box_select,crosshair,resize,reset,hover,save'

# original example
d = Donut(disciplineStatus, label=['status', 'Count'], values='Count',
          text_font_size='12pt', hover_text='status_count', legend='top_left', 
          tools=TOOLS, background_fill_color="#E8DDCB", title='Discipline status', 
          color=RdBu3)

# Print the chart
output_notebook()
show(d)

### Active students x Approvals/Desaprovals

It's a good idea knowing the distribution of active students the passed or failed in this discipline