# Programming Project - Unit 2,4
*by Igor A. Brandão and Leandro Max*

**Goals**
The purpose of this project is explore the following:

- Full content of the statistical part seen in the course;
- Graphs generation;
- Geolocation analysis and hypotheses should be explained in detail.

## Global Imports

Import the necessary libraries to handle 

- Geocoding;
- Maps;
- File input;
- Heatmap;
- Bokeh charts;
- Numpy library;
- Tqdm progress bar

In [None]:
### Library necessary to run this IPython Notebook
!pip install geocoder
!pip install folium
!pip install tqdm

In [11]:
# Import pandas
import pandas as pd

# Import google geocoder
import geocoder as gc

# Import numpy library
import numpy as np

# Import folium heatmap
import folium
from folium.plugins import HeatMap

# Import tqdm progressing bar plugin
from tqdm import tqdm

# Import bokeh libraries
from bokeh.charts import Bar, output_notebook, show
from bokeh.layouts import row
from bokeh.models import HoverTool
from bokeh.charts.attributes import cat, color
from bokeh.charts.operations import blend

In [10]:
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'imd_student_blind.xlsx'

# Load spreadsheet: xl
xl = pd.ExcelFile(file)

# Print sheet names
print(xl.sheet_names)

['Sheet1']


## Data printing

In [15]:
# Load a sheet into a DataFrame by index: df
df = xl.parse(0)

# Print the head of the DataFrame df
df.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado


In [6]:
df.columns

Index(['a_ID', 'CEP', 'ano_ingresso', 'periodo_ingresso', 'status',
       'ano_disciplina', 'periodo_disciplina', 'nota', 'disciplina_ID',
       'status.disciplina'],
      dtype='object')

In [7]:
df.shape

(4842, 10)

# Geolocation handler section

Here in this section, we'll handle the geolocalization infos. The idea is converting the zipcode into latitude and longitude and export the new dataSet (as it requires a long time to perform this operation).

After that, it'll be possible to pin the students position and generate the heatMap.

In [16]:
# df["GeoCod"] = df["CEP"]
df['lat'], df['long'] = [0, 0]

# Print new df
df.head()

Unnamed: 0,a_ID,CEP,ano_ingresso,periodo_ingresso,status,ano_disciplina,periodo_disciplina,nota,disciplina_ID,status.disciplina,lat,long
0,0,59015430,2014,1,CANCELADO,2014,2,2.6,0,Reprovado,0,0
1,0,59015430,2014,1,CANCELADO,2015,1,8.0,0,Aprovado,0,0
2,1,59073120,2014,1,CANCELADO,2014,2,0.1,0,Reprovado,0,0
3,2,59072580,2014,1,ATIVO,2014,2,6.1,0,Aprovado,0,0
4,3,59088150,2014,1,ATIVO,2014,1,3.0,0,Reprovado,0,0


In [None]:
# Retrieve the latitude and longitute related to each student
for i in tqdm(range(len(df))):
    g = gc.google(df.loc[i,'a_ID'])
    if g.lat == None:
        str = df.loc[i,'CEP']
        g = gc.google(str)
    df.ix[i,'lat'] = g.lat
    df.ix[i,'long'] = g.lng
print('Geocoding complete!')


  0%|          | 0/4842 [00:00<?, ?it/s][A
  0%|          | 1/4842 [00:00<36:28,  2.21it/s][A
  0%|          | 2/4842 [00:00<36:22,  2.22it/s][A
  0%|          | 3/4842 [00:01<38:45,  2.08it/s][A
  0%|          | 4/4842 [00:01<38:09,  2.11it/s][A
  0%|          | 5/4842 [00:02<38:25,  2.10it/s][A
  0%|          | 6/4842 [00:02<38:59,  2.07it/s][A
  0%|          | 7/4842 [00:03<39:05,  2.06it/s][A
  0%|          | 8/4842 [00:03<38:47,  2.08it/s][A
  0%|          | 9/4842 [00:04<38:29,  2.09it/s][A
  0%|          | 10/4842 [00:04<38:21,  2.10it/s][A
  0%|          | 11/4842 [00:05<37:47,  2.13it/s][A
 50%|█████     | 2424/4842 [24:04<19:28,  2.07it/s]  

In [None]:
# Print df with latitude and longitude
df.head()

# Export the new dataSet to csv
df.to_csv('py-students-blind-with-lat-long.csv', encoding="utf-8")

In [None]:
# Read the generate csv
geodata1 = pd.read_csv('py-students-blind-with-lat-long.csv', encoding="utf-8", index_col=0)

In [None]:
# Retrieve the data related to geolocalization
geodata = geodata1.filter(['a_ID','lat','long'], axis=1)
geodata = geodata.rename(columns = {'a_ID':'Aluno'})

# Reset the index
geodata = geodata.reset_index()

# Remove the the previous index
geodata.drop(geodata.columns[0], axis=1, inplace=True)
geodata.head()

## Pin map

The idea here is generating a map with pinnings indicating the student location

In [None]:
# Set map center and zoom level
mapc = [-5.788, -35.202]
zoom = 11

# Create map object
map_osm = folium.Map(location=mapc, zoom_start=zoom)

# Plot each of the locations that we geocoded
for j in tqdm(range(len(geodata))):
    folium.Marker([geodata.ix[j,'lat'], geodata.ix[j,'long']],
                  #popup=(geodata.ix[j,'Unidade'])
                 ).add_to(map_osm)
# Show the map
map_osm

## Heatmap map

In order to see in a properly way the students concentration in Natal, we'll generate a hetmap to demonstrate that using colors.

Cold colors represent low concentration of students, and hot colors indicate high concentrations.

In [None]:
# Cound the number of students by zipcode
dataFinal = pd.DataFrame(geodata.groupby(["CEP"])['Count'].count()).reset_index()

# Print the dataSet head
dataFinal.head()

In [None]:
# Retrieve the latitude and longitude
dataFinal['lat'], dataFinal['long'] = [0, 0]

for i in tqdm(range(len(dataFinal))):
    str = dataFinal.loc[i,'CEP']
    g = gc.google(str)
    dataFinal.ix[i,'lat'] = g.lat
    dataFinal.ix[i,'long'] = g.lng
print('Geocoding complete!')
dataFinal

In [None]:
# Set map center and zoom level
mapc = [-5.788, -35.202]
zoom = 11

# Initialize the coordinates array
coordinates = []

# Add the coordinates to the coordinate
for i in range(len(dataFinal)):
    # eliminate items with'nan' element
    if all(~np.isnan([dataFinal.ix[i,'lat'], dataFinal.ix[i,'long'], dataFinal.ix[i,'Count']])):
        coordinates.append([dataFinal.ix[i,'lat'], dataFinal.ix[i,'long'], dataFinal.ix[i,'Count']])

# Create map object
htMap = folium.Map(location=mapc, zoom_start=zoom)

# Append the coordinates to the heatMap
HeatMap(coordinates).add_to(htMap)

# Print the heatMap
htMap