# Enade data extraction
The National Student Performance Examination (Enade) is one of the assessment procedures of the National Higher Education Assessment System (Sinaes). Enade is carried out by the National Institute of Educational Studies and Research Anísio Teixeira (Inep), an autarchy linked to the brazilian Ministry of Education (MEC), according to guidelines established by the National Higher Education Assessment Commission (Conaes), a collegiate body for the coordination and supervision of the Signals.
 
Enade is a mandatory curricular component for undergraduate courses, as determined by Law No. 10,861/2004. It is periodically applied to students in all undergraduate courses, during the first (entry) and last (completing) year of the course.

## Extracting

In [1]:
# Importing the libraries
import pandas as pd
import numpy as np
import zipfile
import requests
from io import BytesIO
import os

In [2]:
# Creating the folder to extract the data locally
os.makedirs("./enade2019", exist_ok=True)

In [3]:
# Downloading the content from the url (using stream so we dont lose the connection)
url = "https://download.inep.gov.br/microdados/microdados_enem_2019.zip"

filebytes = BytesIO(requests.get(url, stream=True).content)

In [5]:
# Extracting the zip
myzip = zipfile.ZipFile(filebytes)
myzip.extractall("./enade2019")

## Verifying the data with Pandas

In [7]:
# Reading file
enade = pd.read_csv("./enade2019/3.DADOS/microdados_enade_2019.txt",
                    sep = ";",
                    decimal = ",",
                    low_memory=False)

In [14]:
# Taking a look in the data set
enade.head(10)

Unnamed: 0,NU_ANO,CO_IES,CO_CATEGAD,CO_ORGACAD,CO_GRUPO,CO_CURSO,CO_MODALIDADE,CO_MUNIC_CURSO,CO_UF_CURSO,CO_REGIAO_CURSO,...,QE_I59,QE_I60,QE_I61,QE_I62,QE_I63,QE_I64,QE_I65,QE_I66,QE_I67,QE_I68
0,2019,1,10002,10028,5710,3,1,5103403,51,5,...,2.0,5.0,1.0,1.0,2.0,5.0,8.0,7.0,1.0,2.0
1,2019,1,10002,10028,5710,3,1,5103403,51,5,...,1.0,4.0,2.0,2.0,2.0,5.0,4.0,4.0,2.0,2.0
2,2019,1,10002,10028,5710,3,1,5103403,51,5,...,3.0,4.0,4.0,3.0,3.0,4.0,1.0,1.0,1.0,4.0
3,2019,1,10002,10028,5710,3,1,5103403,51,5,...,3.0,5.0,2.0,2.0,2.0,3.0,3.0,4.0,3.0,3.0
4,2019,1,10002,10028,5710,3,1,5103403,51,5,...,,,,,,,,,,
5,2019,1,10002,10028,5710,3,1,5103403,51,5,...,1.0,6.0,1.0,1.0,1.0,8.0,1.0,1.0,1.0,1.0
6,2019,1,10002,10028,5710,3,1,5103403,51,5,...,1.0,4.0,1.0,3.0,3.0,6.0,6.0,3.0,6.0,4.0
7,2019,1,10002,10028,5710,3,1,5103403,51,5,...,8.0,6.0,3.0,2.0,2.0,6.0,8.0,1.0,1.0,3.0
8,2019,1,10002,10028,5710,3,1,5103403,51,5,...,5.0,5.0,1.0,1.0,1.0,5.0,1.0,4.0,3.0,1.0
9,2019,1,10002,10028,5710,3,1,5103403,51,5,...,4.0,4.0,1.0,3.0,3.0,4.0,8.0,3.0,4.0,4.0


In [15]:
# Some additional information
enade.info()
dict(enade.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 433930 entries, 0 to 433929
Columns: 137 entries, NU_ANO to QE_I68
dtypes: float64(2), int64(33), object(102)
memory usage: 453.6+ MB


{'NU_ANO': dtype('int64'),
 'CO_IES': dtype('int64'),
 'CO_CATEGAD': dtype('int64'),
 'CO_ORGACAD': dtype('int64'),
 'CO_GRUPO': dtype('int64'),
 'CO_CURSO': dtype('int64'),
 'CO_MODALIDADE': dtype('int64'),
 'CO_MUNIC_CURSO': dtype('int64'),
 'CO_UF_CURSO': dtype('int64'),
 'CO_REGIAO_CURSO': dtype('int64'),
 'NU_IDADE': dtype('int64'),
 'TP_SEXO': dtype('O'),
 'ANO_FIM_EM': dtype('int64'),
 'ANO_IN_GRAD': dtype('float64'),
 'CO_TURNO_GRADUACAO': dtype('float64'),
 'TP_INSCRICAO_ADM': dtype('int64'),
 'TP_INSCRICAO': dtype('int64'),
 'NU_ITEM_OFG': dtype('int64'),
 'NU_ITEM_OFG_Z': dtype('int64'),
 'NU_ITEM_OFG_X': dtype('int64'),
 'NU_ITEM_OFG_N': dtype('int64'),
 'NU_ITEM_OCE': dtype('int64'),
 'NU_ITEM_OCE_Z': dtype('int64'),
 'NU_ITEM_OCE_X': dtype('int64'),
 'NU_ITEM_OCE_N': dtype('int64'),
 'DS_VT_GAB_OFG_ORIG': dtype('O'),
 'DS_VT_GAB_OFG_FIN': dtype('O'),
 'DS_VT_GAB_OCE_ORIG': dtype('O'),
 'DS_VT_GAB_OCE_FIN': dtype('O'),
 'DS_VT_ESC_OFG': dtype('O'),
 'DS_VT_ACE_OFG': dtype(