# **Assignment \#2**: Machine Learning MC886/MO444
University of Campinas (UNICAMP), Institute of Computing (IC)

Prof. Sandra Avila, 2022s2



In [None]:
# TODO: RA & Name 
print('RA:181980 ' + 'Bruno Martinez de Farias')
print('RA:220129 ' + 'Leonardo Mazzamboni Colussi')

## Objective

Explore **linear regression** and **logistic regression** alternatives and come up with the best possible model for the problems, avoiding overfitting. In particular, predict the performance of students from public schools in the state of São Paulo based on socioeconomic data from SARESP (School Performance Assessment System of the State of São Paulo, or Sistema de Avaliação de Rendimento Escolar do Estado de São Paulo) 2021.

### Dataset

These data were aggregated from [Open Data Platform of the Secretary of Education of the State of São Paulo](https://dados.educacao.sp.gov.br/) (*Portal de Dados Abertos da Secretaria da Educação do Estado de São Paulo*). The dataset is based on two data sources: [SARESP questionnaire](https://dados.educacao.sp.gov.br/dataset/question%C3%A1rios-saresp) and [SARESP test](https://dados.educacao.sp.gov.br/dataset/profici%C3%AAncia-do-sistema-de-avalia%C3%A7%C3%A3o-de-rendimento-escolar-do-estado-de-s%C3%A3o-paulo-saresp-por), conducted in 2021 with students from the 5th and 9th year of Primary School and 3rd year of Highschool. The questionnaire comprehends 63 socio-economical questions, and it is available at the [link](https://dados.educacao.sp.gov.br/sites/default/files/Saresp_Quest_2021_Perguntas_Alunos.pdf ) ([English version](https://docs.google.com/document/d/1GUax3wwYxA43d3iNOiyCRImeCHgx8vUJrHlSzzYIXA4/edit?usp=sharing)), and the test is composed of questions of Portuguese, Mathematics, and Natural Sciences.


**Data Dictionary**:

- **CD_ALUNO**: Student ID;

- **CODESC**: School ID;

- **NOMESC**: School Name;

- **RegiaoMetropolitana**: Metropolitan region;

- **DE**: Name of the Education Board;

- **CODMUN**: City ID;

- **MUN**: City name;

- **SERIE_ANO**: Scholar year;

- **TURMA**: Class;

- **TP_SEXO**: Sex (Female/Male);

- **DT_NASCIMENTO**: Birth date;

- **PERIODO**: Period of study (morning, afternoon, evening);

- **Tem_Nec**: Whether student has any special needs (1 = yes, 0 = no);

- **NEC_ESP_1** - **NEC_ESP_5**: Student disabilities;

- **Tipo_PROVA**: Exam type (A = Enlarged, B = Braile, C = Common);

- **QN**: Student answer to the question N (N= 1, ... , 63), see  questions in [questionnaire](https://dados.educacao.sp.gov.br/sites/default/files/Saresp_Quest_2021_Perguntas_Alunos.pdf ) ([English version](https://docs.google.com/document/d/1GUax3wwYxA43d3iNOiyCRImeCHgx8vUJrHlSzzYIXA4/edit?usp=sharing));

- **porc_ACERT_lp**: Percentage of correct answers in the Portuguese test;

- **porc_ACERT_MAT**: Percentage of correct answers in the Mathematics test;

- **porc_ACERT_CIE**: Percentage of correct answers in the Natural Sciences test;

- **nivel_profic_lp**: Proficiency level in the Portuguese test;

- **nivel_profic_mat**: Proficiency level in the Mathematics test;

- **nivel_profic_cie**:  Proficiency level in the Natural Sciences test.


---



You must respect the following training/test split:
- SARESP_train.csv
- SARESP_test.csv

## Linear Regression

This part of the assignment aims to predict students' performance on Portuguese, Mathematics, and Natural Sciences tests (target values: `porc_ACERT_lp`, `porc_ACERT_MAT`, and  `porc_ACERT_CIE`) based on their socioeconomic data. Then, at this point, you have to **drop the columns `nivel_profic_lp`, `nivel_profic_mat`** and **`nivel_profic_cie`**.

### Activities

1. (3.5 points) Perform Linear Regression. You should implement your solution and compare it with ```sklearn.linear_model.SGDRegressor``` (linear model fitted by minimizing a regularized empirical loss with SGD, http://scikit-learn.org). Keep in mind that friends don't let friends use testing data for training :-)

Note: Before we start an ML project, we always conduct a brief exploratory analysis :D 

Some factors to consider: Are there any outliers? Are there missing values? How will you handle categorical variables? Are there any features with low correlation with the target variables? What happens if you drop them?




In [1]:
# TODO: Load and preprocess your dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as rand
import warnings
import zipfile
import os 
from datetime import datetime, date
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder , RobustScaler, MaxAbsScaler
%matplotlib inline

warnings.filterwarnings("ignore")

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
df = pd.read_csv('SARESP_train.csv')
df.shape, df.duplicated().sum()

((120596, 88), 17424)

In [4]:
df.drop_duplicates(inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103172 entries, 0 to 120594
Data columns (total 88 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   CD_ALUNO             103172 non-null  int64  
 1   NOMESC               103172 non-null  object 
 2   Q1                   103172 non-null  object 
 3   Q2                   103172 non-null  object 
 4   Q3                   103172 non-null  object 
 5   Q4                   103172 non-null  object 
 6   Q5                   103172 non-null  object 
 7   Q6                   103172 non-null  object 
 8   Q7                   103172 non-null  object 
 9   Q8                   103172 non-null  object 
 10  Q9                   103172 non-null  object 
 11  Q10                  103172 non-null  object 
 12  Q11                  103172 non-null  object 
 13  Q12                  103172 non-null  object 
 14  Q13                  103172 non-null  object 
 15  Q14              

In [17]:
headers = ['SG_UF', 'CO_ENTIDADE', 'CO_MUNICIPIO', 'QT_MAT_BAS']

with zipfile.ZipFile('microdados_censo_escolar_2021.zip') as zip:
    with zip.open('2021/dados/microdados_ed_basica_2021.csv') as myZip:
        df_mec = pd.read_csv(myZip,
                             encoding = 'latin-1', 
                             sep = ';',
                             usecols = headers) 
        
df_mec_sp = df_mec[df_mec['SG_UF'] == 'SP'].dropna().reset_index(drop = True)

display(df_mec.head(), df_mec_sp.head())

Unnamed: 0,SG_UF,CO_MUNICIPIO,CO_ENTIDADE,QT_MAT_BAS
0,RO,1100015,11022558,8.0
1,RO,1100015,11024275,231.0
2,RO,1100015,11024291,10.0
3,RO,1100015,11024372,104.0
4,RO,1100015,11024666,173.0


Unnamed: 0,SG_UF,CO_MUNICIPIO,CO_ENTIDADE,QT_MAT_BAS
0,SP,3500105,35030806,699.0
1,SP,3500105,35031045,557.0
2,SP,3500105,35031082,952.0
3,SP,3500105,35031100,239.0
4,SP,3500105,35031112,656.0


In [18]:
df_mec_sp = df_mec_sp\
    .groupby('CO_MUNICIPIO')['QT_MAT_BAS'] \
    .agg(MAT_PER_MUN = 'sum', ESC_PER_MUN = 'count')\
    .reset_index() 

df_mec_sp.head()

Unnamed: 0,CO_MUNICIPIO,MAT_PER_MUN,ESC_PER_MUN
0,3500105,7248.0,26
1,3500204,839.0,5
2,3500303,7287.0,26
3,3500402,1092.0,8
4,3500501,3429.0,19


In [19]:
df_mec = pd.merge(df_mec, 
                  df_mec_sp, 
                  how = 'left', 
                  on = 'CO_MUNICIPIO')

df_mec = df_mec[df_mec['SG_UF'] == 'SP'].drop('SG_UF', axis = 1)

df_mec.rename(columns = {'CO_ENTIDADE': 'CODESC'}, inplace = True)
df_mec['CODESC'] = df_mec['CODESC'].apply(lambda x: str(x)[2:]).astype(int)

In [24]:
df = pd.merge(df, df_mec, how = 'left', on = 'CODESC').drop(['CODMUN','CO_MUNICIPIO'], axis = 1)
teste.shape, df.duplicated().sum()

((103172, 90), 0)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103172 entries, 0 to 120594
Data columns (total 88 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   CD_ALUNO             103172 non-null  int64  
 1   NOMESC               103172 non-null  object 
 2   Q1                   103172 non-null  object 
 3   Q2                   103172 non-null  object 
 4   Q3                   103172 non-null  object 
 5   Q4                   103172 non-null  object 
 6   Q5                   103172 non-null  object 
 7   Q6                   103172 non-null  object 
 8   Q7                   103172 non-null  object 
 9   Q8                   103172 non-null  object 
 10  Q9                   103172 non-null  object 
 11  Q10                  103172 non-null  object 
 12  Q11                  103172 non-null  object 
 13  Q12                  103172 non-null  object 
 14  Q13                  103172 non-null  object 
 15  Q14              

In [None]:
df['NEC_ESP_1'].unique(), df['NEC_ESP_2'].unique(), df['NEC_ESP_3'].unique(), df['NEC_ESP_4'].unique()

In [None]:
cond_deficientes = ((df['NEC_ESP_1'].isna() == False) & (df['NEC_ESP_1'] != 'ALTAS HABILIDADES/SUPERDOTACAO')) | (df['NEC_ESP_2'].isna() == False) | (df['NEC_ESP_3'].isna() == False) | (df['NEC_ESP_4'].isna() == False)
cond_superdotados = df['NEC_ESP_1'] == 'ALTAS HABILIDADES/SUPERDOTACAO'

idx_deficientes = df[cond_deficientes].index
idx_superdotados = df[cond_superdotados].index

In [None]:
df[['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp']].mean()

In [None]:
df[cond_deficientes][['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp']].mean()

In [None]:
df[cond_superdotados][['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp']].mean()

In [None]:
# A: has any disability
# B: is gifted
# C: has no disability

df['disability'] = ['C' if i in idx_deficientes else 'B' if i in idx_superdotados else 'A' for i in df.index]

In [None]:
def age(birth_dt):
    today = date.today()
    return today.year - birth_dt.year - ((today.month, today.day) < (birth_dt.month, birth_dt.day))

In [None]:
df['Age'] = pd.to_datetime(df['DT_NASCIMENTO']).apply(lambda x: age(x))
df[['CODMUN', 'CODESC']] = df[['CODMUN', 'CODESC']].astype('str')

In [None]:
drop_cols_categ = ['NOMESC', 'MUN', 'DT_NASCIMENTO', 'NEC_ESP_1', 'NEC_ESP_2', 'NEC_ESP_3', 'NEC_ESP_4', 'TURMA', 'nivel_profic_lp', 'nivel_profic_mat', 'nivel_profic_cie']
drop_cols_numeric = ['CD_ALUNO', 'NEC_ESP_5','Tem_Nec']

df_categ = df.select_dtypes(include = 'object').drop(drop_cols_categ, axis = 1)
df_num = df.select_dtypes(include = 'number').drop(drop_cols_numeric, axis = 1)

df_categ.shape, df_num.shape

In [None]:
df_categ.describe()

In [None]:
df_num.describe()

In [None]:
f, axes = plt.subplots(2, 2)

sns.boxplot(  y='QT_MAT_BAS', data=df,  orient='v' , ax=axes[0][0])
sns.boxplot(  y='MAT_PER_MUN', data=df,  orient='v' , ax=axes[0][1]).set(yscale="log")
sns.boxplot(  y='ESC_PER_MUN', data=df,  orient='v' , ax=axes[1][0]).set(yscale="log")
sns.boxplot(  y='Age', data=df,  orient='v' , ax=axes[1][1])

In [None]:
f, axes = plt.subplots(1, 3)

sns.histplot(df.QT_MAT_BAS, alpha=0.4, kde=True, kde_kws={"cut": 3}, ax=axes[0])
sns.histplot(df.ESC_PER_MUN, alpha=0.4, kde=True, kde_kws={"cut": 3}, ax=axes[1])
sns.histplot(df.Age, alpha=0.7, kde=True, kde_kws={"cut": 3}, ax=axes[2])

In [None]:
def plot_corr(df, size=10):
    corr = df.corr()    
    fig, ax = plt.subplots(figsize = (size, size))
    ax.matshow(corr)  
    plt.xticks(range(len(corr.columns)), corr.columns) 
    plt.yticks(range(len(corr.columns)), corr.columns)  

In [None]:
plot_corr(df_num)

In [None]:
df_num.corr()

In [None]:
# OHE
categ_ohe_quests = ['Q9', 'Q22', 'Q24', 'Q34', 'Q59', 'Q61', 'Q62', 'Q63',
                    'RegiaoMetropolitana', 'SERIE_ANO', 'TP_SEXO', 'PERIODO', 
                    'Tipo_PROVA', 'disability']

# Ordinals:

## A < B < C < ... 
enc_greater = ['Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q19', 
               'Q20', 'Q21', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31', 
               'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41', 
               'Q50', 'Q51', 'Q52', 'Q53', 'Q54', 'Q55', 'Q56', 
               'Q57', 'Q58', 'Q60']

## A > B > C > ... 
enc_lower = ['Q1', 'Q10', 'Q11', 'Q12', 'Q13', 'Q14', 'Q15', 
             'Q16','Q17', 'Q18', 'Q23', 'Q25', 'Q26', 'Q33']

## Categ: A, D, B, C
categ_ADBC = ['Q43', 'Q44', 'Q45', 'Q46', 'Q47', 'Q48', 'Q49']

## Particular cases
#'Q32'
#'Q42' 

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df,
pd.DataFrame(df[['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp','nivel_profic_lp','nivel_profic_mat','nivel_profic_cie']]),test_size=0.2, random_state=1)

In [None]:
#O pré processamento de milhões
from sklearn.preprocessing import OneHotEncoder, RobustScaler , OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

#Cria um pipeline para tratar as variaveis de interesse
df_transform = ColumnTransformer([
    ('cat_encod', OneHotEncoder(drop = 'first'), categ_ohe_quests),
    ('Ordinal_lower', OrdinalEncoder(categories = [['A', 'B', 'C', 'D', 'E']] * len(enc_lower)), enc_lower),
    ('Ordinal_greater', OrdinalEncoder(categories = [['E', 'D', 'C', 'B', 'A']] * len(enc_greater)), enc_greater),
    ('Ordinal_ADBC', OrdinalEncoder(categories = [['A', 'D', 'B', 'C']] * len(categ_ADBC)), categ_ADBC),
    ('Ordinal_particular1', OrdinalEncoder(categories = [['D', 'A', 'B', 'C']]), ['Q32']),
    ('Ordinal_particular2', OrdinalEncoder(categories = [['D', 'C', 'E', 'B', 'A']]), ['Q42']),
    ('scale_robust', RobustScaler(), ['Age','QT_MAT_BAS']),
    ('scale_max', MaxAbsScaler(), ['ESC_PER_MUN'] )   
],
remainder = 'drop')

#Ajusta o pipeline e transforma no banco de "treino"
x_train_prepared = df_transform.fit_transform(x_train)
x_test_prepared = df_transform.fit_transform(x_test)

In [None]:
class Manual_Regression():
    
    '''
    required packages: numpy, random. 
    '''

    def __init__(self):
        self.train = False

    def fit_normal_equation(self, X, y):
        '''
        inputs: X and y must be a np.array.
        return: linear regression parameters by Normal Equation.
        '''
        X = np.insert(X, 0, 1, 1)
        
        self.train = True
        self.thetas = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
        
        return np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
        

    def fit_gradient(self, X, y, alpha = 0.01, iterations = 10**5, threshold = 10**(-12)):
        
        '''
        inputs: X and y must be a np.array.
        return: linear regression parameters by Gradient Descent.
        '''
        
        X = np.insert(X, 0, 1, 1)
                
        self.thetas = np.array([np.random.normal() for i in range(X.shape[1])])
        n = len(y)
         
        for k in range(iterations):
            gradients = list()
            cost_func = 1/(2*n) * np.sum((np.dot(X, self.thetas) - y)**2)

            for j in range(len(self.thetas)):
                gradients.append(1/n * np.sum((np.dot(X, self.thetas) - y) * X[:, j]))

            aux_thetas = np.array([b - alpha*g for b, g in zip(self.thetas, gradients)])
            new_cost_func = 1/(2*n) * np.sum((np.dot(X, aux_thetas) - y)**2) 
            self.thetas = aux_thetas
            
            diff_gain = new_cost_func - cost_func
            
            if k >= 5 and abs(diff_gain) <= threshold:
                self.train = True
                return self.thetas
        
        self.train = True
        return self.thetas
     
        
    def predict(self, X_test):
        '''
        inputs: X must be a np.array.
        return: y predicted values by the fitted model.
        '''

        if self.train:
            X_test = np.insert(X_test, 0, 1, 1)
            return np.dot(X_test, self.thetas)
        else:
            raise ValueError("You first must fit a linear regression model.")
 

In [None]:
# TODO: Linear Regression. You can use scikit-learn libraries.
from sklearn.linear_model import LinearRegression

reg_model = LinearRegression()
reg_model_mat = reg_model.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'])
reg_model_mat.intercept_, reg_model_mat.coef_

In [None]:
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(reg_model, hist = False, qqplot = True)
visualizer.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'])
visualizer.score(np.nan_to_num(x_test_prepared), y_test['porc_ACERT_MAT'])
visualizer.show()

In [None]:
reg_model_cie = reg_model.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_CIE'])

visualizer = ResidualsPlot(reg_model, hist = False, qqplot = True)
visualizer.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_CIE'])
visualizer.score(np.nan_to_num(x_test_prepared), y_test['porc_ACERT_CIE'])
visualizer.show()

In [None]:
reg_model_lp = reg_model.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_lp'])

visualizer = ResidualsPlot(reg_model, hist = False, qqplot = True)
visualizer.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_lp'])
visualizer.score(np.nan_to_num(x_test_prepared), y_test['porc_ACERT_lp'])
visualizer.show()


> What are the conclusions? (1-2 paragraphs)




2. (1 point) Use different Gradient Descent (GD) learning rates when optimizing. Compare the GD-based solutions with Normal Equation. What are the conclusions?


In [None]:
# TODO: Gradient Descent (GD) with 3 different learning rates. You can use scikit-learn libraries.]

from sklearn.linear_model import SGDRegressor

sgd_reg_model = SGDRegressor(max_iter = 1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg_model_mat = sgd_reg_model.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'])
sgd_reg_model_mat.intercept_,sgd_reg_model_mat.coef_

In [None]:
from yellowbrick.regressor import PredictionError

visualizer = PredictionError(sgd_reg_model)
visualizer.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'])
visualizer.score(np.nan_to_num(x_test_prepared), y_test['porc_ACERT_MAT'])
visualizer.show()


In [None]:
sgd_reg_model2 = SGDRegressor(max_iter = 1000, tol=1e-3, penalty=None, eta0=0.01)
sgd_reg_model2.fit(np.nan_to_num(df_prepared), df['porc_ACERT_MAT'])
sgd_reg_model2.intercept_,sgd_reg_model2.coef_

In [None]:
visualizer = PredictionError(sgd_reg_model2)
visualizer.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'])
visualizer.score(np.nan_to_num(x_test_prepared), y_test['porc_ACERT_MAT'])
visualizer.show()


In [None]:
sgd_reg_model3 = SGDRegressor(max_iter = 1000, tol=1e-3, penalty=None, eta0=0.001)
sgd_reg_model3.fit(np.nan_to_num(df_prepared), df['porc_ACERT_MAT'])
sgd_reg_model3.intercept_,sgd_reg_model3.coef_

In [None]:
visualizer = PredictionError(sgd_reg_model3)
visualizer.fit(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'])
visualizer.score(np.nan_to_num(x_test_prepared), y_test['porc_ACERT_MAT'])
visualizer.show()


3. (0.75 point) Sometimes, we need some more complex function to make good prediction. Devise and evaluate a Polynomial Linear Regression model. 


In [None]:
# TODO: Complex model. You can use scikit-learn libraries.
from sklearn.preprocessing import PolynomialFeatures

poly_prep = PolynomialFeatures(degree=3, include_bias=False)

x_poly_prep = poly_prep.fit_transform(x_train_prepared)






*texto em itálico*
 > What are the conclusions? What are the actions after such analyses? (1-2 paragraphs)

 


4. (0.5) Plot the cost function vs. number of epochs in the training/validation set and analyze the model. 

In [None]:
# TODO: Plot the cost function vs. number of iterations in the training set.
from sklearn.metrics import mean_squared_error

def plot_cost_functio(X, y, epochs, learning_rate):
    X_train2, X_val,Y_train2, Y_val = train_test_split(X, y, test_size=0.2)
    train_error, val_error = [], []
    for epoch in range(1,epochs):
        model = SGDRegressor(max_iter = epoch, tol=1e-3, penalty=None, eta0= learning_rate)
        model.fit(X_train2,Y_train2)
        y_train2_predict = model.predict(X_train2)
        y_val_predict = model.predict(X_val)
        train_error.append(mean_squared_error(Y_train2,y_train2_predict))
        val_error.append(mean_squared_error(Y_val,y_val_predict))
        font = {'family': 'serif',
        'color':  'darkblue',
        'weight': 'normal',
        'size': 12,
        }
    plt.plot(train_error,"g-", linewidth=1, label="Train")
    plt.plot(val_error,"r:", linewidth=2, label="Val")
    plt.title("Cost Function per Epoch size",fontdict= font)
    plt.xlabel("Number of Epochs",fontdict= font)
    plt.ylabel("MSE",fontdict=font)
    plt.legend()

In [None]:
from sklearn.linear_model import SGDRegressor

plot_cost_functio(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'], 20, 0.1)

In [None]:
from sklearn.linear_model import SGDRegressor

plot_cost_functio(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'], 20, 0.01)

In [None]:
plot_cost_functio(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'], 20, 1)

In [None]:
plot_cost_functio(np.nan_to_num(x_train_prepared), y_train['porc_ACERT_MAT'], 20, 0.001)

In [None]:
*texto em itálico*
 > What are the conclusions? What are the actions after such analyses? (1-2 paragraphs)

5. (0.25 point) Pick **your best model**, based on your validation set, and predict the target values for the test set.

## Logistic Regression

Now, this part of the assignment aims to predict students' proeficiency level on Portuguese, Mathematics, and Natural Sciences (target values: `nivel_profic_lp`, `nivel_profic_mat` and `nivel_profic_cie`) based on their socioeconomic data. Then, you have to **drop the columns `porc_ACERT_lp`,  `porc_ACERT_MAT`** and  **`porc_ACERT_CIE`**.

### Activities

1. (2.75 points) Perform Multinomial Logistic Regression (_i.e._, softmax regression). It is a generalization of Logistic Regression to the case where we want to handle multiple classes. Try different combinations of features, dropping the ones less correlated to the target variables.

In [None]:
# TODO: Multinomial Logistic Regression. You can use scikit-learn libraries.

> What are the conclusions? (1-2 paragraphs)


2. (0.5 point) Plot the cost function vs. number of epochs in the training/validation set and analyze the model. 

In [None]:
# TODO: Plot the cost function vs. number of iterations in the training set.

> What are the conclusions? (1-2 paragraphs)


3. (0.75 point) Pick **your best model** and plot the confusion matrix in the **test set**. 


In [None]:
# TODO: Plot the confusion matrix. You can use scikit-learn, seaborn, matplotlib libraries.

> What are the conclusions? (1-2 paragraphs)


## Deadline

Monday, September 19, 11:59 pm. 

Penalty policy for late submission: You are not encouraged to submit your assignment after due date. However, in case you do, your grade will be penalized as follows:
- September 20, 11:59 pm : grade * 0.75
- September 21, 11:59 pm : grade * 0.5
- September 22, 11:59 pm : grade * 0.25


## Submission

On Google Classroom, submit your Jupyter Notebook (in Portuguese or English).

**This activity is NOT individual, it must be done in pairs (two-person group).**