# **Assignment \#2**: Machine Learning MC886/MO444
University of Campinas (UNICAMP), Institute of Computing (IC)

Prof. Sandra Avila, 2022s2



In [1]:
# TODO: RA & Name 
print('RA:181980 ' + 'Bruno Martinez de Farias')
print('RA:220129 ' + 'Leonardo Mazzamboni Colussi')

RA:181980 Bruno Martinez de Farias
RA:220129 Leonardo Mazzamboni Colussi


## Objective

Explore **linear regression** and **logistic regression** alternatives and come up with the best possible model for the problems, avoiding overfitting. In particular, predict the performance of students from public schools in the state of São Paulo based on socioeconomic data from SARESP (School Performance Assessment System of the State of São Paulo, or Sistema de Avaliação de Rendimento Escolar do Estado de São Paulo) 2021.

### Dataset

These data were aggregated from [Open Data Platform of the Secretary of Education of the State of São Paulo](https://dados.educacao.sp.gov.br/) (*Portal de Dados Abertos da Secretaria da Educação do Estado de São Paulo*). The dataset is based on two data sources: [SARESP questionnaire](https://dados.educacao.sp.gov.br/dataset/question%C3%A1rios-saresp) and [SARESP test](https://dados.educacao.sp.gov.br/dataset/profici%C3%AAncia-do-sistema-de-avalia%C3%A7%C3%A3o-de-rendimento-escolar-do-estado-de-s%C3%A3o-paulo-saresp-por), conducted in 2021 with students from the 5th and 9th year of Primary School and 3rd year of Highschool. The questionnaire comprehends 63 socio-economical questions, and it is available at the [link](https://dados.educacao.sp.gov.br/sites/default/files/Saresp_Quest_2021_Perguntas_Alunos.pdf ) ([English version](https://docs.google.com/document/d/1GUax3wwYxA43d3iNOiyCRImeCHgx8vUJrHlSzzYIXA4/edit?usp=sharing)), and the test is composed of questions of Portuguese, Mathematics, and Natural Sciences.


**Data Dictionary**:

- **CD_ALUNO**: Student ID;

- **CODESC**: School ID;

- **NOMESC**: School Name;

- **RegiaoMetropolitana**: Metropolitan region;

- **DE**: Name of the Education Board;

- **CODMUN**: City ID;

- **MUN**: City name;

- **SERIE_ANO**: Scholar year;

- **TURMA**: Class;

- **TP_SEXO**: Sex (Female/Male);

- **DT_NASCIMENTO**: Birth date;

- **PERIODO**: Period of study (morning, afternoon, evening);

- **Tem_Nec**: Whether student has any special needs (1 = yes, 0 = no);

- **NEC_ESP_1** - **NEC_ESP_5**: Student disabilities;

- **Tipo_PROVA**: Exam type (A = Enlarged, B = Braile, C = Common);

- **QN**: Student answer to the question N (N= 1, ... , 63), see  questions in [questionnaire](https://dados.educacao.sp.gov.br/sites/default/files/Saresp_Quest_2021_Perguntas_Alunos.pdf ) ([English version](https://docs.google.com/document/d/1GUax3wwYxA43d3iNOiyCRImeCHgx8vUJrHlSzzYIXA4/edit?usp=sharing));

- **porc_ACERT_lp**: Percentage of correct answers in the Portuguese test;

- **porc_ACERT_MAT**: Percentage of correct answers in the Mathematics test;

- **porc_ACERT_CIE**: Percentage of correct answers in the Natural Sciences test;

- **nivel_profic_lp**: Proficiency level in the Portuguese test;

- **nivel_profic_mat**: Proficiency level in the Mathematics test;

- **nivel_profic_cie**:  Proficiency level in the Natural Sciences test.


---



You must respect the following training/test split:
- SARESP_train.csv
- SARESP_test.csv

## Linear Regression

This part of the assignment aims to predict students' performance on Portuguese, Mathematics, and Natural Sciences tests (target values: `porc_ACERT_lp`, `porc_ACERT_MAT`, and  `porc_ACERT_CIE`) based on their socioeconomic data. Then, at this point, you have to **drop the columns `nivel_profic_lp`, `nivel_profic_mat`** and **`nivel_profic_cie`**.

### Activities

1. (3.5 points) Perform Linear Regression. You should implement your solution and compare it with ```sklearn.linear_model.SGDRegressor``` (linear model fitted by minimizing a regularized empirical loss with SGD, http://scikit-learn.org). Keep in mind that friends don't let friends use testing data for training :-)

Note: Before we start an ML project, we always conduct a brief exploratory analysis :D 

Some factors to consider: Are there any outliers? Are there missing values? How will you handle categorical variables? Are there any features with low correlation with the target variables? What happens if you drop them?




In [2]:
# TODO: Load and preprocess your dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as rand
import os 
from datetime import datetime, date
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
%matplotlib inline

os.getcwd()

'/content'

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [5]:
path = '/content/drive/MyDrive/MC886/Trab02/SARESP_train.csv'
df = pd.read_csv(path)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [6]:
df.shape, df.duplicated().sum()

((120596, 88), 17424)

In [7]:
df.drop_duplicates(inplace = True)

In [8]:
df.head()

Unnamed: 0,CD_ALUNO,NOMESC,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,Q45,Q46,Q47,Q48,Q49,Q50,Q51,Q52,Q53,Q54,Q55,Q56,Q57,Q58,Q59,Q60,Q61,Q62,Q63,RegiaoMetropolitana,DE,CODMUN,MUN,CODESC,SERIE_ANO,TURMA,TP_SEXO,DT_NASCIMENTO,PERIODO,NEC_ESP_1,NEC_ESP_2,NEC_ESP_3,NEC_ESP_4,NEC_ESP_5,Tipo_PROVA,Tem_Nec,porc_ACERT_lp,porc_ACERT_MAT,porc_ACERT_CIE,nivel_profic_lp,nivel_profic_mat,nivel_profic_cie
0,26270013,JULIO FORTES,B,E,E,E,E,E,E,E,B,A,A,A,A,A,A,B,A,B,B,B,B,A,A,A,A,A,C,C,B,B,B,C,A,B,C,C,C,C,C,C,C,A,C,C,C,C,C,C,C,B,C,B,B,B,B,A,D,C,C,D,C,A,D,Região Metropolitana do Vale do Paraíba e Lito...,GUARATINGUETA,414,LAVRINHAS,901489,EM-3ª série,A,F,11/15/2003,MANHÃ,,,,,,C,0,41.7,20.8,20.8,Abaixo do Básico,Abaixo do Básico,Abaixo do Básico
1,30756614,MESSIAS FREIRE PROFESSOR,B,D,E,C,E,E,E,E,A,A,A,A,A,A,A,B,B,C,C,B,B,A,D,A,D,C,C,C,C,B,B,D,A,A,B,C,C,B,C,B,C,A,C,C,C,B,B,B,C,B,B,B,B,A,A,A,C,C,C,C,C,C,B,Região Metropolitana de São Paulo,SUL 1,100,SAO PAULO,37461,5º Ano EF,A,M,6/7/2010,MANHÃ,,,,,,C,0,83.3,100.0,66.7,Adequado,Avançado,Adequado
2,26014872,JOSE CONTI,B,E,B,D,E,B,D,C,A,A,A,A,B,A,B,C,B,B,A,A,A,A,A,A,A,D,C,B,A,A,B,B,A,B,B,C,B,C,C,C,B,A,B,B,C,C,B,B,C,D,C,C,B,C,B,A,E,B,C,B,D,C,C,Interior,JAU,348,IGARACU DO TIETE,25963,9º Ano EF,A,F,12/10/2006,MANHÃ,,,,,,C,0,58.3,37.5,54.2,Básico,Básico,Básico
3,25739025,NAPOLEAO DE CARVALHO FREIRE PROFESSOR,B,D,E,D,C,E,D,D,A,A,B,B,C,B,B,C,B,B,A,B,A,B,B,A,B,D,B,A,B,A,B,B,B,B,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,B,B,D,C,A,A,A,E,C,C,B,C,B,C,Região Metropolitana de São Paulo,CENTRO OESTE,100,SAO PAULO,3924,EM-3ª série,B,M,10/3/2003,MANHÃ,,,,,,C,0,29.2,29.2,16.7,Abaixo do Básico,Abaixo do Básico,Abaixo do Básico
4,27363009,RESIDENCIAL BORDON,B,D,E,E,E,E,E,C,A,A,A,A,C,A,B,C,B,B,A,B,A,A,B,A,B,C,C,A,C,B,B,B,B,A,C,C,B,C,C,C,C,A,C,B,C,C,B,B,B,B,B,B,B,B,B,A,E,B,C,A,D,A,D,Região Metropolitana de Campinas,SUMARE,671,SUMARE,576670,9º Ano EF,D,F,4/6/2007,MANHÃ,,,,,,C,0,79.2,41.7,50.0,Adequado,Abaixo do Básico,Básico


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103172 entries, 0 to 120594
Data columns (total 88 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   CD_ALUNO             103172 non-null  int64  
 1   NOMESC               103172 non-null  object 
 2   Q1                   103172 non-null  object 
 3   Q2                   103172 non-null  object 
 4   Q3                   103172 non-null  object 
 5   Q4                   103172 non-null  object 
 6   Q5                   103172 non-null  object 
 7   Q6                   103172 non-null  object 
 8   Q7                   103172 non-null  object 
 9   Q8                   103172 non-null  object 
 10  Q9                   103172 non-null  object 
 11  Q10                  103172 non-null  object 
 12  Q11                  103172 non-null  object 
 13  Q12                  103172 non-null  object 
 14  Q13                  103172 non-null  object 
 15  Q14              

In [10]:
df['NEC_ESP_1'].unique(), df['NEC_ESP_2'].unique(), df['NEC_ESP_3'].unique(), df['NEC_ESP_4'].unique()

(array([nan, 'INTELCTUAL', 'TRANSTORNO DESINTEGRATIVO DA INFANCIA',
        'FISICA-PARALISIA CEREBRAL', 'SURDEZ SEVERA OU PROFUNDA',
        'FISICA-OUTROS', 'MULTIPLA', 'SURDEZ LEVE OU MODERADA',
        'SINDROME DE DOWN', 'AUTISTA INFANTIL', 'SINDROME DE ASPERGER',
        'FISICA-CADEIRANTE', 'BAIXA VISAO', 'SINDROME DE RETT',
        'ALTAS HABILIDADES/SUPERDOTACAO'], dtype=object),
 array([nan, 'BAIXA VISAO', 'FISICA-OUTROS',
        'TRANSTORNO DESINTEGRATIVO DA INFANCIA',
        'FISICA-PARALISIA CEREBRAL', 'FISICA-CADEIRANTE',
        'SURDEZ LEVE OU MODERADA', 'AUTISTA INFANTIL',
        'SURDEZ SEVERA OU PROFUNDA', 'SINDROME DE ASPERGER'], dtype=object),
 array([nan, 'FISICA-CADEIRANTE', 'INTELCTUAL', 'FISICA-OUTROS',
        'SURDEZ LEVE OU MODERADA', 'FISICA-PARALISIA CEREBRAL'],
       dtype=object),
 array([nan, 'INTELCTUAL'], dtype=object))

In [11]:
cond_deficientes = ((df['NEC_ESP_1'].isna() == False) & (df['NEC_ESP_1'] != 'ALTAS HABILIDADES/SUPERDOTACAO')) | (df['NEC_ESP_2'].isna() == False) | (df['NEC_ESP_3'].isna() == False) | (df['NEC_ESP_4'].isna() == False)
cond_superdotados = df['NEC_ESP_1'] == 'ALTAS HABILIDADES/SUPERDOTACAO'

idx_deficientes = df[cond_deficientes].index
idx_superdotados = df[cond_superdotados].index

In [12]:
df[['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp']].mean()

porc_ACERT_MAT    52.685468
porc_ACERT_CIE    56.960714
porc_ACERT_lp     60.465446
dtype: float64

In [13]:
df[cond_deficientes][['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp']].mean()

porc_ACERT_MAT    35.190641
porc_ACERT_CIE    42.150064
porc_ACERT_lp     40.841859
dtype: float64

In [14]:
df[cond_superdotados][['porc_ACERT_MAT', 'porc_ACERT_CIE', 'porc_ACERT_lp']].mean()

porc_ACERT_MAT    68.629412
porc_ACERT_CIE    73.770588
porc_ACERT_lp     75.982353
dtype: float64

In [15]:
# A: has any disability
# B: is gifted
# C: has no disability

df['disability'] = ['C' if i in idx_deficientes else 'B' if i in idx_superdotados else 'A' for i in df.index]

In [16]:
def age(birth_dt):
    today = date.today()
    return today.year - birth_dt.year - ((today.month, today.day) < (birth_dt.month, birth_dt.day))

In [17]:
df['Age'] = pd.to_datetime(df['DT_NASCIMENTO']).apply(lambda x: age(x))
df[['CODMUN', 'CODESC']] = df[['CODMUN', 'CODESC']].astype('str')

In [18]:
drop_cols_categ = ['NOMESC', 'MUN', 'DT_NASCIMENTO', 'NEC_ESP_1', 'NEC_ESP_2', 'NEC_ESP_3', 'NEC_ESP_4', 'TURMA', 'nivel_profic_lp', 'nivel_profic_mat', 'nivel_profic_cie']
drop_cols_numeric = ['CD_ALUNO', 'NEC_ESP_5']

df_categ = df.select_dtypes(include = 'object').drop(drop_cols_categ, axis = 1)
df_num = df.select_dtypes(include = 'number').drop(drop_cols_numeric, axis = 1)

df_categ.shape, df_num.shape

((103172, 72), (103172, 5))

In [19]:
df_categ.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,Q45,Q46,Q47,Q48,Q49,Q50,Q51,Q52,Q53,Q54,Q55,Q56,Q57,Q58,Q59,Q60,Q61,Q62,Q63,RegiaoMetropolitana,DE,CODMUN,CODESC,SERIE_ANO,TP_SEXO,PERIODO,Tipo_PROVA,disability
count,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172,103172
unique,4,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3,3,3,3,2,4,2,4,4,3,3,3,3,3,4,4,2,3,3,3,3,3,3,3,5,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,3,3,4,4,4,4,7,65,351,2756,3,2,3,2,3
top,B,E,E,E,E,E,E,E,A,A,A,A,B,A,A,B,A,B,A,A,A,A,A,A,A,C,C,B,C,B,C,B,B,B,C,C,C,C,C,C,C,B,C,C,C,C,C,C,C,B,C,B,B,B,B,A,D,C,C,B,A,A,C,Região Metropolitana de São Paulo,LESTE 2,100,909166,9º Ano EF,F,MANHÃ,C,A
freq,63673,48965,56254,42496,41102,42629,45263,30243,71040,92007,51556,82649,40001,84354,51556,44303,57733,57423,75906,54796,73511,66637,57679,81489,60588,39779,59389,47277,51098,44563,53186,47755,43907,76036,69335,89804,76208,74628,88464,90499,80256,60063,74650,65269,60101,69496,53565,74094,66176,61424,52473,39708,82818,43757,50888,73427,40151,47625,85267,42435,38734,34781,59205,41289,4104,30111,273,48052,53451,71137,103089,101595


In [20]:
# OHE
categ_ohe_quests = ['Q9', 'Q22', 'Q24', 'Q34', 'Q59', 'Q61', 'Q62', 'Q63']
categ_ohe_noquests = ['RegiaoMetropolitana', 'SERIE_ANO', 'TP_SEXO', 'PERIODO', 'Tipo_PROVA', 'disability']

# Ordinals:

## A < B < C < ... 
enc_greater = ['Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q19', 
               'Q20', 'Q21', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31', 
               'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41', 
               'Q50', 'Q51', 'Q52', 'Q53', 'Q54', 'Q55', 'Q56', 
               'Q57', 'Q58', 'Q60']

## A > B > C > ... 
enc_lower = ['Q1', 'Q10', 'Q11', 'Q12', 'Q13', 'Q14', 'Q15', 
             'Q16','Q17', 'Q18', 'Q23', 'Q25', 'Q26', 'Q33']

## Categ: A, D, B, C
categ_ADBC = ['Q43', 'Q44', 'Q45', 'Q46', 'Q47', 'Q48', 'Q49']

## Particular cases
#'Q32'
#'Q42' 

In [21]:
ohe_quests = OneHotEncoder(drop = 'first')
ohe_noquests = OneHotEncoder(drop = 'first')

ohe_transf_quests = ohe_quests.fit_transform(df_categ[categ_ohe_quests])
ohe_transf_noquests = ohe_noquests.fit_transform(df_categ[categ_ohe_noquests])

In [22]:
Ordinal_lower = OrdinalEncoder(categories = [['A', 'B', 'C', 'D', 'E']] * len(enc_lower))
Ordinal_greater = OrdinalEncoder(categories = [['E', 'D', 'C', 'B', 'A']] * len(enc_greater))
Ordinal_ADBC = OrdinalEncoder(categories = [['A', 'D', 'B', 'C']] * len(categ_ADBC))
Ordinal_particular1 = OrdinalEncoder(categories = [['D', 'A', 'B', 'C']])
Ordinal_particular2 = OrdinalEncoder(categories = [['D', 'C', 'E', 'B', 'A']])

In [23]:
cols_lower = Ordinal_lower.fit_transform(df_categ[enc_lower])
cols_greater = Ordinal_greater.fit_transform(df_categ[enc_greater])
cols_ADBC = Ordinal_ADBC.fit_transform(df_categ[categ_ADBC])
particular1 = Ordinal_particular1.fit_transform(df_categ['Q32'].values.reshape((df_categ.shape[0],1)))
particular2 = Ordinal_particular2.fit_transform(df_categ['Q42'].values.reshape((df_categ.shape[0],1)))

In [24]:
multiple_dfs = [pd.DataFrame(ohe_transf_quests.toarray(), columns = ohe_quests.get_feature_names_out(categ_ohe_quests)),
                pd.DataFrame(cols_lower, columns = enc_lower),
                pd.DataFrame(cols_greater, columns = enc_greater),
                pd.DataFrame(cols_ADBC, columns = categ_ADBC),
                pd.DataFrame(particular1, columns = ['Q32']),
                pd.DataFrame(particular2, columns = ['Q42'])]

df_categ_quests = pd.concat(multiple_dfs, axis=1)

In [25]:
quests_cols = sorted(df_categ_quests.columns.to_list(), key=lambda x: int("".join([i for i in x if i.isdigit()])))
df_categ_quests = df_categ_quests[quests_cols]

In [None]:
final_categ_dfs = [df_categ_quests[quests_cols],
                pd.DataFrame(ohe_transf_noquests.toarray(), columns = ohe_noquests.get_feature_names_out(categ_ohe_noquests))]

df_categ = pd.concat(final_categ_dfs, axis=1)
df_categ

In [None]:
df_num.describe()

In [None]:
def plot_corr(df, size=10):
    corr = df.corr()    
    fig, ax = plt.subplots(figsize = (size, size))
    ax.matshow(corr)  
    plt.xticks(range(len(corr.columns)), corr.columns) 
    plt.yticks(range(len(corr.columns)), corr.columns)  

In [None]:
X = pd.concat([df_categ, df_num[['Tem_Nec', 'Age']]], axis = 1)
plot_corr(X)

In [None]:
X.corr()

In [None]:
class LinearRegression():
    
    '''
    required packages: numpy, random. 
    '''

    def __init__(self):
        self.train = False

    def fit_normal_equation(self, X, y):
        '''
        inputs: X and y must be a np.array.
        return: linear regression parameters by Normal Equation.
        '''
        X = np.insert(X, 0, 1, 1)
        
        self.train = True
        self.thetas = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
        
        return np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
        

    def fit_gradient(self, X, y, alpha = 0.01, iterations = 10**5, threshold = 10**(-12)):
        
        '''
        inputs: X and y must be a np.array.
        return: linear regression parameters by Gradient Descent.
        '''
        
        X = np.insert(X, 0, 1, 1)
                
        self.thetas = np.array([np.random.normal() for i in range(X.shape[1])])
        n = len(y)
         
        for k in range(iterations):
            gradients = list()
            cost_func = 1/(2*n) * np.sum((np.dot(X, self.thetas) - y)**2)

            for j in range(len(self.thetas)):
                gradients.append(1/n * np.sum((np.dot(X, self.thetas) - y) * X[:, j]))

            aux_thetas = np.array([b - alpha*g for b, g in zip(self.thetas, gradients)])
            new_cost_func = 1/(2*n) * np.sum((np.dot(X, aux_thetas) - y)**2) 
            self.thetas = aux_thetas
            
            diff_gain = new_cost_func - cost_func
            
            if k >= 5 and abs(diff_gain) <= threshold:
                self.train = True
                return self.thetas
        
        self.train = True
        return self.thetas
     
        
    def predict(self, X_test):
        '''
        inputs: X must be a np.array.
        return: y predicted values by the fitted model.
        '''

        if self.train:
            X_test = np.insert(X_test, 0, 1, 1)
            return np.dot(X_test, self.thetas)
        else:
            raise ValueError("You first must fit a linear regression model.")
 

In [None]:
# TODO: Linear Regression. You can use scikit-learn libraries.


> What are the conclusions? (1-2 paragraphs)




2. (1 point) Use different Gradient Descent (GD) learning rates when optimizing. Compare the GD-based solutions with Normal Equation. What are the conclusions?


In [None]:
# TODO: Gradient Descent (GD) with 3 different learning rates. You can use scikit-learn libraries.


3. (0.75 point) Sometimes, we need some more complex function to make good prediction. Devise and evaluate a Polynomial Linear Regression model. 


In [None]:
# TODO: Complex model. You can use scikit-learn libraries.

*texto em itálico*
 > What are the conclusions? What are the actions after such analyses? (1-2 paragraphs)

 


4. (0.5) Plot the cost function vs. number of epochs in the training/validation set and analyze the model. 

In [None]:
# TODO: Plot the cost function vs. number of iterations in the training set.

In [None]:
*texto em itálico*
 > What are the conclusions? What are the actions after such analyses? (1-2 paragraphs)

5. (0.25 point) Pick **your best model**, based on your validation set, and predict the target values for the test set.

## Logistic Regression

Now, this part of the assignment aims to predict students' proeficiency level on Portuguese, Mathematics, and Natural Sciences (target values: `nivel_profic_lp`, `nivel_profic_mat` and `nivel_profic_cie`) based on their socioeconomic data. Then, you have to **drop the columns `porc_ACERT_lp`,  `porc_ACERT_MAT`** and  **`porc_ACERT_CIE`**.

### Activities

1. (2.75 points) Perform Multinomial Logistic Regression (_i.e._, softmax regression). It is a generalization of Logistic Regression to the case where we want to handle multiple classes. Try different combinations of features, dropping the ones less correlated to the target variables.

In [None]:
# TODO: Multinomial Logistic Regression. You can use scikit-learn libraries.

> What are the conclusions? (1-2 paragraphs)


2. (0.5 point) Plot the cost function vs. number of epochs in the training/validation set and analyze the model. 

In [None]:
# TODO: Plot the cost function vs. number of iterations in the training set.

> What are the conclusions? (1-2 paragraphs)


3. (0.75 point) Pick **your best model** and plot the confusion matrix in the **test set**. 


In [None]:
# TODO: Plot the confusion matrix. You can use scikit-learn, seaborn, matplotlib libraries.

> What are the conclusions? (1-2 paragraphs)


## Deadline

Monday, September 19, 11:59 pm. 

Penalty policy for late submission: You are not encouraged to submit your assignment after due date. However, in case you do, your grade will be penalized as follows:
- September 20, 11:59 pm : grade * 0.75
- September 21, 11:59 pm : grade * 0.5
- September 22, 11:59 pm : grade * 0.25


## Submission

On Google Classroom, submit your Jupyter Notebook (in Portuguese or English).

**This activity is NOT individual, it must be done in pairs (two-person group).**