<a href="https://colab.research.google.com/github/gomesluiz/commonlit-readability-predicting/blob/main/make_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predição da legibilidade de um texto

As técnicas de Aprendizado de Máquina (AM) e de Processamento de Linguagem Natural (PLN) podem predizer, com base em um texto, o nível de leitura, no qual uma pessoa se encontra, mais apropriado para lê-lo e, assim, ajudar a inspirar o aprendizado? A leitura é uma habilidade essencial para o sucesso acadêmico. Quando os alunos têm acesso a passagens envolventes que oferecem o nível certo de desafio, eles naturalmente desenvolvem habilidades de leitura.
<br/><br/>
Nesse notebook, vamos desenvolver, um modelo supervisionado de regressão, a partir de uma base de dados utilizada na competição promovida, pela Kaggle [CommonLit Readability Prize](https://www.kaggle.com/competitions/commonlitreadabilityprize) e patrocinada pela [CommonLit, Inc.](https://www.commonlit.org/en); e pelo [Departamento de Linguística da Georgia State University](https://alsl.gsu.edu/), para exatamente tentar responder a essa questão.

## Pacotes Python
Os seguintes pacotes são necessários para a execução com sucesso desse notebook:
* [NLTK](https://www.nltk.org/)
* [Pandas](https://pandas.pydata.org/)
* [String](https://docs.python.org/pt-br/3/library/string.html)
* [Sklearn](https://scikit-learn.org/stable/index.html)


Na linha de comando do Linux ou do Windows, execute o comando
```
pip install nltk pandas string sklearn
```

e em uma célula do [Jupypter Notebook](https://jupyter.org/), execute o comando
```
!pip install nltk pandas string sklearn
```
Caso esses pacotes, ainda não estejam instalados. No [Google Colab](https://colab.research.google.com/), todos eles já estão disponíveis.

In [79]:
# Pacotes necessários
import string

import nltk
import numpy as np
import pandas as pd
import sklearn 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('punkt')
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Coleta de dados

**Dataset description**

**Arquivos**
* train.csv
* test.csv

In [11]:
full_train_set = pd.read_csv("data/train.csv", usecols=["excerpt", "target"])
full_train_set.head()

Unnamed: 0,excerpt,target
0,When the young people returned to the ballroom...,-0.340259
1,"All through dinner time, Mrs. Fayre was somewh...",-0.315372
2,"As Roger had predicted, the snow departed as q...",-0.580118
3,And outside before the palace a great garden w...,-1.054013
4,Once upon a time there were Three Bears who li...,0.247197


In [12]:
X_train, X_valid, y_train, y_valid = train_test_split(full_train_set["excerpt"], full_train_set["target"], test_size=.25, random_state=32)

In [68]:
X_train.head()

277     Carbon monoxide is produced from the partial o...
2564    Phœbus Apollo, clad in purple that glowed like...
2301    "O Father! Please to come to the door, and see...
760     The air we breathe is composed of lots of diff...
801     Tornadoes usually form from large thunderstorm...
Name: excerpt, dtype: object

## Pré-processamento de dados

### 1. Limpeza de dados

### 2. Normalização de dados

### 3. Tokenização

### 4. Remoção de palavras irrelevantes



In [63]:
def preprocessa_texto(texto):
  # remove pontuação do texto.
  # normaliza o texto para minúsculo.
  texto_normalizado = "".join([palavra.lower() for palavra in texto if palavra not in string.punctuation])
  
  # tokeniza o texto
  texto_tokenizado = nltk.tokenize.word_tokenize(texto_normalizado)

  # remove palavras irrelevantes (stop words)
  texto_sem_stopwords = " ".join([palavra for palavra in texto_tokenizado if palavra not in stopwords])
 
  return texto_sem_stopwords

In [64]:
texto = full_train_set.iloc[0]['excerpt']
print(f"Texto original:\n{texto}")
print(f"Texto pré-processado:\n{preprocessa_texto(texto)}")

Texto original:
When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.
The floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.
At each end of the room, on the wall, hung a beautiful bear-skin rug.
These rugs were for prizes, one for the girls and one for the boys. And this was the game.
The girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.
This would have been an easy matter, but each traveller was obliged to wear sn

In [74]:
X_train_preprocessado = X_train.apply(preprocessa_texto)
X_valid_preprocessado = X_valid.apply(preprocessa_texto)

In [75]:
X_train_preprocessado

277     carbon monoxide produced partial oxidation car...
2564    phœbus apollo clad purple glowed like radiance...
2301    father please come door see pretty everything ...
760     air breathe composed lots different chemical c...
801     tornadoes usually form large thunderstorms win...
                              ...                        
252     many people like learn family histories parent...
2136    entering parlour found honoured lady seated ar...
1334    bulgarian government left prejudices one side ...
1579    harriet beecher stowe born june 14 1811 charac...
2775    415 attack simultaneously british german destr...
Name: excerpt, Length: 2125, dtype: object

In [94]:

vetorizador = CountVectorizer(max_features=1000)
X_train_vetorizado     = vetorizador.fit_transform(X_train_preprocessado)
X_train_vetorizado = pd.DataFrame(X_train_vetorizado.toarray(), columns=vetorizador.get_feature_names_out())
X_train_vetorizado.index = X_train_vetorizado.index
X_train_vetorizado.head()

Unnamed: 0,10,ability,able,according,account,acid,across,act,action,active,...,writing,written,wrong,year,years,yellow,yes,yet,york,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [96]:
X_valid_vetorizado     = vetorizador.transform(X_valid_preprocessado)
X_valid_vetorizado = pd.DataFrame(X_valid_vetorizado.toarray(), columns=vetorizador.get_feature_names_out())
X_valid_vetorizado.index = X_valid_vetorizado.index
X_valid_vetorizado.head()

Unnamed: 0,10,ability,able,according,account,acid,across,act,action,active,...,writing,written,wrong,year,years,yellow,yes,yet,york,young
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Treinamento do Modelo

In [100]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
reg = linear_model.LinearRegression()
reg.fit(X_train_vetorizado, y_train)

## Teste do Modelo

In [101]:
y_pred = reg.predict(X_valid_vetorizado)

In [106]:
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_valid, y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_valid, y_pred))


Mean squared error: 1.00
Coefficient of determination: 0.02
