<a href="https://colab.research.google.com/github/moriahsantiago/moriahsantiago/blob/main/week16_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 16: Assignment
Perform text analysis on a dataset containing language assessment essays using the Simple Transformers library. This project aims to introduce you to basic NLP concepts and practices.



# Before you Start the Assignment
> Please create a copy of this link in your Google Drive and complete the assignment in the copied Colab environment.

> Submit the link to your Colab file for grading.

> `File` > `Save a Copy in Drive`

In [None]:
#!pip install simpletransformers
import pandas as pd
import numpy as np

## Data Import

> You are provided with a dataset named ielts_writing_dataset.csv containing the following columns: `Task_Type`, `Question`, `Essay`, and `Overall` score.
> Data was downloded from: https://www.kaggle.com/datasets/mazlumi/ielts-writing-scored-essays-dataset


## Task 1: Data Loading and Basic NLP Analysis
> Load the dataset using pandas.

> 1.1 Display the first five rows to understand its structure.

> 1.2 Count the number of 1) unique task types and calculate 2) the average Overall score.

> 1.3 Remove all the columns except `Task_Type`, `Question`, `Essay` and `Overall`.

> 1.4. Create a new `df` that only saves `Task_Type==1`

> 1.5 Create a new `Score` the `Overall` column so that it includes three levels
{`low` :0 < y <= 6,  `high` : 6 < y}

> 1.6 Report the final distribution of the `Score` column (Hint: `df.value_counts().plot.bar()`)

In [None]:
df = pd.read_csv('ielts_writing_dataset.csv')

Unnamed: 0,Task_Type,Question,Essay,Examiner_Commen,Task_Response,Coherence_Cohesion,Lexical_Resource,Range_Accuracy,Overall
0,1,The bar chart below describes some changes abo...,"Between 1995 and 2010, a study was conducted r...",,,,,,5.5
1,2,Rich countries often give money to poorer coun...,Poverty represents a worldwide crisis. It is t...,,,,,,6.5
2,1,The bar chart below describes some changes abo...,The left chart shows the population change hap...,,,,,,5.0
3,2,Rich countries often give money to poorer coun...,Human beings are facing many challenges nowada...,,,,,,5.5
4,1,The graph below shows the number of overseas v...,Information about the thousands of visits from...,,,,,,7.0




---

> Great work 🙂. Now let's move on to training a BERT model.

---



## Task 2: `Simple Transformer` for Writing Analysis
> Install the Simple Transformers library.

> Use a pre-trained `BERT` model, specifically `bert-base-uncased` model to train your classifier.


In [None]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.64.3-py3-none-any.whl (250 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/250.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m194.6/250.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.8/250.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting wandb>=0.10.32 (from sim

In [None]:
import logging
from simpletransformers.classification import ClassificationModel, ClassificationArgs

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Load a pre-trained bert model
model = ClassificationModel('bert', 'bert-base-uncased', use_cuda=False, num_labels=2) #change to True in case you have gpu

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=1) # increase the num epochs if possible.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

> Randomly sample 20 responses from `df` for training (`train_df`), evaluation (`eval_df`), and testing (`test_df`), respectively. (Hint: `df.sample()`)

> Recode the `Score` to 0 (low), 1(high). (Hint: `df.replace({dic})`)

> Save the `Essay` and `Score` columns, and rename as `test` and `labels`.

In [None]:
# Prepare the input, output and evaluation data

> Train the model with `model.train_model(train_df)`

> Evaluate the output `model.eval_model(eval_df)`

> Make prediction with the model and check the performance using the `test_df`. `model.predict()`

> Change the value for `test_case_id` to experiment

In [None]:
test_case_id = 3
print('True label: ', test_df['labels'].iloc[test_case_id])
model.predict([test_df['text'].iloc[test_case_id]])

True label:  1


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

(array([1]), array([[-0.21299088,  0.03091701]]))

## (Optional, Advanced) Task 3: Text Classification
> Increase the training and evaluation sample sizes to train a model.

> Evaluate the performance of your model.