# Natural Language Processing Challenge

## Introduction

Learning how to process text is a skill required for Data Scientists. In this project, you will put these skills into practice to identify whether a sentence was automatically translated or translated by a human.

## Project Overview

In this repository you will find dataset containing sentences in Spanish and their tags: 0, if the sentences was translated by a Machine, 1, if the sentence was translated by a professional translator. Your goal is to build a classifier that is able to distinguish between the two.

## Guidance
Like in a real life scenario, you are able to make your own choices and text treatment. Use the techniques you have learned and the common packages to process this data and classify the text.

## Deliverables

1. **Python Code:** Provide well-documented Python code that conducts the analysis.
2. **Accuracy estimation:** Provide the teacher with your estimation of how your model will perform.
2. **Classified Dataset**: On Friday, you will receive a dataset without tags. Prepare your code to be able to tag that dataset.


## Approach:
### 1. Preprocessing: Tokenization, lowercasing, stopword removal, and lemmatization.
### 2. Feature Extraction:
    - Traditional: TF-IDF or n-gram frequency analysis.
    - Deep Learning: Word embeddings (FastText, Word2Vec, or pretrained ones like BERT).
### 3. Model Selection:
    - Traditional ML: Logistic Regression, Random Forest, or SVM (good for TF-IDF).
    - Deep Learning: LSTM, CNN, or a Transformer-based model like DistilBERT.
### 4. Evaluation: Train/test split, cross-validation, and metrics like accuracy, precision, recall, and F1-score.

# Preprocessing

In [17]:
# imports
import pandas as pd
import re
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

In [12]:
# Load the data (adjust the path as necessary)
file_path = 'TRAINING_DATA.TXT'
data = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'sentence'])

# Check the first few rows to ensure it loaded correctly
print(data.head())

   label                                           sentence
0      1  Cuando conocí a Janice en 2013 , una familia n...
1      0  Hwang habló en Sur de este año por Southwest M...
2      1  Usted podría pensar Katy Perry y Robert Pattin...
3      1  Cualquiera que haya volado los cielos del crea...
4      1  Bueno , este cantante tendrá un LARGO tiempo p...


In [14]:
# Load Pretrained Spanish BERT Model (No Training)
model_name = "dccuchile/bert-base-spanish-wwm-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Ensure model is in evaluation mode
model.eval()

# Step 1: Clean Text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-záéíóúñ\s]', '', text)  # Remove special characters and punctuation
    return text

# Function to Predict Label for a Given Sentence
def predict_label(sentence):
    cleaned_sentence = clean_text(sentence)
    inputs = tokenizer(cleaned_sentence, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
    
    with torch.no_grad():  # No gradient computation (faster inference)
        outputs = model(**inputs)
    
    probabilities = torch.softmax(outputs.logits, dim=1)
    predicted_class = torch.argmax(probabilities, dim=1).item()

    return predicted_class, probabilities.tolist()[0]  # Return label and confidence scores

# Step 2: Process the Entire Dataset
def process_dataset(dataset):
    predictions = []
    for sentence in dataset:
        label, confidence = predict_label(sentence)
        predictions.append({
            "sentence": sentence,
            "predicted_label": label,  # 0=Machine, 1=Professional
            "confidence": confidence
        })
    
    return pd.DataFrame(predictions)

# Step 3: Process and Get Predictions
predicted_df = process_dataset(dataset['sentence'])

# Print predictions for all sentences
print(predicted_df)

# Optionally, save the results to a CSV file
predicted_df.to_csv("predictions.csv", index=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


                                                sentence  predicted_label  \
0      Cuando conocí a Janice en 2013 , una familia n...                0   
1      Hwang habló en Sur de este año por Southwest M...                0   
2      Usted podría pensar Katy Perry y Robert Pattin...                0   
3      Cualquiera que haya volado los cielos del crea...                0   
4      Bueno , este cantante tendrá un LARGO tiempo p...                0   
...                                                  ...              ...   
14919   Yo voy a hacer canciones y voy a hablar de la...                0   
14920  Si la gente en ESA posición todavía se sintier...                0   
14921  ) De todos modos , después de eso , ella es ca...                0   
14922                    ( " No hay humo en el pasillo .                0   
14923   Ella repetía , como hemos luchado durante ves...                0   

                                      confidence  
0       [0.5526164770126

In [19]:
true_labels = data['label'].tolist()

# Get predicted labels from the model
predicted_labels = predicted_df['predicted_label'].tolist()

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)

# Calculate precision, recall, and F1 score
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predicted_labels, average='binary')

# Calculate confusion matrix
conf_matrix = confusion_matrix(true_labels, predicted_labels)

# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.5015
Precision: 0.5374
Recall: 0.0202
F1 Score: 0.0390
Confusion Matrix:
[[7334  130]
 [7309  151]]
