# Análisis Exploratorio de Datos - Clasificación de Traducciones,

    Este notebook contiene el análisis exploratorio de los datos de traducciones humanas vs. automáticas.

## Machine Learning Models Results


## Introduction
This document presents the results of different machine learning models applied to the translation classification task, with the objective of distinguishing between human and machine translations in Spanish texts.


## Dataset
- Total samples: ~18,000 texts
- Class distribution: Balanced (50% human, 50% machine)
- Text characteristics:
 - Extracted features: character length, word count, average word length, punctuation, uppercase letters, numbers
 - Language: Spanish


## Preprocessing
#NOTE: These pre-processing was employed on all the models excepts the Transformer Models. For the Tranformer model we only clean the extra spaces and convert upper case latters to lower cases.

- Text cleaning: 
 - Space and punctuation normalization
 - Special characters removal
 - Lowercase conversion
 - Lemmatization using spaCy (es_core_news_md model)
 - Stopwords removal using NLTK


- Feature extraction:
 - Text length (characters)
 - Word count
 - Average word length
 - Punctuation count
 - Uppercase letters count
 - Numbers count


- Vectorization:
 - TF-IDF with max_features=5000
 - Parameters:
   - min_df=2
   - max_df=0.95
   - ngram_range=(1,1) for unigrams
   - ngram_range=(2,2) for bigrams
   - ngram_range=(3,3) for trigrams


- Normalization:
Note: For clusters Method
 - StandardScaler for numerical features
 - Text normalization using spaCy


- Data split:
 - Standard train/test split
 - Maintaining class balance


## Supervised Models


### Classical Models


#### Naive Bayes
- Model: MultinomialNB()
- Vectorization: TF-IDF (max_features=5000)
- Accuracy: 0.43
- Confusion Matrix: [[798, 990], [1056, 732]]
- Precision: 0.42
- Recall: 0.42
- F1-score: 0.42


#### Logistic Regression
- Model: LogisticRegression(max_iter=1000)
- Vectorization: TF-IDF (max_features=5000)
- Accuracy: 0.46
- Confusion Matrix: [[841, 947], [983, 805]]
- Precision: 0.46
- Recall: 0.46
- F1-score: 0.46


#### Linear SVM
- Model: LinearSVC(max_iter=1000)
- Vectorization: TF-IDF (max_features=5000)
- Accuracy: 0.44
- Confusion Matrix: [[793, 995], [1011, 777]]
- Precision: 0.44
- Recall: 0.44
- F1-score: 0.44


### Transformer Models


#### BETO
- Model: dccuchile/bert-base-spanish-wwm-uncased
- Parameters:
 - Batch size: 32
 - Epochs: 5
 - Learning rate: 2e-5
 - Optimizer: AdamW
 - Max sequence length: 512
- Accuracy: 0.592
- Final average loss: 0.203
- Confusion Matrix: [[981,770],[678,1147]] 


#### RoBERTa-Spanish
- Model: BSC-TeMU/roberta-base-bne
- Parameters:
 - Batch size: 32
 - Epochs: 5
 - Learning rate: 2e-5
 - Optimizer: AdamW
 - Max sequence length: 512
- Accuracy: 0.62
- Final average loss: 0.13
- Confusion Matrix: [[1158,593],[767,1058]] 


### N-grams Analysis


#### Unigrams
- Best features: TF-IDF with max_features=5000
- Parameters:
 - min_df=2
 - max_df=0.95
 - ngram_range=(1,1)
- Accuracy: 0.42


#### Bigrams
- Best features: TF-IDF with max_features=5000
- Parameters:
 - min_df=2
 - max_df=0.95
 - ngram_range=(2,2)
- Accuracy: 0.44


#### Trigrams
- Best features: TF-IDF with max_features=5000
- Parameters:
 - min_df=2
 - max_df=0.95
 - ngram_range=(3,3)
- Accuracy: 0.41


## Unsupervised Models


### K-means Clustering


#### Parameters:
- n_clusters: 2
- random_state: 42
- n_init: 10 (default)
- Normalization: StandardScaler()
- Visualization: PCA (2 components)


#### Results by feature set:
- Basic (char_length, word_count):
 - Silhouette Score: 0.593
 - PCA Variance Explained: 1.000


- Length (char_length, word_count, avg_word_length):
 - Silhouette Score: 0.413
 - PCA Variance Explained: 0.998


- Punctuation (puntuacion, mayusculas):
 - Silhouette Score: 0.557
 - PCA Variance Explained: 1.000


- Complete (all features):
 - Silhouette Score: 0.357
 - PCA Variance Explained: 0.626


### DBSCAN Clustering


#### Tested Parameters:
- eps: [0.3, 0.5, 0.7]
- min_samples: [5, 10, 15]
- Normalization: StandardScaler()
- Visualization: PCA (2 components)


#### Best Configuration:
- eps: 0.7
- min_samples: 10
- Silhouette Score: 0.170
- Comparison with K-means: 0.232
- Number of identified clusters: 4 + noise
- Cluster characteristics:
 - Cluster 0: Average texts (16,228 samples)
 - Cluster 1: High punctuation texts (12 samples)
 - Cluster 2: Short texts with uppercase and numbers (10 samples)
 - Cluster 3: Very short texts with numbers (10 samples)
 - Noise: Atypical texts (1,617 samples)


## Conclusions


1. Transformer models (BETO and RoBERTa-Spanish) achieved the best results with accuracies of 0.592 and 0.62 respectively.


2. Among classical models, Logistic Regression performed best with an accuracy of 0.46.


3. N-grams analysis showed that bigrams were more effective (0.44) than unigrams (0.42) and trigrams (0.41).


4. In unsupervised clustering:
  - K-means worked better with basic features (character length and word count)
  - DBSCAN identified interesting groups but with less cohesion than K-means
  - Most texts were grouped in one main cluster, with small clusters for special cases


5. Supervised classification was more effective than unsupervised clustering for distinguishing between human and machine translations.


## Evaluation Metrics
### Description of metrics used:
- Accuracy: Proportion of correct predictions
- Precision: Proportion of correct positive predictions
- Recall: Proportion of identified positive cases
- F1-score: Harmonic mean between precision and recall
- Silhouette Score: Measure of cluster cohesion (-1 to 1)
- Adjusted Rand Score: Similarity between clusters and real labels (-1 to 1)


## Limitations and Future Work
1. Identified limitations:
  - Difficulty in detecting high-quality translations
  - Extensive training time for transformer models
  - Need for significant computational resources


2. Possible improvements:
  - Experiment with other pre-trained models
  - Increase dataset with more examples
  - Incorporate additional linguistic features
  - Test ensemble learning techniques
