# Sentiment analysis and language identification

This is a project for the Natural Language Processing (NLP) class "IFT-7022 Techniques et applications du traitement automatique de la langue (TALN)" of [Luc Lamontagne](http://www2.ift.ulaval.ca/~lamontagne/). 

## Tasks

There are two tasks to do in this homework: 

- Task 1: classify positive and negative emotion in a document.
- Task 2: predict language in a document.


## License

The current code is released under the **BSD-3-Clause license**. See `LICENSE.md`. Copyright 2018 Guillaume Chevalier.

Note: as this is a university school project, the licences of the imported librairies, datasets, or other assets has not been checked.

## Using the provided data

Here is the structure of the folders for this code to run:

(Note: all text files have been ignored with grep so as to make the tree shorter, and the tree was captured before creating the src* folder. The data could be downloaded from the course website.)

In [1]:
!tree data | grep -v txt | grep -v text
!pwd

data
├── task1
│   └── Book
│       ├── neg_Bk
│       └── pos_Bk
└── task2
    └── identification_langue
        ├── corpus_entrainement
        └── corpus_test1

8 directories, 2044 files
/home/users_home/Documents/Session 7/NLP/TP2


Also see the `requirements.txt` file: 

In [2]:
!cat requirements.txt

# Python 3.6
conv
numpy
scikit-learn
nltk
# nltk.download('stopwords')
# nltk.download('tagsets')
# nltk.download('sentiwordnet')
matplotlib



# Task 1: classify positive and negative emotion in a document.

In [3]:
# Python 3.6

import os
import glob

from src.pipeline_steps.nltk_word_tokenize import NLTKTokenizer
from src.pipeline_steps.to_lower_case import ToLowerCase
from src.pipeline_steps.remove_stop_words import RemoveStopWords
from src.pipeline_steps.keep_open_classes_only import KeepOpenClassesOnly
from src.pipeline_steps.sentiwordnet import SentiWordNetPosNegAttributes
from src.pipeline_steps.porter_stemmer import PorterStemmerStep
from src.pipeline_steps.data_shape_printer import ShapePrinter
from src.data_loading.task_1 import load_all_data_task_1
from src.text_classifier_pipelines.stop_words_open_class_stemmer.pipeline_factory import find_and_train_best_pipelines

[nltk_data] Downloading package stopwords to /home/gui/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /home/gui/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


In [4]:
neg_Bk_files = glob.glob(os.path.join(".", "data", "task1", "Book", "neg_Bk", "*.text"))
pos_Bk_files = glob.glob(os.path.join(".", "data", "task1", "Book", "pos_Bk", "*.text"))

X_train, y_train, X_test, y_test = load_all_data_task_1(neg_Bk_files, pos_Bk_files)

print(len(X_train), len(y_train), len(X_test), len(y_test))

1600 1600 400 400


In [5]:
best_trained_pipelines = find_and_train_best_pipelines(X_train, y_train)

Will start Cross Validation for Logistic Classifiers.

Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ' (3-folds cross validation accuracy score=0.78375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 50000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7fcd6caec598>, 'logistic_regressio

Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute ' (3-folds cross validation accuracy score=0.758125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 50000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7fcd6caec598>, 'logistic_regression__C': 10000.0}

Cross-Validation Grid Search for: 'Logistic Class

Cross-Validation Grid Search for: 'Logistic Classifier remove_stop_words with_stemming '...
Best hyperparameters for 'Logistic Classifier remove_stop_words with_stemming ' (3-folds cross validation accuracy score=0.734375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 50000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7fcd6caec598>, 'logistic_regression__C': 10000.0}

Cross-Validation Grid Search for: 'Logistic Classifier remove_stop_words '...
Best hyperparameters for 'Log

Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase all_attributes '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes ' (3-folds cross validation accuracy score=0.79125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 50000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7fcd6caec598>, 'naive_bayes_multi__alpha': 0.1}

Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase

Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute with_stemming ' (3-folds cross validation accuracy score=0.79125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 50000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7fcd6caec598>, 'naive_bayes_multi__alpha': 0.1}

Cross-Validation Grid Search for: 'Multi

Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute ' (3-folds cross validation accuracy score=0.77875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 50000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7fcd6caec598>, 'naive_bayes_multi__alpha': 0.1}

Cross-Validation Grid Search for: 'Multinomial Naive

In [6]:
print("The final test: classifying on test documents of full-length:")
print("")
print("Note: the test set was of 20% of full data, which was held-out of cross validation.")
print("")
max_score = 0
max_score_model = ""
for (model_name, model) in best_trained_pipelines.items():
    score = model.score(X_test, y_test) * 100
    if score > max_score: 
        max_score = score
        max_score_model = model_name
    print("Test set score for '{}': {}%".format(model_name, score))
print("")
print("Max score is by '{}': {}%".format(max_score_model, max_score))
print("")

The final test: classifying on test documents of full-length:

Note: the test set was of 20% of full data, which was held-out of cross validation.

Test set score for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ': 78.5%
Test set score for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute ': 77.25%
Test set score for 'Logistic Classifier with_lowercase all_attributes with_stemming ': 78.5%
Test set score for 'Logistic Classifier with_lowercase all_attributes ': 76.25%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming ': 77.75%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute ': 71.5%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words with_stemming ': 76.75%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words ': 70.5%
Test set score for 'Logistic Classifier with_lowercase k