# Language Identification and Profanity Detection

This notebook outlines the process for building a system that can identify languages and detect profanity in sentences. It consists of three main components:
1. **FNLI.py**: This module is responsible for generating dictionaries and training the language identification model.
2. **POSTagger.py**: This module implements the Part-of-Speech (POS) tagging using the Stanford POS Tagger.
3. **TAKLUBAN.py**: The main application that integrates the functionalities from the previous modules to process user inputs.

---

## 1. Setup

First, we will import the necessary libraries and define the global paths that will be used throughout the project.

In [1]:
# Cell 1: Import necessary libraries
import os
import csv
import joblib
import pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tag.stanford import StanfordPOSTagger

# Define global paths to avoid redundancy
model_path = "../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION/LanguageIdentification/saved_model.pkl"
dictionary_dir = "../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION/LanguageIdentification/Dictionary"
output_file = "../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION/POSdata.csv"


## 2. FNLI.py: Dictionary Generation and Model Training

In this section, we will implement the functionality to generate word frequency dictionaries and train the language identification model. 

### 2.1 Dictionary Generation
The `DictionaryGenerator` class will be responsible for creating frequency dictionaries for each language. It initializes noise words, loads English noise words, and removes noise words from sentences. 


In [2]:
# Cell 2: Dictionary Generation Code
class DictionaryGenerator:
    def __init__(self, preprocessed_dir, dictionary_dir, english_dict_path, language):
        base_path = "../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION"
        self.language = language  # Add language to the instance
        self.input_file = os.path.join(preprocessed_dir, f"preprocessed_{language}_sentence_profane.csv")
        self.output_file = os.path.join(preprocessed_dir, f"preprocessed_{language}.csv")
        self.english_dict_path = english_dict_path  # Corrected path for the English dictionary
        self.preprocessed_dir = preprocessed_dir  # Ensure the directory paths are available
        self.dictionary_dir = dictionary_dir
        os.makedirs(os.path.dirname(self.output_file), exist_ok=True)

        # Initialize noise words for all languages
        self.noise_words = self.initialize_noise_words()

    def initialize_noise_words(self):
        """Initialize common noise words for Tagalog, Bikol, Cebuano, and English."""
        noise_words = {
            'Tagalog': {"ang", "ng", "sa", "na", "ay", "mga", "yung", "itong", "dito", "iyan", "kay", "kina"},
            'Bikol': {"ang", "an", "na", "sa", "iyo", "kang", "hali", "ini", "ngani", "iyo", "baga", "si", "sinda"},
            'Cebuano': {"ang", "sa", "na", "ug", "ni", "kay", "kani", "adto", "kini", "katong", "ilang", "iyang"}
        }
        noise_words['English'] = self.load_english_noise_words()
        self.clean_english_noise_words(noise_words)
        return noise_words

    def load_english_noise_words(self):
        noise_words = set()
        try:
            with open(self.english_dict_path, 'r', encoding='utf-8') as infile:
                reader = csv.reader(infile)
                next(reader)  # Skip header
                for row in reader:
                    if row:  # Check if the row is not empty
                        word = row[0].strip()
                        noise_words.add(word.lower())
            print(f"Loaded {len(noise_words)} English noise words.")  # Debugging line
        except FileNotFoundError:
            print(f"Error: The file {self.english_dict_path} does not exist.")
        except Exception as e:
            print(f"An error occurred: {e}")
        return noise_words

    def clean_english_noise_words(self, noise_words):
        """Remove words from English noise words that exist in any of the other three language dictionaries."""
        tagalog_set = noise_words.get('Tagalog', set())
        bikol_set = noise_words.get('Bikol', set())
        cebuano_set = noise_words.get('Cebuano', set())

        common_words = (tagalog_set | bikol_set | cebuano_set) & noise_words['English']
        noise_words['English'] = noise_words['English'] - common_words

    def remove_noise(self, words, language):
        """Remove noise words from the list of words."""
        return [word for word in words if word.lower() not in self.noise_words[language.capitalize()]]

    def generate_dictionary(self, language):
        """Generate a word frequency dictionary from preprocessed sentences, excluding words found in the English dictionary."""
        word_count = Counter()
        preprocessed_file = os.path.join(self.preprocessed_dir, f"preprocessed_{language}_sentence_profane.csv")

        try:
            with open(preprocessed_file, 'r', encoding='utf-8') as infile:
                reader = csv.reader(infile)
                for row in reader:
                    if row:  # Check if the row is not empty
                        sentence = row[0]
                        words = sentence.split()
                        cleaned_words = [word for word in self.remove_noise(words, language)
                                         if word.lower() not in self.noise_words['English']]
                        word_count.update(cleaned_words)

            self.save_dictionary(word_count, language)

        except FileNotFoundError:
            print(f"Error: The file {preprocessed_file} does not exist.")
        except Exception as e:
            print(f"An error occurred: {e}")

    def save_dictionary(self, word_count, language):
        """Save the word frequency dictionary to a CSV file."""
        dict_file = os.path.join(self.dictionary_dir, f"{language}_dictionary.csv")
        with open(dict_file, 'w', newline='', encoding='utf-8') as dict_file:
            writer = csv.writer(dict_file)
            writer.writerow(['word', 'frequency'])
            for word, freq in sorted(word_count.items()):
                writer.writerow([word, freq])
        print(f"Dictionary saved at {dict_file}")


### 2.2 Model Training
The `ModelTraining` class will be responsible for training the language identification model using the generated dictionaries.


In [3]:
# Cell 3: Model Training Code
class ModelTraining:
    """This class is responsible for training the language identification model."""
    
    def __init__(self, dictionary_dir):
        self.dictionary_dir = dictionary_dir
        self.word_frequencies = self.load_dictionaries()

    def load_dictionaries(self):
        frequencies = {}
        for language in ['tagalog', 'bikol', 'cebuano']:
            dict_file = os.path.join(self.dictionary_dir, f"{language}_dictionary.csv")
            if os.path.exists(dict_file):
                df = pd.read_csv(dict_file)
                frequencies[language] = dict(zip(df['word'], df['frequency']))
                print(f"Loaded {language} dictionary with {len(frequencies[language])} entries.")
            else:
                print(f"Dictionary file for {language} not found.")
        return frequencies

    def train_model(self):
        # Prepare training data
        data = []
        labels = []
        for language, word_freq in self.word_frequencies.items():
            for word, freq in word_freq.items():
                if freq > 0 and isinstance(word, str):
                    data.extend([word] * freq)
                    labels.extend([language] * freq)

        if not data or not labels:
            raise ValueError("Training data or labels are empty. Please check your dictionary files.")

        # Split the data (60% training, 30% validation, 10% testing)
        X_train, X_temp, y_train, y_temp = train_test_split(data, labels, test_size=0.40, random_state=50)
        X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.25, random_state=50)

        print(f"Train size: {len(X_train)}, Validation size: {len(X_val)}, Test size: {len(X_test)}")

        # Create a pipeline with TfidfVectorizer for N-gram extraction and MultinomialNB with Laplace smoothing
        pipeline = make_pipeline(TfidfVectorizer(ngram_range=(1, 3)), MultinomialNB(alpha=1.0))

        # Hyperparameter tuning using GridSearchCV
        param_grid = {
            'tfidfvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
            'multinomialnb__alpha': [0.1, 0.5, 1.0]
        }
        grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
        grid_search.fit(X_train, y_train)

        model = grid_search.best_estimator_
        model.fit(X_train, y_train)

        # Save the trained model to a file
        joblib.dump(model, model_path)
        print(f"Model saved at {model_path}")

        return model, X_test, y_test

class LanguageIdentification:
    """This class is responsible for predicting the language based on a pre-trained model."""
    
    def __init__(self, model, X_test, y_test):
        self.model = model
        self.X_test = X_test
        self.y_test = y_test

    def predict_language(self, sentence):
        return self.model.predict([sentence])[0]

    def determine_language(self, sentences):
        language_counter = Counter()
        for sentence in sentences:
            dominant_language = self.predict_language(sentence)
            language_counter[dominant_language] += 1
        return language_counter.most_common(1)[0][0] if language_counter else None

    def evaluate_model(self):
        """Evaluate the model and calculate accuracy, precision, recall, and F1 score."""
        predictions = [self.predict_language(sentence) for sentence in self.X_test]

        accuracy = accuracy_score(self.y_test, predictions)
        precision = precision_score(self.y_test, predictions, average='weighted', zero_division=0)
        recall = recall_score(self.y_test, predictions, average='weighted', zero_division=0)
        f1 = f1_score(self.y_test, predictions, average='weighted', zero_division=0)

        return accuracy, precision, recall, f1


## 3. POSTagger.py: Part-of-Speech Tagging

In this section, we will implement the functionality for Part-of-Speech (POS) tagging using the Stanford POS Tagger. The `POSTagger` class will serve as the interface for tagging sentences in the supported languages.


In [4]:
# Cell 4: POSTagger Code
class StanfordPOSTaggerWrapper:
    def __init__(self, language):
        base_path = "../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION"
        self.input_file = f"{base_path}/Results/preprocessed/preprocessed_{language}_sentence_profane.csv"
        self.output_file = f"{base_path}/Results/DATASETOFREGEX/Tagged_{language}.csv"

        os.makedirs(os.path.dirname(self.output_file), exist_ok=True)

        self.data = pd.read_csv(self.input_file, names=['sentence', 'label'])
        print(f"Loaded preprocessed data for {language}. Number of sentences: {len(self.data)}")

        self.tagger = StanfordPOSTagger(
            model_filename='Modules/FSPOST/filipino-left5words-owlqn2-distsim-pref6-inf2.tagger',
            path_to_jar='Modules/FSPOST/stanford-postagger-full-2020-11-17/stanford-postagger.jar',
            java_options='-mx6144m'  # Set maximum Java heap size to 6 GB
        )

    def pos_tag_text(self, text):
        try:
            tokens = text.split()
            pos_tags = self.tagger.tag(tokens)
            return ' '.join([f"{word}|{tag}" for word, tag in pos_tags])
        except Exception as e:
            print(f"Error during POS tagging: {e}")
            return text

    def pos_tag_sentences(self, batch_size=10):
        try:
            for i in range(0, len(self.data), batch_size):
                batch = self.data.iloc[i:i + batch_size].copy()
                batch['pos_tagged'] = batch['sentence'].apply(self.pos_tag_text)

                batch[['pos_tagged', 'label']].to_csv(self.output_file, mode='a', index=False, header=(i == 0))
                print(f"Processed batch {i//batch_size + 1} of {len(self.data) // batch_size + 1}")

            print(f"POS tagging complete. Results saved to {self.output_file}.")
        except Exception as e:
            print(f"An error occurred during POS tagging: {e}")

class POSTagger:
    def __init__(self, language):
        self.language = language.lower()
        self.output_file = 'POSTagging/POSTAGGER/POSData.csv'
        os.makedirs(os.path.dirname(self.output_file), exist_ok=True)

        if self.language in ['tagalog', 'cebuano', 'bikol']:
            self.stanford_tagger = StanfordPOSTaggerWrapper(language)

    @staticmethod
    def clean_token(token):
        token = token.lstrip('/')
        if '|' not in token:
            print(f"Invalid token format, no '|': {token}")
            return token
        word, tag = map(str.strip, token.split('|', 1))
        return f"{word}|{tag}"

    def language_rules(self, token):
        token = self.clean_token(token)
        word, current_tag = token.split('|', 1)

        patterns = {
            'cebuano': {
                'VB': r'\b(mag|nag|mi|mo|mu|mang|manag|man)[a-zA-Z]+\b',
                'NNC': r'\b([a-zA-Z]+on|[a-zA-Z]+an)\b',
                'NNCA': r'\b(ka|pang)[a-zA-Z]+an\b',
                'NNPL': r'\bmga\s+[a-zA-Z]+\b',
                'JJD': r'\b(ma|ka)[a-zA-Z]+an\b',
                'JJCM': r'\bmas\s+[a-zA-Z]+\b',
                'PRP': r'\bako|ikaw|siya|kami|kita|sila\b',
                'DT': r'\bang|bang|mga\b',
                'CCP': r'\bug|o|kundi\b',
            },
            'bikol': {
                'VB': r'\b(MA|MAG|NAG|MANG|PINAG|PA|KA)[a-zA-Z]+\b',
                'NN': r'\b[a-zA-Z]+on\b|\b[a-zA-Z]+an\b|\b[a-zA-Z]+(TA|HON|LAY|LI)[a-zA-Z]*\b',
                'JJ': r'\b(A|KA|MALA)[a-zA-Z]+on\b|\bPINAKA[a-zA-Z]+\b',
                'PRP': r'\bAKO|IKAW|SIYA|KAMI|KITA|SINDA|NIYA|NINDA|NIATO|NATO|SAINDO\b',
                'DT': r'\bANG|MGA|SI|SA|KAN|KUN\b',
                'RB': r'\b(DAKUL|GAD|HALA|DAI|MAYA|SIRA|SINYA|URUG)\b',
                'CC': r'\bOG|PERO|KUNDI\b',
                'IN': r'\bPARA|PAAGI|ASIN|KAN\b',
                'CD': r'\bSARO|DUWA|TULO|APAT|LIMA|ANOM|PITO|WALO|SIYAM|SAMPULO\b',
            }
        }

        language_patterns = patterns.get(self.language, {})
        for tag, pattern in language_patterns.items():
            if re.fullmatch(pattern, word, flags=re.IGNORECASE):
                print(f"Matched: {word} -> {tag}")
                return f"{word}|{tag}"

        return token

    def pos_tag_text(self, text):
        stanford_tagged_text = self.stanford_tagger.pos_tag_text(text)
        print(f"Stanford tagged text: {stanford_tagged_text}")

        if self.language in ['cebuano', 'bikol']:
            tokens = stanford_tagged_text.split()
            return ' '.join([self.language_rules(token) for token in tokens])

        return stanford_tagged_text


## 4. TAKLUBAN.py: Main Application

In this section, we will implement the main application that uses the previously defined classes to identify languages, perform POS tagging, and detect profanity in user-provided sentences.


In [6]:
# Cell 5: TAKLUBAN Code
# Function to check if the CSV file exists and create it if necessary
def initialize_csv():
    """Ensure that the CSV file exists and has a header."""
    if not os.path.exists(output_file):
        with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(['Language', 'Sentence', 'POS', 'Censored Sentence'])  # Header for CSV

# Centralized function for loading or training the language identification model
def load_or_train_model():
    """Load or train the language identification model."""
    if not os.path.exists(model_path):
        print(f"Model file {model_path} not found. Training a new model...")
        trainer = ModelTraining(dictionary_dir)
        model, X_test, y_test = trainer.train_model()
        joblib.dump(model, model_path)
        print(f"Model trained and saved at {model_path}.")
    else:
        print(f"Loading pre-saved model from {model_path}.")
        model = joblib.load(model_path)
        X_test, y_test = [], []  # Empty test sets as they aren't needed for prediction
    return LanguageIdentification(model=model, X_test=X_test, y_test=y_test)

# Function to get the appropriate POS tagger for the specified language
def get_pos_tagger(language):
    """Return the appropriate POS tagger from POSTagger.py for the given language."""
    if language in ['tagalog', 'bikol', 'cebuano']:
        return POSTagger(language)  # Create an instance of the POS tagger for the detected language
    return None

# Function for profanity detection and censorship
def predict_and_censor(sentence, best_model, threshold=0.5):
    """Perform profanity detection and censorship using SVM."""
    probas = best_model.predict_proba([sentence])[0]  # Predict probabilities using the SVM model
    
    is_profane = probas[1] >= threshold  # Only classify as profane if probability is above the threshold
    print(f"SVM Prediction: {'Profane' if is_profane else 'Not Profane'}")

    # If SVM says the sentence is profane, censor it
    if is_profane:
        print(f"Censoring the sentence based on its length.")
        censored_sentence = ' '.join(['*' * len(word) for word in sentence.split()])
        return censored_sentence, True  # Return censored sentence and True to indicate it's profane
    
    return sentence, False  # Return the original sentence and False to indicate it's not profane

# Central function for processing a sentence
def process_sentence(sentence, language_identifier):
    """Process the sentence to predict the language, POS tag it using POSTagger, apply regex rules, and censor if necessary."""
    predicted_language = language_identifier.predict_language(sentence)
    pos_tagger = get_pos_tagger(predicted_language)

    if pos_tagger and predicted_language in ['cebuano', 'bikol', 'tagalog']:
        # Perform initial POS tagging using Stanford POS Tagger
        pos_tagged_sentence = pos_tagger.pos_tag_text(sentence)  # Use the POS tagger from POSTagger.py
        
        profanity_model_path = f'../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION/{predicted_language}_trained_profane_model.pkl'
        
        if os.path.exists(profanity_model_path):
            best_model = joblib.load(profanity_model_path)
            censored_sentence, is_profane = predict_and_censor(sentence, best_model)
        else:
            censored_sentence = sentence
            is_profane = False

        return predicted_language, pos_tagged_sentence, censored_sentence, is_profane
    return "Unsupported language", None, sentence, False

def save_to_csv(language, sentence, pos_tagged, censored_sentence):
    """Save the language, sentence, POS tagged result, and censored sentence to a CSV file."""
    with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([language, sentence, pos_tagged, censored_sentence])

# Main function to run the program
def main():
    # Initialize CSV file and load or train the model
    initialize_csv()
    language_identifier = load_or_train_model()

    print("Welcome to Takluban Language Identifier! Enter your sentences below:")
    
    predictions = []
    true_labels = []

    while True:
        sentence = input("Enter a sentence (or type 'exit' to quit): ").strip()

        if sentence.lower() == 'exit':
            print("Exiting the program.")
            break

        predicted_language, pos_tagged_sentence, censored_sentence, is_profane = process_sentence(sentence, language_identifier)

        if predicted_language in ['cebuano', 'bikol', 'tagalog']:
            print(f"Detected language: {predicted_language}")
            print(f"POS Tagged Sentence: {pos_tagged_sentence}")
            print(f"{'Censored Sentence' if is_profane else 'Cleaned Sentence'}: {censored_sentence}")

            save_to_csv(predicted_language, sentence, pos_tagged_sentence, censored_sentence)

            # Asking the user for the true label (1 = Profane, 0 = Not Profane)
            true_label = int(input("Is the sentence profane? (1 for profane, 0 for not profane): "))
            predictions.append(1 if is_profane else 0)
            true_labels.append(true_label)

            print(f"Sentence '{sentence}' saved with the detected language, POS tagging result, and censored sentence.\n")
        else:
            print(f"Unsupported language detected: {predicted_language}. No POS tagging performed.")
    
    # Confusion matrix and performance metrics calculation
    if len(predictions) > 0:
        print("Confusion Matrix and Performance Metrics:")
        cm = confusion_matrix(true_labels, predictions)
        print(f"Confusion Matrix:\n{cm}")
        
        print("\nClassification Report:")
        print(classification_report(true_labels, predictions))
        
        # Plot the confusion matrix
        plt.figure(figsize=(6,6))
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False, xticklabels=['Not Profane', 'Profane'], yticklabels=['Not Profane', 'Profane'])
        plt.xlabel('Predicted')
        plt.ylabel('True')
        plt.title('Confusion Matrix')
        plt.show()

if __name__ == "__main__":
    main()


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Loading pre-saved model from ../TAKLUBAN-FILIPINO-NATIVE-LANGUAGE-PROFANE-DETECTION/LanguageIdentification/saved_model.pkl.
Welcome to Takluban Language Identifier! Enter your sentences below:
Loaded preprocessed data for tagalog. Number of sentences: 5074
Stanford tagged text: |NAGLALAKAD|VBTR |SIYA|PRS |NANG|RBW |MABAGAL|JJD
SVM Prediction: Profane
Censoring the sentence based on its length.
Detected language: tagalog
POS Tagged Sentence: |NAGLALAKAD|VBTR |SIYA|PRS |NANG|RBW |MABAGAL|JJD
Censored Sentence: ********** **** **** *******
Sentence 'naglalakad siya nang mabagal' saved with the detected language, POS tagging result, and censored sentence.

Loaded preprocessed data for tagalog. Number of sentences: 5074
Stanford tagged text: |ANG|DTC |BILIS|NNC |NIYA|PRS |TUMAKBO|VBAF
SVM Prediction: Not Profane
Detected language: tagalog
POS Tagged Sentence: |ANG|DTC |BILIS|NNC |NIYA|PRS |TUMAKBO|VBAF
Cleaned Sentence: ang bilis niya tumakbo
Sentence 'ang bilis niya tumakbo' saved with the

NameError: name 're' is not defined