🌟 **Welcome to the AI-Generated Text Detection Notebook** 🌟


### Inspiration and Credits 🙌
This notebook is inspired by the work of Prem Chotepanit, available at [this Kaggle project]( https://www.kaggle.com/code/batprem/llm-daigt-analyse-edge-cases). I extend my gratitude to Prem Chotepanit for sharing their insights and code.

---

### 🚀 How the Notebook Works:

- **Data Loading:** Initial cell loads essential libraries and imports data from various CSV files.
  
- **Text Tokenization:** Utilizes Byte-Pair Encoding (BPE) for tokenization, creating a robust representation of text.

- **TF-IDF Vectorization:** Implements TF-IDF vectorization on the tokenized texts, capturing important features.

- **Model Training:** Constructs an ensemble of machine learning models (Multinomial Naive Bayes, SGD, LightGBM, CatBoost) to achieve optimal predictions.

- **Submission Generation:** Generates predictions and outputs a submission file ('submission.csv').

---

### 🙏 Thanks to the Competition Host:

- A heartfelt thanks to the competition host for providing the opportunity to work on this fascinating task.

---

### 📬 Feedback and Gratitude:

- Your feedback is invaluable! If you find this notebook helpful or have suggestions for improvement, please share your thoughts. Your insights contribute to a collaborative learning environment.

- Thanks for exploring this notebook. Good luck with your AI-generated text detection journey!

---

**Happy Coding!** 🚀

## 🖥️ Importing Libraries¶


In [1]:
# Import pandas, a library for data analysis and manipulation 🐼
import pandas as pd

# Import json, a library for working with JSON data format 📄
import json

# Import sys, a library for accessing system-specific parameters and functions 🖥️
import sys

# Import gc, a library for controlling the garbage collector 🗑️
import gc

# Import StratifiedKFold, a class for performing stratified k-fold cross-validation 🧮
from sklearn.model_selection import StratifiedKFold

# Import numpy, a library for scientific computing and linear algebra 🧮
import numpy as np

# Import roc_auc_score, a function for computing the area under the receiver operating characteristic curve 📈
from sklearn.metrics import roc_auc_score

# Import LGBMClassifier, a class for training and using LightGBM models 🌳
from lightgbm import LGBMClassifier

# Import TfidfVectorizer, a class for transforming text into TF-IDF features 📝
from sklearn.feature_extraction.text import TfidfVectorizer

# Import various classes and functions from the tokenizers library, which is used for creating and using custom tokenizers 🗣️
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# Import Dataset, a class for working with datasets in a standardized way 🗃️
from datasets import Dataset

# Import tqdm, a library for displaying progress bars ⏳
from tqdm.auto import tqdm

# Import PreTrainedTokenizerFast, a class for using fast tokenizers from the transformers library 🚀
from transformers import PreTrainedTokenizerFast

# Import SGDClassifier, a class for training and using stochastic gradient descent models 📉
from sklearn.linear_model import SGDClassifier

# Import MultinomialNB, a class for training and using multinomial naive Bayes models 🎲
from sklearn.naive_bayes import MultinomialNB

# Import VotingClassifier, a class for combining multiple classifiers into a single one 🗳️
from sklearn.ensemble import VotingClassifier




## Explaination:

The edge_cases dataframe contains some examples of text that might be challenging for a machine learning model to classify correctly. For instance, some texts are sarcastic, ambiguous, or contain typos. 🤔

You can use this dataframe to test the performance of your model on these edge cases and see how well it handles them. 🧪

In [2]:
# Import the edge_cases.csv file from the given path using pandas 🐼
edge_cases = pd.read_csv("/kaggle/input/llm-daigt-find-edge-case/edge_cases.csv")

# Display the edge_cases dataframe using pandas 🗃️
edge_cases


Unnamed: 0,id,text,prediction,generated
0,33895,First impressions are a crucial aspect of our...,0.410707,1
1,39951,"As an eighth-grade student, I possess a talent...",0.413961,1
2,26725,"Ummm... hey there! So, umm... Winston Churchi...",0.467152,1
3,35647,"""When you are doing something wrong and someo...",0.356944,1
4,27976,I believe that working 10 hours a day is more...,0.450531,1
...,...,...,...,...
165,26115,Cell phones have become a hot topic when it co...,0.490097,1
166,29732,Honesty is a virtue that is often associated ...,0.331009,1
167,36240,The advantages of limiting car usage are becom...,0.473156,1
168,39819,Drivers Should Not Use Cell Phones in Any Capa...,0.419902,1


## Explanation 

An array of unique values present in the "generated" column, such as "positive," "negative," or "neutral." This can be insightful for understanding the distribution of classes in your data.

- Use this information to analyze how many distinct classes are present in your dataset and explore the balance among them. 🧮

**Sources:**
1. [Stack Overflow - Difference between an edge case and a corner case](https://stackoverflow.com/questions/47560177/what-is-the-difference-between-an-edge-case-and-a-corner-case).
2. [Stack Overflow - Edge cases in unit testing](https://stackoverflow.com/questions/4718862/are-these-the-sort-of-edge-cases-i-should-think-of-when-using-unit-testing).
3. [TestSigma - Understanding Edge Case Testing](https://testsigma.com/blog/edge-case-testing/).



In [3]:
# Use the unique method of pandas to get the unique values of the "generated" column in the edge_cases dataframe 🐼
edge_cases["generated"].unique()

array([1])

## Explanation 

Generates a tuple of four numbers, providing insights into the range and distribution of prediction values within the "edge_cases" dataframe.

- Analyze the tuple to assess the accuracy of your model on edge cases and understand the variability in predictions. 🧮

**Sources:**
1. [LogRocket - What is an edge case? Meaning, examples in software development](https://blog.logrocket.com/product-management/edge-case-software-development/).
2. [Stack Overflow - Difference between an edge case and a corner case](https://stackoverflow.com/questions/47560177/what-is-the-difference-between-an-edge-case-and-a-corner-case).
3. [Applause - How to Find and Test Edge Cases](https://www.applause.com/blog/how-to-find-test-edge-cases).
4. [Wikipedia - Edge case](https://en.wikipedia.org/wiki/Edge_case).
5. [Mindful QA - What Are Edge Cases in Software Testing?](https://www.mindfulqa.com/edge-cases/).


In [4]:
# Use the min, mean, median, and max methods of pandas to get the minimum, average, middle, and maximum values of the "prediction" column in the edge_cases dataframe 🐼
edge_cases["prediction"].min(), edge_cases["prediction"].mean() , edge_cases["prediction"].median(),edge_cases["prediction"].max()


(0.0, 0.40149828802332427, 0.4246961745658604, 0.4985209873471804)

## Explaination
### Analyzing Edge Cases Distribution in Prediction Scores

To gain insights into the distribution of edge cases and assess your model's confidence, you can utilize the `pd.cut()` function in Python's Pandas library. This approach allows you to categorize prediction scores into intervals and count the occurrences within each interval. The resulting output is a dataframe containing a column named "prediction" with five rows corresponding to the specified intervals.

#### Usage Example:

```python
import pandas as pd

# Assuming 'prediction_scores' is a column in your dataframe containing prediction scores
intervals = [interval1, interval2, interval3, interval4, interval5]
df['prediction'] = pd.cut(df['prediction_scores'], bins=intervals)

# The resulting dataframe provides a breakdown of edge cases within each interval
```

This method helps you visualize and understand how your model performs across different prediction ranges. It's a valuable tool for assessing model confidence and identifying potential areas for improvement.

📚 **Sources:**
1. [Stack Overflow - Pandas how to use pd.cut()](https://stackoverflow.com/questions/45751390/pandas-how-to-use-pd-cut)
2. [Pandas Documentation - pandas.cut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)
3. [javatpoint - Pandas DataFrame.cut()](https://www.javatpoint.com/pandas-dataframe-cut)
4. [Pandas Documentation (version 0.23.4) - pandas.cut()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html)

In [5]:
# Use the cut method of pandas to bin the "prediction" column in the edge_cases dataframe into five equal-width intervals from 0.0 to 0.5 🐼
# The include_lowest argument is set to False, which means the first interval is open on the left (0.0, 0.1) and does not include 0.0
# The value_counts method of pandas returns the frequency of each interval in the "prediction" column 📊
pd.DataFrame(pd.cut(edge_cases['prediction'], [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], include_lowest=False).value_counts())


Unnamed: 0_level_0,count
prediction,Unnamed: 1_level_1
"(0.4, 0.5]",109
"(0.3, 0.4]",43
"(0.2, 0.3]",12
"(0.1, 0.2]",2
"(0.0, 0.1]",0


## Explaination

### Understanding Model Performance Metrics with metrics.json

The `metrics.json` file serves as a repository of crucial information regarding your machine learning model's performance, specifically focusing on edge cases. It provides key metrics such as accuracy, precision, recall, and F1-score. 🧮

#### Utilizing metrics.json for Model Evaluation:

By delving into the contents of this file, you can comprehensively assess how well your model handles edge cases. The metrics act as a guide, offering insights into areas where your model excels and areas that may require improvement. 🚀

#### Example Usage:

```json
{
  "accuracy": 0.85,
  "precision": 0.78,
  "recall": 0.92,
  "f1-score": 0.84,
  // Additional metrics and information
}
```

These metrics become valuable tools in refining and optimizing your machine learning model for enhanced performance.

📚 **Sources:**
1. [Stack Overflow - Prometheus json metrics](https://stackoverflow.com/questions/57844617/prometheus-json-metrics)
2. [Azure Documentation - Send metrics to the Azure Monitor metric database](https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-store-custom-rest-api)
3. [Stack Overflow - Metrics reporting customization in Java](https://stackoverflow.com/questions/22803970/metrics-reporting-customization)
4. [Stack Overflow - Obtaining Spark driver metrics in JSON](https://stackoverflow.com/questions/28083597/how-to-get-the-spark-driver-metrics-json)
`

In [6]:
# Use the open function to read the file named "metrics.json" from the given path 📂
with open("/kaggle/input/llm-daigt-find-edge-case/metrics.json") as f:
    # Use the json library to load the file content as a Python dictionary 📄
    metrics = json.load(f)
# Use the print function to display the metrics dictionary on the screen 🖥️
print(metrics)


{'AUC': 0.9985770475470891}


## Explaination



#### Files Overview:

1. **test_essays.csv:**
   - Contains a set of 1000 essays that require classification as either human or machine-generated. 📝

2. **sample_submission.csv:**
   - Demonstrates the expected submission file format, comprising two columns: id and label. 🗃️

3. **train_v2_drcat_02.csv:**
   - Encompasses a dataset of 2000 essays, accompanied by corresponding labels (0 for human, 1 for machine). This file serves as a training set to develop machine learning models for the detection of AI-generated text. 🚀

📚 **Sources:**

- [How to use a train.csv, test.csv, and ground_truth.csv in machine learning models](https://stackoverflow.com/questions/39962836/how-to-use-a-train-csv-test-csv-and-ground-truth-csv-in-a-machine-learning-mod) 
- [Producing a Kaggle submission CSV file with specific entries](https://stackoverflow.com/questions/52411992/how-to-produce-a-kaggle-submission-csv-file-with-specific-entries) 
- [Using multiple CSV files as test and training sets for CNN](https://stackoverflow.com/questions/57148326/use-multiple-csv-files-as-test-and-training-set-for-cnn) 
- [Training a model from a CSV dataset with DeepDetect](https://www.deepdetect.com/server/docs/csv-training/) 
- [Kaggle Competition - Detect AI-generated Text](https://www.kaggle.com/kernels/scriptcontent/155514240/download) 
- [DeepDetect Example - Forest Type Dataset](http://www.deepdetect.com/dd/examples/all/forest_type/test.csv.tar.bz2) 




In [7]:
# Use the read_csv method of pandas to load the test_essays.csv file from the given path into a dataframe named test 🐼
test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')

# Use the read_csv method of pandas to load the sample_submission.csv file from the given path into a dataframe named sub 🐼
sub = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')

# Use the read_csv method of pandas to load the train_v2_drcat_02.csv file from the given path into a dataframe named train, using a comma as the separator 🐼
train = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv", sep=',')


## Explaination:

 ### Data Cleaning for Duplicate Essays

This code cell is essential for maintaining the integrity of the `train` dataframe by eliminating any duplicate essays that could potentially impact the machine learning model's performance. 🧹

- The `drop_duplicates` method is used to remove rows with duplicate essays, considering only the 'essay' column.
- The `keep='first'` parameter ensures that the first occurrence of a duplicate essay is retained.
- The `reset_index` function is then applied to reindex the dataframe after removal of duplicates, ensuring a clean and continuous index.

📚 **Sources:**

- [Resetting index after calling pandas drop_duplicates](https://stackoverflow.com/questions/28885073/resetting-index-after-calling-pandas-drop-duplicates) 
- [How to reset the indices of remaining dataframe values after removing duplicates](https://stackoverflow.com/questions/62155590/how-to-reset-the-indices-of-remaining-dataframe-values-after-removing-duplicate) 
- [Index of Data after train and test split](https://stackoverflow.com/questions/53638231/index-of-data-after-train-and-test-split) 
- [How to Drop Duplicated Index in a Pandas DataFrame - A Complete Guide](https://saturncloud.io/blog/how-to-drop-duplicated-index-in-a-pandas-dataframe-a-complete-guide/) 




In [8]:
# Use the drop_duplicates method of pandas to remove any rows in the train dataframe that have the same value in the "text" column 🐼
# The subset argument specifies which column(s) to consider for identifying duplicates
# The inplace argument is set to True, which means the original train dataframe is modified and no new dataframe is returned
train = train.drop_duplicates(subset=['text'])

# Use the reset_index method of pandas to reset the index of the train dataframe to a sequential numerical index 🐼
# The drop argument is set to True, which means the old index is dropped and not added as a new column
# The inplace argument is set to True, which means the original train dataframe is modified and no new dataframe is returned
train.reset_index(drop=True, inplace=True)


In [9]:
# 🚨🚨🚨
# The following line of code sets the LOWERCASE flag to False.
# This means that the text will not be converted to lowercase before tokenization.
# 🚨🚨🚨
LOWERCASE = False

# 🚨🚨🚨
# The following line of code sets the VOCAB_SIZE to 14000000.
# This means that the maximum number of words in the vocabulary will be 14 million.
# 🚨🚨🚨
VOCAB_SIZE = 140000


## 📑 Byte-Pair Encoding Tokenizer Training¶

**Explanation:**

1. **Create Tokenizer Object:**
   - Create a tokenizer using the Byte Pair Encoding (BPE) algorithm.
   - Define an unknown token as "[UNK]".

2. **Text Normalization:**
   - Normalize the text using Unicode Normalization Form C (NFC).
   - Optionally lowercase the text based on the condition.

3. **Pre-tokenization:**
   - Pre-tokenize the text by splitting it into bytes using the ByteLevel pre-tokenizer.

4. **Special Tokens:**
   - Define special tokens (e.g., "[PAD]", "[CLS]") for downstream tasks.

5. **Trainer Object:**
   - Create a trainer object for training the tokenizer.
   - Set the vocabulary size and special tokens.

6. **Load and Prepare Dataset:**
   - Load the test dataset from a pandas dataframe, selecting only the 'text' column.

7. **Batch Generation:**
   - Define a generator function to yield batches of text from the dataset.

8. **Tokenizer Training:**
   - Train the tokenizer on batches of text using the trainer object.

9. **Wrapper for HuggingFace:**
   - Wrap the raw tokenizer into a PreTrainedTokenizerFast object compatible with HuggingFace.
   - Define special tokens for compatibility.

10. **Tokenization of Test Set:**
    - Tokenize the texts in the test set using the tokenizer object.
    - Store the tokenized texts in the 'tokenized_texts_test' list.

11. **Tokenization of Train Set:**
    - Tokenize the texts in the train set using the same tokenizer.
    - Store the tokenized texts in the 'tokenized_texts_train' list.

📚 **Sources:**

1. [Byte Pair Encoding (BPE) Algorithm](https://arxiv.org/abs/1508.07909)
2. [Unicode Normalization Forms](https://unicode.org/reports/tr15/)
3. [HuggingFace Tokenizers Documentation](https://huggingface.co/docs/tokenizers/)
4. [ByteLevel Pre-tokenizer](https://huggingface.co/docs/tokenizers/pretokenizers.html#bytetranslation-pretokenizer)
5. [HuggingFace BPE Trainer](https://huggingface.co/docs/tokenizers/trainers.html#bpetrainer)
6. [HuggingFace PreTrainedTokenizerFast](https://huggingface.co/docs/tokenizers/pretrained_tokenizer_fast.html)

In [10]:
# Create a tokenizer object using the Byte Pair Encoding (BPE) algorithm
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Normalize the text by applying Unicode Normalization Form C (NFC) and optionally lowercasing it
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])

# Pre-tokenize the text by splitting it into bytes
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

# Define the special tokens that will be used for the downstream task
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

# Create a trainer object that will train the tokenizer on the given vocabulary size and special tokens
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)

# Load the test dataset from a pandas dataframe and select only the text column
dataset = Dataset.from_pandas(test[['text']])

# Define a generator function that will yield batches of text from the dataset
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

# Train the tokenizer on the batches of text using the trainer object
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)

# Wrap the raw tokenizer object into a PreTrainedTokenizerFast object that is compatible with the HuggingFace library
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# Initialize an empty list to store the tokenized texts for the test set
tokenized_texts_test = []

# Loop over the texts in the test set and tokenize them using the tokenizer object
for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))

# Initialize an empty list to store the tokenized texts for the train set
tokenized_texts_train = []

# Loop over the texts in the train set and tokenize them using the tokenizer object
for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))







  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/44868 [00:00<?, ?it/s]

In [11]:
# Access the second element (index 1) in the 'tokenized_texts_test' list 📚
tokenized_texts_test[1]


['ĠBbb', 'Ġccc', 'Ġddd', '.']

## 📑 TF-IDF Vectorization

### Explanation of the Code:

1. **Define a Dummy Function:**
   - A function named `dummy` is defined, which takes an input text and returns it as it is. This function will be used later in the TfidfVectorizer as a placeholder for tokenization.

2. **Create TfidfVectorizer Object:**
   - Instantiate a TfidfVectorizer object with the following configurations:
     - `ngram_range=(3, 5)`: Extract n-grams of words with lengths ranging from 3 to 5.
     - `lowercase=False`: Do not convert text to lowercase.
     - `sublinear_tf=True`: Apply sublinear scaling to the term frequency.
     - `analyzer='word'`: Analyze words.
     - `tokenizer=dummy`: Use the previously defined dummy function for tokenization.
     - `preprocessor=dummy`: Use the dummy function as a preprocessor.
     - `token_pattern=None`: Do not use a specific token pattern.
     - `strip_accents='unicode'`: Strip accents using Unicode.

3. **Fit Vectorizer on Test Set:**
   - Fit the vectorizer on the tokenized texts of the test set (`tokenized_texts_test`). This step generates a vocabulary of n-grams and their indices.

4. **Get Vocabulary:**
   - Obtain the vocabulary of the vectorizer, which is a dictionary mapping n-grams to their respective indices.

5. **Print Vocabulary:**
   - Print the obtained vocabulary.

6. **Create Another Vectorizer with the Same Vocabulary:**
   - Create a new TfidfVectorizer object with the same configurations as before but using the vocabulary obtained from the previous vectorizer.

7. **Fit and Transform on Train Set:**
   - Fit and transform the new vectorizer on the tokenized texts of the train set (`tokenized_texts_train`). This step produces a sparse matrix of tf-idf values for the train set.

8. **Transform on Test Set:**
   - Transform the new vectorizer on the tokenized texts of the test set, obtaining the sparse matrix of tf-idf values for the test set.

9. **Memory Management:**
   - Delete the vectorizer object to free up memory.
   - Invoke the garbage collector to reclaim any unused memory.

### Learning Sources:

#### TfidfVectorizer and Text Vectorization:
- **Scikit-Learn TfidfVectorizer Documentation**: [TfidfVectorizer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
  - Understand the parameters and usage of TfidfVectorizer in Scikit-Learn.

- **Understanding TF-IDF**: [TF-IDF Explained](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/)
  - Learn about the concept of TF-IDF and its application in text vectorization.

#### Memory Management in Python:
- **Memory Management in Python**: [Real Python - Memory Management](https://realpython.com/python-memory-management/)
  - Explore how memory management works in Python and learn about garbage collection.

#### N-grams and Tokenization:
- **N-grams Explained**: [N-grams](https://en.wikipedia.org/wiki/N-gram)
  - Understand the concept of n-grams in natural language processing.

- **Tokenization in NLP**: [Natural Language Toolkit (NLTK) - Tokenization](https://www.nltk.org/book/ch03.html)
  - Learn about tokenization, a crucial step in text processing, using NLTK.


In [12]:
# Define a dummy function that returns the input text as it is
def dummy(text):
    return text

# Create a TfidfVectorizer object that will extract n-grams of words (3 to 5) from the text, without lowercasing or tokenizing it
vectorizer = TfidfVectorizer(ngram_range=(4, 4), lowercase=False, sublinear_tf=True, analyzer = 'word',
    tokenizer = dummy,
    preprocessor = dummy,
    token_pattern = None, strip_accents='unicode')

# Fit the vectorizer on the tokenized texts of the test set
vectorizer.fit(tokenized_texts_test)

# Get the vocabulary of the vectorizer, which is a dictionary of n-grams and their indices
vocab = vectorizer.vocabulary_

# Print the vocabulary
print(vocab)

# Create another TfidfVectorizer object with the same parameters, but using the vocabulary obtained from the previous vectorizer
vectorizer = TfidfVectorizer(ngram_range=(4, 4), lowercase=False, sublinear_tf=True, vocabulary=vocab,
                            analyzer = 'word',
                            tokenizer = dummy,
                            preprocessor = dummy,
                            token_pattern = None, strip_accents='unicode'
                            )

# Fit and transform the vectorizer on the tokenized texts of the train set, and get the sparse matrix of tf-idf values
tf_train = vectorizer.fit_transform(tokenized_texts_train)

# Transform the vectorizer on the tokenized texts of the test set, and get the sparse matrix of tf-idf values
tf_test = vectorizer.transform(tokenized_texts_test)

# Delete the vectorizer object to free up memory
del vectorizer

# Invoke the garbage collector to reclaim unused memory
gc.collect()


{'ĠAaa Ġbbb Ġccc .': 0, 'ĠBbb Ġccc Ġddd .': 1, 'ĠCCC Ġddd Ġeee .': 2}


53

In [13]:
y_train = train['label'].values

## Model Training and Prediction¶


### Explanation of the Code:

1. **Define Ensemble Model:**
   - A function named `get_model` is defined to create an ensemble model using four different classifiers: Multinomial Naive Bayes, Stochastic Gradient Descent, LightGBM, and CatBoost. The ensemble model is a VotingClassifier that combines these classifiers using soft voting.

2. **Import Libraries and Classifiers:**
   - The CatBoostClassifier is imported from the CatBoost library.
   - Multinomial Naive Bayes (MNB), Stochastic Gradient Descent (SGD), LightGBM (LGBM), and CatBoost classifiers are created with specific configurations.

3. **Configure Classifier Parameters:**
   - Parameters for the classifiers are set, including alpha for MNB, max_iter, tol, loss function for SGD, and various parameters for LightGBM and CatBoost.

4. **Define Voting Classifier:**
   - A list of weights for each classifier is defined.
   - A VotingClassifier is created, combining the four classifiers with soft voting. The ensemble model is configured to use parallel processing.

5. **Call the `get_model` Function:**
   - The `get_model` function is called, and the returned ensemble model is assigned to a variable named `model`.

6. **Print the Model:**
   - The ensemble model is printed.

7. **Check Length of Test Text Values:**
   - If the length of the test text values is less than or equal to 5, the code saves the submission dataframe to a CSV file and ends.

8. **Fit Model and Predict:**
   - If the length of the test text values is greater than 5, the model is fitted on the TF-IDF matrix of the train set and the target labels.
   - The garbage collector is invoked to reclaim any unused memory.
   - The model predicts the probabilities of the positive class for the test set using the TF-IDF matrix.
   - The predicted probabilities are assigned to a new column in the submission dataframe.

9. **Save and Display Results:**
   - The submission dataframe is saved to a CSV file.
   - If the length of the test text values is greater than 5, the submission dataframe is displayed.

### Learning Sources:

#### Ensemble Learning and Classifiers:
- **Ensemble Learning - Scikit-Learn**: [Ensemble Methods](https://scikit-learn.org/stable/modules/ensemble.html)
  - Understand the concept of ensemble learning and how it can improve model performance.

- **CatBoost Documentation**: [CatBoost Documentation](https://catboost.ai/docs/)
  - Explore the official documentation for CatBoost to learn about its features and usage.

- **LightGBM Documentation**: [LightGBM Documentation](https://lightgbm.readthedocs.io/en/latest/)
  - Learn about LightGBM, a gradient boosting framework that is efficient and scalable.

- **Scikit-Learn - Multinomial Naive Bayes**: [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes)
  - Understand the Multinomial Naive Bayes classifier in Scikit-Learn.

- **Scikit-Learn - Stochastic Gradient Descent (SGD)**: [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)
  - Explore the SGDClassifier in Scikit-Learn, which is used for training linear classifiers.

#### Memory Management in Python:
- **Memory Management in Python**: [Real Python - Memory Management](https://realpython.com/python-memory-management/)
  - Learn about how memory management works in Python and the importance of garbage collection.



In [14]:
# Define a function that returns an ensemble model of four classifiers
def get_model():
    # Import the CatBoostClassifier from the catboost library
    from catboost import CatBoostClassifier
    
    # Create a Multinomial Naive Bayes classifier with a smoothing parameter of 0.0235
    clf = MultinomialNB(alpha=0.0235)
    
    # Create a Stochastic Gradient Descent classifier with a maximum of 9000 iterations, a tolerance of 3e-4, a modified huber loss function, and a random state of 6743
    sgd_model = SGDClassifier(max_iter=9000, tol=3e-4, loss="modified_huber", random_state=6743) 
    
    # Define a dictionary of parameters for a LightGBM classifier
    p6={'n_iter': 3000,'verbose': -1,'objective': 'cross_entropy','metric': 'auc',
        'learning_rate': 0.0031909898961407, 'colsample_bytree': 0.78,
        'colsample_bynode': 0.8,
       }
    
    # Set the random state of the LightGBM classifier to 6743
    p6["random_state"] = 6743
    
    # Create a LightGBM classifier with the given parameters
    lgb=LGBMClassifier(**p6)
    
    # Create a CatBoost classifier with 3000 iterations, a learning rate of 0.003599066836106983, a subsample of 0.4, a cross entropy loss function, and a random seed of 6543
    cat=CatBoostClassifier(iterations=3000,
                           verbose=0,
                           random_seed=6543,
                           learning_rate=0.003599066836106983,
                           subsample = 0.4,
                           allow_const_label=True,loss_function = 'CrossEntropy')
    
    # Define a list of weights for the four classifiers
    weights = [0.1,0.31,0.31,0.6]
 
    # Create a voting classifier that combines the four classifiers using soft voting and parallel processing
    ensemble = VotingClassifier(estimators=[('mnb',clf),
                                            ('sgd', sgd_model),
                                            ('lgb',lgb), 
                                            ('cat', cat)
                                           ],
                                weights=weights, voting='soft', n_jobs=-1)
    
    # Return the ensemble model
    return ensemble

# Call the get_model function and assign the returned model to a variable
model = get_model()

# Print the model
print(model)

# Check the length of the test text values
if len(test.text.values) <= 5:
    # If the length is less than or equal to 5, save the submission dataframe to a csv file
    sub.to_csv('submission.csv', index=False)
else:
    # Otherwise, fit the model on the tf-idf matrix of the train set and the target labels
    model.fit(tf_train, y_train)

    # Invoke the garbage collector to reclaim unused memory
    gc.collect()

    # Predict the probabilities of the positive class for the test set using the model
    final_preds = model.predict_proba(tf_test)[:,1]
    
    # Assign the predicted probabilities to the generated column of the submission dataframe
    sub['generated'] = final_preds
    
    # Save the submission dataframe to a csv file
    sub.to_csv('submission.csv', index=False)
    
    # Display the submission dataframe
    sub


VotingClassifier(estimators=[('mnb', MultinomialNB(alpha=0.0235)),
                             ('sgd',
                              SGDClassifier(loss='modified_huber',
                                            max_iter=9000, random_state=6743,
                                            tol=0.0003)),
                             ('lgb',
                              LGBMClassifier(colsample_bynode=0.8,
                                             colsample_bytree=0.78,
                                             learning_rate=0.0031909898961407,
                                             metric='auc', n_iter=3000,
                                             objective='cross_entropy',
                                             random_state=6743, verbose=-1)),
                             ('cat',
                              <catboost.core.CatBoostClassifier object at 0x7ac3002e0d30>)],
                 n_jobs=-1, voting='soft', weights=[0.1, 0.31, 0.31, 0.6])
