# FastText Installation Guide for GitHub Codespaces

This guide provides step-by-step instructions to install FastText in GitHub Codespaces, preparing the environment for text classification tasks.

## Step 1: Update Packages and Install Dependencies

First, update your packages and install the necessary dependencies for building FastText:

```bash
sudo apt-get update
sudo apt-get install -y build-essential cmake


# Step 2: Clone the FastText Repository
```bash
git clone https://github.com/facebookresearch/fastText.git

# Step 3: Navigate to the FastText Directory
```bash
cd fastText



# Getting and Preparing the IMDB Movie Dataset

In this section, we’ll download the IMDB movie review dataset and prepare it for FastText. FastText requires a specific format for labeled data, so we’ll convert it accordingly.

## Step 1: Download the IMDB Dataset

1. Go to the [IMDB Movie dataset on Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).
2. Download the dataset file (`IMDB Dataset.csv`) and upload it to your working directory in GitHub Codespaces.

## Step 2: Prepare the Data for FastText

FastText requires each line of data to follow this format:

```bash
label<label> <review text>


For example:
```bash
__label__positive The movie was fantastic! 
__label__negative Not worth watching.


### Convert IMDB Dataset to FastText Format

Run the following Python script to convert the IMDB dataset into the required format. This will create `train.txt` (80% of the data) for training and `test.txt` (20%) for validation.

```python
import pandas as pd

# Load the dataset
df = pd.read_csv('IMDB Dataset.csv')

# Preprocess and save in FastText format
df['review'] = df['review'].str.replace('\n', ' ').str.lower()  # Remove newlines and lowercase
df['label'] = df['sentiment'].apply(lambda x: '__label__' + x)
df['fasttext_format'] = df['label'] + ' ' + df['review']

# Split into train and test
df.sample(frac=0.8, random_state=42).to_csv('train.txt', columns=['fasttext_format'], index=False, header=False)
df.drop(df.sample(frac=0.8, random_state=42).index).to_csv('test.txt', columns=['fasttext_format'], index=False, header=False)


train.txt: Contains 40,000 reviews (80% of the data) for training.
test.txt: Contains 10,000 reviews (20% of the data) for testing.

# Training and Evaluating the Model

Training the FastText model on the prepared IMDB dataset and evaluate its performance on the test set.

## Step 1: Train the Model

Use the `train.txt` file to train a supervised FastText model. This command will save the trained model as `model_imdb.bin`.

```bash
./fasttext supervised -input train.txt -output model_imdb
Read 9M words
Number of words:  390624
Number of labels: 2
Progress: 100.0% words/sec/thread: 1487245 lr:  0.000000 avg.loss:  0.693287 ETA:  0h 0m 0s


supervised: Tells FastText to run in supervised mode, suitable for classification tasks.

-input train.txt: Specifies train.txt as the training data.

-output model_imdb: Sets the output file name as model_imdb.bin.

# Step 2: Evaluate the Model

After training, evaluate the model on test.txt to assess its accuracy:

```bash 
./fasttext test model_imdb.bin test.txt
N       276
P@1     0.562
R@1     0.562

test: Tells FastText to evaluate the model on test data.

model_imdb.bin: Specifies the trained model file.

test.txt: Uses test.txt as the test data file.

# Step 3: Test Model Predictions on Sample Sentences

Positive Review Example:
```bash
echo "The movie was absolutely brilliant, with stunning visuals and an amazing storyline!" | ./fasttext predict model_imdb.bin -

__label__positive


Negative Review Example:
```bash
echo "I was really disappointed by the movie; it was boring and felt too long." | ./fasttext predict model_imdb.bin -

__label__positive

Above statement should have been negative but the model is not trained properley so its giving wrong output


# Step 4: Fine-tuning the model

Changing the lr (Learning Rate) reduces avg.loss

```bash
./fasttext supervised -input train.txt -output model_imdb -lr 0.5  
Read 9M words
Number of words:  350082
Number of labels: 2
Progress: 100.0% words/sec/thread: 1552394 lr:  0.000000 avg.loss:  0.511963 ETA:   0h 0m 0s

Changing the epochs also results in improvement

```bash
./fasttext supervised -input train.txt -output model_imdb -lr 0.5 -epoch 25 
Read 9M words
Number of words:  350082
Number of labels: 2
Progress: 100.0% words/sec/thread: 1574460 lr:  0.000000 avg.loss:  0.120751 ETA:   0h 0m 0s

Fine Tunning results in a much better results:

```bash
./fasttext test model_imdb.bin test.txt
N       276
P@1     0.804
R@1     0.804

# Step 5: Using Bigrams
***'Bigram'*** the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens.

```bash
./fasttext supervised -input train.txt -output model_imdb -lr 1 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 
Read 9M words
Number of words:  350082
Number of labels: 2
Progress: 100.0% words/sec/thread: 1381679 lr:  0.000000 avg.loss:  0.076925 ETA:   0h 0m 0s

# Step 6: Using Hirachichal Softmax

Hierarchical softmax is a computational technique used to speed up training, especially useful for large datasets. It can be enabled by adding the ***-loss*** hs option when training the model.
```bash
./fasttext supervised -input train.txt -output model_imdb -lr 1 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs
Read 9M words
Number of words:  350082
Number of labels: 2
Progress: 100.0% words/sec/thread: 1455385 lr:  0.000000 avg.loss:  0.098376 ETA:   0h 0m 0s

# Step 7: Multi-label classification
A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with ***-loss*** ***one-vs-all*** or ***-loss ova***.
```bash
./fasttext supervised -input train.txt -output model_imdb -lr 1 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss ova
Read 9M words
Number of words:  350082
Number of labels: 2
Progress: 100.0% words/sec/thread: 1455964 lr:  0.000000 avg.loss:  0.148587 ETA:   0h 0m 0s

# Results and Evaluation of the model Model:
```bash
./fasttext test model_imdb.bin test.txt
N       276
P@1     0.815
R@1     0.815

Now running the same sentence that the model gave Positive label instead of negative:
```bash
echo "I was really disappointed by the movie; it was boring and felt too long." | ./fasttext predict model_imdb.bin -
__label__negative
