# GloVe (Global Vectors for Word Representation) model

glove.6B.300d.txt is a pre-trained word embedding file from the GloVe (Global Vectors for Word Representation) model, developed by researchers at Stanford. Here's what it is and what it contains:

#### Key Features:

1. Pre-trained Word Embeddings:
    - GloVe embeddings are pre-trained on a large corpus of text to capture semantic meaning, word similarity, and relationships.
    - The 6B in the filename refers to the corpus used for training, specifically 6 billion tokens (words) from a dataset including Wikipedia and Gigaword.
    
2. Vector Dimensionality:
    - The 300d indicates the dimensionality of the word vectors. Each word is represented as a 300-dimensional numerical vector.
    
3. Format:
    - The file is in plain text, with each line containing:
    
        `
        word v1 v2 v3 ... v300
        `
    
    Where word is the vocabulary term, and v1 to v300 are the 300-dimensional vector components.

4. Vocabulary:
    - This particular file includes 400,000 unique words or tokens.
    
5. Applications:
    - Natural Language Processing (NLP) tasks such as text classification, sentiment analysis, question answering, machine translation, and more.
    - The embeddings are used as input to machine learning models to represent textual data numerically.
    
6. Advantages:
    - Captures both semantic (e.g., king-queen, man-woman) and syntactic (e.g., walking-walked, swimming-swam) relationships.
    - Useful for downstream tasks without requiring the training of embeddings from scratch.
    



In [1]:
# The GloVe official website : https://nlp.stanford.edu/projects/glove/

# To install
# !pip install gensim 

## To create a GloVe-like file (glove.6B.300d.txt) with embeddings for your own words

To create a GloVe-like file (glove.6B.300d.txt) with embeddings for your own words, you can use a pre-trained model (such as GloVe, Word2Vec, or FastText) to extract embeddings for your specific words.

Creating a pre-trained word embedding file involves training a Word2Vec model or downloading a pre-trained one.

### Option 1: Download the Pre-Trained Word2Vec File

The Google News Word2Vec embeddings are widely used and publicly available.

1. Download the Pre-Trained File:
    - Visit the official GoogleNews-vectors repository or use the hosted version from other reliable sources like Kaggle. 
    - Direct download link: GoogleNews-vectors-negative300.bin.gz
    - [kaggle](https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300?select=GoogleNews-vectors-negative300.bin.gz)
    - [github](https://github.com/mmihaltz/word2vec-GoogleNews-vectors/blob/master/GoogleNews-vectors-negative300.bin.gz)
    - [drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view?resourcekey=0-wjGZdNAUop6WykTtMip30g)
    
2. Save the File:
    - Save the file to a local directory on your machine.

3. Use the File:
    - Load it into Python using the gensim library as shown in the earlier example:
        
        ```Python
        from gensim.models import KeyedVectors
        model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
        ```
    

In [2]:
import os
from gensim.models import KeyedVectors

file_path = os.path.join('GoogleNews_vectors_negative300', '1', 'GoogleNews-vectors-negative300.bin.gz')
model = KeyedVectors.load_word2vec_format(file_path, binary=True)
print(f"model: {model}")

model: KeyedVectors<vector_size=300, 3000000 keys>


### Option 2: Train Your Own Word2Vec Model

If you prefer to train your own Word2Vec embeddings for a custom dataset:

#### Steps to Train Word2Vec

1. Prepare a Text Corpus:
    - Collect a large corpus of text data related to your domain. Save it as a .txt file.

2. Install Gensim:
    
    `
    pip install gensim
    `

3. Train the Word2Vec Model: Use the gensim library to train a Word2Vec model.

    ```Python
    from gensim.models import Word2Vec

    # Load your text corpus
    corpus_file = 'your_text_corpus.txt'  # Replace with your text corpus file
    with open(corpus_file, 'r', encoding='utf-8') as f:
        sentences = [line.strip().split() for line in f]  # Tokenize sentences

    # Train Word2Vec model
    model = Word2Vec(
        sentences,
        vector_size=300,  # Number of dimensions for the embeddings
        window=5,         # Context window size
        min_count=5,      # Minimum word frequency to include in the vocabulary
        workers=4         # Number of threads
    )

    # Save the model in binary format
    model.wv.save_word2vec_format('custom_word2vec.bin', binary=True)
    print("Word2Vec model saved as 'custom_word2vec.bin'")
    ```

In [None]:
from gensim.models import Word2Vec

# Load your text corpus
corpus_file = 'your_text_corpus.txt'  # Replace with your text corpus file
with open(corpus_file, 'r', encoding='utf-8') as f:
    sentences = [line.strip().split() for line in f]  # Tokenize sentences

# Train Word2Vec model
model = Word2Vec(
    sentences,
    vector_size=300,  # Number of dimensions for the embeddings
    window=5,         # Context window size
    min_count=5,      # Minimum word frequency to include in the vocabulary
    workers=4         # Number of threads
)

# Save the model in binary format
model.wv.save_word2vec_format('custom_word2vec.bin', binary=True)
print("Word2Vec model saved as 'custom_word2vec.bin'")
