# GloVe (Global Vectors for Word Representation) model

glove.6B.300d.txt is a pre-trained word embedding file from the GloVe (Global Vectors for Word Representation) model, developed by researchers at Stanford. Here's what it is and what it contains:

#### Key Features:

1. Pre-trained Word Embeddings:
    - GloVe embeddings are pre-trained on a large corpus of text to capture semantic meaning, word similarity, and relationships.
    - The 6B in the filename refers to the corpus used for training, specifically 6 billion tokens (words) from a dataset including Wikipedia and Gigaword.
    
2. Vector Dimensionality:
    - The 300d indicates the dimensionality of the word vectors. Each word is represented as a 300-dimensional numerical vector.
    
3. Format:
    - The file is in plain text, with each line containing:
    
        `
        word v1 v2 v3 ... v300
        `
    
    Where word is the vocabulary term, and v1 to v300 are the 300-dimensional vector components.

4. Vocabulary:
    - This particular file includes 400,000 unique words or tokens.
    
5. Applications:
    - Natural Language Processing (NLP) tasks such as text classification, sentiment analysis, question answering, machine translation, and more.
    - The embeddings are used as input to machine learning models to represent textual data numerically.
    
6. Advantages:
    - Captures both semantic (e.g., king-queen, man-woman) and syntactic (e.g., walking-walked, swimming-swam) relationships.
    - Useful for downstream tasks without requiring the training of embeddings from scratch.
    



In [1]:
# The GloVe official website : https://nlp.stanford.edu/projects/glove/

# To install
# !pip install gensim 

## To create a GloVe-like file (glove.6B.300d.txt) with embeddings for your own words

To create a GloVe-like file (glove.6B.300d.txt) with embeddings for your own words, you can use a pre-trained model (such as GloVe, Word2Vec, or FastText) to extract embeddings for your specific words.

Creating a pre-trained word embedding file involves training a Word2Vec model or downloading a pre-trained one.

### Option 1: Download the Pre-Trained Word2Vec File

The Google News Word2Vec embeddings are widely used and publicly available.

1. Download the Pre-Trained File:
    - Visit the official GoogleNews-vectors repository or use the hosted version from other reliable sources like Kaggle. 
    - Direct download link: GoogleNews-vectors-negative300.bin.gz
    - [kaggle](https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300?select=GoogleNews-vectors-negative300.bin.gz)
    - [github](https://github.com/mmihaltz/word2vec-GoogleNews-vectors/blob/master/GoogleNews-vectors-negative300.bin.gz)
    - [drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view?resourcekey=0-wjGZdNAUop6WykTtMip30g)
    
2. Save the File:
    - Save the file to a local directory on your machine.

3. Use the File:
    - Load it into Python using the gensim library as shown in the earlier example:
        
        ```Python
        from gensim.models import KeyedVectors
        model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
        ```
    

In [2]:
import os
from gensim.models import KeyedVectors

file_path = os.path.join('GoogleNews_vectors_negative300', '1', 'GoogleNews-vectors-negative300.bin.gz')
model = KeyedVectors.load_word2vec_format(file_path, binary=True)
print(f"model: {model}")

model: KeyedVectors<vector_size=300, 3000000 keys>


### Create CUB_corpus.txt

In [15]:
import os
import glob

# for path in glob.glob(os.path.join('..', 'Dataset', 'CUB_KMean_Dataset', 'images', '*')):
#     # print(path.split('\\')[-1], end="', '")
#     print(path.split('\\')[-1])
#     # break

corpus_file = 'CUB_corpus_test.txt'
corpus_file_path = os.path.join('GoogleNews_vectors_negative300', 'corpus', corpus_file)

with open(corpus_file_path, "a") as f:
    for path in glob.glob(os.path.join('..', 'Dataset', 'CUB_KMean_Dataset', 'images', '*')):
        f.write(path.split('\\')[-1].replace('_', ' ') + '\n')
        

### Option 2: Train Your Own Word2Vec Model

If you prefer to train your own Word2Vec embeddings for a custom dataset:

#### Steps to Train Word2Vec

1. Prepare a Text Corpus:
    - Collect a large corpus of text data related to your domain. Save it as a .txt file.

2. Install Gensim:
    
    `
    pip install gensim
    `

3. Train the Word2Vec Model: Use the gensim library to train a Word2Vec model.

    ```Python
    from gensim.models import Word2Vec

    # Load your text corpus
    corpus_file = 'your_text_corpus.txt'  # Replace with your text corpus file
    with open(corpus_file, 'r', encoding='utf-8') as f:
        sentences = [line.strip().split() for line in f]  # Tokenize sentences

    # Train Word2Vec model
    model = Word2Vec(
        sentences,
        vector_size=300,  # Number of dimensions for the embeddings
        window=5,         # Context window size
        min_count=5,      # Minimum word frequency to include in the vocabulary
        workers=4         # Number of threads
    )

    # Save the model in binary format
    model.wv.save_word2vec_format('custom_word2vec.bin', binary=True)
    print("Word2Vec model saved as 'custom_word2vec.bin'")
    ```

In [16]:
import os
from gensim.models import Word2Vec

corpus_file_path = os.path.join('GoogleNews_vectors_negative300', 'corpus', 'CUB_corpus.txt')
save_word2vec_file_path = os.path.join('GoogleNews_vectors_negative300', 'corpus', 'custom_word2vec.bin')

# Load your text corpus
corpus_file = corpus_file_path  # Replace with your text corpus file
with open(corpus_file, 'r', encoding='utf-8') as f:
    sentences = [line.strip().split() for line in f]  # Tokenize sentences

# Train Word2Vec model
model = Word2Vec(
    sentences,
    vector_size=300,  # Number of dimensions for the embeddings
    window=5,         # Context window size
    min_count=5,      # Minimum word frequency to include in the vocabulary
    workers=4         # Number of threads
)

# # Build the vocabulary
# model.build_vocab(sentences)

# # Train the Word2Vec model
# model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

# Save the model in binary format
model.wv.save_word2vec_format(save_word2vec_file_path, binary=True)
print("Word2Vec model saved as 'custom_word2vec.bin'")


Word2Vec model saved as 'custom_word2vec.bin'


In [None]:
# [
#     'african_buffalo', 'alligator', 'amphibian', 'amur_leopard', 
#     'ants', 'bear', 'bird', 'blue_whale', 'bobcat', 'cat', 'chimp', 
#     'chimpanzee', 'cow', 'dog', 'dolphin', 'domestic_water_buffalo', 
#     'eagle', 'elephant', 'fish', 'frog', 'giant', 'giant_panda', 'goat', 
#     'gorilla', 'hen', 'horse', 'killer_whale', 'lion', 'lizard', 'monkey', 
#     'mouse', 'orangutan', 'ostrich', 'ox', 'panda', 'polar_bear', 'rabbit', 
#     'rat', 'rhino', 'rhinoceros', 'rhinoceroses', 'seal', 'sealskin', 
#     'siamese_cat', 'skunk', 'spider_monkey', 'squirrel', 'tiger', 'turtle', 
#     'walrus', 'whale', 'bird', 'fish', 'lion', 'tiger', 'bull'
# ]

# ----

# Classes in Cluster 0: ['001.Black_footed_Albatross', '004.Groove_billed_Ani', '009.Brewer_Blackbird', '013.Bobolink', '014.Indigo_Bunting', '015.Lazuli_Bunting', '018.Spotted_Catbird', '020.Yellow_breasted_Chat', '021.Eastern_Towhee', '022.Chuck_will_Widow', '026.Bronzed_Cowbird', '029.American_Crow', '030.Fish_Crow', '036.Northern_Flicker', '044.Frigatebird', '059.California_Gull', '060.Glaucous_winged_Gull', '061.Heermann_Gull', '062.Herring_Gull', '068.Ruby_throated_Hummingbird', '070.Green_Violetear', '076.Dark_eyed_Junco', '077.Tropical_Kingbird', '085.Horned_Lark', '086.Pacific_Loon', '087.Mallard', '088.Western_Meadowlark', '092.Nighthawk', '096.Hooded_Oriole', '103.Sayornis', '106.Horned_Puffin', '107.Common_Raven', '111.Loggerhead_Shrike', '113.Baird_Sparrow', '114.Black_throated_Sparrow', '117.Clay_colored_Sparrow', '120.Fox_Sparrow', '121.Grasshopper_Sparrow', '123.Henslow_Sparrow', '125.Lincoln_Sparrow', '126.Nelson_Sharp_tailed_Sparrow', '135.Bank_Swallow', '136.Barn_Swallow', '158.Bay_breasted_Warbler', '160.Black_throated_Blue_Warbler', '161.Blue_winged_Warbler', '162.Canada_Warbler', '164.Cerulean_Warbler', '167.Hooded_Warbler', '169.Magnolia_Warbler', '170.Mourning_Warbler', '188.Pileated_Woodpecker', '189.Red_bellied_Woodpecker', '193.Bewick_Wren', '195.Carolina_Wren', '200.Common_Yellowthroat']
# Classes in Cluster 1: ['005.Crested_Auklet', '006.Least_Auklet', '010.Red_winged_Blackbird', '017.Cardinal', '023.Brandt_Cormorant', '028.Brown_Creeper', '031.Black_billed_Cuckoo', '034.Gray_crowned_Rosy_Finch', '037.Acadian_Flycatcher', '038.Great_Crested_Flycatcher', '039.Least_Flycatcher', '041.Scissor_tailed_Flycatcher', '045.Northern_Fulmar', '046.Gadwall', '047.American_Goldfinch', '049.Boat_tailed_Grackle', '050.Eared_Grebe', '051.Horned_Grebe', '054.Blue_Grosbeak', '055.Evening_Grosbeak', '058.Pigeon_Guillemot', '067.Anna_Hummingbird', '071.Long_tailed_Jaeger', '073.Blue_Jay', '074.Florida_Jay', '079.Belted_Kingfisher', '080.Green_Kingfisher', '081.Pied_Kingfisher', '084.Red_legged_Kittiwake', '089.Hooded_Merganser', '091.Mockingbird', '093.Clark_Nutcracker', '095.Baltimore_Oriole', '097.Orchard_Oriole', '099.Ovenbird', '100.Brown_Pelican', '101.White_Pelican', '102.Western_Wood_Pewee', '104.American_Pipit', '105.Whip_poor_Will', '108.White_necked_Raven', '109.American_Redstart', '110.Geococcyx', '112.Great_Grey_Shrike', '115.Brewer_Sparrow', '116.Chipping_Sparrow', '118.House_Sparrow', '119.Field_Sparrow', '122.Harris_Sparrow', '124.Le_Conte_Sparrow', '127.Savannah_Sparrow', '128.Seaside_Sparrow', '134.Cape_Glossy_Starling', '139.Scarlet_Tanager', '141.Artic_Tern', '142.Black_Tern', '143.Caspian_Tern', '148.Green_tailed_Towhee', '149.Brown_Thrasher', '151.Black_capped_Vireo', '152.Blue_headed_Vireo', '153.Philadelphia_Vireo', '183.Northern_Waterthrush', '185.Bohemian_Waxwing']

# ----

# sentences = ['Black_footed_Albatross', 'Laysan_Albatross', 'Sooty_Albatross', 'Groove_billed_Ani', 'Crested_Auklet', 'Least_Auklet', 'Parakeet_Auklet',
#               'Rhinoceros_Auklet', 'Brewer_Blackbird', 'Red_winged_Blackbird', 'Rusty_Blackbird', 'Yellow_headed_Blackbird', 'Bobolink', 
#               'Indigo_Bunting', 'Lazuli_Bunting', 'Painted_Bunting', 'Cardinal', 'Spotted_Catbird', 'Gray_Catbird', 'Yellow_breasted_Chat',
#                 'Eastern_Towhee', 'Chuck_will_Widow', 'Brandt_Cormorant', 'Red_faced_Cormorant', 'Pelagic_Cormorant', 'Bronzed_Cowbird', 'Shiny_Cowbird', 
#                 'Brown_Creeper', 'American_Crow', 'Fish_Crow', 'Black_billed_Cuckoo', 'Mangrove_Cuckoo', 'Yellow_billed_Cuckoo', 'Gray_crowned_Rosy_Finch', 
#                 'Purple_Finch', 'Northern_Flicker', 'Acadian_Flycatcher', 'Great_Crested_Flycatcher', 'Least_Flycatcher', 'Olive_sided_Flycatcher',
#                   'Scissor_tailed_Flycatcher', 'Vermilion_Flycatcher', 'Yellow_bellied_Flycatcher', 'Frigatebird', 'Northern_Fulmar', 'Gadwall',
#                     'American_Goldfinch', 'European_Goldfinch', 'Boat_tailed_Grackle', 'Eared_Grebe', 'Horned_Grebe', 'Pied_billed_Grebe', 'Western_Grebe', 
#                     'Blue_Grosbeak', 'Evening_Grosbeak', 'Pine_Grosbeak', 'Rose_breasted_Grosbeak', 'Pigeon_Guillemot', 'California_Gull', 'Glaucous_winged_Gull', 
#                     'Heermann_Gull', 'Herring_Gull', 'Ivory_Gull', 'Ring_billed_Gull', 'Slaty_backed_Gull', 'Western_Gull', 'Anna_Hummingbird', 'Ruby_throated_Hummingbird', 
#                     'Rufous_Hummingbird', 'Green_Violetear', 'Long_tailed_Jaeger', 'Pomarine_Jaeger', 'Blue_Jay', 'Florida_Jay', 'reen_Jay', 'Dark_eyed_Junco', 
#                     'Tropical_Kingbird', 'Gray_Kingbird', 'Belted_Kingfisher', 'Green_Kingfisher', 'Pied_Kingfisher', 'Ringed_Kingfisher', 'White_breasted_Kingfisher',
#                     'Red_legged_Kittiwake', 'Horned_Lark', 'Pacific_Loon', 'Mallard', 'Western_Meadowlark', 'Hooded_Merganser', 'Red_breasted_Merganser', 'Mockingbird', 
#                     'Nighthawk', 'Clark_Nutcracker', 'White_breasted_Nuthatch', 'Baltimore_Oriole', 'Hooded_Oriole', 'Orchard_Oriole', 'Scott_Oriole', 'Ovenbird', 
#                     'Brown_Pelican', 'White_Pelican', 'Western_Wood_Pewee', 'Sayornis', 'American_Pipit', 'Whip_poor_Will', 'Horned_Puffin', 'Common_Raven', 
#                     'White_necked_Raven', 'American_Redstart', 'Geococcyx', 'Loggerhead_Shrike', 'Great_Grey_Shrike', 'Baird_Sparrow', 'Black_throated_Sparrow', 
#                     'Brewer_Sparrow', 'Chipping_Sparrow', 'Clay_colored_Sparrow', 'House_Sparrow', 'Field_Sparrow', 'Fox_Sparrow', 'Grasshopper_Sparrow', 
#                     'Harris_Sparrow', 'Henslow_Sparrow', 'Le_Conte_Sparrow', 'Lincoln_Sparrow', 'Nelson_Sharp_tailed_Sparrow', 'Savannah_Sparrow', 
#                     'Seaside_Sparrow', 'Song_Sparrow', 'Tree_Sparrow', 'Vesper_Sparrow', 'White_crowned_Sparrow', 'White_throated_Sparrow', 
#                     'Cape_Glossy_Starling', 'Bank_Swallow', 'Barn_Swallow', 'Cliff_Swallow', 'Tree_Swallow', 'Scarlet_Tanager', 'Summer_Tanager', 
#                     'Artic_Tern', 'Black_Tern', 'Caspian_Tern', 'Common_Tern', 'Elegant_Tern', 'Forsters_Tern', 'Least_Tern', 'Green_tailed_Towhee', 
#                     'Brown_Thrasher', 'Sage_Thrasher', 'Black_capped_Vireo', 'Blue_headed_Vireo', 'Philadelphia_Vireo', 'Red_eyed_Vireo', 'Warbling_Vireo', 
#                     'White_eyed_Vireo', 'Yellow_throated_Vireo', 'Bay_breasted_Warbler', 'Black_and_white_Warbler', 'Black_throated_Blue_Warbler', 
#                     'Blue_winged_Warbler', 'Canada_Warbler', 'Cape_May_Warbler', 'Cerulean_Warbler', 'Chestnut_sided_Warbler', 'Golden_winged_Warbler', 
#                     'Hooded_Warbler', 'Kentucky_Warbler', 'Magnolia_Warbler', 'Mourning_Warbler', 'Myrtle_Warbler', 'Nashville_Warbler', 'Orange_crowned_Warbler', 
#                     'Palm_Warbler', 'Pine_Warbler', 'Prairie_Warbler', 'Prothonotary_Warbler', 'Swainson_Warbler', 'Tennessee_Warbler', 'Wilson_Warbler', 
#                     'Worm_eating_Warbler', 'Yellow_Warbler', 'Northern_Waterthrush', 'Louisiana_Waterthrush', 'Bohemian_Waxwing', 'Cedar_Waxwing', 
#                     'American_Three_toed_Woodpecker', 'Pileated_Woodpecker', 'Red_bellied_Woodpecker', 'Red_cockaded_Woodpecker', 'Red_headed_Woodpecker', 'Downy_Woodpecker', 
#              'Bewick_Wren', 'Cactus_Wren', 'Carolina_Wren', 'House_Wren', 'Marsh_Wren', 'Rock_Wren', 'Winter_Wren', 'Common_Yellowthroat']

# ----

# Model Vocabulary: ['Common_Yellowthroat', 'Ivory_Gull', 'Blue_Jay', 'Pomarine_Jaeger', 'Long_tailed_Jaeger', 'Green_Violetear', 'Rufous_Hummingbird', 'Ruby_throated_Hummingbird', 'Anna_Hummingbird', 'Western_Gull', 'Slaty_backed_Gull', 'Ring_billed_Gull', 'Herring_Gull', 'Winter_Wren', 'Heermann_Gull', 'Glaucous_winged_Gull', 'California_Gull', 'Pigeon_Guillemot', 'Rose_breasted_Grosbeak', 'Pine_Grosbeak', 'Evening_Grosbeak', 'Blue_Grosbeak', 'Western_Grebe', 'Pied_billed_Grebe', 'Florida_Jay', 'reen_Jay', 'Dark_eyed_Junco', 'Tropical_Kingbird', 'Scott_Oriole', 'Orchard_Oriole', 'Hooded_Oriole', 'Baltimore_Oriole', 'White_breasted_Nuthatch', 'Clark_Nutcracker', 'Nighthawk', 'Mockingbird', 'Red_breasted_Merganser', 'Hooded_Merganser', 'Western_Meadowlark', 'Mallard', 'Pacific_Loon', 'Horned_Lark', 'Red_legged_Kittiwake', 'White_breasted_Kingfisher', 'Ringed_Kingfisher', 'Pied_Kingfisher', 'Green_Kingfisher', 'Belted_Kingfisher', 'Gray_Kingbird', 'Horned_Grebe', 'Eared_Grebe', 'Boat_tailed_Grackle', 'Red_faced_Cormorant', 'Chuck_will_Widow', 'Eastern_Towhee', 'Yellow_breasted_Chat', 'Gray_Catbird', 'Spotted_Catbird', 'Cardinal', 'Painted_Bunting', 'Lazuli_Bunting', 'Indigo_Bunting', 'Bobolink', 'Yellow_headed_Blackbird', 'Rusty_Blackbird', 'Red_winged_Blackbird', 'Brewer_Blackbird', 'Rhinoceros_Auklet', 'Parakeet_Auklet', 'Least_Auklet', 'Crested_Auklet', 'Groove_billed_Ani', 'Sooty_Albatross', 'Laysan_Albatross', 'Brandt_Cormorant', 'Pelagic_Cormorant', 'European_Goldfinch', 'Bronzed_Cowbird', 'American_Goldfinch', 'Gadwall', 'Northern_Fulmar', 'Frigatebird', 'Yellow_bellied_Flycatcher', 'Vermilion_Flycatcher', 'Scissor_tailed_Flycatcher', 'Olive_sided_Flycatcher', 'Least_Flycatcher', 'Great_Crested_Flycatcher', 'Acadian_Flycatcher', 'Northern_Flicker', 'Purple_Finch', 'Gray_crowned_Rosy_Finch', 'Yellow_billed_Cuckoo', 'Mangrove_Cuckoo', 'Black_billed_Cuckoo', 'Fish_Crow', 'American_Crow', 'Brown_Creeper', 'Shiny_Cowbird', 'Ovenbird', 'Brown_Pelican', 'White_Pelican', 'Pine_Warbler', 'Orange_crowned_Warbler', 'Nashville_Warbler', 'Myrtle_Warbler', 'Mourning_Warbler', 'Magnolia_Warbler', 'Kentucky_Warbler', 'Hooded_Warbler', 'Golden_winged_Warbler', 'Chestnut_sided_Warbler', 'Cerulean_Warbler', 'Cape_May_Warbler', 'Canada_Warbler', 'Blue_winged_Warbler', 'Black_throated_Blue_Warbler', 'Black_and_white_Warbler', 'Bay_breasted_Warbler', 'Yellow_throated_Vireo', 'White_eyed_Vireo', 'Warbling_Vireo', 'Red_eyed_Vireo', 'Philadelphia_Vireo', 'Palm_Warbler', 'Prairie_Warbler', 'Black_capped_Vireo', 'Prothonotary_Warbler', 'Rock_Wren', 'Marsh_Wren', 'House_Wren', 'Carolina_Wren', 'Cactus_Wren', 'Bewick_Wren', 'Downy_Woodpecker', 'Red_headed_Woodpecker', 'Red_cockaded_Woodpecker', 'Red_bellied_Woodpecker', 'Pileated_Woodpecker', 'American_Three_toed_Woodpecker', 'Cedar_Waxwing', 'Bohemian_Waxwing', 'Louisiana_Waterthrush', 'Northern_Waterthrush', 'Yellow_Warbler', 'Worm_eating_Warbler', 'Wilson_Warbler', 'Tennessee_Warbler', 'Swainson_Warbler', 'Blue_headed_Vireo', 'Sage_Thrasher', 'Western_Wood_Pewee', 'Lincoln_Sparrow', 'Henslow_Sparrow', 'Harris_Sparrow', 'Grasshopper_Sparrow', 'Fox_Sparrow', 'Field_Sparrow', 'House_Sparrow', 'Clay_colored_Sparrow', 'Chipping_Sparrow', 'Brewer_Sparrow', 'Black_throated_Sparrow', 'Baird_Sparrow', 'Great_Grey_Shrike', 'Loggerhead_Shrike', 'Geococcyx', 'American_Redstart', 'White_necked_Raven', 'Common_Raven', 'Horned_Puffin', 'Whip_poor_Will', 'American_Pipit', 'Sayornis', 'Le_Conte_Sparrow', 'Nelson_Sharp_tailed_Sparrow', 'Brown_Thrasher', 'Savannah_Sparrow', 'Green_tailed_Towhee', 'Least_Tern', 'Forsters_Tern', 'Elegant_Tern', 'Common_Tern', 'Caspian_Tern', 'Black_Tern', 'Artic_Tern', 'Summer_Tanager', 'Scarlet_Tanager', 'Tree_Swallow', 'Cliff_Swallow', 'Barn_Swallow', 'Bank_Swallow', 'Cape_Glossy_Starling', 'White_throated_Sparrow', 'White_crowned_Sparrow', 'Vesper_Sparrow', 'Tree_Sparrow', 'Song_Sparrow', 'Seaside_Sparrow', 'Black_footed_Albatross']

# ----

