**Language models** are mathematical models used to understand, produce and analyze texts in a language. Its main purpose is to probabilistically represent and give meaning to sequences of words, sentences or paragraphs in a language. These models are used in natural language processing (NLP) applications and try to "understand" what a text says.

Sample Models : BERT and GPT
1. BERT (Bidirectional Encoder Representations from Transformers):
   - What Does It Do?: BERT understands the context of words in the text in both directions (by looking at the previous and next words), so it is a bidirectional model.
   - Areas of Use: It is used in tasks such as text classification, question answering, sentiment analysis.
   - Features: Inferring meaning by looking at the entire sentence while being trained enables one to better understand the nuances of the language.

2. GPT (Generative Pre-trained Transformer):
   - What is it for?: GPT is a model used to generate text. Given a starting text, it produces the continuation by guessing.
   - Areas of Use: It is used in tasks requiring production such as text completion, chat bots, creative writing.
   - Features: It only works from left to right, that is, it predicts the next word by looking at the previous words.

**Text-Generation**

In [1]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-xl')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, but what I do you need to know isn't that hard. But if you want to understand us, you"},
 {'generated_text': "Hello, I'm a language model, this is my first commit and I'd like to get some feedback to see if I understand this commit.\n"},
 {'generated_text': "Hello, I'm a language model, and I'll guide you on your journey!\n\nLet's get to it.\n\nBefore we start"},
 {'generated_text': 'Hello, I\'m a language model, not a developer." If everything you\'re learning about code is through books, you\'ll never get to know about'},
 {'generated_text': 'Hello, I\'m a language model, please tell me what you think!" – I started out on this track, and now I am doing a lot'}]

In [2]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-xl')
set_seed(42)
generator("This is a NLP exercise", max_length=30, num_return_sequences=5)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This is a NLP exercise: what you do in this exercise is to change everything in your thoughts and attitudes in order to see those things through to'},
 {'generated_text': 'This is a NLP exercise to teach yourself the best and most effective ways to speak with people who are in a position to earn some money.\n'},
 {'generated_text': 'This is a NLP exercise, but it should remind you that words can help improve memory.\n\nNLP is all about using words as resources'},
 {'generated_text': 'This is a NLP exercise, one of tens of thousands of experiments in which AI is being used to predict the thoughts and intentions of people. These'},
 {'generated_text': 'This is a NLP exercise. I don\'t know anything about NLP. Just thought it was a nice exercise." – Dave Smith\n\n5'}]

**Sentiment Analysis**

In [3]:
from transformers import pipeline

# create a pipeline for sentiment analysis
sentiment_analysis = pipeline("sentiment-analysis")

# analyze the text
result = sentiment_analysis("I really love this movie!")

print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.999879002571106}]


**Text-Summarization**

In [5]:
from transformers import pipeline

summarizer = pipeline("summarization", model="Falconsai/text_summarization")

text="Language models are mathematical models used to understand, produce and analyze texts in a language. Its main purpose is to probabilistically represent and give meaning to sequences of words, sentences or paragraphs in a language. These models are used in natural language processing (NLP) applications and try to understand what a text says."

print(summarizer(text, max_length=1000, min_length=30, do_sample=False))





config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Your max_length is set to 1000, but your input_length is only 72. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)


[{'summary_text': 'Language models are mathematical models used to understand, produce and analyze texts . Its main purpose is to probabilistically represent and give meaning to sequences of words, sentences or paragraphs in a language .'}]


**Sentence Similarity**

In [10]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# load the model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

sentences = ["The cat is sitting on the mat", "A dog is running in the park"]

# convert sentences to vectors
embeddings = model.encode(sentences)

# calculate the similarity between two sentences
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])

print("Similarity Score:", similarity[0][0])



Similarity Score: 0.06508216


Sources: <br>

*   https://huggingface.co/openai-community/gpt2-large
*   https://huggingface.co/Falconsai/text_summarization
*  https://huggingface.co/sentence-transformers/all-mpnet-base-v2

