#### **Natural Language Processing**

`NLP` is a field of computer linguistics that focuses on enabling computers to understand, interpet and generate human language. `LLMs` are the most advanced type of models in NLP. NLP is categorized into some tasks, that models attempt to complete:
- `Clasifying words`: NER, Part of Speech Tagging
- `Classifying sentences`: Sentiment Analysis
- `Generating text`: summarisation, autocompletion
- `Extracting answers`: QA

> **What are LLMs?**
> **How do they differ from other NLP models built?**

`LLMs` are NLP models characterised by their massive size, extensive training data and ability to complete a wide array of language tasks with minimal task specific training. At the heart of LLMs is an architecture known as the `Transformer`.
##### **Characteristics of LLMS**
1. Scale
2. Emergence
3. General capabilities
4. In-context learning

Similar to code, where multiple people around the globe can write it, build applications and websites and have a need for sharing the code either to benefit anyone out there or to collaborate with collagues, people working on NLP models might also need to share models or collaborate with others. That way for any NLP task we can think of, we do not need to build models from scratch but can work on or finetune already built models. Similarly we might build a pretty amazing model ourselves and want to share it with others. What `GitHub` is for developers and software engineers, `🤗` is for data scientists, AI engineers, ML engineers and just about anyone working on AI, Machine Learning and NLP. This central collection of all the state of the art models alongside the `Transformers` API developed by 🤗 allows us to quikcly accomplish NLP tasks with very high levels of accuracy without building models from scratch or hosting them ourselves locally.

At the heart of it all is the `pipeline`. For any NLP task, we have a ML model. ML models need inputs to be numeric or categorical at least, and language is nothing close to that. To convert input which could be a word, phrase or sentence into an input the model can understand, get model predictions/output and then convert that to human language is a process that has been studied widely and can therefore be standardised into a series of steps, known as `pre-processing` steps. The pipeline is like a housing for all this intricate conversions, transformations and inference. You text goes into the pipeline and text comes out of it without ever having to explicitly design what happens in between.



In [2]:
from transformers import pipeline

##### **Sentiment Analysis**

In [None]:
sent_class = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.





Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development





All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use 0


In [4]:
text = [
    "I have been waiting for this job for a really long time",
    "Anger is an amazing motivator if chanelled correctly",
    "The Kenyan government is very unpopular with its citizens",
    "I hate this so much!",
    "HuggingFace courses are amazing"
]

In [5]:
sent_class(text)

[{'label': 'NEGATIVE', 'score': 0.9700242877006531},
 {'label': 'POSITIVE', 'score': 0.9994468092918396},
 {'label': 'NEGATIVE', 'score': 0.9991427659988403},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'POSITIVE', 'score': 0.9998717308044434}]

##### **Text Generation**

In [8]:
generator  = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use 0


In [11]:
generator("Some NLP projects that force any junior to learn the necessary skils to one day be a great professional are", num_return_sequences = 4, max_length = 120)

[{'generated_text': "Some NLP projects that force any junior to learn the necessary skils to one day be a great professional are NLP classes that put a junior in a class of only ten (10) that takes advantage of him/herself to find the proper skils. It's usually called a nrlclass, a very high-stakes nrl class to get a high ranking by the junior through a rigorous, thorough and hard-to-understand study.\n\nThere are NLP classes in the public university of any state, even with a private university. What exactly is a nrlclass"},
 {'generated_text': "Some NLP projects that force any junior to learn the necessary skils to one day be a great professional are: – the 'Inner Skil' programme, a comprehensive online workshop, which teaches new skil-related concepts and has been running for some 7 years. – (formerly known as the 'Orientation Centre for the Development of Skil Learning') the 'Skil Day' course, at which junior students are given to discover what works and isn't effective with the res

##### **Text to audio**

While there are HuggingFace Models for text to audio although not native to the pipeline API, we can use `pyttsx3`, a Python library to come up with audio from text.

In [1]:
import pyttsx3

In [2]:
def speak(text):
    engine = pyttsx3.init()
    engine.setProperty('rate', 140)
    voices = engine.getProperty('voices')
    engine.setProperty('voice',voices[0].id)
    engine.setProperty('volume',0.8)
    engine.say(text)
    engine.runAndWait()

In [3]:
speak("Fun isn't something one considers when balancing the fate of the universe, but this; I'm going to enjoy very much!")

In [2]:
engine = pyttsx3.init()