# Coding by Lake

# Algoritma

Algoritma Data Science School is an educational institution based in Indonesia that specializes in providing training and courses in the field of data science and analytics. It aims to equip individuals with the knowledge and skills needed to excel in the rapidly growing field of data science.

This Jupyter Notebook has been specifically designed for Algoritma Data Science School, providing a platform for students to explore, analyze, and visualize data using Python and its associated data science libraries. Jupyter Notebook serves as an interactive workspace where students can demonstrate their understanding of data science concepts, showcase their skills, and present their findings in a structured and organized manner. Please note that this Jupyter Notebook is intended for personal or educational use only. Kindly refrain from reproducing or distributing this notebook without prior permission. 


## Large Language Model Demo

A large language model is a type of artificial intelligence model that is designed to process and generate human-like text based on its understanding of language patterns and context. These models are typically trained on vast amounts of text data and use deep learning techniques to learn the statistical relationships between words and phrases.

Large language models, such as GPT-3 (Generative Pre-trained Transformer 3), are trained using a technique called unsupervised learning. They are exposed to a massive corpus of text from various sources, including books, articles, websites, and more. During training, the model learns to predict the likelihood of the next word in a sentence or sequence of words based on the context provided by the preceding words.

The training process involves optimizing the model's parameters to maximize the probability of generating coherent and contextually relevant text. This enables the model to learn grammar, syntax, semantics, and even some level of world knowledge based on the patterns it observes in the training data.

Large language models like GPT-3 have millions (or even billions) of parameters, allowing them to capture intricate details and nuances of language. They can generate text in a wide range of styles, mimic human-like responses, and perform various language-related tasks such as translation, summarization, question-answering, and even creative writing.

These models have found applications in numerous areas, including natural language processing, chatbots, virtual assistants, content generation, language translation, and more. They have demonstrated impressive capabilities in understanding and generating human-like text, although it's important to note that they have limitations and can sometimes produce incorrect or nonsensical outputs.

It's worth mentioning that large language models are resource-intensive and require significant computational power and data for both training and inference. As technology advances, researchers and developers continue to explore and refine these models, pushing the boundaries of what can be achieved in natural language understanding and generation.

### Transformer

The Transformer is a neural network architecture introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. It has become a groundbreaking model in the field of natural language processing (NLP) and has achieved state-of-the-art results in various NLP tasks.

You can read the paper here : https://arxiv.org/abs/1706.03762

The key idea behind the Transformer architecture is the concept of self-attention. Unlike traditional sequential models like recurrent neural networks (RNNs), which process inputs sequentially, the Transformer allows each word in an input sequence to attend to all other words, capturing the dependencies between them. This self-attention mechanism enables the model to capture contextual relationships effectively, considering the importance of different words in the context of others.

The Transformer architecture consists of two main components: the encoder and the decoder. The encoder processes the input sequence and generates a representation, which is then passed to the decoder. The decoder generates the output sequence based on the encoder's representation and previous outputs.

Here are the key components and concepts of the Transformer architecture:

1. Positional Encoding: Transformers incorporate positional encoding to provide the model with information about the order of words in the input sequence. Positional encoding is usually added to the input embeddings and allows the model to differentiate between the positions of words.

![1.png](attachment:1.png)

2. Self-Attention Mechanism: Self-attention allows each word in the input sequence to attend to all other words. It computes the attention weight between each pair of words and uses them to generate a weighted sum of the word embeddings. This mechanism enables the model to capture long-range dependencies and learn contextual relationships effectively.

![2.png](attachment:2.png)

## Langchain

LangChain is a framework for developing applications powered by language models. The LangChain framework is designed around these principle : 

1. Data-aware: connect a language model to other sources of data

2. Agentic: allow a language model to interact with its environment

### dotenv

The dotenv library is a popular Python library that simplifies the process of loading environment variables from a .env file into your Python application. It allows you to store configuration variables separately from your code, making it easier to manage sensitive information such as API keys, database credentials, or other environment-specific settings.

#### `.env` file

the .env file is a text file commonly used in software development projects to store environment-specific configuration variables. It follows a simple key-value format, where each line represents a single configuration variable.

Here are a few important points about the .env file:

1. Purpose: The primary purpose of the .env file is to separate sensitive or environment-specific information from your codebase. It allows you to store configuration variables such as API keys, database credentials, or other settings that may change based on the environment (e.g., development, staging, production).

2. File Format: The .env file is typically a plain text file without any special formatting. Each line in the file consists of a key-value pair, where the key and value are separated by an equal sign (=). For example :

```{python}
API_KEY=abc123
DATABASE_URL=mysql://user:password@localhost/db

```

3. Environment Variables: Each line in the .env file represents an environment variable. The key is the name of the environment variable, and the value is the corresponding value for that variable. These variables can be accessed within your code to retrieve the associated values.

4. Loading Variables: To make use of the variables defined in the .env file, you need to load them into your application. This is typically done using a library like dotenv in Python. The library reads the .env file and sets the defined variables as environment variables that can be accessed within your code.

5. Security: It's essential to ensure the security of your .env file. The file may contain sensitive information, such as passwords or access tokens. Make sure to exclude the .env file from version control systems like Git and only share it with authorized individuals who require access to the environment-specific configuration.

Overall, the .env file provides a convenient and flexible way to manage configuration variables in your project, allowing you to keep sensitive information separate from your code and easily configure different environments.


In [29]:
import os 
from dotenv import load_dotenv
from langchain import OpenAI, SQLDatabase, SQLDatabaseChain
from langchain.document_loaders.csv_loader import CSVLoader

load_dotenv()

True

## Conversing with SQL database

At this part we need to load the data, we will use the chinook data from our academy class as example. You need to explicitly explain what kind of database you load, for example `sqlite:///`.

Then you can just load the database using `SQLDatabase` from `langchain`.

In [14]:
dburi = "sqlite:///data_input/chinook.db"
db = SQLDatabase.from_uri(dburi)

Then we will call the openai api with `OpenAI`. It only need one parameter, temperature.

### Temperature

In the context of a language model, "temperature" refers to a parameter that influences the randomness and creativity of the generated text. It is used during the text generation process to control the diversity of the model's output.

When generating text using a language model, the model assigns probabilities to each possible word or token in the vocabulary based on the input context. The higher the probability assigned to a word, the more likely it is to be selected as the next word in the generated text. The temperature parameter affects how these probabilities are interpreted.

A higher temperature value, such as 1.0, increases the randomness of the generated text. This means that words with lower probabilities have a higher chance of being selected, leading to more varied and diverse output. Higher temperature values can result in more creative and surprising text generation, but it may also lead to less coherent or less contextually relevant output.

Conversely, a lower temperature value, such as 0.5, reduces the randomness of the generated text. Words with higher probabilities are more likely to be chosen, resulting in more focused and deterministic output. Lower temperature values tend to produce more conservative and predictable text, closely resembling the patterns and styles observed in the training data.

The choice of temperature depends on the desired balance between randomness and coherence in the generated text. Higher values like 1.0 encourage exploration and creativity, while lower values like 0.5 prioritize consistency and adherence to the training data patterns. Researchers and developers often experiment with different temperature values to find the right balance for their specific use cases and preferences.

In [15]:
llm = OpenAI(temperature=0)

#### Chain

Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.

The most commonly used type of chain is an LLMChain, which combines a PromptTemplate, a Model, and Guardrails to take user input, format it accordingly, pass it to the model and get a response, and then validate and fix (if necessary) the model output.

In [None]:
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)

db_chain.run("How many rows is in the tracks table of this db?")

Another example we use the question from our dive deeper. 

> all sales in rock genre in 2012



[1m> Entering new SQLDatabaseChain chain...[0m
How many sales of rock genre at 2012 in invoice table
SQLQuery:[32;1m[1;3mSELECT COUNT(*) FROM invoices i INNER JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId INNER JOIN tracks t ON ii.TrackId = t.TrackId INNER JOIN genres g ON t.GenreId = g.GenreId WHERE g.Name = 'Rock' AND strftime('%Y', i.InvoiceDate) = '2012';[0m
SQLResult: [33;1m[1;3m[(164,)][0m
Answer:[32;1m[1;3mThere were 164 sales of rock genre in 2012.[0m
[1m> Finished chain.[0m


'There were 164 sales of rock genre in 2012.'

> We want the returned DataFrame to contain only the Pop genre and only when the UnitPrice of the track is 0.99

### Conversing with csv

you can do same things with the csv. This time we also will use the familiar dataset from our academy which is `rice`.

In [39]:
filepath = "data_input/rice.csv"

In [40]:
from langchain.agents import create_csv_agent
agent = create_csv_agent(llm, filepath, verbose=True)

In [42]:
agent.run("berikan detail banyaknya transaksi yang terjadi di setiap format")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: saya harus menghitung jumlah transaksi yang terjadi di setiap format
Action: python_repl_ast
Action Input: df.groupby('format')['receipt_id'].count()[0m
Observation: [36;1m[1;3mformat
hypermarket     999
minimarket     7088
supermarket    3913
Name: receipt_id, dtype: int64[0m
Thought:[32;1m[1;3m Saya sekarang tahu jawaban akhir
Final Answer: Hypermarket memiliki 999 transaksi, Minimarket memiliki 7088 transaksi, dan Supermarket memiliki 3913 transaksi.[0m

[1m> Finished chain.[0m


'Hypermarket memiliki 999 transaksi, Minimarket memiliki 7088 transaksi, dan Supermarket memiliki 3913 transaksi.'

I hope this you can get this new knowledge well! 

For more tutorial about large language model and `langchain` you can visit our course producer github here : https://github.com/onlyphantom/llm-python

