### Chat With SQL | Data Ingestion 

This notebook is a very simple example on how we can ingest data to SQL using [SQLAlchemy](https://www.sqlalchemy.org/). We are using local [Postgres DB](https://www.postgresql.org/) as our engine, however it should work with any other engine like DuckDB. 

We start with importing all the libraries. You can install those depedencies from  [requirements.txt](/chat-with-sql/requirements.txt). 

In [1]:
import pandas as pd
from sqlalchemy import (
    create_engine, 
    MetaData,
    Table,
    Column,
    String,
    Integer,
    insert
)

Once we import all the libraries, we define all the configurations required to connect with our Postgres database. You can either have a local database or a hosted one, provided by the popular cloud providers. You can edit the cell to provide all the credentials to make the connection. 

In [2]:
# Postgres SQL Configuration

username = 'xxxxxx'      # Enter your user name here 
password = 'xxxxxx'      # Enter the password here 
host = 'localhost'       # Enter the host (it will be local host in our example)
port = '5432'            # Enter the port. For postgres, the default port is 5432
database = 'xxxxxx'      # Enter the database name here. 

For this example, we are going to use this [Netflix movies dataset](https://www.kaggle.com/datasets/rahulvyasm/netflix-movies-and-tv-shows) from kaggle. We download the dataset (which a single CSV file). We load the dataset using pandas, and then parse each row of the dataframe and insert it to our table. If you want to test out our code, then simply run the cells below or change the dataset with any other source and change the schema, so that it can ingest properly in your table.

In [3]:
# There are lot of full NaN columns in the dataframe so we choose only those columns 
# which we only require. 

required_columns = [
    'show_id', 
    'type', 
    'title', 
    'director', 
    'cast', 
    'country', 
    'date_added', 
    'release_year', 
    'rating', 
    'duration', 
    'listed_in', 
    'description'
]

df = pd.read_csv("netflix_titles.csv", encoding='latin1')[required_columns]

In [4]:
# Provide the table name here and then define the table schema using sqlalchemy

table_name = "latest_netflix_movies"
metadata_obj = MetaData()

movies_table = Table(
    table_name,
    metadata_obj,
    Column("show_id", String, primary_key=True),
    Column("type", String),
    Column("title", String),
    Column("director", String),
    Column("cast", String),
    Column("country", String),
    Column("date_added", String),
    Column("release_year", Integer),
    Column("rating", String),
    Column("duration", String),
    Column("listed_in", String),
    Column("description", String)
)

Once we defined our table schema, we initialise our engine which builds up the connection with our database. The code is stays same for almost all the kinds of database. For example, here is the code for connecting [DuckDB](https://duckdb.org/). 

```python
engine = create_engine("duckdb:///:memory:")
```

In [5]:
engine = create_engine(f"postgresql://{username}:{password}@{host}:{port}/{database}")
metadata_obj.create_all(engine)

Finally we parse through each rows of our dataframe and then insert it as a new entry in our table. 

In [6]:
from tqdm.auto import tqdm 


for index, row in tqdm(df.iterrows(), total=len(df)):
    stmt = insert(movies_table).values(
        show_id=row['show_id'],
        type=row['type'],
        title=row['title'],
        director=row['director'],
        cast=row['cast'],
        country=row['country'],
        date_added=row['date_added'],
        release_year=row['release_year'],
        rating=row['rating'],
        duration=row['duration'],
        listed_in=row['listed_in'],
        description=row['description']
    )
    with engine.begin() as connection:
        connection.execute(stmt)

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 8809/8809 [00:04<00:00, 2161.56it/s]


In [None]:
# Print the first three for a sanity check
import textwrap
from tabulate import tabulate

# A helper function to wrap the elements with long lines of sentences or entries.

def show(data, headers):
    wrapped_data = []
    for row in data:
        wrapped_row = []
        for col_idx, cell in enumerate(row):
            if headers[col_idx] in ['title', 'cast', 'description']:
                wrapped_row.append("\n".join(textwrap.wrap(cell, width=30)))  # Adjust width as needed
            else:
                wrapped_row.append(cell)
        wrapped_data.append(wrapped_row)

    # Print as a formatted table with custom wrapping
    print(tabulate(wrapped_data, headers=headers, tablefmt='plain'))


with engine.connect() as connection:
    result = connection.exec_driver_sql(f"SELECT * FROM {table_name} LIMIT 10")
    data = result.fetchall()
    show(data, headers=required_columns)

show_id    type     title                          director                        cast                            country                                                                date_added            release_year  rating    duration    listed_in                                                      description
s1         Movie    Dick Johnson Is Dead           Kirsten Johnson                 NaN                             United States                                                          September 25, 2021            2020  PG-13     90 min      Documentaries                                                  As her father nears the end of
                                                                                                                                                                                                                                                                                                                   his life, filmmaker Kirsten
        

Congratulations on completing the ingestion pipeline, you are now ready to connect this database with our LLM using llama-index. You can check out our [cookbook tutorial](https://docs.premai.io/cookbook/chat-with-sql) for that or directly [run the code](/chat-with-sql/main.py). 