# Kaskada with CSV

**Goal:** Demonstrate the new feature work for Kaskada to operate on CSV data. This work is scheduled for demo on 2/28/2023.

## Generating CSV Data

The section below generates a basic dataset.

In [None]:
import random
import names
import pandas

def generate_dataset(num_members, num_records_per_member):
    member_ids = list(range(0, num_members))
    column_1_name = 'amount'
    column_2_name = 'random_col'
    records = []
    for member_id in member_ids:
        for i in range(0, num_records_per_member):
            records.append({
                'id': member_id,
                'time': random.randint(1000000000000000, 9000000000000000), # number of seconds from epoch
                'name': f"my-cool-name-{random.randint(-100, 100)}",
                column_1_name : random.randint(-100, 100),
                column_2_name : f"some-value-{random.randint(0, 100)}"
            })

    df = pandas.DataFrame(records)
    df['time']= pandas.to_datetime(df['time'])
    return df

In [None]:
dataset1 = generate_dataset(100, 5)
dataset1.to_csv('dataset1.csv')
dataset1

## Launch Kaskada

There is no additional configuration Kaskada needs to utilize CSV. Simply create a table or use an existing table, and load the CSV data to the table. Previously, the python client library used PyArrow to convert CSV to Parquet prior to ingestion but now this constraint is removed.

In [None]:
from kaskada.api.session import LocalBuilder
session = LocalBuilder().download(False).build()

In [None]:
import kaskada.table
# Create the table named transactions with the time and name column
kaskada.table.create_table('transactions', 'time', 'name')

In [None]:
# Load the data to the table
kaskada.table.load('transactions', 'dataset1.csv')

In [None]:
# Get the table and see the version is incremented and the schema is available.
kaskada.table.get_table('transactions')

## Run a query with Fenlmagic

In [None]:
%load_ext fenlmagic

In [None]:
%%fenl
transactions

## More Data

Add more data with additional CSV files.

In [None]:
dataset2 = generate_dataset(100, 50) #5000
dataset2.to_csv('dataset2.csv')

In [None]:
kaskada.table.load('transactions', 'dataset2.csv')

In [None]:
%%fenl
transactions

In [None]:
%%fenl --output=csv
transactions