# Lector usage

This notebook shows usage of Lector's main functionality, reading (parsing) CSV files and inferring correct column data types. 

For motivation why we need another CSV reader see further below in this notebook.

# Setup

In [1]:
import io
import importlib
from pathlib import Path

import gdown
import humanize
import lector
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.csv

from lector.log import LOG, pformat, schema_view

%xmode Minimal

Exception reporting mode: Minimal


In [26]:
# Create some not so clean CSV example files to experiment with

example_1 = """
Some preamble content here
This is still "part of the metadata preamble"
id;genre;metric;count;content;website;tags;vecs;date
1234982348728374;a;0.1;1;; http://www.graphext.com;"[a,b,c]";"[1.3, 1.4, 1.67]";11/10/2022
;b;0.12;;"Natural language text is different from categorical data."; https://www.twitter.com;[d];"[0, 1.9423]";01/10/2022
9007199254740993;a;3.14;3;"The Project · Gutenberg » EBook « of Die Fürstin.";http://www.google.com;"['e', 'f']";["84.234, 12509.99"];13/10/2021

""".encode("ISO-8859-1")

example_2 = """
id;genre;metric;count;content;website;tags;vecs;date
1234982348728374;a;0.1;1;; http://www.graphext.com;"[a,b,c]";"[1.3, 1.4, 1.67]";11/10/2022
;b;0.12;;"Natural language text is different from categorical data."; https://www.twitter.com;[d];"[0, 1.9423]";01/10/2022
18446744073709551615;a;3.14;3;"The Project · Gutenberg » EBook « of Die Fürstin.";http://www.google.com;"['e', 'f']";["84.234, 12509.99"];13/10/2021
""".encode("ISO-8859-1")

with open("example_1.csv", "wb") as f:
    f.write(example_1)

with open("example_2.csv", "wb") as f:
    f.write(example_2)

fpm = "example_md.csv"
fpl = "example_lg.csv"

if not Path("./example_md.csv").exists():
    gdown.download("https://drive.google.com/uc?id=188lrt2psoru55bTog4mLDWd6DC_MUSQd", fpm)
    gdown.download("https://drive.google.com/uc?id=1fEN_Z2SbBesNiWdOf2ceXx2c-SRNHIZR", fpl)


# High-level API

The high-level functional API is the simplest way to use Lector. By default it reads CSVs into a pyarrow Table:

In [5]:
tbl = lector.read_csv("example_1.csv")

The first thing to notice is that this simply worked, unlike the pandas/arrow examples you'll find below. Let's do the same but with a bit more feedback to see what happened:

In [6]:
lector.LOG.setLevel("DEBUG")
tbl = lector.read_csv("example_1.csv", log=True)

[38;20m18:54:20 INFO | lector | preambles.detect:66[0m 
'Fieldless' matches CSV buffer: detected 3 rows to skip.


[38;20m18:54:20 INFO | lector | abc.analyze:147[0m 
                                                                                                                   
  ─────────── CSV Format ────────────                                                                              
   [1m{[0m                                                                                                               
       [32m'encoding'[0m: [32m'ISO-8859-1'[0m,                                                                                   
       [32m'preamble'[0m: [1;36m3[0m,                                                                                              
       [32m'dialect'[0m: [1;35mDialect[0m[1m([0m                                                                                         
           [33mdelimiter[0m=[32m';'[0m,                                                                                          
           [33mquote_char[0m=[32m'"'[0

[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "id" with converter
[1;35mNumber[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.95[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "genre" with converter
[1;35mCategory[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.0[0m, [33mmax_cardinality[0m=[3;35mNone[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "metric" with converter
[1;35mNumber[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.95[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "count" with converter
[1;35mNumber[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.95[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "content" with converter
[1;35mText[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.8[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "website" with converter
[1;35mUrl[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.8[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "tags" with converter
[1;35mList[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.95[0m, [33mthreshold_urls[0m=[1;36m0[0m[1;36m.8[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "vecs" with converter
[1;35mList[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.95[0m, [33mthreshold_urls[0m=[1;36m0[0m[1;36m.8[0m[1m)[0m


[38;20m18:54:20 DEBUG | lector | cast.cast_array:120[0m 
Converted column "date" with converter
[1;35mTimestamp[0m[1m([0m[33mthreshold[0m=[1;36m0[0m[1;36m.95[0m[1m)[0m


[38;20m18:54:20 INFO | lector | cast.cast_table:83[0m 
                                                                                                                   
 [3mChanged types                      [0m                                                                               
  ─────────────────────────────────                                                                                
  [1m [0m[1mColumn [0m[1m [0m [1m [0m[1mBefore[0m[1m [0m [1m [0m[1mAfter       [0m[1m [0m                                                                                
  ─────────────────────────────────                                                                                
  [38;5;204m [0m[38;5;204mid     [0m[38;5;204m [0m [38;5;214m [0m[38;5;214mstring[0m[38;5;214m [0m [38;5;184m [0m[38;5;184mint64       [0m[38;5;184m [0m                                                                                
  [38;5;204m [0m[38;5;204mgen

There is a lot to unpack in this log.

Firstly, Lector has inferred and constructed a CSV `Format`, which contains all the necessary information for a CSV parser to parse the file (including the encoding, lines to skip, the separator etc.). Lector has then used this format to instruct `pyarrow.csv.read_csv` to read the file _without_ inferring any column types, simply leaving all columns with the original strings as found in the file (because as shown below, arrow, like pandas, may otherwise import erroneous data, silently and non-recoverable).

Lector has then applied its `Autocast` strategy to infer and convert each column to the most appropriate data type. The log shows which converter has been found as most appropriate for each column, and with which configuration the conversion has been applied.

A table summarizes the final type for each column. Note that not only did Lector infer types for _all_ columns (including lists), and did so correctly (including the non-ISO date format), but also automatically identified the smallest possible numeric types. None of this is possible with pandas or arrow.

Inspecting the arrow table's schema, we can see this in more detail:

In [8]:
LOG.info(pformat(schema_view(tbl.schema, title="Schema")))

[38;20m18:55:26 INFO | lector | 1494666864.<module>:1[0m 
                                                                                                                   
 [3mSchema                                                                 [0m                                           
  ─────────────────────────────────────────────────────────────────────                                            
  [1m [0m[1mColumn [0m[1m [0m [1m [0m[1mType        [0m[1m [0m [1m [0m[1mMeta                                      [0m[1m [0m                                            
  ─────────────────────────────────────────────────────────────────────                                            
  [38;5;204m [0m[38;5;204mid     [0m[38;5;204m [0m [38;5;184m [0m[38;5;184mint64       [0m[38;5;184m [0m  [1m{[0m[32m'semantic'[0m: [32m'number[0m[32m[[0m[32mInt64[0m[32m][0m[32m'[0m[1m}[0m                                                          
  [

 Note e.g. that Lector distinguishes between two types of "stringy" data types: `text` and `category`. The former is stored using arrow's efficient `string` type. In the metadata we indicate via the `semantic` field how the data is to be interpreted. In this case, this is "text", meaning natural language text that can be processed e.g. with NLP models. When a column's strings don't seem to be text-like, and when its cardinality is appropriate, it is inferred as categorical instead (e.g. the genre column). Also note that the website column was inferred as categorical, but that Lector has in fact recognized it as containing URLs, and so indicates this in its semantic type field.

Finally, neither Arrow nor pandas infer list types automatically, nor timestamps (correctly) without knowing the format beforehand.

If you want the parsed result as a pandas DataFrame, simply pass the `to_pandas` option:

In [9]:
df = lector.read_csv("example_1.csv", to_pandas=True)
df

Unnamed: 0,id,genre,metric,count,content,website,tags,vecs,date
0,1234982348728374.0,a,0.1,1.0,,http://www.graphext.com,"[a, b, c]","[1.3, 1.4, 1.67]",2022-10-11
1,,b,0.12,,Natural language text is different from catego...,https://www.twitter.com,[d],"[0.0, 1.9423]",2022-10-01
2,9007199254740992.0,a,3.14,3.0,The Project · Gutenberg » EBook « of Die Fürstin.,http://www.google.com,"[e, f]","[84.234, 12509.99]",2021-10-13


Note that this doesn't use pyarrow's default conversion, which is not "correct" in as much as it converts all numeric columns containing nulls to float instead of using the appropriate nullable pandas dtype (nor does it let you configure manually _how_ to convert columns containing nulls). Lector, in contrast, maintains the most appropriate dtypes automatically:

In [10]:
df.dtypes

id                  Int64
genre            category
metric            float64
count               UInt8
content            string
website          category
tags               object
vecs               object
date       datetime64[ns]
dtype: object

# Customization

The full signature of the single-function high-level API looks like this:

``` python
def read_csv(
    fp: FileLike,
    encoding: str | EncodingDetector | None = None,
    dialect: dict | DialectDetector | None = None,
    preamble: int | PreambleRegistry | None = None,
    types: dict | Inference = Inference.Auto,
    strategy: CastStrategy | None = None,
    to_pandas: bool = False,
): 
    ...
```

I.e. for each feature of the CSV that can and by default is inferred (encoding, dialect, preamble and data types), you can either specify known values, or a an object that knows how to generate them.

The following, for example, side-steps inference of encoding and preamble, as well as some column types during inference:

In [11]:
tbl = lector.read_csv(
    "example_1.csv",
    encoding="ISO-8859-1",
    preamble=3,
    types={"id": "uint64"},
    log=False)

Note that specifying concrete types for individual columns will currently fall back to arrow's default type inference for the remaining columns.

To fully customize how to infer data types you can use the `strategy` argument. This let's you decide:

- which _columns_ to infer data types for (remaining will be left as `string`)
- which _data types_ Lector is allowed to infer
- exactly _how_ each data type is inferred and how it casts the data

For example:

In [12]:
from lector.types import Autocast, Category, Timestamp, List, Number

strategy = Autocast(
    columns=["id", "metric", "tags"],
    converters=[
        Number(threshold=0.85),
        Timestamp(threshold=0.85),
        # List(threshold=0.95),
    ],
    fallback=Category(max_cardinality=1.0)
)

tbl = lector.read_csv("example_1.csv", strategy=strategy)
tbl

Got no matching converter for string column 'tags'. Will try fallback [1;35mCategory[0m[1m([0m[33mmax_cardinality[0m=[1;36m1[0m[1;36m.0[0m[1m)[0m.


pyarrow.Table
id: int64
genre: string
metric: double
count: string
content: string
website: string
tags: dictionary<values=string, indices=int32, ordered=0>
vecs: string
date: string
----
id: [[1234982348728374,null,9007199254740993]]
genre: [["a","b","a"]]
metric: [[0.1,0.12,3.14]]
count: [["1",null,"3"]]
content: [[null,"Natural language text is different from categorical data.","The Project · Gutenberg » EBook « of Die Fürstin."]]
website: [[" http://www.graphext.com"," https://www.twitter.com","http://www.google.com"]]
tags: [  -- dictionary:
["[a,b,c]","[d]","['e', 'f']"]  -- indices:
[0,1,2]]
vecs: [["[1.3, 1.4, 1.67]","[0, 1.9423]","["84.234, 12509.99"]"]]
date: [["11/10/2022","01/10/2022","13/10/2021"]]

The above will 

- only infer data types for the 3 specified columns ("id", "metric", "tags")
- only infer numeric and timestamp types, leaving the rest as unadulterated string types 
- try the specified fallback type (Category) for columns that could not be cast to any other type. Since the tags columns could not be cast to number or timestamp, it tries the specified fallback type (Category). But using Category's default parameters will not pass the type's validity tests and so the column is not cast. In this case, this happened because Category specifies a maximum cardinality of 0.1 (10%), but all rows here are unique. Setting `fallback=Category(max_cardinality=1.0)` e.g. would successfully apply the type in this case.

Different data types may have additional parameters determining how and when they cast data. But all have a `threshold` parameter in common. The threshold indicates the proportion of non-null values in a column that have to be convertible to the given data type for the converter to apply. E.g. the converter `Number(threshold=0.85)` will only cast input data to a numeric data type if at least 85% of the original non-null values can be cast without error.

The default `Autocast` strategy used above uses this behaviour to try all allowed converters in specified order until one passes this validity test (and if none passes will either use the fallback converter or leave the column as `string`).

See the documentation for more details and try the above with different settings.

# But why another CSV reader?

Below we illustrate why pandas (and arrow) CSV reading is less than optimal. Specifically that...

### Pandas isn't very good at reading CSVs

- Doesn't know how without a lot of hand-holding
- Can be wrong, but you won't know it
- Doesn't infer the types pandas itself supports
- ...and yet is also slow
- ...and uses a lot of memory


### Arrow is somewhat better

- Fast
- Memory-efficient
- ...but also doesn't infer many of it's supported types
- ...and makes similar errors to pandas

In addition, neither pandas nor arrow allow (real) customization of type inference. You can basically use it or leave it. But you cannot configure e.g. the inference of only certain types, or how a specific type is inferred or not.

Let's see this in action, or "how I wasted a morning trying to read a client's CSV file"...

Let's try simply reading the first example file with pandas:

## Pandas doesn't know how to read CSVs

In [3]:
fp = "example_1.csv"
df = pd.read_csv(fp)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 380: invalid start byte

Hm, seems this isn't utf-8. We'll try common alternatives until we find one that seems to match

In [4]:
df = pd.read_csv(fp, encoding="ISO-8859-1")

ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 5


Still doesn't work. We have to open the file in a text editor, inspect it manually, and hopefully identify the reason. In this case, we have 3 lines of initial metadata in the file, let's ignore those.

In [5]:
pd.read_csv(fp, encoding="ISO-8859-1", skiprows=3)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,id;genre;metric;count;content;website;tags;vecs;date
"1234982348728374;a;0.1;1;; http://www.graphext.com;""[a",b,"c]"";""[1.3",1.4,"1.67]"";11/10/2022"
";b;0.12;;""Natural language text is different from categorical data.""; https://www.twitter.com;[d];""[0","1.9423]"";01/10/2022",,,
"9007199254740993;a;3.14;3;""The Project · Gutenberg » EBook « of Die Fürstin."";http://www.google.com;""['e'","'f']"";[""84.234","12509.99""];13/10/2021",,


Still looks wrong. Pandas hasn't detected the separator as being ";", so let's specify that also manually:

In [6]:
df = pd.read_csv(fp, encoding="ISO-8859-1", skiprows=3, sep=";")
df

Unnamed: 0,id,genre,metric,count,content,website,tags,vecs,date
0,1234982000000000.0,a,0.1,1.0,,http://www.graphext.com,"[a,b,c]","[1.3, 1.4, 1.67]",11/10/2022
1,,b,0.12,,Natural language text is different from catego...,https://www.twitter.com,[d],"[0, 1.9423]",01/10/2022
2,9007199000000000.0,a,3.14,3.0,The Project · Gutenberg » EBook « of Die Fürstin.,http://www.google.com,"['e', 'f']","[""84.234, 12509.99""]",13/10/2021


This sort of "worked", but the imported data is still wrong:

- The id column has floats instead of ints (with wrong values)
- The URLs are not clean, they have initial spaces
- Pandas doesn't know about lists, of course, so these remain plain strings
- Dates haven't been automatically inferred

It's easier to see these limitations without pandas Dataframe formatting:

In [7]:
for col in ("id", "website", "tags", "vecs", "date"):
    print(f"{col}: {df[col].tolist()}")

id: [1234982348728374.0, nan, 9007199254740992.0]
website: [' http://www.graphext.com', ' https://www.twitter.com', 'http://www.google.com']
tags: ['[a,b,c]', '[d]', "['e', 'f']"]
vecs: ['[1.3, 1.4, 1.67]', '[0, 1.9423]', '["84.234, 12509.99"]']
date: ['11/10/2022', '01/10/2022', '13/10/2021']


Let's try helping pandas along by giving it the exact data types we want:

In [8]:
dtypes = {
    "id": "UInt64",
    "genre": "category",
    "metric": "float",
    "count": "UInt8", 
    "content": "string",
    "website": "category",
    "tags": "object",
    "vecs": "object"
}

df = pd.read_csv(
    fp,
    encoding="ISO-8859-1",
    skiprows=3,
    sep=";",
    dtype=dtypes,
    parse_dates=["date"],
    infer_datetime_format=True
)

display(df)
df.dtypes

  df = pd.read_csv(


Unnamed: 0,id,genre,metric,count,content,website,tags,vecs,date
0,1234982348728374.0,a,0.1,1.0,,http://www.graphext.com,"[a,b,c]","[1.3, 1.4, 1.67]",2022-11-10
1,,b,0.12,,Natural language text is different from catego...,https://www.twitter.com,[d],"[0, 1.9423]",2022-01-10
2,9007199254740992.0,a,3.14,3.0,The Project · Gutenberg » EBook « of Die Fürstin.,http://www.google.com,"['e', 'f']","[""84.234, 12509.99""]",2021-10-13


id                 UInt64
genre            category
metric            float64
count               UInt8
content            string
website          category
tags               object
vecs               object
date       datetime64[ns]
dtype: object

Ok, where to begin. Even though we specified the correct data types:

- The `id` column was indeed imported as integer, but not without internally passing through a float type apparently, which doesn't have enough precision to represent large integers, and so the value is plain wrong (the original, correct value is `9007199254740993`). If this `id` is the identifier of a row in a database, e.g., we just got majorly screwed. What's more, we've been screwed silently. If you use this table to join with another on the `id` column, for example, your whole analysis may be wrong without you realizing much later on (if ever)
- The date column was converted to a date dtype, but also unfortunately wrong. Note that the original strings in the CSV are '11/10/2022', '01/10/2022' and '13/10/2021' and thus the only consistent date format here is `day/month/year` (the only sane format of course). Yet pandas has used mixed formats and wrongly inferred most of them. It even warns us about inconsistent formats, but isn't clever enough to infer the correct format itself.
- The `website` URLs aren't clean still, but that was to be expected
- The list-like columns are also still and unsurprisingly `object` dtypes containing missing values and strings instead of lists

## Pandas is slow and memory-hungry

Let's check pandas on a medium large CSV file with 1,468,825 rows and 5 columns:

In [9]:
df = pd.read_csv(fpm, sep=";")
df.dtypes

departamento      object
provincia         object
distrito          object
fec_vacunacion    object
cantidad           int64
dtype: object

pandas has only recognized the correct dtype for a single column, namely the int dtype for the `cantidad` column. Arrow behaves the same way, as does lector if you tell it to use Arrow's native type inference. Let's compare their performance:

In [10]:
%timeit pd.read_csv(fpm, sep=";")
%timeit pa.csv.read_csv(fpm, parse_options=pa.csv.ParseOptions(delimiter=";"))
%timeit lector.read_csv(fpm, types=lector.Inference.Native)
%timeit lector.read_csv(fpm)

412 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
33 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
59.5 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
345 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note that

- pandas is  at least 10x slower than arrow reading this file
- letting lector infer the CSV format adds a 30ms overhead. 

Crucially, letting lector infer the CSV format and _all_ column types (including dates, downcasting ints etc.), is still faster than pandas while producing a more useful result:

In [12]:
lector.read_csv(fpm, to_pandas=True).dtypes

departamento            category
provincia               category
distrito                category
fec_vacunacion    datetime64[ns]
cantidad                  uint16
dtype: object

Let't try an even bigger file, this time 707MB on disk, with 900,013 rows and 69 columns:

In [14]:
%timeit -n 1 -r 1 pd.read_csv(fpl, low_memory=False)
%timeit pa.csv.read_csv(fpl, parse_options=pa.csv.ParseOptions(invalid_row_handler=lambda r: "skip"))
%timeit lector.read_csv(fpl, types=lector.Inference.Native)
%timeit -n 1 -r 2 lector.read_csv(fpl)

10.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
466 ms ± 26.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
607 ms ± 17.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.01 s ± 15.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


Again, in terms of pure performance, pandas is 20x slower here than arrow, while lector with arrow's native type inference adds a small overhead only. But this difference is even more impressive if we look at what each output generates.

Pandas, even though being rather slow, hasn't done anything at all really:

In [15]:
df = pd.read_csv(fpl, low_memory=False)
print(df.dtypes.value_counts())
humanize.naturalsize(df.memory_usage(index=True, deep=True).sum())

object     53
float64    16
dtype: int64


'3.0 GB'

Pandas has produced 53 object columns, which could contain anything at all really, and 16 float columns. Also note that the Dataframe occupies 3GB of memory (for a CSV of 707MB on disk)!

Arrow, in contrast, while being 20x faster, recognizes at least 6 columns as containing dates (though we were admittedly lucky here because these are stored in the only format arrow understands). Also, its table only uses 780MB of memory, less than a third of pandas, which means we can work with datasets at least 3x bigger on the same machine if we stick to arrow:

In [16]:
tbl = pa.csv.read_csv(fpl, parse_options=pa.csv.ParseOptions(invalid_row_handler=lambda r: "skip"))
print(tbl.to_pandas().dtypes.value_counts())
humanize.naturalsize(tbl.get_total_buffer_size())

object                 47
float64                16
datetime64[ns, UTC]     6
dtype: int64


'780.5 MB'

Finally, using lector with _smart_ type inference, we're still 2x faster than pandas, while producing a better, more useful, result than both pandas and arrow (we also occupy even less memory by inferring better types):

In [17]:
from lector.utils import as_pd

tbl = lector.read_csv(fpl)
df = as_pd(tbl)
print(df.dtypes.astype("string").value_counts())
humanize.naturalsize(tbl.get_total_buffer_size())

category               45
UInt32                  8
datetime64[ns, UTC]     6
float64                 3
string                  3
Int32                   1
UInt16                  1
UInt8                   1
object                  1
dtype: Int64


'316.0 MB'

Finally, although arrow with default type inference (or equivalently, lector with arrow's native type inference) may often seem to be enough, it in fact commits the same int vs. float conversion error as pandas, and doesn't infer all of its own data types. Using the CSV file `example_2` defined at the top of this notebook, e.g.:

In [29]:
tbl = pa.csv.read_csv(
    "example_1.csv",
    read_options=pa.csv.ReadOptions(encoding="ISO-8859-1", skip_rows=3),
    parse_options=pa.csv.ParseOptions(delimiter=";"),
    convert_options=pa.csv.ConvertOptions(strings_can_be_null=True)
)

In [28]:
print(tbl, "\n")
print(tbl.column("id").to_pylist())
print(tbl.column("tags").to_pylist())
print(tbl.column("vecs").to_pylist())

int(tbl.column("id")[2].as_py()) == 18446744073709551615

pyarrow.Table
id: int64
genre: string
metric: double
count: int64
content: string
website: string
tags: string
vecs: string
date: string
----
id: [[1234982348728374,null,9007199254740993]]
genre: [["a","b","a"]]
metric: [[0.1,0.12,3.14]]
count: [[1,null,3]]
content: [[null,"Natural language text is different from categorical data.","The Project · Gutenberg » EBook « of Die Fürstin."]]
website: [[" http://www.graphext.com"," https://www.twitter.com","http://www.google.com"]]
tags: [["[a,b,c]","[d]","['e', 'f']"]]
vecs: [["[1.3, 1.4, 1.67]","[0, 1.9423]","["84.234, 12509.99"]"]]
date: [["11/10/2022","01/10/2022","13/10/2021"]] 

[1234982348728374, None, 9007199254740993]
['[a,b,c]', '[d]', "['e', 'f']"]
['[1.3, 1.4, 1.67]', '[0, 1.9423]', '["84.234, 12509.99"]']


False

We see that:

- the `id` column has been parsed into wrong values. Its third value should be `18446744073709551615`, not `1234982348728374`. Again, this silent error could create mejor headaches if used in joins e.g., which is to say, this is potentially much worse than a simple "rounding" type of error
- lists are not recognized (though arrow supports them)
- non ISO-formatted dates are also not recognized

Lector, on the other hand, imports this data correctly and conveniently, and without a lot of hand-holding:

In [37]:
df = lector.read_csv("example_2.csv", to_pandas=True)

display(df)
print(df.dtypes)
print("\nType of items in tags column:", type(df.tags.iloc[0]))

Unnamed: 0,id,genre,metric,count,content,website,tags,vecs,date
0,1234982348728374,a,0.1,1.0,,http://www.graphext.com,"[a, b, c]","[1.3, 1.4, 1.67]",2022-10-11
1,<NA>,b,0.12,,Natural language text is different from catego...,https://www.twitter.com,[d],"[0.0, 1.9423]",2022-10-01
2,18446744073709551615,a,3.14,3.0,The Project · Gutenberg » EBook « of Die Fürstin.,http://www.google.com,"[e, f]","[84.234, 12509.99]",2021-10-13


id                 UInt64
genre            category
metric            float64
count               UInt8
content            string
website          category
tags               object
vecs               object
date       datetime64[ns]
dtype: object

Type of items in tags column: <class 'numpy.ndarray'>


# Summary

Lector tries to offer a CSV reader that just works. Specifically, one that

1. does not require human inspection and inference of CSV parsing parameters
2. does not introduce value errors (no erroneous int to float coercion e.g.)
3. automatically infers a wide range of data types (more than pandas and arrow)

In the __best case__, lector is almost as fast as arrow, but also infers CSV formats (no more guessing of parser parameters).

In the __worst case__, lector is still faster than pandas, while

  - parsing data _correctly_
  - inferring "all" the types
  - using less memory (when keeping result in arrow)
  - being configurable
  - being hackable
  
By "hackable" we mean that lector is a small library will easily extensible or even replaceable parts. Its behaviour can easily be changed to fit anyone'e needs, something which cannot be said about pandas (messy codebase) or arrow (C++).

# Contribute

- Repo: https://github.com/graphext/lector
- Docs: https://lector.readthedocs.io/en/latest/