<a href="https://colab.research.google.com/github/revendrat/Big-Data-Analytics/blob/main/05_Schema.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Arrow automatically infers the most appropriate data type when reading in data or converting Python objects to Arrow objects.
* In addition, data types may be passed manually in Arrow to ensure interoperability with databases and data warehouse systems.
* Note includes illustrations on:
 * Setting the data type of an Arrow Array
 * Setting the schema of a Table
 * Merging multiple schemas

## Setting the data type of an Arrow Array
* Change an existing array to a different data type through the cast function

In [1]:
import pyarrow as pa

arr = pa.array([1, 2, 3, 4, 5])
print(arr.type)

int64


In [2]:
# Change to int8
arr = arr.cast(pa.int8())
print(arr.type)

# Change to int16
arr = arr.cast(pa.int16())
print(arr.type)

int8
int16


### Manually create an array of particulare data type of interest by specifying data type during array creation



In [4]:
arr = pa.array([1, 2, 3, 4, 5], type=pa.int32())
print(arr.type)

int32


## Setting the schema of a Table
* Tables detain multiple columns, each with its own name and type. 
* Schema is defined as the union of types and names

In [5]:
test_schema = pa.schema([
    ("col1", pa.int8()),
    ("col2", pa.string()),
    ("col3", pa.float64())
])

In [6]:
test_schema

col1: int8
col2: string
col3: double

### Provide the test_schema details to the data and arrow table as given below

In [7]:
table = pa.table([
    [1, 2, 3, 4, 5],
    ["a", "b", "c", "d", "e"],
    [1.0, 2.0, 3.0, 4.0, 5.0]
], schema=test_schema)

In [8]:
table

pyarrow.Table
col1: int8
col2: string
col3: double
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]
col3: [[1,2,3,4,5]]

### Similar to arrays, cast tables to different schemas.
Illustration below:

In [9]:
schema_int32 = pa.schema([
    ("col1", pa.int32()),
    ("col2", pa.string()),
    ("col3", pa.float64())
])

table = table.cast(schema_int32)

In [10]:
table

pyarrow.Table
col1: int32
col2: string
col3: double
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]
col3: [[1,2,3,4,5]]

## Merging Multiple Schemas
* Multiple separate groups of data may be combined to unify their schemas.
* Such combination is helpful to create a superset of schema to applies to all data sources.
* Use unify_schemas()  to combine multiple schemas into a single one

In [11]:
first_schema = pa.schema([
    ("country", pa.string()),
    ("population", pa.int32())
])

second_schema = pa.schema([
    ("country_code", pa.string()),
    ("language", pa.string())
])

In [12]:
print(first_schema)
print(second_schema)

country: string
population: int32
country_code: string
language: string


In [14]:
# use unify_schema to combine/merge
combine_schema = pa.unify_schemas([first_schema, second_schema])
print(combine_schema)

country: string
population: int32
country_code: string
language: string


### In case the combined schemas have overlapping columns, they can still be combined as far as the colliding columns retain the same type (country_code)

In [15]:
third_schema = pa.schema([
    ("country_code", pa.string()),
    ("lat", pa.float32()),
    ("long", pa.float32()),
])

combined_schema =  pa.unify_schemas([first_schema, second_schema, third_schema])

print(combined_schema)

country: string
population: int32
country_code: string
language: string
lat: float
long: float


* If a merged field has instead diverging types in the combined schemas then trying to merge the schemas will fail. 
* For example if country_code was a numeric instead of a string we would be unable to unify the schemas because in second_schema it was already declared as a pa.string()

In [16]:
third_schema = pa.schema([
    ("country_code", pa.int32()),
    ("lat", pa.float32()),
    ("long", pa.float32()),
])

combined_schema =  pa.unify_schemas([first_schema, second_schema, third_schema])


ArrowInvalid: ignored

In [18]:
# To fetch only the error, use the following:
try:
    combined_schema =  pa.unify_schemas([first_schema, second_schema, third_schema])
except pa.ArrowInvalid as e:
    print(e)

Unable to merge: Field country_code has incompatible types: string vs int32
