# Lab 2 - Automatic Creation of `sqlalchemy` data type `dict`

So far, we have been manually constructing the `sqlalchemy` type `dict`, but this approach quickly becomes unwieldy.  Let's look at using the `pandas` types to programmically construct the type `dict`.

In [62]:
import pandas as pd
from dfply import *

In [63]:
!rm databases/baseball.db

rm: databases/baseball.db: No such file or directory


## Case Study - People

Let's use the `People.csv` file from the [Lahman’s Baseball Database](http://www.seanlahman.com/baseball-archive/statistics/) as our motivating example, since 

1. It has examples of a number of types
2. It has lots of columns and would be annoying to manually construct the type `dict`.

In [64]:
people = pd.read_csv('~/Desktop/baseball/core/People.csv')
people.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


In [65]:
people.shape

(19370, 24)

## <font color="red"> Exercise 1 </font>

Inspect the types of the `people` table and make note of any necessary changes.

In [66]:
people.dtypes

playerID         object
birthYear       float64
birthMonth      float64
birthDay        float64
birthCountry     object
birthState       object
birthCity        object
deathYear       float64
deathMonth      float64
deathDay        float64
deathCountry     object
deathState       object
deathCity        object
nameFirst        object
nameLast         object
nameGiven        object
weight          float64
height          float64
bats             object
throws           object
debut            object
finalGame        object
retroID          object
bbrefID          object
dtype: object

PlayerID should be the primary key. Things are either objects or float64. Birth and death columns should be new date column. debut and final game should be in date time format

## Missing `Int64` columns

As mentioned in [Lecture 1/4](./pbpython/notebooks/1_4_more_on_pandas_data_types_key.ipynb), we need to use the most recent version `pandas` (still in development as of this writing) to allow us to have integer columns with missing values.

In [67]:
assert pd.__version__.startswith('0.24'), "Please uncomment and run the pip command to upgrade pandas"
#!pip install --upgrade --pre pandas

## Correcting the `pandas` types

1. We pass `parse_dates` a list of date columns
2. We pass `dtypes` a `dict` of types of the birth and death columns

#### Constructing the `dtype` `dict`

In [68]:
date_cols = ['debut', 'finalGame']

In [69]:
birth_death_date_cols = [prefix + time for prefix in ('birth', 'death') for time in ('Year', 'Month', 'Day')]

In [70]:
people_dtypes = {col:pd.Int64Dtype() for col in people.columns if col in birth_death_date_cols}
people_dtypes

{'birthYear': Int64Dtype(),
 'birthMonth': Int64Dtype(),
 'birthDay': Int64Dtype(),
 'deathYear': Int64Dtype(),
 'deathMonth': Int64Dtype(),
 'deathDay': Int64Dtype()}

## Rereading the csv with the correct types

In [71]:
people = pd.read_csv('~/Desktop/baseball/core/People.csv', dtype=people_dtypes, parse_dates=date_cols)
people.dtypes

playerID                object
birthYear                Int64
birthMonth               Int64
birthDay                 Int64
birthCountry            object
birthState              object
birthCity               object
deathYear                Int64
deathMonth               Int64
deathDay                 Int64
deathCountry            object
deathState              object
deathCity               object
nameFirst               object
nameLast                object
nameGiven               object
weight                 float64
height                 float64
bats                    object
throws                  object
debut           datetime64[ns]
finalGame       datetime64[ns]
retroID                 object
bbrefID                 object
dtype: object

In [72]:
people.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934,2,5,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954,9,8,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972,8,25,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


## <font color="red"> Exercise 2 </font>
**Goal:** Find a method/attribute of each `dtype`, which preferably returns something immutable like a `str`, that we can use to identify the general type.
**Tasks:**

1. Pull off an example `dtype`
2. Use `dir` to inspect the available methods
3. Test the methods/attributes to find a good candidate.

In [73]:
[m for m in dir(people.dtypes) if not m.startswith("_")]

['T',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'as_matrix',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'autocorr',
 'axes',
 'base',
 'bats',
 'bbrefID',
 'between',
 'between_time',
 'bfill',
 'birthCity',
 'birthCountry',
 'birthDay',
 'birthMonth',
 'birthState',
 'birthYear',
 'bool',
 'clip',
 'clip_lower',
 'clip_upper',
 'combine',
 'combine_first',
 'compound',
 'compress',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'data',
 'deathCity',
 'deathCountry',
 'deathDay',
 'deathMonth',
 'deathState',
 'deathYear',
 'debut',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'finalGame',
 'first',
 'first_valid_index',
 'flags'

In [74]:
[t.kind for t in people.dtypes]

['O',
 'i',
 'i',
 'i',
 'O',
 'O',
 'O',
 'i',
 'i',
 'i',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'f',
 'f',
 'O',
 'O',
 'M',
 'M',
 'O',
 'O']

## Creating a type conversion dictionary

1. keys will be the `dtype.kind` strings
2. The `values` will be the associated `sqlalchemy` types

In [75]:
from sqlalchemy import Integer, Float, String, DateTime
DTYPES_TO_SQLALCHEMY_TYPES = {'O':String,
                              'i':Integer,
                              'f':Float,
                              'M':DateTime}
DTYPES_TO_SQLALCHEMY_TYPES

{'O': sqlalchemy.sql.sqltypes.String,
 'i': sqlalchemy.sql.sqltypes.Integer,
 'f': sqlalchemy.sql.sqltypes.Float,
 'M': sqlalchemy.sql.sqltypes.DateTime}

## Use ALL CAPS for global constants

When dealing with global constants, we should

1. Define them at the top of the file.
2. Use an ALL CAPS name to make them stand out.

## <font color="red"> Exercise 3 </font>

Write a `dict` comprehension that uses our conversion `dict` to convert the `pandas` `dtypes` to `sqlalchemy` types.

In [76]:
from toolz import get
{col_name:get(col_type.kind, DTYPES_TO_SQLALCHEMY_TYPES) for col_name, col_type in zip(people.columns, people.dtypes)}


{'playerID': sqlalchemy.sql.sqltypes.String,
 'birthYear': sqlalchemy.sql.sqltypes.Integer,
 'birthMonth': sqlalchemy.sql.sqltypes.Integer,
 'birthDay': sqlalchemy.sql.sqltypes.Integer,
 'birthCountry': sqlalchemy.sql.sqltypes.String,
 'birthState': sqlalchemy.sql.sqltypes.String,
 'birthCity': sqlalchemy.sql.sqltypes.String,
 'deathYear': sqlalchemy.sql.sqltypes.Integer,
 'deathMonth': sqlalchemy.sql.sqltypes.Integer,
 'deathDay': sqlalchemy.sql.sqltypes.Integer,
 'deathCountry': sqlalchemy.sql.sqltypes.String,
 'deathState': sqlalchemy.sql.sqltypes.String,
 'deathCity': sqlalchemy.sql.sqltypes.String,
 'nameFirst': sqlalchemy.sql.sqltypes.String,
 'nameLast': sqlalchemy.sql.sqltypes.String,
 'nameGiven': sqlalchemy.sql.sqltypes.String,
 'weight': sqlalchemy.sql.sqltypes.Float,
 'height': sqlalchemy.sql.sqltypes.Float,
 'bats': sqlalchemy.sql.sqltypes.String,
 'throws': sqlalchemy.sql.sqltypes.String,
 'debut': sqlalchemy.sql.sqltypes.DateTime,
 'finalGame': sqlalchemy.sql.sqltypes.Da

## <font color="red"> Exercise 4 </font>

Package your expression in a `lambda` and refactor your code by adding helper functions to clean up the expression.

In [77]:
def get_sql_types_dict(df):
    get_sql_type = lambda col_type: get(col_type.kind, DTYPES_TO_SQLALCHEMY_TYPES)
    cols_types = lambda df: zip(df.columns, df.dtypes)
    return {col:get_sql_type(col_type) 
            for col, col_type in cols_types(df)}

In [78]:
get_sql_types_dict(people)

{'playerID': sqlalchemy.sql.sqltypes.String,
 'birthYear': sqlalchemy.sql.sqltypes.Integer,
 'birthMonth': sqlalchemy.sql.sqltypes.Integer,
 'birthDay': sqlalchemy.sql.sqltypes.Integer,
 'birthCountry': sqlalchemy.sql.sqltypes.String,
 'birthState': sqlalchemy.sql.sqltypes.String,
 'birthCity': sqlalchemy.sql.sqltypes.String,
 'deathYear': sqlalchemy.sql.sqltypes.Integer,
 'deathMonth': sqlalchemy.sql.sqltypes.Integer,
 'deathDay': sqlalchemy.sql.sqltypes.Integer,
 'deathCountry': sqlalchemy.sql.sqltypes.String,
 'deathState': sqlalchemy.sql.sqltypes.String,
 'deathCity': sqlalchemy.sql.sqltypes.String,
 'nameFirst': sqlalchemy.sql.sqltypes.String,
 'nameLast': sqlalchemy.sql.sqltypes.String,
 'nameGiven': sqlalchemy.sql.sqltypes.String,
 'weight': sqlalchemy.sql.sqltypes.Float,
 'height': sqlalchemy.sql.sqltypes.Float,
 'bats': sqlalchemy.sql.sqltypes.String,
 'throws': sqlalchemy.sql.sqltypes.String,
 'debut': sqlalchemy.sql.sqltypes.DateTime,
 'finalGame': sqlalchemy.sql.sqltypes.Da

## <font color="red"> Exercise 5 </font>

Add the `People.csv` to your `baseball.db`

In [79]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///baseball.db', echo=False)
people.to_sql('people', 
               con=engine, 
               dtype=get_sql_types_dict(people), 
               index=False,
               if_exists='replace')

In [80]:
from sqlalchemy import inspect
insp = inspect(engine)
insp.get_columns('people')

[{'name': 'playerID',
  'type': VARCHAR(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'birthYear',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'birthMonth',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'birthDay',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'birthCountry',
  'type': VARCHAR(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'birthState',
  'type': VARCHAR(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'birthCity',
  'type': VARCHAR(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'deathYear',
  'type': INTEGER(),
  'nullable': True,
  'default': Non

In [81]:
# refactored lambda function and helper functions
def file_to_db(file, db):
    from sqlalchemy import create_engine
    engine = create_engine('sqlite:///'+ db + '.db', echo=False)
    file_save = "'"+ str(file) + '"'
    file.to_sql("poo",  
                con=engine, 
                dtype=get_sql_types_dict(file), 
                index=False,
                if_exists='replace')
file_to_db(people, "baseball")

## <font color="red"> Exercise 6 </font>

Set up a similar automatic conversion precess for `pyspark.DataFrames`. You need to build a structure like this:

```python
schema = StructType([StructField('Name', StringType(), True),
                     StructField('DateTime', TimestampType(), True)
                     StructField('Age', IntegerType(), True)])
```

In [None]:
from pyspark.sql.types import StructType
from pyspark.sql.types import DoubleType, StringType, IntegerType

hero_schema = (StructType()
  .add('Id', IntegerType(), False)
  .add('name', StringType(), True)
  .add('Gender', StringType(), True)
  .add('Eye color', StringType(), True)
  .add('Race', StringType(), True)
  .add('Hair color', StringType(), True)
  .add('Height', DoubleType(), True)
  .add('Publisher', StringType(), True)
  .add('Skin color', StringType(), True)
  .add('Alignment', StringType(), True)
  .add('Weight', DoubleType(), True))

heros = spark.read.csv('./data/heroes_information.csv', header=True, schema=hero_schema, nullValue='-')
heros

In [None]:
from pyspark.sql.types import DoubleType, StringType, IntegerType, FloatType, DatetimeConverter
DTYPES_TO_PYSPARK_TYPES = {'O':StringType,
                           'i':IntegerType,
                           'f':FloatType,
                           'M':DatetimeConverter}
DTYPES_TO_PYSPARK_TYPES

In [None]:
def get_pyspark_types_dict(df):
    get_pyspark_type = lambda col_type: get(col_type.kind, DTYPES_TO_PYSPARK_TYPES)
    cols_types = lambda df: zip(df.columns, df.dtypes)
    return {col:get_pyspark_type(col_type) 
            for col, col_type in cols_types(df)}
get_pyspark_types_dict(people)

In [None]:
nullable_helper = lambda col: False if col == "ID" else True

In [83]:
from pyspark.sql import *

In [84]:
spark = SparkSession.builder.appName('Ops').getOrCreate()

In [93]:
def get_pyspark_types_dict(df):
    from pyspark.sql.types import StructType, StructField
    get_pyspark_type = lambda col_type: get(col_type.kind, DTYPES_TO_PYSPARK_TYPES)
    cols_types = lambda df: zip(df.columns, df.dtypes)
    nullable_helper = lambda col: False if col == "playerID" else True
    return[(col, get_pyspark_type(col_type) , nullable_helper(col))
              for col, col_type in cols_types(df)]
    #b = [StructField(tup) for tup in a]
    #return[(StructField(col, get_pyspark_type(col_type) , nullable_helper(col)))
              #for col, col_type in cols_types(df)]
    #schema = StructType(pre_schema)
    #peoples = spark.read.csv('./data/people.csv', header=True, schema=schema, nullValue='-')
                
get_pyspark_types_dict(people)    

[('playerID', pyspark.sql.types.StringType, False),
 ('birthYear', pyspark.sql.types.IntegerType, True),
 ('birthMonth', pyspark.sql.types.IntegerType, True),
 ('birthDay', pyspark.sql.types.IntegerType, True),
 ('birthCountry', pyspark.sql.types.StringType, True),
 ('birthState', pyspark.sql.types.StringType, True),
 ('birthCity', pyspark.sql.types.StringType, True),
 ('deathYear', pyspark.sql.types.IntegerType, True),
 ('deathMonth', pyspark.sql.types.IntegerType, True),
 ('deathDay', pyspark.sql.types.IntegerType, True),
 ('deathCountry', pyspark.sql.types.StringType, True),
 ('deathState', pyspark.sql.types.StringType, True),
 ('deathCity', pyspark.sql.types.StringType, True),
 ('nameFirst', pyspark.sql.types.StringType, True),
 ('nameLast', pyspark.sql.types.StringType, True),
 ('nameGiven', pyspark.sql.types.StringType, True),
 ('weight', pyspark.sql.types.FloatType, True),
 ('height', pyspark.sql.types.FloatType, True),
 ('bats', pyspark.sql.types.StringType, True),
 ('throws', 