# Data Access SQL

This document provides an introduction on how to access the [PostgreSQL](https://www.postgresql.org/) database used for Challenge 2 and the Project.

Import the necessary packages.

In [1]:
# activate install if needed
install.packages("RPostgreSQL")

require("RPostgreSQL")

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading required package: RPostgreSQL
Loading required package: DBI


Define the database connection information.

In [2]:
# define the database connection string
DB_HOST='server2053.cs.technik.fhnw.ch' # or 86.119.36.94 depending on the network
DB_PORT = 5432
DB_DBNAME = 'bank_db' # or 'warenkorb_db'
DB_USERNAME = 'db_user' 
DB_PASSWORD = 'db_user_pw' 

Load the driver and setup the connection.

In [3]:
# load the PostgreSQL driver
drv <- dbDriver("PostgreSQL")
# connect to the database
con <- dbConnect(drv, dbname = DB_DBNAME,
                 host = DB_HOST, port = DB_PORT,
                 user = DB_USERNAME, password = DB_PASSWORD)

Define the query and execute it.

In [4]:
query <- "SELECT * FROM client"
df <- dbGetQuery(con, query)

The result of your query is stored in the variable df (being of type DataFrame).
http://www.r-tutor.com/r-introduction/data-frame provides a first intro on how to work with R DataFrames.

In [5]:
# provides a short overview to DataFrames
help(df)

Access data of a DataFrame

In [6]:
# preview
head(df)

client_id,birth_number,district_id
<int>,<int>,<int>
1,706213,18
2,450204,1
3,406009,1
4,561201,5
5,605703,5
6,190922,12


In [7]:
# first row and column "birth_number"
df[1, "birth_number"] 

Lets try the gender extraction from https://gitlab.fhnw.ch/classes/grundkompetenz_datenbanken/blob/master/date/Date_Modifier.ipynb in R.

## Birth Number

The attribute *birth_number* of the table *client* has the format "YYMMDD" (year|month|day) and encodes the gender in the month (+50 for females). 

A recurring question is how is it possible to extract gender, birth date, age etc. from *birth_number*. 

Get a first impresson...

In [8]:
values <- dbGetQuery(con, "SELECT birth_number FROM client LIMIT 5")
values

birth_number
<int>
706213
450204
406009
561201
605703


Get some meta info on data types...

In [9]:
values <- dbGetQuery(con, "SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'client'")
values

column_name,data_type
<chr>,<chr>
client_id,integer
birth_number,integer
district_id,integer


The attribute *birth_number* seems to be an int (not a date). Lets think about how we can extract the gender from this number. 

### Gender

In order to determine the gender we need to extract the *MM* part from a 6 digit integer (having the format YYMMDD). In case this number is in the range 01..12 the person a male and if it is in the range 51..62 the person is a female (this can be simplified by just saying if this number is bigger than 50 the person is a female). Arithmetic operations/functions (https://www.postgresql.org/docs/9.3/functions-math.html ) help to extract the *MM* part from an integer. 

In a first step, we need to get rid of the *DD* part of the 6 digit integer. This can be achieved by dividing the integer with 100 (i.e. YYMMDD / 100 = YYMM (with rest DD). 

Building on the result of the first step (YYMM), we can now extract the *MM* with the modulo (https://de.wikipedia.org/wiki/Division_mit_Rest#Modulo ) operation (i.e. MOD(YYMM, 100) = MM).

Based on the above result, we can use the CASE (https://www.postgresql.org/docs/current/functions-conditional.html ) statement to differentiate if the result is > 50 for female and <= 50 for male and name this calculated attribute *gender*.

In [10]:
values <- dbGetQuery(con, "SELECT birth_number, CASE WHEN MOD(birth_number / 100, 100) > 50 THEN 'female' ELSE 'male' END as gender FROM client LIMIT 5")
values

birth_number,gender
<int>,<chr>
706213,female
450204,male
406009,female
561201,male
605703,female


### TBL

I realized that some of you are using TBL. This is centainly an alternative. However, from my point of view it is important that you practise pure SQL syntax. https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html#querying_the_database_with_the_sql_syntax describes how you can do this with TBL.

In [11]:
# close the connection (don't forget to cleanup)
dbDisconnect(con)
dbUnloadDriver(drv)