# ClickHouse

## Import libraries and creating connection

In [4]:
#!pip install clickhouse_driver

Collecting clickhouse_driver
  Using cached clickhouse_driver-0.2.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (755 kB)
Collecting tzlocal
  Using cached tzlocal-5.0.1-py3-none-any.whl (20 kB)
Collecting backports.zoneinfo
  Using cached backports.zoneinfo-0.2.1-cp38-cp38-manylinux1_x86_64.whl (74 kB)
Installing collected packages: backports.zoneinfo, tzlocal, clickhouse_driver
Successfully installed backports.zoneinfo-0.2.1 clickhouse_driver-0.2.6 tzlocal-5.0.1


In [3]:
from clickhouse_driver import Client


user_name = 'user_name'
pwd = 'password'

# creating connection ClickHouse
client = Client(host='clickhouse.lab.karpov.courses', port=9000,
                user=user_name, password=pwd, database='hardda')

# checking connection
result = client.execute("SELECT * FROM user_dm_events LIMIT 10")

# showing the result
for row in result[0:1]:
    print(row)

(datetime.date(2022, 2, 1), datetime.date(2022, 1, 31), 'android', 'f7411212fd0e2523e126cbfdd3f226c211212', '4beb10e1-aeeb-4c52-acd2-ce1ddbc1fc24b10e1', 22, 11, 3, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0)


## Tasks

### Task 1. 

What is the size of the table `user_dm_events`?

In [6]:
query = '''
SELECT
  database,
  table,
  formatReadableSize(SUM(bytes)) AS size
FROM 
  system.parts
WHERE
  database = 'hardda'
    AND table = 'user_dm_events'
GROUP BY
  database, table
'''

In [7]:
result = client.execute(query)

In [8]:
for row in result:
    print(row)

('hardda', 'user_dm_events', '4.72 GiB')


### Task 2. 

How many rows does the table have?

In [9]:
query = '''
SELECT
  COUNT(*)
FROM 
  user_dm_events
'''

In [10]:
result = client.execute(query)

In [11]:
print(result)

[(55916675,)]


### Task 3. 

On which set of fields is the table partitioned by?

In [12]:
query = '''
SELECT
  table,
  partition_key
FROM 
  system.tables
WHERE
  database = 'hardda'
    AND table = 'user_dm_events'
'''

In [13]:
result = client.execute(query)

In [14]:
print(result)

[('user_dm_events', 'event_date')]


### Task 4. 

What set of fields is the primary key (in ClickHouse terms and definitions)?

In [24]:
query = '''
SELECT
  table,
  primary_key
FROM 
  system.tables
WHERE
  database = 'hardda'
    AND table = 'user_dm_events'
'''

In [25]:
result = client.execute(query)

In [26]:
print(result)

[('user_dm_events', 'event_date')]


### Task 5. 

What set of fields is the order by key (in ClickHouse terms and definitions)?

In [31]:
query = '''
SELECT
  table,
  sorting_key
FROM 
  system.tables
WHERE
  database = 'hardda'
    AND table = 'user_dm_events'
'''

In [32]:
result = client.execute(query)

In [33]:
print(result)

[('user_dm_events', 'event_date')]


### Task 6.

What set of fields is the primary key (in terms of relational theory)?

In [46]:
query = '''
SELECT
  COUNT(*),
  COUNT(DISTINCT(user_pseudo_id || user_x_phone_id || toString(event_date)))
FROM 
  user_dm_events
'''

In [47]:
result = client.execute(query)

In [48]:
print(result)

[(55916675, 55916675)]


Answer: `user_pseudo_id` + `user_x_phone_id` + `event_date`

### Task 7. 

Which columns having a string data type can be transformed into LowCardinality data type to improve columns typization.

From official documentation: `use LowCardinality when you have up to 10,000 unique values of a column`.

In [51]:
query = '''
SELECT
  COUNT(DISTINCT(platform)),
  COUNT(DISTINCT(user_pseudo_id)),
  COUNT(DISTINCT(user_x_phone_id))
FROM 
  user_dm_events
'''

In [52]:
result = client.execute(query)

In [53]:
print(result)

[(2, 6663048, 6446931)]


Answer: `platform`

### Task 8. 

Choose numeric fields for which the data type can be changed to a more compact one.

**Int Ranges**  

**`Int8`** — [-128 : 127]  
**`Int16`** — [-32768 : 32767]  
**`Int32`** — [-2147483648 : 2147483647]  
**`Int64`** — [-9223372036854775808 : 9223372036854775807]  
**`Int128`** — [-170141183460469231731687303715884105728 : 170141183460469231731687303715884105727]  
**`Int256`** — [-57896044618658097711785492504343953926634992332820282019728792003956564819968 : 57896044618658097711785492504343953926634992332820282019728792003956564819967]

**UInt Ranges**

**`UInt8`** — [0 : 255]  
**`UInt16`** — [0 : 65535]  
**`UInt32`** — [0 : 4294967295]  
**`UInt64`** — [0 : 18446744073709551615]  
**`UInt128`** — [0 : 340282366920938463463374607431768211455]  
**`UInt256`** — [0 : 115792089237316195423570985008687907853269984665640564039457584007913129639935]

First let's check what data types do we have. 

In [15]:
query = '''
SELECT
  name,
  type
FROM 
  system.columns
WHERE
  database = 'hardda'
    AND table = 'user_dm_events'
'''

In [16]:
result = client.execute(query)

In [18]:
for i in result:
    print(i)

('event_date', 'Date')
('week_start_date', 'Date')
('platform', 'String')
('user_pseudo_id', 'String')
('user_x_phone_id', 'String')
('cnt_events', 'UInt32')
('cnt_view_advertisement', 'UInt32')
('cnt_view_listing', 'UInt32')
('cnt_new_advertisement_open', 'UInt32')
('cnt_new_advertisement_view_screen', 'UInt32')
('cnt_successful_new_advertisement_creation', 'UInt32')
('cnt_session_initiation', 'UInt32')
('cnt_display_phone', 'UInt32')
('cnt_send_message', 'UInt32')
('cnt_order_via_phone', 'UInt32')
('cnt_add_to_favorites', 'UInt32')
('cnt_view_ads_in_cabinet', 'UInt32')
('cnt_edit_advert_view_screen_package', 'UInt32')
('cnt_new_advert_view_screen_package', 'UInt32')


Now let's find out columns without an optimal data type. 

In [28]:
query = '''
SELECT
  max(cnt_events),
  max(cnt_view_advertisement),
  max(cnt_view_listing),
  max(cnt_new_advertisement_open),
  max(cnt_new_advertisement_view_screen),
  max(cnt_successful_new_advertisement_creation),
  max(cnt_session_initiation),
  max(cnt_display_phone),
  max(cnt_send_message),
  max(cnt_order_via_phone),
  max(cnt_add_to_favorites),
  max(cnt_view_ads_in_cabinet),
  max(cnt_edit_advert_view_screen_package),
  max(cnt_new_advert_view_screen_package)
FROM 
  user_dm_events
'''

In [29]:
result = client.execute(query)

In [31]:
print(result)

[(11070, 2902, 525, 346, 1713, 344, 49, 1588, 2683, 12, 1433, 960, 340, 688)]


Answer: all of the columns could be stored using a more compact `UInt16` data type. 

### Task 9. 

Choose fields with a date or date and time, for which the type can be changed to a more compact one.

ClickHouse supports the following date and time types:
- `Date` - A date without time. Stored as a 2-byte integer representing the number of days since 1970-01-01.
- `DateTime` - A date and time. Stored as a 8-byte integer representing the number of seconds since 1970-01-01 00:00:00 UTC.
- `DateTime64` - A high-precision date and time. Stored as a 16-byte integer representing the number of nanoseconds since 1970-01-01 00:00:00 UTC.
- `Date32` - A date without time. Stored as a 4-byte integer representing the number of days since 1970-01-01.

Let's check our date (datetime) columns data types. 

In [35]:
query = '''
SELECT
  name,
  type
FROM 
  system.columns
WHERE
  database = 'hardda'
    AND table = 'user_dm_events'
      AND (name = 'event_date' OR name = 'week_start_date')
'''

In [36]:
result = client.execute(query)

In [37]:
print(result)

[('event_date', 'Date'), ('week_start_date', 'Date')]


Answer: our date columns already have the most compact date type. No changes of data types required here. 

### Task 10. 

tbc..