# Python Data Types

## Unicode, UTF-8 and Strings in Python
1. __What are strings made of?__

In Python (2 or 3), __strings__ can either be represented in __bytes__ or __unicode code points__. 
__Byte__ is a unit of information that is built of __8 bits__ — _bytes are used to store all files in a hard disk_. So all of the CSVs and JSON files on your computer are built of bytes. We can all agree that we need bytes, but then what about unicode code points?
 
2. __What is Unicode, and unicode code points?__

__Unicode__ is __international standard__ where a mapping of individual characters and a unique number is maintained. As of May 2019, the most recent version of Unicode is 12.1 which contains over 137k characters including different scripts including English, Hindi, Chinese and Japanese, as well as emojis. These 137k characters are each represented by a unicode code point. So __unicode code points__ refer to _actual characters that are displayed._ These code points are encoded to bytes and decoded from bytes back to code points. Examples: Unicode code point for alphabet a is U+0061.

3 of the most popular encoding standards defined by Unicode are __UTF-8__, __UTF-16__ and __UTF-32__.

3. __What are Unicode encodings UTF-8, UTF-16, and UTF-32?__

We now know that __Unicode__ is an __international standard__ that encodes every known character to a unique number. Then the next question is how do we move these unique numbers around the internet? You already know the answer! Using bytes of information.

__UTF-8:__ It uses 1, 2, 3 or 4 bytes to encode every code point. It is backwards compatible with ASCII. All English characters just need 1 byte — which is quite efficient. We only need more bytes if we are sending non-English characters.
It is the most popular form of encoding, and is by default the encoding in Python 3. In Python 2, the default encoding is ASCII (unfortunately).

__UTF-16__ is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each. It’s bad for English as all English characters also need 2 bytes here.

__UTF-32__ is fixed 4 bytes. All characters are encoded in 4 bytes so it needs a lot of memory. It is not used very often.

We need __encode__ method to convert __unicode code points__ to __bytes__. This will happen typically during writing string data to a CSV or JSON file for example. We need __decode__ method to convert __bytes__ to __unicode code points__. This will typically happen during reading data from a file into strings.

https://github.com/nyangweso-rodgers/Analytics_with_Python/blob/main/Reference_Images_Folder/image2.png

4. __What data types in Python handle Unicode code points and bytes?__

__Remark:__ _in Python, strings can either be represented in bytes or unicode code points._

The main takeaways in Python are:
* Python 2 uses __str__ type to store __bytes__ and __unicode__ type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters. In this case, we need to remember to use decode("utf-8") during reading of files. This is inconvenient.

* Python 3 came and fixed this. Strings are still __str__ type by default but they now mean __unicode code points__ instead — we carry what we see. If we want to store these str type strings in files we use bytes type instead. Default encoding is UTF-8 instead of ASCII. Perfect!

## Generating unique string using UUID
__UUID__, __Universal Unique Identifier__, is a python library which helps in generating random objects of 128 bits as ids.  It provides the uniqueness as it generates ids on the basis of time, Computer hardware (MAC etc.).

#### Advantages of UUID :
* Can be used as general utility to generate unique random id.
* Can be used in cryptography and hashing applications.
* Useful in generating random documents, addresses etc.

In [3]:
import uuid
print(dir(uuid))

['Enum', 'NAMESPACE_DNS', 'NAMESPACE_OID', 'NAMESPACE_URL', 'NAMESPACE_X500', 'RESERVED_FUTURE', 'RESERVED_MICROSOFT', 'RESERVED_NCS', 'RFC_4122', 'SafeUUID', 'UUID', '_AIX', '_GETTERS', '_LINUX', '_OS_GETTERS', '_UuidCreate', '__author__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_arp_getnode', '_find_mac', '_generate_time_safe', '_has_uuid_generate_time_safe', '_ifconfig_getnode', '_ip_getnode', '_ipconfig_getnode', '_is_universal', '_lanscan_getnode', '_last_timestamp', '_load_system_functions', '_netbios_getnode', '_netstat_getnode', '_node', '_popen', '_random_getnode', '_unix_getnode', '_uuid', '_windll_getnode', 'bytes_', 'getnode', 'int_', 'os', 'sys', 'uuid1', 'uuid3', 'uuid4', 'uuid5']


In [4]:
print ("The random id using uuid1() is :", uuid.uuid1()) 

The random id using uuid1() is : d5109b24-e3eb-11eb-97c7-38fc980d3a99


In [5]:
print(str(uuid.uuid4()))

f26f0820-e44f-4c3e-ad16-bceffb47bc09


In [6]:
print(type(uuid.uuid4()))

<class 'uuid.UUID'>


## Arrays
An __array__ is a data structure that contains a group of elements.  Typically these elements are all of the same data type, such as an integer or string. In Python , __List__ is implemented as __Dynamic Array__ , In other languages like __JAVA__ and __C++__ we have __static__ array and __dynamic__ array. 

* __Static Array__: In case of __static array__ the size of array is fixed . So assume that you made a array with 5 as the capacity and you tried to add more that the capacity then it will through exception : Array Index Out of Order 
* __Dynamic Array__:  In case of __Dynamic array__ , for this you need not to specify any size here you can keep on adding elements.

## Dates

#### Example 1: Generating Date Arrays

In [8]:
import pandas as pd 
from datetime import datetime, timedelta

# generate date array
date_list = pd.date_range(start="2020-01-01", end="2021-04-30").to_list()
df = pd.DataFrame(date_list, columns=['date'])

date_list = pd.date_range(start="2020-01-01", end="2021-04-30").to_list()
df = pd.DataFrame(date_list, columns=['date'])
# getting the week number
df['week_number'] = df['date'].dt.isocalendar().week

# Create function to calculate Start Week date
week_start_date = lambda date: date - timedelta(days=date.weekday())
# Apply above function on DataFrame column
df['week_start_date'] = df['date'].apply(week_start_date)
df

Unnamed: 0,date,week_number,week_start_date
0,2020-01-01,1,2019-12-30
1,2020-01-02,1,2019-12-30
2,2020-01-03,1,2019-12-30
3,2020-01-04,1,2019-12-30
4,2020-01-05,1,2019-12-30
...,...,...,...
481,2021-04-26,17,2021-04-26
482,2021-04-27,17,2021-04-26
483,2021-04-28,17,2021-04-26
484,2021-04-29,17,2021-04-26


### References
1. [A Guide to Unicode, UTF-8 and Strings in Python](https://towardsdatascience.com/a-guide-to-unicode-utf-8-and-strings-in-python-757a232db95c)