In [1]:
import numpy as np
import pandas as pd

### Numpy Data Types
Numpy has the following data types: 
- ```int```
- ```float```
- ```complex```
- ```bool```
- ```string```
- ```unicode```
- ```object```

The numeric data types have various precisions like 32-bit or 64-bit. 

Numpy data types can be represented using either __Type__ or __Type Code__

In [3]:
#create a DataFrame listing NumPy data types and their corresponding type codes
dtypes = pd.DataFrame(
    {
        'Type': [
            'int8', 
            'uint8', 
            'int16', 
            'uint16', 
            'int or int32', 
            'uint32', 
            'int64', 
            'uint64', 
            'float16', 
            'float32', 
            'float or float64',
            'float128', 
            'complex64', 
            'complex or complex128', 
            'bool', 
            'object', 
            'bytes_',
            'str_',
        ],
        
        'Type Code': [
            'i1', 
            'u1', 
            'i2', 
            'u2', 
            'i4 or i', 
            'u4', 
            'i8', 
            'u8', 
            'f2', 
            'f4 or f', 
            'f8 or d', 
            'f16 or g', 
            'c8', 
            'c16', 
            None, 
            'O', 
            'S', 
            'U',
        ]
    }
)

dtypes

Unnamed: 0,Type,Type Code
0,int8,i1
1,uint8,u1
2,int16,i2
3,uint16,u2
4,int or int32,i4 or i
5,uint32,u4
6,int64,i8
7,uint64,u8
8,float16,f2
9,float32,f4 or f


| Datatype | Range / Precision | Memory Usage |
| :--- | :--- | :--- |
| **`int32`** | -2,147,483,648 to 2,147,483,647 | 4 bytes |
| **`int64`** | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 | 8 bytes |
| **`float32`** | ~7 decimal digits of precision | 4 bytes |
| **`float64`** | ~15-17 decimal digits of precision | 8 bytes |

Data types can be defined at creating the numpy array and converted to other types later. 


In [4]:
# create a numpy array with a float32 dtype using the type code
arr = np.array([1,2,3], dtype='f4')
arr

array([1., 2., 3.], dtype=float32)

In [5]:
# Identical to the above, but using the dtype name
arr = np.array([1,2,3], dtype='float32')
arr.dtype

dtype('float32')

### Type Conversion

```astype``` method: convert the data type of an array to other data types. 

Notice that ```astype``` returns a copy of the array instead of converting the data type in place. You need to assign the copy to the original array or a new array.

In [6]:
arr = np.array([1,2,3], dtype='int16') # create an array with int16 data type
print('Original Data Type: ' + str(arr.dtype)) # print original data type

arr = arr.astype(np.float32) # convert to float32 data type
print('Data Type After Conversion: ' + str(arr.dtype)) # print data type after conversion

Original Data Type: int16
Data Type After Conversion: float32


__WARNING__: be cautious about data overflow when you downcast the data type (from higher precision to lower precision). Some unexpected and undefined values might occur and it is usually difficult to debug such issues. 

In [8]:
# An example of integer overflow at downcasting
arr = np.array([126,127,256], dtype='int16')
print('np array before type conversion: ' + str(arr))

# Range of int8 [-128, 127], 256 overflows after conversion
arr = arr.astype('int8')
print('np array after type conversion: ' + str(arr))

np array before type conversion: [126 127 256]
np array after type conversion: [126 127   0]


| Feature | `int8` | `int16` |
| :--- | :--- | :--- |
| **Size (Memory)** | 8 bits (1 byte) | 16 bits (2 bytes) |
| **Minimum Value** | $-2^7$ ($-128$) | $-2^{15}$ ($-32,768$) |
| **Maximum Value** | $2^7 - 1$ ($127$) | $2^{15} - 1$ ($32,767$) |
| **Total Values** | 256 | 65,536 |

### String and Unicode Data Type

The ```string_``` and ```unicode_``` data types are all implicitly _fixed-length_. 

The length of the string is given by their type code appended with a number. For example, ```S3``` represents string of length 3; ```U10``` represents unicode of length 10. Otherwise, the default length is the length of the longest string in the array.

If the length of a string in the array is shorter than the length of the data type defined or converted to, the string will be truncated.

In [10]:
# An example of truncated string
s = np.array(['abc', 'defg'], dtype='S3')
print(s)

# An example of truncated unicode
s = np.array(['abcd', 'efghi'], dtype='U11')
print(s)

[b'abc' b'def']
['abcd' 'efghi']


The 'b' stands for bytes, confirming they are raw byte strings.

In [None]:
arr = np.array(['a', 'ab', 'abc'], dtype=np.bytes_) # create numpy array with the bytes dtype
print('The array is ' + str(arr))
print('The data type is ' + str(arr.dtype) + ' because the longest string in the array is "abc" and its length is 3.')

arr = np.array(['a', 'abc', 'abcd'], dtype=np.str_) #create a numpy array as a string, with becomes a U4 dtype
print('The array is ' + str(arr))
print('The data type is ' + str(arr.dtype) + ' because the longest unicode in the array is "abcd" and its length is 4.')


The array is [b'a' b'ab' b'abc']
The data type is |S3 because the longest string in the array is "abc" and its length is 3.
The array is ['a' 'abc' 'abcd']
The data type is <U4 because the longest unicode in the array is "abcd" and its length is 4.


What do "|" and "<" in the data types above mean?
   
    "|" indicates Endianness: the byte order is not applicable.
    
    "<" stands for Little-Endian, indicating the least significant byte of a data type comes first.

They are the byte order indicators, which are beyond the scope of this tutorial.

Further readings if you are interested:
- https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.byteorder.html
- https://en.wikipedia.org/wiki/Endianness



# Exercises


## Exercise 1: The Precision Trap
**Goal:** Understand the difference between integer types and how they handle large numbers.

Create a NumPy array containing the value 250 with a data type of np.uint8 (unsigned 8-bit integer).

Add 10 to this array.

Question: What is the resulting value? Why didn't it result in 260?

Task: Fix the code by casting the array to a larger data type (like np.int16) before performing the addition.

In [7]:
import numpy as np
#Create an NumPy array containing the value 250 with a data type of int16
arr = np.array([250], dtype='int16')
print('Original array: ' + str(arr))
#add 10 to the original array, which will cause an overflow
np_arr = np_arr + 10
print('Array after adding 10: ' + str(np_arr))

Original array: [250]
Array after adding 10: [44]


In [9]:
#Fixed Code

#Create an NumPy array containing the value 250 with a data type of int16
arr2 = np.array([250], dtype='int16')
print('Original array: ' + str(arr2))
#add 10 to the original array, which will cause an overflow
np_arr2 = arr2 + 10
print('Array after adding 10: ' + str(np_arr2))


Original array: [250]
Array after adding 10: [260]


## Exercise 2: Memory Footprint Comparison
**Goal:** Visualize how choosing the right dtype saves system resources.

Create a 1D array of 1,000,000 zeros using np.float64.

Create a second array of the same size using np.float16.

Use the .nbytes attribute on both arrays to see the total memory consumption in bytes.

Task: Calculate the ratio of memory saved by using float16 instead of float6

In [12]:
#Create a 1D array of 1,000,000 zeros using np.float64
arr3 = np.zeros(1000000, dtype=np.float64)
#Create a 1D array of 1,000,000 zeros using np.float16
arr4 = np.zeros(1000000, dtype=np.float16)
print(arr3.nbytes) # check the number of bytes used by arr3
print(arr4.nbytes) # check the number of bytes used by arr4

#Calculate the ratio of memory saved by using float16 instead of float64
memory_saved = (arr3.nbytes - arr4.nbytes) / arr3.nbytes
print('Memory saved by using float16 instead of float64: ' + str(memory_saved * 100) + '%')

8000000
2000000
Memory saved by using float16 instead of float64: 75.0%


## Exercise 3: String Truncation and Fixed-Widths
**Goal:** Learn how NumPy handles strings differently than standard Python lists.

Create a NumPy array from the following list of names: ["Alice", "Bob", "Charlie"].

Check the dtype of the array (it should look something like <U7).

Try to change the first element ("Alice") to "Alexandria" using indexing: arr[0] = "Alexandria".

Question: What happens to the word "Alexandria" when you print the array?

Task: Re-create the array but explicitly set the dtype to a length that accommodates "Alexandria" (e.g., dtype='U10').

In [22]:
#Create a NumPy array from the following list of names: ["Alice", "Bob", "Charlie"].
names = np.array(["Alice", "Bob", "Charlie"])
print(names.dtype)
names[0] = "Alexandria"
print(names)

<U7
['Alexand' 'Bob' 'Charlie']


In [21]:
names = np.array(["Alice", "Bob", "Charlie"], dtype='U10')
names[0] = "Alexandria"
print(names)


['Alexandria' 'Bob' 'Charlie']


Next [Chapter](./2.%20Create%20an%20Array.ipynb)