*Numerical data types in Python, datatime pandas, modulus operator `%`, `pd.options.display.max_info_columns`, `memory_usage()`, Save order, Parquet and Feather, `psutil` library, managing memory in notebooks*

![](designer.jpeg)

# Working plan for the next weeks
- [ ] Flood Competition. Modelling the data and making predictions
- [ ] AMEX competition
- [ ] Concrete competition. Modelling the data and making predictions
- [ ] Shapley 
- [ ] Inference and prediction
- [ ] PCA
- [ ] Target Transformations
- [ ] Richard Dawking explanation of FP and FN

# Daily Note - 16/05/2024

## 1. Difference between `int32` and `int64` in Python

In Python libraries like pandas and numpy, we can use `int32` and `int64` to represent integers. The difference between them is the amount of memory they use. `int32` uses 32 bits (4 bytes) to represent an integer, while `int64` uses 64 bits (8 bytes). This means that `int64` can represent larger numbers than `int32`. For example, `int32` can represent numbers from -2,147,483,648 to 2,147,483,647, while `int64` can represent numbers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

## 2. Float Numbers data types in Python

In Python, there are several data types to represent numbers, one of them is float numbers. Float numbers are used to represent real numbers, and they can have decimal points. There are two main float data types in Python: `float32` and `float64`. 

A `float64` (or double precision) stores numbers with approximately 15-17 decimal digits of precision and requires 8 bytes per number.A `float32` (or single precision) stores numbers with about 6-9 decimal digits of precision and requires only 4 bytes per number.

Converting from `float64` to `float32` can save half the memory usage without a significant loss in precision for many applications, although the exact impact depends on the specific data and requirements.

A `float16` (or half precision) provides even less precision, about 3-4 decimal digits, and requires only 2 bytes per number.

## 3. Convert year, month, day to `int8`

When working with date columns in pandas, it is common to convert them to `int8` to save memory. `int8` data type can store integers from -128 to 127, which is enough to represent the year, month, and day.

`int8` occupies only 1 byte of memory per entry, whereas `int32` uses 4 bytes and `int64` uses 8 bytes. This difference becomes significant when dealing with large datasets.

## 4. Remainder and modulus operator `%`

The remainder is the amount left over after performing a division operation between two numbers. For example, when you divide 17 by 5, the quotient is 3 and the remainder is 2.

The modulus operator `%` is a mathematical tool used in programming to find the remainder of a division of one number by another. It is often used to determine if a number is even or odd, or to extract the last digit of a number like in the case of extracting the last two digits of a year. In pandas, you can extract the last two digits:
```python
data['S_2'].dt.year % 100
```
Where `data['S_2']` is a datetime column and `dt.year` extracts the year from the datetime column.

## 5. `pd.options.display.max_info_columns`

`pd.options.display.max_info_columns` is a pandas option that controls the maximum number of columns displayed when using the `df.info()` method. It is useful when working with large datasets with many columns.

```python
pd.options.display.max_info_columns = 300
```

## 6. `memory_usage()` method in pandas

The `memory_usage()` method in pandas is used to calculate the memory usage of a DataFrame. By default, it returns the memory usage of each column in bytes. You can pass the `deep=True` argument to introspect the data deeply by interrogating object dtypes for system-level memory consumption. Documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html)

```python
data.memory_usage(deep=True)['customer_ID']
```

## 7. Save order, Parquet and Feather

In the cases where managing large datasets is critical, choosing the right file format based on **save order** and **compression** can be important. 

### Save Order

The save order refers to how data is physically stored in a file. It can be either row-oriented or column-oriented. Row-oriented storage is when data is stored row by row, while column-oriented storage is when data is stored column by column. 

CSV file save data in a row-wise manner. Each row is written sequentially , and when reading the file, it typically reads row by row. If your analysis or processing often requires accessing complete rows at a time then CSV might be suitable. However if you only need to access specific columns, CSV is inefficient because it loads entire rows into memory. Also keep in mind, **CSV files doesn't support compression natively, and doesn't preserve datatypes.** 


### Parquet and Feather

Parquet and feather are designed to store data column by column. This format is particularly beneficial for analytical processing where queries often involve specific columns across a wide range of rows. Both formats support compression and preserve datatypes.

So if your data access is mostly columnar, then use Parquet or Feather. Use Parquet if you need efficient storage and excellent compression, or use Feather if you need fast read and write times.

For the AMEX competition, where datasets are typically large, efficiency is crucial. Parquet is often the preferred choice due to its performance benefits in terms of storage, partial reads, and data type preservation

## 8. `psutil` library

`psutil` is a Python library that provides an interface for retrieving information on running processes and system utilization. It can be used to monitor system resources like CPU, memory, disk, and network usage. Documentation [here](https://psutil.readthedocs.io/en/latest/)

I've created a function to know the available memory in the system. 

```python
import psutil

def available_memory_gb():
    return psutil.virtual_memory().available / (1024**3)
```	


## 9. Managing memory with notebooks

When working with large datasets in Jupyter notebooks, it is important to manage memory efficiently to avoid running out of memory. 

In Python, memory management is primarily handled by the garbage collector, which automatically frees up memory when objects are no longer needed. However, in some cases, especially when working with large data structures, it can be beneficial to manually intervine to ensure meory is freed up more promptly.

Here are some tips to manage memory in notebooks:


### Use `del` to delete variables

When you no longer need a variable, use the `del` statement to delete it from memory. This will free up memory that can be used for other operations.

```python
import gc

del variable_name  # Delete the variable to free up memory
gc.collect()  # Explicitly call garbage collector
````

### Restart the kernel

If you are running out of memory, restarting the kernel can help free up memory. This will clear all variables and objects from memory, allowing you to start fresh.

### Monitor System Memory

You can use system tools like the task manager on Windows, or `top` and `htop` on Linux to monitor system memory usage.

### Monitor GPU use (NVIDIA GPUs)

The most straightforward way to monitor GPU usage is to use the `nvidia-smi` command in the terminal. This command provides real-time information about GPU utilization, memory usage, and temperature.

You can also run it in a continous monitoring mode by running `nvidia-smi -l 1` in the terminal.

For deeplearning training model is quite common to use `nvidia-smi dmon` to monitor the GPU usage.