# Data Compression Performance Analysis for Organizations Dataset: A Comparative Study

## 1 Introduction

In the digital era, managing and processing large datasets efficiently is a paramount challenge. **Data compression techniques provide a solution by reducing the size of data for storage and transmission, which leads to improved system performance.** This report delves into a comparative analysis of three popular data compression techniques - **gzip, snappy, and lz4** - using a dataset of organizations. Our objective is to evaluate the compression and decompression speeds of these methods, considering different storage media types, and to provide recommendations for their use in real-world scenarios.

## 2 Problem Statement and Refinement

The initial problem posed was to explore data compression techniques including **gzip, snappy, and lz4**, and evaluate their performance. To make this analysis more practical and relevant, we chose to work with an organizations dataset. This dataset comprises 100,000 records of organizations, each with various attributes such as Organization Id, Name, Website, Country, Description, Founded Year, Industry, and Number of employees.

## 3 Methodology

**Data Preparation:** The dataset was sourced and preprocessed to ensure consistency and accuracy. It contains a mix of categorical and numerical attributes, which reflects a real-world scenario.

**Data Cleaning** During the preprocessing phase, it's crucial to ensure that the data is clean and ready for analysis. One common issue is the presence of empty cells or missing values within the dataset, as they can hinder accurate analysis and modeling. To address this concern, we implemented a data cleaning process, specifically focusing on identifying and removing columns with empty cells.

**Compression Implementation:** For each compression technique - **gzip, snappy, and lz4** - we implemented compression and decompression functions tailored for the dataset's schema. The dataset was converted to a CSV format for compression.

**Analysis:** We compared the performance of compression techniques in terms of compression time, decompression time, and read/write speed. We also investigated the effect of storage types (NVMe, SSD, HDD) on these performance metrics.

**ANOVA Test:** We performed ANOVA tests to analyze whether there are statistically significant differences in the running times among the storage types for both decompression/reading and writing/compression.

**Visualization:** To visualize our results, we created histogram that compare the running times of different compression methods for decompression/reading and writing/compression across different storage types.


## 2. Results and Findings

**Decompression/Reading Time Comparison:** 

**gzip:** NVMe - 0.1777s, SSD - 0.1234s, HDD - 0.1573s 

**snappy:** NVMe - 0.0491s, SSD - 0.1075s, HDD - 0.1473s 

**lz4:** NVMe - 0.0353s, SSD - 0.0544s, HDD - 0.0946s 

**Compression/Writing Time Comparison:**

**gzip:** NVMe - 1.9478s, SSD - 1.8003s, HDD - 2.1026s

**snappy:** NVMe - 0.9619s, SSD - 0.9948s, HDD - 1.3535s lz4: 

**NVMe:** 1.0759s, SSD - 0.8906s, HDD - 1.2163s

**Storage Type Impact:**
The choice of storage type significantly affects the performance of compression and decompression tasks. NVMe consistently outperforms SSD and HDD in terms of speed.



# 3. Anova Testing
**Decompression and Reading Times ANOVA**

- F-statistic: 22.3297, P-value: 0.0003
- Statistically significant differences in runtime performance for data decompression and reading among storage types.

**Writing and Compression Times ANOVA**

- F-statistic: 1.4266, P-value: 0.2896
- No statistically significant evidence of runtime performance variations for data writing and compression across storage types.

**Interpretation:**

With a p-value of 0.2896, which is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This suggests that there is no statistically significant evidence to conclude that the runtime performance for data writing and compression significantly varies across different storage types.

These ANOVA tests provided insights into how different storage types impact the runtime performance of data operations. The results can help in making informed decisions about selecting appropriate storage solutions for specific use cases. Keep in mind that the interpretation of ANOVA results should be context-dependent and should consider practical implications as well.

# 4. Histogram Visualization

To better understand the runtime performance of different compression techniques across various storage types, we have created a histogram visualization. This visualization provides a clear overview of the time taken by each compression and decompression method on different storage media. The x-axis represents the time in seconds, and the y-axis indicates the frequency of occurrences.

By examining the histogram, you can easily spot trends and differences in the distribution of runtime for each compression technique. It allows us to visualize how each method performs in terms of speed across different storage types: NVMe, SSD, and HDD.

The bars in the histogram correspond to:

- Gzip Compression and Decompression
- Snappy Compression and Decompression
- LZ4 Compression and Decompression

For each method, we've measured the time taken for reading and processing data, decompression, compression, and writing operations. This visualization serves as a visual summary of the comprehensive runtime analysis presented earlier, enhancing our understanding of how compression techniques and storage types interact to impact overall performance.

![Figure_1.png](attachment:Figure_1.png)

## 5. Recommendations

**1. Fastest Overall Performance:**

For scenarios prioritizing both fast compression and decompression, snappy compression combined with NVMe storage is the most suitable choice. This combination offers optimal speed and responsiveness.

**2. Balancing Performance and Cost:**

When working with budget constraints, consider using SSD storage along with snappy or lz4 compression. This trade-off may still provide substantial performance benefits.
For use cases where high-speed storage is not readily available, snappy compression with HDD storage can provide a noticeable improvement over uncompressed data.

**3. Use Case Specificity:**

The choice of compression and storage should align with the specific use case requirements. For example, if real-time data processing is crucial, prioritize the fastest compression algorithms and storage types.

**4 Unspecified Use Case:**

Use Gzip if you need high compression ratios and are willing to accept slower read and write times.
Use Snappy if you prioritize speed and moderate compression ratios.
Use LZ4 if ultra-fast read and write times are critical, and you can tolerate larger compressed files.

## 6. Limitations and Future Consideration

1. Dataset Diversity:
Our analysis utilized a specific dataset, which might not encompass all possible scenarios. Different data characteristics could yield varied compression results.

2. Hardware and Configurations:
Results could vary based on hardware specifications and software configurations. Different hardware setups might exhibit different performance patterns.

3. Additional Algorithms and Storage Types:
Exploring more compression algorithms (such as zstd) and storage types (cloud-based, distributed storage) could offer a more comprehensive understanding of performance trade-offs.

4. Scalability and Parallel Processing:
The analysis did not delve into scalability and parallel processing capabilities, which can be crucial when dealing with larger datasets.

## 7. Conclusion

Data compression techniques play a pivotal role in optimizing storage space and enhancing data transfer speeds. Our analysis of the **gzip**, **snappy**, and **lz4** compression techniques across various storage media types (NVMe, SSD, and HDD) provides valuable insights for making informed decisions in real-world scenarios.

### Compression and Decompression Efficiency

When evaluating compression and decompression efficiency, the following observations were made:

- **NVMe:** NVMe drives consistently demonstrated remarkable performance. For example, snappy compression exhibited a compression time of just 0.9618 seconds, while lz4 compression took around 1.0759 seconds. Decompression times were similarly swift, with snappy and lz4 decompression taking 0.0491 seconds and 0.0353 seconds, respectively. These results indicate that data can be rapidly compressed and decompressed using NVMe drives.

- **SSD:** SSDs, although slower than NVMe drives, still displayed impressive compression and decompression times. Snappy compression took approximately 0.9948 seconds, and lz4 compression took 0.8906 seconds. Snappy and lz4 decompression on SSDs were notably faster than on HDDs, highlighting the advantages of SSD technology.

- **HDD:** Due to their mechanical nature, HDDs showcased slower compression and decompression times compared to SSDs and NVMe drives. However, snappy compression demonstrated a notable improvement over uncompressed data, indicating its suitability for scenarios with HDD storage.

### Reading and Writing Operations

In addition to compression and decompression times, reading and writing operations are crucial considerations. The analysis revealed the following insights:

- **NVMe:** Reading and processing data times were consistent across all compression methods on NVMe drives. Writing times were slightly higher, with writing and snappy compression taking around 0.5635 seconds. Despite this, the overall efficiency of NVMe drives ensures that even with slightly increased writing times, data operations remain rapid.

- **SSD:** SSDs maintained a balance between reading, processing, and writing times. Writing times for snappy and lz4 compression were approximately 0.6764 seconds and 0.6552 seconds, respectively. While not as fast as NVMe, SSDs provide a cost-effective solution without significant compromises on speed.

- **HDD:** HDDs exhibited slower reading, processing, and writing times compared to SSDs and NVMe drives. However, even on HDDs, snappy compression resulted in notably improved writing times (around 0.9217 seconds), making it a reasonable choice for enhancing data operations.



 

## 8. Final Reflection

Data compression techniques play a pivotal role in enhancing system performance and resource utilization. The comprehensive analysis of **gzip, snappy, and lz4 compression techniques** on an organizations dataset has illuminated their strengths and limitations. By aligning compression methods and storage solutions with the specific demands of different applications, organizations can make informed decisions, ensuring optimal performance and efficient data operations. As technology continues to evolve, staying attuned to advancements in compression algorithms and storage technologies will remain essential for ensuring data processing remains swift and streamlined.


## 9. Project Experience

**Project experience (Rafid)**

Data compression/decompression analysis
Computational Data Science, (CMPT 353)
- Learned about different compression and decompression techniques
- Analyzed performance for different sized files
- Compared data and made technical report

**Project experience (Nikhil)**

Data compression/decompression analysis
Computational Data Science, (CMPT 353)
- Learned about different compression and decompression techniques
- Performed ANOVA testing to statistically analyze data
- Learned about different data visualization methods

