<a href="https://colab.research.google.com/github/ikramyoumba1/BigDataAanlysis/blob/main/Tp2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import dask.dataframe as dd
import time
import os


In [2]:
path = "/content/drive/MyDrive/data/2019-Oct.csv"

In [5]:
import pandas as pd
df = pd.read_csv(path)
print(df.head(5))

                event_time event_type  product_id          category_id  \
0  2019-10-01 00:00:00 UTC       view    44600062  2103807459595387724   
1  2019-10-01 00:00:00 UTC       view     3900821  2053013552326770905   
2  2019-10-01 00:00:01 UTC       view    17200506  2053013559792632471   
3  2019-10-01 00:00:01 UTC       view     1307067  2053013558920217191   
4  2019-10-01 00:00:04 UTC       view     1004237  2053013555631882655   

                         category_code     brand    price    user_id  \
0                                  NaN  shiseido    35.79  541312140   
1  appliances.environment.water_heater      aqua    33.20  554748717   
2           furniture.living_room.sofa       NaN   543.10  519107250   
3                   computers.notebook    lenovo   251.74  550050854   
4               electronics.smartphone     apple  1081.98  535871217   

                           user_session  
0  72d76fde-8bb3-4e00-8c23-a032dfed738c  
1  9333dfbd-b87a-4708-9857-6336556b0fc

**Method 1: pandas.read_csv(chunksize)**

In [None]:
start = time.time()

chunk_size = 500000
chunks = []

for chunk in pd.read_csv(path, chunksize=chunk_size):
    chunks.append(chunk["price"].mean())

mean_price_chunks = sum(chunks) / len(chunks)

end = time.time()
time_chunks = end - start

df_sample = pd.read_csv(path, nrows=100000)
storage_pandas = df_sample.memory_usage(deep=True).sum() / (1024**3)

print("✅ [1] Pandas + chunksize")
print(f"Average price: {mean_price_chunks:.2f}")
print(f"Time taken: {time_chunks:.2f} sec\n")
print(f"Memory used  {storage_pandas:.2f} GB\n")


✅ [1] Pandas + chunksize
Average price: 290.33
Time taken: 128.71 sec

Memory used  0.03 GB



**Method 2: Dask**

In [None]:
start = time.time()

df_dask = dd.read_csv(path)
mean_price_dask = df_dask["price"].mean().compute()

end = time.time()
time_dask = end - start

storage_dask = df_dask.memory_usage(deep=True).sum().compute() / (1024**3)


print("✅ [2] Dask")
print(f"Average price: {mean_price_dask:.2f}")
print(f"Time taken: {time_dask:.2f} sec\n")
print(f"Memory used : {storage_dask:.2f} GB\n")


✅ [2] Dask
Average price: 290.32
Time taken: 110.42 sec

Memory used : 6.16 GB



**Compression**

In [None]:

compressed_path = "/content/drive/MyDrive/data/2019-Nov.csv.gz"

#  Time Calculation
start = time.time()

# Reading the original file
df = pd.read_csv(path)

# Save the file compressed in gzip format
df.to_csv(compressed_path, index=False, compression='gzip')

end = time.time()
time_compress = end - start

#  Calculate the size in GB
original_size_gb = os.path.getsize(path) / (1024**3)
compressed_size_gb = os.path.getsize(compressed_path) / (1024**3)


print("✅ [3] Compression (gzip)")
print(f"Original size: {original_size_gb:.2f} GB")
print(f"Compressed size: {compressed_size_gb:.2f} GB")
print(f"Compression time: {time_compress:.2f} sec")

# Pressure ratio
compression_ratio = (1 - compressed_size_gb / original_size_gb) * 100
print(f"Compression ratio: {compression_ratio:.1f}% smaller")


✅ [3] Compression (gzip)
Original size: 5.28 GB
Compressed size: 1.58 GB
Compression time: 832.68 sec
Compression ratio: 70.1% smaller


**Comparison of results**

In [None]:
print("=== 🧩 Comparison Summary ===")
print(f"Pandas + chunksize: {time_chunks:.2f}s | {storage_pandas:.2f} GB RAM")
print(f"Dask: {time_dask:.2f}s | {storage_dask:.2f} GB RAM")
print(f"Compression: {time_compress:.2f}s | {compressed_size_gb:.2f} GB on disk")

=== 🧩 Comparison Summary ===
Pandas + chunksize: 103.08s | 0.03 MB RAM
Dask: 99.72s | 6.16 MB RAM
Compression: 832.68s | 1.58 GB on disk


| models   | Time | Memory Usage  |
| ---------------- | ---------------- | ---------- |
| **Chunk**    | Average     | Low |
| **Dask**     | Fast     | Average   |
| **Compression**  | Slower      | Low    |
