# **Big Data Analysis with Google Colab**
This notebook demonstrates how to handle large datasets using **Pandas** and **Dask** in Google Colab.

In [None]:
# Step 1: Install required libraries
!pip install dask

## **Step 2: Generate a Large Dataset**
We simulate a large dataset with 10 million rows.

In [None]:
import pandas as pd
import numpy as np

# Generate a large dataset with 10 million rows
num_rows = 10_000_000
data = {
    'id': np.arange(1, num_rows + 1),
    'category': np.random.choice(['A', 'B', 'C', 'D'], size=num_rows),
    'value': np.random.rand(num_rows) * 100,
    'date': pd.date_range(start='1/1/2020', periods=num_rows, freq='S')
}

df = pd.DataFrame(data)
df.to_csv('large_dataset.csv', index=False)
print("Dataset saved as 'large_dataset.csv'")

## **Step 3: Load the Large Dataset Using Dask**
Dask allows handling large datasets efficiently.

In [None]:
import dask.dataframe as dd

# Read the large dataset with Dask
df_dask = dd.read_csv('large_dataset.csv')

# Display the first few rows
df_dask.head()

## **Step 4: Perform Big Data Analysis**

In [None]:
# 1. Count Rows
df_dask.shape[0].compute()

In [None]:
# 2. Summary Statistics
df_dask.describe().compute()

In [None]:
# 3. Group Data by Category
df_grouped = df_dask.groupby('category')['value'].mean().compute()
print(df_grouped)

In [None]:
# 4. Filter Large Values
df_filtered = df_dask[df_dask['value'] > 90]
df_filtered.compute().head()

## **Step 5: Visualizing Data**

In [None]:
import matplotlib.pyplot as plt

# Convert Dask DataFrame to Pandas for plotting
df_plot = df_grouped.to_frame().reset_index()

# Bar plot of category-wise average value
plt.figure(figsize=(8, 5))
plt.bar(df_plot['category'], df_plot['value'], color=['blue', 'green', 'red', 'purple'])
plt.xlabel("Category")
plt.ylabel("Average Value")
plt.title("Average Value per Category")
plt.show()