<h1 style='color:blue'>🔹 Dask for Big Data Analysis 🔹</h1>

Dask is a **parallel computing** library designed for handling **large datasets** efficiently.
It extends the capabilities of **pandas, NumPy, and scikit-learn** to work with big data.

<h2 style='color:green'>✅ Why Use Dask?</h2>
- **Parallel Computing**: Uses multiple CPU cores.
- **Lazy Evaluation**: Computes only when needed.
- **Scalability**: Works on personal machines and clusters.
- **Familiar API**: Similar to pandas and NumPy.

<h2 style='color:purple'>🔹 Step 1: Install Dask</h2>
First, install **Dask** if it's not already installed.

In [None]:
!pip install dask

<h2 style='color:purple'>🔹 Step 2: Import Required Libraries</h2>
Now, let's import **Dask DataFrame**, which works like pandas but is optimized for big data.

In [None]:
import dask.dataframe as dd

<h2 style='color:purple'>🔹 Step 3: Load a Large CSV File</h2>
Instead of loading the entire dataset into memory (as pandas does), **Dask loads it in chunks.**

In [None]:
# Load large dataset (Example: NYC Taxi Data)
df = dd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv')

# Display first few rows
df.head()

<h2 style='color:purple'>🔹 Step 4: Compute Statistics</h2>
Dask does **lazy evaluation**, meaning it only computes when `.compute()` is called.

In [None]:
# Compute summary statistics
df.describe().compute()

<h2 style='color:purple'>🔹 Step 5: Parallel Computation</h2>
Dask runs operations **in parallel** on multiple CPU cores.

In [None]:
# Compute mean of a column
df['Height(Inches)'].mean().compute()

<h2 style='color:purple'>🔹 Step 6: Data Visualization</h2>
Dask can also handle big data visualization efficiently.

In [None]:
import matplotlib.pyplot as plt

# Convert to pandas for visualization
df_pd = df.compute()

# Plot histogram
df_pd.hist(figsize=(8, 5))
plt.show()

<h2 style='color:blue'>🚀 Conclusion</h2>
- **Dask** helps process large datasets that **don't fit in memory**.
- Works like pandas but processes data **in parallel**.
- **Ideal for Big Data analysis** on personal machines or cloud environments.