# Data Exploration and Summary

In [28]:
import pandas as pd
import numpy as np

## 3. Value Counts & Frequency Analysis

This section focuses on understanding how frequently different values occur in your dataset — crucial for 
- **categorical columns**, 
- **deduplication**, 
- **imbalance detection**, and 
- **distribution analysis**.

In [3]:
data = {
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Mumbai', 'Delhi', 'Chennai', 'Kolkata', 'Mumbai'],
    'Sales': [200, 300, 150, 400, 300, 250, 450, 500, 300]
}

df = pd.DataFrame(data)
df

Unnamed: 0,City,Sales
0,Delhi,200
1,Mumbai,300
2,Delhi,150
3,Chennai,400
4,Mumbai,300
5,Delhi,250
6,Chennai,450
7,Kolkata,500
8,Mumbai,300


### 12. `df['col'].value_counts()`
Counts the frequency of unique values **in descending order**.

In [5]:
df['City'].value_counts()

City
Delhi      3
Mumbai     3
Chennai    2
Kolkata    1
Name: count, dtype: int64

* **Sort ascending**:

In [6]:
df['City'].value_counts(ascending=True)

City
Kolkata    1
Chennai    2
Delhi      3
Mumbai     3
Name: count, dtype: int64

* **Include missing values** (if any):

In [8]:
df['City'].value_counts(dropna=False)

City
Delhi      3
Mumbai     3
Chennai    2
Kolkata    1
Name: count, dtype: int64

In [29]:
df1 = pd.concat([df, pd.DataFrame([[np.nan, np.nan]], columns=['City', 'Sales'])], axis=0)
df1

Unnamed: 0,City,Sales
0,Delhi,200.0
1,Mumbai,300.0
2,Delhi,150.0
3,Chennai,400.0
4,Mumbai,300.0
5,Delhi,250.0
6,Chennai,450.0
7,Kolkata,500.0
8,Mumbai,300.0
0,,


In [33]:
df1['City'].value_counts()

City
Delhi      3
Mumbai     3
Chennai    2
Kolkata    1
Name: count, dtype: int64

In [34]:
df1['City'].value_counts(dropna=False)

City
Delhi      3
Mumbai     3
Chennai    2
Kolkata    1
NaN        1
Name: count, dtype: int64

### 13. `df['col'].value_counts(normalize=True)`
Returns **relative frequency** instead of count (i.e., proportion as float).

In [35]:
df1.City.value_counts(normalize=True)

City
Delhi      0.333333
Mumbai     0.333333
Chennai    0.222222
Kolkata    0.111111
Name: proportion, dtype: float64

In [37]:
# You can multiply by 100 for percentage:

df1['City'].value_counts(normalize=True) * 100

City
Delhi      33.333333
Mumbai     33.333333
Chennai    22.222222
Kolkata    11.111111
Name: proportion, dtype: float64

In [38]:
df1['City'].value_counts(normalize=True, dropna=False) * 100

City
Delhi      30.0
Mumbai     30.0
Chennai    20.0
Kolkata    10.0
NaN        10.0
Name: proportion, dtype: float64

### 14. `df['col'].unique()` and `df['col'].nunique()`

* `.unique()` → Returns **array** of distinct values
* `.nunique()` → Returns **number** of unique values

In [39]:
df1['City'].unique()

array(['Delhi', 'Mumbai', 'Chennai', 'Kolkata', nan], dtype=object)

In [42]:
df1['City'].nunique()

4

You can also count unique values **including NaNs**:

In [41]:
df1['City'].nunique(dropna=False)

5

### ✅ Summary Table

| Method                         | Purpose                | Output Type         |
| ------------------------------ | ---------------------- | ------------------- |
| `value_counts()`               | Frequency count        | Series (descending) |
| `value_counts(normalize=True)` | Frequency proportion   | Series (float)      |
| `unique()`                     | Unique values          | NumPy array         |
| `nunique()`                    | Count of unique values | Integer             |

---

### 🔍 Real-Life Use Cases

| Use Case                            | Relevant Method                     |
| ----------------------------------- | ----------------------------------- |
| Checking class imbalance            | `value_counts()` / `normalize=True` |
| Finding unique categories in column | `unique()` / `nunique()`            |
| Data cleaning (detect typos)        | `value_counts()`                    |
| Encoding categorical values later   | `unique()` (for mapping to numbers) |

<center><b>Thanks</b></center>