In [31]:
import pandas as pd

In [32]:
demo_data = {"Name":["Aman","Rahul","Shyam"],"Age":[20,22,32],"City":["Delhi","Ghaziabad","Mumbai"]}
df = pd.DataFrame(demo_data)
print(df)

    Name  Age       City
0   Aman   20      Delhi
1  Rahul   22  Ghaziabad
2  Shyam   32     Mumbai


In [33]:
df["Name"]+=" Surname"
print(df)

            Name  Age       City
0   Aman Surname   20      Delhi
1  Rahul Surname   22  Ghaziabad
2  Shyam Surname   32     Mumbai


In [34]:
df["Age"]+=5
print(df)

            Name  Age       City
0   Aman Surname   25      Delhi
1  Rahul Surname   27  Ghaziabad
2  Shyam Surname   37     Mumbai


In [35]:
df.iloc[:]

Unnamed: 0,Name,Age,City
0,Aman Surname,25,Delhi
1,Rahul Surname,27,Ghaziabad
2,Shyam Surname,37,Mumbai


In [36]:
demo = df[df["Age"] > 25]
print(demo)

            Name  Age       City
1  Rahul Surname   27  Ghaziabad
2  Shyam Surname   37     Mumbai


In [37]:
print(df.sort_values(by='Age',ascending=False))
print()
print(df)

            Name  Age       City
2  Shyam Surname   37     Mumbai
1  Rahul Surname   27  Ghaziabad
0   Aman Surname   25      Delhi

            Name  Age       City
0   Aman Surname   25      Delhi
1  Rahul Surname   27  Ghaziabad
2  Shyam Surname   37     Mumbai


In [38]:
df.sort_values(by='Age',ascending=False,inplace=True)
print(df)

            Name  Age       City
2  Shyam Surname   37     Mumbai
1  Rahul Surname   27  Ghaziabad
0   Aman Surname   25      Delhi


In [39]:
df.set_index("Name")
print(df)

            Name  Age       City
2  Shyam Surname   37     Mumbai
1  Rahul Surname   27  Ghaziabad
0   Aman Surname   25      Delhi


In [40]:
print(df.isnull())
print(df.dropna())

    Name    Age   City
2  False  False  False
1  False  False  False
0  False  False  False
            Name  Age       City
2  Shyam Surname   37     Mumbai
1  Rahul Surname   27  Ghaziabad
0   Aman Surname   25      Delhi


In [41]:
# fill missing values with 0
print(df.fillna(0))

            Name  Age       City
2  Shyam Surname   37     Mumbai
1  Rahul Surname   27  Ghaziabad
0   Aman Surname   25      Delhi


In [56]:
print(df['City'].value_counts())

City
Mumbai       1
Ghaziabad    1
Delhi        1
Name: count, dtype: int64


In [55]:
data1 = {"A":[1,2,3,2,1],"B":[3,2,6,2,3]}
df_data1 = pd.DataFrame(data1)
print(df_data1.drop_duplicates())

   A  B
0  1  3
1  2  2
2  3  6


In [57]:
def square(x):
    return x*x

df["Age_square"] = df["Age"].apply(square)
print(df)

            Name  Age       City  Age_square
2  Shyam Surname   37     Mumbai        1369
1  Rahul Surname   27  Ghaziabad         729
0   Aman Surname   25      Delhi         625


**Groupby**

In Pandas, the **groupby** operation is a powerful tool for splitting data into groups based on some criteria, applying a function to each group, and combining the results back into a data structure. It is similar to the "group by" concept in SQL.

**Steps in GroupBy**

- Splitting: Dividing the data into groups based on certain criteria (e.g., column values).
- Applying: Applying a function to each group (e.g., sum, mean, count, etc.).
- Combining: Combining the results into a new data structure.

**Syntax:**

**df.groupby(by, axis=0, level=None, as_index=True, sort=True)**

- by: Specifies the column(s) or function to group by.
- axis: Whether to group rows (axis=0) or columns (axis=1).
- as_index: If True, the grouped columns become the index in the output. Default is True.

In [61]:
# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Group by 'Category' and calculate the sum
grouped = df.groupby('Category')['Values'].sum()
print(grouped)

Category
A    90
B    60
Name: Values, dtype: int64


In **Pandas**, the `groupby` operation is a powerful tool for splitting data into groups based on some criteria, applying a function to each group, and combining the results back into a data structure. It is similar to the "group by" concept in SQL.

### **Steps in GroupBy**
The `groupby` operation can be described as:
1. **Splitting**: Dividing the data into groups based on certain criteria (e.g., column values).
2. **Applying**: Applying a function to each group (e.g., sum, mean, count, etc.).
3. **Combining**: Combining the results into a new data structure.

### **Syntax**
```python
df.groupby(by, axis=0, level=None, as_index=True, sort=True)
```

- `by`: Specifies the column(s) or function to group by.
- `axis`: Whether to group rows (axis=0) or columns (axis=1).
- `as_index`: If `True`, the grouped columns become the index in the output. Default is `True`.

---

### **Basic Example**
```python
import pandas as pd

# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Group by 'Category' and calculate the sum
grouped = df.groupby('Category')['Values'].sum()
print(grouped)
```

**Output:**
```
Category
A    90
B    60
Name: Values, dtype: int64
```

---

### **Common Aggregations**
You can use many functions after `groupby` to aggregate data:

| **Operation**   | **Description**                                  | **Example**                                  |
|------------------|--------------------------------------------------|----------------------------------------------|
| `sum()`          | Sum of values in each group                     | `df.groupby('col').sum()`                   |
| `mean()`         | Average of values in each group                 | `df.groupby('col').mean()`                  |
| `count()`        | Count of non-NA values in each group            | `df.groupby('col').count()`                 |
| `min()` / `max()`| Minimum or maximum value in each group          | `df.groupby('col').min()`                   |
| `size()`         | Count of all elements in each group (includes NA)| `df.groupby('col').size()`                  |
| `std()` / `var()`| Standard deviation or variance in each group    | `df.groupby('col').std()`                   |
| `first()`/`last()`| First or last value in each group              | `df.groupby('col').first()`                 |

---

### **Custom Functions with `apply()`**
You can define custom aggregation functions and use `apply`.

```python
# Custom function to calculate range
range_func = lambda x: x.max() - x.min()

grouped = df.groupby('Category')['Values'].apply(range_func)
print(grouped)
```

**Output:**
```
Category
A    40
B    20
Name: Values, dtype: int64
```

---

### **Grouping by Multiple Columns**
You can group by more than one column by passing a list.

```python
data = {'Category': ['A', 'A', 'B', 'B'],
        'Sub-Category': ['X', 'Y', 'X', 'Y'],
        'Values': [10, 20, 30, 40]}
df = pd.DataFrame(data)

grouped = df.groupby(['Category', 'Sub-Category'])['Values'].sum()
print(grouped)
```

**Output:**
```
Category  Sub-Category
A         X              10
          Y              20
B         X              30
          Y              40
Name: Values, dtype: int64
```

---

### **Transform and Filter**
1. **Transform**: Used to apply a function and return results aligned with the original DataFrame size.
```python
df['Mean'] = df.groupby('Category')['Values'].transform('mean')
```

2. **Filter**: Used to filter out groups based on some condition.
```python
filtered = df.groupby('Category').filter(lambda x: x['Values'].sum() > 50)
```

---

### **Example Use Cases**
1. Aggregating sales data by product category.
2. Calculating average scores grouped by students' grades.
3. Analyzing trends grouped by time periods (e.g., daily, monthly).

Let me know if you'd like to explore a specific use case or dive deeper into any operation!