<table class="table table-bordered">
    <tr>
        <th style="width:200px;">
            <img src='https://bcgriseacademy.com/hs-fs/hubfs/RISE%202.0%20Logo_Options_25Jan23_RISE%20-%20For%20Black%20Background.png?width=3522&height=1986&name=RISE%202.0%20Logo_Options_25Jan23_RISE%20-%20For%20Black%20Background.png' style="background-color:black; width: 100%; height: 100%;">
        </th>
        <th style="text-align:center;">
            <h1>IBF TFIP</h1>
            <h2>Pandas I - Data Analysis using Pandas</h2>
        </th>
    </tr>
</table>

# Learning Objectives
#### After completing this lesson, you should be able to:

1. LO1 : Understand Pandas Data Manipulation
2. LO2 : Understand Pandas Functions
3. LO3 : Apply Data Manipulation using various Python Functions




# Table of Contents <a id='tc'></a>

1. [Python Functions: Pandas Data Manipulation](#p1)
2. [Python Functions: Writing Data](#p2)
3. [Hands-On Practice Exercise](#p3)
4. [Kahoot Quiz](#)

# 1. Python Functions: Pandas Data Manipulation <a id='p1' />


## 1.1 Replacing missing values

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('../Data/weather_data_missing_replace.csv')
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,-99999,7,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,0
4,1/5/2017,32,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,0


In this data, although we don't seem to have missing values as such but some numbers like `-99999` meaningless. We can proceed to convert these values to be nulls instead. So, that these special values are identified as nulls and we can handle them as required. 

In [None]:
new_df = df.replace(-99999,np.NaN)
new_df

In [None]:
# If there are multiple special values, you can pass a list of these special values that you need to replace with NaN.

new_df = df.replace([-99999,-88888],np.NaN)
new_df

In [None]:
df

After further inspection of the data, we can identify that there are some event with a 0 value. Actually, they are missings too.
Here, we will try to replace the special values in **temperature** and **windspeed** column and replace the 0 in **event** column.

In [None]:
new_df = df.replace({
    'temperature':-99999,
    'windspeed':-88888,
    'event':'0'
},np.NaN)
new_df

In [None]:
# loading another data
df = pd.read_csv('./data/weather_data_missing_replace1.csv')

df

Here, we can see that there are special values `99999` and `88888` in the dataset. We will replace these with nulls and also we will replace the `No Event` from the event column to `Sunny`.

In [None]:
new_df1 = df.replace({
    -99999 :np.NaN,
    -88888 :np.NaN,
    'No Event': 'Sunny'
})

new_df1

Sometimes the dataset may have units like the one below.

In [None]:
df = pd.read_csv('./data/weather_data_missing_replace2.csv')
df

Here, we would like to remove the units C, mph from the data.

In [None]:
# Remove the units (mph,C) from the data
new_df = df.replace('[A-Za-z]','',regex=True) #This will remove all alphabets basically
new_df

But notice what really happens is that we remove all the texts from events. But, this is not what we want. Instead, we want to replace only the C and mph.

In [None]:
# Here we specify the columns from where we want to remove these values.
new_df1 = df.replace({
        'temperature': '[A-Za-z]',
        'windspeed': '[A-Za-z]'
},'',regex=True)
new_df1

Let's load another data. Here, we want to replace a list of values with another list of values.

In [None]:
listDF = pd.DataFrame({
        'score':['exceptional','average','good','poor','average','exceptional'],
        'student':['rob','maya','parthiv','tom','julian','erica']
})
listDF


In [None]:
# Replace poor -> 1, average -> 2, good -> 3, exceptional -> 4
newListDF = listDF.replace(['poor','average','good','exceptional'],[1,2,3,4])
newListDF

## 1.2 Handling missing values



In [None]:
# We can read the data by parsing column as date and setting the date to index
df = pd.read_csv('./data/weather_data_missing.csv', parse_dates=["day"], index_col= 'day')
df

### 1.2.1 Replacing NaN with 0

The easiest and most straightforward way to deal with NaN is to replace these missings with 0s. We definitely need to justify this with logic, but that's a separate discussion.

In [None]:
# replacing nulls with 0.
new_df = df.fillna(0)
new_df

The problem here is that, the nulls in event also got replaced with 0.

In [None]:
# We can choose the columns that we want to replace with 0s and customize the replacement.
new_df = df.fillna({
'temperature':0,
'windspeed':0,
'event':'no event'
}) 

new_df

### 1.2.2 Replacing NaN with mean, median, mode

In [None]:
# replacing nulls with mean. For temperature column, we will replace the nulls with mean.
new_df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
new_df

In [None]:
# replacing nulls with median. For windspeed column, we will replace the nulls with median.
new_df['windspeed'] = df['windspeed'].fillna(df['windspeed'].median())
new_df

In [None]:
# replacing nulls with mode. For event column, we will replace the nulls with mode.
new_df['event'] = df['event'].fillna(df['event'].mode()[0])
new_df

### 1.2.3 Replacing missings with bfill & ffill (backward fill & forward fill)

In [None]:
# Use back fill to fill - values by taking previous values 
new_df_bfill = df.fillna(method = "bfill")#Put next days value to NaN into NaN
new_df_bfill

In [None]:
df

In [None]:
#copy only one time in below columns to current cell
new_df = df.fillna(method = "ffill",limit=1) 
new_df

### 1.2.4 Replacing Nan with interpolation

In [None]:
df

In [None]:
new_df = df.interpolate() #works on numeric data 
new_df

In [None]:
new_df = df.interpolate(method ='time')
new_df

`method='time'`: This parameter specifies the interpolation method to be used. When using method='time', pandas will perform time-based interpolation, considering the time information present in the DataFrame's index. It is particularly useful when dealing with time series data, where the missing values can be interpolated based on the time intervals between data points.

Time-based interpolation is well-suited for filling missing data in time series when the time intervals between data points are relatively constant. It can handle data with irregular time intervals, where the time differences between consecutive data points vary. The method estimates the missing values by considering the time information, which can provide more accurate and meaningful interpolated results, especially for time-dependent data.

### 1.2.5 Dropping null values

Instead of imputing the missings, sometimes we can also choose to drop the missing values. This is useful when the amount of missing values are negligible and can be dropped.

In [None]:
df

In [None]:
# Dropping all na values
new_df = df.dropna()# Default it will drop the rows having atleast one NaN
new_df

In [None]:
df

In [None]:
# this will drop those rows where all the columns has Nan. Keep a lookout for row 7 (index: 2017-01-09)
new_df = df.dropna(how ="all")
new_df

In [None]:
# Keep rows with 1 or more Nan values or none Nan values -> all values 
new_df = df.dropna(thresh=1)#thresh=1 means it will check for atleast one valid value(non NaN) in a row & keep that row
new_df

In [None]:
new_df = df.dropna(thresh=2)#thresh=2 means it will check for at least two valid value(non NaN) in a row & keep that row
new_df

### 1.2.6 Inserting Missing dates

Here, in the dataframe there are some dates missing, like `2017-01-02`, `2017-01-03`. We would like to add all the missing dates too even if that means having nulls in the temperature, windspeed and event columns. Later, we can impute these missing values with the appropriate method. 

In [None]:
# inserting missing dates. The date format is month - day - year
dt = pd.date_range("01-01-2017","01-11-2017") 

idx = pd.DatetimeIndex(dt)
new_df = new_df.reindex(idx)
new_df

In [None]:
new_df.index

In [None]:
# If you want change the date format
new_df.index = new_df.index.strftime("%Y-%d-%m")

In [None]:
new_df

## 1.3 Merge, Join & Concatenate DataFrames

### 1.3.1 Merging Dataframes

Merging two DataFrames in pandas allows you to combine data from different sources based on common columns or indices. There are several ways to perform merging in pandas, and the appropriate method depends on the relationship between the data in the two DataFrames.

In [None]:
df1 = pd.DataFrame({
   "city": ["new york","chicago","orlando","baltimore"],
    "State": ['US','US','US', 'US'],
    "temperature": [21,14,35,32]
})
df1

In [None]:
df2 = pd.DataFrame({
   "city": ["new york","orlando","chicago"],
    "State": ['US','US','US'],
    "humidity": [65,68,75]
})
df2

*__ pd.merge() can be used to merge dataframe - gives inner join by default __*

In [None]:
df3 = pd.merge(df1,df2,on="city")
df3

*__ Experiment with how = "left", "right", "outer" to see different types of join:  __*
![image.png](attachment:image.png)

In [None]:
df3 = pd.merge(df1,df2,on="city", how="left", suffixes=('_left','_right')) 
df3

##### When to use join vs merge:
__Main differences between df.join() and df.merge():__

- lookup on right table: df1.join(df2) always joins via the index of df2, but df1.merge(df2) can join to one or more columns of df2 (default) or to the index of df2 (with right_index=True).
- lookup on left table: by default, df1.join(df2) uses the index of df1 and df1.merge(df2) uses column(s) of df1. That can be overridden by specifying df1.join(df2, on=key_or_keys) or df1.merge(df2, left_index=True).
- left vs inner join: df1.join(df2) does a left join by default (keeps all rows of df1), but df.merge does an inner join by default (returns only matching rows of df1 and df2).
*So, the generic approach is to use pandas.merge(df1, df2) or df1.merge(df2). But for a number of common situations (keeping all rows of df1 and joining to an index in df2), you can save some typing by using df1.join(df2) instead.*



In [None]:
df1.join(df2,lsuffix="_l",rsuffix="_r")

The above join happened on index. If you want to join on any other column, you can set that column as an index first and then perform the join.

### 1.3.3 Concat

In pandas, the concat() function is used to concatenate (combine) DataFrames along a particular axis, either vertically or horizontally. It allows you to join multiple DataFrames into a single DataFrame based on their indices or columns. The concat() function is quite flexible and can handle various concatenation scenarios. 

![image.png](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png)

###### Column wise concatenation

In [None]:
result = pd.concat([df1, df2], axis=1, sort=False)
result

###### Row-wise concatenation

In [None]:
result = pd.concat([df1, df2], axis=0, sort=False)
result

## 1.4 Groupby 



`Groupby()` is a powerful function in pandas that allows you to group data in a DataFrame based on one or more columns and then perform various operations on the grouped data. It is particularly useful for data aggregation, transformation, and analysis, especially when dealing with large datasets. The groupby() function is often used in combination with aggregation functions like sum(), mean(), count(), max(), min(), etc., to compute summary statistics for each group.

![image.png](attachment:image.png)

In [None]:
df = pd.read_csv('data\weather_data_groupBy.csv')

In [None]:
df

In [None]:
# average temperature of all cities
df.groupby('city')['temperature'].mean() 

In [None]:
# maximum windspeed of events
df.groupby('event')['windspeed'].max()

In [None]:
# average temperature of city and event
df.groupby(['city','event'])['temperature'].mean()

In [None]:
# Calculate the average temperature and maximum windspeed of all cities
df.groupby('city').agg({'temperature':np.mean, 'windspeed':np.max})

## 1.5 CrossTab <a id='p1.5' /> 

The crosstab() function in pandas is used to compute a cross-tabulation (also known as a contingency table) of two or more factors (variables). It is a quick way to summarize and analyze the relationship between categorical variables in a DataFrame.

![image.png](attachment:image.png)

In [None]:

df = pd.read_excel("./data/crosstab_data.xlsx")
df

In [None]:
pd.crosstab(df.Nationality, df.Handedness, margins=True)

### 1.5.1 Putting multiple fields in one column

In [None]:
pd.crosstab(df.Sex, [df.Handedness, df.Nationality], margins=True)

### 1.5.2 Calculating % and Averages in Crosstab

In [None]:
import numpy as np
pd.crosstab(df.Sex, df.Handedness, values=df.Age, aggfunc = np.average)

## 1.6 Stack/UnStack

 Stack and unstack are two important methods used to transform data between "wide" and "long" formats. These methods are primarily used to work with hierarchical or multi-level indexed data structures, such as pandas Series and DataFrames.
 * Stack: The stack method is used to pivot the columns of a DataFrame into rows, effectively converting a "wide" DataFrame into a "long" one. It takes the column labels and moves them into the DataFrame's index, creating a new hierarchical index with two or more levels. This operation is also known as "stacking" or "melting" the DataFrame.

 * Unstack: The unstack method is the reverse operation of stack. It is used to pivot the rows of a DataFrame into columns, converting a "long" DataFrame back into a "wide" one. It takes a level of the index and moves it back as column labels, effectively "unstacking" the DataFrame.


![image.png](attachment:image.png)

Some examples of stacking:

- Imagine you have a DataFrame with sales data, where each row represents a specific product and has separate columns for sales in different regions. You can use stack to transform the DataFrame into a long format, where each row represents a single sale with the product, region, and corresponding sales value.

| Product | Region A Sales | Region B Sales | Region C Sales |
|---------|----------------|----------------|----------------|
|   P1    |      100       |      150       |      200       |
|   P2    |      50        |      80        |      120       |

After applying stack, the DataFrame becomes:

| Product | Region       | Sales |
|---------|--------------|-------|
|   P1    | Region A     | 100   |
|   P1    | Region B     | 150   |
|   P1    | Region C     | 200   |
|   P2    | Region A     | 50    |
|   P2    | Region B     | 80    |
|   P2    | Region C     | 120   |

- Consider a DataFrame that contains stock prices for multiple companies over different dates, where each column represents a company's stock price, and each row represents a specific date. By using stack, you can convert this wide DataFrame into a long-format DataFrame with three columns: date, company, and stock price.

|    Date    | Company A | Company B | Company C |
|------------|-----------|-----------|-----------|
| 2023-08-01 |    100    |    150    |    200    |
| 2023-08-02 |    120    |    180    |    220    |

After applying stack, the DataFrame becomes:

|    Date    | Company | Stock Price |
|------------|---------|-------------|
| 2023-08-01 | A       |    100      |
| 2023-08-01 | B       |    150      |
| 2023-08-01 | C       |    200      |
| 2023-08-02 | A       |    120      |
| 2023-08-02 | B       |    180      |
| 2023-08-02 | C       |    220      |

- Imagine you have a DataFrame that contains monthly revenue data for different products and years, where each column represents a year, and each row represents a specific product's revenue. By using stack, you can convert this wide DataFrame into a long-format DataFrame with three columns: product, year, and revenue, making it easier to analyze trends and compare revenue between products and years.

|  Product  | 2022 Sales | 2023 Sales |
|-----------|------------|------------|
|    P1     |    100     |    150     |
|    P2     |    80      |    120     |

After applying stack, the DataFrame becomes:

|  Product  |   Year   | Sales |
|-----------|----------|-------|
|    P1     |   2022   |  100  |
|    P1     |   2023   |  150  |
|    P2     |   2022   |  80   |
|    P2     |   2023   |  120  |


In [None]:


# Create a DataFrame with multiple columns
df = pd.DataFrame({
    'Group': ['A', 'A', 'B', 'B'],
    'Category': ['X', 'Y', 'X', 'Y'],
    'Value1': [10, 20, 30, 40],
    'Value2': [50, 60, 70, 80]
})
df
# Reshape the DataFrame using stack
stacked = df.set_index(['Group', 'Category']).stack()

# Reshape the stacked DataFrame using unstack
unstacked = stacked.unstack()

# Print the original DataFrame, stacked DataFrame, and unstacked DataFrame
print("Original DataFrame:\n", df)
print("\nStacked DataFrame:\n", stacked)
print("\nUnstacked DataFrame:\n", unstacked)


In [None]:
df.stack()

In [None]:
# Reshape the DataFrame using stack
satcked = df.set_index(['Group', 'Category']).stack()
stacked

![image.png](attachment:image.png)

Some examples of unstacking:

- Suppose you have a DataFrame containing survey data, where each row represents a respondent, and one of the columns contains the survey question number, while another column contains the corresponding answer. You can use unstack to pivot the survey questions into columns and the answers as values, creating a more readable and organized summary of the survey results.

| Respondent | Question | Answer |
|------------|----------|--------|
|     R1     |   Q1     |   Yes  |
|     R1     |   Q2     |   No   |
|     R2     |   Q1     |   No   |
|     R2     |   Q2     |   Yes  |

After applying unstack, the DataFrame becomes:

| Respondent | Q1   | Q2   |
|------------|------|------|
|     R1     | Yes  | No   |
|     R2     | No   | Yes  |

- Suppose you have a DataFrame with student exam scores, where the index represents student names, and the columns contain exam subjects. You can use unstack to transform the DataFrame into a more structured format with exam subjects as the index and student names as columns, making it easier to analyze and compare performance across subjects.

|     Student     | Math | Science | History |
|-----------------|------|---------|---------|
|      Alice      |  90  |   85    |   78    |
|       Bob       |  80  |   92    |   88    |
|     Charlie     |  85  |   78    |   92    |

After applying unstack, the DataFrame becomes:

| Subject | Alice | Bob | Charlie |
|---------|-------|-----|---------|
|  Math   |   90  |  80 |   85    |
| Science |   85  |  92 |   78    |
| History |   78  |  88 |   92    |

- Suppose you have a DataFrame with multi-level column headers representing different regions and sub-regions, and each row contains some data for each region and sub-region. You can use unstack to pivot the sub-regions into columns, creating a more organized DataFrame that allows for better comparison and analysis of the data.

|           | Region A | Region B | Region C |
|-----------|----------|----------|----------|
| 2019 Q1   |   100    |   150    |   200    |
| 2019 Q2   |   120    |   180    |   220    |
| 2020 Q1   |   130    |   160    |   210    |
| 2020 Q2   |   140    |   170    |   230    |

After applying unstack, the DataFrame becomes:

|           | 2019 Q1 | 2019 Q2 | 2020 Q1 | 2020 Q2 |
|-----------|---------|---------|---------|---------|
| Region A  |   100   |   120   |   130   |   140   |
| Region B  |   150   |   180   |   160   |   170   |
| Region C  |   200   |   220   |   210   |   230   |


In [None]:
unstacked = stacked.unstack()

unstacked

In [None]:
df.stack().unstack()

## 1.7 Pivot Table

In pandas, a pivot table is a way to summarize and analyze data in a tabular format by transforming a DataFrame. It allows you to reorganize and reshape data, making it easier to perform analysis and gain insights from your data. A pivot table aggregates and groups data based on one or more columns, while applying one or more aggregate functions to the values in another column.


![image.png](https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_pivot.png)

In [None]:
df = pd.read_csv('data\weather_data_groupBy.csv')

df

In [None]:
df.pivot_table(index = 'city', columns = 'event', values = 'windspeed',aggfunc='sum')

### Summary

Comparison of Crosstab, Stack/Unstack and Pivot

| Method   | Description                                                                                                          | When to Use                                                                                                       |
|----------|----------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| crosstab | A function in pandas that computes a cross-tabulation of two or more factors. It calculates the frequency distribution | When you want to analyze the relationship between two or more categorical variables and their frequency of occurrence. |
| stack    | A method in pandas used to pivot (unpivot) the columns of a DataFrame into rows, creating a "long" format.              | When you have multiple columns representing different categories and you want to convert them into rows for analysis. |
| unstack  | A method in pandas used to pivot (unpivot) the rows of a DataFrame into columns, creating a "wide" format.              | When you have a hierarchical index or multi-level columns and want to convert one level of the index into columns.   |
| pivot    | A method in pandas used to create a pivot table, aggregating and summarizing data based on specified index and columns. | When you want to transform the data to create a summary table based on specified index and columns with aggregated data. |


## 1.8 Pandas Column Operations

In [None]:
# add 5 to the windspeed column

df['windspeed'] = df['windspeed']+5

df

In [None]:
# use .str to perform string wise operations on pandas dataframe. Convert the cities to uppper case
df['city']=df['city'].str.upper()
df

# 2. Pandas Functions: Writing Data <a id='p2' />

### 2.1 Writing DataFrame to Excel in different Sheet <a id='p2.1' />

In [None]:
df_stocks = pd.DataFrame({
    'tickers':['GOOGLE','WMT','MSFT'],
    'price':[845,65,64],
    'pe':[30.37,14,30],
    'eps':[27,4,2]
})

df_weather = pd.DataFrame({
    'day':['1/1/2017','1/2/2017','1/3/2017'],
    'temperature':[32,35,28],
    'event':['Rain','Sunny','Snow']
})

In [None]:
df_stocks.to_excel("./data/new_df_stocks.xlsx",sheet_name = "stocks")

#### Writing on the same excel above will lead to overwriting the excel even with different sheet_name, use ExcelWriter for storing in different sheets

In [None]:
with pd.ExcelWriter('./data/stocks_weather_combined.xlsx') as writer:
    df_stocks.to_excel(writer,sheet_name="stocks")
    df_weather.to_excel(writer,sheet_name="weather")

### 2.2 Writing to csv <a id='p2.2' />

In [None]:
df_stocks.to_csv('./data/df_stocks.csv')

# 3. Hands-On Practice Exercise <a id='p3' />

You can use `chat-gpt` to solve the questions too!

Consider the below dataframe.

In [2]:
data = {
    'ID': [1, 2, 3, 4, 5],
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value1': [10, 20, None, 40, 50],
    'Value2': [100, None, 300, 400, 500],
    'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
    'Age': [25, 30, 35, 40, 45],
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Category,Value1,Value2,Date,Age
0,1,A,10.0,100.0,2023-01-01,25
1,2,B,20.0,,2023-01-02,30
2,3,A,,300.0,2023-01-03,35
3,4,B,40.0,400.0,2023-01-04,40
4,5,A,50.0,500.0,2023-01-05,45


1. Load the dataset and perform imputation using different techniques like mean and median.

In [4]:
df1 = df.copy()
df1['Value1'] = df['Value1'].fillna(df['Value1'].mean())
df1['Value2'] = df['Value2'].fillna(df['Value2'].mean())
df1

Unnamed: 0,ID,Category,Value1,Value2,Date,Age
0,1,A,10.0,100.0,2023-01-01,25
1,2,B,20.0,325.0,2023-01-02,30
2,3,A,30.0,300.0,2023-01-03,35
3,4,B,40.0,400.0,2023-01-04,40
4,5,A,50.0,500.0,2023-01-05,45


In [5]:
df2 = df.copy()
df2['Value1'] = df['Value1'].fillna(df['Value1'].median())
df2['Value2'] = df['Value2'].fillna(df['Value2'].median())
df2

Unnamed: 0,ID,Category,Value1,Value2,Date,Age
0,1,A,10.0,100.0,2023-01-01,25
1,2,B,20.0,350.0,2023-01-02,30
2,3,A,30.0,300.0,2023-01-03,35
3,4,B,40.0,400.0,2023-01-04,40
4,5,A,50.0,500.0,2023-01-05,45


2. Group the dataset by 'Category' and calculate both sum and mean for 'Value1' and 'Value2'.

In [10]:
# number of sales grouped by release year
df_grouped = df.groupby('Category', as_index=False) \
               .agg({'Value1': ['sum', 'median'], 'Value2': ['sum', 'median']})
df_grouped

Unnamed: 0_level_0,Category,Value1,Value1,Value2,Value2
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,median,sum,median
0,A,60.0,30.0,900.0,300.0
1,B,60.0,30.0,400.0,400.0


3. Merge the dataset with itself using an inner join based on the 'ID' column.

In [11]:
df_merged = df.merge(df, on = 'ID')
df_merged

Unnamed: 0,ID,Category_x,Value1_x,Value2_x,Date_x,Age_x,Category_y,Value1_y,Value2_y,Date_y,Age_y
0,1,A,10.0,100.0,2023-01-01,25,A,10.0,100.0,2023-01-01,25
1,2,B,20.0,,2023-01-02,30,B,20.0,,2023-01-02,30
2,3,A,,300.0,2023-01-03,35,A,,300.0,2023-01-03,35
3,4,B,40.0,400.0,2023-01-04,40,B,40.0,400.0,2023-01-04,40
4,5,A,50.0,500.0,2023-01-05,45,A,50.0,500.0,2023-01-05,45


4. Select rows where 'Value1' is greater than the mean of 'Value1'.

In [12]:
df[df['Value1'] > df['Value1'].mean()]

Unnamed: 0,ID,Category,Value1,Value2,Date,Age
3,4,B,40.0,400.0,2023-01-04,40
4,5,A,50.0,500.0,2023-01-05,45


5. Create a pivot table to display the sum of 'Value1' for each 'Category' against different 'IDs'.

6. Group the dataset by both 'Category' and 'Age' and calculate the sum of 'Value1' for each group.

7. Create two new DataFrames with the same columns and concatenate them, but only include rows where 'Value1' is greater than 30. 

##### The End
[Back to Content](#tc)

Copyright © 2023 by Boston Consulting Group. All rights reserved.