Q1. List any five functions of the pandas library with execution.

In [4]:
import pandas as pd

# Example: Reading a CSV file into a DataFrame
df = pd.read_csv('titanic.csv')

# Displaying the first 3 rows of the DataFrame
print("First 3 rows of the DataFrame:")
print(df.head(3))

# Displaying information about the DataFrame
print("\nInformation about the DataFrame:")
df.info()

# Descriptive statistics of the DataFrame
print("\nDescriptive statistics of the DataFrame:")
print(df.describe())

# Grouping the DataFrame by a column and calculating the mean of each group
print("\nMean value for each category:")
grouped_data = df.groupby('Pclass')['Survived'].mean()
print(grouped_data)


First 3 rows of the DataFrame:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  

Information about the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int6

Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the 
DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [5]:
import pandas as pd

def reindex_dataframe(df):
    # Create a new index starting from 1 and incrementing by 2
    new_index = range(1, 2 * len(df) + 1, 2)

    df_reindexed = df.set_index(pd.Index(new_index))
    
    return df_reindexed

df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]})
print("Original DataFrame:")
print(df)

df_reindexed = reindex_dataframe(df)
print("\nDataFrame after re-indexing:")
print(df_reindexed)


Original DataFrame:
    A   B   C
0  10  40  70
1  20  50  80
2  30  60  90

DataFrame after re-indexing:
    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that 
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The 
function should print the sum to the console.

For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should 
calculate and print the sum of the first three values, which is 60.

In [6]:
import pandas as pd

def calculate_sum_of_first_three_values(df):

    if 'Values' not in df.columns:
        print("Error: 'Values' column not found in the DataFrame.")
        return

    first_three_values = df['Values'].head(3)

    sum_of_first_three_values = first_three_values.sum()
    print("Sum of the first three values:", sum_of_first_three_values)

df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
print("Original DataFrame:")
print(df)

calculate_sum_of_first_three_values(df)


Original DataFrame:
   Values
0      10
1      20
2      30
3      40
4      50
Sum of the first three values: 60


Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 
'Word_Count' that contains the number of words in each row of the 'Text' column.

In [7]:
import pandas as pd

def add_word_count_column(df):

    if 'Text' not in df.columns:
        print("Error: 'Text' column not found in the DataFrame.")
        return

    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))

df = pd.DataFrame({'Text': ['This is a sample sentence.',
                            'Count the words in this text.',
                            'Pandas DataFrame example.']})
print("Original DataFrame:")
print(df)

add_word_count_column(df)

print("\nDataFrame with Word_Count column:")
print(df)


Original DataFrame:
                            Text
0     This is a sample sentence.
1  Count the words in this text.
2      Pandas DataFrame example.

DataFrame with Word_Count column:
                            Text  Word_Count
0     This is a sample sentence.           5
1  Count the words in this text.           6
2      Pandas DataFrame example.           3


Q5. How are DataFrame.size() and DataFrame.shape() different?

In [8]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
size_of_df = df.size
print(size_of_df)  # Output: 6 (3 rows * 2 columns = 6 elements)

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
shape_of_df = df1.shape
print(shape_of_df)  # Output: (3, 2) (3 rows, 2 columns)

6
(3, 2)


6. Which function of pandas do we use to read an excel file?

In [None]:
import pandas as pd

# Read an Excel file into a DataFrame
df = pd.read_excel('example.xlsx')

# Display the DataFrame
print(df)

Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email 
addresses in the format 'username@domain.com'. Write a Python function that creates a new column 
'Username' in df that contains only the username part of each email address.

The username is the part of the email address that appears before the '@' symbol. For example, if the 
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your 
function should extract the username from each email address and store it in the new 'Username' 
column.

In [9]:
import pandas as pd

def extract_username_from_email(df):
    # Check if 'Email' column exists in the DataFrame
    if 'Email' not in df.columns:
        print("Error: 'Email' column not found in the DataFrame.")
        return

    # Extract the username from each email address and create a new 'Username' column
    df['Username'] = df['Email'].str.split('@').str.get(0)


df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.smith@example.com', 'bob@example.com']})
print("Original DataFrame:")
print(df)

# Extract and store the username in the 'Username' column
extract_username_from_email(df)


print("\nDataFrame with Username column:")
print(df)


Original DataFrame:
                    Email
0    john.doe@example.com
1  jane.smith@example.com
2         bob@example.com

DataFrame with Username column:
                    Email    Username
0    john.doe@example.com    john.doe
1  jane.smith@example.com  jane.smith
2         bob@example.com         bob


Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects 
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The 
function should return a new DataFrame that contains only the selected rows.

For example, if df contains the following values:

   A   B   C

0  3   5   1

1  8   2   7

2  6   9   4

3  2   3   5

4  9   1   2

Your function should select the following rows:   A   B   C

1  8   2   7

4  9   1   2

The function should return a new DataFrame that contains only the selected rows.

In [10]:
import pandas as pd

def select_rows_by_condition(df):

    if 'A' not in df.columns or 'B' not in df.columns:
        print("Error: Columns 'A' and 'B' not found in the DataFrame.")
        return

    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]

    return selected_rows

df = pd.DataFrame({'A': [3, 8, 6, 2, 9],
                   'B': [5, 2, 9, 3, 1],
                   'C': [1, 7, 4, 5, 2]})
print("Original DataFrame:")
print(df)

selected_df = select_rows_by_condition(df)

print("\nDataFrame with selected rows:")
print(selected_df)


Original DataFrame:
   A  B  C
0  3  5  1
1  8  2  7
2  6  9  4
3  2  3  5
4  9  1  2

DataFrame with selected rows:
   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, 
median, and standard deviation of the values in the 'Values' column.

In [11]:
import pandas as pd

def calculate_statistics(df):
    # Check if 'Values' column exists in the DataFrame
    if 'Values' not in df.columns:
        print("Error: 'Values' column not found in the DataFrame.")
        return

    # Calculate mean, median, and standard deviation
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_deviation = df['Values'].std()

    # Print the results
    print("Mean: {:.2f}".format(mean_value))
    print("Median: {:.2f}".format(median_value))
    print("Standard Deviation: {:.2f}".format(std_deviation))


df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
print("Original DataFrame:")
print(df)

# Calculate and print mean, median, and standard deviation
calculate_statistics(df)


Original DataFrame:
   Values
0      10
1      20
2      30
3      40
4      50
Mean: 30.00
Median: 30.00
Standard Deviation: 15.81


Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to 
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days 
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and 
should include the current day.

In [12]:
import pandas as pd

def calculate_moving_average(df):
    # Check if 'Sales' and 'Date' columns exist in the DataFrame
    if 'Sales' not in df.columns or 'Date' not in df.columns:
        print("Error: 'Sales' or 'Date' column not found in the DataFrame.")
        return

    df.sort_values(by='Date', inplace=True)

    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()

df = pd.DataFrame({
    'Date': pd.date_range(start='2022-01-01', periods=10),
    'Sales': [10, 15, 12, 20, 18, 22, 25, 30, 28, 35]
})
print("Original DataFrame:")
print(df)

# Calculate and add the 'MovingAverage' column
calculate_moving_average(df)

# Display the DataFrame with the new 'MovingAverage' column
print("\nDataFrame with MovingAverage column:")
print(df)


Original DataFrame:
        Date  Sales
0 2022-01-01     10
1 2022-01-02     15
2 2022-01-03     12
3 2022-01-04     20
4 2022-01-05     18
5 2022-01-06     22
6 2022-01-07     25
7 2022-01-08     30
8 2022-01-09     28
9 2022-01-10     35

DataFrame with MovingAverage column:
        Date  Sales  MovingAverage
0 2022-01-01     10      10.000000
1 2022-01-02     15      12.500000
2 2022-01-03     12      12.333333
3 2022-01-04     20      14.250000
4 2022-01-05     18      15.000000
5 2022-01-06     22      16.166667
6 2022-01-07     25      17.428571
7 2022-01-08     30      20.285714
8 2022-01-09     28      22.142857
9 2022-01-10     35      25.428571


Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new 
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. 
Monday, Tuesday) corresponding to each date in the 'Date' column.

For example, if df contains the following values:

         Date

0  2023-01-01

1  2023-01-02

2  2023-01-03

3  2023-01-04

4  2023-01-05

Your function should create the following DataFrame:


         Date    Weekday

0  2023-01-01    Sunday

1  2023-01-02     Monday

2  2023-01-03    Tuesday

3  2023-01-04    Wednesday

4  2023-01-05    Thursday

The function should return the modified DataFrame.

In [13]:
import pandas as pd

def add_weekday_column(df):

    if 'Date' not in df.columns:
        print("Error: 'Date' column not found in the DataFrame.")
        return

    df['Date'] = pd.to_datetime(df['Date'])

    df['Weekday'] = df['Date'].dt.day_name()

    return df

df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']})
print("Original DataFrame:")
print(df)

df = add_weekday_column(df)

print("\nDataFrame with Weekday column:")
print(df)


Original DataFrame:
         Date
0  2023-01-01
1  2023-01-02
2  2023-01-03
3  2023-01-04
4  2023-01-05

DataFrame with Weekday column:
        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python 
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [14]:
import pandas as pd

def select_rows_by_date(df):

    if 'Date' not in df.columns:
        print("Error: 'Date' column not found in the DataFrame.")
        return

    df['Date'] = pd.to_datetime(df['Date'])

    selected_rows = df[(df['Date'] >= '2023-01-01') & (df['Date'] <= '2023-01-31')]

    return selected_rows

df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-15', '2023-01-25', '2023-02-05']})
print("Original DataFrame:")
print(df)

selected_df = select_rows_by_date(df)

print("\nDataFrame with selected rows:")
print(selected_df)


Original DataFrame:
         Date
0  2023-01-01
1  2023-01-15
2  2023-01-25
3  2023-02-05

DataFrame with selected rows:
        Date
0 2023-01-01
1 2023-01-15
2 2023-01-25


Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to 
be imported?

In [15]:
import pandas as pd
