## Question 1: List any five functions of the pandas library with execution.

Pandas is a powerful data manipulation library in Python, widely used for data analysis and preprocessing. Here are five commonly used functions in Pandas, along with examples of how to execute them:

1. read_csv()

Function: Reads a CSV file into a DataFrame.

### Example:

import pandas as pd

* Reading a CSV file into a DataFrame

df = pd.read_csv('data.csv')

print(df.head())  # Display the first five rows

2. head()

Function: Returns the first n rows of a DataFrame.

### Example:

* Display the first 5 rows (default) of the DataFrame

print(df.head())

* Display the first 3 rows of the DataFrame

print(df.head(3))

3. describe()

Function: Generates descriptive statistics of the DataFrame.

### Example:
* Generate descriptive statistics

description = df.describe()

print(description)

4. groupby()

Function: Groups the DataFrame using a mapper or by a Series of columns.

### Example:

* Grouping by a column and calculating the mean

grouped = df.groupby('column_name').mean()

print(grouped)

5. merge()

Function: Merges DataFrames in a SQL-like way.

### Example:

* Creating two DataFrames

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})

df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})

* Merging the DataFrames on the 'key' column

merged_df = pd.merge(df1, df2, on='key', how='inner')

print(merged_df)

## Question 2: Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [2]:
import pandas as pd

def reindex_dataframe(df):
    # Create a new index starting from 1 and incrementing by 2
    new_index = range(1, 2 * len(df) + 1, 2)
    
    # Assign the new index to the DataFrame
    df.index = new_index
    
    return df

# Example usage
# Creating a sample DataFrame
data = {'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]}
df = pd.DataFrame(data)

# Re-indexing the DataFrame
reindexed_df = reindex_dataframe(df)
print(reindexed_df)

    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


## Question 3: You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

In [3]:
import pandas as pd

def sum_first_three_values(df):
    # Check if 'Values' column exists and has at least three rows
    if 'Values' in df.columns and len(df) >= 3:
        # Calculate the sum of the first three values
        total = df['Values'].iloc[:3].sum()
        print(f"The sum of the first three values in the 'Values' column is: {total}")
    else:
        print("The DataFrame does not have enough values in the 'Values' column.")

# Example usage
# Creating a sample DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calling the function
sum_first_three_values(df)

The sum of the first three values in the 'Values' column is: 60


## Question 4: Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [4]:
import pandas as pd

def add_word_count_column(df):
    # Check if 'Text' column exists
    if 'Text' in df.columns:
        # Create 'Word_Count' column by applying a lambda function to count words
        df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    else:
        print("The DataFrame does not have a 'Text' column.")
    
    return df

# Example usage
# Creating a sample DataFrame
data = {'Text': ["Hello world", "Pandas is great for data analysis", "Python programming"]}
df = pd.DataFrame(data)

# Adding the 'Word_Count' column
df = add_word_count_column(df)
print(df)

                                Text  Word_Count
0                        Hello world           2
1  Pandas is great for data analysis           6
2                 Python programming           2


## Question 5: How are DataFrame.size() and DataFrame.shape() different?

In Pandas, DataFrame.size and DataFrame.shape are attributes used to get different types of information about the dimensions and size of a DataFrame. Here's how they differ:

#### DataFrame.size

* Definition: DataFrame.size returns the total number of elements in the DataFrame. It is calculated as the product of the number of rows and the number of columns.
* Return Type: Integer
* Example: For a DataFrame with 4 rows and 3 columns, DataFrame.size would return 12 (4 rows × 3 columns).

In [5]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(df.size)  # Output: 9

9


#### DataFrame.shape

* Definition: DataFrame.shape returns a tuple representing the dimensionality of the DataFrame. The tuple contains two elements: the number of rows and the number of columns.
* Return Type: Tuple of two integers
* Example: For a DataFrame with 4 rows and 3 columns, DataFrame.shape would return (4, 3).

In [6]:
print(df.shape)  # Output: (3, 3)

(3, 3)


## Question 6: Which function of pandas do we use to read an excel file?

To read an Excel file into a Pandas DataFrame, you use the pd.read_excel() function from the Pandas library. This function can read data from Excel files with extensions .xls or .xlsx.

#### Example Usage

import pandas as pd

* Reading an Excel file into a DataFrame

df = pd.read_excel('file_name.xlsx')

* Displaying the first few rows of the DataFrame

print(df.head())


### Parameters

* filepath_or_buffer: The path to the Excel file you want to read. This can be a string representing the file path or a file-like object.

* sheet_name (optional): Specifies the sheet to be read from the Excel file. It can be the sheet name as a string, an integer index (0-based), or a list of sheet names or indices. The default is 0, which reads the first sheet.

* usecols (optional): Specifies which columns to read, either by column number or column name.

* nrows (optional): Specifies the number of rows to read.

#### Example with Specific Parameters

* Reading the second sheet of the Excel file and only the first 5 rows

df = pd.read_excel('file_name.xlsx', sheet_name=1, nrows=5)

print(df)


## Question 7: You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

In [1]:
import pandas as pd

def add_username_column(df):
    # Check if 'Email' column exists
    if 'Email' in df.columns:
        # Create 'Username' column by splitting the 'Email' column at '@' and taking the first part
        df['Username'] = df['Email'].apply(lambda x: x.split('@')[0])
    else:
        print("The DataFrame does not have an 'Email' column.")
    
    return df

# Example usage
# Creating a sample DataFrame
data = {'Email': ['john.doe@example.com', 'jane.smith@domain.org', 'info@company.net']}
df = pd.DataFrame(data)

# Adding the 'Username' column
df = add_username_column(df)
print(df)

                   Email    Username
0   john.doe@example.com    john.doe
1  jane.smith@domain.org  jane.smith
2       info@company.net        info


## Question 8: You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows. For example, if df contains the following values:
A B C

0 3 5 1

1 8 2 7

2 6 9 4

3 2 3 5

4 9 1 2

In [2]:
import pandas as pd

def filter_rows(df):
    # Select rows where 'A' > 5 and 'B' < 10
    filtered_df = df[(df['A'] > 5) & (df['B'] < 10)]
    return filtered_df

# Example usage
# Creating a sample DataFrame
data = {'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]}
df = pd.DataFrame(data)

# Filtering rows based on the conditions
filtered_df = filter_rows(df)
print(filtered_df)

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


## Question 9 : Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [3]:
import pandas as pd

def calculate_statistics(df):
    # Check if 'Values' column exists
    if 'Values' in df.columns:
        # Calculate mean, median, and standard deviation
        mean_val = df['Values'].mean()
        median_val = df['Values'].median()
        std_dev = df['Values'].std()
        
        # Print the results
        print(f"Mean: {mean_val}")
        print(f"Median: {median_val}")
        print(f"Standard Deviation: {std_dev}")
        
        # Optionally, return the results as a dictionary
        return {"Mean": mean_val, "Median": median_val, "Standard Deviation": std_dev}
    else:
        print("The DataFrame does not have a 'Values' column.")
        return None

# Example usage
# Creating a sample DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calculating statistics
statistics = calculate_statistics(df)

Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


## Qusetion 10: Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [4]:
import pandas as pd

def calculate_moving_average(df):
    # Check if 'Sales' and 'Date' columns exist
    if 'Sales' in df.columns and 'Date' in df.columns:
        # Ensure the 'Date' column is in datetime format and sort by date
        df['Date'] = pd.to_datetime(df['Date'])
        df = df.sort_values('Date')

        # Calculate the 7-day moving average of the 'Sales' column
        df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    else:
        print("The DataFrame does not have the required 'Sales' and 'Date' columns.")
    
    return df

# Example usage
# Creating a sample DataFrame
data = {
    'Date': ['2023-06-01', '2023-06-02', '2023-06-03', '2023-06-04', '2023-06-05', '2023-06-06', '2023-06-07', '2023-06-08'],
    'Sales': [100, 150, 200, 250, 300, 350, 400, 450]
}
df = pd.DataFrame(data)

# Calculating the moving average
df = calculate_moving_average(df)
print(df)

        Date  Sales  MovingAverage
0 2023-06-01    100          100.0
1 2023-06-02    150          125.0
2 2023-06-03    200          150.0
3 2023-06-04    250          175.0
4 2023-06-05    300          200.0
5 2023-06-06    350          225.0
6 2023-06-07    400          250.0
7 2023-06-08    450          300.0


## Question 11: You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.
* For example, if df contains the following values:
Date

0 2023-01-01

1 2023-01-02

2 2023-01-03

3 2023-01-04

4 2023-01-05

Your function should create the following DataFrame:

Date Weekday

0 2023-01-01 Sunday

1 2023-01-02 Monday

2 2023-01-03 Tuesday

3 2023-01-04 Wednesday

4 2023-01-05 Thursday

The function should return the modified DataFrame.

In [5]:
import pandas as pd

def add_weekday_column(df):
    # Check if 'Date' column exists
    if 'Date' in df.columns:
        # Convert the 'Date' column to datetime format
        df['Date'] = pd.to_datetime(df['Date'])
        # Create a new 'Weekday' column with the weekday name
        df['Weekday'] = df['Date'].dt.day_name()
    else:
        print("The DataFrame does not have a 'Date' column.")
    
    return df

# Example usage
# Creating a sample DataFrame
data = {
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']
}
df = pd.DataFrame(data)

# Adding the 'Weekday' column
df = add_weekday_column(df)
print(df)

        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


## Question 12: Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [6]:
import pandas as pd

def filter_date_range(df):
    # Check if 'Date' column exists
    if 'Date' in df.columns:
        # Convert the 'Date' column to datetime format
        df['Date'] = pd.to_datetime(df['Date'])
        
        # Define the start and end dates for filtering
        start_date = '2023-01-01'
        end_date = '2023-01-31'
        
        # Filter rows based on the date range
        filtered_df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    else:
        print("The DataFrame does not have a 'Date' column.")
        return None
    
    return filtered_df

# Example usage
# Creating a sample DataFrame
data = {
    'Date': ['2022-12-31', '2023-01-01', '2023-01-15', '2023-01-31', '2023-02-01'],
    'Value': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Filtering rows based on the date range
filtered_df = filter_date_range(df)
print(filtered_df)

        Date  Value
1 2023-01-01     20
2 2023-01-15     30
3 2023-01-31     40


## Question 13: To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

In [7]:
import pandas as pd