# PW SKILLS

##  Assignment Questions :

### Q1. List any five functions of the pandas library with execution.
### Answer :

### Certainly! The pandas library is a powerful data manipulation and analysis library for Python. Here are five common functions along with their execution:

### 1. read_csv: This function is used to read data from a CSV file into a DataFrame.

In [None]:
import pandas as pd

# Example CSV file: "example.csv"
# Name,Age,Location
# John,25,New York
# Alice,30,San Francisco

# Reading CSV into a DataFrame
df = pd.read_csv("example.csv")
print(df)


### 2. head: This function is used to display the first few rows of a DataFrame.

In [None]:
# Displaying the first 5 rows of the DataFrame
print(df.head())


### 3. describe: This function provides descriptive statistics of the numerical columns in a DataFrame.

In [None]:
# Displaying summary statistics of numerical columns
print(df.describe())


### 4. shape: This function returns the number of rows and columns in the DataFrame.

In [None]:
# Displaying the shape of the DataFrame
print(df.shape)


### 5. groupby: This function is used for grouping data based on some criteria and then applying a function to each group.

In [None]:
# Grouping data by the 'Location' column and calculating the average age for each location
grouped_df = df.groupby('Location')['Age'].mean()
print(grouped_df)


### These examples assume you have pandas installed in your Python environment. If you don't have it installed, you can install it using:

In [None]:
pip install pandas


### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.
### Answer : Certainly! We can use the reset_index method to achieve this. Here's a Python function that re-indexes the DataFrame with a new index starting from 1 and incrementing by 2 for each row:

In [1]:
import pandas as pd

def reindex_dataframe(df):
    # Resetting the index and creating a new index starting from 1 and incrementing by 2
    df_reindexed = df.reset_index(drop=True)
    df_reindexed.index = df_reindexed.index * 2 + 1
    
    return df_reindexed

# Example DataFrame df with columns 'A', 'B', and 'C'
data = {'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]}
df = pd.DataFrame(data)

# Re-indexing the DataFrame
df_reindexed = reindex_dataframe(df)

# Displaying the original and re-indexed DataFrames
print("Original DataFrame:")
print(df)

print("\nDataFrame after re-indexing:")
print(df_reindexed)


Original DataFrame:
    A   B   C
0  10  40  70
1  20  50  80
2  30  60  90

DataFrame after re-indexing:
    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

### For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should calculate and print the sum of the first three values, which is 60.
### Answer : Certainly! We can create a Python function that iterates over the first three values in the 'Values' column of the DataFrame and calculates their sum. Here's an example:

In [2]:
import pandas as pd

def calculate_sum_of_first_three(df):
    # Extracting the 'Values' column from the DataFrame
    values_column = df['Values']

    # Calculating the sum of the first three values
    sum_of_first_three = sum(values_column.head(3))

    # Printing the sum to the console
    print("Sum of the first three values:", sum_of_first_three)

# Example DataFrame df with a 'Values' column
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calling the function to calculate and print the sum of the first three values
calculate_sum_of_first_three(df)


Sum of the first three values: 60


### In this example, the function calculate_sum_of_first_three extracts the 'Values' column from the DataFrame using df['Values'], then uses the head(3) method to select the first three values, and finally calculates their sum using the sum function. The result is then printed to the console.

### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.
### Answer :Certainly! We can use the apply function along with a lambda function to count the number of words in each row of the 'Text' column and create a new column 'Word_Count'. Here's an example:

In [3]:
import pandas as pd

def add_word_count_column(df):
    # Function to count words in a text
    count_words = lambda text: len(str(text).split())

    # Applying the function to the 'Text' column and creating a new 'Word_Count' column
    df['Word_Count'] = df['Text'].apply(count_words)

# Example DataFrame df with a 'Text' column
data = {'Text': ["Hello, how are you?", "This is a sample sentence.", "Python programming is fun!"]}
df = pd.DataFrame(data)

# Calling the function to add the 'Word_Count' column
add_word_count_column(df)

# Displaying the DataFrame with the new 'Word_Count' column
print(df)


                         Text  Word_Count
0         Hello, how are you?           4
1  This is a sample sentence.           5
2  Python programming is fun!           4


### In this example, the add_word_count_column function defines a lambda function count_words to count the number of words in a given text. The apply function is then used to apply this lambda function to each row of the 'Text' column, and a new 'Word_Count' column is created in the DataFrame. The result is a DataFrame with the original 'Text' column and a new 'Word_Count' column containing the word count for each row.

### Q5. How are DataFrame.size() and DataFrame.shape() different?
### Answer : In the context of pandas, a popular data manipulation library in Python, DataFrame.size and DataFrame.shape are two different attributes that provide information about the structure of a DataFrame.

### 1. DataFrame.size:

### DataFrame.size returns the total number of elements in the DataFrame.
### It calculates the size by multiplying the number of rows by the number of columns.
### The result is a single integer value representing the total number of elements in the DataFrame.

In [1]:
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Using DataFrame.size
size_result = df.size
print(size_result)


6


### 2. DataFrame.shape:

### DataFrame.shape returns a tuple representing the dimensions of the DataFrame.
### The tuple contains two elements: the number of rows and the number of columns.
### It is useful for quickly checking the shape of the DataFrame without having to calculate the total number of elements.

In [2]:
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Using DataFrame.shape
shape_result = df.shape
print(shape_result)


(3, 2)


### In summary, while DataFrame.size gives you the total number of elements in the DataFrame, DataFrame.shape provides a tuple representing the number of rows and columns.

### Q6. Which function of pandas do we use to read an excel file.
### Answer : In pandas, you can use the read_excel function to read data from an Excel file. This function is part of the pandas library and allows you to read the contents of an Excel file into a DataFrame.

### Here is an example of how to use the read_excel function:

In [None]:
import pandas as pd

# Specify the Excel file path
excel_file_path = 'your_excel_file.xlsx'

# Use read_excel to read the Excel file into a DataFrame
df = pd.read_excel(excel_file_path)

# Now, 'df' contains the data from the Excel file


### In the example above, replace 'your_excel_file.xlsx' with the actual path or URL of your Excel file. The read_excel function can also take additional parameters to customize the import process, such as specifying sheet names, skipping rows, and more. Refer to the pandas documentation for more details on the available options: pandas.read_excel documentation.

### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

### The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.
### Answer : Certainly! We can achieve this by using the apply function along with a custom function to extract the username from each email address. Here's an example:

In [3]:
import pandas as pd

# Sample DataFrame with 'Email' column
data = {'Email': ['john.doe@example.com', 'alice.smith@example.com', 'bob.jones@example.com']}
df = pd.DataFrame(data)

# Function to extract username from email address
def extract_username(email):
    return email.split('@')[0]

# Create a new 'Username' column using the apply function
df['Username'] = df['Email'].apply(extract_username)

# Display the updated DataFrame
print(df)


                     Email     Username
0     john.doe@example.com     john.doe
1  alice.smith@example.com  alice.smith
2    bob.jones@example.com    bob.jones


### In this example, the extract_username function takes an email address as input, splits it at the '@' symbol, and returns the first part (username). The apply function is then used to apply this function to each element in the 'Email' column, creating a new 'Username' column in the DataFrame with the extracted usernames.

### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

For example, if df contains the following values:

   A   B   C

0  3   5   1

1  8   2   7

2  6   9   4

3  2   3   5

4  9   1   2

### Your function should select the following rows:   A   B   C

1  8   2   7

4  9   1   2

### The function should return a new DataFrame that contains only the selected rows.
### Answer : Certainly! Here's a Python function that selects rows from the DataFrame where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10:



In [4]:
import pandas as pd

# Sample DataFrame
data = {'A': [3, 8, 6, 2, 9],
        'B': [5, 2, 9, 3, 1],
        'C': [1, 7, 4, 5, 2]}

df = pd.DataFrame(data)

# Function to select rows based on conditions
def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

# Call the function
selected_df = select_rows(df)

# Display the selected DataFrame
print(selected_df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


### This function uses boolean indexing to filter rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The resulting DataFrame (selected_df) contains only the rows that satisfy these conditions.

### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.
### Answer : Certainly! We can use the mean(), median(), and std() functions provided by pandas to calculate the mean, median, and standard deviation of a specific column. Here's a Python function that does this for the 'Values' column in a given DataFrame:

In [5]:
import pandas as pd

# Sample DataFrame
data = {'Values': [3, 8, 6, 2, 9, 5, 12, 7]}
df = pd.DataFrame(data)

# Function to calculate mean, median, and standard deviation
def calculate_stats(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_dev_value = df['Values'].std()
    
    return mean_value, median_value, std_dev_value

# Call the function
mean_val, median_val, std_dev_val = calculate_stats(df)

# Display the calculated statistics
print(f'Mean: {mean_val}')
print(f'Median: {median_val}')
print(f'Standard Deviation: {std_dev_val}')


Mean: 6.5
Median: 6.5
Standard Deviation: 3.251373336211726


### In this example, the calculate_stats function takes a DataFrame as input, extracts the 'Values' column, and then uses the mean(), median(), and std() functions to calculate the mean, median, and standard deviation, respectively. The calculated values are then returned by the function.

### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.
### Answer :Certainly! We can use the rolling method in pandas to calculate the moving average for a specified window size. Here's a Python function that creates a new column 'MovingAverage' in the DataFrame, containing the moving average of the 'Sales' column for the past 7 days:

In [6]:
import pandas as pd

# Sample DataFrame
data = {'Date': pd.date_range('2022-01-01', '2022-01-15'),
        'Sales': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]}

df = pd.DataFrame(data)

# Function to calculate moving average
def calculate_moving_average(df, window_size=7):
    # Sort DataFrame by date (if not already sorted)
    df = df.sort_values(by='Date')

    # Calculate the moving average and create a new column 'MovingAverage'
    df['MovingAverage'] = df['Sales'].rolling(window=window_size, min_periods=1).mean()

    return df

# Call the function
df_with_moving_average = calculate_moving_average(df)

# Display the DataFrame with the new 'MovingAverage' column
print(df_with_moving_average)


         Date  Sales  MovingAverage
0  2022-01-01     10           10.0
1  2022-01-02     15           12.5
2  2022-01-03     20           15.0
3  2022-01-04     25           17.5
4  2022-01-05     30           20.0
5  2022-01-06     35           22.5
6  2022-01-07     40           25.0
7  2022-01-08     45           30.0
8  2022-01-09     50           35.0
9  2022-01-10     55           40.0
10 2022-01-11     60           45.0
11 2022-01-12     65           50.0
12 2022-01-13     70           55.0
13 2022-01-14     75           60.0
14 2022-01-15     80           65.0


### In this example, the calculate_moving_average function takes a DataFrame as input, sorts it by the 'Date' column, and then uses the rolling method to calculate the moving average for the 'Sales' column with a window size of 7. The min_periods=1 argument ensures that the moving average is calculated even if there are fewer than 7 days of data.

### The resulting DataFrame (df_with_moving_average) contains the original 'Sales' column and a new 'MovingAverage' column.

### Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

For example, if df contains the following values:

         Date

0  2023-01-01

1  2023-01-02

2  2023-01-03

3  2023-01-04

4  2023-01-05

Your function should create the following DataFrame:


         Date    Weekday

0  2023-01-01    Sunday

1  2023-01-02     Monday

2  2023-01-03    Tuesday

3  2023-01-04    Wednesday

4  2023-01-05    Thursday

The function should return the modified DataFrame.

### Answer :We can achieve this by using the dt accessor in pandas to extract the weekday name from the 'Date' column. Here's a Python function that creates a new column 'Weekday' in the DataFrame:


In [7]:
import pandas as pd

# Sample DataFrame
data = {'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])}
df = pd.DataFrame(data)

# Function to create 'Weekday' column
def add_weekday_column(df):
    # Convert 'Date' column to datetime if not already
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Create 'Weekday' column using the dt accessor
    df['Weekday'] = df['Date'].dt.day_name()

    return df

# Call the function
df_with_weekday = add_weekday_column(df)

# Display the modified DataFrame
print(df_with_weekday)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


### In this example, the add_weekday_column function converts the 'Date' column to datetime format (in case it's not already) and then uses the dt.day_name() accessor to extract the weekday names. The resulting DataFrame (df_with_weekday) contains the original 'Date' column and a new 'Weekday' column.

### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.
### Answer : Certainly! We can use boolean indexing to select rows within a specific date range in pandas. Here's a Python function that selects all rows where the date is between '2023-01-01' and '2023-01-31':

In [8]:
import pandas as pd

# Sample DataFrame
data = {'Date': pd.to_datetime(['2023-01-01', '2023-01-15', '2023-01-25', '2023-02-05'])}
df = pd.DataFrame(data)

# Function to select rows within a date range
def select_rows_by_date(df, start_date, end_date):
    # Convert 'Date' column to datetime if not already
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Use boolean indexing to select rows within the date range
    selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

    return selected_rows

# Call the function with the specified date range
start_date = '2023-01-01'
end_date = '2023-01-31'
selected_df = select_rows_by_date(df, start_date, end_date)

# Display the selected DataFrame
print(selected_df)


        Date
0 2023-01-01
1 2023-01-15
2 2023-01-25


### In this example, the select_rows_by_date function takes a DataFrame, converts the 'Date' column to datetime format (if not already), and then uses boolean indexing to select rows within the specified date range (start_date to end_date). The resulting DataFrame (selected_df) contains only the rows that satisfy these conditions.

### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported ?
### Answer : To use the basic functions of pandas in Python, you need to import the pandas library. It is the primary library for data manipulation and analysis in Python. You typically import it using the following convention:

In [None]:
import pandas as pd


### This allows you to use the functionalities provided by pandas using the pd alias. For example, you can create a DataFrame, manipulate data, and perform various operations using the pandas functions after importing the library.