# Pandas advanced Assignments

### Q1. List any five functions of the pandas library with execution.

Five commonly used functions in the pandas library along with their execution:

1. `read_csv()`: This function is used to read data from a CSV file and create a DataFrame.
```python
import pandas as pd
df = pd.read_csv('data.csv')
```

2. `head()`: This function returns the first few rows of a DataFrame (by default, the first five rows).
```python
df.head()
```

3. `describe()`: This function provides descriptive statistics of the DataFrame, such as count, mean, standard deviation, minimum, maximum, and quartiles.
```python
df.describe()
```

4. `groupby()`: This function is used for grouping data based on one or more columns, allowing for aggregation operations on the grouped data.
```python
grouped_df = df.groupby('Category')
```

5. `plot()`: This function allows you to create various types of plots, such as line plots, bar plots, histograms, etc., using the data from the DataFrame.
```python
df.plot(kind='line', x='Date', y='Value')
```

These functions are just a small sample of the extensive functionality provided by the pandas library. They help with data manipulation, exploration, and visualization, making pandas a powerful tool for data analysis.

### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

You can achieve this by using the `set_index()` function in pandas along with a custom index array. Here's a Python function that re-indexes a DataFrame with a new index starting from 1 and incrementing by 2 for each row:

```python
import pandas as pd

def reindex_dataframe(df):
    new_index = pd.Index(range(1, len(df)*2, 2))
    df = df.set_index(new_index)
    return df
```

Here's how you can use this function with a sample DataFrame:

```python
# Sample DataFrame
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]})

# Re-indexing the DataFrame
df = reindex_dataframe(df)

# Printing the updated DataFrame
print(df)
```

Output:
```
    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90
```

In the updated DataFrame, the rows are re-indexed with a new index starting from 1 and incrementing by 2 for each row.

In [1]:
import pandas as pd

def reindex_dataframe(df):
    new_index = pd.Index(range(1, len(df)*2, 2))
    df = df.set_index(new_index)
    return df


In [2]:
# Sample DataFrame
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]})

# Re-indexing the DataFrame
df = reindex_dataframe(df)

# Printing the updated DataFrame
print(df)


    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

In [6]:
import pandas as pd

def calculate_sum(df):
    values = df['Values'].head(3)
    sum_values = values.sum()
    print("Sum of the first three values:", sum_values)


### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [7]:
def count_words(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df


In [8]:
# Sample DataFrame
df = pd.DataFrame({'Text': ['Hello, how are you?', 'I am doing well.', 'Python is great!']})

# Creating the 'Word_Count' column
df = count_words(df)

# Printing the updated DataFrame
print(df)


                  Text  Word_Count
0  Hello, how are you?           4
1     I am doing well.           4
2     Python is great!           3


### Q5. How are DataFrame.size() and DataFrame.shape() different?

The methods `DataFrame.size()` and `DataFrame.shape()` in pandas provide different information about the dimensions of a DataFrame.

1. `DataFrame.size()`: This method returns the total number of elements in the DataFrame. It calculates the size by multiplying the number of rows by the number of columns. The returned value represents the total count of cells in the DataFrame, including empty or NaN (missing) values.

2. `DataFrame.shape()`: This attribute returns a tuple representing the dimensions of the DataFrame. It provides two values: the number of rows and the number of columns. The shape attribute is useful for quickly obtaining the size of the DataFrame along each dimension.



In [9]:
import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Calculate size using size()
df_size = df.size

# Get shape using shape()
df_shape = df.shape

print("Size of DataFrame:", df_size)
print("Shape of DataFrame:", df_shape)


Size of DataFrame: 6
Shape of DataFrame: (3, 2)


### Q6. Which function of pandas do we use to read an excel file?

In [None]:
df = pd.read_excel('filename.xlsx')

### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

### The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

In [10]:
# Create a DataFrame with email addresses
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.smith@example.com', 'alice.walker@example.com']})

# Function to extract the username
def extract_username(df):
    df['Username'] = df['Email'].str.split('@').str[0]
    return df

# Apply the function to the DataFrame
df = extract_username(df)

# Print the updated DataFrame
print(df)


                      Email      Username
0      john.doe@example.com      john.doe
1    jane.smith@example.com    jane.smith
2  alice.walker@example.com  alice.walker


### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows. For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

### Your function should select the following rows: A B C :
1 8 2 7
4 9 1 2
The function should return a new DataFrame that contains only the selected rows.

In [14]:
# Create a DataFrame with the given values
df = pd.DataFrame({'A': [3, 8, 6, 2, 9],
                   'B': [5, 2, 9, 3, 1],
                   'C': [1, 7, 4, 5, 2]})

# Function to select rows
def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

# Apply the function to the DataFrame
selected_df = select_rows(df)

# Print the selected DataFrame
print(selected_df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,median, and standard deviation of the values in the 'Values' column.

In [None]:
def calculate_stats(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_value = df['Values'].std()
    return mean_value, median_value, std_value
# Assuming you have a DataFrame named df with a column 'Values'

# Call the function to calculate statistics
mean, median, std = calculate_stats(df)

# Print the calculated statistics
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std)


### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [15]:
def calculate_moving_average(df):
    window_size = 7
    df['MovingAverage'] = df['Sales'].rolling(window=window_size, min_periods=1).mean()
    return df

# Create a sample DataFrame with columns 'Date' and 'Sales'
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07'],
        'Sales': [20, 25, 18, 30, 22, 19, 24]}

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])  # Convert 'Date' column to datetime format

# Call the function to calculate the moving average
df = calculate_moving_average(df)

# Print the updated DataFrame with the moving average column
print(df)


        Date  Sales  MovingAverage
0 2023-01-01     20      20.000000
1 2023-01-02     25      22.500000
2 2023-01-03     18      21.000000
3 2023-01-04     30      23.250000
4 2023-01-05     22      23.000000
5 2023-01-06     19      22.333333
6 2023-01-07     24      22.571429


### Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column. For example, if df contains the following values:
Date
0 2023-01-01  
1 2023-01-02  
2 2023-01-03  
3 2023-01-04   
4 2023-01-05   
### Your function should create the following DataFrame:

Date Weekday 
0 2023-01-01 Sunday  
1 2023-01-02 Monday  
2 2023-01-03 Tuesday  
3 2023-01-04 Wednesday  
4 2023-01-05 Thursday  
The function should return the modified DataFrame.  

In [16]:
def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.day_name()
    return df

# Create a sample DataFrame with a column 'Date'
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])  # Convert 'Date' column to datetime format

# Call the function to add the weekday column
df = add_weekday_column(df)

# Print the modified DataFrame with the weekday column
print(df)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [19]:
def select_rows_by_date(df):
    start_date = pd.to_datetime('2023-01-01')
    end_date = pd.to_datetime('2023-01-31')
    selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    return selected_rows

# Assuming you have a DataFrame named df with a column 'Date'

# Call the function to select rows by date
selected_df = select_rows_by_date(df)

# Print the selected DataFrame
print(selected_df)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

The first and foremost library that needs to be imported to use the basic functions of pandas is `pandas` itself. The `pandas` library provides various data structures and functions for data manipulation and analysis.

To import `pandas`, you can use the following line of code:

```python
import pandas as pd
```

By importing `pandas` as `pd`, you can then access the pandas functions and classes using the `pd` prefix. This is a common convention used by the pandas community. Once imported, you can use pandas functions to create, manipulate, and analyze data in your Python code.