# Consider following code to answer further questions:  
import pandas as pd  
course_name = [‘Data Science’, ‘Machine Learning’, ‘Big Data’, ‘Data Engineer’]  
duration = [2,3,6,4]
df = pd.DataFrame(data = {‘course_name’ : course_name, ‘duration’ : duration})


# Q1. Write a code to print the data present in the second row of the dataframe, df.

## **Solution:**  
To print the data from the second row of the DataFrame, we can use the `iloc[]` indexer, which is used for integer-location based indexing. Since Python indexing starts at 0, the second row corresponds to index 1.

### **Code Implementation:**

In [1]:
import pandas as pd

course_name = ['Data Science', 'Machine Learning', 'Big Data', 'Data Engineer']
duration = [2, 3, 6, 4]

df = pd.DataFrame(data={'course_name': course_name, 'duration': duration})

# Printing the data in the second row of the dataframe
print(df.iloc[1])


course_name    Machine Learning
duration                      3
Name: 1, dtype: object


# Q2. What is the difference between the functions `loc` and `iloc` in pandas.DataFrame?

## **Answer:**
The `loc[]` and `iloc[]` functions are both used to access rows and columns of a pandas DataFrame, but they differ in how they reference the rows and columns.

### **Key Differences:**

1. **`loc[]`:**  
   - It is label-based indexing.
   - It allows access to rows and columns using the labels (names) of the rows and columns.
   - Includes both the start and end of the range when slicing.

# Q3. Reindex the given dataframe using a variable, `reindex = [3, 0, 1, 2]` and store it in the variable, `new_df`. Then find the output for both `new_df.loc[2]` and `new_df.iloc[2]`. Did you observe any difference in both the outputs? If so, explain it.

## **Solution:**

To reindex the dataframe using the new index `[3, 0, 1, 2]`, we can use the `reindex()` function. After reindexing, we will access the rows using both `.loc[]` (label-based indexing) and `.iloc[]` (integer-location based indexing) and compare their outputs.

### **Code Implementation:**

In [2]:
import pandas as pd

# Original DataFrame
course_name = ['Data Science', 'Machine Learning', 'Big Data', 'Data Engineer']
duration = [2, 3, 6, 4]
df = pd.DataFrame(data={'course_name': course_name, 'duration': duration})

# Reindexing the DataFrame
reindex = [3, 0, 1, 2]
new_df = df.reindex(reindex)

# Output using loc and iloc
print("new_df.loc[2]:")
print(new_df.loc[2])  # Using label-based indexing

print("\nnew_df.iloc[2]:")
print(new_df.iloc[2])  # Using integer-location based indexing

new_df.loc[2]:
course_name    Big Data
duration              6
Name: 2, dtype: object

new_df.iloc[2]:
course_name    Machine Learning
duration                      3
Name: 1, dtype: object


## Explanation
## Did you observe any difference in both the outputs? If so then explain it.

ans: Yes, there is a difference between the outputs:

new_df.loc[2] returns the row with label 2, which corresponds to the data ['Big Data', 6] in the reindexed DataFrame.
new_df.iloc[2] returns the row at the integer position 2, which corresponds to the data ['Machine Learning', 3] in the reindexed DataFrame.
Explanation:
loc[2] looks for the row with the label 2 (after reindexing), whereas iloc[2] looks for the row at the position 2 in the DataFrame, regardless of the label.
In summary:

loc[] uses the label of the row.
iloc[] uses the integer position of the row.

# Consider the below code to answer further questions:
import pandas as pd  
import numpy as np  
columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']  
indices = [1,2,3,4,5,6]  
#Creating a dataframe:  
df1 = pd.DataFrame(np.random.rand(6,6), columns = columns, index = indices)


# Q4. Write a code to find the following statistical measurements for the above dataframe df1:
(i) mean of each and every column present in the dataframe.
(ii) standard deviation of column, 'column_2'

## **Solution:**

We can use the `mean()` function to calculate the mean of each column and the `std()` function to calculate the standard deviation of a specific column in the DataFrame.

### **Code Implementation:**

In [3]:
import pandas as pd
import numpy as np

# Creating the DataFrame
columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1, 2, 3, 4, 5, 6]
df1 = pd.DataFrame(np.random.rand(6, 6), columns = columns, index = indices)

# (i) Mean of each and every column
mean_values = df1.mean()

# (ii) Standard deviation of 'column_2'
std_column_2 = df1['column_2'].std()

# Output
print("Mean of each column:")
print(mean_values)

print("\nStandard deviation of column_2:")
print(std_column_2)


Mean of each column:
column_1    0.511501
column_2    0.489160
column_3    0.546999
column_4    0.399587
column_5    0.496963
column_6    0.568245
dtype: float64

Standard deviation of column_2:
0.31165172950938785


## Explanation:
Mean of each column: The mean() function calculates the average of each column in the DataFrame.
Standard deviation of column_2: The std() function calculates the standard deviation of the data in column_2, showing how much the values deviate from the mean.

# Q5. Replace the data present in the second row of column, ‘column_2’ by a string variable then find the mean of column, 'column_2'.  
If you are getting errors in executing it, then explain why.

## **Solution:**

To replace the data present in the second row of column `column_2` with a string, we can use the `loc[]` function. However, replacing a numerical value with a string in a column that is expected to hold numerical data will cause issues when calculating the mean of the column, because pandas will attempt to compute the mean of mixed data types (numerical and string). This will lead to errors.

### **Code Implementation:**

In [4]:
import pandas as pd
import numpy as np

# Creating the DataFrame
columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1, 2, 3, 4, 5, 6]
df1 = pd.DataFrame(np.random.rand(6, 6), columns = columns, index = indices)

# Replacing the second row of 'column_2' with a string
df1.loc[2, 'column_2'] = 'string_value'

# Trying to find the mean of 'column_2'
try:
    mean_column_2 = df1['column_2'].mean()
    print("Mean of column_2:", mean_column_2)
except Exception as e:
    print("Error:", e)

Error: unsupported operand type(s) for +: 'float' and 'str'


  df1.loc[2, 'column_2'] = 'string_value'


## Explanation of Error:
The error occurs because pandas tries to compute the mean of a column that contains both numeric and string data types, which is not possible. When the second row is replaced with a string, pandas will treat the entire column as an object (since strings are objects in pandas). The mean() function cannot handle non-numeric data types, and thus an error is raised.

To avoid this error, ensure that all values in the column are numeric before attempting to calculate the mean.

# Q6. What do you understand about the windows function in pandas and list the types of windows functions?

## **Answer:**

In pandas, the **window functions** (or **rolling operations**) allow you to perform calculations over a moving or sliding window of data. These functions help to analyze data over a specified period or range of rows and are particularly useful for time series analysis, smoothing data, and calculating moving averages, among other tasks.

### **Types of Window Functions in pandas:**

1. **Rolling Window**:
   - It applies a function over a fixed-sized window that moves across the data.
   - It is used to calculate statistics like moving averages, sums, or other aggregations over a specific window of data.
   
2. **Expanding Window:**

This function calculates statistics over all the previous data up to the current row. Unlike rolling windows that are fixed in size, expanding windows grow as you move through the data.
Example: Calculating the cumulative sum (expanding sum) of a column.

3. **EWM (Exponentially Weighted Window):**

The exponentially weighted window allows you to give more weight to recent observations when calculating statistics. It applies a decay factor to past data.
It’s often used for smoothing time series data, such as in the calculation of exponentially weighted moving average

In [5]:
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'column_1': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data)

# Rolling Window Example (Moving Average)
df['moving_avg'] = df['column_1'].rolling(window=3).mean()

# Expanding Window Example (Cumulative Sum)
df['expanding_sum'] = df['column_1'].expanding().sum()

# EWM (Exponentially Weighted Moving Average) Example
df['ewm_avg'] = df['column_1'].ewm(span=3).mean()

# Output the results
print(df)

   column_1  moving_avg  expanding_sum    ewm_avg
0        10         NaN           10.0  10.000000
1        20         NaN           30.0  16.666667
2        30        20.0           60.0  24.285714
3        40        30.0          100.0  32.666667
4        50        40.0          150.0  41.612903
5        60        50.0          210.0  50.952381
6        70        60.0          280.0  60.551181


# Q7. Write a code to print only the current month and year at the time of answering this question.

## **Answer:**

To print the current month and year, we can use the `pandas` `datetime` function to get the current date and time. Then, we can extract and print only the month and year from the current date.

### **Code Implementation:**

In [6]:
import pandas as pd

# Get the current date
current_date = pd.to_datetime("today")

# Extract the current month and year
current_month_year = current_date.strftime('%B %Y')  # %B gives the full month name, %Y gives the year

# Print the current month and year
print(current_month_year)

February 2025


# Q8. Write a Python program that takes in two dates as input (in the format YYYY-MM-DD) and calculates the difference between them in days, hours, and minutes using Pandas time delta. The program should prompt the user to enter the dates and display the result.

## **Answer:**

To calculate the difference between two dates, we can use the `pd.to_datetime()` function to convert the input dates into datetime objects, then subtract the two dates to get a time delta. We can then extract the difference in days, hours, and minutes from the time delta.

### **Code Implementation:**

In [7]:
import pandas as pd

# Prompt the user to input two dates in the format YYYY-MM-DD
date1_input = input("Enter the first date (YYYY-MM-DD): ")
date2_input = input("Enter the second date (YYYY-MM-DD): ")

# Convert the input strings into pandas datetime objects
date1 = pd.to_datetime(date1_input)
date2 = pd.to_datetime(date2_input)

# Calculate the difference between the two dates
time_difference = abs(date2 - date1)

# Extract days, hours, and minutes from the time delta
days = time_difference.days
hours = time_difference.seconds // 3600
minutes = (time_difference.seconds % 3600) // 60

# Print the result
print(f"The difference is {days} days, {hours} hours, and {minutes} minutes.")

Enter the first date (YYYY-MM-DD): 2023-02-10
Enter the second date (YYYY-MM-DD): 2025-03-20
The difference is 769 days, 0 hours, and 0 minutes.


# Q9. Write a Python program that reads a CSV file containing categorical data and converts a specified column to a categorical data type. The program should prompt the user to enter the file path, column name, and category order, and then display the sorted data.

## **Answer:**

To achieve this, we can use `pandas` to read the CSV file, convert a specific column to a categorical type, and then sort the data based on the categories defined by the user.

Here is how we can implement this:

### **Code Implementation:**

In [12]:
import pandas as pd

# Sample data to create a CSV file
data = {
    'Name': ['Alice', 'Bob', 'Clarie', 'David', 'Eve'],
    'Category': ['High', 'Medium', 'Low', 'High', 'Medium']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('sample_data.csv', index=False)

# Display the DataFrame
df


Unnamed: 0,Name,Category
0,Alice,High
1,Bob,Medium
2,Clarie,Low
3,David,High
4,Eve,Medium
