### **Advanced Assignment: Getting Started with Pandas**

### **Assignment 1: Complex DataFrame Manipulation** (20 points)
**Objective**: Practice advanced DataFrame creation, manipulation, and indexing.

In [None]:
import pandas as pd

1. **Task**:
   - Create a DataFrame from the following multi-level dictionary representing students' scores:
     ```python
     data = {
         'Grade 9': {
             'Math': [78, 88, 95, 67],
             'English': [82, 79, 88, 91],
         },
         'Grade 10': {
             'Math': [81, 85, 79, 92],
             'English': [84, 90, 73, 87],
         }
     }
     ```
   - Set the column index to reflect both `Grade` and `Subject`, with the rows representing individual students.
   - Perform the following operations:
     - Select all the `Math` scores for `Grade 10`.
     - Calculate the average `English` score across both grades.

In [None]:
data = {
         'Grade 9': {
             'Math': [78, 88, 95, 67],
             'English': [82, 79, 88, 91],
         },
         'Grade 10': {
             'Math': [81, 85, 79, 92],
             'English': [84, 90, 73, 87],
         }
     }

df=pd.DataFrame({(grade,subjects):scores for grade,subject in data.items()for subjects,scores in subject.items()})
print(df)

math_scoresof_grade10 = df[('Grade 10','Math')]
print("\nMath scores for Grade 10:")
print(math_scoresof_grade10)

english_avg = df[[('Grade 9', 'English'), ('Grade 10', 'English')]].mean()
print("\nAverage English score across both grades:")
print(english_avg)

------

### **Assignment 2: Advanced Indexing and Filtering** (20 points)
**Objective**: Practice complex indexing and filtering on a DataFrame.

1. **Task**:
   - Create a DataFrame representing the sales of various products across different regions:
     ```python
     data = {
         'Product': ['A', 'B', 'C', 'D'],
         'Region': ['North', 'South', 'East', 'West'],
         'Sales_Q1': [120, 150, 200, 130],
         'Sales_Q2': [180, 140, 170, 160]
     }
     ```
   - Set `Product` as the index of the DataFrame.
   - Perform the following operations:
     - Select only those products where `Sales_Q2` is greater than `Sales_Q1`.
     - Filter and display rows where `Sales_Q1` or `Sales_Q2` exceeds 160.
     - Replace all sales values greater than 170 with the value `170`.

In [None]:
data = {
         'Product': ['A', 'B', 'C', 'D'],
         'Region': ['North', 'South', 'East', 'West'],
         'Sales_Q1': [120, 150, 200, 130],
         'Sales_Q2': [180, 140, 170, 160]
     }

In [None]:
df = pd.DataFrame(data).set_index('Product')
print(df)

sales_q2_greater_q1 = df[df['Sales_Q2'] > df['Sales_Q1']]
print("\nProducts where Sales_Q2 is greater than Sales_Q1:")
print(sales_q2_greater_q1)

sales_exceeds_160 = df[(df['Sales_Q1'] > 160) | (df['Sales_Q2'] > 160)]
print("\nRows where Sales_Q1 or Sales_Q2 exceeds 160:")
print(sales_exceeds_160)

df[['Sales_Q1', 'Sales_Q2']] = df[['Sales_Q1', 'Sales_Q2']].applymap(lambda x: 170 if x > 170 else x)
print("\nDataFrame with sales values capped at 170:")
print(df)

### **Assignment 3: Grouping with Multiple Aggregations** (20 points)
**Objective**: Perform complex groupby operations with multiple aggregation functions.

1. **Task**:
   - Create a DataFrame with the following data on employees, their department, and monthly salary:
     ```python
     data = {
         'Employee': ['John', 'Jane', 'Tom', 'Lucy', 'Max', 'Anna'],
         'Department': ['HR', 'Finance', 'HR', 'Finance', 'Sales', 'Sales'],
         'Salary': [3000, 4000, 3500, 4200, 3700, 3800],
         'Experience': [2, 5, 3, 4, 2, 6]
     }
     ```
   - Group the data by `Department` and calculate the following for each department:
     - The total salary.
     - The average experience.
   - Further, apply multiple aggregation functions to the `Salary` column: find the minimum, maximum, and mean salary for each department.

In [None]:
data = {
         'Employee': ['John', 'Jane', 'Tom', 'Lucy', 'Max', 'Anna'],
         'Department': ['HR', 'Finance', 'HR', 'Finance', 'Sales', 'Sales'],
         'Salary': [3000, 4000, 3500, 4200, 3700, 3800],
         'Experience': [2, 5, 3, 4, 2, 6]
     }

In [None]:
df = pd.DataFrame(data)
print(df)

grouped = df.groupby('Department')

department_totals = grouped.agg(
    Total_Salary=('Salary', 'sum'),
    Avg_Experience=('Experience', 'mean')
)

salary_stats = grouped['Salary'].agg(['min', 'max', 'mean'])
print("Total Salary and Average Experience by Department:")
print(department_totals)
print("\nSalary Statistics by Department:")
print(salary_stats)

### **Assignment 4: Complex Merging and Joining** (15 points)
**Objective**: Use advanced merging techniques to combine multiple DataFrames.

plt.hist(xyz_avg[:,0])
plt.title('Average $x(t)$');

1. **Task**:
   - Create two DataFrames:
     - One representing customer details:
       ```python
       customers = {
           'CustomerID': [1, 2, 3, 4],
           'Name': ['Alice', 'Bob', 'Charlie', 'David'],
           'Country': ['USA', 'Canada', 'USA', 'Mexico']
       }
       ```
     - Another representing order details:
       ```python
       orders = {
           'OrderID': [101, 102, 103, 104],
           'CustomerID': [1, 2, 2, 3],
           'Amount': [250, 100, 200, 150]
       }
       ```
   - Merge the two DataFrames on `CustomerID` and perform the following:
     - Display all customers, even if they don’t have any orders.
     - Calculate the total amount spent by each customer, including those with no orders.
     - Sort the merged DataFrame by `Amount` in descending order, placing missing values (`NaN`) at the bottom.


In [None]:
customers = {
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Country': ['USA', 'Canada', 'USA', 'Mexico']
}

orders = {
    'OrderID': [101, 102, 103, 104],
    'CustomerID': [1, 2, 2, 3],
    'Amount': [250, 100, 200, 150]
}

df_customers = pd.DataFrame(customers)
df_orders = pd.DataFrame(orders)

merged_df = pd.merge(df_customers, df_orders, on='CustomerID', how='left')
print("Merged DataFrame (all customers):")
print(merged_df)

total_amount_spent = merged_df.groupby(['CustomerID', 'Name', 'Country'])['Amount'].sum().reset_index()
print("\nTotal amount spent by each customer:")
print(total_amount_spent)


sorted_df = merged_df.sort_values(by='Amount', ascending=False)
print("\nMerged DataFrame sorted by Amount:")
print(sorted_df)


### **Assignment 5: Advanced Pivoting and Reshaping** (15 points)
**Objective**: Use pivoting, stacking, and unstacking techniques for reshaping data.

1. **Task**:
   - Create a DataFrame representing temperature readings at different times across three cities:
     ```python
     data = {
         'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago'],
         'Date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
         'Temperature': [30, 75, 25, 32, 77, 27],
         'Time': ['Morning', 'Morning', 'Morning', 'Afternoon', 'Afternoon', 'Afternoon']
     }
     ```
   - Pivot the data to create a table where the rows represent the `Date` and the columns represent the cities, with the temperature readings as the values.
   - Stack and unstack the pivoted data to swap the hierarchy of the cities and time of the day (`Morning`, `Afternoon`).
   - Calculate the daily average temperature for each city.

In [None]:
data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago'],
    'Date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
    'Temperature': [30, 75, 25, 32, 77, 27],
    'Time': ['Morning', 'Morning', 'Morning', 'Afternoon', 'Afternoon', 'Afternoon']
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print("\nPivoted DataFrame:\n", pivot_df)

stacked = df.set_index(['Date', 'Time', 'City'])
print("\nStacked  DataFrame:\n", stacked)

x=stacked.unstack('Time')
print("\n Unstacked DataFrame:\n",x)

daily_avg = df.groupby(['Date', 'City'])['Temperature'].mean()
print("\nDaily Average Temperature for Each City:\n", daily_avg)


### **Assignment 7: Time Series Data Handling** (10 points)
**Objective**: Practice working with time series data using pandas.

1. **Task**:
   - Create a date range of 30 consecutive business days starting from `2023-01-01`.
   - Create a DataFrame with random stock prices for a single company for each of these days.
   - Calculate the rolling 7-day moving average of the stock prices.
   - Find the day with the highest stock price and the corresponding price.

In [None]:
import numpy as np
date_range = pd.date_range(start='2023-01-01', periods=30, freq='B')
print("Date Range:\n", date_range)

np.random.seed(0)  
stock_prices = np.random.uniform(low=100, high=200, size=30)
df = pd.DataFrame({'Date': date_range, 'Stock Price': stock_prices})
print("\nDataFrame with Random Stock Prices:\n", df)

df['7-Day MA'] = df['Stock Price'].rolling(window=7).mean()
print("\nDataFrame with 7-Day Moving Average:\n", df)

max_price_row = df.loc[df['Stock Price'].idxmax()]
print("\nDay with Highest Stock Price:\n", max_price_row[['Date', 'Stock Price']])

### **Assignment 8: Advanced Missing Data Handling** (10 points)
**Objective**: Use advanced techniques to handle missing data.

1. **Task**:
   - Create a DataFrame with missing values in multiple columns:
     ```python
     data = {
         'Name': ['Alice', 'Bob', 'Charlie', 'David'],
         'Age': [25, np.nan, 35, np.nan],
         'Score': [85, 90, np.nan, 88]
     }
     ```
   - Perform the following operations:
     - Interpolate the missing values in the `Age` column.
     - Fill the missing values in the `Score` column with the mean score.
     - Drop any rows that still have missing values after the above steps.

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, np.nan, 35, np.nan],
    'Score': [85, 90, np.nan, 88]
}

df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:\n", df)

# Step 2: Interpolate missing values in the 'Age' column
# Interpolation fills missing values based on linear interpolation by default.
df['Age'] = df['Age'].interpolate()
print("\nDataFrame after Interpolating 'Age':\n", df)

# Step 3: Fill missing values in the 'Score' column with the mean score
# `fillna(df['Score'].mean())` replaces missing values in 'Score' with the column mean.
df['Score'] = df['Score'].fillna(df['Score'].mean())
print("\nDataFrame after Filling 'Score' with Mean:\n", df)

# Step 4: Drop any rows that still contain missing values
# `dropna()` removes any rows that still have NaN values after the previous steps.
df = df.dropna()
print("\nDataFrame after Dropping Remaining Missing Values:\n", df)