### Question 1 - Fixing Messy Columns and Missing Data

Tasks:

1. Remove leading/trailing spaces from all string columns.

2. Convert "Purchase Amount" to numeric (float), removing the $ symbol.

3. Standardize "Signup Date" to YYYY-MM-DD format as datetime.

4. Fill missing "Name" values with "Unknown".

5. Drop any rows where "Customer ID" is missing.

6. Reset the index after cleaning.

In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Customer ID': ['1001', '1002', None, '1004', '1005'],
    ' Name ': [' Alice ', 'Bob', 'Charlie', None, ' Eve '],
    'Purchase Amount': ['$250.00', '$100', None, '$300.50', '$450.75'],
    'Signup Date': ['2024/01/01', '2024-02-15', '15-03-2024', '2024-04-01', None]
})

In [None]:
# Q1
df.columns = df.columns.str.strip() # Removes space from columns
df = df.apply(lambda col: col.str.strip() if col.dtype == 'object' else col)
print(df)

# Q2
df['Purchase Amount'] = df['Purchase Amount'].str.replace('$', '', regex=False).astype(float)
print(df)

# Q3
df['Signup Date'] = pd.to_datetime(df['Signup Date'], errors='coerce', format='mixed').dt.strftime("%Y-%m-%d")
print(df)

# Q4
df['Name'] = df['Name'].fillna('Unknown')
print(df)

# Q5
#df = df[df['Customer ID'].notna()]
df = df.dropna(subset=['Customer ID'])
print(df)

# Q6
df.reset_index(drop=True)

  Customer ID     Name Purchase Amount Signup Date
0        1001    Alice         $250.00  2024/01/01
1        1002      Bob            $100  2024-02-15
2        None  Charlie            None  15-03-2024
3        1004     None         $300.50  2024-04-01
4        1005      Eve         $450.75        None
  Customer ID     Name  Purchase Amount Signup Date
0        1001    Alice           250.00  2024/01/01
1        1002      Bob           100.00  2024-02-15
2        None  Charlie              NaN  15-03-2024
3        1004     None           300.50  2024-04-01
4        1005      Eve           450.75        None
  Customer ID     Name  Purchase Amount Signup Date
0        1001    Alice           250.00  2024-01-01
1        1002      Bob           100.00  2024-02-15
2        None  Charlie              NaN  2024-03-15
3        1004     None           300.50  2024-04-01
4        1005      Eve           450.75         NaN
  Customer ID     Name  Purchase Amount Signup Date
0        1001    A

Unnamed: 0,Customer ID,Name,Purchase Amount,Signup Date
0,1001,Alice,250.0,2024-01-01
1,1002,Bob,100.0,2024-02-15
2,1004,Unknown,300.5,2024-04-01
3,1005,Eve,450.75,


### Example 2 — Standardizing Categorical Data

Tasks:

1. Normalize "product" and "category" strings to lowercase without extra spaces.

2. Fill missing prices with the mean price per product type.

3. Drop duplicate rows after cleaning.

4. Sort by "product" then "price" ascending.

In [None]:
df = pd.DataFrame({
    'product': ['Laptop', 'laptop', ' LAPTOP ', 'Tablet', 'tablet', 'Phone', ' phone '],
    'price': [1000, 950, 1050, 400, None, 600, 580],
    'category': ['Electronics', 'electronics', ' ELECTRONICS ', 'Electronics', 'electronics', None, 'ELECTRONICS']
})

# Q1
df['product'] = df['product'].str.lower().str.strip()
df['category'] = df['category'].str.lower().str.strip()
df

# Q2
df['price'] = df['price'].fillna(np.mean(df['price']))
df

# Q3
df.drop_duplicates(subset=['product', 'category'], inplace=True)
df

# Q4
df.sort_values(by=['product', 'price'], ascending=[True, True])

### Question 3

You have the following DataFrame df_sales:

|  date  | store | product | revenue |
|-------------|---|---|--------|
| 2024-01-01  | A | X | 250    |
| 2024-02-15  | A | Y | 100    |
| 2024-01-10  | B | X | 400    |
| 2024-03-04  | B | Y | 150    |
| 2024-04-01  | C | Y | 300    |

Tasks:
1. Convert date to datetime and extract the week number
2. Find the total revenue per store per week
3. Rank the products in each store by total revenue

In [1]:
import pandas as pd

df = pd.DataFrame({
    'date': ['2024-01-01', '2024-02-15', '2024-01-10', '2024-03-04', '2024-04-01'],
    'store': ['A', 'A', 'B', 'B', 'C'],
    'product': ['X', 'Y', 'X', 'Y', 'Z'],
    'sales': [250, 100, 400, 150, 300]
})

In [11]:
# Q1
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].dt.isocalendar().week

# Q2
df.groupby(['store', 'week']).agg({'sales':'sum'}).reset_index().rename(columns={'sales':'total_revenue'})

# Q3
df_ranked = df.groupby(['store', 'product'])['sales'].sum().reset_index()
df_ranked['rank'] = df_ranked.groupby('store')['sales'].rank(method='dense', ascending=False).astype(int)
df_ranked

Unnamed: 0,store,product,sales,rank
0,A,X,250,1
1,A,Y,100,2
2,B,X,400,1
3,B,Y,150,2
4,C,Z,300,1
