### Practice Problems: Data Cleaning and Preprocessing in Pandas

📌 Problem 1: Handling Missing Values Dataset: 

A small employee dataset containing missing values.
import pandas as pd

data = {
"Employee": ["Alice", "Bob", "Charlie", "David", "Emma"],
"Age": [25, None, 30, None, 40],
"Department": ["HR", "Finance", None, "IT", "Marketing"],
"Salary": [50000, 60000, None, 70000, None]
}

df = pd.DataFrame(data)

Tasks:
✅ Identify missing values in the dataset.
✅ Fill missing values in the Age column with the average age.
✅ Fill missing values in the Department column with "Unknown".
✅ Drop rows where Salary is missing.

In [44]:
import pandas as pd

In [45]:
data = {
    "Employee": ["Alice", "Bob", "Charlie", "David", "Emma"],
    "Age": [25, None, 30, None, 40], 
    "Department": ["HR", "Finance", None, "IT", "Marketing"], 
    "Salary": [50000, 60000, None, 70000, None]
}

In [46]:
df = pd.DataFrame(data)

In [47]:
df

Unnamed: 0,Employee,Age,Department,Salary
0,Alice,25.0,HR,50000.0
1,Bob,,Finance,60000.0
2,Charlie,30.0,,
3,David,,IT,70000.0
4,Emma,40.0,Marketing,


In [48]:
type(df)

pandas.core.frame.DataFrame

In [5]:
df.isna().sum()

Employee      0
Age           2
Department    1
Salary        2
dtype: int64

In [6]:
df['Age'].isna().sum()

2

In [7]:
df['Age'].fillna(df['Age'].mean(), inplace=True)

In [8]:
df['Age'].isna().sum()

0

In [13]:
df['Department'].isna().sum()

0

In [14]:
df['Department'].fillna("Unknown", inplace=True)

In [15]:
df['Department'].isna().sum()

0

In [16]:
df['Salary']

0    50000.0
1    60000.0
2        NaN
3    70000.0
4        NaN
Name: Salary, dtype: float64

In [17]:
df.dropna(subset=['Salary'], inplace=True)


In [18]:
df['Salary']

0    50000.0
1    60000.0
3    70000.0
Name: Salary, dtype: float64

In [19]:
df

Unnamed: 0,Employee,Age,Department,Salary
0,Alice,25.0,HR,50000.0
1,Bob,31.666667,Finance,60000.0
3,David,31.666667,IT,70000.0


### Problem 2: Renaming Columns and Changing Data Types

Dataset: A small sales dataset with column names that need renaming.

data = {

"cust_name": ["John", "Sara", "Mike", "Emma"],

"purchase_amt": ["100", "200", "300", "400"],

"purchase_date": ["2023-01-10", "2023-02-15", "2023-03-20","2023-04-05"]

}

df = pd.DataFrame(data)

🔹 Tasks:

✅ Rename cust_name to "Customer Name", purchase_amt to "Purchase Amount",
and purchase_date to "Purchase Date".

✅ Convert "Purchase Amount" from string to integer.

✅ Convert "Purchase Date" to datetime format.

In [53]:
data = {
    "cust_name": ["John", "Sara", "Mike", "Emma"],
    "purchase_amt": ["100", "200", "300", "400"],
    "purchase_date": ["2023-01-10", "2023-02-15", "2023-03-20","2023-04-05"]
}

df = pd.DataFrame(data)

In [54]:
df

Unnamed: 0,cust_name,purchase_amt,purchase_date
0,John,100,2023-01-10
1,Sara,200,2023-02-15
2,Mike,300,2023-03-20
3,Emma,400,2023-04-05


In [55]:
df.columns

Index(['cust_name', 'purchase_amt', 'purchase_date'], dtype='object')

In [56]:
df.rename(columns={
    'cust_name': "customer_name",
    'purchase_amt': "purchase_amount",
    'purchase_date': "purchase_date"
}, inplace=True)

In [57]:
type(df)

pandas.core.frame.DataFrame

In [58]:
df.columns

Index(['customer_name', 'purchase_amount', 'purchase_date'], dtype='object')

In [71]:
df['purchase_amount'] = df['purchase_amount'].astype(int)

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   customer_name    4 non-null      object
 1   purchase_amount  4 non-null      int32 
 2   purchase_date    4 non-null      object
dtypes: int32(1), object(2)
memory usage: 208.0+ bytes


In [76]:
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   customer_name    4 non-null      object        
 1   purchase_amount  4 non-null      int32         
 2   purchase_date    4 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int32(1), object(1)
memory usage: 208.0+ bytes


### Problem 3: Normalizing a Column

Dataset: A product review dataset with different rating scales.

data = {

"Product": ["Laptop", "Phone", "Tablet", "Monitor"],

"Price": [1000, 500, 300, 800]

}

df = pd.DataFrame(data)


🔹 Tasks:

✅ Normalize the Price column using Min-Max scaling:

✅ Add the normalized values as a new column in the DataFrame.



In [78]:
data = {
    "Product": ["Laptop", "Phone", "Tablet", "Monitor"],
    "Price": [1000, 500, 300, 800]

}

In [79]:
df = pd.DataFrame(data)

In [80]:
df

Unnamed: 0,Product,Price
0,Laptop,1000
1,Phone,500
2,Tablet,300
3,Monitor,800


In [83]:
from sklearn.preprocessing import MinMaxScaler

In [84]:
scaler = MinMaxScaler()

# Apply Min-Max Scaling to the "Price" column

df["Price_Normalized"] = scaler.fit_transform(df[["Price"]])

df

Unnamed: 0,Product,Price,Price_Normalized
0,Laptop,1000,1.0
1,Phone,500,0.285714
2,Tablet,300,0.0
3,Monitor,800,0.714286


In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product           4 non-null      object 
 1   Price             4 non-null      int64  
 2   Price_Normalized  4 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 224.0+ bytes
