### **Coding Assignment: Data Exploration and Visualization Using Pandas**

### **Dataset 01:**
Use the [Iris Species dataset](https://www.kaggle.com/uciml/iris) from Kaggle, which contains measurements of different species of iris flowers.

#### **Instructions:**

1. **Importing the Dataset:**
   - Load the dataset into a pandas DataFrame.
   - Display the first 5 rows of the dataset.

2. **Basic Data Exploration:**
   - **Data Summary:**
     - Display basic information about the dataset, including the number of rows and columns, data types of each column, and memory usage.
     - Generate descriptive statistics for the numerical columns (e.g., mean, median, standard deviation).
   - **Check for Missing Values:**
     - Identify any missing values in the dataset and explain how you would handle them if there were any.
   - **Value Counts:**
     - Display the count of each unique species in the `species` column.

3. **Filtering and Sorting:**
   - Filter the dataset to show only the rows where the `sepal_length` is greater than 5.0.
   - Sort the filtered dataset by `petal_length` in descending order.


5. **Creating New Columns:**
   - Create a new column called `sepal_petal_ratio` which is the ratio of `sepal_length` to `petal_length`.


6. **Data Visualization:**
   - **Histogram:**
     - Create a histogram to visualize the distribution of `sepal_length` values.
   - **Scatter Plot:**
     - Create a scatter plot with `sepal_length` on the x-axis and `petal_length` on the y-axis. Use different colors to represent different species.
   - **Box Plot:**
     - Create a box plot for the `sepal_width` values, grouped by `species`.


7. **Correlation Matrix**
   - Calculate the correlation matrix for the numerical features and create a heatmap to visualize the correlations using `seaborn`.



---

### **Dataset 02 :** [House Prices Dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data)

The dataset includes 79 explanatory variables that describe aspects of residential homes.

#### **Instructions**:
- Use Python and Jupyter Notebook to solve the questions.
- Ensure your code is efficient, well-commented, and follows best practices for handling missing data and outliers.
- Use appropriate visualizations where needed and include explanations for your analysis.

### **Problem 1: Data Exploration with Pandas**  
1. Load the House Prices dataset using Pandas and perform initial exploration:
   - Display the first and last 5 rows.
   - Display the **data types** of each column.
   - Check for **missing values** and calculate the percentage of missing data for each column.
   - Summarize the **statistical properties** of the numerical columns using `describe()`.

2. Based on the dataset, identify the following:
   - Columns that are **categorical** and **numerical**.
   - Any potential **outliers** in the numerical columns (use visualizations like boxplots).

### **Problem 2: Data Manipulation with Pandas**  
1. Filter the dataset to include only houses where the **SalePrice** is above the **90th percentile**.
2. Create a new **calculated column**: `Price_Per_SqFt = SalePrice / GrLivArea`.
3. **Group the data** by `Neighborhood` and calculate the **mean** and **median** for `SalePrice`.
4. Sort the dataset by `Price_Per_SqFt` and display the **top 10 rows**.

### **Problem 3: Handling Missing Data and Imputation**  
1. Identify columns with **missing values**.
2. Impute the missing values:
   - For **numerical columns**, replace missing values with the **median**.
   - For **categorical columns**, replace missing values with the **most frequent** value.
3. Justify your choice of imputation methods and check if any missing data remains.

### **Problem 4: Numpy Array Operations**  
1. Convert the numerical columns (such as `LotArea`, `YearBuilt`, `GrLivArea`, `SalePrice`, etc.) into a **NumPy array**.
2. Perform the following NumPy operations on the array:
   - Compute the **mean** and **standard deviation** of each column.
   - Normalize the data using **min-max scaling** and **z-score normalization**.
   - Compute the **dot product** between two numerical columns of your choice.

### **Problem 5: Data Visualization**  
Using Matplotlib and/or Seaborn, visualize the following:
1. **Distribution** of `SalePrice` using a **histogram**.
2. **Box plot** of `SalePrice` grouped by `OverallQual` (overall quality of the house).
3. **Scatter plot** between `GrLivArea` (Above-ground living area) and `SalePrice`. Analyze the relationship.
4. Plot a **correlation heatmap** for the numerical columns.

### **Problem 6: Aggregation and Transformation**  
1. Perform a **groupby** operation on `Neighborhood` and calculate the **mean** and **median** for `SalePrice`.
2. Use the `apply()` function to create a custom transformation for `SalePrice` (e.g., scaling or transformation).
3. Aggregate multiple statistical values (e.g., **mean**, **median**, and **count**) for `SalePrice` grouped by `OverallQual`.