# PT02 - Python for Data Science 
**AI Bootcamp - Instinct Institute**

**Author:** MORK Mongkul

--- 

## Exercise 1: NumPy Array

You are a **Data Scientist** preparing numerical data for a Machine Learning model.  
Your task is to understand how NumPy stores and manipulates numerical data efficiently.

**Task:**
1. Import NumPy as `np`.
2. Create a 1D NumPy array called `scores` containing:
   - `[60, 65, 70, 78, 85]`
3. Print the following properties:
   - Shape
   - Data type
   - Number of dimensions
4. Access:
   - The first element
   - The last element
   - A slice containing the middle three values

---

In [None]:
# To Do 

## Exercise 2: Matrices and Spatial Slicing

In Data Science, a 2D array often represents a dataset where **rows** are samples and **columns** are features. Mastering 2D slicing is essential for selecting specific data subsets.

**Task:**
1. Create a 4 × 4 NumPy array named `matrix` containing integers from 1 to 16.
2. **Row/Column Access:**
   - Print the entire second row.
   - Print the entire third column.
3. **Sub-matrix Extraction:**
   - Use slicing to extract the "center" 2 × 2 square: `[[6, 7], [10, 11]]`.
4. **Corner Extraction:**
   - Extract the four corner elements using a single slicing operation (Hint: Use a **step** of 3).

---

In [None]:
# To Do

## Exercise 4: Structural Transformations

Data often arrives in a "flat" format. A Data Scientist must know how to change the shape of data without changing its values to fit the input requirements of AI models.

**Task:**
1. Create a 1D array of 12 elements from 10 to 120.
2. **Reshape:** Transform this array into a 3 × 4 matrix.
3. **Transpose:** Use the `.T` attribute to flip the matrix so that rows become columns.
4. **Inferred Reshaping:** Use `reshape(-1, 2)` on the original array. Explain what the `-1` does.

---

In [None]:
# To Do

## Exercise 5: Feature Scaling (Min-Max Normalization)

In Machine Learning, features often have different scales (e.g., Age 0-100 vs. Salary 20k-100k). We normalize data so that all features contribute equally to the model calculations.

**Task:**
1. Create a 10 × 3 matrix of random integers between 10 and 500.
2. **Global Statistics:** Find the `min` and `max` values of the entire matrix.
3. **Vectorized Operation:** Apply the Min-Max formula to scale the data between 0 and 1:

$$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

4. Print the first 5 rows of the normalized data to verify the values are between 0 and 1.

---

In [None]:
# To Do

## Exercise 7: One-Hot Encoding Simulation (Masking)

ML models cannot process categorical labels (like "Red", "Green", "Blue") directly. We use boolean logic to convert categories into binary columns.

**Task:**
1. Create a 1D array `labels` with values: `[0, 1, 2, 0, 1, 2]`.
2. **Boolean Masking:** Create three separate masks: `is_cat_0`, `is_cat_1`, and `is_cat_2` using equality comparisons.
3. **Type Conversion:** Convert these boolean masks into integers (`0` and `1`) using the `.astype(int)` method.
4. **Stacking:** Use `np.stack` or `np.column_stack` to combine them into a 6 × 3 matrix.

---

In [None]:
# To Do

## Exercise 8: Mean Squared Error (MSE) Calculation

In ML, we measure model performance by calculating the distance between predicted values ($\hat{y}$) and actual values ($y$).

**Task:**
1. Create two 1D arrays of size 50: `y_true` and `y_pred` using random values.
2. **Error Vector:** Calculate the difference (`y_true - y_pred`).
3. **Aggregate:** Square the differences and find the average. Implement this formula using NumPy:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{true}} - y_{\text{pred}})^2$$

---

In [None]:
# To Do 

## Exercise 9: Analysis of Titanic Survival Dataset

In this comprehensive exercise, you will act as a **Data Analyst**. You have been handed the passenger manifest of the HMS Titanic. Your goal is to load, clean, and analyze the data to find patterns in survival rates.

### Task 1: Load and Explore the Data

1. Import `pandas` as `pd`.
2. Load the dataset from the official bootcamp URL:  
   `https://raw.githubusercontent.com/MorkMongkul/AI-Bootcamp-Instinct/main/Data/Titanic-Dataset.csv`
3. Display the first 10 rows and use `df.info()` to identify which columns contain **NaN** (missing) values.
4. Use `df.describe()` to find the average `Age` and the maximum `Fare` paid by a passenger.

In [None]:
# To Do 

### Task 2: Data Cleaning

1. Fill the missing values in the `Age` column with the **median** age of the dataset to avoid losing data.
2. Remove the `Cabin` column entirely, as it contains too many missing values to be useful for AI.
3. Convert the `Sex` column from strings to integers: change `"male"` to `0` and `"female"` to `1` using the `.map()` method.

In [None]:
# To Do 

### Task 3: Filtering

1. Create a new DataFrame called `first_class_survivors` containing only passengers who were in `Pclass` 1 and had a `Survived` status of 1.
2. Use the `.query()` method to find all passengers older than 70 years. How many are there?

In [None]:
# To Do 

### Task 4: Aggregation

1. Group the data by `Pclass` and calculate the **mean** survival rate for each class.
2. Use `value_counts()` on the `Embarked` column to determine which port (C, Q, or S) had the highest number of departures.

---

In [None]:
# To Do 

## Exercise 12: Telcom Customer Churn Case Study

In this exercise, you will follow the workflow used by Data Scientists to analyze customer retention. Your goal is to move from raw data to business insights using Pandas, Matplotlib, and Seaborn.

### Task 1: Data Loading and Preprocessing

1. Import `pandas`, `matplotlib.pyplot`, and `seaborn`.
2. Load the dataset URL:  
   `https://raw.githubusercontent.com/MorkMongkul/AI-Bootcamp-Instinct/main/Data/Telco-Customer-Churn.csv`
3. **Handling the ID:** The `customerID` column is unique to every row and provides no statistical value. Drop it from your DataFrame.
4. **Data Type Correction:** The `TotalCharges` column is stored as a string but should be numeric. Use `pd.to_numeric(errors='coerce')` to convert it.
5. **Handling Nulls:** After the conversion, identify the missing values in `TotalCharges` and fill them with the median.

In [None]:
# To Do 

### Task 2: Exploratory Data Analysis (Categorical)

1. **Gender and Seniority:** Use `sns.countplot()` to visualize the distribution of `gender` and `SeniorCitizen`.
2. **Contract and Payment:** Create count plots for `Contract` and `PaymentMethod`. Which categories are the most common?
3. **Churn Distribution:** Visualize the `Churn` column. Calculate the exact percentage of customers who have left the company.

In [None]:
# To Do 

### Task 3: Feature Engineering and Pattern Analysis

1. **Tenure Binning:** Create a new column `tenure_group` to group the `tenure` column into years (e.g., 0-12 months, 12-24 months, etc.) or categories (Short, Medium, Long term).
2. **Service Analysis:** Create a visualization that shows the churn rate for different `InternetService` types (DSL, Fiber optic, No).
3. **Contract Impact:** Use `sns.countplot(x='Contract', hue='Churn')` to determine which contract type has the highest churn frequency.

In [None]:
# To Do

### Task 4: Numerical Analysis and Relationships

1. **Monthly Charges:** Use `sns.kdeplot()` to compare the distribution of `MonthlyCharges` for customers who churned vs. those who stayed.
2. **Tenure vs Churn:** Use `sns.boxplot(x='Churn', y='tenure')` to visualize the median time (months) it takes for a customer to churn.
3. **Total Spend:** Use `sns.scatterplot(x='tenure', y='TotalCharges', hue='Churn')` to see the relationship between length of stay and total revenue.

In [None]:
# To Do 

### Task 5: Correlation Analysis

1. **Encoding:** Convert the `Churn` column to numeric values (`Yes=1, No=0`) and use `pd.get_dummies()` to convert other categorical variables into dummy variables.
2. **Heatmap:** Generate a correlation matrix for the entire dataset.
3. **Visualization:** Plot the correlation matrix using `sns.heatmap()` with `annot=True`.
4. **Insight:** Identify the top 3 features that have the strongest positive or negative correlation with `Churn`.

---

In [None]:
# To Do 