# PT02 - Python for Data Science 
**AI & Machine Learning | ICT Center**

**Author:** PHALLY Makara

--- 

## Exercise 1: NumPy Array

You are a **Data Scientist** preparing numerical data for a Machine Learning model.  
Your task is to understand how NumPy stores and manipulates numerical data efficiently.

**Task:**
1. Import NumPy as `np`.
2. Create a 1D NumPy array called `scores` containing:
   - `[60, 65, 70, 78, 85]`
3. Print the following properties:
   - Shape
   - Data type
   - Number of dimensions
4. Access:
   - The first element
   - The last element
   - A slice containing the middle three values

---

In [None]:
# To Do 

## Exercise 2: Matrices and Spatial Slicing

In Data Science, a 2D array often represents a dataset where **rows** are samples and **columns** are features. Mastering 2D slicing is essential for selecting specific data subsets.

**Task:**
1. Create a 4 × 4 NumPy array named `matrix` containing integers from 1 to 16.
2. **Row/Column Access:**
   - Print the entire second row.
   - Print the entire third column.
3. **Sub-matrix Extraction:**
   - Use slicing to extract the "center" 2 × 2 square: `[[6, 7], [10, 11]]`.
4. **Corner Extraction:**
   - Extract the four corner elements using a single slicing operation (Hint: Use a **step** of 3).

---

In [None]:
# To Do

## Exercise 3: Structural Transformations

Data often arrives in a "flat" format. A Data Scientist must know how to change the shape of data without changing its values to fit the input requirements of AI models.

**Task:**
1. Create a 1D array of 12 elements from 10 to 120.
2. **Reshape:** Transform this array into a 3 × 4 matrix.
3. **Transpose:** Use the `.T` attribute to flip the matrix so that rows become columns.
4. **Inferred Reshaping:** Use `reshape(-1, 2)` on the original array. Explain what the `-1` does.

---

In [None]:
# To Do

## Exercise 4 : Feature Scaling (Min-Max Normalization)

In Machine Learning, features often have different scales (e.g., Age 0-100 vs. Salary 20k-100k). We normalize data so that all features contribute equally to the model calculations.

**Task:**
1. Create a 10 × 3 matrix of random integers between 10 and 500.
2. **Global Statistics:** Find the `min` and `max` values of the entire matrix.
3. **Vectorized Operation:** Apply the Min-Max formula to scale the data between 0 and 1:

$$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

4. Print the first 5 rows of the normalized data to verify the values are between 0 and 1.

---

In [None]:
# To Do

## Exercise 5: One-Hot Encoding Simulation (Masking)

ML models cannot process categorical labels (like "Red", "Green", "Blue") directly. We use boolean logic to convert categories into binary columns.

**Task:**
1. Create a 1D array `labels` with values: `[0, 1, 2, 0, 1, 2]`.
2. **Boolean Masking:** Create three separate masks: `is_cat_0`, `is_cat_1`, and `is_cat_2` using equality comparisons.
3. **Type Conversion:** Convert these boolean masks into integers (`0` and `1`) using the `.astype(int)` method.
4. **Stacking:** Use `np.stack` or `np.column_stack` to combine them into a 6 × 3 matrix.

---

In [None]:
# To Do

## Exercise 6: Mean Squared Error (MSE) Calculation

In ML, we measure model performance by calculating the distance between predicted values ($\hat{y}$) and actual values ($y$).

**Task:**
1. Create two 1D arrays of size 50: `y_true` and `y_pred` using random values.
2. **Error Vector:** Calculate the difference (`y_true - y_pred`).
3. **Aggregate:** Square the differences and find the average. Implement this formula using NumPy:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{true}} - y_{\text{pred}})^2$$

---

## Exercise 8 : Analysis of Titanic Survival Dataset

In this comprehensive exercise, you will act as a **Data Analyst**. You have been handed the passenger manifest of the HMS Titanic. Your goal is to load, clean, and analyze the data to find patterns in survival rates.

Dataset : `https://raw.githubusercontent.com/MorkMongkul/AI-Bootcamp-Instinct/main/Data/Titanic-Dataset.csv`

**About Dataset**

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

---

**Task1:**

1. Load the Titanic Dataset and Display the first 5 rows. 

2. What is the shape of the dataset? How many rows and columns? 

3. Display dataset information, provide your short analysis of those information. 

4. Show basics descriptive statistics for numerical columns? Provides your short analysis of dataset based on descriptive statistics. 

In [None]:
# To Do 

**Task2:**

1. Which columns contain missing ?

2. Handle missing value in columns `Age` by fill missing value with `median`, and `Embarked` fill missing value with `mode`.

3. Have you notice that  `Cabin` columns has too many missing value, and also  `PassengerId` are useless, drop those columns. 

4. Verify that the missing values are handled. 

In [None]:
#To Do 

**Task3:**

1. What was the overal survival rate of the passengers ?

2. Compare the survival rate between Males and Females. 

3. Which passenger class(`Pclass`) had the highest survival rate ? 

4. Did Travelling alone make passenger more likely to survived ? 

5. Did the amount paid(`Fare`) directly impact survival within the same class? 

In [None]:
#To Do 

**Task4: Data Visualization(matplotlib/seaborn)**

1. create a bar plot to show the Survival Distribution.

2. create a bar plot to show the Survial Count by Sex.   

3. create a countplot to visualize the survival by gender. 

4. create a countplot to visualize the passenger class & Survival 

5. create a histogram plot to show the age distribution

6. create a histogram plot to show the age vs survival distribution. 

7. create a boxplot to show the fare distribution. 

8. create a catplot to show the survival by class & gender. 

9. create a heatmap plot to show the correlation between columns of data. 

10. create a pairplot of each numerical columns.


In [None]:
# To Do


## Exercise9: Telecom Churn Data Analysis

In this exercise, you will follow the workflow used by Data Scientists to analyze customer retention. Your goal is to move from raw data to business insights using Pandas, Matplotlib, and Seaborn.
 
Dataset : `https://raw.githubusercontent.com/MorkMongkul/AI-Bootcamp-Instinct/main/Data/Telco-Customer-Churn.csv`

**About Dataset**

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

- Customers who left within the last month – the column is called Churn

- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

- Demographic info about customers – gender, age range, and if they have partners and dependents

---

**Task1:**

1. Load the Titanic Dataset and Display the first 5 rows. 

2. What is the shape of the dataset? How many rows and columns? 

3. Display dataset information, provide your short analysis of those information. 

4. Show basics descriptive statistics for numerical columns? Provides your short analysis of dataset based on descriptive statistics. ### Task 2: Exploratory Data Analysis (Categorical)

1. **Gender and Seniority:** Use `sns.countplot()` to visualize the distribution of `gender` and `SeniorCitizen`.
2. **Contract and Payment:** Create count plots for `Contract` and `PaymentMethod`. Which categories are the most common?
3. **Churn Distribution:** Visualize the `Churn` column. Calculate the exact percentage of customers who have left the company.

In [None]:
# To Do 

**Task2:**

1. Is there any missing value in this dataset?

2. Is there any incorrect datatype of columns in this dataset ? 

3. After correcting datatype of `TotalCharges` column, are there any missing value, what will you do with those missing value?


In [None]:
# To Do

**Task3:**

1. What is the overal churn rate ? 

2. Compare the average tenure of churned vs retained customers. 

3. Compare the monthly charges of churn. 

4. Compare total charges vs churn. 

5. Convert the Churn Yes/No to Numerics( Yes = 1, No = 0)

5. Compare churn rate by contract type. 

In [None]:
# To Do 

**Task4:**

1. create a TenureGroup column: 

    - Customers with 0-12 months tenure are `New`.

    - Customers with 12-24 months are `Junior`.

    - Over 24 months are  `Loyal`.




In [None]:
# To Do 

**Task5:**

1. create a plot to show the churn distribution. 

2. Visualize the count of churned vs stayes customers for each contract type. 

3. Visualize the Fiber Optic Vs DSL to show the type of internet service affect churn. 

4. Visualize the KDE plot to show the distribution of Tenure. 

5. Create a subplot to show churn by contract & internet service(Bivariate Analysis)

6. create a pairplot of each numerical columns.


In [None]:
# To Do